Merge pull request #175 from datacarpentry/pr-issue-170

Closes #170. Fixing typos and wording
datacarpentry · Jul 26, 2024 · b8251aa · b8251aa
2 parents fcda37f + f228137
commit b8251aa
Show file tree

Hide file tree

Showing 4 changed files with 26 additions and 26 deletions.
diff --git a/episodes/01-format-data.md b/episodes/01-format-data.md
@@ -91,9 +91,9 @@ For instance, we're going to be working with data from a study of
 agricultural practices among farmers in two countries in eastern
 sub-Saharan Africa (Mozambique and Tanzania). Researchers conducted
 interviews with farmers in these countries to collect data on
-household statistics (e.g. number of household members,
+household statistics (e.g., number of household members,
 number of meals eaten per day, availability of water),
-farming practices (e.g. water usage), and assets (e.g. number of farm plots,
+farming practices (e.g., water usage), and assets (e.g., number of farm plots,
 number of livestock). They also recorded the dates and locations of
 each interview.
 
@@ -127,8 +127,8 @@ later in this workshop.
 > The data used in these lessons are taken from interviews of farmers in two
 > countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These
 > interviews were conducted between November 2016 and June 2017 and probed
-> household features (e.g. construction materials used, number of household
-> members), agricultural practices (e.g. water usage), and assets (e.g. number
+> household features (e.g., construction materials used, number of household
+> members), agricultural practices (e.g., water usage), and assets (e.g., number
 > and types of livestock).
 
 This is a real dataset, however, it has been simplified for this workshop. If
@@ -243,15 +243,15 @@ disrupt the formatting of your data file.
 Some of this information may be familiar to learners who conduct analyses on
 survey data or other data sets that come with codebooks. Codebooks will often
 describe the way a variable has been constructed, what prompt was associated with
-it in an survey or interview, and what the meaning of various values are. For example,
+it in a survey or interview, and what the meaning of various values are. For example,
 the [General Social Survey](https://gss.norc.org) maintains their entire codebook online.
 Looking at an entry for a particular variable, such as
 [the variable `SEX`](https://gssdataexplorer.norc.org/variables/81/vshow), provides
 valuable information about what survey waves the variable covers, and the meaning
 of particular values.
 
 Additionally, file or database level metadata describes how files that make up
-the dataset relate to each other; what format are they are
+the dataset relate to each other; what format they are
 in; and whether they supersede or are superseded by previous files. A
 folder-level readme.txt file is the classic way of accounting for
 all the files and folders in a project.
@@ -263,7 +263,7 @@ Research librarians may have specific expertise in this area, and can be
 helpful resources for thinking about ways to purposefully document metatdata
 as part of your research.
 
-(Text on metadata adapted from the online course Research Data [MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, University of Edinburgh. MANTRA is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).)
+(Text on metadata adapted from the online course [MANTRA - Research Data Management Training](https://mantra.ed.ac.uk/) by Research Data Service and the Institute for Academic Development, University of Edinburgh. MANTRA is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).)
 
 :::::::::::::::::::::::::::::::::::::::  challenge
 
@@ -290,13 +290,13 @@ data are:
 
 - the exact wording of questions used in the interviews (if interviews were
   structured) or general prompts used (if interviews were semi-structured)
-- a description of the type of data allowed in each column (e.g. the allowed
+- a description of the type of data allowed in each column (e.g., the allowed
   range for numerical data with a restricted range, a list of allowed options
   for categorical variables, whether data in a numerical column should be
   continuous or discrete)
-- definitions of any categorical variables (e.g. definitions of
+- definitions of any categorical variables (e.g., definitions of
   "burntbricks" and "sunbricks")
-- definitions of what was counted as a "room", a "plot", etc. (e.g. was
+- definitions of what was counted as a "room", a "plot", etc. (e.g., was
   there a minimum size)
 - learners may come up with additional questions to add to this list
 

diff --git a/episodes/02-common-mistakes.md b/episodes/02-common-mistakes.md
@@ -21,7 +21,7 @@ exercises: 0
 This lesson is meant to be used as a reference for discussion as learners identify issues with the messy dataset discussed in the
 previous lesson. Instructors: don't go through this lesson except to refer to responses to the exercise in the previous lesson.
 
-There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets, can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.
+There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.
 
 - [Using multiple tables](#tables)
 - [Using multiple tabs](#tabs)
@@ -37,7 +37,7 @@ There are a few potential errors to be on the lookout for in your own data as we
 ## Using multiple tables {#tables}
 
 A common strategy is creating multiple data tables within
-one spreadsheet. This confuses the computer, so don't do this!
+one spreadsheet. This confuses the computer, so try to avoid doing this!
 When you create multiple tables within one
 spreadsheet, you're drawing false associations between things for the computer,
 which sees each row as an observation. You're also potentially using the same
@@ -46,17 +46,17 @@ into a usable form. The example below depicts the problem:
 
 ![](fig/multiple-tables-example2.png){alt='multiple tables'}
 
-In the example above, the computer will see (for example) row 24 and assume that all columns A-J
+In the example above, the computer will see row 24 and assume that all columns A-J
 refer to the same sample. This row actually represents two distinct samples
 (information about livestock for informant 1 and information about plots for informant 2). Other rows are similarly problematic.
 
 ## Using multiple tabs {#tabs}
 
 But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail
 to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or
-scripting to ensure this connection). Say, for instance, you make a separate tab for each day you take a measurement.
+scripting to ensure this connection).
 
-This isn't good practice for two reasons:
+Say you make a separate tab for each day you take a measurement. This isn't good practice for two reasons:
 
 1) you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and
 2) even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the
@@ -98,7 +98,7 @@ subsequent calculations or analyses. For example, the average of a set of number
 **Solution**: One common practice is to record unknown or missing data as -999, 999, or 0. Many statistical programs will not recognize
 that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to
 analyze your data. It is essential to use a clearly defined and consistent null indicator.
-Blanks (most applications) and NA (for R) are good choices. White et al, 2013, explain good choices for indicating null values for different software applications in their article:
+Blanks (most applications) and NA (for R) are good choices. White et al., 2013, explain good choices for indicating null values for different software applications in their article:
 [Nine simple ways to make it easier to (re)use your data.](https://ojs.library.queensu.ca/index.php/IEE/article/view/4608) Ideas in Ecology and Evolution.
 
 | Null Values | Problems                                                                                                                                                                   | Compatibility         | Recommendation | 
@@ -116,7 +116,7 @@ Blanks (most applications) and NA (for R) are good choices. White et al, 2013, e
 
 ## Using formatting to convey information {#formatting}
 
-**Example**: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.
+**Example**: highlighting cells, rows or columns that should be excluded from an analysis, and leaving blank rows to indicate separations in data.
 
 ![](fig/bad-formatting.png){alt='formatting'}
 

diff --git a/episodes/03-dates-as-data.md b/episodes/03-dates-as-data.md
@@ -7,7 +7,7 @@ exercises: 10
 ::::::::::::::::::::::::::::::::::::::: objectives
 
 - Recognise problematic or suspicious date formats.
-- Use formulas to separate dates into their component values (e.g. Month, Day, Year).
+- Use formulas to separate dates into their component values (e.g., Year, Month, Day).
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -23,9 +23,9 @@ Dates in spreadsheets are often stored in a single column.
 
 While this seems like a logical way to record dates when you are entering them, or visually reviewing data, it's not actually a best practice for preparing data for analysis.
 
-When working with data, your goal is to have as little ambiguity as possible. Ambiguity can creep into your data when working with dates when there are regional variations either in your observations and when you or your team might be working with different versions or suites of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric).
+When working with data, your goal is to have as little ambiguity as possible. Ambiguity can creep into your data when working with dates when there are regional variations either in your observations or when you or your team might be working with different versions or suites of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric).
 
-To avoid ambiguity between regional differences in date formatting and compatibility across spreadsheet software programs, a good practice is to divide dates into components in different columns - DAY, MONTH, and YEAR.
+To avoid ambiguity between regional differences in date formatting and compatibility across spreadsheet software programs, a good practice is to divide dates into components in different columns - YEAR, MONTH, and DAY.
 
 When working with dates it's also important to remember that functions are guaranteed to be compatible only within the same family of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric). If you need to export your data and conserve the timestamps, you are better off handling dates using one of the solutions discussed below than the single column method.
 
@@ -79,7 +79,7 @@ try to type in a US format date into a UK version of Excel, it may or may not be
 date.
 
 This regional variation is one good reason to treat dates, not as a single data point, but as
-three distinct pieces of data (month, day, and year). Separating dates into their component parts
+three distinct pieces of data (year, month, and day). Separating dates into their component parts
 will avoid this confusion, while also giving the added benefit of allowing you to compare, for
 example data collected in January of multiple years with data collected in February of multiple years.
 
@@ -95,17 +95,17 @@ Choose the tab of the spreadsheet that corresponds to the way you format dates i
 location (either day first `DD_MM_YEAR`, or month first `MM_DD_YEAR`).
 
 Extract the components of the date to new columns. For this we
-can use the built in Excel functions:
+can use the built-in Excel functions:
 
+`=YEAR()`
 `=MONTH()`  
 `=DAY()`  
-`=YEAR()`
 
 Apply each of these formulas to its entire column.
 Make sure the new column is formatted as a number and not as a date.
 
 We now have each component of our date isolated in its own column. This will allow us
-to group our data with respect to month, year, or day of month for our analyses and will
+to group our data with respect to year, month, or day of month for our analyses and will
 also prevent problems when passing data between different versions of spreadsheet
 software (as for example when sharing data with collaborators in different countries).
 
@@ -129,7 +129,7 @@ Note that this solution shows the dates in `MM_DD_YEAR` format.
 
 Using the same spreadsheet you used for the previous exercise, add another data point
 in the `interview_date` column by typing either `11/17` (if your location uses `MM/DD` formatting)
-or `17/11` (if your location uses `DD/MM` formatting). The `Day`, `Month`, and `Year` columns
+or `17/11` (if your location uses `DD/MM` formatting). The `Year`, `Month`, and `Day` columns
 should populate for this new data point. What year is shown in the `Year` column?
 
 :::::::::::::::  solution

diff --git a/episodes/05-exporting-data.md b/episodes/05-exporting-data.md
@@ -38,7 +38,7 @@ version) isn't a good idea. Why?
 
 - The above points also apply to other formats such as open data formats used by LibreOffice. These formats are not static and do not get parsed the same way by different software packages.
 
-As an example of inconsistencies in data storage, do you remember how we talked about how Excel stores dates earlier? It turns out that
+As an example of inconsistencies in data storage, do you remember our earlier discussion about how Excel stores dates? It turns out that
 there are multiple defaults for different versions of the software, and you can switch between them all. So, say you're
 compiling Excel-stored data from multiple sources. There's dates in each file- Excel interprets them as their own internally consistent
 serial numbers. When you combine the data, Excel will take the serial number from the place you're importing it from, and interpret it