Skip to content

Commit

Permalink
Merge pull request #175 from datacarpentry/pr-issue-170
Browse files Browse the repository at this point in the history
Closes #170. Fixing typos and wording
  • Loading branch information
josenino95 authored Jul 26, 2024
2 parents fcda37f + f228137 commit b8251aa
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 26 deletions.
20 changes: 10 additions & 10 deletions episodes/01-format-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ For instance, we're going to be working with data from a study of
agricultural practices among farmers in two countries in eastern
sub-Saharan Africa (Mozambique and Tanzania). Researchers conducted
interviews with farmers in these countries to collect data on
household statistics (e.g. number of household members,
household statistics (e.g., number of household members,
number of meals eaten per day, availability of water),
farming practices (e.g. water usage), and assets (e.g. number of farm plots,
farming practices (e.g., water usage), and assets (e.g., number of farm plots,
number of livestock). They also recorded the dates and locations of
each interview.

Expand Down Expand Up @@ -127,8 +127,8 @@ later in this workshop.
> The data used in these lessons are taken from interviews of farmers in two
> countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These
> interviews were conducted between November 2016 and June 2017 and probed
> household features (e.g. construction materials used, number of household
> members), agricultural practices (e.g. water usage), and assets (e.g. number
> household features (e.g., construction materials used, number of household
> members), agricultural practices (e.g., water usage), and assets (e.g., number
> and types of livestock).
This is a real dataset, however, it has been simplified for this workshop. If
Expand Down Expand Up @@ -243,15 +243,15 @@ disrupt the formatting of your data file.
Some of this information may be familiar to learners who conduct analyses on
survey data or other data sets that come with codebooks. Codebooks will often
describe the way a variable has been constructed, what prompt was associated with
it in an survey or interview, and what the meaning of various values are. For example,
it in a survey or interview, and what the meaning of various values are. For example,
the [General Social Survey](https://gss.norc.org) maintains their entire codebook online.
Looking at an entry for a particular variable, such as
[the variable `SEX`](https://gssdataexplorer.norc.org/variables/81/vshow), provides
valuable information about what survey waves the variable covers, and the meaning
of particular values.

Additionally, file or database level metadata describes how files that make up
the dataset relate to each other; what format are they are
the dataset relate to each other; what format they are
in; and whether they supersede or are superseded by previous files. A
folder-level readme.txt file is the classic way of accounting for
all the files and folders in a project.
Expand All @@ -263,7 +263,7 @@ Research librarians may have specific expertise in this area, and can be
helpful resources for thinking about ways to purposefully document metatdata
as part of your research.

(Text on metadata adapted from the online course Research Data [MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, University of Edinburgh. MANTRA is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).)
(Text on metadata adapted from the online course [MANTRA - Research Data Management Training](https://mantra.ed.ac.uk/) by Research Data Service and the Institute for Academic Development, University of Edinburgh. MANTRA is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).)

::::::::::::::::::::::::::::::::::::::: challenge

Expand All @@ -290,13 +290,13 @@ data are:

- the exact wording of questions used in the interviews (if interviews were
structured) or general prompts used (if interviews were semi-structured)
- a description of the type of data allowed in each column (e.g. the allowed
- a description of the type of data allowed in each column (e.g., the allowed
range for numerical data with a restricted range, a list of allowed options
for categorical variables, whether data in a numerical column should be
continuous or discrete)
- definitions of any categorical variables (e.g. definitions of
- definitions of any categorical variables (e.g., definitions of
"burntbricks" and "sunbricks")
- definitions of what was counted as a "room", a "plot", etc. (e.g. was
- definitions of what was counted as a "room", a "plot", etc. (e.g., was
there a minimum size)
- learners may come up with additional questions to add to this list

Expand Down
14 changes: 7 additions & 7 deletions episodes/02-common-mistakes.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ exercises: 0
This lesson is meant to be used as a reference for discussion as learners identify issues with the messy dataset discussed in the
previous lesson. Instructors: don't go through this lesson except to refer to responses to the exercise in the previous lesson.

There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets, can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.
There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.

- [Using multiple tables](#tables)
- [Using multiple tabs](#tabs)
Expand All @@ -37,7 +37,7 @@ There are a few potential errors to be on the lookout for in your own data as we
## Using multiple tables {#tables}

A common strategy is creating multiple data tables within
one spreadsheet. This confuses the computer, so don't do this!
one spreadsheet. This confuses the computer, so try to avoid doing this!
When you create multiple tables within one
spreadsheet, you're drawing false associations between things for the computer,
which sees each row as an observation. You're also potentially using the same
Expand All @@ -46,17 +46,17 @@ into a usable form. The example below depicts the problem:

![](fig/multiple-tables-example2.png){alt='multiple tables'}

In the example above, the computer will see (for example) row 24 and assume that all columns A-J
In the example above, the computer will see row 24 and assume that all columns A-J
refer to the same sample. This row actually represents two distinct samples
(information about livestock for informant 1 and information about plots for informant 2). Other rows are similarly problematic.

## Using multiple tabs {#tabs}

But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail
to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or
scripting to ensure this connection). Say, for instance, you make a separate tab for each day you take a measurement.
scripting to ensure this connection).

This isn't good practice for two reasons:
Say you make a separate tab for each day you take a measurement. This isn't good practice for two reasons:

1) you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and
2) even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the
Expand Down Expand Up @@ -98,7 +98,7 @@ subsequent calculations or analyses. For example, the average of a set of number
**Solution**: One common practice is to record unknown or missing data as -999, 999, or 0. Many statistical programs will not recognize
that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to
analyze your data. It is essential to use a clearly defined and consistent null indicator.
Blanks (most applications) and NA (for R) are good choices. White et al, 2013, explain good choices for indicating null values for different software applications in their article:
Blanks (most applications) and NA (for R) are good choices. White et al., 2013, explain good choices for indicating null values for different software applications in their article:
[Nine simple ways to make it easier to (re)use your data.](https://ojs.library.queensu.ca/index.php/IEE/article/view/4608) Ideas in Ecology and Evolution.

| Null Values | Problems | Compatibility | Recommendation |
Expand All @@ -116,7 +116,7 @@ Blanks (most applications) and NA (for R) are good choices. White et al, 2013, e

## Using formatting to convey information {#formatting}

**Example**: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.
**Example**: highlighting cells, rows or columns that should be excluded from an analysis, and leaving blank rows to indicate separations in data.

![](fig/bad-formatting.png){alt='formatting'}

Expand Down
16 changes: 8 additions & 8 deletions episodes/03-dates-as-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ exercises: 10
::::::::::::::::::::::::::::::::::::::: objectives

- Recognise problematic or suspicious date formats.
- Use formulas to separate dates into their component values (e.g. Month, Day, Year).
- Use formulas to separate dates into their component values (e.g., Year, Month, Day).

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -23,9 +23,9 @@ Dates in spreadsheets are often stored in a single column.

While this seems like a logical way to record dates when you are entering them, or visually reviewing data, it's not actually a best practice for preparing data for analysis.

When working with data, your goal is to have as little ambiguity as possible. Ambiguity can creep into your data when working with dates when there are regional variations either in your observations and when you or your team might be working with different versions or suites of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric).
When working with data, your goal is to have as little ambiguity as possible. Ambiguity can creep into your data when working with dates when there are regional variations either in your observations or when you or your team might be working with different versions or suites of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric).

To avoid ambiguity between regional differences in date formatting and compatibility across spreadsheet software programs, a good practice is to divide dates into components in different columns - DAY, MONTH, and YEAR.
To avoid ambiguity between regional differences in date formatting and compatibility across spreadsheet software programs, a good practice is to divide dates into components in different columns - YEAR, MONTH, and DAY.

When working with dates it's also important to remember that functions are guaranteed to be compatible only within the same family of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric). If you need to export your data and conserve the timestamps, you are better off handling dates using one of the solutions discussed below than the single column method.

Expand Down Expand Up @@ -79,7 +79,7 @@ try to type in a US format date into a UK version of Excel, it may or may not be
date.

This regional variation is one good reason to treat dates, not as a single data point, but as
three distinct pieces of data (month, day, and year). Separating dates into their component parts
three distinct pieces of data (year, month, and day). Separating dates into their component parts
will avoid this confusion, while also giving the added benefit of allowing you to compare, for
example data collected in January of multiple years with data collected in February of multiple years.

Expand All @@ -95,17 +95,17 @@ Choose the tab of the spreadsheet that corresponds to the way you format dates i
location (either day first `DD_MM_YEAR`, or month first `MM_DD_YEAR`).

Extract the components of the date to new columns. For this we
can use the built in Excel functions:
can use the built-in Excel functions:

`=YEAR()`
`=MONTH()`
`=DAY()`
`=YEAR()`

Apply each of these formulas to its entire column.
Make sure the new column is formatted as a number and not as a date.

We now have each component of our date isolated in its own column. This will allow us
to group our data with respect to month, year, or day of month for our analyses and will
to group our data with respect to year, month, or day of month for our analyses and will
also prevent problems when passing data between different versions of spreadsheet
software (as for example when sharing data with collaborators in different countries).

Expand All @@ -129,7 +129,7 @@ Note that this solution shows the dates in `MM_DD_YEAR` format.

Using the same spreadsheet you used for the previous exercise, add another data point
in the `interview_date` column by typing either `11/17` (if your location uses `MM/DD` formatting)
or `17/11` (if your location uses `DD/MM` formatting). The `Day`, `Month`, and `Year` columns
or `17/11` (if your location uses `DD/MM` formatting). The `Year`, `Month`, and `Day` columns
should populate for this new data point. What year is shown in the `Year` column?

::::::::::::::: solution
Expand Down
2 changes: 1 addition & 1 deletion episodes/05-exporting-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ version) isn't a good idea. Why?

- The above points also apply to other formats such as open data formats used by LibreOffice. These formats are not static and do not get parsed the same way by different software packages.

As an example of inconsistencies in data storage, do you remember how we talked about how Excel stores dates earlier? It turns out that
As an example of inconsistencies in data storage, do you remember our earlier discussion about how Excel stores dates? It turns out that
there are multiple defaults for different versions of the software, and you can switch between them all. So, say you're
compiling Excel-stored data from multiple sources. There's dates in each file- Excel interprets them as their own internally consistent
serial numbers. When you combine the data, Excel will take the serial number from the place you're importing it from, and interpret it
Expand Down

0 comments on commit b8251aa

Please sign in to comment.