Address #119

datacarpentry · Sep 20, 2023 · e0f6574 · e0f6574
1 parent baf6cfe
commit e0f6574
Show file tree

Hide file tree

Showing 3 changed files with 60 additions and 47 deletions.
diff --git a/episodes/03-data-structures-part1.Rmd b/episodes/03-data-structures-part1.Rmd
@@ -260,12 +260,9 @@ str(nordic_2$lifeExp)
 str(nordic$lifeExp)
 ```
 
-The data in `nordic_2$lifeExp` is stored as factors rather than
-numeric. This is because of the "or" character string in the third
-data point. "Factor" is R's special term for categorical data.
-We will be working more with factor data later in this workshop.
-
-
+The data in `nordic_2$lifeExp` is stored as a character vector, rather than as
+a numeric vector. This is because of the "or" character string in the third
+data point.
 
 :::::::::::::::::::::::::
 
@@ -337,18 +334,17 @@ We said that columns in data frames were vectors:
 ```{r}
 str(nordic$lifeExp)
 str(nordic$year)
-```
-
-These make sense. But what about
-
-```{r}
 str(nordic$country)
 ```
 
-Another important data structure is called a factor. Factors look like character
-data, but are used to represent categorical information. For example, let's make
-a vector of strings labeling nordic countries for all the countries in our
-study:
+One final important data structure in R is called a "factor". Factors look like 
+character data, but are used to represent data where each element of the vector
+must be one of a limited number of "levels". To phrase that another way, factors
+are an "enumerated" type where there are a finite number of pre-defined values
+that your vector can have. 
+
+For example, let's make a vector of strings labeling nordic countries for all 
+the countries in our study:
 
 ```{r}
 nordic_countries <- c('Norway', 'Finland', 'Denmark', 'Iceland', 'Sweden')
@@ -387,8 +383,6 @@ Can you guess why these numbers are used to represent these countries?
 
 They are sorted in alphabetical order
 
-
-
 :::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
@@ -397,36 +391,61 @@ They are sorted in alphabetical order
 
 ## Challenge 3
 
-Is there a factor in our `nordic` data frame? what is its name? Try using
-`?read.csv` to figure out how to keep text columns as character vectors
-instead of factors; then write a command or two to show that the factor in
-`nordic` is actually a character vector when loaded in this way.
+Convert the `country` column of our `nordic` data frame to a factor. Then try
+converting it back to a character vector. 
+
+Now try converting `lifeExp` in our `nordic` data frame to a factor, then back
+to a numeric vector. What happens if you use `as.numeric()`?
+
+Remember that you can reload the `nordic` data frame using 
+`read.csv("data/nordic-data.csv")` if you accidentally lose some data!
 
 :::::::::::::::  solution
 
 ## Solution to Challenge 3
 
-One solution is use the argument `stringAsFactors`:
+Converting character vectors to factors can be done using the `factor()` 
+function:
 
-```{r, eval=FALSE}
-nordic <- read.csv(file = "data/nordic-data.csv", stringsAsFactors = FALSE)
-str(nordic$country)
+```{r}
+nordic$country <- factor(nordic$country)
+nordic$country
 ```
 
-Another solution is use the argument `colClasses`
-that allow finer control.
+You can convert these back to character vectors using `as.character()`:
 
-```{r, eval=FALSE}
-nordic <- read.csv(file="data/nordic-data.csv", colClasses=c(NA, NA, "character"))
-str(nordic$country)
+```{r}
+nordic$country <- as.character(nordic$country)
+nordic$country
+```
+
+You can convert numeric vectors to factors in the exact same way:
+
+```{r}
+nordic$lifeExp <- factor(nordic$lifeExp)
+nordic$lifeExp
+```
+
+But be careful -- you can't use `as.numeric()` to convert factors to numerics!
+
+```{r}
+as.numeric(nordic$lifeExp)
+```
+
+Instead, `as.numeric()` converts factors to those "numbers under the hood" we 
+talked about. To go from a factor to a number, you need to first turn the factor
+into a character vector, and _then_ turn that into a numeric vector:
+
+```{r}
+nordic$lifeExp <- as.character(nordic$lifeExp)
+nordic$lifeExp <- as.numeric(nordic$lifeExp)
+nordic$lifeExp
 ```
 
 Note: new students find the help files difficult to understand; make sure to let them know
 that this is typical, and encourage them to take their best guess based on semantic meaning,
 even if they aren't sure.
 
-
-
 :::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
@@ -523,15 +542,14 @@ nordic[[1]]
 ```
 
 The double brace `[[1]]` returns the contents of the list item. In this case
-it is the contents of the first column, a *vector* of type *factor*.
+it is the contents of the first column, a *vector* of type *character*.
 
 ```{r, eval=TRUE, echo=TRUE}
 nordic$country
 ```
 
-This example uses the `$` character to address items by name. *coat* is the
-first column of the data frame, again a *vector* of type *factor*.
-X
+This example uses the `$` character to address items by name. *country* is the
+first column of the data frame, again a *vector* of type *character*.
 
 ```{r, eval=TRUE, echo=TRUE}
 nordic["country"]
@@ -546,8 +564,7 @@ nordic[1, 1]
 
 This example uses a single brace, but this time we provide row and column
 coordinates. The returned object is the value in row 1, column 1. The object
-is an *integer* but because it is part of a *vector* of type *factor*, R
-displays the label "Denmark" associated with the integer value.
+is an *character*: the first value of the first vector in our `nordic` object.
 
 ```{r, eval=TRUE, echo=TRUE}
 nordic[, 1]

diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd
@@ -205,7 +205,7 @@ neighbors!
 
 The object `gapminder` is a data frame with columns
 
-- `country` and `continent` are factors.
+- `country` and `continent` are character vectors.
 - `year` is an integer vector.
 - `pop`, `lifeExp`, and `gdpPercap` are numeric vectors.
 
@@ -344,8 +344,7 @@ You can create a new data frame right from within R with the following syntax:
 ```{r}
 df <- data.frame(id = c("a", "b", "c"),
                  x = 1:3,
-                 y = c(TRUE, TRUE, FALSE),
-                 stringsAsFactors = FALSE)
+                 y = c(TRUE, TRUE, FALSE))
 ```
 
 Make a data frame that holds the following information for yourself:
@@ -365,8 +364,7 @@ time for coffee break?"
 ```{r}
 df <- data.frame(first = c("Grace"),
                  last = c("Hopper"),
-                 lucky_number = c(0),
-                 stringsAsFactors = FALSE)
+                 lucky_number = c(0))
 df <- rbind(df, list("Marie", "Curie", 238) )
 df <- cbind(df, coffeetime = c(TRUE, TRUE))
 ```

diff --git a/renv/activate.R b/renv/activate.R
@@ -295,8 +295,7 @@ local({
         # retrieve package database
         db <- tryCatch(
           as.data.frame(
-            utils::available.packages(type = type, repos = repos),
-            stringsAsFactors = FALSE
+            utils::available.packages(type = type, repos = repos)
           ),
           error = identity
         )
@@ -557,8 +556,7 @@ local({
       sep              = "=",
       quote            = c("\"", "'"),
       col.names        = c("Key", "Value"),
-      comment.char     = "#",
-      stringsAsFactors = FALSE
+      comment.char     = "#"
     )
 
     vars <- as.list(release$Value)