From e0f657494b4de03fd3322c3e339465873c0bcb52 Mon Sep 17 00:00:00 2001 From: Mike Mahoney Date: Wed, 20 Sep 2023 15:09:02 -0400 Subject: [PATCH] Address #119 --- episodes/03-data-structures-part1.Rmd | 93 ++++++++++++++++----------- episodes/04-data-structures-part2.Rmd | 8 +-- renv/activate.R | 6 +- 3 files changed, 60 insertions(+), 47 deletions(-) diff --git a/episodes/03-data-structures-part1.Rmd b/episodes/03-data-structures-part1.Rmd index 6baab073..17dda20a 100644 --- a/episodes/03-data-structures-part1.Rmd +++ b/episodes/03-data-structures-part1.Rmd @@ -260,12 +260,9 @@ str(nordic_2$lifeExp) str(nordic$lifeExp) ``` -The data in `nordic_2$lifeExp` is stored as factors rather than -numeric. This is because of the "or" character string in the third -data point. "Factor" is R's special term for categorical data. -We will be working more with factor data later in this workshop. - - +The data in `nordic_2$lifeExp` is stored as a character vector, rather than as +a numeric vector. This is because of the "or" character string in the third +data point. ::::::::::::::::::::::::: @@ -337,18 +334,17 @@ We said that columns in data frames were vectors: ```{r} str(nordic$lifeExp) str(nordic$year) -``` - -These make sense. But what about - -```{r} str(nordic$country) ``` -Another important data structure is called a factor. Factors look like character -data, but are used to represent categorical information. For example, let's make -a vector of strings labeling nordic countries for all the countries in our -study: +One final important data structure in R is called a "factor". Factors look like +character data, but are used to represent data where each element of the vector +must be one of a limited number of "levels". To phrase that another way, factors +are an "enumerated" type where there are a finite number of pre-defined values +that your vector can have. + +For example, let's make a vector of strings labeling nordic countries for all +the countries in our study: ```{r} nordic_countries <- c('Norway', 'Finland', 'Denmark', 'Iceland', 'Sweden') @@ -387,8 +383,6 @@ Can you guess why these numbers are used to represent these countries? They are sorted in alphabetical order - - ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -397,36 +391,61 @@ They are sorted in alphabetical order ## Challenge 3 -Is there a factor in our `nordic` data frame? what is its name? Try using -`?read.csv` to figure out how to keep text columns as character vectors -instead of factors; then write a command or two to show that the factor in -`nordic` is actually a character vector when loaded in this way. +Convert the `country` column of our `nordic` data frame to a factor. Then try +converting it back to a character vector. + +Now try converting `lifeExp` in our `nordic` data frame to a factor, then back +to a numeric vector. What happens if you use `as.numeric()`? + +Remember that you can reload the `nordic` data frame using +`read.csv("data/nordic-data.csv")` if you accidentally lose some data! ::::::::::::::: solution ## Solution to Challenge 3 -One solution is use the argument `stringAsFactors`: +Converting character vectors to factors can be done using the `factor()` +function: -```{r, eval=FALSE} -nordic <- read.csv(file = "data/nordic-data.csv", stringsAsFactors = FALSE) -str(nordic$country) +```{r} +nordic$country <- factor(nordic$country) +nordic$country ``` -Another solution is use the argument `colClasses` -that allow finer control. +You can convert these back to character vectors using `as.character()`: -```{r, eval=FALSE} -nordic <- read.csv(file="data/nordic-data.csv", colClasses=c(NA, NA, "character")) -str(nordic$country) +```{r} +nordic$country <- as.character(nordic$country) +nordic$country +``` + +You can convert numeric vectors to factors in the exact same way: + +```{r} +nordic$lifeExp <- factor(nordic$lifeExp) +nordic$lifeExp +``` + +But be careful -- you can't use `as.numeric()` to convert factors to numerics! + +```{r} +as.numeric(nordic$lifeExp) +``` + +Instead, `as.numeric()` converts factors to those "numbers under the hood" we +talked about. To go from a factor to a number, you need to first turn the factor +into a character vector, and _then_ turn that into a numeric vector: + +```{r} +nordic$lifeExp <- as.character(nordic$lifeExp) +nordic$lifeExp <- as.numeric(nordic$lifeExp) +nordic$lifeExp ``` Note: new students find the help files difficult to understand; make sure to let them know that this is typical, and encourage them to take their best guess based on semantic meaning, even if they aren't sure. - - ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -523,15 +542,14 @@ nordic[[1]] ``` The double brace `[[1]]` returns the contents of the list item. In this case -it is the contents of the first column, a *vector* of type *factor*. +it is the contents of the first column, a *vector* of type *character*. ```{r, eval=TRUE, echo=TRUE} nordic$country ``` -This example uses the `$` character to address items by name. *coat* is the -first column of the data frame, again a *vector* of type *factor*. -X +This example uses the `$` character to address items by name. *country* is the +first column of the data frame, again a *vector* of type *character*. ```{r, eval=TRUE, echo=TRUE} nordic["country"] @@ -546,8 +564,7 @@ nordic[1, 1] This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object -is an *integer* but because it is part of a *vector* of type *factor*, R -displays the label "Denmark" associated with the integer value. +is an *character*: the first value of the first vector in our `nordic` object. ```{r, eval=TRUE, echo=TRUE} nordic[, 1] diff --git a/episodes/04-data-structures-part2.Rmd b/episodes/04-data-structures-part2.Rmd index b6ccbf68..2ce4db0d 100644 --- a/episodes/04-data-structures-part2.Rmd +++ b/episodes/04-data-structures-part2.Rmd @@ -205,7 +205,7 @@ neighbors! The object `gapminder` is a data frame with columns -- `country` and `continent` are factors. +- `country` and `continent` are character vectors. - `year` is an integer vector. - `pop`, `lifeExp`, and `gdpPercap` are numeric vectors. @@ -344,8 +344,7 @@ You can create a new data frame right from within R with the following syntax: ```{r} df <- data.frame(id = c("a", "b", "c"), x = 1:3, - y = c(TRUE, TRUE, FALSE), - stringsAsFactors = FALSE) + y = c(TRUE, TRUE, FALSE)) ``` Make a data frame that holds the following information for yourself: @@ -365,8 +364,7 @@ time for coffee break?" ```{r} df <- data.frame(first = c("Grace"), last = c("Hopper"), - lucky_number = c(0), - stringsAsFactors = FALSE) + lucky_number = c(0)) df <- rbind(df, list("Marie", "Curie", 238) ) df <- cbind(df, coffeetime = c(TRUE, TRUE)) ``` diff --git a/renv/activate.R b/renv/activate.R index a8fdc320..43c41d8b 100644 --- a/renv/activate.R +++ b/renv/activate.R @@ -295,8 +295,7 @@ local({ # retrieve package database db <- tryCatch( as.data.frame( - utils::available.packages(type = type, repos = repos), - stringsAsFactors = FALSE + utils::available.packages(type = type, repos = repos) ), error = identity ) @@ -557,8 +556,7 @@ local({ sep = "=", quote = c("\"", "'"), col.names = c("Key", "Value"), - comment.char = "#", - stringsAsFactors = FALSE + comment.char = "#" ) vars <- as.list(release$Value)