Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start to remove stringsAsFactors references #145

Merged
merged 1 commit into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 55 additions & 38 deletions episodes/03-data-structures-part1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -260,12 +260,9 @@ str(nordic_2$lifeExp)
str(nordic$lifeExp)
```

The data in `nordic_2$lifeExp` is stored as factors rather than
numeric. This is because of the "or" character string in the third
data point. "Factor" is R's special term for categorical data.
We will be working more with factor data later in this workshop.


The data in `nordic_2$lifeExp` is stored as a character vector, rather than as
a numeric vector. This is because of the "or" character string in the third
data point.

:::::::::::::::::::::::::

Expand Down Expand Up @@ -337,18 +334,17 @@ We said that columns in data frames were vectors:
```{r}
str(nordic$lifeExp)
str(nordic$year)
```

These make sense. But what about

```{r}
str(nordic$country)
```

Another important data structure is called a factor. Factors look like character
data, but are used to represent categorical information. For example, let's make
a vector of strings labeling nordic countries for all the countries in our
study:
One final important data structure in R is called a "factor". Factors look like
character data, but are used to represent data where each element of the vector
must be one of a limited number of "levels". To phrase that another way, factors
are an "enumerated" type where there are a finite number of pre-defined values
that your vector can have.

For example, let's make a vector of strings labeling nordic countries for all
the countries in our study:

```{r}
nordic_countries <- c('Norway', 'Finland', 'Denmark', 'Iceland', 'Sweden')
Expand Down Expand Up @@ -387,8 +383,6 @@ Can you guess why these numbers are used to represent these countries?

They are sorted in alphabetical order



:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand All @@ -397,36 +391,61 @@ They are sorted in alphabetical order

## Challenge 3

Is there a factor in our `nordic` data frame? what is its name? Try using
`?read.csv` to figure out how to keep text columns as character vectors
instead of factors; then write a command or two to show that the factor in
`nordic` is actually a character vector when loaded in this way.
Convert the `country` column of our `nordic` data frame to a factor. Then try
converting it back to a character vector.

Now try converting `lifeExp` in our `nordic` data frame to a factor, then back
to a numeric vector. What happens if you use `as.numeric()`?

Remember that you can reload the `nordic` data frame using
`read.csv("data/nordic-data.csv")` if you accidentally lose some data!

::::::::::::::: solution

## Solution to Challenge 3

One solution is use the argument `stringAsFactors`:
Converting character vectors to factors can be done using the `factor()`
function:

```{r, eval=FALSE}
nordic <- read.csv(file = "data/nordic-data.csv", stringsAsFactors = FALSE)
str(nordic$country)
```{r}
nordic$country <- factor(nordic$country)
nordic$country
```

Another solution is use the argument `colClasses`
that allow finer control.
You can convert these back to character vectors using `as.character()`:

```{r, eval=FALSE}
nordic <- read.csv(file="data/nordic-data.csv", colClasses=c(NA, NA, "character"))
str(nordic$country)
```{r}
nordic$country <- as.character(nordic$country)
nordic$country
```

You can convert numeric vectors to factors in the exact same way:

```{r}
nordic$lifeExp <- factor(nordic$lifeExp)
nordic$lifeExp
```

But be careful -- you can't use `as.numeric()` to convert factors to numerics!

```{r}
as.numeric(nordic$lifeExp)
```

Instead, `as.numeric()` converts factors to those "numbers under the hood" we
talked about. To go from a factor to a number, you need to first turn the factor
into a character vector, and _then_ turn that into a numeric vector:

```{r}
nordic$lifeExp <- as.character(nordic$lifeExp)
nordic$lifeExp <- as.numeric(nordic$lifeExp)
nordic$lifeExp
```

Note: new students find the help files difficult to understand; make sure to let them know
that this is typical, and encourage them to take their best guess based on semantic meaning,
even if they aren't sure.



:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down Expand Up @@ -523,15 +542,14 @@ nordic[[1]]
```

The double brace `[[1]]` returns the contents of the list item. In this case
it is the contents of the first column, a *vector* of type *factor*.
it is the contents of the first column, a *vector* of type *character*.

```{r, eval=TRUE, echo=TRUE}
nordic$country
```

This example uses the `$` character to address items by name. *coat* is the
first column of the data frame, again a *vector* of type *factor*.
X
This example uses the `$` character to address items by name. *country* is the
first column of the data frame, again a *vector* of type *character*.

```{r, eval=TRUE, echo=TRUE}
nordic["country"]
Expand All @@ -546,8 +564,7 @@ nordic[1, 1]

This example uses a single brace, but this time we provide row and column
coordinates. The returned object is the value in row 1, column 1. The object
is an *integer* but because it is part of a *vector* of type *factor*, R
displays the label "Denmark" associated with the integer value.
is an *character*: the first value of the first vector in our `nordic` object.

```{r, eval=TRUE, echo=TRUE}
nordic[, 1]
Expand Down
8 changes: 3 additions & 5 deletions episodes/04-data-structures-part2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ neighbors!

The object `gapminder` is a data frame with columns

- `country` and `continent` are factors.
- `country` and `continent` are character vectors.
- `year` is an integer vector.
- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors.

Expand Down Expand Up @@ -344,8 +344,7 @@ You can create a new data frame right from within R with the following syntax:
```{r}
df <- data.frame(id = c("a", "b", "c"),
x = 1:3,
y = c(TRUE, TRUE, FALSE),
stringsAsFactors = FALSE)
y = c(TRUE, TRUE, FALSE))
```

Make a data frame that holds the following information for yourself:
Expand All @@ -365,8 +364,7 @@ time for coffee break?"
```{r}
df <- data.frame(first = c("Grace"),
last = c("Hopper"),
lucky_number = c(0),
stringsAsFactors = FALSE)
lucky_number = c(0))
df <- rbind(df, list("Marie", "Curie", 238) )
df <- cbind(df, coffeetime = c(TRUE, TRUE))
```
Expand Down
6 changes: 2 additions & 4 deletions renv/activate.R
Original file line number Diff line number Diff line change
Expand Up @@ -295,8 +295,7 @@ local({
# retrieve package database
db <- tryCatch(
as.data.frame(
utils::available.packages(type = type, repos = repos),
stringsAsFactors = FALSE
utils::available.packages(type = type, repos = repos)
),
error = identity
)
Expand Down Expand Up @@ -557,8 +556,7 @@ local({
sep = "=",
quote = c("\"", "'"),
col.names = c("Key", "Value"),
comment.char = "#",
stringsAsFactors = FALSE
comment.char = "#"
)

vars <- as.list(release$Value)
Expand Down
Loading