Skip to content

Commit

Permalink
Address #119
Browse files Browse the repository at this point in the history
  • Loading branch information
mikemahoney218 committed Sep 20, 2023
1 parent baf6cfe commit e0f6574
Show file tree
Hide file tree
Showing 3 changed files with 60 additions and 47 deletions.
93 changes: 55 additions & 38 deletions episodes/03-data-structures-part1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -260,12 +260,9 @@ str(nordic_2$lifeExp)
str(nordic$lifeExp)
```

The data in `nordic_2$lifeExp` is stored as factors rather than
numeric. This is because of the "or" character string in the third
data point. "Factor" is R's special term for categorical data.
We will be working more with factor data later in this workshop.


The data in `nordic_2$lifeExp` is stored as a character vector, rather than as
a numeric vector. This is because of the "or" character string in the third
data point.

:::::::::::::::::::::::::

Expand Down Expand Up @@ -337,18 +334,17 @@ We said that columns in data frames were vectors:
```{r}
str(nordic$lifeExp)
str(nordic$year)
```

These make sense. But what about

```{r}
str(nordic$country)
```

Another important data structure is called a factor. Factors look like character
data, but are used to represent categorical information. For example, let's make
a vector of strings labeling nordic countries for all the countries in our
study:
One final important data structure in R is called a "factor". Factors look like
character data, but are used to represent data where each element of the vector
must be one of a limited number of "levels". To phrase that another way, factors
are an "enumerated" type where there are a finite number of pre-defined values
that your vector can have.

For example, let's make a vector of strings labeling nordic countries for all
the countries in our study:

```{r}
nordic_countries <- c('Norway', 'Finland', 'Denmark', 'Iceland', 'Sweden')
Expand Down Expand Up @@ -387,8 +383,6 @@ Can you guess why these numbers are used to represent these countries?

They are sorted in alphabetical order



:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand All @@ -397,36 +391,61 @@ They are sorted in alphabetical order

## Challenge 3

Is there a factor in our `nordic` data frame? what is its name? Try using
`?read.csv` to figure out how to keep text columns as character vectors
instead of factors; then write a command or two to show that the factor in
`nordic` is actually a character vector when loaded in this way.
Convert the `country` column of our `nordic` data frame to a factor. Then try
converting it back to a character vector.

Now try converting `lifeExp` in our `nordic` data frame to a factor, then back
to a numeric vector. What happens if you use `as.numeric()`?

Remember that you can reload the `nordic` data frame using
`read.csv("data/nordic-data.csv")` if you accidentally lose some data!

::::::::::::::: solution

## Solution to Challenge 3

One solution is use the argument `stringAsFactors`:
Converting character vectors to factors can be done using the `factor()`
function:

```{r, eval=FALSE}
nordic <- read.csv(file = "data/nordic-data.csv", stringsAsFactors = FALSE)
str(nordic$country)
```{r}
nordic$country <- factor(nordic$country)
nordic$country
```

Another solution is use the argument `colClasses`
that allow finer control.
You can convert these back to character vectors using `as.character()`:

```{r, eval=FALSE}
nordic <- read.csv(file="data/nordic-data.csv", colClasses=c(NA, NA, "character"))
str(nordic$country)
```{r}
nordic$country <- as.character(nordic$country)
nordic$country
```

You can convert numeric vectors to factors in the exact same way:

```{r}
nordic$lifeExp <- factor(nordic$lifeExp)
nordic$lifeExp
```

But be careful -- you can't use `as.numeric()` to convert factors to numerics!

```{r}
as.numeric(nordic$lifeExp)
```

Instead, `as.numeric()` converts factors to those "numbers under the hood" we
talked about. To go from a factor to a number, you need to first turn the factor
into a character vector, and _then_ turn that into a numeric vector:

```{r}
nordic$lifeExp <- as.character(nordic$lifeExp)
nordic$lifeExp <- as.numeric(nordic$lifeExp)
nordic$lifeExp
```

Note: new students find the help files difficult to understand; make sure to let them know
that this is typical, and encourage them to take their best guess based on semantic meaning,
even if they aren't sure.



:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down Expand Up @@ -523,15 +542,14 @@ nordic[[1]]
```

The double brace `[[1]]` returns the contents of the list item. In this case
it is the contents of the first column, a *vector* of type *factor*.
it is the contents of the first column, a *vector* of type *character*.

```{r, eval=TRUE, echo=TRUE}
nordic$country
```

This example uses the `$` character to address items by name. *coat* is the
first column of the data frame, again a *vector* of type *factor*.
X
This example uses the `$` character to address items by name. *country* is the
first column of the data frame, again a *vector* of type *character*.

```{r, eval=TRUE, echo=TRUE}
nordic["country"]
Expand All @@ -546,8 +564,7 @@ nordic[1, 1]

This example uses a single brace, but this time we provide row and column
coordinates. The returned object is the value in row 1, column 1. The object
is an *integer* but because it is part of a *vector* of type *factor*, R
displays the label "Denmark" associated with the integer value.
is an *character*: the first value of the first vector in our `nordic` object.

```{r, eval=TRUE, echo=TRUE}
nordic[, 1]
Expand Down
8 changes: 3 additions & 5 deletions episodes/04-data-structures-part2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ neighbors!

The object `gapminder` is a data frame with columns

- `country` and `continent` are factors.
- `country` and `continent` are character vectors.
- `year` is an integer vector.
- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors.

Expand Down Expand Up @@ -344,8 +344,7 @@ You can create a new data frame right from within R with the following syntax:
```{r}
df <- data.frame(id = c("a", "b", "c"),
x = 1:3,
y = c(TRUE, TRUE, FALSE),
stringsAsFactors = FALSE)
y = c(TRUE, TRUE, FALSE))
```

Make a data frame that holds the following information for yourself:
Expand All @@ -365,8 +364,7 @@ time for coffee break?"
```{r}
df <- data.frame(first = c("Grace"),
last = c("Hopper"),
lucky_number = c(0),
stringsAsFactors = FALSE)
lucky_number = c(0))
df <- rbind(df, list("Marie", "Curie", 238) )
df <- cbind(df, coffeetime = c(TRUE, TRUE))
```
Expand Down
6 changes: 2 additions & 4 deletions renv/activate.R
Original file line number Diff line number Diff line change
Expand Up @@ -295,8 +295,7 @@ local({
# retrieve package database
db <- tryCatch(
as.data.frame(
utils::available.packages(type = type, repos = repos),
stringsAsFactors = FALSE
utils::available.packages(type = type, repos = repos)
),
error = identity
)
Expand Down Expand Up @@ -557,8 +556,7 @@ local({
sep = "=",
quote = c("\"", "'"),
col.names = c("Key", "Value"),
comment.char = "#",
stringsAsFactors = FALSE
comment.char = "#"
)

vars <- as.list(release$Value)
Expand Down

0 comments on commit e0f6574

Please sign in to comment.