R_exercises/Bio720_R_week3_DataMungingInClass.Rmd

---
title: "Bio720_DataMunging in base R"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output: 
  html_document: 
    keep_md: yes
    number_sections: yes
    toc: yes
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
options(digits  = 3)
```
# Data parsing and munging in Base R

## Brief introduction
One of the most common "programming" activities you will do in genomics and bioinfomatics is to get the data into a format that facilitates (and makes possible) the analyses you are doing. This activity can often occupy very large amounts of your time

These activities have many names: *wrangling data, munging data, cleaning data, data manipulation*. Given how important this is, it will come as no surprise that there are entire R libraries devoted to it. There are several lists that are worth looking at like [here](http://datascienceplus.com/best-packages-for-data-manipulation-in-r/) and [here](https://www.analyticsvidhya.com/blog/2015/12/faster-data-manipulation-7-packages/). Also, books like [Data Manipulation with R](https://www.amazon.ca/Data-Manipulation-R-Phil-Spector/dp/0387747303)!

## Why no tidyverse in this tutorial
You have already been introduced to a little bit of the use of the *dplyr* library and verbs within it like `filter()`, `arrange()`, `group_by()` and `summarize()` (one other important one is `select()`). You will have more opportunities to learn aspects of the tidyverse (more on dplyr and ggplot2 in particular). However, today I am not going to show you data munging this way, instead relying on base R techniques.

Why? tidyverse libraries are excellent in many ways, *but* they do have some issues. First and foremost, there are lots of dependencies (so if you load library "a", you also need libraries "b", "c", "d", etc). That can add a lot of libraries potentially. Second, unlike base R where the code is very (probably too) conservative, and those that maintain and alter the code do so in a manner that is very unlikely to alter downstream behaviour (i.e. code in other libraries), tidyverse libraries **change more rapidly** (and thus, while still very unlikely are more likely to break existing code). 

Currently, *[Bioconductor](http://bioconductor.org/)* objects don't always "play nice" with tidyverse tibbles and approaches. These are essentially two independently evolving "ecosystems of syntax" (really macro languages) in *R*. There is a move to integrate these better. But at this time, if you are planning on using libraries from *Bioconductor* for genomic and bioinformatic analysis it is important to know how to work with data in base R.

It is worth thinking about when to use tidyverse, and when not to.

If you are writing scripts that are just for your own data analysis (i.e. you are not writing a general purpose library) and are not interacting with *Bioconductor* objects too much (or don't mind altering objects from one to the other), then the *tidyverse* is an extremely smart choice. Fast to learn, relatively consistent (in comparison to Base *R*), and often faster than base *R*.

## *data.table* library for fast import and execution of large data sets.
Finally it is worth pointing out that the *data.table* library is not only the most efficient for importing large data sets (by a wide margin usually), but data munging (sorting, subsetting, etc..) can also be done via this library with its own syntax. For large datasets, this is usually the fastest. Datacamp has a course on this [here](https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial). Like `readr` and *tibbles* in tidyverse, the `data.table` is a slightly different object from a standard `data.frame`. If you are working with large datasets in `R`, knowing this library is essential! 

## for your own review

We have already spent considerable time in previous tutorials making data frames using `cbind`, `rbind`, `data.frame`, etc.. As we have limited time in class I leave review of that to you from the screencasts and activites at datacamp.

## install packages.
You may wish to install certain `R` libraries that get used a lot. While R-studio has a number of easy ways to install libraries I still prefer having it in the script. You only need to install the libraries once, although it is worth updating them regularly, and I re-install them everytime I update `R`.

```{r, eval = FALSE}
# remove hash to install.
#install.packages("data.table")
```

## Data we are using.

We will use a number of different data sets to illustrate each of the points, but for the main one, we will use an old *Drosophila melanogaster* data set measuring several traits (lengths) and the number of sex comb teeth (a structure used to clasp females during copulation) for different wild type strains (line) reared at different developmental temperatures, with and without a mutation that effects proximal-distal axis development in limbs (genotype). Note that we are downloading the data from a data repository (Dryad). If you really care, it is from [this](https://www.ncbi.nlm.nih.gov/pubmed/15733306) paper.


```{r, eval = TRUE}
dll_data = read.csv("http://beaconcourse.pbworks.com/f/dll.csv",
                       h = TRUE, stringsAsFactors = TRUE)
```


Before we go on, how should we look at the data to make sure it imported correctly, and the structure (and other information) about the object we have just created?
 
```{r}
# How do you want to check your data?
```
- Since we are using base *R*, this is a regular data.frame.

## Cleaning data
### removing missing data

Sometimes your data set has missing data, i.e. for some reason you could not measure one of your variables on a particular object. How you decide to deal with missing data can be a big topic, but for the moment we are going to assume you want to delete rows that contain missing data. 

First let's check if there is any missing data. Use `is.na` to see if there is any? How many observations have missing data?

```{r}
is.na(dll_data)
sum(is.na(dll_data))
anyNA(dll_data)
```

More generally we can use `anyNA()`. Try it on this data:


### Why missing data can cause headaches
```{r}
mean(dll_data$femur)
```
- Since there is some missing data, the `mean()` function defaults to NA, to warn you. This is a useful behaviour.

```{r}
mean(dll_data$femur, na.rm = TRUE)
```


### What to do with this rows with missing data
We need to decide what to do with the missing data. If you have lots of data, and do not mind removing all cases (rows) that contain **any** missing data then `na.omit` is a useful solution. How does using `na.omit()` change the number of observations for *dll.data*? Don't erase the original data!


There are numerous options about how `R` should handle operations if an `NA` is found. You can look at the help under `?NA`. 


For the moment, we are going to accept the approach above, and use it. However, I don't generally recommend this. Instead, you can first subset the variables you are going to use, and then decide how to deal with the missing data. Or you can allow them into your analysis, but make sure to set the right flags in functions that enable them to deal with missing data.

```{r}
dll_data <- na.omit(dll_data)
mean(dll_data$femur)
```
- Note that the mean now is not the same as before. This is because additional variables (not just femur) had missing data for different observations, but we removed any observation with any missing data!

### finding and removing duplicate rows in a data frame.
The `duplicated()` allows you to see which, if any rows are perfect duplicates of one another.

We can first ask if any rows are duplicates.  Use `duplicated` to check.

```{r}
sum(duplicated(dll_data))
```

How might we find how many duplicated?


Or if you want it as a Boolean (hint use `any`)


`R` has a convenience function for the above:

```{r}
anyDuplicated(dll_data)

dll_data[891,]

dll_data[anyDuplicated(dll_data),]
```

So for this example there was no duplicate rows.


However let's say an accident happened in generating the data set and a few rows were duplicated. I am going to use the `sample()` function to extract 5 random rows.

```{r, eval = TRUE}
new_rows <- dll_data[sample(nrow(dll_data), size = 5, replace = T ),]

dll_data2 <- rbind(dll_data, new_rows)
str(dll_data2)
```

Now you don't know which rows, they are.  Please show how you would find whether there are any duplicated rows, and which ones they are using `duplicated()`

```{r}
sum(duplicated(dll_data2))

dll_data2[duplicated(dll_data2), ]
```


So what should we do? We can use the `unique()` to remove these duplicated rows if we know for certain that they do not belong in the data set (i.e. they are duplicated in error). 

```{r}
dll_data_unique <- unique(dll_data2)

dim(dll_data_unique)
dim(dll_data2)
dim(dll_data)
```

Let's just do a bit of clean-up.

```{r}
rm(dll_data_complete, dll_data_unique, dll_data2)
```

## Subsetting data sets (review)

In some sense this is a review of what we learned during the first week, but there are a few details that are worth considering.

Let's look at the data again:

```{r}
str(dll_data)
```

As we can see there are two genotypes. Let's say we wanted to make a data frame (`dll_data_wt`) that was a subset with only the wild type (`wt`) data? How would you do this with the index?

### using the index
```{r}
dll_data_wt <- ?
```

Has this created the appropriate subset of data? How would you check?


### subsetting on factors and removing unused levels.

So that seems to match. We should only have one level now in `genotype`. Let's check.

```{r}
levels(dll_data_wt$genotype)
```

So what is going on? 

We know we only have the correct number of observations associated, but it still has the level `Dll` in the factor genotype, why? Because of the way it has been stored, each of these levels is still associated as part of `genotype`. Thankfully it is very easy to fix. Use the `droplevels()` to drop unused levels. 

```{r}
dll_data_wt <- 
levels(dll_data_wt$genotype)
```

### using the subset function

Of course the easiest alternative is to use the `subset` function we introduced a few weeks back. Please use this to make a data frame (`dll_data_Dll`) for the *Dll* factor level of genotype and check what has happened to the *wt* factor level. Try that here.


```{r, eval = FALSE}
dll_data_Dll <- ?
dim(dll_data_Dll)
with(dll_data, table(genotype))
levels(dll_data_Dll$genotype)
dll_data_Dll <- droplevels(dll_data_Dll)
```

`subset()` like using the index allows you to select specific columns as well. Create a new version of dll_data_Dll where the only columns that remain are line, genotype, temp and SCT.

```{r}
dll_data_Dll <- ?

dim(dll_data_Dll)
```

### the `%in%` matching operator 

`%in%` can be very useful for finding things in your data set to work on.

The `%in%` is a matching operator (also see `match()`) that returns a boolean (TRUE or FALSE). This can be really useful to find things. Go ahead and find all instances where a couple of the lines `c("line-Sam", "line-1")` are in the data set `dll_data$line`. I recommend first creating a dummy logical variable that can be used to filter via the index`[]`.

```{r}
matched_set <- ?

dll_data_new_subset <- ?
dim(dll_data_new_subset)
```


### Clean up before moving on.
let's remove the data frames we are not using.

```{r}
rm(dll_data_Dll, dll_data_wt, dll_data_new_subset, matched_set)
```

## Cleaning variable names
Let's take a closer look at the names of the different fly strains used.

```{r}
levels(dll_data$line)
```

Some are numbers, some are letters. We decide, that we need to clean this up a bit. First off, we have no need to have "line-" as a prefix for each of these labels, and may make things confusing down the road. So we want to get rid of them. However we **don't** want to edit the spreadsheet, as it is most importantly bad scientific practice, and would be a lot of work. So how do we do it efficiently?

We actually have a couple of options.

First thing to keep in mind though is that `dll_data$line` is currently stored as an object of class `factor`. Why is this important?

### `substr()`
Let's try to do some simple string manipulation. There are a few functions that could be useful here. Let's start with `substr()` (substring). This will extract or replace substrings within a character vector. So what do we need to do first to this variable (stored as a factor)

```{r}
line_str <- ?
str(line_str)
head(line_str)
```

Now we can use `substr()`. It is by position. Look up `?substr`, in the function you will want to use `stop = 1000000L`

```{r}
line_names <- ?

head(line_names)
tail(line_names)
```

In this case, since the strings we want to keep start at different positions in the original string, we can't use an arbitrary string length (i.e. try `stop  = 7`). 


### `strsplit()`
An alternative way to do this is to use the fact that we (purposefully) used a delimiter character to seperate parts of the name. In this case we used the "-" to seperate the word "line" from the actual line name we car about. As such we can use a second function that can be very valuable, `strsplit()`. Use `strsplit` to create a new object splitting on the "-"

```{r}
line_names2 <- ?
head(line_names2)

```
Here it has created a list, where each list contains the 2 elements ("line" and the name of the line which we want). We are going to convert this list to a matrix so we can extract what we need.

```{r}
line_names2_mat <- ?

head(line_names2_mat)
```

And the second column contains what we want (line names) so we could make a new variable.

```{r}
dll_data$line_names <- ?
str(dll_data$line_names)
```

As an aside, in my lab I enforce an approach for imaging where all images, videos, or sequence files have a consistent (and identical within project) naming convention. For instance our wing images have a naming convention like initials_project_line_sex_genotype_rep. So for images it may be something like:

`ID_GBE_SAM_F_WT_01.tif`

Which would be ID = Ian Dworkin (initials), GBE = project name, SAM is line name, etc... This enables me to import the file names as variables and use `strsplit()` like we did above and automatically populate the variables for the experiment. No need to try to seperate things in a spreadsheet program!

### `gsub`

Many times however, there is no consistent delimiter like "-", nor can we assume the variable name we want is in a particular part of the string. So sometimes we need to utilize some simple regular expressions. `R` has a number of useful regular expression tools in the `base` library (`gsub`, `grep`, `grepl`, etc ). I suggest using `?grep` to take a deeper look. Here we will encounter just a few.

`sub()` and `gsub()` are functions that perform replacements for the first match or globally (g) for all matches respectively. If we go back to the names of the original lines, we will see they all have `line-` in common, so perhaps we can search and replace based on that pattern. This is quite easy as well. use `gsub` to extract just the line name without the first part of the string, "line-"

```{r}
str(line_str)
line_names3 <- ?

head(line_names3)
tail(line_names3)
```


### Some other things to help "clean" up names, etc..

You may have noticed that for the line names which are short words, some are lower case and some are upper case. It could be that this inconsistentcy make cause an issue down the road. So let's say we wanted to take `line_names3` and make all of the words lower case. How would we do that? `R` has some nice functions to help, namely `tolower()` (and `toupper()`) and the more general `chartr()` for character translation.

```{r}
line_names3 <- tolower(line_names3)

head(line_names3)
tail(line_names3, n = 25)
```

Say we wanted to call the "sam" line by its proper name "Samarkand". How could we do this? with `sub`?

```{r}
line_names3 <- ?
```

#### Clean up before moving on.
```{r}
rm(line_names2, line_names3, line_names, line_names2_mat, line_str, new_rows)
```

### renaming variables, factors etc..

For this experiment, all flies were reared during their development at two temperatures, 25C and 30C. Currently this is kept as an integer. 

```{r}
str(dll_data$temp)
```

However we might want to treat this as a factor. Also your PI has decided they want you to encode it at "HighTemp" and "LowTemp". So you have decided to create a new variable to do so. Not surprisingly, you have a few different ways to do this. The most obvious (and what you have done before) is to generate a factor. You can make the variable part of the data frame object, but I am going to keep it seperate for now.

Use the `factor()` function to generate a new variable `temp_as_factor` with factor labels "LowTemp", "HighTemp".

```{r}
temp_as_factor <- ?

str(temp_as_factor)
head(temp_as_factor)
tail(temp_as_factor)
```

### Using ifelse()

An alternative approach would be to use the `ifelse()` function. Create a new variable (that achieves the same thing) using `ifelse()`

```{r}
temp_as_factor2 <- ?
temp_as_factor2 <- factor(temp_as_factor2)
str(temp_as_factor2)
```

One issue with this is nested `if` statements, like `for` loops can get messy. Plus there is an internal limit on how much nesting can occur (I think a hierarchy of 5 or 6), so you want to take some care in using this approach.

## Sorting data frames

Sometimes you just want to sort your data frame. While this is rarely necessary for statistical analyses or plotting of the data, it is important to look at the raw data as a double (triple!!!) check that there are no typos, or issues you were not aware of. I would say in genomic analysis this is most important when you are looking at the summary of statistical analyses and you want to sort your output by genes with greatest effects or something else. While there is a `sort()` function in `R` this is not generally the way you want to approach it.

```{r}
head(sort(dll_data$SCT))

tail(sort(dll_data$SCT))
```
This just sorts the vector, which does not help us much. Instead we are going to use the `order()` function which provides information on the index for each observation. So 

```{r}
head(order(dll_data$SCT))

tail(order(dll_data$SCT))
```
Provides the row numbers. We can use this to sort via the *index*. Use the `order()` to sort the dll_data frame into a new object (dll_data_sorted) based on dll_data$SCT.

```{r}
dll_data_sorted <- ?

head(dll_data_sorted)
```

How would you sort on tibia?

### sorting from highest to lowest

So this by default sorts from lowest to highest. Use the help function for `order` and sort the data from highest to lowest.

```{r}
dll_data_sorted <- ?
head(dll_data_sorted)
```

### Sorting on multiple different columns. 

This is a natural extension of what we have already done. Sort first on *SCT*, then on *temp*

```{r}
dll_data_sorted <- ?
head(dll_data_sorted)
head(dll_data)
```

## Merging data frames

We often end up in a situation where we collect more data on the same set of samples (or in the same locales) that we need to include in our analyses. Again, we do not want to go back and edit the spreadsheets, but instead merge the data frames during the analysis. For instance, let's say we had this new data that related the elevation that each of the 27 lines were originally collected at, and we thought this may explain some important component of the variation in the traits (in this case size). So our collaborator collected this data:

```{r}
line_names <- dll_data$line
levels(line_names)
elevations <- c(100, 300, 270, 250, 500, 900, 500, 1100, 500,
                3000,500, 570, 150, 800, 600, 500, 1900, 100,
                300, 270, 250, 500, 900, 500, 1100, 500, 600)

MeanDayTimeTemp <- c(rnorm(27, mean = 20, sd = 5))
elevation_data <- data.frame(levels(line_names), 
                             elevations,
                             MeanDayTimeTemp)
```
So how do we combine these?

Using a for loop or ifelse is possible, but would be messy. There are two easier options. First using merge.

### `merge()`

The `merge()` function is designed for exactly this. It works on matching variable names (exactly is best) from variables with the same name, or where the matching pairs are specified.

In our case

```{r}
names(dll_data)
names(elevation_data)
```

The easiest thing to do would be to rename the variable in `elevation_data` so that it matches that in `dll_data`

```{r}
names(elevation_data)[1] <- "line"
str(elevation_data)
```

Now we can merge the data

```{r}
merged_data <- merge(x = elevation_data, 
                     y = dll_data,
                     sort = FALSE)
```

How can we check if it merged correctly?

```{r}
head(merged_data)
tail(merged_data)
```

The `merge()` function has a great deal more flexibility, and it is worth looking at the examples in the help function! A word of warning with `merge()` and `reshape()` (below). The functions can result in some odd behaviour when there are non-unique rows which create conflicts. They can also stumble with missing data sometimes. So be on the look out for such things when you are using these functions.The other thing to know is that it will sort the new object it creates, so if you need to use this in conjunction with other files, keep a dummy variable in both files that have unique identifiers. 

### the power of the index

We can also use the index to do something similar.

Say we measured gene expression of a particular transcript for both genotypes *Dll* and *wt*. We wanted to include that. It would be very easy to use an ifelse, but we can also use the index

```{r}
geneXpression <- c(Dll=1121, wt = 2051)
merged_data$geneExp <- geneXpression[dll_data$genotype]

head(dll_data)
```

Make a new variable based on genotype for `geneY = c(200, 500)`.

#### Clean up for next part.
```{r}
rm(elevation_data, elevations, MeanDayTimeTemp, temp_as_factor, dll_data_sorted, geneXpression, temp_as_factor2, line_names, stuff, merged_data)
```

## reshaping data

There are often times when your data is in a format that is not appropriate for the analyses you need to do. In particular sometimes you need each variable (say each gene whose expression you are monitoring) in its own column (so called *wide* format). Sometimes it is best to have a single column for gene expression, with a second column indexing which gene the measure of expression is for. i.e. all expression data is in a single column, and thus each row has a single measure, but the number of rows will be equal to the number of samples multiplied by the number of genes (lots of rows!!). This is called the *long* format.

This is such a common operation that `R` has a function `reshape()` to do this. Unfortunately (but perhaps well deserved), the arguments and argument names for the `reshape()` function are not so intuitive. Thus while I give an example of it below, this is one of those cases that using an additional package (like `reshape2` or `tidyr`) may be a worthwhile investment in time. Indeed `gather` and `spread` are really straightforward in *tidyr*

### `reshape()` function in `R`
Let's take a quick look at our data again:
```{r}
head(dll_data)
```

It is clear that we have three variables (femur, tibia and tarsus) which are all measures of size (3 segments of the leg). While we currently have these in the *wide* format, for some analysis (such as in mixed models) we may wish to put them in a *long* format. 

First let me show you the code (and that it worked), and then we will go through the various arguments to make sense of what we have done.
```{r}
long_data <- reshape(dll_data,
                     varying = list(names(dll_data)[5:7]),
                     direction = "long",
                     ids = row.names(dll_data),
                     times = c("femur", "tibia", "tarsus"),
                     timevar = "leg_segments",
                     v.names = "length")
```

Let's check that this worked. How might we do this? What do we expect in terms of number of rows in `long_data`?

```{r}
# It should have 3 times the number of rows as our original data
(3*nrow(dll_data)) == nrow(long_data)

head(long_data, n = 10)

tail(long_data, n = 10)
```


### some other approaches to using `reshape()`

Sometimes we actually have unique subject identifiers already in the data frame, and we don't need to just use row numbers. These unique identifiers can be a combination of several variables if it generates a unique combination for each row (for data currently in wide format). See the section below on combining names. In turns out that even if we combined line, genotype, temp we would not have unique names, so that strategy does not work here. So we can make a fake identifer
```{r}
dll_data$subject <- as.character(1:nrow(dll_data))
```

```{r}
long_data2 <- reshape(dll_data,
                     varying = list(names(dll_data)[5:7]),
                     direction = "long",
                     idvar = "subject",
                     #ids = row.names(dll_data),
                     times = c("femur", "tibia", "tarsus"),
                     timevar = "leg_segments",
                     v.names = "length"
                     )
```

Which we can check again

```{r}
3*nrow(dll_data) == nrow(long_data2)

head(long_data2)
```

In this case, this is really similar to the approach above as we used the same fundamental IDs.

## Combining variables together.
Sometimes you need to combine several variables together. For instance, we may need to look at combinations of both *genotype* and *temp*

```{r}
str(dll_data)
```

What if we wanted to combine these two variables together into a meta-variable? There are a few good options. One is to use paste of course

```{r}
meta_variable <- with(dll_data,
                      paste(genotype, temp, 
                            sep = ":"))

head(meta_variable, n = 20)
str(meta_variable)
```
Note: you get to choose the character to use as a seperator. ":", "." and "_" are very common!

The only problem is this is now a character vector not a factor, although this is easily fixed.

```{r}
meta_variable <- as.factor(meta_variable)
str(meta_variable)
```
Great!

The other pretty easy way, especially for factors is to use the `interaction` function.

```{r}
meta_variable2 <- with(dll_data,
     interaction(as.factor(temp), genotype,
                 drop = TRUE,
                 sep = ":"))

head(meta_variable2, n = 20)
str(meta_variable2)
```

Please note the `drop = TRUE` flag. You need this if there are combinations of factors that do not exist in the data, and it is almost always the safer choice unless you need all possible combinations (including those that don't occur).

## Please ignore
In class we are going to go through just a few activities that are important. combining levels in a factor (grepl factor, ), transform, numeric to factor (cut),   creating a dataset from multiple external files (RNAseq example)....