R_exercises/BIO720_TeachingMC_SimulationsBasics.Rmd

---
title: "TeachingSimulationsBasics"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output:
  html_document: 
    toc: true
    number_sections: true
    keep_md: true
    code_folding: hide
    theme: journal
  pdf_document: 
    toc: true
    number_sections: true
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# The basics of Simulations in R

##  What do I mean by simulate? Why Monte Carlo?

## Our first example

Let us simulate 10000 observations from a normally distributed variable with a mean of 5 and a standard deviation of 2. In other words:  $\sim N(\mu = 5, \sigma = 2)$


We will use the `rnorm` function which generates **r**andom values from a **norm**al distribution

```{r}
norm_simulated <- rnorm(n = 10000, mean  = 5, sd = 2)

par(mfrow=c(1,1))

plot(norm_simulated, 
     pch = 20, col = rgb(0.5,0,0.5,0.25),
     ylab = "simulated values")

hist(norm_simulated, 
     freq = F,
     xlab = "simulated values",
     main = "histogram of simulated values")

curve(dnorm(x, mean = 5, sd = 2), from = 0, to = 10, 
      add = T, col = "red")
```

With a $~N(\mu = 5, \sigma = 2)$ distribution superimposed (red) over the distribution of simulated values.

## look at summary statistics (mean and sd) from this.

```{r}
mean(norm_simulated)
sd(norm_simulated)
```


## repeat this a few times. what do you observe?

Let's repeat a few times, with a more modest sample size.

```{r}
norm_simulated1 <- rnorm(n = 100, mean  = 5, sd = 2)
norm_simulated2 <- rnorm(n = 100, mean  = 5, sd = 2)
norm_simulated3 <- rnorm(n = 100, mean  = 5, sd = 2)
norm_simulated4 <- rnorm(n = 100, mean  = 5, sd = 2)

par(mfrow = c(2,2))
hist(norm_simulated1)
hist(norm_simulated2)
hist(norm_simulated3)
hist(norm_simulated4)
par(mfrow = c(1,1))

mean(norm_simulated1)
mean(norm_simulated2)
mean(norm_simulated3)
mean(norm_simulated4)
```

So for each (iteration) of the simulation, we have sampled 100 observations from a population with a normal distribution with  $\sim N(\mu = 5, \sigma = 2)$. If we wanted to change the number of observations per iteration we simply change `n`. Similarly we could change the values for mean (or make it some function like a line), the standard deviation and the type of probability distribution we are sampling from.

### Please simulate 30 observations from 
$\sim N(\mu = 10, \sigma = 3)$ 5 times.


## more efficient performing the simulation 

Of course performing this one iteration at a time is not particularly useful, so instead we find a way to repeat it efficiently.

 One way to do this in R is to use the `replicate function`, which just repeats the function call n times. Notice that there is the `n=4` in the replicate which refers to how many times to repeat the call to the function (in this case the rnorm). the `n=100` in rnorm is how many samples are drawn from the population (distribution).
 
 
```{r, include=FALSE}
norm_sim_all_2 <- replicate(n = 4, 
                            rnorm(n = 100, mean = 5, sd = 2)) 

par(mfrow = c(2,2))

apply(X = norm_sim_all_2, MARGIN = 2, FUN = hist)
apply(X = norm_sim_all_2, MARGIN = 2, FUN = mean)
apply(X = norm_sim_all_2, MARGIN = 2, FUN = sd)

par(mfrow=c(1,1))

```
 

let's do this again, but repeat the sampling process (number of iterations) 2000 times. $\sim N(\mu = 7, \sigma = 1.8)$

```{r}
norm_sim_all_3 <- replicate(n=2000, 
                            rnorm(n = 100, mean = 7, sd = 1.8)) 
```
 
We can ask about the spread of the means from our simulated trials. In other words for each of the 2000 iterations of the simulation, what is the mean for the 100 values we simulated? How might you code this?

```{r}
hist(apply(X = norm_sim_all_3, MARGIN = 2, FUN = mean),
     main = "distribution of means from simulation",
     xlab = "mean values")


# standard deviation among sampled means (the standard error of the mean)
sd(apply(X = norm_sim_all_3, MARGIN = 2, FUN = mean))
```


Let's repeat this experiment with lower sample size (say 25)

```{r}
norm_sim_all_4 <- replicate(n = 2000, 
                            rnorm(n = 25, mean = 5,sd = 2))

# standard deviation among sampled means (the standard error of the mean)
sd(apply(norm_sim_all_4,2, mean))

hist(apply(norm_sim_all_4,2,mean),
     main = "distribution of means from simulation",
     xlab = "mean values")
```

How about if we have much larger sample size (say 10000)?

```{r}
norm_sim_all_5 <- replicate(n = 2000,
                            rnorm(10000, 5, 2))


sd(apply(norm_sim_all_5, 2, mean))

hist(apply(norm_sim_all_5, 2, mean),
     main = "distribution of means from simulation",
     xlab = "mean values")
```


# Do you see a pattern for the precision of the mean across repeated simulated samples?

```{r}

sd(apply(norm_sim_all_4, 2, mean)) # n=25

sd(apply(norm_sim_all_2, 2, mean)) # n=100

sd(apply(norm_sim_all_5, 2, mean)) # n=10000
```


Notice that as sample size (for a given simulated sample) gets larger, the variation (sd among the means) among the estimated means gets smaller!

We have just learned about monte carlo methods. When we perform simulations from that distributions, as our n (sample size) increase we will get closer to the expected value (which in this case we provided). That is the law of large numbers.
 
**This hopefully reinforces why the sample size influences the estimated the standard error so clearly!**

## Using for loops instead  

We can also run such simulations using for loops, which can be more efficient when the number of simulations (or parameters being estimated) is large

# How many iterations of the simulation do we want to do? We will call this "N"

```{r}
N <- 2000
```

 First we initialize a variable to store the mean values we are going to estimate from each iteration of the simulation
 
```{r}
simulated_means <- rep(NA, N) # Create an empty vector of length N

head(simulated_means)  # Filled with "missing data" until we fill it
```
 

Now we use the for loop to repeat this from i=1 until i=N (which is 2000 in this case )


```{r}
for (i in 1:N) {
	sim_data <- rnorm(n = 100, mean  = 5, sd = 2)
	simulated_means[i] <- mean(sim_data)
	rm(sim_data) # While this gets over-written each iteration of the loop, 
	   #we want to remove it after the last iteration.
}
```

 the vector "simulated_means" now contains all of the means from the N iterations of the simulation.
 
```{r}
hist(simulated_means)
sd(simulated_means)
```


## Simulations for a simple linear regression

Instead of simulating just a mean and sd, let's specify something more interesting... How about a linear regression!!!!!

let's have a regression $Y \sim N(\mu = a + b*x, \sigma = sd)$

Specifically with an intercept of 5 (`a = 5`) and a slope of 0.7 (`b = 0.7`).

$$Y \sim N(\mu = 5 + 7*x, \sigma = sd)$$

```{r}
a = 5
b = 0.7
x <- seq(2, 20) # our predictor variable
y_fixed <- a + b*x  # deterministic part of our model

par(mfrow = c(1, 1))
plot(y_fixed ~ x, pch = 20, cex = 1.5, col = "red")  # The deterministic relationship.
```

 But we want our response to be randomly sampled (conditioned on x), so we need to be able to incorporate this. So let's say `sd=2.5`

$$Y \sim N(\mu = 5 + 7*x, \sigma = 2.5)$$

```{r}
y_sim_1 <- rnorm(length(x), mean = y_fixed, sd = 2.5) 
```

 So we have now simulated values of our response variable `y` given a deterministic and stochastic component to the model!


```{r}
par(mfrow = c(1, 1))
plot(y_sim_1 ~ x, pch = 20, cex = 1.5, col = "red")
abline(a = 5, b = 0.7, lwd = 2) # Expected relationship based on the "known" parameters we used.
```

The line is the relationship based on the "known" parameters we used. The red points are the simulated values of $y$ we just generated.

But what are the parameter estimates for the regression, based on the simulated data (for y)? We can use the `lm` function to run a linear regression.

```{r}
y_sim_1_lm <- lm(y_sim_1 ~ x)

 # notice parameter estimates and RSE!

coef(y_sim_1_lm)
confint(y_sim_1_lm) # does it include our expected (known) parameter values?
```


```{r}
plot(y_sim_1 ~ x, pch = 20, cex = 1.5, col = "red")

abline(a = 5, b = 0.7, lwd = 2) 
abline(reg=y_sim_1_lm, lty = 2, col = "purple", lwd = 2) # estimated values based on simulated data.
```

Not identical, but similar to the "true" (which is known because we chose it) relationship. The point is, we have now done a little simulation of our regression model! We have the R tools we need to really begin learning statistics and about uncertainty!

Let's repeat this simulation a few times and add the lines on. Simulations are in purple the "true" relationship in black

```{r}
y_sim_2 <- rnorm(length(x), mean = y_fixed, sd = 2.5) 
y_sim_2_lm <- lm(y_sim_2 ~ x)

y_sim_3 <- rnorm(length(x), mean = y_fixed, sd = 2.5) 
y_sim_3_lm <- lm(y_sim_3 ~ x)


plot(y_sim_1 ~ x, 
     col = "red", type = "n",
     ylab = "y (simulated)")

abline(a = 5, b = 0.7, lwd = 2)
abline(reg = y_sim_1_lm, lty = 2, col = "purple", lwd = 2)
abline(reg = y_sim_2_lm, lty = 2, col = "purple", lwd = 2)
abline(reg = y_sim_3_lm, lty = 2, col = "purple", lwd = 2)
```

This is the absolute fundamentals, but there is much more we can do...