-
Notifications
You must be signed in to change notification settings - Fork 22
/
BIO720_TeachingMC_SimulationsBasics.Rmd
299 lines (196 loc) · 8.99 KB
/
BIO720_TeachingMC_SimulationsBasics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "TeachingSimulationsBasics"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output:
html_document:
toc: true
number_sections: true
keep_md: true
code_folding: hide
theme: journal
pdf_document:
toc: true
number_sections: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# The basics of Simulations in R
## What do I mean by simulate? Why Monte Carlo?
## Our first example
Let us simulate 10000 observations from a normally distributed variable with a mean of 5 and a standard deviation of 2. In other words: $\sim N(\mu = 5, \sigma = 2)$
We will use the `rnorm` function which generates **r**andom values from a **norm**al distribution
```{r}
norm_simulated <- rnorm(n = 10000, mean = 5, sd = 2)
par(mfrow=c(1,1))
plot(norm_simulated,
pch = 20, col = rgb(0.5,0,0.5,0.25),
ylab = "simulated values")
hist(norm_simulated,
freq = F,
xlab = "simulated values",
main = "histogram of simulated values")
curve(dnorm(x, mean = 5, sd = 2), from = 0, to = 10,
add = T, col = "red")
```
With a $~N(\mu = 5, \sigma = 2)$ distribution superimposed (red) over the distribution of simulated values.
## look at summary statistics (mean and sd) from this.
```{r}
mean(norm_simulated)
sd(norm_simulated)
```
## repeat this a few times. what do you observe?
Let's repeat a few times, with a more modest sample size.
```{r}
norm_simulated1 <- rnorm(n = 100, mean = 5, sd = 2)
norm_simulated2 <- rnorm(n = 100, mean = 5, sd = 2)
norm_simulated3 <- rnorm(n = 100, mean = 5, sd = 2)
norm_simulated4 <- rnorm(n = 100, mean = 5, sd = 2)
par(mfrow = c(2,2))
hist(norm_simulated1)
hist(norm_simulated2)
hist(norm_simulated3)
hist(norm_simulated4)
par(mfrow = c(1,1))
mean(norm_simulated1)
mean(norm_simulated2)
mean(norm_simulated3)
mean(norm_simulated4)
```
So for each (iteration) of the simulation, we have sampled 100 observations from a population with a normal distribution with $\sim N(\mu = 5, \sigma = 2)$. If we wanted to change the number of observations per iteration we simply change `n`. Similarly we could change the values for mean (or make it some function like a line), the standard deviation and the type of probability distribution we are sampling from.
### Please simulate 30 observations from
$\sim N(\mu = 10, \sigma = 3)$ 5 times.
## more efficient performing the simulation
Of course performing this one iteration at a time is not particularly useful, so instead we find a way to repeat it efficiently.
One way to do this in R is to use the `replicate function`, which just repeats the function call n times. Notice that there is the `n=4` in the replicate which refers to how many times to repeat the call to the function (in this case the rnorm). the `n=100` in rnorm is how many samples are drawn from the population (distribution).
```{r, include=FALSE}
norm_sim_all_2 <- replicate(n = 4,
rnorm(n = 100, mean = 5, sd = 2))
par(mfrow = c(2,2))
apply(X = norm_sim_all_2, MARGIN = 2, FUN = hist)
apply(X = norm_sim_all_2, MARGIN = 2, FUN = mean)
apply(X = norm_sim_all_2, MARGIN = 2, FUN = sd)
par(mfrow=c(1,1))
```
let's do this again, but repeat the sampling process (number of iterations) 2000 times. $\sim N(\mu = 7, \sigma = 1.8)$
```{r}
norm_sim_all_3 <- replicate(n=2000,
rnorm(n = 100, mean = 7, sd = 1.8))
```
We can ask about the spread of the means from our simulated trials. In other words for each of the 2000 iterations of the simulation, what is the mean for the 100 values we simulated? How might you code this?
```{r}
hist(apply(X = norm_sim_all_3, MARGIN = 2, FUN = mean),
main = "distribution of means from simulation",
xlab = "mean values")
# standard deviation among sampled means (the standard error of the mean)
sd(apply(X = norm_sim_all_3, MARGIN = 2, FUN = mean))
```
Let's repeat this experiment with lower sample size (say 25)
```{r}
norm_sim_all_4 <- replicate(n = 2000,
rnorm(n = 25, mean = 5,sd = 2))
# standard deviation among sampled means (the standard error of the mean)
sd(apply(norm_sim_all_4,2, mean))
hist(apply(norm_sim_all_4,2,mean),
main = "distribution of means from simulation",
xlab = "mean values")
```
How about if we have much larger sample size (say 10000)?
```{r}
norm_sim_all_5 <- replicate(n = 2000,
rnorm(10000, 5, 2))
sd(apply(norm_sim_all_5, 2, mean))
hist(apply(norm_sim_all_5, 2, mean),
main = "distribution of means from simulation",
xlab = "mean values")
```
# Do you see a pattern for the precision of the mean across repeated simulated samples?
```{r}
sd(apply(norm_sim_all_4, 2, mean)) # n=25
sd(apply(norm_sim_all_2, 2, mean)) # n=100
sd(apply(norm_sim_all_5, 2, mean)) # n=10000
```
Notice that as sample size (for a given simulated sample) gets larger, the variation (sd among the means) among the estimated means gets smaller!
We have just learned about monte carlo methods. When we perform simulations from that distributions, as our n (sample size) increase we will get closer to the expected value (which in this case we provided). That is the law of large numbers.
**This hopefully reinforces why the sample size influences the estimated the standard error so clearly!**
## Using for loops instead
We can also run such simulations using for loops, which can be more efficient when the number of simulations (or parameters being estimated) is large
# How many iterations of the simulation do we want to do? We will call this "N"
```{r}
N <- 2000
```
First we initialize a variable to store the mean values we are going to estimate from each iteration of the simulation
```{r}
simulated_means <- rep(NA, N) # Create an empty vector of length N
head(simulated_means) # Filled with "missing data" until we fill it
```
Now we use the for loop to repeat this from i=1 until i=N (which is 2000 in this case )
```{r}
for (i in 1:N) {
sim_data <- rnorm(n = 100, mean = 5, sd = 2)
simulated_means[i] <- mean(sim_data)
rm(sim_data) # While this gets over-written each iteration of the loop,
#we want to remove it after the last iteration.
}
```
the vector "simulated_means" now contains all of the means from the N iterations of the simulation.
```{r}
hist(simulated_means)
sd(simulated_means)
```
## Simulations for a simple linear regression
Instead of simulating just a mean and sd, let's specify something more interesting... How about a linear regression!!!!!
let's have a regression $Y \sim N(\mu = a + b*x, \sigma = sd)$
Specifically with an intercept of 5 (`a = 5`) and a slope of 0.7 (`b = 0.7`).
$$Y \sim N(\mu = 5 + 7*x, \sigma = sd)$$
```{r}
a = 5
b = 0.7
x <- seq(2, 20) # our predictor variable
y_fixed <- a + b*x # deterministic part of our model
par(mfrow = c(1, 1))
plot(y_fixed ~ x, pch = 20, cex = 1.5, col = "red") # The deterministic relationship.
```
But we want our response to be randomly sampled (conditioned on x), so we need to be able to incorporate this. So let's say `sd=2.5`
$$Y \sim N(\mu = 5 + 7*x, \sigma = 2.5)$$
```{r}
y_sim_1 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
```
So we have now simulated values of our response variable `y` given a deterministic and stochastic component to the model!
```{r}
par(mfrow = c(1, 1))
plot(y_sim_1 ~ x, pch = 20, cex = 1.5, col = "red")
abline(a = 5, b = 0.7, lwd = 2) # Expected relationship based on the "known" parameters we used.
```
The line is the relationship based on the "known" parameters we used. The red points are the simulated values of $y$ we just generated.
But what are the parameter estimates for the regression, based on the simulated data (for y)? We can use the `lm` function to run a linear regression.
```{r}
y_sim_1_lm <- lm(y_sim_1 ~ x)
# notice parameter estimates and RSE!
coef(y_sim_1_lm)
confint(y_sim_1_lm) # does it include our expected (known) parameter values?
```
```{r}
plot(y_sim_1 ~ x, pch = 20, cex = 1.5, col = "red")
abline(a = 5, b = 0.7, lwd = 2)
abline(reg=y_sim_1_lm, lty = 2, col = "purple", lwd = 2) # estimated values based on simulated data.
```
Not identical, but similar to the "true" (which is known because we chose it) relationship. The point is, we have now done a little simulation of our regression model! We have the R tools we need to really begin learning statistics and about uncertainty!
Let's repeat this simulation a few times and add the lines on. Simulations are in purple the "true" relationship in black
```{r}
y_sim_2 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
y_sim_2_lm <- lm(y_sim_2 ~ x)
y_sim_3 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
y_sim_3_lm <- lm(y_sim_3 ~ x)
plot(y_sim_1 ~ x,
col = "red", type = "n",
ylab = "y (simulated)")
abline(a = 5, b = 0.7, lwd = 2)
abline(reg = y_sim_1_lm, lty = 2, col = "purple", lwd = 2)
abline(reg = y_sim_2_lm, lty = 2, col = "purple", lwd = 2)
abline(reg = y_sim_3_lm, lty = 2, col = "purple", lwd = 2)
```
This is the absolute fundamentals, but there is much more we can do...