-
Notifications
You must be signed in to change notification settings - Fork 0
/
MSIA401 - Homework 5 - 2016-10-28.Rmd
202 lines (170 loc) · 7.42 KB
/
MSIA401 - Homework 5 - 2016-10-28.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "MSIA 401 - Homework 5"
author: "Jeff Parker"
date: "October 25, 2016"
output:
html_document:
theme: readable
---
# 4.14
1. First we import our data and our libraries. GLMNET must have the data in matrix form.
```{r}
library(car)
library(glmnet)
a <- read.csv("C:/Users/Jeff/Downloads/acetylene.csv")
a$x1_x2 <- a$x1 * a$x2
a$x1_x3 <- a$x1 * a$x3
a$x2_x3 <- a$x2 * a$x3
a$x1_sq <- a$x1^2
a$x2_sq <- a$x2^2
a$x3_sq <- a$x3^2
y <- a$y
x <- as.matrix(a[c(1,2,3,5,6,7,8,9,10)])
```
2. We fit a model using lambda in sequences of 0.01 from 0 to 10. We cross-validate our model using three folds of the data.
```{r}
ridgefit <- glmnet(x, y, alpha=0,lambda=seq(0,10,0.01))
ridgecv <- cv.glmnet(x,y,alpha=0,nfold=3,lambda=seq(0,10,0.01))
```
3. We now find the minimum lambda from all of the options we used.
```{r}
lambdaridge <- ridgecv$lambda.min
lambdaridge
```
4. Using this value of lambda (which can vary because of the different fold cuts), now we find it's location in the ridgecv and betas for this value of lambda.
```{r}
small.lambda.index <- which(ridgecv$lambda == ridgecv$lambda.min)
small.lambda.betas <- coef(ridgecv$glmnet.fit)[,small.lambda.index]
```
5. Plotting the coefficent by the log of the lambdas we can see the minimum is about 1.
```{r}
plot(ridgefit,xvar="lambda", main="Coeffs of Ridge Regression", type="l",
xlab=expression("log_lambda"), ylab="Coeff")
abline(h=0)
abline(v=log(ridgecv$lambda.min))
grid()
small.lambda.betas
```
# 4.15
1. Similiar to Ridge regression, we fit the Lasso model using many variables of lambda, take a three-fold cross-validation and find the minimum.
```{r}
lassofit=glmnet(x, y, alpha=1,lambda=seq(0,10,0.01))
lassocv=cv.glmnet(x,y,alpha=1,nfold=3,lambda=seq(0,10,0.01))
lambdalasso=lassocv$lambda.min
print(lambdalasso)
```
2. Using this minimum of lambda, we find the location and the betas.
```{r}
small.lambda.index <- which(lassocv$lambda == lassocv$lambda.min)
small.lambda.betas <- coef(lassocv$glmnet.fit)[,small.lambda.index]
```
3. Plotting the lambda against the coefficients, we can see the minimum for all the variables seem to converge at just under 1.
```{r}
plot(lassofit,xvar="lambda",label=TRUE, main="Coeffs of Lasso Regression", type="l",
xlab=expression("log_lambda"), ylab="Coeff")
abline(h=0); abline(v=log(lassocv$lambda.min))
grid()
small.lambda.betas
```
# 5.3
First, I create the data used for this exercise.
```{r}
y <- c(12.37,12.66,12.00,11.93,11.06,13.03,13.13,11.44,12.86,10.84,11.20,11.56,10.83,12.63,12.46)
x <- matrix(data = c(2.23,2.57,3.87,3.10,3.39,2.83,3.02,2.14,3.04,3.26,3.39,2.35,2.76,3.90,3.16,9.66,8.94,4.40,6.64,4.91,8.52,8.04,9.05,7.71,5.11,5.05,8.51,6.59,4.90,6.96), nrow = 15, ncol = 2, dimnames = list(c(1:15),c("x1","x2")))
```
## a.
It appears that the x1 and x2 are highly correllated, but neither x1 or x2 is correlated with y.
```{r}
plot(as.data.frame(cbind(y,x)))
```
Now we take a pairwise regression.
```{r}
lmfit <- lm(y ~ x)
step(lmfit,direction="both")
```
Using both backwards and both stepwise regression, the best options is to include both x1 and x2 as that minimizes the AIC.
## b.
```{r}
summary <- summary(lmfit)
summary$r.squared # Full Model r.squared
lmfit <- lm(y ~ as.data.frame(x)$x1)
summary <- summary(lmfit)
summary$r.squared # x1 only model r.squared
lmfit <- lm(y ~ as.data.frame(x)$x2)
summary <- summary(lmfit)
summary$r.squared # x2 only model r.squared
```
With both variables in the model, the r squared if extremely high. However, with just one variable the model the r squared is very low. The r squared takes into account the correllation of both the independent variables and the dependent variables and is therefore artificially inflated. Since x1 and x2 are highly correllated, we violate multicollinarity. Niether variable should be included in the model. We would not catch this if we use a backwards or both stepwise regression. We would catch this if we used a forward stepwise regression.
## c.
Using the forward method, the model would not find it significant to include x1 in the model. Then it would move to x2 and it would not find it significant to include in the model. Therefore, neither x1 nor x2 would have been found to be included in the model. However, both x1 and x2 should be included in the model
# 5.5
## a.
Variables in Model | $SSE_p$ | $p$ | Error d.f. | $MSE_p$ | $R^2_{adj,p}$ | $C_p$ | $AIC_p$
------------------ | ------- | ---- | ---------- | ------- | ------------- | ----- | -------
None | 950 | 0 | 19 | 50 | 0 | 1 | 79.21
$x_1$ | 720 | 1 | 18 | 40 | 0.2 | 2 | 75.67
$x_2$ | 630 | 1 | 18 | 35 | 0.3 | 2 | 73.00
$x_3$ | 540 | 1 | 18 | 30 | 0.4 | 2 | 69.92
$x_1$,$x_2$ |595 | 2 | 17 | 35 | 0.3 | 3 | 73.86
$x_1$,$x_3$ |425 | 2 | 17 | 25 | 0.5 | 3 | 67.13
$x_2$,$x_3$ |510 | 2 | 17 | 30 | 0.4 | 3 | 70.77
$x_1$,$x_2$,$x_3$ |400 | 3 | 16 | 25 | 0.5 | 4 | 67.91
## b.
I would select the model with $x_1$ and $x_3$ because it has the highest adjusted R squared, lowest $C_p$ and lowest $AIC_p$.
## c.
Since x3 has the highest absolute value of correlation, it is the first to enter the model: $r_{y x_1} = 1 - SSE_0 / SSE_p$
```{r}
r_yx1 = sqrt(1 - 720/950)
r_yx2 = sqrt(1 - 630/950)
r_yx3 = sqrt(1 - 540/950)
```
The $F_{in}$ is:
```{r}
f_x3 = (r_yx3^2)*(20 - (1 + 1))/(1 - (r_yx3)^2)
```
## d.
The second variable to enter the model is X1 with a $f_{in}$ of 4.6 with a partial correlation coefficient of 0.46:
```{r}
r_yx3_x1 = sqrt(1 - 425/540)
r_yx3_x2 = sqrt(1 - 510/540)
f_x3x1 = (r_yx3_x1^2)*(20 - (2 + 1))/(1 - (r_yx3_x1)^2)
```
## e.
The first varible x3 cannot be removed because it's $f_{out}$ = 11.8 > 4.
```{r}
r_yx1_x3 = sqrt((720 -425) / 720)
f_x1x3 = (r_yx1_x3^2)*(20 - (2 + 1))/(1 - (r_yx1_x3)^2)
```
## f.
The third variable x2, cannot be entered into the model because it's $f_{out}$ = 1 $!>$ 4.
```{r}
r_yx3_x1_x2 = sqrt((425 - 400)/425)
f_x3x1x2 = (r_yx3_x1_x2^2)*(20 - (3 + 1))/(1 - (r_yx3_x1_x2)^2)
```
# 5.6
```{r}
c <- read.csv("C:/Users/Jeff/Downloads/carprices.csv")
c$Buick <- ifelse(c$Make == "Buick",1,0)
c$Cadillac <- ifelse(c$Make == "Cadillac",1,0)
c$Chevrolet <- ifelse(c$Make == "Chevrolet",1,0)
c$SAAB <- ifelse(c$Make == "SAAB",1,0)
c$Pontiac <- ifelse(c$Make == "Pontiac",1,0)
c$Saturn <- ifelse(c$Make == "Saturn",1,0)
c$Convertible <- ifelse(c$Type == "Convertible",1,0)
c$Coupe <- ifelse(c$Type == "Coupe",1,0)
c$Hatchback <- ifelse(c$Type == "Hatchback",1,0)
c$Sedan <- ifelse(c$Type == "Sedan",1,0)
c$Wagon <- ifelse(c$Type == "Wagon",1,0)
train_indices <- seq(1, nrow(c), by=2)
train <- c[train_indices, ]
test <- c[-train_indices, ]
model1 <- lm(I(log(c$Price)) ~ c$Mileage + c$Cylinder + c$Liter + c$Buick + c$Cadillac + c$Chevrolet + c$SAAB + c$Convertible + c$Coupe + c$Hatchback + c$Sedan, data = test)
summary(model1)
model2 <- lm(I(log(c$Price)) ~ c$Mileage + c$Liter + c$Cadillac + c$Chevrolet + c$Pontiac + c$SAAB + c$Saturn + c$Convertible + c$Wagon + c$Cylinder, data = test)
summary(model2)
SSE_1 <- sum(resid(model1)^2)
SSE_2 <- sum(resid(model2)^2)
SSE_1
SSE_2
```
The SSE is smaller for Model 1. All the variables are statistically significant at 5% for both models. Since this is the case, both models are highly predicitive. If I had to select a model, I would probably choose Model 1 because the variables included are mostly likely to be available in the future. Model 2 includes Pontiac, SAAB, Saturn, convertible, and wagon, all of which are dead car companies and models not commonly made. Model 1 includes SAAB unfortunalety, but it also includes the now popular hatchback type of vehicle.