-
Notifications
You must be signed in to change notification settings - Fork 0
/
MSIA401 - Homework 3 - 2016-10-13.rmd
239 lines (201 loc) · 8.3 KB
/
MSIA401 - Homework 3 - 2016-10-13.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "MSIA401 - Homework 3"
author: "Jeff Parker"
date: "October 11, 2016"
output: html_document
---
# Exercise 1
```{r}
cigarettes <- read.csv("C:/Users/Jeff/Downloads/cigarette.CSV")
attach(cigarettes)
```
## a.
```{r}
lmfit <- lm(Sales ~ Age + HS + Income + Black + Female + Price)
summary(lmfit)
```
The p-value's for HS and Female are highly non significant, 0.94008 and 0.85071 respectively.
```{r}
lmfit <- lm(Sales ~ Age + Income + Black + Price + HS + Female + HS:Female)
summary(lmfit)
```
Jointly, the p-value of HS:Female is 0.68297. This p-value is still non-significant at most alphas.
## b.
```{r}
confint.lm(lmfit, level = .9)
```
Our confidence interval for Income is (-1.606320e+00, 0.97494061). Our $\beta_{Income}$ (0.01920) is between this interval. This agrees with our P-value at a $\alpha$ = 0.10 because the P-value $= 0.06997 < 0.10 =$ $\alpha$.
#### c.
```{r}
lmfit <- lm(Sales ~ Income + Price)
summary(lmfit)
```
$$
\text{Full Model} R^2 : 0.3235 \hspace{2cm} \text{Partial Model} R^2 : 0.2503\\
$$
Using the following formula:
$$
F = \frac{(R^2_p - R^2_q)/(p-q)}{(1-R^2_p)/[n-(p+1)]}
$$
```{r, include=TRUE}
n = 51
p = 6
q = 2
alpha = 0.1
r2_p = 0.3235
r2_q = 0.2503
F <- ((r2_p - r2_q)/(p - q))/((1-r2_p)/(n - (p + 1)))
p_value = qf(1-alpha, df1= p-q , df2= n - (p+ 1))
F
p_value
```
Since our p-value: $2.077194 \nless 1.141784$ = F, we cannot reject the $H_0$. Therefore, the drop in $R^2$ is not statistically significant. The full model increases the $R^2$ but the increase is not enough to justify the extra variables. Therefore, even though it is better fit from the $R^2$ perspective, the simpler, more elegant model with few variables is better model.
# Exercise 2
```{r}
election <- read.csv("C:/Users/Jeff/Downloads/election.CSV")
election$D1 <- 0
election$D1[which(election$D == 1)] <- 1
election$D2 <- 0
election$D2[which(election$D == -1)] <- 1
attach(election)
lmfit <- lm(V ~ I + D + G:I)
summary(lmfit)
```
## a.
With the three options of D, the model can be written as:
$$
\begin{equation}
E(y)=
\begin{cases}
\beta_0 + \beta_1*1 + \beta_2*1 + \beta_3 (G:1) + \epsilon = \beta_0 + \beta_1 + \beta_2 + \beta_3 G + \epsilon & \text{Democrat incumbent running} \\
\beta_0 + \beta_1*0 + \beta_2*0 + \beta_3 (G:0) = \beta_0 + \beta_1 I + \epsilon & \text{Incumbent is not running} \\
\beta_0 + \beta_1*-1 + \beta_2*-1 + \beta_3 (G:-1) + \epsilon = \beta_0 - \beta_1 - \beta_2 - \beta_3 G + \epsilon & \text{Republic incumbent running} \\
\end{cases}
\end{equation}
$$
## b.
```{r}
lmfit <- lm(V ~ D + G:I)
summary(lmfit)
```
I and G:I are measuring two different effects of variables on the dependent variable. Common practice is that the I should be included in the model. However, it is not statisically significant. I indicates the party of the incumbent (if any), while G:I indicates the GDP growth given the incumbent options. Logically, G:I could effect the dependent variable at different rates. If you remove I and G from the model, all of the variables are statistically significant and the model is much more elegant. For this reason, I would remove them from the model.
# Excercise 3
## a.
Using the fitted model $V = \beta_0 + \beta_1 I +\alpha_1 D_1 +\alpha_1 D_1 + \beta_2 (G:I) + \epsilon$ we can write the following three models:
$$
\begin{equation}
E(y)=
\begin{cases}
\beta_0 + \beta_1 + \alpha_1 + \beta_1 (G) + \epsilon & \text{Democrat incumbent running} \\
\beta_0 + \beta_1 I + \beta_1 (G:I) + \epsilon & \text{Incumbent not running} \\
\beta_0 - \beta_1 + \alpha_2 - \beta_1 (G) \epsilon & \text{Republican incumbent running} \\
\end{cases}
\end{equation}
$$
## b.
Showing that the model in Exercise 2 is obtained by assuming $\alpha_1 = -\alpha_2$
$$
\begin{align}
&V = \beta_0 + \beta_1 I +\alpha_1 D_1 +\alpha_1 D_1 + \beta_3 (G:I) & \text{Dummy variable model}\\
&V = \beta_0 + \beta_1 I + \beta_2 D + \beta_3 (G:I) & \text{Original model from Excercise 2}\\
&0 = \beta_2 D -\alpha_1 D_1 -\alpha_1 D_1 & \text{Subtracting like terms}\\
&\beta_2 D = +\alpha_1 D_1 +\alpha_1 D_1 & \text{Algebra}\\
\end{align}
$$
Case #1
$$
D = 1 \hspace{2cm} D_1 = 1 \hspace{2cm} D_2 = 0\\
\begin{align}
&\beta_2 (1) = +\alpha_1 (1) +\alpha_2 (0) & \text{substituting in terms}\\
&\beta_2 = \alpha_1\\
\end{align}
$$
Case #2
$$
D = 1 \hspace{2cm} D_1 = 1 \hspace{2cm} D_2 = 0\\
\begin{align}
&\beta_2 (0) = +\alpha_1 (0) +\alpha_2 (0) & \text{substituting in terms}\\
&0 = 0\\
\end{align}
$$
Case #3
$$
D = 1 \hspace{2cm} D_1 = 1 \hspace{2cm} D_2 = 0\\
\begin{align}
&\beta_2 (-1) = +\alpha_1 (0) +\alpha_2 (-1) & \text{substituting in terms}\\
&\beta_2 = \alpha_2\\
&\beta_2 = - \alpha_2\\
\end{align}
$$
Now comparing the terms from case #1 and case #3
$$\beta_2 = \alpha_1 = - \alpha_2
$$
## c.
```{r}
lmfit <- lm(V ~ I + D1 + D2 + G:I)
summary(lmfit)
```
Fit the model and test that $H_0 : \alpha_1 = -\alpha_2$\
Alternatively $H_0 : \alpha_1 + \alpha_2 = 0$
```{r}
vcov(lmfit)
```
$$
t = \frac{(\hat{\alpha_1} + \hat{\alpha_2})}{\sqrt{Var(\hat{\alpha_1}) + Var(\hat{\alpha_2})+ 2Cov(\hat{\alpha_1},\hat{\alpha_2})}}
$$
```{r}
alpha_1 = 0.062530
alpha_2 = -0.043788
var_alpha_1 = 8.504655e-04
var_alpha_2 = 7.336822e-04
covar_alpha_1_alpha_2 = 5.692558e-05
df = 21 - 6 + 1
t_score <- (alpha_1 + alpha_2)/(sqrt(var_alpha_1 + var_alpha_2 + 2*covar_alpha_1_alpha_2))
t_crit <- c(qt(.05/2, df),qt(1-.05/2,df))
t_score
t_crit
```
Since the t-score (0.454828) is between the critical range (-2.119905, 2.119905), we cannot reject the $H_0$.
Therefore, $\alpha_1$ and $\alpha_2$ do not have statistically significant explanatory power in the model.
# Excercise 4
## a.
```{r}
GPA <- read.csv("C:/Users/Jeff/Downloads/GPA.CSV")
lmfit <- lm(GPA$GPA ~ GPA$Verbal + GPA$Math + I(GPA$Verbal^2) + I(GPA$Math^2) + GPA$Verbal:GPA$Math)
summary(lmfit)
plot(lmfit,which=1:2)
```
Looking at the Residuals vs Fitted, it looks like the variances are slightly funnel shaped which means we may be seeing some heterscedasticity. The variances are distributed at a exponential pattern. A log transformation will likely bring this back to a state of homoscedasticity. Looking at the Q-Q plot, it appears that we do not have any major problems with normality. In short, our homoscedasticity assumption may be violated and this model is a candidate for a log transformation.
## b.
```{r}
lmfit <- lm(log10(GPA$GPA) ~ GPA$Verbal + GPA$Math + I(GPA$Verbal^2) + I(GPA$Math^2) + GPA$Verbal:GPA$Math)
summary(lmfit)
plot(lmfit,which=1:2)
```
With the log transformation, the Residuals vs Fitted plot looks much better, there is no decernible pattern. So are variance is ideal and homoscedastic. The log transformation also helped with our normality assumption as the standardized residuals are much closer to the theoretical quantiles.
#Excercise 4.8
```{r}
Salary <- read.csv("C:/Users/Jeff/Downloads/TEMCO.csv")
Salary$Female <- as.integer(ifelse(Salary$Gender == 'Female',1,0))
Salary$Advert <- as.integer(ifelse(Salary$Dept == 'Advertse',1,0))
Salary$Engg <- as.integer(ifelse(Salary$Dept == 'Engineer',1,0))
Salary$Sales <- as.integer(ifelse(Salary$Dept == 'Sales',1,0))
Salary$Male <- as.integer(ifelse(Salary$Gender == 'Male',1,0))
Salary$Purchase <- as.integer(ifelse(Salary$Dept == 'Purchase',1,0))
```
## a.
```{r}
lmfit <- lm(as.numeric(Salary$Salary) ~ Salary$YrsEm + Salary$Education + Salary$Advert + Salary$Engg + Salary$Sales)
plot(lmfit,which=2)
lmfit <- lm(log10(as.numeric(Salary$Salary)) ~ Salary$YrsEm + Salary$Education + Salary$Advert + Salary$Engg + Salary$Sales)
plot(lmfit,which=2)
```
In this case the log transformation has not much improved the normality. In both cases, the residuals appear to be distributed very normally. Normality is not a major concern for this model as the sample size is quite large.
## b.
```{r}
lmfit <- lm(as.numeric(Salary$Salary) ~ Salary$YrsEm + Salary$Education + Salary$Advert + Salary$Engg + Salary$Sales)
plot(lmfit,which=1)
lmfit <- lm(log10(as.numeric(Salary$Salary)) ~ Salary$YrsEm + Salary$Education + Salary$Advert + Salary$Engg + Salary$Sales)
plot(lmfit,which=1)
```
There does appear to be some heteroscedasticy in the variances of the first model. The log transformation does improve the assumption of homoscedasticity in the second model.