-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_Regression-5.Rmd
153 lines (123 loc) · 3.82 KB
/
01_Regression-5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "Chapter 1: Linear regression"
subtitle: "Categorical variables"
author: "Joris Vankerschaver"
header-includes:
- \useinnertheme[shadow=true]{rounded}
- \usecolortheme{rose}
- \setbeamertemplate{footline}[frame number]
- \usepackage{color}
- \usepackage{graphicx}
- \usepackage{amsmath}
- \graphicspath{{./images/01-linear-regression}}
output:
beamer_presentation:
theme: "default"
keep_tex: true
includes:
in_header: columns.tex
---
```{r include = FALSE}
birnh <- read.table("./datasets/01-linear-regression/birnh.txt", header = T, sep = "\t", dec = ".")
attach(birnh)
```
## Example: BIRNH study
- Epidemiological follow-up study in the mid 80s where nutritional and health data in Belgium were measured ($n=5,815$)
- **Goal**: effect of smoking on cholesterol
- Since people from different provinces might have a different smoking and dietary behaviour, we want to correct for province
- Possible values of this variable:
- 1: West Flanders
- 2: East Flanders
- 3: Flemish Brabant
- 7: Antwerp
- 8: Limburg
## A first analysis ...
\small
```{r echo=FALSE}
model_birnh <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) + SEX + PROVINCE)
summary(model_birnh)$coefficients
```
\normalsize
Model implicitly assumes that mean difference in cholesterol
- between Limburg and West-Flanders is 7 times as large as
- the one between East- and West-Flanders
## Dummy variables
- Create 4 **dummy variables**
\begin{align*}
P_2 & = \left\{\begin{array}{ll}
1 & \textrm{East Flanders}\\
0 & \textrm{other}
\end{array}\right.\\
P_3 & = \left\{\begin{array}{ll}
1 & \textrm{Flemish Brabant}\\
0 & \textrm{other}
\end{array}\right.\\
P_7 & = \left\{\begin{array}{ll}
1 & \textrm{Antwerp}\\
0 & \textrm{other}
\end{array}\right.\\
P_8 & = \left\{\begin{array}{ll}
1 & \textrm{Limburg}\\
0 & \textrm{other}
\end{array}\right.\\
\end{align*}
## Dummy variables
- These 4 dummy variables carry same information as variable PROVINCE:
- In West Flanders: $(P_2,P_3,P_7,P_8)=(0,0,0,0)$
- In East Flanders: $(P_2,P_3,P_7,P_8)=(1,0,0,0)$
- In Flemish Brabant: $(P_2,P_3,P_7,P_8)=(0,1,0,0)$
- In Antwerp: $(P_2,P_3,P_7,P_8)=(0,0,1,0)$
- In Limburg: $(P_2,P_3,P_7,P_8)=(0,0,0,1)$
- Each categorical variable with $k$ levels can be transformed into $k-1$ dummy variables by choosing 1 level as **reference**:
- In \texttt{R}:
```{r eval=FALSE}
m <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3)
+ SEX + factor(PROVINCE))
```
## Analysis with dummy variables
\footnotesize
```{r echo=FALSE}
m <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3)
+ SEX + factor(PROVINCE))
summary(m)$coefficients
```
\phantom{Necessity to test if multiple coefficients are zero}
## How to test for effect of province?
\footnotesize
```{r echo=FALSE}
m <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3)
+ SEX + factor(PROVINCE))
summary(m)$coefficients
```
\normalsize
Necessity to test if multiple coefficients are zero
## Partial F-test
Assume we want to compare \alert{2 nested models}:
- **Complete model (C)** with $p_C$ parameters; e.g., $p_C=10$ and
\[
E(Y|X,P)=\beta_0+\beta_1X+\beta_2P
\]
- **Reduced model (R)** with $p_R$ parameters; e.g., $p_R=6$ and
\[
E(Y|X,P)=\beta^*_0+\beta^*_1X
\]
- Testing $H_0:\beta_2=0$ is equivalent to testing if complete and reduced model are equal.
## Partial F-test
- Under null hypothesis, residual sums of squares of both models will be approximately same.
- **Test statistic**:
\[
SSE(R) - SSE(C).
\]
- What is distribution under null hypothesis?
\[
\frac{SSE(R) - SSE(C)}{p_C-p_R}\div \frac{SSE(C)}{n-p_C}\sim F_{p_C-p_R,n-p_C}.
\]
under null hypothesis
## Partial F-test in \texttt{R}
\footnotesize
```{r}
mC <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) + SEX
+ factor(PROVINCE))
mR <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) + SEX)
anova(mR,mC)
```