01_Regression-5.Rmd

---
title: "Chapter 1: Linear regression"
subtitle: "Categorical variables"
author: "Joris Vankerschaver"
header-includes:
  - \useinnertheme[shadow=true]{rounded}
  - \usecolortheme{rose}
  - \setbeamertemplate{footline}[frame number]
  - \usepackage{color}
  - \usepackage{graphicx}
  - \usepackage{amsmath}
  - \graphicspath{{./images/01-linear-regression}}
output: 
  beamer_presentation:
    theme: "default"
    keep_tex: true
    includes:
      in_header: columns.tex
---


```{r include = FALSE}
birnh <- read.table("./datasets/01-linear-regression/birnh.txt", header = T, sep = "\t", dec = ".")
attach(birnh)
```
## Example: BIRNH study

- Epidemiological follow-up study in the mid 80s where nutritional and health data in Belgium were measured ($n=5,815$)
- **Goal**: effect of smoking on cholesterol
- Since people from different provinces might have a different smoking and dietary behaviour, we want to correct for province
- Possible values of this variable:
  - 1: West Flanders
  - 2: East Flanders
  - 3: Flemish Brabant
  - 7: Antwerp
  - 8: Limburg

## A first analysis ...

\small
```{r echo=FALSE}
model_birnh <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) + SEX + PROVINCE)
summary(model_birnh)$coefficients
```
\normalsize

Model implicitly assumes that mean difference in cholesterol 

- between Limburg and West-Flanders is 7 times as large as 
- the one between East- and West-Flanders


## Dummy variables


- Create 4 **dummy variables**

\begin{align*}
P_2 & =  \left\{\begin{array}{ll}
1 & \textrm{East Flanders}\\
0 & \textrm{other}
\end{array}\right.\\
P_3 & = \left\{\begin{array}{ll}
1 & \textrm{Flemish Brabant}\\
0 & \textrm{other}
\end{array}\right.\\
P_7 & =  \left\{\begin{array}{ll}
1 & \textrm{Antwerp}\\
0 & \textrm{other}
\end{array}\right.\\
P_8 & =  \left\{\begin{array}{ll}
1 & \textrm{Limburg}\\
0 & \textrm{other}
\end{array}\right.\\
\end{align*}


## Dummy variables

- These 4 dummy variables carry same information as variable PROVINCE:
  - In West Flanders: $(P_2,P_3,P_7,P_8)=(0,0,0,0)$
  - In East Flanders: $(P_2,P_3,P_7,P_8)=(1,0,0,0)$
  - In Flemish Brabant: $(P_2,P_3,P_7,P_8)=(0,1,0,0)$
  - In Antwerp: $(P_2,P_3,P_7,P_8)=(0,0,1,0)$
  - In Limburg: $(P_2,P_3,P_7,P_8)=(0,0,0,1)$

- Each categorical variable with $k$ levels can be transformed into $k-1$ dummy variables by choosing 1 level as **reference**:
- In \texttt{R}:

```{r eval=FALSE}
m <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) 
        + SEX + factor(PROVINCE))
```

## Analysis with dummy variables

\footnotesize
```{r echo=FALSE}
m <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) 
        + SEX + factor(PROVINCE))
summary(m)$coefficients
```

\phantom{Necessity to test if multiple coefficients are zero}

## How to test for effect of province?

\footnotesize
```{r echo=FALSE}
m <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) 
        + SEX + factor(PROVINCE))
summary(m)$coefficients
```
\normalsize
Necessity to test if multiple coefficients are zero

## Partial F-test

Assume we want to compare \alert{2 nested models}:

- **Complete model (C)** with $p_C$ parameters; e.g., $p_C=10$ and
\[
  E(Y|X,P)=\beta_0+\beta_1X+\beta_2P
\]
- **Reduced model (R)** with $p_R$ parameters; e.g., $p_R=6$ and
\[
  E(Y|X,P)=\beta^*_0+\beta^*_1X
\]
- Testing $H_0:\beta_2=0$ is equivalent to testing if complete and reduced model are equal.

## Partial F-test

- Under null hypothesis, residual sums of squares of both models will be approximately same.
- **Test statistic**:
\[
  SSE(R) - SSE(C).
\]
- What is distribution under null hypothesis?
\[
  \frac{SSE(R) - SSE(C)}{p_C-p_R}\div \frac{SSE(C)}{n-p_C}\sim F_{p_C-p_R,n-p_C}.
\]
under null hypothesis

## Partial F-test in \texttt{R}

\footnotesize
```{r}
mC <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) + SEX 
         + factor(PROVINCE))
mR <- lm(TCHOL ~ SMOKING + AGE + I(AGE^2) + I(AGE^3) + SEX)
anova(mR,mC)
```