Week 9: Model Building & Comparison

class: center, middle, inverse, title-slide

# <b>Week 9: Model Building & Comparison </b>
## Data Analysis for Psychology in R 2<br><br>
### TOM BOOTH
### Department of Psychology<br>The University of Edinburgh
### AY 2020-2021

---

---
# Weeks Learning Objectives
1. Understand the principles of model comparison and how to compare models via R2 and F tests. 
2. Understand how to apply AIC and BIC for model comparison.
3. Understand the basics of Backward elimination, Forward selection and stepwise regression.
4. Distinguish exploratory from conformatory approaches.
5. Understand basic principles of cross-validation techniques.

---

# Topics for today
+ Challenge of model selection/comparison

+ Statistical tools for selection/comparison
  + Incremental `$F$`-test
  + Nested vs. non-nested models
  + AIC
  + BIC
  
+ Automated model building/selection

---
# The challenge

.pull-left[
+ Model selection refers to choosing between competing statistical models

+ An important aspect of model selection is choosing which predictors out of all of those that you have collected should be included in your model

+ Challenge is to strike a balance and avoid:
  + Over-fitting 
  + Under-fitting

+ No hard and fast rules for model selection

+ Requires using your judgement

]

.pull-right[
<img src="./selection.png" width="401" />

]

---
# Why not include everything?

.pull-left[

+ **Theoretical**: The principle of parsimony 
  + Occam’s razor
  + All else being equal, simpler models are better

+ **Practical**: Impact on precision and power
  + Having large numbers of predictors increases standard errors and reduces the power to detect the effects of individual predictors

]

.pull-right[
<img src="./okam.png" width="361" />

]
---
# Selection criteria

+ **Theoretical**: The theory you are testing implies the predictors to include.
  + e.g., you are testing the theory of planned behaviour, therefore, must include attitudes, subjective norms and perceived behavioural control

+ **Covariate control**: A certain predictor(s) is/are known to be a potential confound and must be included in the model to control for it
  + e.g., you are interested in the effects of personality on health but must control for the potential confounds of age.

+ **Statistical**: The predictor(s) contributes in a statistically or practically significant sense to improving variance explained in the outcome.

---
# Statistical approaches
+ For a single predictor, to decide whether we want it in the model we could look at the magnitude and statistical significance of its slope.

+ When thinking about model selection, we think about sets of predictors

+ Sets of predictors could include:
  + A structural set which together encode the effects of a single construct (e.g. dummy variables)
  + A functional set which represents a block of conceptually related predictors (e.g. Big Five personality traits)

+ For either a single predictor, or a set, we can statistically compare models with and without those variables.
  + Incremental F-test
  + AIC
  + BIC

---
# Example
+ Our example for today uses data from the Midlife In United States (MIDUS2) study.

+ Outcome: self-rated health

+ Covariates: Age, sex

+ Predictors: Big Five traits and Purpose in Life.

---
# Data

```r
midus <- read_csv("MIDUS2.csv")
midus2 <- midus %>%
  select(1:4, 31:42) %>%
  mutate(
    PIL = rowMeans(.[grep("PIL", names(.))],na.rm=T)
  ) %>%
  select(1:4, 12:17) %>%
  drop_na(.)
slice(midus2, 1:3)
```

```
## # A tibble: 3 x 10
##      ID   age sex    health     O     C     E     A     N   PIL
##   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10002    69 MALE        8  2.14   2.8   2.6   3.4  2     5.86
## 2 10019    51 MALE        8  3.14   3     3.4   3.6  1.5   5.71
## 3 10023    78 FEMALE      4  3.57   3.4   3.6   4    1.75  5.14
```

---
class: center, middle
# Time for a break

**Grab a cup of tea/coffee....a few equations on the way.**

---
class: center, middle
# Welcome Back!

---
# Incremental F-test
+ Recall, the F-ratio for a single model tests the statistical significance of a linear model

+ The incremental F-test evaluates the statistical significance of the improvement in variance explained in an outcome with the addition of further predictor(s)

+ It is based on the difference in F-values between two models.
  + Note: The F-test as we have already seen it could also be viewed in this way.
  + As the difference between a model with predictors and an "empty model" (intercept only)

+ We call the model with the additional predictor(s) model 1 or full model

+ We call the model without model 0 or restricted model

---
# Incremental F-test

.pull-left[
`$$\Delta F_{(\Delta k, N- k_1 - 1)} = \frac{(N-k_1 - 1)\Delta R^2}{\Delta k(1-R_1^2)}$$`

$$
`\begin{align}
& \text{Where:} \\
& k_1 = \text{no. predictors from model 1} \\
& \Delta R^2 = \text{difference in R2 between model 0 and 1} \\
& \Delta k = \text{difference in no. predictors model 0 and 1} \\
& R_1^2 = \text{R2 value from model 1} \\
\end{align}`
$$

]

.pull-right[
`$$F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F}$$`

$$
`\begin{align}
& \text{Where:} \\
& SSR_R = \text{residual sums of squares for the restricted model} \\
& SSR_F = \text{residual sums of squares for the full model} \\
& df_R = \text{residual degrees of freedom from the restricted model} \\
& df_F = \text{residual degrees of freedom from the full model} \\
\end{align}`
$$
]

---
# Models

```r
m0 <- lm(health ~ age + sex, data = midus2)
m0res <- summary(m0)

m1 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2)
m1res <- summary(m1)
```

---
# Incremental F-test: calculation
`$$\Delta F_{(\Delta k, N- k_1 - 1)} = \frac{(N-k_1 - 1)\Delta R^2}{\Delta k(1-R_1^2)}$$`

+ `$N$` = 1761
+ `$k_0$` = 2
+ `$k_1$` = 7
+ `$\Delta k$` = 5
+ `$R_{m0}^2$` = 0.0046268
+ `$R_{m1}^2$` = 0.148435
+ `$\Delta R^2$` = 0.1438082

---
# Incremental F-test: calculation
`$$\Delta F_{(\Delta k, N- k_1 - 1)} = \frac{(N-k_1 - 1)\Delta R^2}{\Delta k(1-R_1^2)}$$`

+ Plug in our values

`$$\Delta F_{(5, 1753)} = \frac{(1761-7 - 1)0.1438}{5(1-0.1484)}$$`

+ Work them through:

`$$\Delta F_{(5, 1753)} = \frac{252.0814}{4.2595}$$`

+ And there we go (we have rounded the numbers above, so this will not match perfectly)

`$$\Delta F_{(5, 1753)} = 59.18098$$`

---
# Incremental F-test: Significance
+ Having calculated `$\Delta F$`, we then compare it against an F-distribution with `$\Delta k$` and `$(N-k_1 - 1)$` degrees of freedom.

+ This provides a p-value for the change in variance explained by model 1 versus model 0 for a given `$\alpha$`

+ In our example:

```r
pf(.95, 5, 1753)
```

```
## [1] 0.5525509
```

---
# In R

```r
anova(m0, m1)
```

```
## Analysis of Variance Table
## 
## Model 1: health ~ age + sex
## Model 2: health ~ age + sex + O + C + E + A + N
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1758 4740.2                                  
## 2   1753 4055.4  5    684.85 59.208 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Nested vs non-nested models
+ The F-ratio depends on the models being compared being nested

+ Nested means that the predictors in one model are a subset of the predictors in the other

+ We also require the models to be computed on the same data

---
# Nested vs non-nested models

.pull-left[
**Nested**

```r
m0 <- lm(outcome ~ x1 + x2 , data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)
```

+ These models are nested.

+ `x1` and `x2` appear in both models
]

.pull-right[
**Non-nested**

```r
m0 <- lm(outcome ~ x1 + x2 + x4, data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)
```

+ These models are non-nested

+ There are unique variables in both models
  + `x4` in `m0`
  + `x3` in `m1`

]

---
class: center, middle
# Time for a break

**Time to have a look at a few model specifications and decide if they are nested.**

---
class: center, middle
# Welcome Back!

**What happens when models are non-nested?**

---
# AIC
`$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$`

$$
`\begin{align}
& \text{Where:} \\
& SS_{residual} = \text{sum of squares residuals} \\
& n = \text{sample size} \\
& k = \text{number of explanatory variables} \\
& \text{ln} = \text{natural log function} 
\end{align}`
$$

+ Unlike the incremental F-test AIC does not require two models to be nested

+ Smaller (more negative) values of AIC indicate better fitting models.
  + So we compare values and choose the model with the smaller AIC

---
# AIC parsimony correction

`$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$`

+ Main point to note is that the term `$2k$` applies a penalty for having more predictors

+ When you add more predictors, fit will improve ( `$SSE$` will get smaller)

+ The decrease is partially offset by the `$+2k$`

+ This makes AIC a parsimony-corrected statistic

+ Parsimony-corrected statistics help us avoid over-fitting

---
# In R

```r
AIC(m0, m1)
```

```
##    df      AIC
## m0  4 6749.246
## m1  9 6484.457
```

---
# BIC
`$$BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n)$$`
$$
`\begin{align}
& \text{Where:} \\
& SS_{residual} = \text{sum of squares residuals} \\
& n = \text{sample size} \\
& k = \text{number of explanatory variables} \\
& \text{ln} = \text{natural log function} 
\end{align}`
$$

+ Like AIC... 
  + BIC doesn’t require nested models
  + Smaller (more negative) BIC values mean better models
  + We can compare the BICs for two models and choose the one with the smaller BIC as the better model

---
# In R

```r
BIC(m0, m1)
```

```
##    df      BIC
## m0  4 6771.141
## m1  9 6533.719
```

---
# Parsimony corrections

`$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$`

`$$BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n)$$`

+ BIC has a ‘harsher’ parsimony penalty for typical sample sizes when applying linear models than AIC
  
+ When `$\text{ln}(n) > 2$` BIC will have a more severe parsimony penalty (i.e. essentially all the time!)

---
# Considerations for use of AIC and BIC
+ The AIC and BIC for a model are not meaningful on their own 
  + They only make sense for model comparisons

+ For AIC, there are no cut-offs to suggest how big a difference in two models is needed to conclude that one is substantively better than the other

+ For BIC, a difference of 10 can be used as a rule of thumb to suggest that one model is substantively better than another

---
# Using statistical tools
+ AIC and BIC sometimes tell a different story.
  + Due to the parsimony corrections

+ This means it is more likely to lead you to select a ‘simpler’ model (i.e., a model with fewer predictors) than AIC

+ Requires you to think about whether you are more concerned about avoiding over-fitting or under-fitting

+ Similarly, both AIC and BIC can tell a different from the incremental F-test.

---
# Missing data
+ It is important to make sure the two models being compared with incremental F, AIC and BIC are fit to the same cases (participants).

+ Beware when you have missing data for your predictors.

---
# Missing data

.pull-left[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> ID </th>
   <th style="text-align:right;"> Y </th>
   <th style="text-align:right;"> x1 </th>
   <th style="text-align:right;"> x2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Participant1 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Participant2 </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Participant3 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Participant4 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Participant5 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Participant6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
</tbody>
</table>
]

.pull-right[

```r
m0 <- lm(Y ~ x1, data = data)
```

+ `m0` includes participants 1, 3, 4, 5 and 6

```r
m1 <- lm(Y ~ x2, data = data)
```

+ `m1` includes participants 1, 2, 3, 5 & 6
]

---
class: center, middle
# Time for a break

**Let's look at some examples and think about the preferred model.**

---
class: center, middle
# Welcome Back!

**So far we have assumed we have a theoretical model(s) and we are comparing them**

**What if we don't?**

---
# Automated approaches
+ Exploratory models

+ "Stepwise" methods
  + Forward selection
  + Backward selection

+ All possible subsets

---
# Forward

+ Start with the variable which has the highest association with the DV.

+ Next, add the variable which increases r-squared most of those which remain.

+ Continue until no variables improve model r-square.

---
# Backward

+ Start with all variables in the model.

+ Remove the predictor with the highest p-value.

+ Run the model again and repeat.

+ Stop when all p-values for predictors are less than the a priori set critical p-value.

---
# Good idea?

---
# All possible regressions
+ Start with a model with only the intercept (“empty model”, best guess with no predictors).

+ Then run all possible models with 1 variable.

+ Then run all possible models with 2 variables.

+ Then run all possible models with 3 variables.

+ And so on until you reach the maximum number of predictors.

+ Results in lots of models.
  + `$2^k$` where k=number of predictors.
  + E.g. 12 predictors = `$2^{12}$` = 4096 equations

---
# All possible subsets
+ An essentially identical procedure is all-possible-regressions.

+ Here select a subset of key variables

+ Note if you have designed your study well this will be all your variables.

+ Run all models with 1 predictor, 2 predictors etc.

+ Select and present the best model for each number of predictors.
  + Thus if we had 8 predictors, we would evaluate 8 possible models (+empty model).

---
# Good idea?

---
# Broad considerations
+ Atheoretical.

+ Solely for predictive purposes.

+ A solution may be numerically better than others, but it may not be interpretable.

+ General issues with multiple comparisons.

+ Quality of the model is still dependent on the design and inclusion of good variable sets.

+ The selection of the “best” model may also be influenced by the criteria you use to define what is best.

---
# Summary of today

+ Introduced the general problem of model selection.

+ Discussed various statistical tools and their differences for deciding between models.

+ Discussed data driven exploratory model building tools (and there issues)