F-tests & Model Comparison

class: center, middle, inverse, title-slide

.title[
# <b>F-tests & Model Comparison </b>
]
.subtitle[
## Data Analysis for Psychology in R 2<br><br>
]
.author[
### dapR2 Team
]
.institute[
### Department of Psychology<br>The University of Edinburgh
]

---

# Weeks Learning Objectives
1. Understand the use of `$F$` and incremental `$F$` tests.

2. Be able to run and interpret `$F$`-tests in R.

3. Understand how to use model comparisons to test different types of question.

4. Understand the difference between nested and non-nested models, and the appropriate statistics to use for comparison in each case.

---
class: inverse, center, middle

# Part 1: Recap and `$F$`-tests

---
# Where we left off...
+ Last week we looked at:
  + The significance of individual predictors
  + Overall model evaluation through `$R^2$` and adjusted `$R^2$` to see how much variance in the outcome has been explained.

+ Today we will:
  + Look at significance tests of the overall model
  + Discuss how we can use the same tools to do incremental tests (how much does my model improve when I add variables)

---
#  Significance of the overall model 
+ The test of the individual predictors (IVs, or `$x$`'s) does not tell us if the overall model is significant or not.
	+ Neither does `$R^2$`
	+ But both are indicative

+ To test the significance of the model as a whole, we conduct an `$F$`-test.

---
#  F-test & F-ratio
+ An `$F$`-test involves testing the statistical significance of a test statistic called (wait for it) the `$F$`-ratio.

+ The `$F$`-ratio tests the null hypothesis that all the regression slopes in a model are all zero.

+ In other words, our predictors tell us nothing about our outcome.
  
  + They explain no variance.

+ If our predictors do explain some variance, our `$F$`-ratio will be significant.

---
# Our results (significant F)

```r
performance <- lm(score ~ hours + motivation, data = test_study2); summary(performance)
```

```
## 
## Call:
## lm(formula = score ~ hours + motivation, data = test_study2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9548  -2.8042  -0.2847   2.9344  13.8240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.86679    0.65473  10.488   <2e-16 ***
## hours        1.37570    0.07989  17.220   <2e-16 ***
## motivation   0.91634    0.38376   2.388   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.386 on 147 degrees of freedom
## Multiple R-squared:  0.6696,	Adjusted R-squared:  0.6651 
## F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16
```

---
#  F-ratio: Some details
+ `$F$`-ratio is a ratio of the explained to unexplained variance:

`$$F = \frac{\frac{SS_{model}}{df_{model}}}{\frac{SS_{residual}}{df_{residual}}} = \frac{MS_{Model}}{MS_{Residual}}$$`

+ Where MS = mean squares

+ **What are mean squares?**
  + Mean squares are sums of squares calculations divided by the associated degrees of freedom.
  + We saw how to calculate model and residual sums of squares last week

+ But what are degrees of freedom...

---
# Degrees of freedom
+ The degrees of freedom are defined as the number of independent values associated with the different calculations.
  + Df are typically the combination of the amount of data you have (sample size, `$n$`) and the number of things you need to calculate/estimate ( `$k$` ).
  
+ **Model degrees of freedom = `$k$` **
  + `$SS_{model}$` are dependent on estimated `$\beta$` s, hence `$k$`.

+ **Residual degrees of freedom = `$n-k-1$` **
  + `$SS_{residual}$` calculation is based on our model, in which we estimate `$k$` `$\beta$` terms and an intercept ( `$1$` )

+ **Total degrees of freedom = `$n-1$` **
  + `$SS_{total}$` calculation is based on the observed `$y_i$` and `$\bar{y}$` . 
  + In order to estimate `$\bar{y}$` , all apart from one value of `$y$` is free to vary, hence `$n-1$`

---
# F-table

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> SS </th>
   <th style="text-align:left;"> df </th>
   <th style="text-align:left;"> MS </th>
   <th style="text-align:left;"> Fratio </th>
   <th style="text-align:left;"> pvalue </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Model </td>
   <td style="text-align:left;"> k </td>
   <td style="text-align:left;"> SS model/df model </td>
   <td style="text-align:left;"> MS model/ MS residual </td>
   <td style="text-align:left;"> F(df model,df residual) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Residual </td>
   <td style="text-align:left;"> n-k-1 </td>
   <td style="text-align:left;"> SS residual/df residual </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Total </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;">  </td>
  </tr>
</tbody>
</table>

---
# Our example (note the df at the bottom)

```r
summary(performance)
```

---
# F-ratio
+ Bigger `$F$`-ratios indicate better fitting models.

+ It means the variance explained by the model is big compared to the residual variance.

+ `$H_0$` for the model says that the best guess of any individual's `$y$` value is `$\bar{y}$` plus error.
	+ Or, that the `$x$` variables carry no information collectively about `$y$`.
	+ I.e. all slopes = 0

+ `$F$`-ratio will be close to 1 when `$H_0$` is true
  + If there is equivalent residual to model variation ( `$MS_{residuals} = MS_{model}$` ), `$F$`=1
	+ If there is more model than residual `$F$` > 1

---
# Testing the significance of `$F$`

+ The `$F$`-ratio is our test statistic for the significance of our model.

+ As with all statistical inferences, we would select an `$\alpha$` level.
  
  + Identify the proper null `$F$`-distribution and calculate the critical value of `$F$` associated with chosen level of `$\alpha$`
  
  + Compare our `$F$`-statistic to the critical value.
  
  + If our value is more extreme than the critical value, it is considered significant

---
# Sampling distribution for the null

.pull-left[
+ Similar to the `$t$`-distribution, the `$F$`-distribution changes shape based on `$df$`

+ With an `$F$`-statistic, we have to consider both the `$df_{model}$` and `$df_{residual}$`

]

.pull-right[
<img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-6-1.png" width="504" />
]

---
# A decision about the null

+ We have an `$F$`-statistic:
  
  + `$F = 148.9$`

+ We need to calculate `$df_{model}$` and `$df_{residual}$` to get our null distribution:
  
  + `$df_{model}=k=2$`
  
  + `$df_{residual}=n-k-1=150-2-1=147$`

+ We need to set our `$\alpha$` level
  
  + `$\alpha = .05$`

--
+ Now we can compute our critical value for `$F$`

---
# Visualize the test

.pull-left[
<img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-7-1.png" width="504" />

]

.pull-right[

+ `$F$`-distribution with 2 `$df_{model}$` and 147 `$df_{residual}$` (our null distribution)

]

---
count: false

# Visualize the test

.pull-left[
<img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-8-1.png" width="504" />

]

.pull-right[
+ `$F$`-distribution with 2 `$df_{model}$` and 147 `$df_{residual}$` (our null distribution)

+ Our critical value

```r
(Crit = round(qf(0.95, 2, 147), 3))
```

```
## [1] 3.058
```

]

---
count: false

# Visualize the test

.pull-left[
<img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-10-1.png" width="504" />

]

.pull-right[
+ `$F$`-distribution with 2 `$df_{model}$` and 147 `$df_{residual}$` (our null distribution)

+ Our critical value

```r
(Crit = round(qf(0.95, 2, 147), 3))
```

```
## [1] 3.058
```

+ We can calculate the probability of an F-statistic at least as extreme as ours, given `$H_0$` is true (our `$p$`-value):

```r
(pVal = 1-pf(148.9, 2, 147))
```

```
## [1] 0
```

]

---
count: false

# Visualize the test

.pull-left[
<img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-13-1.png" width="504" />

]

.pull-right[
+ `$F$`-distribution with 2 `$df_{model}$` and 147 `$df_{residual}$` (our null distribution)

+ Our critical value

```r
(Crit = round(qf(0.95, 2, 147), 3))
```

```
## [1] 3.058
```

+ We can calculate the probability of an F-statistic at least as extreme as ours, given `$H_0$` is true (our `$p$`-value):

```r
(pVal = 1-pf(148.9, 2, 147))
```

```
## [1] 0
```
+ Our model significantly predicted the variance in test score, `$F(2,147)= 148.90, p < .001$`

]

---
class: center, middle

# Questions?

---
class: inverse, center, middle

# Part 2: Model Comparison & Incremental `$F$`-tests

---
# Model comparisons

+ So far, our questions have been _is our overall model better than nothing?_ ( `$F$`-test ) or _which variables, specifically, are good predictors of the outcome variable?_ ( `$t$`-tests of `$\beta$` estimates )

+ But what if instead we wanted to ask:

> **When I make a change to my model, does it improve or not?**

+ This question is the core of model comparison

+ We can adapt this to our models in a more specific way:

+ Eg is a model with `$x_1$` and `$x_2$` and `$x_3$` as predictors better than the model with just `$x_1$`?

+ In our linear model journey so far, we haven't yet explored the tools to answer this question. 
  
  + We have tested individual predictors
  + and we have tested overall models
  + **but we have not tested the improvement when we add predictors**
  
+ More generally, we have not looked at combined tests of the effects of predictors >1 but < all predictors

---
# `$F$`-test as an incremental test

+ One important way we can think about the `$F$`-test and the `$F$`-ratio is as an incremental test against an "empty" or null model.

+ A null or empty model is a linear model with only the intercept.
  + In this model, our predicted value of the outcome for every case in our data set, is the mean of the outcome ( `$\bar{y}$`).
  + That is, with no predictors, we have no information that may help us predict the outcome.
  + So we will be "least wrong" by guessing the mean of the outcome.

+ An empty model is the same as saying all `$\beta$` = 0.

+ And remember, this was the null hypothesis of the `$F$`-test

+ So in this way, the `$F$`-test can be seen as **comparing two models**.

+ We can extend this idea, and use the `$F$`-test to compare two models that contain different sets of predictors.
  + This is the **incremental `$F$`-test**

---
# Incremental `$F$`-test
.pull-left[
+ The incremental `$F$`-test evaluates the statistical significance of the improvement in variance explained in an outcome with the addition of further predictor(s)

+ It is based on the difference in `$F$`-values between two models.
  + We call the model with the additional predictor(s) **model 1** or **full model**
  + We call the model without **model 0** or **restricted model**

]

.pull-right[
`$$F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F}$$`

$$
`\begin{align}
& \text{Where:} \\
& SSR_R = \text{residual sums of squares for the restricted model} \\
& SSR_F = \text{residual sums of squares for the full model} \\
& df_R = \text{residual degrees of freedom from the restricted model} \\
& df_F = \text{residual degrees of freedom from the full model} \\
\end{align}`
$$
]

---
# Why does this matter?
+ A very reasonable question you may have at this point is "so what, why does that matter?"

+ Well, it is very useful if we every want to formally compare two models and decide which is the better one.

+ And that is what we will discuss next!

---
class: inverse, center, middle

# Part 3: Examples of Model comparisons

---
# Example 1
+ How about this example based on data from the Midlife In United States (MIDUS2) study:

+ Outcome: self-rated health

+ Covariates: Age, sex

+ Predictors: Big Five traits and Purpose in Life.

+ Research Question: Does personality predict self-rated health over and above age and sex?

---
# The data

```r
midus <- read_csv("data/MIDUS2.csv")
midus2 <- midus %>%
  select(1:4, 31:42) %>%
  mutate(
    PIL = rowMeans(.[grep("PIL", names(.))],na.rm=T)
  ) %>%
  select(1:4, 12:17) %>%
  drop_na(.)
slice(midus2, 1:3)
```

```
## # A tibble: 3 × 10
##      ID   age sex    health     O     C     E     A     N   PIL
##   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10002    69 MALE        8  2.14   2.8   2.6   3.4  2     5.86
## 2 10019    51 MALE        8  3.14   3     3.4   3.6  1.5   5.71
## 3 10023    78 FEMALE      4  3.57   3.4   3.6   4    1.75  5.14
```

---
# Example 1: The models
+ Does personality significantly predict self-rated health over and above the effects of age and sex?

+ First step here is to run two models.
  + M1: We predict from age and sex
  + M2: we add in the FFM traits

```r
m1 <- lm(health ~ age + sex, data = midus2)
```

```r
m2 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2)
```

---
# Incremental `$F$`-test in R
+ Second step
  + Compare the two models based on an incremental `$F$`-test

+ In order to apply the `$F$`-test for model comparison in R, we use the `anova()` function.

+ `anova()` takes as its arguments models that we wish to compare
  + Here we will show examples with 2 models, but we could use more.

```r
anova(m1, m2)
```

---
# Incremental `$F$`-test in R

```r
anova(m1, m2)
```

```
## Analysis of Variance Table
## 
## Model 1: health ~ age + sex
## Model 2: health ~ age + sex + O + C + E + A + N
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1758 4740.2                                  
## 2   1753 4055.4  5    684.85 59.208 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
class: center, middle

# Questions?

---
class: inverse, center, middle

# Part 4: Nested models and alternatives to `$F$`-tests

---
# Nested vs non-nested models
+ The `$F$`-ratio depends on the comparison models being nested

+ Nested means that the predictors in one model are a subset of the predictors in the other

+ We also require the models to be computed on the same data

---
# Nested vs non-nested models

.pull-left[
**Nested**

```r
m0 <- lm(outcome ~ x1 + x2 , data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)
```

+ These models are nested.

+ `x1` and `x2` appear in both models
]

.pull-right[
**Non-nested**

```r
m0 <- lm(outcome ~ x1 + x2 + x4, data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)
```

+ These models are non-nested

+ There are unique variables in both models
  + `x4` in `m0`
  + `x3` in `m1`

]

---
# Model comparison for non-nested models
+ So what happens when we have non-nested models?

+ There are two commonly used alternatives
  + AIC
  + BIC

+ Unlike the incremental `$F$`-test AIC and BIC do not require two models to be nested

+ Smaller (more negative) values indicate better fitting models.
  + So we compare values and choose the model with the smaller AIC or BIC value
  
---
# AIC & BIC

.pull-left[
`$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$`

$$
`\begin{align}
& \text{Where:} \\
& SS_{residual} = \text{sum of squares residuals} \\
& n = \text{sample size} \\
& k = \text{number of explanatory variables} \\
& \text{ln} = \text{natural log function} 
\end{align}`
$$
]

.pull-right[

`$$BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n)$$`

]

---
# Parsimony corrections

+ Both AIC and BIC contain something called a parsimony correction
  + In essence, they penalise models for being complex
  + This is to help us avoid overfitting (adding predictors arbitarily to improve fit)
  
`$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$`

`$$BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n)$$`

+ BIC has a harsher parsimony penalty for typical sample sizes when applying linear models than AIC
  + When `$\text{ln}(n) > 2$` BIC will have a more severe parsimony penalty (i.e. essentially all the time!)

---
# In R

+ Let's use AIC and BIC on our `m1` and `m2` models from previously:

.pull-left[

```r
AIC(m1, m2)
```

```
##    df      AIC
## m1  4 6749.246
## m2  9 6484.457
```
]

.pull-right[

```r
BIC(m1, m2)
```

```
##    df      BIC
## m1  4 6771.141
## m2  9 6533.719
```
]

---
# Let's consider a different example
+ Our previous models were nested
  + `m1` had just covariates
  + `m2` added personality
  
+ Using the same data, lets consider a non-nested example.

+ Suppose we want to compare a model that:
  + predicts self-rated health from just 5 personality variables (`nn1` : non-nested model 1)
  + to a model that predicts from age, sex and a variable called Purpose in Life (PIL) (`nn2`).

---
# Applied to non-nested models

```r
nn1 <- lm(health ~ O + C + E + A + N, data=midus2)
nn2 <- lm(health ~ age + sex + PIL, data = midus2)
```

```r
AIC(nn1, nn2)
```

```
##     df      AIC
## nn1  7 6501.524
## nn2  5 6564.953
```

```r
BIC(nn1, nn2)
```

```
##     df      BIC
## nn1  7 6539.840
## nn2  5 6592.321
```

---
# Considerations for use of AIC and BIC

+ AIC and BIC can be used for both nested and non-nested models.

+ The AIC and BIC for a single model are not meaningful
  + They only make sense for model comparisons
  + We evaluate these comparisons by looking at the difference, `$\Delta$`, between two values

+ There are no specific thresholds for `$\Delta AIC$` to suggest how big a difference in two models is needed to conclude that one is substantively better than the other

+ The following `$\Delta BIC$` cutoffs have been suggested (Raftery, 1995):
  
| Value             | Interpretation                                    |
|-------------------|---------------------------------------------------|
| `$\Delta < 2$`      | No evidence of difference between models          |
| `$2 < \Delta < 6$`  | Positive evidence of difference between models    |
| `$6 < \Delta < 10$` | Strong evidence of difference between models      |
| `$\Delta > 10$`     | Very strong evidence of difference between models |

---
class: center, middle

# Questions?

---
# Pause to summarise what we know so far

+ So far we have seen how to:
  + run a linear model with a single predictor
  + extend this and add predictors
  + interpret these coefficients either in original units or standardized units
  + test the significance of `$\beta$` coefficients
  + test the significance of the overall model
  + estimate the amount of variance explained by our model

+ Short version, well done, you can now run and interpret linear models with continuous predictors.

+ Next week we will put this into action constructing and implementing an analysis plan for a linear model on a real example.

---
class: inverse, center, middle

# Thanks for listening