F-tests & Model Comparison

Data Analysis for Psychology in R 2

Emma Waterston

Department of Psychology
University of Edinburgh
2025–2026

Course Overview

Introduction to Linear Models	Intro to Linear Regression
	Interpreting Linear Models
	Testing Individual Predictors
	Model Testing & Comparison
	Linear Model Analysis
Analysing Experimental Studies	Categorical Predictors & Dummy Coding
	Effects Coding & Coding Specific Contrasts
	Assumptions & Diagnostics
	Bootstrapping
	Categorical Predictor Analysis

Interactions	Interactions I
	Interactions II
	Interactions III
	Analysing Experiments
	Interaction Analysis
Advanced Topics	Power Analysis
	Binary Logistic Regression I
	Binary Logistic Regression II
	Logistic Regression Analysis
	Exam Prep and Course Q&A

This Week’s Learning Objectives

Understand the use of \(F\) and incremental \(F\) tests
Be able to run and interpret \(F\)-tests in R
Understand how to use model comparisons to test different types of questions
Understand the difference between nested and non-nested models, and the appropriate statistics to use for comparison in each case

Part 1: Recap & Overview

Recap

Last week we looked at:
- The significance of individual predictors
- Overall model evaluation through \(R^2\) and \(\hat R^2\) to see how much variance in the outcome has been explained
This week we will:
- Look at significance tests of the overall model
- Discuss how we can use the same tools to do incremental tests (how much does my model improve when I add variables)

Statistical Significance of the Overall Model

Does our combination of \(x\)’s significantly improve prediction of \(y\), compared to not having any predictors?

Some indications that the model might be significant:
- Slopes for individual predictors associated with significant \(p\)-values
- High \(R^2\)
But these do not directly show model significance

To test the significance of the model as a whole, we conduct an \(F\)-test

When generally evaluating your model overall, ideally you want to look at all of these components together (i.e., significance of \(\beta\)s, \(R^2\) / \(\hat R^2\), and \(F\)-test)

\(F\)-test & \(F\)-ratio

An \(F\)-test involves testing the statistical significance of a test statistic called the \(F\)-ratio (also called \(F\)-statistic)
The \(F\)-ratio tests the null hypothesis that all the regression slopes in a model are zero (i.e., \(H_0: \text{All } \beta_j = 0\))
How does it work? Compares two models - the one you have specified to an intercept-only model (i.e., a model with no independent variables where the the model’s predictions equal the mean of the outcome)

In other words, our predictors tell us nothing about our outcome
They explain no variance

The more variance our predictors explain, the bigger our \(F\)-ratio
As with \(t\)-values and the \(t\)-distribution, we compare the \(F\)-statistic to the \(F\)-distribution to obtain a \(p\)-value

Our Results (Significant \(F\))

performance <- lm(score ~ hours + motivation, data = test_study2); summary(performance)


Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.955  -2.804  -0.285   2.934  13.824 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.8668     0.6547   10.49   <2e-16 ***
hours         1.3757     0.0799   17.22   <2e-16 ***
motivation    0.9163     0.3838    2.39    0.018 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.39 on 147 degrees of freedom
Multiple R-squared:  0.67,  Adjusted R-squared:  0.665 
F-statistic:  149 on 2 and 147 DF,  p-value: <2e-16

F-ratio: Some Details

\(F\)-ratio is a ratio of the explained to unexplained variance:

\[ F = \frac{\frac{SS_{model}}{df_{model}}}{\frac{SS_{residual}}{df_{residual}}} = \frac{MS_{Model}}{MS_{Residual}} \]

In other words, the ratio of the how much of the variation is explained by the model (per parameter) versus how much of the variation is unexplained (per remaining degrees of freedom)

What are mean squares (MS)?
- Mean squares are sums of squares calculations divided by the associated degrees of freedom
- We saw how to calculate model and residual sums of squares last week
But what are model and residual degrees of freedom?

Recap: Sums of Squares

\[ \small SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2 \]

\[ \small SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

\[ \small SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]

Degrees of Freedom

The degrees of freedom are defined as the number of independent values associated with the different calculations
- Conceptually, how many values in the calculation can vary, if we keep the outcome of the calculation fixed

\(df\) are typically linked to:
- the amount of data you have (sample size, \(n\))
- and the number of things you need to calculate/estimate based on that data (in our case the number of \(beta\)s)

Degrees of Freedom

Model degrees of freedom = \(k\)
- \(SS_{model}\) are dependent on estimated \(\beta\) s, which is \(k + 1\) (the number of predictors plus the intercept)
- From \(k + 1\), we subtract \(1\) as not all the estimates can vary while holding the outcome constant
- This gives us \(k\) for Model \(df\)

Residual degrees of freedom = \(n-k-1\)
- \(SS_{residual}\) calculation is based on our individual data points and our model (in which we estimate \(k + 1\) \(\beta\) terms, i.e. the slopes and an intercept)
- For each coefficient estimated, we lose a degree of freedom, as we’re fitting the model to the data and reducing the flexibility in how much the residuals (errors) can vary

Total degrees of freedom = \(n-1\)
- \(SS_{total}\) calculation is based on the observed \(y_i\) and \(\bar{y}\) .
- In order to estimate \(\bar{y}\) , all apart from one value of \(y\) is free to vary, hence \(n-1\)

Our Example (note the \(df\) at the bottom)

summary(performance)


Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.955  -2.804  -0.285   2.934  13.824 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.8668     0.6547   10.49   <2e-16 ***
hours         1.3757     0.0799   17.22   <2e-16 ***
motivation    0.9163     0.3838    2.39    0.018 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.39 on 147 degrees of freedom
Multiple R-squared:  0.67,  Adjusted R-squared:  0.665 
F-statistic:  149 on 2 and 147 DF,  p-value: <2e-16

\(F\)-ratio

Bigger \(F\)-ratios indicate better fitting models
- It means the variance explained by the model is big compared to the residual variance

\(H_0\) for the model says that the best guess of any individual’s \(y\) value is \(\bar{y}\) (plus error)
- Or, that the \(x\) variables collectively carry no information about \(y\)
- All slopes = 0

\[F = \frac{MS_{Model}}{MS_{Residual}}\]

\(F\)-ratio will be close to 1 when \(H_0\) is true
- If there is equivalent model to residual variation ( \(MS_{model} = MS_{residual}\) ), then \(F\)=1
  - If there is more model than residual variation, then \(F\) > 1

Testing the Significance of \(F\)

The \(F\)-ratio is our test statistic for the significance of our model
- As with all statistical inferences, we would select an \(\alpha\) level.
- Identify the proper null \(F\)-distribution and calculate the critical value of \(F\) associated with chosen level of \(\alpha\)
- Compare our \(F\)-statistic to the critical value.
- If our value is more extreme than the critical value, it is considered significant

Sampling Distribution for the Null

Similar to the \(t\)-distribution, the \(F\)-distribution changes shape based on \(df\)
With an \(F\)-statistic, we have to consider both the \(df_{model}\) and \(df_{residual}\)
In parentheses, \(df_{model}\) is shown before \(df_{residual}\)

A Decision about the Null

We need to set our \(\alpha\) level
- \(\alpha = .05\)

We have an \(F\)-statistic (from our model output summary):
- \(F = 148.9\)

We consider \(df_{model}\) and \(df_{residual}\) to get our null distribution:
- \(df_{model}=k=2\)
- \(df_{residual}=n-k-1=150-2-1=147\)

Now we can compute our critical value for \(F\)

Visualise the Test

\(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)

Visualise the Test

\(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)
Our critical value (using the qf() function):

(Crit = round(qf(0.95, 2, 147), 3))

[1] 3.06

Visualise the Test

\(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)
Our critical value (using the qf() function)

(Crit = round(qf(0.95, 2, 147), 3))

[1] 3.06

We can calculate the probability of an \(F\)-statistic at least as extreme as ours, given \(H_0\) is true (our \(p\)-value):

(pVal = 1-pf(148.9, 2, 147))

[1] 0

Visualise the Test

\(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)
Our critical value (using the qf() function)

Crit <- round(qf(0.95, 2, 147), 3)
Crit

[1] 3.06

We can calculate the probability of an \(F\)-statistic at least as extreme as ours, given \(H_0\) is true (our \(p\)-value):

pVal <- 1-pf(148.9, 2, 147)

pVal

[1] "0.00000"

Our model significantly predicted the variance in test score, \(F(2,147)= 148.90, p < .001\)

Part 2: Model Comparison & Incremental \(F\)-tests

Model Comparisons

So far:
- We have tested individual predictors
- and we have tested overall models
Our questions have been is our overall model better than nothing? ( \(F\)-test ) or which variables, specifically, are good predictors of the outcome variable? ( \(t\)-tests of \(\beta\) estimates )

But what if instead we wanted to ask:

When I make a change to my model, does it improve or not?

This question is the core of model comparison

We can adapt this to our models in a more specific way:
- E.g. is a model with \(x_1\) and \(x_2\) and \(x_3\) as predictors better than the model with just \(x_1\)?

Can ask questions such as does our model improve when we add predictors?
- To answer, we can look at the combined performance of a subset of predictors

\(F\)-test as an Incremental Test

One important way we can think about the \(F\)-test and the \(F\)-ratio is as an incremental test against an “empty” or null model
A null or empty model is a linear model with only the intercept
- In this model, our predicted value of the outcome for every case in our data set, is the mean of the outcome ( \(\bar{y}\))
- That is, with no predictors, we have no information that may help us predict the outcome
- So we will be ‘least wrong’ by guessing the mean of the outcome
An empty model is the same as saying all \(\beta\) = 0
- And remember, this was the null hypothesis of the \(F\)-test

So in this way, the \(F\)-test can be seen as comparing two models

We can extend this idea, and use the \(F\)-test to compare two models that contain fewer or more predictors
- This is the incremental \(F\)-test

Incremental \(F\)-test

The incremental \(F\)-test evaluates the statistical significance of the improvement in variance explained in an outcome with the addition of further predictor(s)
It is based on the difference in \(F\)-values between two models
- We call the model with the additional predictor(s) the full model (denoted with \(_F\))
- We call the model without additional predictors the restricted model (denoted with \(_R\))

\[ F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F} \]

\[ \begin{align} & \text{Where:} \\ & SSR_R = \text{residual sums of squares for the restricted model} \\ & SSR_F = \text{residual sums of squares for the full model} \\ & df_R = \text{residual degrees of freedom from the restricted model} \\ & df_F = \text{residual degrees of freedom from the full model} \\ \end{align} \]

Example of Model Comparison

Consider this example based on data from the Midlife in United States (MIDUS2) study:
- Outcome: self-rated health (health)
- Covariates: Age (age), sex (sex)
- Predictors: Big Five personality traits (O, C, E, A, N) and Purpose in Life (PIL)
Research Question: Does personality predict self-rated health over and above age and sex?
- Two step process to address RQ:
  - Step 1: Fit both models
  - Step 2: Statistically compare models

The Data

# A tibble: 10 × 10
      ID   age sex    health     O     C     E     A     N   PIL
   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 10002    69 MALE        8  2.14   2.8   2.6   3.4  2     5.86
 2 10019    51 MALE        8  3.14   3     3.4   3.6  1.5   5.71
 3 10023    78 FEMALE      4  3.57   3.4   3.6   4    1.75  5.14
 4 10039    53 MALE        4  3.57   3.2   3     3.8  1.75  4.57
 5 10040    49 MALE        8  3.43   3.6   3.2   3    3     6   
 6 10042    59 FEMALE      8  3.71   4     3.2   4    1.75  6.71
 7 10047    45 FEMALE      9  2.43   4     3     4    1.75  5.71
 8 10050    44 FEMALE      3  3.71   3.6   2.8   4    3     6.86
 9 10060    58 MALE        7  3.71   3     3     3.6  2.25  6.57
10 10061    81 MALE        8  3      3.6   2.4   3.2  1.75  6.71

The Models

Does personality significantly predict self-rated health over and above the effects of age and sex?
First step: Fit and run models
- M1: We predict from age and sex
- M2: we add in the FFM (personality) traits

m1 <- lm(health ~ age + sex, data = midus2)

m2 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2)

Model 1 Output (Age + Sex)


Call:
lm(formula = health ~ age + sex, data = midus2)

Residuals:
   Min     1Q Median     3Q    Max 
-7.306 -1.050  0.580  0.853  2.977 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.75597    0.18365   42.23   <2e-16 ***
age         -0.00883    0.00312   -2.83   0.0047 ** 
sexMALE      0.03529    0.07862    0.45   0.6536    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.64 on 1758 degrees of freedom
Multiple R-squared:  0.00463,   Adjusted R-squared:  0.00349 
F-statistic: 4.09 on 2 and 1758 DF,  p-value: 0.017

Model 2 Output (Age + Sex + Personality)


Call:
lm(formula = health ~ age + sex + O + C + E + A + N, data = midus2)

Residuals:
   Min     1Q Median     3Q    Max 
-6.772 -0.792  0.253  1.010  3.955 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.66172    0.45100   14.77  < 2e-16 ***
age         -0.01310    0.00298   -4.40  1.2e-05 ***
sexMALE     -0.09571    0.07955   -1.20     0.23    
O            0.09307    0.08306    1.12     0.26    
C            0.57147    0.08507    6.72  2.5e-11 ***
E            0.56771    0.08061    7.04  2.7e-12 ***
A           -0.40380    0.09025   -4.47  8.2e-06 ***
N           -0.56493    0.06189   -9.13  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.52 on 1753 degrees of freedom
Multiple R-squared:  0.148, Adjusted R-squared:  0.145 
F-statistic: 43.7 on 7 and 1753 DF,  p-value: <2e-16

Incremental \(F\)-test in R

Second step: Compare the two models based on an incremental \(F\)-test
In order to apply the \(F\)-test for model comparison in R, we use the anova() function
anova() takes as its arguments the models that we wish to compare
- Here we see an example with 2 models, but we could use more (e.g., anova(m1, m2, m3))

anova(m1, m2)

Incremental \(F\)-test in R

anova(m1, m2)

Analysis of Variance Table

Model 1: health ~ age + sex
Model 2: health ~ age + sex + O + C + E + A + N
  Res.Df  RSS Df Sum of Sq    F Pr(>F)    
1   1758 4740                             
2   1753 4055  5       685 59.2 <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Personality was found to explain a significant amount of variance in self-rated health over and above the effects of age and sex \(F(5, 1753) = 59.21, p < .001)\) .

Part 3: Non-Nested Models and Alternatives to \(F\)-tests

Nested vs Non-Nested Models

The \(F\)-ratio depends on the comparison models being nested
- Nested means that the predictors in one model are a subset of the predictors in the other
We also require the models to be computed on the same data
- Be careful when data contains NA’s
- The lm() function excludes the whole row of data if any of \(y\) or \(x\)’s specified in the model have missing values on that row
- If the additional variables you add to the full model have NA’s, the data sets used by the models could end up different!

You can only do an \(F\)-test if the models are nested: the variables are nested and the data are identical

Nested vs Non-Nested Models

Assuming no NA’s in data:

Nested

m0 <- lm(outcome ~ x1 + x2 , data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)

x1 and x2 appear in both models

Non-Nested

m0 <- lm(outcome ~ x1 + x2 + x4, data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)

There are unique variables in both models
- x4 in m0
- x3 in m1

Model Comparison for Non-Nested Models

So what happens when we have non-nested models?
There are two commonly used alternatives
- AIC
- BIC
Unlike the incremental \(F\)-test AIC and BIC do not require two models to be nested
Smaller (more negative) values indicate better fitting models
- So we compare values and choose the model with the smaller AIC or BIC value

AIC & BIC

\[ \small AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \]

\[ \small BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \]

\[ \begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align} \]

Parsimony Corrections

Both AIC and BIC contain something called a parsimony correction
- In essence, they penalise models for being complex
- This is to help us avoid overfitting (adding predictors arbitrarily to improve fit)

\[ AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \]

\[ BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \]

BIC has a harsher parsimony penalty for typical sample sizes when applying linear models than AIC
- When \(\text{ln}(n) > 2\) , BIC will have a more severe parsimony penalty (i.e. essentially all the time!)

In R

Let’s use AIC and BIC on our m1 and m2 models from previously

m1 <- lm(health ~ age + sex, data = midus2)

m2 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2)

AIC(m1, m2)

   df  AIC
m1  4 6749
m2  9 6484

BIC(m1, m2)

   df  BIC
m1  4 6771
m2  9 6534

A Different Example

Our previous models were nested
- m1 had just covariates
- m2 added personality
Using the same data, let’s consider a non-nested example
Suppose we want to compare a model that:
- predicts self-rated health from just 5 personality variables (nn1 : non-nested model 1)
- to a model that predicts from age, sex and a variable called Purpose in Life (PIL) (nn2).

Applied to Non-Nested Models

nn1 <- lm(health ~ O + C + E + A + N, data=midus2)
nn2 <- lm(health ~ age + sex + PIL, data = midus2)

AIC(nn1, nn2)

    df  AIC
nn1  7 6502
nn2  5 6565

BIC(nn1, nn2)

    df  BIC
nn1  7 6540
nn2  5 6592

Considerations for use of AIC and BIC

AIC and BIC can be used for both nested and non-nested models

The AIC and BIC for a single model are not meaningful
- They only make sense for model comparisons
- We evaluate these comparisons by looking at the difference, \(\Delta\), between two values

There are no specific thresholds for \(\Delta AIC\) to suggest how big a difference in two models is needed to conclude that one is substantively better than the other

The following \(\Delta BIC\) cutoffs have been suggested (Raftery, 1995):

Value	Interpretation
\(\Delta < 2\)	No evidence of difference between models
\(2 < \Delta < 6\)	Positive evidence of difference between models
\(6 < \Delta < 10\)	Strong evidence of difference between models
\(\Delta > 10\)	Very strong evidence of difference between models

Block 1 Summary

So far we have seen how to:
- run a linear model with a single predictor
- extend this and add predictors
- interpret these coefficients either in original units or standardised units
- test the significance of \(\beta\) coefficients
- test the significance of the overall model
- estimate the amount of variance explained by our model
- evaluate improvements to model fit when variables are added
- select a better-fitting model between two nested or non-nested models
You can now run and interpret linear models with continuous predictors

This Week

Tasks

Attend your lab and work together on the exercises

Complete the weekly quiz

Support

Help each other on the Piazza forum

Attend office hours (see Learn page for details)