F-tests & Model Comparison


Data Analysis for Psychology in R 2

Emma Waterston


Department of Psychology
University of Edinburgh
2025–2026

Course Overview

Introduction to Linear Models Intro to Linear Regression
Interpreting Linear Models
Testing Individual Predictors
Model Testing & Comparison
Linear Model Analysis
Analysing Experimental Studies Categorical Predictors & Dummy Coding
Effects Coding & Coding Specific Contrasts
Assumptions & Diagnostics
Bootstrapping
Categorical Predictor Analysis
Interactions Interactions I
Interactions II
Interactions III
Analysing Experiments
Interaction Analysis
Advanced Topics Power Analysis
Binary Logistic Regression I
Binary Logistic Regression II
Logistic Regression Analysis
Exam Prep and Course Q&A

This Week’s Learning Objectives

  1. Understand the use of \(F\) and incremental \(F\) tests

  2. Be able to run and interpret \(F\)-tests in R

  3. Understand how to use model comparisons to test different types of questions

  4. Understand the difference between nested and non-nested models, and the appropriate statistics to use for comparison in each case

Part 1: Recap & Overview

Recap

  • Last week we looked at:
    • The significance of individual predictors
    • Overall model evaluation through \(R^2\) and \(\hat R^2\) to see how much variance in the outcome has been explained
  • This week we will:
    • Look at significance tests of the overall model
    • Discuss how we can use the same tools to do incremental tests (how much does my model improve when I add variables)

Statistical Significance of the Overall Model

  • Does our combination of \(x\)’s significantly improve prediction of \(y\), compared to not having any predictors?
  • Some indications that the model might be significant:
    • Slopes for individual predictors associated with significant \(p\)-values
    • High \(R^2\)
  • But these do not directly show model significance
  • To test the significance of the model as a whole, we conduct an \(F\)-test
  • When generally evaluating your model overall, ideally you want to look at all of these components together (i.e., significance of \(\beta\)s, \(R^2\) / \(\hat R^2\), and \(F\)-test)

\(F\)-test & \(F\)-ratio

  • An \(F\)-test involves testing the statistical significance of a test statistic called the \(F\)-ratio (also called \(F\)-statistic)

  • The \(F\)-ratio tests the null hypothesis that all the regression slopes in a model are zero (i.e., \(H_0: \text{All } \beta_j = 0\))

  • How does it work? Compares two models - the one you have specified to an intercept-only model (i.e., a model with no independent variables where the the model’s predictions equal the mean of the outcome)

  • In other words, our predictors tell us nothing about our outcome

  • They explain no variance

  • The more variance our predictors explain, the bigger our \(F\)-ratio

  • As with \(t\)-values and the \(t\)-distribution, we compare the \(F\)-statistic to the \(F\)-distribution to obtain a \(p\)-value

Our Results (Significant \(F\))

performance <- lm(score ~ hours + motivation, data = test_study2); summary(performance)

Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.955  -2.804  -0.285   2.934  13.824 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.8668     0.6547   10.49   <2e-16 ***
hours         1.3757     0.0799   17.22   <2e-16 ***
motivation    0.9163     0.3838    2.39    0.018 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.39 on 147 degrees of freedom
Multiple R-squared:  0.67,  Adjusted R-squared:  0.665 
F-statistic:  149 on 2 and 147 DF,  p-value: <2e-16

F-ratio: Some Details

  • \(F\)-ratio is a ratio of the explained to unexplained variance:

\[ F = \frac{\frac{SS_{model}}{df_{model}}}{\frac{SS_{residual}}{df_{residual}}} = \frac{MS_{Model}}{MS_{Residual}} \]

  • In other words, the ratio of the how much of the variation is explained by the model (per parameter) versus how much of the variation is unexplained (per remaining degrees of freedom)
  • What are mean squares (MS)?

    • Mean squares are sums of squares calculations divided by the associated degrees of freedom
    • We saw how to calculate model and residual sums of squares last week
  • But what are model and residual degrees of freedom?

Recap: Sums of Squares

\[ \small SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2 \]

\[ \small SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

\[ \small SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]

Degrees of Freedom

  • The degrees of freedom are defined as the number of independent values associated with the different calculations

    • Conceptually, how many values in the calculation can vary, if we keep the outcome of the calculation fixed
  • \(df\) are typically linked to:
    • the amount of data you have (sample size, \(n\))
    • and the number of things you need to calculate/estimate based on that data (in our case the number of \(beta\)s)

Degrees of Freedom

  • Model degrees of freedom = \(k\)

    • \(SS_{model}\) are dependent on estimated \(\beta\) s, which is \(k + 1\) (the number of predictors plus the intercept)
    • From \(k + 1\), we subtract \(1\) as not all the estimates can vary while holding the outcome constant
    • This gives us \(k\) for Model \(df\)
  • Residual degrees of freedom = \(n-k-1\)
    • \(SS_{residual}\) calculation is based on our individual data points and our model (in which we estimate \(k + 1\) \(\beta\) terms, i.e. the slopes and an intercept)
    • For each coefficient estimated, we lose a degree of freedom, as we’re fitting the model to the data and reducing the flexibility in how much the residuals (errors) can vary
  • Total degrees of freedom = \(n-1\)
    • \(SS_{total}\) calculation is based on the observed \(y_i\) and \(\bar{y}\) .
    • In order to estimate \(\bar{y}\) , all apart from one value of \(y\) is free to vary, hence \(n-1\)

Our Example (note the \(df\) at the bottom)

summary(performance)

Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.955  -2.804  -0.285   2.934  13.824 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.8668     0.6547   10.49   <2e-16 ***
hours         1.3757     0.0799   17.22   <2e-16 ***
motivation    0.9163     0.3838    2.39    0.018 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.39 on 147 degrees of freedom
Multiple R-squared:  0.67,  Adjusted R-squared:  0.665 
F-statistic:  149 on 2 and 147 DF,  p-value: <2e-16

\(F\)-ratio

  • Bigger \(F\)-ratios indicate better fitting models

    • It means the variance explained by the model is big compared to the residual variance
  • \(H_0\) for the model says that the best guess of any individual’s \(y\) value is \(\bar{y}\) (plus error)
    • Or, that the \(x\) variables collectively carry no information about \(y\)
    • All slopes = 0

\[F = \frac{MS_{Model}}{MS_{Residual}}\]

  • \(F\)-ratio will be close to 1 when \(H_0\) is true
    • If there is equivalent model to residual variation ( \(MS_{model} = MS_{residual}\) ), then \(F\)=1
      • If there is more model than residual variation, then \(F\) > 1

Testing the Significance of \(F\)

  • The \(F\)-ratio is our test statistic for the significance of our model

    • As with all statistical inferences, we would select an \(\alpha\) level.

    • Identify the proper null \(F\)-distribution and calculate the critical value of \(F\) associated with chosen level of \(\alpha\)

    • Compare our \(F\)-statistic to the critical value.

    • If our value is more extreme than the critical value, it is considered significant

Sampling Distribution for the Null

  • Similar to the \(t\)-distribution, the \(F\)-distribution changes shape based on \(df\)

  • With an \(F\)-statistic, we have to consider both the \(df_{model}\) and \(df_{residual}\)

  • In parentheses, \(df_{model}\) is shown before \(df_{residual}\)

A Decision about the Null

  • We need to set our \(\alpha\) level

    • \(\alpha = .05\)
  • We have an \(F\)-statistic (from our model output summary):

    • \(F = 148.9\)
  • We consider \(df_{model}\) and \(df_{residual}\) to get our null distribution:

    • \(df_{model}=k=2\)

    • \(df_{residual}=n-k-1=150-2-1=147\)

  • Now we can compute our critical value for \(F\)

Visualise the Test

  • \(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)

Visualise the Test

  • \(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)

  • Our critical value (using the qf() function):

(Crit = round(qf(0.95, 2, 147), 3))
[1] 3.06

Visualise the Test

  • \(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)

  • Our critical value (using the qf() function)

(Crit = round(qf(0.95, 2, 147), 3))
[1] 3.06
  • We can calculate the probability of an \(F\)-statistic at least as extreme as ours, given \(H_0\) is true (our \(p\)-value):
(pVal = 1-pf(148.9, 2, 147))
[1] 0

Visualise the Test

  • \(F\)-distribution with 2 \(df_{model}\) and 147 \(df_{residual}\) (our null distribution)

  • Our critical value (using the qf() function)

Crit <- round(qf(0.95, 2, 147), 3)
Crit
[1] 3.06
  • We can calculate the probability of an \(F\)-statistic at least as extreme as ours, given \(H_0\) is true (our \(p\)-value):
pVal <- 1-pf(148.9, 2, 147)
pVal
[1] "0.00000"
  • Our model significantly predicted the variance in test score, \(F(2,147)= 148.90, p < .001\)

Part 2: Model Comparison & Incremental \(F\)-tests

Model Comparisons

  • So far:

    • We have tested individual predictors
    • and we have tested overall models
  • Our questions have been is our overall model better than nothing? ( \(F\)-test ) or which variables, specifically, are good predictors of the outcome variable? ( \(t\)-tests of \(\beta\) estimates )

  • But what if instead we wanted to ask:

When I make a change to my model, does it improve or not?

  • This question is the core of model comparison
  • We can adapt this to our models in a more specific way:

    • E.g. is a model with \(x_1\) and \(x_2\) and \(x_3\) as predictors better than the model with just \(x_1\)?
  • Can ask questions such as does our model improve when we add predictors?

    • To answer, we can look at the combined performance of a subset of predictors

\(F\)-test as an Incremental Test

  • One important way we can think about the \(F\)-test and the \(F\)-ratio is as an incremental test against an “empty” or null model

  • A null or empty model is a linear model with only the intercept

    • In this model, our predicted value of the outcome for every case in our data set, is the mean of the outcome ( \(\bar{y}\))
    • That is, with no predictors, we have no information that may help us predict the outcome
    • So we will be ‘least wrong’ by guessing the mean of the outcome
  • An empty model is the same as saying all \(\beta\) = 0

    • And remember, this was the null hypothesis of the \(F\)-test
  • So in this way, the \(F\)-test can be seen as comparing two models
  • We can extend this idea, and use the \(F\)-test to compare two models that contain fewer or more predictors
    • This is the incremental \(F\)-test

Incremental \(F\)-test

  • The incremental \(F\)-test evaluates the statistical significance of the improvement in variance explained in an outcome with the addition of further predictor(s)

  • It is based on the difference in \(F\)-values between two models

    • We call the model with the additional predictor(s) the full model (denoted with \(_F\))
    • We call the model without additional predictors the restricted model (denoted with \(_R\))

\[ F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F} \]

\[ \begin{align} & \text{Where:} \\ & SSR_R = \text{residual sums of squares for the restricted model} \\ & SSR_F = \text{residual sums of squares for the full model} \\ & df_R = \text{residual degrees of freedom from the restricted model} \\ & df_F = \text{residual degrees of freedom from the full model} \\ \end{align} \]

Example of Model Comparison

  • Consider this example based on data from the Midlife in United States (MIDUS2) study:

    • Outcome: self-rated health (health)

    • Covariates: Age (age), sex (sex)

    • Predictors: Big Five personality traits (O, C, E, A, N) and Purpose in Life (PIL)

  • Research Question: Does personality predict self-rated health over and above age and sex?

    • Two step process to address RQ:

      • Step 1: Fit both models

      • Step 2: Statistically compare models

The Data

# A tibble: 10 × 10
      ID   age sex    health     O     C     E     A     N   PIL
   <dbl> <dbl> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 10002    69 MALE        8  2.14   2.8   2.6   3.4  2     5.86
 2 10019    51 MALE        8  3.14   3     3.4   3.6  1.5   5.71
 3 10023    78 FEMALE      4  3.57   3.4   3.6   4    1.75  5.14
 4 10039    53 MALE        4  3.57   3.2   3     3.8  1.75  4.57
 5 10040    49 MALE        8  3.43   3.6   3.2   3    3     6   
 6 10042    59 FEMALE      8  3.71   4     3.2   4    1.75  6.71
 7 10047    45 FEMALE      9  2.43   4     3     4    1.75  5.71
 8 10050    44 FEMALE      3  3.71   3.6   2.8   4    3     6.86
 9 10060    58 MALE        7  3.71   3     3     3.6  2.25  6.57
10 10061    81 MALE        8  3      3.6   2.4   3.2  1.75  6.71

The Models

  • Does personality significantly predict self-rated health over and above the effects of age and sex?

  • First step: Fit and run models

    • M1: We predict from age and sex
    • M2: we add in the FFM (personality) traits
m1 <- lm(health ~ age + sex, data = midus2)
m2 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2)

Model 1 Output (Age + Sex)


Call:
lm(formula = health ~ age + sex, data = midus2)

Residuals:
   Min     1Q Median     3Q    Max 
-7.306 -1.050  0.580  0.853  2.977 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.75597    0.18365   42.23   <2e-16 ***
age         -0.00883    0.00312   -2.83   0.0047 ** 
sexMALE      0.03529    0.07862    0.45   0.6536    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.64 on 1758 degrees of freedom
Multiple R-squared:  0.00463,   Adjusted R-squared:  0.00349 
F-statistic: 4.09 on 2 and 1758 DF,  p-value: 0.017

Model 2 Output (Age + Sex + Personality)


Call:
lm(formula = health ~ age + sex + O + C + E + A + N, data = midus2)

Residuals:
   Min     1Q Median     3Q    Max 
-6.772 -0.792  0.253  1.010  3.955 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.66172    0.45100   14.77  < 2e-16 ***
age         -0.01310    0.00298   -4.40  1.2e-05 ***
sexMALE     -0.09571    0.07955   -1.20     0.23    
O            0.09307    0.08306    1.12     0.26    
C            0.57147    0.08507    6.72  2.5e-11 ***
E            0.56771    0.08061    7.04  2.7e-12 ***
A           -0.40380    0.09025   -4.47  8.2e-06 ***
N           -0.56493    0.06189   -9.13  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.52 on 1753 degrees of freedom
Multiple R-squared:  0.148, Adjusted R-squared:  0.145 
F-statistic: 43.7 on 7 and 1753 DF,  p-value: <2e-16

Incremental \(F\)-test in R

  • Second step: Compare the two models based on an incremental \(F\)-test

  • In order to apply the \(F\)-test for model comparison in R, we use the anova() function

  • anova() takes as its arguments the models that we wish to compare

    • Here we see an example with 2 models, but we could use more (e.g., anova(m1, m2, m3))
anova(m1, m2)

Incremental \(F\)-test in R

anova(m1, m2)
Analysis of Variance Table

Model 1: health ~ age + sex
Model 2: health ~ age + sex + O + C + E + A + N
  Res.Df  RSS Df Sum of Sq    F Pr(>F)    
1   1758 4740                             
2   1753 4055  5       685 59.2 <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Personality was found to explain a significant amount of variance in self-rated health over and above the effects of age and sex \(F(5, 1753) = 59.21, p < .001)\) .

Part 3: Non-Nested Models and Alternatives to \(F\)-tests

Nested vs Non-Nested Models

  • The \(F\)-ratio depends on the comparison models being nested

    • Nested means that the predictors in one model are a subset of the predictors in the other
  • We also require the models to be computed on the same data

    • Be careful when data contains NA’s
    • The lm() function excludes the whole row of data if any of \(y\) or \(x\)’s specified in the model have missing values on that row
    • If the additional variables you add to the full model have NA’s, the data sets used by the models could end up different!

You can only do an \(F\)-test if the models are nested: the variables are nested and the data are identical

Nested vs Non-Nested Models

Assuming no NA’s in data:

Nested

m0 <- lm(outcome ~ x1 + x2 , data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)
  • x1 and x2 appear in both models

Non-Nested

m0 <- lm(outcome ~ x1 + x2 + x4, data = data)

m1 <- lm(outcome ~ x1 + x2 + x3, data = data)
  • There are unique variables in both models
    • x4 in m0
    • x3 in m1

Model Comparison for Non-Nested Models

  • So what happens when we have non-nested models?

  • There are two commonly used alternatives

    • AIC
    • BIC
  • Unlike the incremental \(F\)-test AIC and BIC do not require two models to be nested

  • Smaller (more negative) values indicate better fitting models

    • So we compare values and choose the model with the smaller AIC or BIC value

AIC & BIC

\[ \small AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \]

\[ \small BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \]

\[ \begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align} \]

Parsimony Corrections

  • Both AIC and BIC contain something called a parsimony correction

    • In essence, they penalise models for being complex

    • This is to help us avoid overfitting (adding predictors arbitrarily to improve fit)

\[ AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \]

\[ BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \]

  • BIC has a harsher parsimony penalty for typical sample sizes when applying linear models than AIC
    • When \(\text{ln}(n) > 2\) , BIC will have a more severe parsimony penalty (i.e. essentially all the time!)

In R

  • Let’s use AIC and BIC on our m1 and m2 models from previously
m1 <- lm(health ~ age + sex, data = midus2)
m2 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2)


AIC(m1, m2)
   df  AIC
m1  4 6749
m2  9 6484
BIC(m1, m2)
   df  BIC
m1  4 6771
m2  9 6534

A Different Example

  • Our previous models were nested

    • m1 had just covariates
    • m2 added personality
  • Using the same data, let’s consider a non-nested example

  • Suppose we want to compare a model that:

    • predicts self-rated health from just 5 personality variables (nn1 : non-nested model 1)
    • to a model that predicts from age, sex and a variable called Purpose in Life (PIL) (nn2).

Applied to Non-Nested Models

nn1 <- lm(health ~ O + C + E + A + N, data=midus2)
nn2 <- lm(health ~ age + sex + PIL, data = midus2)


AIC(nn1, nn2)
    df  AIC
nn1  7 6502
nn2  5 6565
BIC(nn1, nn2)
    df  BIC
nn1  7 6540
nn2  5 6592

Considerations for use of AIC and BIC

  • AIC and BIC can be used for both nested and non-nested models
  • The AIC and BIC for a single model are not meaningful
    • They only make sense for model comparisons
    • We evaluate these comparisons by looking at the difference, \(\Delta\), between two values
  • There are no specific thresholds for \(\Delta AIC\) to suggest how big a difference in two models is needed to conclude that one is substantively better than the other
  • The following \(\Delta BIC\) cutoffs have been suggested (Raftery, 1995):
Value Interpretation
\(\Delta < 2\) No evidence of difference between models
\(2 < \Delta < 6\) Positive evidence of difference between models
\(6 < \Delta < 10\) Strong evidence of difference between models
\(\Delta > 10\) Very strong evidence of difference between models

Block 1 Summary

  • So far we have seen how to:
    • run a linear model with a single predictor
    • extend this and add predictors
    • interpret these coefficients either in original units or standardised units
    • test the significance of \(\beta\) coefficients
    • test the significance of the overall model
    • estimate the amount of variance explained by our model
    • evaluate improvements to model fit when variables are added
    • select a better-fitting model between two nested or non-nested models
  • You can now run and interpret linear models with continuous predictors

This Week


Tasks

Attend your lab and work together on the exercises


Complete the weekly quiz

Support

Help each other on the Piazza forum


Attend office hours (see Learn page for details)