LEARNING OBJECTIVES

Understand measures of model fit using F.
Understand the principles of model selection and how to compare models via F tests.
Understand AIC and BIC.

Model Fit

F-ratio

As in simple linear regression, the F-ratio is used to test the null hypothesis that all regression slopes are zero.

It is called the F-ratio because it is the ratio of the how much of the variation is explained by the model (per paramater) versus how much of the variation is unexplained (per remaining degrees of freedom).

\[ F_{df_{model},df_{residual}} = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model}/df_{Model}}{SS_{Residual}/df_{Residual}} \\ \quad \\ \begin{align} & \text{Where:} \\ & df_{model} = k \\ & df_{error} = n-k-1 \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

In R, at the bottom of the output of summary(<modelname>), you can view the F-ratio, along with an hypothesis test against the alternative hypothesis that the at least one of the coefficients \(\neq 0\) (under the null hypothesis that all coefficients = 0, the ratio of explained:unexplained variance should be approximately 1):

Figure 1: Multiple regression output in R, summary.lm(). F statistic highlighted

Question 1

Run the code below. It reads in the wellbeing/rurality study data, and creates a new binary variable which specifies whether or not each participant lives in a rural location.

library(tidyverse)
mwdata2<-read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")
mwdata2 <- 
  mwdata2 %>% mutate(
  isRural = ifelse(location=="rural","rural","notrural")
)

Fit the following model, and assign it the name “wb_mdl1.”

\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \epsilon\)

Solution

wb_mdl1 <- lm(wellbeing ~ social_int + isRural, data=mwdata2)
summary(wb_mdl1)

## 
## Call:
## lm(formula = wellbeing ~ social_int + isRural, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3711  -3.1794   0.1097   2.5407  17.6324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.11591    1.07052  31.868  < 2e-16 ***
## social_int    0.38167    0.08257   4.622 6.85e-06 ***
## isRuralrural -4.85152    0.66283  -7.319 6.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.664 on 197 degrees of freedom
## Multiple R-squared:  0.2593, Adjusted R-squared:  0.2517 
## F-statistic: 34.47 on 2 and 197 DF,  p-value: 1.453e-13

Model Comparison

Incremental F-test

If (and only if) two models are nested, can we compare them using an incremental F-test.

What does nested mean?

Consider that you have two regression models where Model 1 contains a subset of the predictors containted in the other Model 2 and is fitted to the same data. More simply, Model 2 contains all of the predictors included in Model 1, plus additional predictor(s). This means that Model 1 is nested within Model 2, or that Model 1 is a submodel of Model 2. These two terms, at least in this setting, are interchangeable - it might be easier to think of Model 1 as your null and Model 2 as your alternative.

What is an incremental F-test?

This is a formal test of whether the additional predictors provide a better fitting model.
Formally this is the test of:

\(H_0:\) coefficients for the added/ommitted variables are all zero.
\(H_1:\) at least one of the added/ommitted variables has a coefficient that is not zero.

The F-ratio for comparing the residual sums of squares between two models can be calculated as:

\[ F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F} \\ \quad \\ \begin{align} & \text{Where:} \\ & SSR_R = \text{residual sums of squares for the restricted model} \\ & SSR_F = \text{residual sums of squares for the full model} \\ & df_R = \text{residual degrees of freedom from the restricted model} \\ & df_F = \text{residual degrees of freedom from the full model} \\ \end{align} \]

Comparing regression models with `anova()`

Remember that you want your models to be parsimonious, or in other words, only as complex as they need to be in order to describe the data well. This means that you need to be able to justify your model choice, and one way to do so is by comparing models via anova(). If your model with multiple IVs does not provide a significantly better fit to your data than a more simplistic model with less IVs, then the more simplistic model should be preferred.

In R, we can conduct an incremental F-test by constructing two linear regression models, and passing them to the anova() function: anova(model1, model2).

If the p-value is sufficiently low (i.e., below your predetermined significance level - usually .05), then you would conclude that model 2 is significantly better fitting than model 1. If p is not < .05, then you should favour the more simplistic model.

Question 2

The F-ratio you see at the bottom of summary(model) is actually a comparison between two models: your model (with some explanatory variables in predicting \(y\)) and the null model. In regression, the null model can be thought of as the model in which all explanatory variables have zero regression coefficients. It is also referred to as the intercept-only model, because if all predictor variable coefficients are zero, then the only we are only estimating \(y\) via an intercept (which will be the mean - \(\bar y\)).

Use the code below to fit the null model.

Then, use the anova() function to perform a model comparison between your earlier model (wb_mdl1) and the null model. Remember that the null model tests the null hypothesis that all beta coefficients are zero. By comparing null_model to wb_mdl1, we can test whether we should include the two IVs of social_int and isRural.

Check that the F-statistic and the p-value are the the same as that which is given at the bottom of summary(wb_mdl1).

null_model <- lm(wellbeing ~ 1, data = mwdata2)

Solution

# fit the null model
null_model <- lm(wellbeing ~ 1, data = mwdata2)

# model comparison null vs wb_mdl1
anova(null_model, wb_mdl1)

## Analysis of Variance Table
## 
## Model 1: wellbeing ~ 1
## Model 2: wellbeing ~ social_int + isRural
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    199 5785.6                                  
## 2    197 4285.6  2      1500 34.475 1.453e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# extract f statistic from summary of wb_mdl1
summary(wb_mdl1)$fstatistic

##     value     numdf     dendf 
##  34.47463   2.00000 197.00000

# we can retrieve the p-value:
fstat = summary(wb_mdl1)$fstatistic[1]
df_1 = summary(wb_mdl1)$fstatistic[2]
df_2 = summary(wb_mdl1)$fstatistic[3]
pf(fstat, df_1, df_2, lower.tail = FALSE)

##        value 
## 1.452925e-13

Question 3

Does weekly outdoor time explain a significant amount of variance in wellbeing scores over and above weekly social interactions and location (rural vs not-rural)?

Provide an answer to this question by fitting and comparing two models (one of them you may already have fitted in an earlier question).

Solution

We can compare the following models which predict wellbeing scores from weekly social interactions and location, with and without weekly outdoor time.

\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \epsilon\)
\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \beta_3 \cdot \text{Outdoor time} + \epsilon\)

We have already fitted the first model and assigned it the name wb_mdl1.

We need to fit the second:

wb_mdl2 <- lm(wellbeing ~ 1 + social_int + isRural + outdoor_time, data=mwdata2)

Let’s look at the amount of variation in wellbeing scores explained by each model. Recall from semester 1 that this means looking at our R^2 value:

summary(wb_mdl1)$adj.r.squared

## [1] 0.251737

summary(wb_mdl2)$adj.r.squared

## [1] 0.3037301

The model with weekly outdoor time as a predictor explains 30% of the variance, and the model without explains 25%. But, from only looking at the proportion of variance accounted for in each model, we cannot determine which model is statistically a better fit. To answer the question ‘Does including weekly outdoor time as a predictor provide a significantly better fit of the data?’ we need to compare wb_mdl2 to wb_mdl1.

anova(wb_mdl1, wb_mdl2)

## Analysis of Variance Table
## 
## Model 1: wellbeing ~ social_int + isRural
## Model 2: wellbeing ~ 1 + social_int + isRural + outdoor_time
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    197 4285.6                                  
## 2    196 3967.6  1    318.03 15.711 0.0001033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Weekly outdoor time was found to explain a significant amount of variance in wellbeing scores over and above weekly social interactions and location (rural vs not-rural).
\(F\)(1 )=15.71, p<.001.

Question 4

Recall the data on Big 5 Personality traits, perceptions of social ranks, and depression and anxiety scale scores:

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
summary(scs_study)

##        zo                 zc                 ze                 za          
##  Min.   :-2.81928   Min.   :-3.21819   Min.   :-3.00576   Min.   :-2.94429  
##  1st Qu.:-0.63089   1st Qu.:-0.66866   1st Qu.:-0.68895   1st Qu.:-0.69394  
##  Median : 0.08053   Median : 0.00257   Median :-0.04014   Median :-0.01854  
##  Mean   : 0.09202   Mean   : 0.01951   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.80823   3rd Qu.: 0.71215   3rd Qu.: 0.67085   3rd Qu.: 0.72762  
##  Max.   : 3.55034   Max.   : 3.08015   Max.   : 2.80010   Max.   : 2.97010  
##        zn               scs             dass      
##  Min.   :-1.4486   Min.   :27.00   Min.   :23.00  
##  1st Qu.:-0.7994   1st Qu.:33.00   1st Qu.:41.00  
##  Median :-0.2059   Median :35.00   Median :44.00  
##  Mean   : 0.0000   Mean   :35.77   Mean   :44.72  
##  3rd Qu.: 0.5903   3rd Qu.:38.00   3rd Qu.:49.00  
##  Max.   : 3.3491   Max.   :54.00   Max.   :68.00

Research questions

Part 1: Beyond Neuroticism and its interaction with social comparison, does Openness predict symptoms of depression, anxiety and stress?

Part 2: Beyond Neuroticism and its interaction with social comparison, do other personality traits predict symptoms of depression, anxiety and stress?

Construct and compare multiple regression models to answer these two question. Remember to check that your models meet assumptions (for this exercises, a quick eyeball of the diagnostic plots will suffice. Were this an actual research project, you would want to provide a more thorough check, for instance conducting formal tests of the assumptions).

Although the solutions are available immediately for this question, we strongly advocate that you attempt it yourself before looking at them.

Solution

First let us mean-center our social comparison scale scores, as we did in the previous labs.

scs_study <- 
  scs_study %>%
  mutate(
    scs_mc = scs - mean(scs)
  )

The first question is asking whether Openness improves model fit beyond a model with just Neuoriticism, social comparison score and their interaction.

Notice how our initial model has one very influential point, which we will remove:

dass_mod1 <- lm(dass ~ scs_mc * zn, data = scs_study)
plot(dass_mod1)

dass_mod1 <- lm(dass ~ scs_mc * zn, data = scs_study[-35, ])
plot(dass_mod)

And our model with the Openness included:

dass_mod2 <- lm(dass ~ scs_mc * zn + zo, data = scs_study[-35, ])
plot(dass_mod2)

Look at the summary of Model 2 - Openness is not a significant predictor of DASS scores (p = .18).

summary(dass_mod2)

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo, data = scs_study[-35, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4283  -3.9350  -0.0666   3.5911  17.1235 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97898    0.22812 197.175  < 2e-16 ***
## scs_mc      -0.54936    0.06554  -8.382 3.23e-16 ***
## zn           1.45113    0.22782   6.369 3.59e-10 ***
## zo          -0.30051    0.22204  -1.353    0.176    
## scs_mc:zn   -0.78795    0.06865 -11.478  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.778 on 650 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2595 
## F-statistic:  58.3 on 4 and 650 DF,  p-value: < 2.2e-16

Lets compare the two models:

anova(dass_mod1, dass_mod2)

## Analysis of Variance Table
## 
## Model 1: dass ~ scs_mc * zn
## Model 2: dass ~ scs_mc * zn + zo
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    651 21763                           
## 2    650 21702  1    61.154 1.8316 0.1764

Openness did not explain a significant amount of variance in depression, anxiety and stress scores over and above Neuroticism and its interaction with social comparison.
\(F\)(1 )=1.83, p = 0.176.

The second question is asking whether including a group of predictors (the O, C, E, A personality traits) improves model fit beyond a model with just Neuoriticism, social comparison score and their interaction. We will need to compare this model (dass_mod3) to dass_mod1.

dass_mod3 <- lm(dass ~ scs_mc * zn + zo + zc + ze + za, data = scs_study[-35, ])
par(mfrow=c(2,2))
plot(dass_mod3)

par(mfrow=c(1,1))

We can explore the individual coefficients of our full model, and we notice that none of the other personality variables (zo, zc, ze, za) significantly predict DASS-21 scores:

summary(dass_mod3)

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo + zc + ze + za, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1455  -3.8155  -0.0066   3.6905  18.1483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97703    0.22635 198.708  < 2e-16 ***
## scs_mc      -0.54832    0.06519  -8.412 2.58e-16 ***
## zn           1.41639    0.22661   6.250 7.44e-10 ***
## zo          -0.31435    0.22056  -1.425    0.155    
## zc           0.09134    0.22515   0.406    0.685    
## ze           0.52695    0.34233   1.539    0.124    
## za           0.33847    0.34281   0.987    0.324    
## scs_mc:zn   -0.78254    0.06817 -11.479  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733 on 647 degrees of freedom
## Multiple R-squared:  0.279,  Adjusted R-squared:  0.2712 
## F-statistic: 35.76 on 7 and 647 DF,  p-value: < 2.2e-16

However, when we compare the two models, we find that including these predictors does significantly improve model fit.

anova(dass_mod1, dass_mod3)

## Analysis of Variance Table
## 
## Model 1: dass ~ scs_mc * zn
## Model 2: dass ~ scs_mc * zn + zo + zc + ze + za
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)   
## 1    651 21763                              
## 2    647 21262  4    501.25 3.8133 0.0045 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Whilst the other individual personality traits did not significantly predict DASS-21 scores, the model including these traits was significantly better fitting than the model including only Neuroticism and its interaction with social comparison. \(F\)(4 )=3.81, p = 0.004.

Stop & Think

This conclusion may be a bit confusing - are we saying that none of Openness, Conscientiousness, Agreeableness, or Extraversion significantly predict DASS-21 scores, but collectively they do? This sort of discrepancy can often be the result of multicollinearity. Note that there may be some correlation between the personality variables - you could explore this by checking vif(). It could also be that one variable is masking the effect of the other. Check what happens when you remove one of the personality variables

library(car)
vif(dass_mod3)

##    scs_mc        zn        zo        zc        ze        za scs_mc:zn 
##  1.015133  1.015736  1.013310  1.008235  2.332486  2.342220  1.012475

lm(dass ~ scs_mc * zn + zo + zc + ze, data = scs_study[-35, ]) %>% summary

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo + zc + ze, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1327  -3.8012  -0.0145   3.7151  17.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97614    0.22634 198.709  < 2e-16 ***
## scs_mc      -0.55175    0.06509  -8.476  < 2e-16 ***
## zn           1.42834    0.22629   6.312 5.11e-10 ***
## zo          -0.30725    0.22044  -1.394 0.163853    
## zc           0.10062    0.22495   0.447 0.654794    
## ze           0.78223    0.22437   3.486 0.000523 ***
## scs_mc:zn   -0.78132    0.06816 -11.463  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.732 on 648 degrees of freedom
## Multiple R-squared:  0.2779, Adjusted R-squared:  0.2712 
## F-statistic: 41.56 on 6 and 648 DF,  p-value: < 2.2e-16

AIC & BIC

If models are not nested, we cannot compare them using an incremental F-test. Instead, for non-nested models, we can use information criterion statistics, such as AIC and BIC.

What does non-nested mean?

Consider that you have two regression models where Model 1 contains different variables to those contained in Model 2, where both models are fitted to the same data. More simply, Model 1 and Model 2 contain unique variables that are not shared. This means that Model 1 and Model 2 are not nested.

What are AIC & BIC?

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) combine information about the sample size, the number of model parameters, and the residual sums of squares (\(SS_{residual}\)). Models do not need to be nested to be compared via AIC and BIC, but they need to have been fit to the same dataset.

For both of these fit indices, lower values are better, and both include a penalty for the number of predictors in the model (although BIC’s penalty is harsher):

\[ AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \\ \quad \\ BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \\ \quad \\ \begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align} \]

In R, we can calculate AIC and BIC by using the AIC() and BIC() functions.

Question 5

Lets compare the AIC and BIC values of two models, each looking at the associations of DASS scores and two personality traits. Fit the following models, and compare using AIC() and BIC(). Report which model you think best fits the data.

\(\text{DASS} = \beta_0 + \beta_1 \cdot \text{Neuroticism} + \beta_2 \cdot \text{Extraversion} + \epsilon\)
\(\text{DASS} = \beta_0 + \beta_1 \cdot \text{Openness} + \beta_2 \cdot \text{Agreeableness} +\epsilon\)

Solution

dassNE <- lm(dass ~ zn + ze, data = scs_study[-35, ])
dassOA <- lm(dass ~ zo + za, data = scs_study[-35, ])

AIC(dassNE, dassOA)

##        df      AIC
## dassNE  4 4324.772
## dassOA  4 4348.892

BIC(dassNE, dassOA)

##        df      BIC
## dassNE  4 4342.710
## dassOA  4 4366.831

We used AIC and BIC model selection to distinguish between two possible models describing the relationship between several personality factors and DASS-21 scores. Our model with Neuroticism and Extroversion (AIC = 4324.77) included as predictors was better fitting than the alternative model with Openness and Agreeableness (AIC = 4348.89). Based on the BIC value of the dassNE model (BIC = 4342.71) we concluded that it was substantively better fitting than the alternative model (BIC = 4366.83).

Choosing the Right Model Comparison Approach

Question 6

The code below fits 5 different models:

model1 <- lm(wellbeing ~ social_int + outdoor_time, data = mwdata2)
model2 <- lm(wellbeing ~ social_int + outdoor_time + age, data = mwdata2)
model3 <- lm(wellbeing ~ social_int + outdoor_time + routine, data = mwdata2)
model4 <- lm(wellbeing ~ social_int + outdoor_time + routine + age, data = mwdata2)
model5 <- lm(wellbeing ~ social_int + outdoor_time + routine + steps_k, data = mwdata2)

For each of the below pairs of models, what methods are/are not available for us to use for comparison and why?

model1 vs model2
model2 vs model3
model1 vs model4
model3 vs model5

This flowchart might help you to reach your decision:

Solution

Extra reading: Joshua Loftus’ Blog: Model selection bias invalidates significance tests

Model Comparison

Model Fit

F-ratio

Model Comparison

Incremental F-test

What does nested mean?

What is an incremental F-test?

Comparing regression models with anova()

Stop & Think

AIC & BIC

What does non-nested mean?

What are AIC & BIC?

Choosing the Right Model Comparison Approach

Comparing regression models with `anova()`