Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!

LEARNING OBJECTIVES

Understand measures of model fit using \(R^2\) and F.
Understand the principles of model selection and how to compare models via \(R^2\) and F tests.
Understand AIC and BIC.
Understand the basics of backward elimination, forward selection and stepwise regression.

Model Fit

Adjusted \(R^2\)

We know from our work on simple linear regression that the R-squared can be obtained as: \[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]

However, when we add more and more predictors into a multiple regression model, \(SS_{Residual}\) cannot increase, and may decrease by pure chance alone, even if the predictors are unrelated to the outcome variable. Because \(SS_{Total}\) is constant, the calculation \(1-\frac{SS_{Residual}}{SS_{Total}}\) will increase by chance alone.

An alternative, the Adjusted-\(R^2\), does not necessarily increase with the addition of more explanatory variables, by including a penalty according to the number of explanatory variables in the model. It is not by itself meaningful, but can be useful in determining what predictors to include in a model. \[ Adjusted{-}R^2=1-\frac{(1-R^2)(n-1)}{n-k-1} \\ \quad \\ \begin{align} & \text{Where:} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

In R, you can view the mutiple and adjusted \(R^2\) at the bottom of the output of summary(<modelname>):

Figure 1: Multiple regression output in R, summary.lm(). R-squared highlighted

F-ratio

As in simple linear regression, the F-ratio is used to test the null hypothesis that all regression slopes are zero.

It is called the F-ratio because it is the ratio of the how much of the variation is explained by the model (per paramater) versus how much of the variation is unexplained (per remaining degrees of freedom).

\[ F_{df_{model},df_{residual}} = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model}/df_{Model}}{SS_{Residual}/df_{Residual}} \\ \quad \\ \begin{align} & \text{Where:} \\ & df_{model} = k \\ & df_{error} = n-k-1 \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

In R, at the bottom of the output of summary(<modelname>), you can view the F ratio, along with an hypothesis test against the alternative hypothesis that the at least one of the coefficients \(\neq 0\) (under the null hypothesis that all coefficients = 0, the ratio of explained:unexplained variance should be approximately 1):

Figure 2: Multiple regression output in R, summary.lm(). F statistic highlighted

Question 1

Run the code below. It reads in the wellbeing/rurality study data, and creates a new binary variable which specifies whether or not each participant lives in a rural location.

library(tidyverse)
mwdata2<-read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")
mwdata2 <- 
  mwdata2 %>% mutate(
  isRural = ifelse(location=="rural","rural","notrural")
)

Fit the following model, and assign it the name “wb_mdl1.”

\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \epsilon\)

Does the model provide a better fit to the data than a model with no explanatory variables? (i.e., test against the alternative hypothesis that at least one of the explanatory variables significantly predicts wellbeing scores).

Solution

wb_mdl1 <- lm(wellbeing ~ social_int + isRural, data=mwdata2)
summary(wb_mdl1)

## 
## Call:
## lm(formula = wellbeing ~ social_int + isRural, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3711  -3.1794   0.1097   2.5407  17.6324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.11591    1.07052  31.868  < 2e-16 ***
## social_int    0.38167    0.08257   4.622 6.85e-06 ***
## isRuralrural -4.85152    0.66283  -7.319 6.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.664 on 197 degrees of freedom
## Multiple R-squared:  0.2593, Adjusted R-squared:  0.2517 
## F-statistic: 34.47 on 2 and 197 DF,  p-value: 1.453e-13

Weekly social interactions and location (rural vs not rural) explained 25.2% of the variance (adjusted \(R^2\) =0.252, \(F\)(2,197)=34.5, p<.001)

Model Comparison

Incremental F-test

If (and only if) two models are nested (one model contains all the predictors of the other and is fitted to the same data), we can compare them using an incremental F-test.

This is a formal test of whether the additional predictors provide a better fitting model.
Formally this is the test of:

\(H_0:\) coefficients for the added/ommitted variables are all zero.
\(H_1:\) at least one of the added/ommitted variables has a coefficient that is not zero.

The F-ratio for comparing the residual sums of squares between two models can be calculated as:

\[ F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F} \\ \quad \\ \begin{align} & \text{Where:} \\ & SSR_R = \text{residual sums of squares for the restricted model} \\ & SSR_F = \text{residual sums of squares for the full model} \\ & df_R = \text{residual degrees of freedom from the restricted model} \\ & df_F = \text{residual degrees of freedom from the full model} \\ \end{align} \]

In R, we can conduct an incremental F-test by constructing two models, and passing them to the anova() function: anova(model1, model2).

Question 2

The F-ratio you see at the bottom of summary(model) is actually a comparison between two models: your model (with some explanatory variables in predicting \(y\)) and the null model. In regression, the null model can be thought of as the model in which all explanatory variables have zero regression coefficients. It is also referred to as the intercept-only model, because if all predictor variable coefficients are zero, then the only we are only estimating \(y\) via an intercept (which will be the mean - \(\bar y\)).

Use the code below to fit the null model.
Then, use the anova() function to perform a model comparison between your earlier model (wb_mdl1) and the null model.
Check that the F statistic is the same as that which is given at the bottom of summary(wb_mdl1).

null_model <- lm(wellbeing ~ 1, data = mwdata2)

Solution

# fit the null model
null_model <- lm(wellbeing ~ 1, data = mwdata2)

# model comparison null vs wb_mdl1
anova(null_model, wb_mdl1)

## Analysis of Variance Table
## 
## Model 1: wellbeing ~ 1
## Model 2: wellbeing ~ social_int + isRural
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    199 5785.6                                  
## 2    197 4285.6  2      1500 34.475 1.453e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# extract f statistic from summary of wb_mdl1
summary(wb_mdl1)$fstatistic

##     value     numdf     dendf 
##  34.47463   2.00000 197.00000

# we can retrieve the p-value:
fstat = summary(wb_mdl1)$fstatistic[1]
df_1 = summary(wb_mdl1)$fstatistic[2]
df_2 = summary(wb_mdl1)$fstatistic[3]
pf(fstat, df_1, df_2, lower.tail = FALSE)

##        value 
## 1.452925e-13

Question 3

Does weekly outdoor time explain a significant amount of variance in wellbeing scores over and above weekly social interactions and location (rural vs not-rural)?

Provide an answer to this question by fitting and comparing two models (one of them you may already have fitted in an earlier question).

Solution

We can compare the following models which predict wellbeing scores from weekly social interactions and location, with and without weekly outdoor time.

\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \epsilon\)
\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \beta_3 \cdot \text{Outdoor time} + \epsilon\)

We have already fitted the first model and assigned it the name wb_mdl1.
We need to fit the second:

wb_mdl2 <- lm(wellbeing ~ 1 + social_int + isRural + outdoor_time, data=mwdata2)

Let’s look at the amount of variation in wellbeing scores explained by each model:

summary(wb_mdl1)$adj.r.squared

## [1] 0.251737

summary(wb_mdl2)$adj.r.squared

## [1] 0.3037301

The model with weekly outdoor time as a predictor explains 30% of the variance, and the model without explains 25%.
Does including weekly outdoor time as a predictor provide a significantly better fit of the data (wb_mdl2 compared to wb_mdl1)?

anova(wb_mdl1, wb_mdl2)

## Analysis of Variance Table
## 
## Model 1: wellbeing ~ social_int + isRural
## Model 2: wellbeing ~ 1 + social_int + isRural + outdoor_time
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    197 4285.6                                  
## 2    196 3967.6  1    318.03 15.711 0.0001033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Weekly outdoor time was found to explain a significant amount of variance in wellbeing scores over and above weekly social interactions and location (rural vs not-rural)
\(F\)(1 )=NA, 15.71, p<.001.

Incremental validity - A caution

A common goal for researchers is to determine which variables matter (and which do not) in contributing to some outcome variable. A common approach to answer such questions is to consider whether some variable \(X\)’s contribution remains significant after controlling for variables \(Z\).

The reasoning:

If our measure of \(X\) correlates significantly with outcome \(Y\) even when controlling for our measure of \(Z\), then \(X\) contributes to \(y\) over and above the contribution of \(Z\).

In multiple regression, we might fit the model \(Y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot Z + \epsilon\) and conclude that \(X\) is a useful predictor of \(Y\) over and above \(Z\) based on the estimate \(\hat \beta_1\), or via model comparison between that model and the model without \(Z\) as a predictor (\(Y = \beta_0 + \beta_1 \cdot X + \epsilon\)).

A Toy Example

Suppose we have monthly data over a seven year period which captures the number of shark attacks on swimmers each month, and the number of ice-creams sold by beach vendors each month.
Consider the relationship between the two:

We can fit the linear model and see a significant relationship between ice cream sales and shark attacks:

sharkdata <- read_csv("https://uoepsy.github.io/data/sharks.csv")
shark_mdl <- lm(shark_attacks ~ ice_cream_sales, data = sharkdata)
summary(shark_mdl)

## 
## Call:
## lm(formula = shark_attacks ~ ice_cream_sales, data = sharkdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3945  -4.9268   0.5087   4.8152  15.7023 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.58835    5.19063   1.077    0.285    
## ice_cream_sales  0.32258    0.05809   5.553 3.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.245 on 81 degrees of freedom
## Multiple R-squared:  0.2757, Adjusted R-squared:  0.2668 
## F-statistic: 30.84 on 1 and 81 DF,  p-value: 3.461e-07

Question

Does the relationship between ice cream sales and shark attacks make sense? What might be missing from our model?

Solution

You might quite rightly suggest that this relationship is actually being driven by temperature - when it is hotter, there are more ice cream sales and there are more people swimming (hence more shark attacks).

Question

Is \(X\) (the number of ice-cream sales) a useful predictor of \(Y\) (numbers of shark attacks) over and above \(Z\) (temperature)?

We might answer this with a multiple regression model including both temperature and ice cream sales as predictors of shark attacks:

shark_mdl2 <- lm(shark_attacks ~ ice_cream_sales + temperature, data = sharkdata)
summary(shark_mdl2)

## 
## Call:
## lm(formula = shark_attacks ~ ice_cream_sales + temperature, data = sharkdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5359  -3.1353   0.1088   3.1064  17.2566 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.73422    4.27917   0.405    0.686    
## ice_cream_sales  0.08588    0.05997   1.432    0.156    
## temperature      1.31868    0.20457   6.446 8.04e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 80 degrees of freedom
## Multiple R-squared:  0.5233, Adjusted R-squared:  0.5114 
## F-statistic: 43.91 on 2 and 80 DF,  p-value: 1.345e-13

What do you conclude?

Solution

It appears that numbers of ice cream sales is not a significant predictor of sharks attack numbers over and above the temperature.

However… In psychology, we can rarely observe and directly measure the constructs which we are interested in (for example, personality traits, intelligence, emotional states etc.). We rely instead on measurements of, e.g. behavioural tendencies, as a proxy for personality traits.

Let’s suppose that instead of including temperature in degrees celsius, we asked a set of people to self-report on a scale of 1 to 7 how hot it was that day. This measure should hopefully correlate well with the actual temperature, however, there will likely be some variation:

Question

Is \(X\) (the number of ice-cream sales) a useful predictor of \(Y\) (numbers of shark attacks) over and above \(Z\) (temperature - measured on our self-reported heat scale)?

shark_mdl2a <- lm(shark_attacks ~ ice_cream_sales + sr_heat, data = sharkdata)
summary(shark_mdl2a)

## 
## Call:
## lm(formula = shark_attacks ~ ice_cream_sales + sr_heat, data = sharkdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4576  -3.7818  -0.0553   3.6712  15.2155 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.96066    4.37145   2.050  0.04366 *  
## ice_cream_sales  0.14943    0.05643   2.648  0.00974 ** 
## sr_heat          2.96130    0.49276   6.010 5.24e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.051 on 80 degrees of freedom
## Multiple R-squared:  0.501,  Adjusted R-squared:  0.4885 
## F-statistic: 40.16 on 2 and 80 DF,  p-value: 8.394e-13

What do you conclude?

Moral of the story: be considerate of what exactly it is that you are measuring.
This example was adapted from Westfall and Yarkoni, 2020 which provides a much more extensive discussion of incremental validity and type 1 error rates.

AIC & BIC

We can also compare models using information criterion statistics, such as AIC and BIC. These combine information about the sample size, the number of model parameters and the residual sums of squares (\(SS_{residual}\)). Models do not need to be nested to be compared via AIC and BIC, but they need to have been fit to the same dataset.
For both of these fit indices, lower values are better, and both include a penalty for the number of predictors in the model, although BIC’s penalty is harsher:

\[ AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \\ \quad \\ BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \\ \quad \\ \begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align} \]

In R, we can calculate AIC and BIC by using the AIC() and BIC() functions.

Question 4

The code below fits 5 different models:

model1 <- lm(wellbeing ~ social_int + outdoor_time, data = mwdata2)
model2 <- lm(wellbeing ~ social_int + outdoor_time + age, data = mwdata2)
model3 <- lm(wellbeing ~ social_int + outdoor_time + routine, data = mwdata2)
model4 <- lm(wellbeing ~ social_int + outdoor_time + routine + age, data = mwdata2)
model5 <- lm(wellbeing ~ social_int + outdoor_time + routine + steps_k, data = mwdata2)

For each of the below pairs of models, what methods are/are not available for us to use for comparison and why?

model1 vs model2
model2 vs model3
model1 vs model4
model3 vs model5

Solution

Question 5

Recall the data on Big 5 Personality traits, perceptions of social ranks, and depression and anxiety scale scores:

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
summary(scs_study)

##        zo                 zc                 ze                 za          
##  Min.   :-2.81928   Min.   :-3.21819   Min.   :-3.00576   Min.   :-2.94429  
##  1st Qu.:-0.63089   1st Qu.:-0.66866   1st Qu.:-0.68895   1st Qu.:-0.69394  
##  Median : 0.08053   Median : 0.00257   Median :-0.04014   Median :-0.01854  
##  Mean   : 0.09202   Mean   : 0.01951   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.80823   3rd Qu.: 0.71215   3rd Qu.: 0.67085   3rd Qu.: 0.72762  
##  Max.   : 3.55034   Max.   : 3.08015   Max.   : 2.80010   Max.   : 2.97010  
##        zn               scs             dass      
##  Min.   :-1.4486   Min.   :27.00   Min.   :23.00  
##  1st Qu.:-0.7994   1st Qu.:33.00   1st Qu.:41.00  
##  Median :-0.2059   Median :35.00   Median :44.00  
##  Mean   : 0.0000   Mean   :35.77   Mean   :44.72  
##  3rd Qu.: 0.5903   3rd Qu.:38.00   3rd Qu.:49.00  
##  Max.   : 3.3491   Max.   :54.00   Max.   :68.00

Research question

Beyond neuroticism and its interaction with social comparison, do other personality traits predict symptoms of depression, anxiety and stress?

Construct and compare multiple regression models to answer this question. Remember to check that your models meet assumptions (for this exercises, a quick eyeball of the diagnostic plots will suffice. Were this an actual research project, you would want to provide a more thorough check, for instance conducting formal tests of the assumptions).

Although the solutions are available immediately for this question, we strongly advocate that you attempt it yourself before looking at them.

Solution

First let us mean-center our social comparison scale scores, as we did in the previous labs.

scs_study <- 
  scs_study %>%
  mutate(
    scs_mc = scs - mean(scs)
  )

The question is asking whether including a group of predictors (the O, C, E, A personality traits) improves model fit beyond a model with just neuoriticism, social comparison score and their interaction.

Notice how our initial model has one very influential point, which we will remove:

dass_mod <- lm(dass ~ scs_mc * zn, data = scs_study)
plot(dass_mod)

dass_mod <- lm(dass ~ scs_mc * zn, data = scs_study[-35, ])
plot(dass_mod)

And our full model, with the other personality variables included:

dass_mod2 <- lm(dass ~ scs_mc * zn + zo + zc + ze + za, data = scs_study[-35, ])
plot(dass_mod2)

We can explore the individual coefficients of our full model, and we notice that none of the other personality variables (zo, zc, ze, za) significantly predict DASS-21 scores:

summary(dass_mod2)

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo + zc + ze + za, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1455  -3.8155  -0.0066   3.6905  18.1483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97703    0.22635 198.708  < 2e-16 ***
## scs_mc      -0.54832    0.06519  -8.412 2.58e-16 ***
## zn           1.41639    0.22661   6.250 7.44e-10 ***
## zo          -0.31435    0.22056  -1.425    0.155    
## zc           0.09134    0.22515   0.406    0.685    
## ze           0.52695    0.34233   1.539    0.124    
## za           0.33847    0.34281   0.987    0.324    
## scs_mc:zn   -0.78254    0.06817 -11.479  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733 on 647 degrees of freedom
## Multiple R-squared:  0.279,  Adjusted R-squared:  0.2712 
## F-statistic: 35.76 on 7 and 647 DF,  p-value: < 2.2e-16

However, when we compare the two models, we find that including these predictors does significantly improve model fit.

anova(dass_mod, dass_mod2)

## Analysis of Variance Table
## 
## Model 1: dass ~ scs_mc * zn
## Model 2: dass ~ scs_mc * zn + zo + zc + ze + za
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)   
## 1    651 21763                              
## 2    647 21262  4    501.25 3.8133 0.0045 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This may be a bit confusing - are we saying that none of openness, conscientiousness, agreeableness, or extraversion significantly predict DASS-21 scores, but collectively they do?

This sort of discrepancy can often be the result of multicollinearity. Note that there may be some correlation between the za and ze variables:

library(car)
vif(dass_mod2)

##    scs_mc        zn        zo        zc        ze        za scs_mc:zn 
##  1.015133  1.015736  1.013310  1.008235  2.332486  2.342220  1.012475

What may actually be happening here is that one variable is masking the effect of the other. Note that when we take one of them out, the other becomes significant:

lm(dass ~ scs_mc * zn + zo + zc + ze, data = scs_study[-35, ]) %>% summary

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo + zc + ze, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1327  -3.8012  -0.0145   3.7151  17.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97614    0.22634 198.709  < 2e-16 ***
## scs_mc      -0.55175    0.06509  -8.476  < 2e-16 ***
## zn           1.42834    0.22629   6.312 5.11e-10 ***
## zo          -0.30725    0.22044  -1.394 0.163853    
## zc           0.10062    0.22495   0.447 0.654794    
## ze           0.78223    0.22437   3.486 0.000523 ***
## scs_mc:zn   -0.78132    0.06816 -11.463  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.732 on 648 degrees of freedom
## Multiple R-squared:  0.2779, Adjusted R-squared:  0.2712 
## F-statistic: 41.56 on 6 and 648 DF,  p-value: < 2.2e-16

Extra Exercises: Model Selection

“Which predictors should I include in my model?”

As a rule of thumb, you should include as predictors your variables of interest (i.e., those required to answer your questions), and those which theory suggests you should take into account (for instance, if theory tells you that temperature is likely to influence the number of shark attacks on a given day, it would be remiss of you to not include it in your model).

However, in some specific situations, you may simply want to let the data tell you whatever there is to tell, without being guided by theory. This is where analysis becomes exploratory in nature (and therefore should not be used as confirmatory evidence in support of theory).

In both the design and the analysis of a study, you will have to make many many choices. Each one takes you a different way, and leads to a different set of choices. This idea has become widely known as the garden of forking paths, and has important consequences for your statistical inferences.

Out of all the possible paths you could have taken, some will end with what you consider to be a significant finding, and some you will simply see as dead ends. If you reach a dead-end, do you go back and try a different path? Why might this be a risky approach to statistical analyses?

For a given set of data, there will likely be some significant relationships between variables which are there simply by chance (recall that \(p<.05\) corresponds to a 1 in 20 chance - if we study 20 different relationships, we would expect one of them two be significant by chance). The more paths we try out, the more likely we are to find a significant relationship, even though it may actually be completely spurious!

Model selection is a means of answering the question “which predictors should I include in my model?” but it is a big maze of forking paths, which will result in keeping only those predictors which meet some criteria (e.g., significance).

Stepwise

Forward Selection

Start with variable which has highest association with DV.
Add the variable which most increases \(R^2\) out of all which remain.
Continue until no variables improve \(R^2\).

Backward Elimination

Start with all variables in the model.
Remove the predictor with the highest p-value.
Run the model again and repeat.
Stop when all p-values for predictors are less than the a priori set critical level.

Note that we can have different criteria for selecting models in this stepwise approach, for instance, choosing the model with the biggest decrease in AIC.

Question 6

Using the backward elimination approach, construct a final model to predict wellbeing scores using the mwdata2 dataset from above.

Solution

We will stop when all p-values are \(<.05\)

Note that we have two variables in there which are direct transformations of one another - “location” and “isRural.” We can’t have both.

summary(mwdata2)

##       age         outdoor_time     social_int       routine        wellbeing   
##  Min.   :18.00   Min.   : 1.00   Min.   : 3.00   Min.   :0.000   Min.   :22.0  
##  1st Qu.:30.00   1st Qu.:12.75   1st Qu.: 9.00   1st Qu.:0.000   1st Qu.:33.0  
##  Median :42.00   Median :18.00   Median :12.00   Median :1.000   Median :35.0  
##  Mean   :42.30   Mean   :18.25   Mean   :12.06   Mean   :0.565   Mean   :36.3  
##  3rd Qu.:54.25   3rd Qu.:23.00   3rd Qu.:15.00   3rd Qu.:1.000   3rd Qu.:40.0  
##  Max.   :70.00   Max.   :35.00   Max.   :24.00   Max.   :1.000   Max.   :59.0  
##                                                                                
##    location            steps_k         isRural         
##  Length:200         Min.   :  0.00   Length:200        
##  Class :character   1st Qu.: 24.00   Class :character  
##  Mode  :character   Median : 42.45   Mode  :character  
##                     Mean   : 44.93                     
##                     3rd Qu.: 65.28                     
##                     Max.   :111.30                     
##                     NA's   :66

full_model <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location + steps_k, data = mwdata2)
summary(full_model)

## 
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine + 
##     location + steps_k, data = mwdata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.567  -2.901  -0.050   2.919   9.416 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.289216   2.094310  13.985  < 2e-16 ***
## age             0.018316   0.026404   0.694 0.489155    
## outdoor_time    0.145573   0.061363   2.372 0.019189 *  
## social_int      0.357076   0.098317   3.632 0.000408 ***
## routine         3.039223   0.788248   3.856 0.000183 ***
## locationrural  -5.176386   0.974776  -5.310 4.78e-07 ***
## locationsuburb  0.075809   1.114115   0.068 0.945858    
## steps_k         0.006114   0.016434   0.372 0.710489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.457 on 126 degrees of freedom
##   (66 observations deleted due to missingness)
## Multiple R-squared:  0.3707, Adjusted R-squared:  0.3358 
## F-statistic:  10.6 on 7 and 126 DF,  p-value: 1.926e-10

We will remove the “steps_k” variable, as it is the predictor with the highest p-value (don’t be tempted to think that “location” has the highest p-value. The estimated difference between urban and suburban does indeed have a high p-value, but the difference between rural and urban has a very low p-value).

model1 <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location, data = mwdata2)
summary(model1)

## 
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine + 
##     location, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9173  -2.6865  -0.2971   2.8334  14.6234 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.76995    1.65807  17.955  < 2e-16 ***
## age            -0.01079    0.02043  -0.528    0.598    
## outdoor_time    0.17452    0.04295   4.063 7.03e-05 ***
## social_int      0.38733    0.07592   5.102 8.01e-07 ***
## routine         2.95814    0.61145   4.838 2.68e-06 ***
## locationrural  -4.97721    0.71808  -6.931 6.09e-11 ***
## locationsuburb -0.27975    0.86795  -0.322    0.748    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.278 on 193 degrees of freedom
## Multiple R-squared:  0.3896, Adjusted R-squared:  0.3707 
## F-statistic: 20.53 on 6 and 193 DF,  p-value: < 2.2e-16

And now the “age” variable:

model2 <- lm(wellbeing ~ outdoor_time + social_int + routine + location, data = mwdata2)
summary(model2)

## 
## Call:
## lm(formula = wellbeing ~ outdoor_time + social_int + routine + 
##     location, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8649  -2.6977  -0.2127   2.5935  14.8263 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.28101    1.37298  21.327  < 2e-16 ***
## outdoor_time    0.17561    0.04282   4.101 6.04e-05 ***
## social_int      0.38815    0.07576   5.123 7.21e-07 ***
## routine         2.95193    0.61020   4.838 2.67e-06 ***
## locationrural  -4.96317    0.71625  -6.929 6.08e-11 ***
## locationsuburb -0.28431    0.86630  -0.328    0.743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.27 on 194 degrees of freedom
## Multiple R-squared:  0.3888, Adjusted R-squared:  0.373 
## F-statistic: 24.68 on 5 and 194 DF,  p-value: < 2.2e-16

In this model, all our predictors have p-values lower than our critical level of \(.05\).

Question 7

There are functions in R which automate the stepwise procedure for us.
step(<modelname>) will by default use backward elimination to choose the model with the lowest AIC.

Using data on the Big 5 Personality traits, perceptions of social ranks, and depression and anxiety, fit the full model to predict DASS-21 scores.
Use step() to determine which predictors to keep in your model.
What predictors do you have in your final model?

Solution

full_dass_model <- lm(dass ~ zn*scs_mc + zo + zc + ze + za + zn, data = scs_study)
step(full_dass_model)

## Start:  AIC=2374.99
## dass ~ zn * scs_mc + zo + zc + ze + za + zn
## 
##             Df Sum of Sq   RSS    AIC
## - zc         1      2.05 23915 2373.0
## - za         1     11.06 23924 2373.3
## - zo         1     22.82 23936 2373.6
## <none>                   23913 2375.0
## - ze         1    147.25 24060 2377.0
## - zn:scs_mc  1   2340.68 26254 2434.2
## 
## Step:  AIC=2373.04
## dass ~ zn + scs_mc + zo + ze + za + zn:scs_mc
## 
##             Df Sum of Sq   RSS    AIC
## - za         1     10.66 23926 2371.3
## - zo         1     23.28 23938 2371.7
## <none>                   23915 2373.0
## - ze         1    148.17 24063 2375.1
## - zn:scs_mc  1   2339.57 26255 2432.3
## 
## Step:  AIC=2371.33
## dass ~ zn + scs_mc + zo + ze + zn:scs_mc
## 
##             Df Sum of Sq   RSS    AIC
## - zo         1     22.33 23948 2369.9
## <none>                   23926 2371.3
## - ze         1    497.48 24423 2382.8
## - zn:scs_mc  1   2340.17 26266 2430.6
## 
## Step:  AIC=2369.95
## dass ~ zn + scs_mc + ze + zn:scs_mc
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   23948 2369.9
## - ze         1     496.1 24444 2381.4
## - zn:scs_mc  1    2318.4 26266 2428.6

## 
## Call:
## lm(formula = dass ~ zn + scs_mc + ze + zn:scs_mc, data = scs_study)
## 
## Coefficients:
## (Intercept)           zn       scs_mc           ze    zn:scs_mc  
##     44.9311       1.5566      -0.4471       0.8708      -0.5153

Extra reading: Joshua Loftus’ Blog: Model selection bias invalidates significance tests

Model Fit, Comparison, and Selection

Model Fit

Adjusted \(R^2\)

F-ratio

Model Comparison

Incremental F-test

Incremental validity - A caution

AIC & BIC

Extra Exercises: Model Selection

Stepwise