Extras: Model Selection

“Which predictors should I include in my model?”

As a rule of thumb, you should include as predictors your variables of interest (i.e., those required to answer your questions), and those which theory suggests you should take into account (for instance, if theory tells you that temperature is likely to influence the number of shark attacks on a given day, it would be remiss of you to not include it in your model).

However, in some specific situations, you may simply want to let the data tell you whatever there is to tell, without being guided by theory. This is where analysis becomes exploratory in nature (and therefore should not be used as confirmatory evidence in support of theory).

In both the design and the analysis of a study, you will have to make many many choices. Each one takes you a different way, and leads to a different set of choices. This idea has become widely known as the garden of forking paths, and has important consequences for your statistical inferences.

Out of all the possible paths you could have taken, some will end with what you consider to be a significant finding, and some you will simply see as dead ends. If you reach a dead-end, do you go back and try a different path? Why might this be a risky approach to statistical analyses?

For a given set of data, there will likely be some significant relationships between variables which are there simply by chance (recall that \(p<.05\) corresponds to a 1 in 20 chance - if we study 20 different relationships, we would expect one of them two be significant by chance). The more paths we try out, the more likely we are to find a significant relationship, even though it may actually be completely spurious!

Model selection is a means of answering the question “which predictors should I include in my model?” but it is a big maze of forking paths, which will result in keeping only those predictors which meet some criteria (e.g., significance).

Stepwise

Forward Selection

Start with variable which has highest association with DV.
Add the variable which most increases \(R^2\) out of all which remain.
Continue until no variables improve \(R^2\).

Backward Elimination

Start with all variables in the model.
Remove the predictor with the highest p-value.
Run the model again and repeat.
Stop when all p-values for predictors are less than the a priori set critical level.

Note that we can have different criteria for selecting models in this stepwise approach, for instance, choosing the model with the biggest decrease in AIC.

Question 1

Using the backward elimination approach, construct a final model to predict wellbeing scores using the mwdata2 dataset.

mwdata2 <- read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")

Solution

We will stop when all p-values are \(<.05\)

Note that we have two variables in there which are direct transformations of one another - “location” and “isRural.” We can’t have both.

summary(mwdata2)

##       age        outdoor_time    social_int      routine        wellbeing   
##  Min.   :18.0   Min.   : 1.0   Min.   : 3.0   Min.   :0.000   Min.   :22.0  
##  1st Qu.:30.0   1st Qu.:12.8   1st Qu.: 9.0   1st Qu.:0.000   1st Qu.:33.0  
##  Median :42.0   Median :18.0   Median :12.0   Median :1.000   Median :35.0  
##  Mean   :42.3   Mean   :18.3   Mean   :12.1   Mean   :0.565   Mean   :36.3  
##  3rd Qu.:54.2   3rd Qu.:23.0   3rd Qu.:15.0   3rd Qu.:1.000   3rd Qu.:40.0  
##  Max.   :70.0   Max.   :35.0   Max.   :24.0   Max.   :1.000   Max.   :59.0  
##                                                                             
##    location            steps_k     
##  Length:200         Min.   :  0.0  
##  Class :character   1st Qu.: 24.0  
##  Mode  :character   Median : 42.5  
##                     Mean   : 44.9  
##                     3rd Qu.: 65.3  
##                     Max.   :111.3  
##                     NA's   :66

full_model <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location + steps_k, data = mwdata2)
summary(full_model)

## 
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine + 
##     location + steps_k, data = mwdata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.57  -2.90  -0.05   2.92   9.42 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.28922    2.09431   13.99  < 2e-16 ***
## age             0.01832    0.02640    0.69  0.48916    
## outdoor_time    0.14557    0.06136    2.37  0.01919 *  
## social_int      0.35708    0.09832    3.63  0.00041 ***
## routine         3.03922    0.78825    3.86  0.00018 ***
## locationrural  -5.17639    0.97478   -5.31  4.8e-07 ***
## locationsuburb  0.07581    1.11412    0.07  0.94586    
## steps_k         0.00611    0.01643    0.37  0.71049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.46 on 126 degrees of freedom
##   (66 observations deleted due to missingness)
## Multiple R-squared:  0.371,  Adjusted R-squared:  0.336 
## F-statistic: 10.6 on 7 and 126 DF,  p-value: 1.93e-10

We will remove the “steps_k” variable, as it is the predictor with the highest p-value (don’t be tempted to think that “location” has the highest p-value. The estimated difference between urban and suburban does indeed have a high p-value, but the difference between rural and urban has a very low p-value).

model1 <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location, data = mwdata2)
summary(model1)

## 
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine + 
##     location, data = mwdata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.917  -2.686  -0.297   2.833  14.623 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     29.7700     1.6581   17.95  < 2e-16 ***
## age             -0.0108     0.0204   -0.53     0.60    
## outdoor_time     0.1745     0.0429    4.06  7.0e-05 ***
## social_int       0.3873     0.0759    5.10  8.0e-07 ***
## routine          2.9581     0.6115    4.84  2.7e-06 ***
## locationrural   -4.9772     0.7181   -6.93  6.1e-11 ***
## locationsuburb  -0.2797     0.8680   -0.32     0.75    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.28 on 193 degrees of freedom
## Multiple R-squared:  0.39,   Adjusted R-squared:  0.371 
## F-statistic: 20.5 on 6 and 193 DF,  p-value: <2e-16

And now the “age” variable:

model2 <- lm(wellbeing ~ outdoor_time + social_int + routine + location, data = mwdata2)
summary(model2)

## 
## Call:
## lm(formula = wellbeing ~ outdoor_time + social_int + routine + 
##     location, data = mwdata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.865  -2.698  -0.213   2.594  14.826 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     29.2810     1.3730   21.33  < 2e-16 ***
## outdoor_time     0.1756     0.0428    4.10  6.0e-05 ***
## social_int       0.3881     0.0758    5.12  7.2e-07 ***
## routine          2.9519     0.6102    4.84  2.7e-06 ***
## locationrural   -4.9632     0.7163   -6.93  6.1e-11 ***
## locationsuburb  -0.2843     0.8663   -0.33     0.74    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.27 on 194 degrees of freedom
## Multiple R-squared:  0.389,  Adjusted R-squared:  0.373 
## F-statistic: 24.7 on 5 and 194 DF,  p-value: <2e-16

In this model, all our predictors have p-values lower than our critical level of \(.05\).

Question 2

There are functions in R which automate the stepwise procedure for us.
step(<modelname>) will by default use backward elimination to choose the model with the lowest AIC.

Using data on the Big 5 Personality traits, perceptions of social ranks, and depression and anxiety, fit the full model to predict DASS-21 scores.
Use step() to determine which predictors to keep in your model.
What predictors do you have in your final model?

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")

scs_study <-
  scs_study %>%
  mutate(
    scs_mc = scs - mean(scs)
  )

Solution

full_dass_model <- lm(dass ~ zn*scs_mc + zo + zc + ze + za + zn, data = scs_study)
step(full_dass_model)

## Start:  AIC=2375
## dass ~ zn * scs_mc + zo + zc + ze + za + zn
## 
##             Df Sum of Sq   RSS  AIC
## - zc         1         2 23915 2373
## - za         1        11 23924 2373
## - zo         1        23 23936 2374
## <none>                   23913 2375
## - ze         1       147 24060 2377
## - zn:scs_mc  1      2341 26254 2434
## 
## Step:  AIC=2373
## dass ~ zn + scs_mc + zo + ze + za + zn:scs_mc
## 
##             Df Sum of Sq   RSS  AIC
## - za         1        11 23926 2371
## - zo         1        23 23938 2372
## <none>                   23915 2373
## - ze         1       148 24063 2375
## - zn:scs_mc  1      2340 26255 2432
## 
## Step:  AIC=2371
## dass ~ zn + scs_mc + zo + ze + zn:scs_mc
## 
##             Df Sum of Sq   RSS  AIC
## - zo         1        22 23948 2370
## <none>                   23926 2371
## - ze         1       497 24423 2383
## - zn:scs_mc  1      2340 26266 2431
## 
## Step:  AIC=2370
## dass ~ zn + scs_mc + ze + zn:scs_mc
## 
##             Df Sum of Sq   RSS  AIC
## <none>                   23948 2370
## - ze         1       496 24444 2381
## - zn:scs_mc  1      2318 26266 2429

## 
## Call:
## lm(formula = dass ~ zn + scs_mc + ze + zn:scs_mc, data = scs_study)
## 
## Coefficients:
## (Intercept)           zn       scs_mc           ze    zn:scs_mc  
##      44.931        1.557       -0.447        0.871       -0.515

Extra reading: Joshua Loftus’ Blog: Model selection bias invalidates significance tests

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.