Extras: Model Selection

“Which predictors should I include in my model?”

As a rule of thumb, you should include as predictors your variables of interest (i.e., those required to answer your questions), and those which theory suggests you should take into account (for instance, if theory tells you that temperature is likely to influence the number of shark attacks on a given day, it would be remiss of you to not include it in your model).

However, in some specific situations, you may simply want to let the data tell you whatever there is to tell, without being guided by theory. This is where analysis becomes exploratory in nature (and therefore should not be used as confirmatory evidence in support of theory).

In both the design and the analysis of a study, you will have to make many many choices. Each one takes you a different way, and leads to a different set of choices. This idea has become widely known as the garden of forking paths, and has important consequences for your statistical inferences.

Out of all the possible paths you could have taken, some will end with what you consider to be a significant finding, and some you will simply see as dead ends. If you reach a dead-end, do you go back and try a different path? Why might this be a risky approach to statistical analyses?

For a given set of data, there will likely be some significant relationships between variables which are there simply by chance (recall that \(p<.05\) corresponds to a 1 in 20 chance - if we study 20 different relationships, we would expect one of them two be significant by chance). The more paths we try out, the more likely we are to find a significant relationship, even though it may actually be completely spurious!

Model selection is a means of answering the question “which predictors should I include in my model?”, but it is a big maze of forking paths, which will result in keeping only those predictors which meet some criteria (e.g., significance).

Stepwise

Forward Selection

Start with variable which has highest association with DV.
Add the variable which most increases \(R^2\) out of all which remain.
Continue until no variables improve \(R^2\).

Backward Elimination

Start with all variables in the model.
Remove the predictor with the highest p-value.
Run the model again and repeat.
Stop when all p-values for predictors are less than the a priori set critical level.

Note that we can have different criteria for selecting models in this stepwise approach, for instance, choosing the model with the biggest decrease in AIC.

Question 1

Using the backward elimination approach, construct a final model to predict wellbeing scores using the mwdata2 dataset.

mwdata2 <- read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")

Solution

We will stop when all p-values are \(<.05\)

Note that we have two variables in there which are direct transformations of one another - “location” and “isRural”. We can’t have both.

summary(mwdata2)

##       age         outdoor_time     social_int       routine        wellbeing   
##  Min.   :18.00   Min.   : 1.00   Min.   : 3.00   Min.   :0.000   Min.   :22.0  
##  1st Qu.:30.00   1st Qu.:12.75   1st Qu.: 9.00   1st Qu.:0.000   1st Qu.:33.0  
##  Median :42.00   Median :18.00   Median :12.00   Median :1.000   Median :35.0  
##  Mean   :42.30   Mean   :18.25   Mean   :12.06   Mean   :0.565   Mean   :36.3  
##  3rd Qu.:54.25   3rd Qu.:23.00   3rd Qu.:15.00   3rd Qu.:1.000   3rd Qu.:40.0  
##  Max.   :70.00   Max.   :35.00   Max.   :24.00   Max.   :1.000   Max.   :59.0  
##                                                                                
##    location            steps_k      
##  Length:200         Min.   :  0.00  
##  Class :character   1st Qu.: 24.00  
##  Mode  :character   Median : 42.45  
##                     Mean   : 44.93  
##                     3rd Qu.: 65.28  
##                     Max.   :111.30  
##                     NA's   :66

full_model <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location + steps_k, data = mwdata2)
summary(full_model)

## 
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine + 
##     location + steps_k, data = mwdata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.567  -2.901  -0.050   2.919   9.416 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.289216   2.094310  13.985  < 2e-16 ***
## age             0.018316   0.026404   0.694 0.489155    
## outdoor_time    0.145573   0.061363   2.372 0.019189 *  
## social_int      0.357076   0.098317   3.632 0.000408 ***
## routine         3.039223   0.788248   3.856 0.000183 ***
## locationrural  -5.176386   0.974776  -5.310 4.78e-07 ***
## locationsuburb  0.075809   1.114115   0.068 0.945858    
## steps_k         0.006114   0.016434   0.372 0.710489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.457 on 126 degrees of freedom
##   (66 observations deleted due to missingness)
## Multiple R-squared:  0.3707, Adjusted R-squared:  0.3358 
## F-statistic:  10.6 on 7 and 126 DF,  p-value: 1.926e-10

We will remove the “steps_k” variable, as it is the predictor with the highest p-value (don’t be tempted to think that “location” has the highest p-value. The estimated difference between urban and suburban does indeed have a high p-value, but the difference between rural and urban has a very low p-value).

model1 <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location, data = mwdata2)
summary(model1)

## 
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine + 
##     location, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9173  -2.6865  -0.2971   2.8334  14.6234 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.76995    1.65807  17.955  < 2e-16 ***
## age            -0.01079    0.02043  -0.528    0.598    
## outdoor_time    0.17452    0.04295   4.063 7.03e-05 ***
## social_int      0.38733    0.07592   5.102 8.01e-07 ***
## routine         2.95814    0.61145   4.838 2.68e-06 ***
## locationrural  -4.97721    0.71808  -6.931 6.09e-11 ***
## locationsuburb -0.27975    0.86795  -0.322    0.748    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.278 on 193 degrees of freedom
## Multiple R-squared:  0.3896, Adjusted R-squared:  0.3707 
## F-statistic: 20.53 on 6 and 193 DF,  p-value: < 2.2e-16

And now the “age” variable:

model2 <- lm(wellbeing ~ outdoor_time + social_int + routine + location, data = mwdata2)
summary(model2)

## 
## Call:
## lm(formula = wellbeing ~ outdoor_time + social_int + routine + 
##     location, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8649  -2.6977  -0.2127   2.5935  14.8263 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29.28101    1.37298  21.327  < 2e-16 ***
## outdoor_time    0.17561    0.04282   4.101 6.04e-05 ***
## social_int      0.38815    0.07576   5.123 7.21e-07 ***
## routine         2.95193    0.61020   4.838 2.67e-06 ***
## locationrural  -4.96317    0.71625  -6.929 6.08e-11 ***
## locationsuburb -0.28431    0.86630  -0.328    0.743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.27 on 194 degrees of freedom
## Multiple R-squared:  0.3888, Adjusted R-squared:  0.373 
## F-statistic: 24.68 on 5 and 194 DF,  p-value: < 2.2e-16

In this model, all our predictors have p-values lower than our critical level of \(.05\).

Question 2

There are functions in R which automate the stepwise procedure for us.
step(<modelname>) will by default use backward elimination to choose the model with the lowest AIC.

Using data on the Big 5 Personality traits, perceptions of social ranks, and depression and anxiety, fit the full model to predict DASS-21 scores.
Use step() to determine which predictors to keep in your model.
What predictors do you have in your final model?

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")

scs_study <-
  scs_study %>%
  mutate(
    scs_mc = scs - mean(scs)
  )

Solution

full_dass_model <- lm(dass ~ zn*scs_mc + zo + zc + ze + za + zn, data = scs_study)
step(full_dass_model)

## Start:  AIC=2374.99
## dass ~ zn * scs_mc + zo + zc + ze + za + zn
## 
##             Df Sum of Sq   RSS    AIC
## - zc         1      2.05 23915 2373.0
## - za         1     11.06 23924 2373.3
## - zo         1     22.82 23936 2373.6
## <none>                   23913 2375.0
## - ze         1    147.25 24060 2377.0
## - zn:scs_mc  1   2340.68 26254 2434.2
## 
## Step:  AIC=2373.04
## dass ~ zn + scs_mc + zo + ze + za + zn:scs_mc
## 
##             Df Sum of Sq   RSS    AIC
## - za         1     10.66 23926 2371.3
## - zo         1     23.28 23938 2371.7
## <none>                   23915 2373.0
## - ze         1    148.17 24063 2375.1
## - zn:scs_mc  1   2339.57 26255 2432.3
## 
## Step:  AIC=2371.33
## dass ~ zn + scs_mc + zo + ze + zn:scs_mc
## 
##             Df Sum of Sq   RSS    AIC
## - zo         1     22.33 23948 2369.9
## <none>                   23926 2371.3
## - ze         1    497.48 24423 2382.8
## - zn:scs_mc  1   2340.17 26266 2430.6
## 
## Step:  AIC=2369.95
## dass ~ zn + scs_mc + ze + zn:scs_mc
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   23948 2369.9
## - ze         1     496.1 24444 2381.4
## - zn:scs_mc  1    2318.4 26266 2428.6

## 
## Call:
## lm(formula = dass ~ zn + scs_mc + ze + zn:scs_mc, data = scs_study)
## 
## Coefficients:
## (Intercept)           zn       scs_mc           ze    zn:scs_mc  
##     44.9311       1.5566      -0.4471       0.8708      -0.5153

Extra reading: Joshua Loftus’ Blog: Model selection bias invalidates significance tests

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.