“Which predictors should I include in my model?”
As a rule of thumb, you should include as predictors your variables of interest (i.e., those required to answer your questions), and those which theory suggests you should take into account (for instance, if theory tells you that temperature is likely to influence the number of shark attacks on a given day, it would be remiss of you to not include it in your model).
However, in some specific situations, you may simply want to let the data tell you whatever there is to tell, without being guided by theory. This is where analysis becomes exploratory in nature (and therefore should not be used as confirmatory evidence in support of theory).
In both the design and the analysis of a study, you will have to make many many choices. Each one takes you a different way, and leads to a different set of choices. This idea has become widely known as the garden of forking paths, and has important consequences for your statistical inferences.
Out of all the possible paths you could have taken, some will end with what you consider to be a significant finding, and some you will simply see as dead ends. If you reach a dead-end, do you go back and try a different path? Why might this be a risky approach to statistical analyses?
For a given set of data, there will likely be some significant relationships between variables which are there simply by chance (recall that \(p<.05\) corresponds to a 1 in 20 chance - if we study 20 different relationships, we would expect one of them two be significant by chance). The more paths we try out, the more likely we are to find a significant relationship, even though it may actually be completely spurious!
Model selection is a means of answering the question “which predictors should I include in my model?” but it is a big maze of forking paths, which will result in keeping only those predictors which meet some criteria (e.g., significance).
Stepwise
Forward Selection
- Start with variable which has highest association with DV.
- Add the variable which most increases \(R^2\) out of all which remain.
- Continue until no variables improve \(R^2\).
Backward Elimination
- Start with all variables in the model.
- Remove the predictor with the highest p-value.
- Run the model again and repeat.
- Stop when all p-values for predictors are less than the a priori set critical level.
Note that we can have different criteria for selecting models in this stepwise approach, for instance, choosing the model with the biggest decrease in AIC.
Question 1
Using the backward elimination approach, construct a final model to predict wellbeing scores using the mwdata2
dataset.
mwdata2 <- read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")
Solution
We will stop when all p-values are \(<.05\)
Note that we have two variables in there which are direct transformations of one another - “location” and “isRural.” We can’t have both.
## age outdoor_time social_int routine wellbeing
## Min. :18.0 Min. : 1.0 Min. : 3.0 Min. :0.000 Min. :22.0
## 1st Qu.:30.0 1st Qu.:12.8 1st Qu.: 9.0 1st Qu.:0.000 1st Qu.:33.0
## Median :42.0 Median :18.0 Median :12.0 Median :1.000 Median :35.0
## Mean :42.3 Mean :18.3 Mean :12.1 Mean :0.565 Mean :36.3
## 3rd Qu.:54.2 3rd Qu.:23.0 3rd Qu.:15.0 3rd Qu.:1.000 3rd Qu.:40.0
## Max. :70.0 Max. :35.0 Max. :24.0 Max. :1.000 Max. :59.0
##
## location steps_k
## Length:200 Min. : 0.0
## Class :character 1st Qu.: 24.0
## Mode :character Median : 42.5
## Mean : 44.9
## 3rd Qu.: 65.3
## Max. :111.3
## NA's :66
full_model <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location + steps_k, data = mwdata2)
summary(full_model)
##
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine +
## location + steps_k, data = mwdata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.57 -2.90 -0.05 2.92 9.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.28922 2.09431 13.99 < 2e-16 ***
## age 0.01832 0.02640 0.69 0.48916
## outdoor_time 0.14557 0.06136 2.37 0.01919 *
## social_int 0.35708 0.09832 3.63 0.00041 ***
## routine 3.03922 0.78825 3.86 0.00018 ***
## locationrural -5.17639 0.97478 -5.31 4.8e-07 ***
## locationsuburb 0.07581 1.11412 0.07 0.94586
## steps_k 0.00611 0.01643 0.37 0.71049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.46 on 126 degrees of freedom
## (66 observations deleted due to missingness)
## Multiple R-squared: 0.371, Adjusted R-squared: 0.336
## F-statistic: 10.6 on 7 and 126 DF, p-value: 1.93e-10
We will remove the “steps_k” variable, as it is the predictor with the highest p-value (don’t be tempted to think that “location” has the highest p-value. The estimated difference between urban and suburban does indeed have a high p-value, but the difference between rural and urban has a very low p-value).
model1 <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location, data = mwdata2)
summary(model1)
##
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine +
## location, data = mwdata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.917 -2.686 -0.297 2.833 14.623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.7700 1.6581 17.95 < 2e-16 ***
## age -0.0108 0.0204 -0.53 0.60
## outdoor_time 0.1745 0.0429 4.06 7.0e-05 ***
## social_int 0.3873 0.0759 5.10 8.0e-07 ***
## routine 2.9581 0.6115 4.84 2.7e-06 ***
## locationrural -4.9772 0.7181 -6.93 6.1e-11 ***
## locationsuburb -0.2797 0.8680 -0.32 0.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.28 on 193 degrees of freedom
## Multiple R-squared: 0.39, Adjusted R-squared: 0.371
## F-statistic: 20.5 on 6 and 193 DF, p-value: <2e-16
And now the “age” variable:
model2 <- lm(wellbeing ~ outdoor_time + social_int + routine + location, data = mwdata2)
summary(model2)
##
## Call:
## lm(formula = wellbeing ~ outdoor_time + social_int + routine +
## location, data = mwdata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.865 -2.698 -0.213 2.594 14.826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.2810 1.3730 21.33 < 2e-16 ***
## outdoor_time 0.1756 0.0428 4.10 6.0e-05 ***
## social_int 0.3881 0.0758 5.12 7.2e-07 ***
## routine 2.9519 0.6102 4.84 2.7e-06 ***
## locationrural -4.9632 0.7163 -6.93 6.1e-11 ***
## locationsuburb -0.2843 0.8663 -0.33 0.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.27 on 194 degrees of freedom
## Multiple R-squared: 0.389, Adjusted R-squared: 0.373
## F-statistic: 24.7 on 5 and 194 DF, p-value: <2e-16
In this model, all our predictors have p-values lower than our critical level of \(.05\).
Question 2
There are functions in R which automate the stepwise procedure for us.
step(<modelname>)
will by default use backward elimination to choose the model with the lowest AIC.
- Using data on the Big 5 Personality traits, perceptions of social ranks, and depression and anxiety, fit the full model to predict DASS-21 scores.
- Use
step()
to determine which predictors to keep in your model.
- What predictors do you have in your final model?
scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
scs_study <-
scs_study %>%
mutate(
scs_mc = scs - mean(scs)
)
Solution
full_dass_model <- lm(dass ~ zn*scs_mc + zo + zc + ze + za + zn, data = scs_study)
step(full_dass_model)
## Start: AIC=2375
## dass ~ zn * scs_mc + zo + zc + ze + za + zn
##
## Df Sum of Sq RSS AIC
## - zc 1 2 23915 2373
## - za 1 11 23924 2373
## - zo 1 23 23936 2374
## <none> 23913 2375
## - ze 1 147 24060 2377
## - zn:scs_mc 1 2341 26254 2434
##
## Step: AIC=2373
## dass ~ zn + scs_mc + zo + ze + za + zn:scs_mc
##
## Df Sum of Sq RSS AIC
## - za 1 11 23926 2371
## - zo 1 23 23938 2372
## <none> 23915 2373
## - ze 1 148 24063 2375
## - zn:scs_mc 1 2340 26255 2432
##
## Step: AIC=2371
## dass ~ zn + scs_mc + zo + ze + zn:scs_mc
##
## Df Sum of Sq RSS AIC
## - zo 1 22 23948 2370
## <none> 23926 2371
## - ze 1 497 24423 2383
## - zn:scs_mc 1 2340 26266 2431
##
## Step: AIC=2370
## dass ~ zn + scs_mc + ze + zn:scs_mc
##
## Df Sum of Sq RSS AIC
## <none> 23948 2370
## - ze 1 496 24444 2381
## - zn:scs_mc 1 2318 26266 2429
##
## Call:
## lm(formula = dass ~ zn + scs_mc + ze + zn:scs_mc, data = scs_study)
##
## Coefficients:
## (Intercept) zn scs_mc ze zn:scs_mc
## 44.931 1.557 -0.447 0.871 -0.515
Extra reading: Joshua Loftus’ Blog: Model selection bias invalidates significance tests