“Which predictors should I include in my model?”
As a rule of thumb, you should include as predictors your variables of interest (i.e., those required to answer your questions), and those which theory suggests you should take into account (for instance, if theory tells you that temperature is likely to influence the number of shark attacks on a given day, it would be remiss of you to not include it in your model).
However, in some specific situations, you may simply want to let the data tell you whatever there is to tell, without being guided by theory. This is where analysis becomes exploratory in nature (and therefore should not be used as confirmatory evidence in support of theory).
In both the design and the analysis of a study, you will have to make many many choices. Each one takes you a different way, and leads to a different set of choices. This idea has become widely known as the garden of forking paths, and has important consequences for your statistical inferences.
Out of all the possible paths you could have taken, some will end with what you consider to be a significant finding, and some you will simply see as dead ends. If you reach a dead-end, do you go back and try a different path? Why might this be a risky approach to statistical analyses?
For a given set of data, there will likely be some significant relationships between variables which are there simply by chance (recall that \(p<.05\) corresponds to a 1 in 20 chance - if we study 20 different relationships, we would expect one of them two be significant by chance). The more paths we try out, the more likely we are to find a significant relationship, even though it may actually be completely spurious!
Model selection is a means of answering the question “which predictors should I include in my model?”, but it is a big maze of forking paths, which will result in keeping only those predictors which meet some criteria (e.g., significance).
Stepwise
Forward Selection
- Start with variable which has highest association with DV.
- Add the variable which most increases \(R^2\) out of all which remain.
- Continue until no variables improve \(R^2\).
Backward Elimination
- Start with all variables in the model.
- Remove the predictor with the highest p-value.
- Run the model again and repeat.
- Stop when all p-values for predictors are less than the a priori set critical level.
Note that we can have different criteria for selecting models in this stepwise approach, for instance, choosing the model with the biggest decrease in AIC.
Question 1
Using the backward elimination approach, construct a final model to predict wellbeing scores using the mwdata2
dataset.
mwdata2 <- read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")
Solution
We will stop when all p-values are \(<.05\)
Note that we have two variables in there which are direct transformations of one another - “location” and “isRural”. We can’t have both.
summary(mwdata2)
## age outdoor_time social_int routine wellbeing
## Min. :18.00 Min. : 1.00 Min. : 3.00 Min. :0.000 Min. :22.0
## 1st Qu.:30.00 1st Qu.:12.75 1st Qu.: 9.00 1st Qu.:0.000 1st Qu.:33.0
## Median :42.00 Median :18.00 Median :12.00 Median :1.000 Median :35.0
## Mean :42.30 Mean :18.25 Mean :12.06 Mean :0.565 Mean :36.3
## 3rd Qu.:54.25 3rd Qu.:23.00 3rd Qu.:15.00 3rd Qu.:1.000 3rd Qu.:40.0
## Max. :70.00 Max. :35.00 Max. :24.00 Max. :1.000 Max. :59.0
##
## location steps_k
## Length:200 Min. : 0.00
## Class :character 1st Qu.: 24.00
## Mode :character Median : 42.45
## Mean : 44.93
## 3rd Qu.: 65.28
## Max. :111.30
## NA's :66
full_model <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location + steps_k, data = mwdata2)
summary(full_model)
##
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine +
## location + steps_k, data = mwdata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.567 -2.901 -0.050 2.919 9.416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.289216 2.094310 13.985 < 2e-16 ***
## age 0.018316 0.026404 0.694 0.489155
## outdoor_time 0.145573 0.061363 2.372 0.019189 *
## social_int 0.357076 0.098317 3.632 0.000408 ***
## routine 3.039223 0.788248 3.856 0.000183 ***
## locationrural -5.176386 0.974776 -5.310 4.78e-07 ***
## locationsuburb 0.075809 1.114115 0.068 0.945858
## steps_k 0.006114 0.016434 0.372 0.710489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.457 on 126 degrees of freedom
## (66 observations deleted due to missingness)
## Multiple R-squared: 0.3707, Adjusted R-squared: 0.3358
## F-statistic: 10.6 on 7 and 126 DF, p-value: 1.926e-10
We will remove the “steps_k” variable, as it is the predictor with the highest p-value (don’t be tempted to think that “location” has the highest p-value. The estimated difference between urban and suburban does indeed have a high p-value, but the difference between rural and urban has a very low p-value).
model1 <- lm(wellbeing ~ age + outdoor_time + social_int + routine + location, data = mwdata2)
summary(model1)
##
## Call:
## lm(formula = wellbeing ~ age + outdoor_time + social_int + routine +
## location, data = mwdata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9173 -2.6865 -0.2971 2.8334 14.6234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.76995 1.65807 17.955 < 2e-16 ***
## age -0.01079 0.02043 -0.528 0.598
## outdoor_time 0.17452 0.04295 4.063 7.03e-05 ***
## social_int 0.38733 0.07592 5.102 8.01e-07 ***
## routine 2.95814 0.61145 4.838 2.68e-06 ***
## locationrural -4.97721 0.71808 -6.931 6.09e-11 ***
## locationsuburb -0.27975 0.86795 -0.322 0.748
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.278 on 193 degrees of freedom
## Multiple R-squared: 0.3896, Adjusted R-squared: 0.3707
## F-statistic: 20.53 on 6 and 193 DF, p-value: < 2.2e-16
And now the “age” variable:
model2 <- lm(wellbeing ~ outdoor_time + social_int + routine + location, data = mwdata2)
summary(model2)
##
## Call:
## lm(formula = wellbeing ~ outdoor_time + social_int + routine +
## location, data = mwdata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8649 -2.6977 -0.2127 2.5935 14.8263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.28101 1.37298 21.327 < 2e-16 ***
## outdoor_time 0.17561 0.04282 4.101 6.04e-05 ***
## social_int 0.38815 0.07576 5.123 7.21e-07 ***
## routine 2.95193 0.61020 4.838 2.67e-06 ***
## locationrural -4.96317 0.71625 -6.929 6.08e-11 ***
## locationsuburb -0.28431 0.86630 -0.328 0.743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.27 on 194 degrees of freedom
## Multiple R-squared: 0.3888, Adjusted R-squared: 0.373
## F-statistic: 24.68 on 5 and 194 DF, p-value: < 2.2e-16
In this model, all our predictors have p-values lower than our critical level of \(.05\).
Question 2
There are functions in R which automate the stepwise procedure for us.
step(<modelname>)
will by default use backward elimination to choose the model with the lowest AIC.
- Using data on the Big 5 Personality traits, perceptions of social ranks, and depression and anxiety, fit the full model to predict DASS-21 scores.
- Use
step()
to determine which predictors to keep in your model.
- What predictors do you have in your final model?
scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
scs_study <-
scs_study %>%
mutate(
scs_mc = scs - mean(scs)
)
Solution
full_dass_model <- lm(dass ~ zn*scs_mc + zo + zc + ze + za + zn, data = scs_study)
step(full_dass_model)
## Start: AIC=2374.99
## dass ~ zn * scs_mc + zo + zc + ze + za + zn
##
## Df Sum of Sq RSS AIC
## - zc 1 2.05 23915 2373.0
## - za 1 11.06 23924 2373.3
## - zo 1 22.82 23936 2373.6
## <none> 23913 2375.0
## - ze 1 147.25 24060 2377.0
## - zn:scs_mc 1 2340.68 26254 2434.2
##
## Step: AIC=2373.04
## dass ~ zn + scs_mc + zo + ze + za + zn:scs_mc
##
## Df Sum of Sq RSS AIC
## - za 1 10.66 23926 2371.3
## - zo 1 23.28 23938 2371.7
## <none> 23915 2373.0
## - ze 1 148.17 24063 2375.1
## - zn:scs_mc 1 2339.57 26255 2432.3
##
## Step: AIC=2371.33
## dass ~ zn + scs_mc + zo + ze + zn:scs_mc
##
## Df Sum of Sq RSS AIC
## - zo 1 22.33 23948 2369.9
## <none> 23926 2371.3
## - ze 1 497.48 24423 2382.8
## - zn:scs_mc 1 2340.17 26266 2430.6
##
## Step: AIC=2369.95
## dass ~ zn + scs_mc + ze + zn:scs_mc
##
## Df Sum of Sq RSS AIC
## <none> 23948 2369.9
## - ze 1 496.1 24444 2381.4
## - zn:scs_mc 1 2318.4 26266 2428.6
##
## Call:
## lm(formula = dass ~ zn + scs_mc + ze + zn:scs_mc, data = scs_study)
##
## Coefficients:
## (Intercept) zn scs_mc ze zn:scs_mc
## 44.9311 1.5566 -0.4471 0.8708 -0.5153
Extra reading: Joshua Loftus’ Blog: Model selection bias invalidates significance tests