A common goal for researchers is to determine which variables matter (and which do not) in contributing to some outcome variable. A common approach to answer such questions is to consider whether some variable \(X\)’s contribution remains significant after controlling for variables \(Z\).
The reasoning:
In multiple regression, we might fit the model \(Y = b_0 + b_1 \cdot X + b_2 \cdot Z + \epsilon\) and conclude that \(X\) is a useful predictor of \(Y\) over and above \(Z\) based on the estimate \(\hat b_1\), or via model comparison between that model and the model without \(Z\) as a predictor (\(Y = b_0 + b_1 \cdot X + \epsilon\)).
Suppose we have monthly data over a seven year period which captures the number of shark attacks on swimmers each month, and the number of ice-creams sold by beach vendors each month.
Consider the relationship between the two:
We can fit the linear model and see a significant relationship between ice cream sales and shark attacks:
<- read_csv("https://uoepsy.github.io/data/sharks.csv")
sharkdata <- lm(shark_attacks ~ ice_cream_sales, data = sharkdata)
shark_mdl summary(shark_mdl)
##
## Call:
## lm(formula = shark_attacks ~ ice_cream_sales, data = sharkdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.395 -4.927 0.509 4.815 15.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5883 5.1906 1.08 0.28
## ice_cream_sales 0.3226 0.0581 5.55 3.5e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.24 on 81 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.267
## F-statistic: 30.8 on 1 and 81 DF, p-value: 3.46e-07
Does the relationship between ice cream sales and shark attacks make sense? What might be missing from our model?
You might quite rightly suggest that this relationship is actually being driven by temperature - when it is hotter, there are more ice cream sales and there are more people swimming (hence more shark attacks).
Is \(X\) (the number of ice-cream sales) a useful predictor of \(Y\) (numbers of shark attacks) over and above \(Z\) (temperature)?
We might answer this with a multiple regression model including both temperature and ice cream sales as predictors of shark attacks:
<- lm(shark_attacks ~temperature + ice_cream_sales, data = sharkdata)
shark_mdl2 anova(shark_mdl2)
## Analysis of Variance Table
##
## Response: shark_attacks
## Df Sum Sq Mean Sq F value Pr(>F)
## temperature 1 3000 3000 85.78 2.7e-14 ***
## ice_cream_sales 1 72 72 2.05 0.16
## Residuals 80 2798 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do you conclude?
It appears that numbers of ice cream sales is not a significant predictor of sharks attack numbers over and above the temperature.
However… In psychology, we can rarely observe and directly measure the constructs which we are interested in (for example, personality traits, intelligence, emotional states etc.). We rely instead on measurements of, e.g. behavioural tendencies, as a proxy for personality traits.
Let’s suppose that instead of including temperature in degrees celsius, we asked a set of people to self-report on a scale of 1 to 7 how hot it was that day. This measure should hopefully correlate well with the actual temperature, however, there will likely be some variation:
Is \(X\) (the number of ice-cream sales) a useful predictor of \(Y\) (numbers of shark attacks) over and above \(Z\) (temperature - measured on our self-reported heat scale)?
<- lm(shark_attacks ~ sr_heat + ice_cream_sales, data = sharkdata)
shark_mdl2a anova(shark_mdl2a)
## Analysis of Variance Table
##
## Response: shark_attacks
## Df Sum Sq Mean Sq F value Pr(>F)
## sr_heat 1 2684 2684 73.31 6.4e-13 ***
## ice_cream_sales 1 257 257 7.01 0.0097 **
## Residuals 80 2929 37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do you conclude?
Moral of the story: be considerate of what exactly it is that you are measuring.
This example was adapted from Westfall and Yarkoni, 2020 which provides a much more extensive discussion of incremental validity and type 1 error rates.