Prelude

A common goal for researchers is to determine which variables matter (and which do not) in contributing to some outcome variable. A common approach to answer such questions is to consider whether some variable \(X\)’s contribution remains significant after controlling for variables \(Z\).

The reasoning:

If our measure of \(X\) correlates significantly with outcome \(Y\) even when controlling for our measure of \(Z\), then \(X\) contributes to \(y\) over and above the contribution of \(Z\).

In multiple regression, we might fit the model \(Y = b_0 + b_1 \cdot X + b_2 \cdot Z + \epsilon\) and conclude that \(X\) is a useful predictor of \(Y\) over and above \(Z\) based on the estimate \(\hat b_1\), or via model comparison between that model and the model without \(Z\) as a predictor (\(Y = b_0 + b_1 \cdot X + \epsilon\)).

Toy Example

Suppose we have monthly data over a seven year period which captures the number of shark attacks on swimmers each month, and the number of ice-creams sold by beach vendors each month.
Consider the relationship between the two:

We can fit the linear model and see a significant relationship between ice cream sales and shark attacks:

sharkdata <- read_csv("https://uoepsy.github.io/data/sharks.csv")
shark_mdl <- lm(shark_attacks ~ ice_cream_sales, data = sharkdata)
summary(shark_mdl)

## 
## Call:
## lm(formula = shark_attacks ~ ice_cream_sales, data = sharkdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.395  -4.927   0.509   4.815  15.702 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.5883     5.1906    1.08     0.28    
## ice_cream_sales   0.3226     0.0581    5.55  3.5e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.24 on 81 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.267 
## F-statistic: 30.8 on 1 and 81 DF,  p-value: 3.46e-07

Question

Does the relationship between ice cream sales and shark attacks make sense? What might be missing from our model?

You might quite rightly suggest that this relationship is actually being driven by temperature - when it is hotter, there are more ice cream sales and there are more people swimming (hence more shark attacks).

Question

Is \(X\) (the number of ice-cream sales) a useful predictor of \(Y\) (numbers of shark attacks) over and above \(Z\) (temperature)?

We might answer this with a multiple regression model including both temperature and ice cream sales as predictors of shark attacks:

shark_mdl2 <- lm(shark_attacks ~temperature + ice_cream_sales, data = sharkdata)
anova(shark_mdl2)

## Analysis of Variance Table
## 
## Response: shark_attacks
##                 Df Sum Sq Mean Sq F value  Pr(>F)    
## temperature      1   3000    3000   85.78 2.7e-14 ***
## ice_cream_sales  1     72      72    2.05    0.16    
## Residuals       80   2798      35                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What do you conclude?

It appears that numbers of ice cream sales is not a significant predictor of sharks attack numbers over and above the temperature.

However… In psychology, we can rarely observe and directly measure the constructs which we are interested in (for example, personality traits, intelligence, emotional states etc.). We rely instead on measurements of, e.g. behavioural tendencies, as a proxy for personality traits.

Let’s suppose that instead of including temperature in degrees celsius, we asked a set of people to self-report on a scale of 1 to 7 how hot it was that day. This measure should hopefully correlate well with the actual temperature, however, there will likely be some variation:

Question

Is \(X\) (the number of ice-cream sales) a useful predictor of \(Y\) (numbers of shark attacks) over and above \(Z\) (temperature - measured on our self-reported heat scale)?

shark_mdl2a <- lm(shark_attacks ~ sr_heat + ice_cream_sales, data = sharkdata)
anova(shark_mdl2a)

## Analysis of Variance Table
## 
## Response: shark_attacks
##                 Df Sum Sq Mean Sq F value  Pr(>F)    
## sr_heat          1   2684    2684   73.31 6.4e-13 ***
## ice_cream_sales  1    257     257    7.01  0.0097 ** 
## Residuals       80   2929      37                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What do you conclude?

Moral of the story: be considerate of what exactly it is that you are measuring.

This example was adapted from Westfall and Yarkoni, 2020 which provides a much more extensive discussion of incremental validity and type 1 error rates.

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

Incremental validity - A caution

Prelude

Toy Example