Week 9 Exercises: Interactions and Categorical Predictors

Aging Cognition

Data: cogapoe4.csv

Ninety adult participants were recruited for a study investigating how cognitive functioning varies by age, and whether this is different depending on whether people carry an APOE-4 gene.

Research Question: Does the relationship between age and cognitive functioning differ between those with and without the APOE-4 genotype?

The data are available at https://uoepsy.github.io/data/cogapoe4.csv.

variables	description
pid	Participant ID
age	Age (in Years)
educ	Years of Education
birthweight_kg	Birthweight (in Kg)
apoe4	APOE-4 Gene expansion ('none', 'apoe4a', 'apoe4b', apoe4c')
acer	Score on Addenbrooke's Cognitive Examination

Question 1

Does the relationship between age and cognitive functioning differ between those with and without the APOE-4 genotype?

Read in the data and explore the variables which you think you will use to answer the research question above (create some plots, some descriptive stats etc.)

handy functions:

The pairs.panels() function from the psych package is quite a nice way to plot a scatterplot matrix of a dataset.
The describe() function is also quite nice (from the psych package too).

Question 2

Check the apoe4 variable. It currently has four levels (“none”/“apoe4a”/“apoe4b”/“apoe4c”), but the research question is actually interested in two (“none” vs “apoe4”).

Make a new variable that indicates whether or not someone has the APOE4 genotype.

Hints:

One way to do this would be to use ifelse() to define a variable which takes one value (e.g., “NO”) if the observation meets from some condition, or another value (e.g., “YES”) if it does not. You can use it to add a new variable either inside mutate(), or using data$new_variable_name <- ifelse(test, x, y) syntax.

Question 3

Does the relationship between age and cognitive functioning differ between those with and without the APOE-4 genotype?

To answer this question, do you need an interaction? If so, between what kind of variables (continuous x continuous, continuous x categorical, or categorical x categorical)?

Hint:

See 9A#it-depends

Question 4

Especially for this type of interactions, we can start to get a sense of things by plotting the data and having a separate facet for each group.

Produce a visualisation of the relationship between age and cognitive functioning, with separate facets for people with and without the APOE4 gene.

Hint:

remember facet_wrap()?

Question 5

Does the relationship between age and cognitive functioning differ between those with and without the APOE-4 genotype?

Fit a model to answer this research question.

Hint:

for how to fit interactions, see 9A#fitting-interactions-in-r

Question 6

Looking at the coefficient estimates from your model, write a description of what each one corresponds to on the plot shown in Figure 1 (it may help to sketch out the plot yourself and annotate it).

Figure 1: Multiple regression model: ACER ~ Age * isAPOE4
The dotted lines show the extension back to where the x-axis is zero

Some options for you to choose from:

The point at which the blue line cuts the y-axis (where age = 0)
The point at which the red line cuts the y-axis (where age = 0)
The average vertical distance between the red and blue lines.
The vertical distance from the blue to the red line at the y-axis (where age = 0)
The vertical distance from the red to the blue line at the y-axis (where age = 0)
The vertical distance from the blue to the red line at the center of the plot
The vertical distance from the red to the blue line at the center of the plot
The slope (vertical increase on the y-axis associated with a 1 unit increase on the x-axis) of the blue line
The slope of the red line
The adjustment to the slope when you move from the blue to the red line
The adjustment to the slope when you move from the red to the blue line

Question 7

Produce a visualisation of the estimated interaction.

Hints:

e.g. plot your model!
see 9A#visualisation

Question 8

Write a short paragraph explaining the pattern of results, including coefficient estimates as appropriate.

Question 9

It’s important to emphasise that models are structures that we design, then estimate from our data, and assess how well they fit.

Here are three models:

cogapoe$isAPOE4 = ifelse(cogapoe$apoe4 == "none", "No", "Yes")
apoe4mod1 <- lm(acer ~ 1 + age, data = cogapoe)
apoe4mod2 <- lm(acer ~ 1 + age + isAPOE4, data = cogapoe)
apoe4mod3 <- lm(acer ~ 1 + age + isAPOE4 + age:isAPOE4, data = cogapoe)

And below are those three models plotted, with the data added.
Which model corresponds to which plot?

Personality and Social Rank

Data: scs_study.csv

Data from 656 participants containing information on scores on each trait of a Big 5 personality measure, their perception of their own social rank, and their scores on a measure of depression. Previous research has identified an association between an individual’s perception of their social rank and symptoms of depression, anxiety and stress. We are interested in the individual differences in this relationship.

Research Question: After accounting for other personality traits, does the effect of social comparison on symptoms of depression, anxiety and stress vary depending on level of neuroticism?

The data is available at https://uoepsy.github.io/data/scs_study.csv.

The data in scs_study.csv contain seven attributes collected from a random sample of $n=656$ participants:

zo: Openness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
zc: Conscientiousness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
ze: Extraversion (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
za: Agreeableness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
zn: Neuroticism (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
scs: Social Comparison Scale - An 11-item scale that measures an individual’s perception of their social rank, attractiveness and belonging relative to others. The scale is scored as a sum of the 11 items (each measured on a 5-point scale), with higher scores indicating more favourable perceptions of social rank. Scores range from 11 to 55.
dass: Depression Anxiety and Stress Scale - The DASS-21 includes 21 items, each measured on a 4-point scale. The score is derived from the sum of all 21 items, with higher scores indicating higher a severity of symptoms. Scores range from 21 to 84.

What does ‘Z-scored’ mean?

Question 10

Read in the Social Comparison Study data and explore the relevant distributions and relationships between the variables of interest to the research question below.

After accounting for other personality traits, does the effect of social comparison on symptoms of depression, anxiety and stress vary depending on level of neuroticism?

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
summary(scs_study)

       zo                 zc                 ze                 za          
 Min.   :-2.81928   Min.   :-3.21819   Min.   :-3.00576   Min.   :-2.94429  
 1st Qu.:-0.63089   1st Qu.:-0.66866   1st Qu.:-0.68895   1st Qu.:-0.69394  
 Median : 0.08053   Median : 0.00257   Median :-0.04014   Median :-0.01854  
 Mean   : 0.09202   Mean   : 0.01951   Mean   : 0.00000   Mean   : 0.00000  
 3rd Qu.: 0.80823   3rd Qu.: 0.71215   3rd Qu.: 0.67085   3rd Qu.: 0.72762  
 Max.   : 3.55034   Max.   : 3.08015   Max.   : 2.80010   Max.   : 2.97010  
       zn               scs             dass      
 Min.   :-1.4486   Min.   :27.00   Min.   :23.00  
 1st Qu.:-0.7994   1st Qu.:33.00   1st Qu.:41.00  
 Median :-0.2059   Median :35.00   Median :44.00  
 Mean   : 0.0000   Mean   :35.77   Mean   :44.72  
 3rd Qu.: 0.5903   3rd Qu.:38.00   3rd Qu.:49.00  
 Max.   : 3.3491   Max.   :54.00   Max.   :68.00

p1 <- ggplot(data = scs_study, aes(x=dass)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of DASS-21 Scores", 
       x = "Depression Anxiety and Stress Scale", y = "Probability density")

p2 <- ggplot(data = scs_study, aes(x=scs)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of Social Comparison Scale (SCS) scores", 
       x = "Social Comparison Scale Score", y = "Probability density")

p3 <- ggplot(data = scs_study, aes(x=zn)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of Neuroticism (Z-Scored)", 
       x = "Neuroticism (Z-Scored)", y = "Probability density")
library(patchwork)
p1 / (p2 + p3)

The marginal distribution of scores on the Depression, Anxiety and Stress Scale (DASS-21) is unimodal with a mean of approximately 45 and a standard deviation of 7.

The marginal distribution of scores on the Social Comparison Scale (SCS) is unimodal with a mean of approximately 36 and a standard deviation of 4. There look to be a number of outliers at the upper end of the scale.

The marginal distribution of Neuroticism (Z-scored) is positively skewed, with the 25% of scores falling below -0.8, 75% of scores falling below 0.59.

p1 <- ggplot(data = scs_study, aes(x=scs, y=dass)) + 
  geom_point()+
  labs(x = "SCS", y = "DASS-21")

p2 <- ggplot(data = scs_study, aes(x=zn, y=dass)) + 
  geom_point()+
  labs(x = "Neuroticism", y = "DASS-21")

p1 / p2

library(knitr) # for making tables look nice
# the kable() function from the knitr package can make table outputs print nicely into html.
scs_study %>%
  select(dass, scs, zn) %>%
  cor %>% 
  kable

	dass	scs	zn
dass	1.0000000	-0.2280126	0.2001885
scs	-0.2280126	1.0000000	0.1146687
zn	0.2001885	0.1146687	1.0000000

There is a weak, negative, linear relationship between scores on the Social Comparison Scale and scores on the Depression Anxiety and Stress Scale for the participants in the sample. Severity of symptoms measured on the DASS-21 tend to decrease, on average, the more favourably participants view their social rank.
There is a weak, positive, linear relationship between the levels of Neuroticism and scores on the DASS-21. Participants who are more neurotic tend to, on average, display a higher severity of symptoms of depression, anxiety and stress.

Question 11

Fit your model using lm(). Examine the coefficients, plot the relevant estimated associations to visually address the research question.

Hints:

there are sort of two bits to this:
- “after accounting for” (i.e., the stuff we learned in 8A#multiple-regression)
- “the effect of this depends on that” (the interaction stuff from 9a#interactions).

Question 12

Think - what does “0” represent in each variable? And what does an increase of 1 represent? Are these meaningful?

The zn variable is standardised, meaning that “0” is the mean, and moving from “0” to “1” is moving 1 standard deviation.

I would argue that because “0” on scs variable is not a possible score that anyone can get, we may want to re-center this variable.

Recenter the scs variable on either:

the minimum possible SCS score
the mean SCS score

Hint:

You can choose either min or mean. It won’t change your model, it will just change what you get out of it.
You can see an example of mean-centering in 9A#getting-more-from-your-model.

Question 13

Re-fit your model using your re-centered SCS scores instead of the original variable.

Fill in the blanks in the statements below.

For those at the mean on all personality traits and scoring ??? on the SCS, the estimated DASS-21 Score is ???
For those who score ??? on the SCS, an increase of ??? in neuroticism is associated with a change of ??? in DASS-21 Scores, holding other personality traits constant.
For those of average neuroticism, an increase of ??? on the SCS is associated with a change of ??? in DASS-21 Scores, holding other personality traits constant.
For every increase of ??? in neuroticism, the change in DASS-21 associated with an increase of ??? on the SCS is asjusted by ???
For every increase of ??? in SCS, the change in DASS-21 associated with an increase of ??? in neuroticism is asjusted by ???

Mean Centered

dass_mdl2 <- lm(dass ~ 1 + scs_Cmean * zn, data = scs_study)
# pull out the coefficients from the summary():
summary(dass_mdl2)$coefficients

               Estimate Std. Error    t value     Pr(>|t|)
(Intercept)  44.9324476 0.24052861 186.807079 0.000000e+00
scs_Cmean    -0.4439065 0.06834135  -6.495431 1.643265e-10
zn            1.5797687 0.24086372   6.558766 1.105118e-10
scs_Cmean:zn -0.5186142 0.06552100  -7.915236 1.063297e-14

For those at the mean on all personality traits and scoring average on the SCS, the estimated DASS-21 Score is 44.93
For those who score average (mean) on the SCS, an increase of 1 standard deviation in neuroticism is associated with a change of 1.58 in DASS-21 Scores, holding other personality traits constant.
For those of average neuroticism, an increase of 1 point on the SCS is associated with a change of -0.44 in DASS-21 Scores, holding other personality traits constant.
For every increase of 1 standard deviation in neuroticism, the change in DASS-21 associated with an increase of 1 point on the SCS is asjusted by -0.52
For every increase of 1 point in SCS, the change in DASS-21 associated with an increase of 1 standard deviation in neuroticism is asjusted by -0.52

Minimum-centered

dass_mdl2a <- lm(dass ~ 1 + scs_Cmin * zn, data = scs_study)
# pull out the coefficients from the summary():
summary(dass_mdl2a)$coefficients

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 48.8233968 0.64359279 75.860695 0.000000e+00
scs_Cmin    -0.4439065 0.06834135 -6.495431 1.643265e-10
zn           6.1255486 0.62628048  9.780839 3.572450e-21
scs_Cmin:zn -0.5186142 0.06552100 -7.915236 1.063297e-14

For those at the mean on all personality traits and scoring minimum on the SCS, the estimated DASS-21 Score is 48.82
For those who score the minimum on the SCS, an increase of 1 standard deviation in neuroticism is associated with a change of 6.13 in DASS-21 Scores, holding other personality traits constant.
For those of average neuroticism, an increase of 1 point on the SCS is associated with a change of -0.44 in DASS-21 Scores, holding other personality traits constant.
For every increase of 1 standard deviation in neuroticism, the change in DASS-21 associated with an increase of 1 point on the SCS is asjusted by -0.52
For every increase of 1 point in SCS, the change in DASS-21 associated with an increase of 1 standard deviation in neuroticism is asjusted by -0.52

Seasonal Extraversion

USMR 2022 Data

The data from the USMR 2022 survey (now closed) can be found at https://uoepsy.github.io/data/usmr2022.csv.

note, this is the survey data just from USMR this year, not other students on other courses or in previous years

Question 14

We want to make a variable that indicates what season each person was born in.

We have seen the ifelse() command a few of times now, but something we are yet to see is that we can have nested statements. So we could turn everybody’s birth-months into a season by doing something like the below, where we ask “if it’s january then it’s winter, else if it’s february then it’s winter, else if it’s march…” and so on.

usmrdata <- read_csv("https://uoepsy.github.io/data/usmr2022.csv")
usmrdata %>% mutate(
  season = ifelse(birthmonth == "jan", "winter",
                  ifelse(birthmonth == "feb", "winter",
                         ifelse(birthmonth == "mar", "spring"
                                ifelse( ..... 
)

We can make this a little more efficient by use of the %in% operator, which allows us to ask “if it’s one of [december, january, february] then it’s winter, else…”.

usmrdata %>% mutate(
  season = ifelse(birthmonth %in% c("dec","jan", "feb"), "winter",
                  ifelse(birthmonth %in% c("mar","apr","may"), "spring",
                         ...
                         ...
)

Complete the code above in order to make a column that assigns the correct birth-season based on their month of birth.

Hints:

you don’t have to write out the last “if”, because you can capture those in the final “else” bit.
if you want a more tidyverse way to do this then look up the help docs for the case_when() function.

Question 15

Make the season variable that you just created a factor.

We would like you to do the following, in order:

fit a model lm(extraversion ~ season), and assign it the name seasonmod1.
relevel the season variable to have “spring” as the reference level.
fit the same model, this time assigning it the name seasonmod2
set the contrasts for the season variable to be “sum contrasts”
fit the same model, this time assigning it the name seasonmod3

Hints:

for relevelling factors, we’ve seen this in 9A#getting-more-from-your-model.
for “sum contrasts” and how to set them, see 9B#sum-contrasts.

Question 16

Calculate the mean extraversion of people born in each season. From these, also calculate the mean of those four means.

Hints:

you can use the data %>% group_by %>% summarise logic here.
to calculate a mean of means, how about data %>% group_by %>% summarise %>% summarise!

Question 17

Because there aren’t any other predictors in our model, our coefficients from all three models we fitted in question 15 are going to be various comparisons between the numbers we calculated in question 16

For every coefficient estimate in the three models, write down what difference it is estimating, and check it against the numbers you calculated in question 16

coef(seasonmod1)

 (Intercept) seasonspring seasonsummer seasonwinter 
   26.117647     4.174020     7.132353     3.609626

coef(seasonmod2)

 (Intercept) seasonautumn seasonsummer seasonwinter 
  30.2916667   -4.1740196    2.9583333   -0.5643939

coef(seasonmod3)

(Intercept)     season1     season2     season3 
 29.8466466   0.4450201  -3.7289996   3.4033534

seasonmod1

coef(seasonmod1)

 (Intercept) seasonspring seasonsummer seasonwinter 
   26.117647     4.174020     7.132353     3.609626

For the first model, the ordering of the levels is alphabetical, and so “autumn” is the reference level. This means the intercept is the extraversion of autumn-babies, and each coefficient is the other months compared to autumn:

usmrdata %>%
  group_by(season) %>%
  summarise(meanEx = mean(extraversion)) %>%
  mutate(compare = meanEx - meanEx[season=="autumn"])

# A tibble: 4 × 3
  season meanEx compare
  <fct>   <dbl>   <dbl>
1 spring   30.3    4.17
2 autumn   26.1    0   
3 summer   33.2    7.13
4 winter   29.7    3.61

seasonmod2

coef(seasonmod2)

 (Intercept) seasonautumn seasonsummer seasonwinter 
  30.2916667   -4.1740196    2.9583333   -0.5643939

For the second model, we changed it so that spring is the reference level, which means spring is the intercept, and the coefficients are each level compared to spring:

usmrdata %>%
  group_by(season) %>%
  summarise(meanEx = mean(extraversion)) %>%
  mutate(compare = meanEx - meanEx[season=="spring"])

# A tibble: 4 × 3
  season meanEx compare
  <fct>   <dbl>   <dbl>
1 spring   30.3   0    
2 autumn   26.1  -4.17 
3 summer   33.2   2.96 
4 winter   29.7  -0.564

seasonmod3

coef(seasonmod3)

(Intercept)     season1     season2     season3 
 29.8466466   0.4450201  -3.7289996   3.4033534

For the final model it is a little more difficult. We’re using sum contrasts, so the intercept is going to be the ‘grand mean’ (mean of season means), which we calculated in question 16 was 29.85.

We know that the coefficients are going to be comparisons between each season and the grand mean. But (annoyingly) they no longer have useful names - the coefficients are just season1 and season2 (whatever they are).

To see it more clearly, we can look at our contrast matrix:

contrasts(usmrdata$season)

       [,1] [,2] [,3]
spring    1    0    0
autumn    0    1    0
summer    0    0    1
winter   -1   -1   -1

We can see from the 1s that our coefficients are going to be spring vs the grand mean, autumn vs the grand mean and summer vs the grand mean.

usmrdata %>%
  group_by(season) %>%
  summarise(meanEx = mean(extraversion)) %>%
  mutate(compare = meanEx - 29.85)

# A tibble: 4 × 3
  season meanEx compare
  <fct>   <dbl>   <dbl>
1 spring   30.3   0.442
2 autumn   26.1  -3.73 
3 summer   33.2   3.4  
4 winter   29.7  -0.123