Standardising Coefficients

Often it is useful to make distinctions in notation between the effects that we are interested in (the population parameter) and our best guess (the estimate). With regression coefficients this is the notation we tend to use in this course (there are many other conventions for notation):

Population parameter	Fitted estimate
$b$	$\hat b$

Remember the scale of your variables!

Recall one of our models from last week, explaining wellbeing by the combination of social interactions and outdoor time:
\[ \textrm{Wellbeing} = b_0 \ + \ b_1 (\textrm{Outdoor Time}) \ + \ b_2 (\textrm{Social Interactions}) \ +\ \epsilon \] We fitted this as so:

mwdata <- read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")
wbmodel2 <- lm(wellbeing ~ outdoor_time + social_int, data=mwdata)

Wellbeing: Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), a self-report measure of mental health and well-being. The scale is scored by summing responses to each item, with items answered on a 1 to 5 Likert scale. The minimum scale score is 14 and the maximum is 70.
Social Interactions: Self report estimated number of social interactions per week (both online and in-person)
Outdoor Time: Self report estimated number of hours per week spent outdoors

This influences our interpretation of the coefficients (the $\hat b$’s):

$\hat b_0$ = the estimated mean score on the WEMWBS for someone who has zero social interactions per week and zero hours of outdoor time per week.
$\hat b_1$ = for every increase in 1 hour of outdoor time per week, scores on the WEMWBS are estimated to change by $\hat b_1$ points.
$\hat b_2$ = for every increase in 1 social interaction per week, scores on the WEMWBS are estimated to change by $\hat b_2$ points.

Question A1

The code below creates a standardised the outdoor_time variable, transforming the values so that the mean of the new variable is 0, and the standard deviation is 1. We can either use the scale() function to do this, or do it manually:

mwdata <- mwdata %>%
  mutate(
    outdoor_time_scaled = (outdoor_time - mean(outdoor_time))/sd(outdoor_time)
  )

Note that the shape of the distribution stays exactly the same, but the units on the x-axis are different.
How does the interpretation of $\hat b_0$ and $\hat b_1$ change for the model:

lm(wellbeing ~ social_int + outdoor_time_scaled, data=mwdata)

Solution

$\hat b_0$ = the estimated mean score on the WEMWBS for someone who has zero social interactions per week and has the mean number of hours of outdoor time per week.
$\hat b_1$ = for every increase in 1 standard deviation of hours of outdoor time per week, scores on the WEMWBS are estimated to change by $\hat b_1$ points.
$\hat b_2$ = for every increase in 1 social interaction per week, scores on the WEMWBS are estimated to change by $\hat b_2$ points.

Note that the significance of $\hat b_1$ (the effect of outdoor time) is exactly the same, but the estimate is different:

wbmodel2 <- lm(wellbeing ~ social_int + outdoor_time, data=mwdata)
wbmodel2a <- lm(wellbeing ~ social_int + outdoor_time_scaled, data=mwdata)
summary(wbmodel2)$coefficients

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)  5.3703775  4.3205141 1.242995 2.238259e-01
## social_int   1.8034489  0.2690982 6.701825 2.369845e-07
## outdoor_time 0.5923673  0.1689445 3.506284 1.499467e-03

summary(wbmodel2a)$coefficients

##                      Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)         14.144818  4.4406338 3.185315 3.445184e-03
## social_int           1.803449  0.2690982 6.701825 2.369845e-07
## outdoor_time_scaled  4.116262  1.1739670 3.506284 1.499467e-03

This is because it now represents the change in wellbeing associated with a 1 standard deviation change in outdoor time, rather than a 1 hour increase.

coef(wbmodel2)['outdoor_time'] * sd(mwdata$outdoor_time)

## outdoor_time 
##     4.116262

Question A2

We can also standardise our outcome variable, the wellbeing variable. Let’s use the scale() function this time:

mwdata <- mwdata %>%
  mutate(
    wellbeing_scaled = scale(wellbeing)
  )

How will our coefficients change for the following model?

lm(wellbeing_scaled ~ social_int + outdoor_time_scaled, data = mwdata)

Solution

$\hat b_0$ = the estimated number of standard deviations from the mean on the WEMWBS for someone who has zero social interactions per week and has the mean number of hours of outdoor time per week.
$\hat b_1$ = for every increase in 1 standard deviation of hours of outdoor time per week, scores on the WEMWBS are estimated to change by $\hat b_1$ standard deviations.
$\hat b_2$ = for every increase in 1 social interaction per week, scores on the WEMWBS are estimated to change by $\hat b_2$ standard deviations.

The coefficients now represent the change in standard deviations of wellbeing scores.

coef(wbmodel2)['social_int'] / sd(mwdata$wellbeing)

## social_int 
##  0.1545345

We can actually standardise all the variables in our model by using the function standardCoefs() from the lsr package.

In some notations, the distinction between these “standardised coefficients” and the original coefficients in the raw units, gets denoted by $\hat b$ changing to $\hat \beta$ for the standardised coefficients¹:

Notation

Population parameter	Fitted estimate	Standardised Estimate
$b$	$\hat b$	$\hat \beta$

library(lsr)
standardCoefs(wbmodel2)

##                      b      beta
## social_int   1.8034489 0.6741720
## outdoor_time 0.5923673 0.3527157

The interpretation of $\hat \beta$s (the “beta”) column, is now:

$\hat \beta_1$: The estimated change in standard deviations of wellbeing scores associated with a change of 1 standard deviation of number of hours of outdoor time per week.
$\hat \beta_2$: The estimated change in standard deviations of wellbeing scores associated with a change of 1 standard deviation of number of social interactions per week.

A benefit of this is that the standardised coefficients are much more comparable. Comparing $\hat b_1$ and $\hat b_2$ is asking whether 1 hour of outdoor time has a bigger effect than 1 social interaction. This is a bit like ‘comparing apples and oranges.’

If instead we compare $\hat \beta_1$ and $\hat \beta_2$, then the comparisons we make are in terms of standard deviations of each predictor. Thus we might consider increasing the number of social interactions each week to have a greater effect on wellbeing than increasing the hours of outdoor time (because $0.67 > 0.35$).

More Categorical Predictors

We talked briefly about binary predictor variables last week. A categorical variable with two levels (e.g., “Dog” vs “Cat”) can be represented in binary 0s and 1s. R does this translation for us when we put a categorical variable into a linear model, so we might have data that we would plot like the top plot in Figure 1, but our model is really doing what is in the bottom plot.

Figure 1: Binary Predictors

We can have categorical predictors in multiple regression, and not a great deal changes. Last week we visualised a regression surface, and if we have a binary predictor, instead of the continuum, we simply have two lines:

Question B1

Fit the multiple regression model below using lm(), and assign it to an object to store it in your environment. \[ \textrm{Wellbeing} = b_0 + b_1 (\textrm{Outdoor Time}) + b_2 (\textrm{HasARoutine}) + \epsilon \]

$\hat b_0$ (the intercept) is the estimated average wellbeing score associated with zero hours of weekly outdoor time and zero in the routine variable.

For which group (routine vs no routine) does the intercept correspond?
Hint: looking at the summary() might help

Solution

wbmodel3 <- lm(wellbeing ~ 1 + outdoor_time + routine, data = mwdata)
summary(wbmodel3)

## 
## Call:
## lm(formula = wellbeing ~ 1 + outdoor_time + routine, data = mwdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3597  -5.7983   0.1047   7.2899  12.5957 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     26.2525     3.9536   6.640  2.8e-07 ***
## outdoor_time     0.9152     0.2358   3.881 0.000552 ***
## routineRoutine   7.2947     3.2507   2.244 0.032633 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.06 on 29 degrees of freedom
## Multiple R-squared:  0.4361, Adjusted R-squared:  0.3972 
## F-statistic: 11.21 on 2 and 29 DF,  p-value: 0.0002467

As you can see in the output of the model, we have a coefficient called routineRoutine. This is the parameter estimate for a dummy variable which has been inputted into the model. The lm() function will automatically name the dummy variables (and therefore the coefficients) according to what level is identified by the 1. It names them <variable><Level>, so we can tell that routineRoutine is 1 for “Routine” and 0 for “No Routine.”

The intercept is therefore the estimated wellbeing score for those with No Routine and zero hours of outdoor time.

Question B2

Get a pen and paper, and sketch out the plot shown in Figure 2.

Figure 2: Multiple regression model: Wellbeing ~ Outdoor Time + Routine

Annotate your plot with labels for each of parameter estimates from your model:

Parameter Estimate	Model Coefficient	Estimate
$\hat b_0$	`(Intercept)`	26.25
$\hat b_1$	`outdoor_time`	0.92
$\hat b_2$	`routineRoutine`	7.29

Hint

Solution

Question B3

Load the sjPlot package using library(sjPlot) and try running the code below (changing to use the name of your fitted model). You may already have the sjPlot package installed from previous exercises. If not, you will need to install it first.

library(sjPlot)
plot_model(wbmodel3)
plot_model(wbmodel3, type = "pred")
plot_model(wbmodel3, type = "pred",  terms=c("outdoor_time","routine"), show.data=TRUE)

What do you think each one is showing?

The plot_model function (and the sjPlot package) can do a lot of different things. Most packages in R come with tutorials (or “vignettes”), for instance: https://strengejacke.github.io/sjPlot/articles/plot_model_estimates.html

Solution

library(sjPlot)
plot_model(wbmodel3)

These are the parameter estimates (the $\hat b$’s), and the confidence intervals.

confint(wbmodel3)

##                     2.5 %    97.5 %
## (Intercept)    18.1665376 34.338493
## outdoor_time    0.4329433  1.397409
## routineRoutine  0.6462090 13.943226

When we add type="pred" we are asking for the predicted values. It will provide a separate plot for each explanatory variable, showing the predicted values at each level of that variable:

plot_model(wbmodel3, type = "pred")

## $outdoor_time

## 
## $routine

We can combine these into one plot, and ask it to show the raw data as well:

plot_model(wbmodel3, type = "pred",  terms=c("outdoor_time","routine"), show.data=TRUE)

Categorical Predictors with $k$ levels

We have seen how a binary categorical variable gets inputted into our model as a variable of 0s and 1s (these typically get called “dummy variables”).

> Dummy variables are numeric variables that represent categorical data.

When we have a categorical explanatory variable with more than 2 levels, our model gets a bit bigger - it needs not just one, but a number of dummy variables. For a categorical variable with $k$ levels, we can express it in $k-1$ dummy variables.

For example, the “species” column below has three levels, and can be expressed by the two variables “species_dog” and “species_parrot”:

##   species species_dog species_parrot
## 1     cat           0              0
## 2     cat           0              0
## 3     dog           1              0
## 4  parrot           0              1
## 5     dog           1              0
## 6     cat           0              0
## 7     ...         ...            ...

The “cat” level is expressed whenever both the “species_dog” and “species_parrot” variables are 0.
The “dog” level is expressed whenever the “species_dog” variable is 1 and the “species_parrot” variable is 0.
The “parrot” level is expressed whenever the “species_dog” variable is 0 and the “species_parrot” variable is 1.

R will do all of this re-expression for us. If we include in our model a categorical explanatory variable with 4 different levels, the model will estimate 3 parameters - one for each dummy variable. We can interpret the parameter estimates (the coefficients we obtain using coefficients(),coef() or summary()) as the estimated increase in the outcome variable associated with an increase of one in each dummy variable (holding all other variables equal).

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	60.28	1.209	49.86	5.273e-39
speciesdog	-11.47	1.71	-6.708	3.806e-08
speciesparrot	-4.916	1.71	-2.875	0.006319

Note that in the above example, an increase in 1 of “species_dog” is the difference between a “cat” and a “dog.” An increase in one of “species_parrot” is the difference between a “cat” and a “parrot.” We think of the “cat” category in this example as the reference level - it is the category against which other categories are compared against.

If you think about it, a regression model with a categorical predictor with $k$ levels is really just a regression model with $k-1$ dummy variables as predictors.

Optional: Contrasts!

As it happens, we don’t always have to use 0s and 1s to represent categorical variables.

By changing what numbers we use, we change what the various estimates represent. For instance, we might use -1 and 1. Which makes 0 the mid-point, or the overall mean, and this is where our intercept will be. The coefficient will then be the difference from the group to the overall mean. In additional, we can also change what “a change of 1” represents.

This can get very confusing very quickly as we move to variables that have more than 2 levels, but the short story is that we can use these “contrasts” to test the specific differences that we might be interested in (depending on what our hypothesis is).
In R, we can see quickly how our variables will be encoded in our model by using contrasts().
tip: it requires a variable to be a factor.

mwdata$location <- factor(mwdata$location)
contrasts(mwdata$location)

##        Rural Suburb
## City       0      0
## Rural      1      0
## Suburb     0      1

In the output of contrasts(), the columns are the dummy variables which go into our model. So we know that if we use location in a model, the reference level will be “City” (where both variables are zero), and we will get a coefficient for “Rural” which will be the difference “Rural - City,” and a coefficient for “Suburb” which will be the difference “Suburb - City.”

We can see that here:

coef(lm(wellbeing ~ location, data = mwdata))

##    (Intercept)  locationRural locationSuburb 
##      41.000000       7.571429       1.100000

And calculating the group means of wellbeing helps to show that it’s doing what we expect:

mwdata %>% group_by(location) %>%
  summarise(
    meanwb = mean(wellbeing)
  ) %>%
  mutate(diff_from_city = meanwb - meanwb[1])

## # A tibble: 3 x 3
##   location meanwb diff_from_city
##   <fct>     <dbl>          <dbl>
## 1 City       41             0   
## 2 Rural      48.6           7.57
## 3 Suburb     42.1           1.10

We’re not going to delve too much into contrasts in this course, other than to say that we can use them change the way that information in categorical variables get inputted in our model, thereby changing the group differences that we are estimating. If you would like to learn more, then do please email and ask us. There are also lots of great resources around online, such as this one.

Interactions

We’re now going to move on to something a little more interesting, and we’re finally going to stop using that dataset about wellbeing, outdoor time and socialising etc.

So far, we’ve been talking a lot about “the effect of x” and “change in y for a 1 unit change in x” etc. But what if we think that some such “effects” are actually dependent upon some other variable? For example, we might think that the amount to which caffeine makes you stressed depends upon how old you are.

An interaction is a quantitative researcher’s way of answering the question “what is the relationship between Y and X?” with, “well, it depends….”

We’re going to look at interactions between two numeric variables, but before that we’re going to look at an interaction between a numeric variable and a categorical variable.

~ Numeric * Categorical

Data: cogapoe.csv

Ninety adult participants were recruited for a study investigating how cognitive functioning varies by age, and whether this is different depending on whether people carry an APOE-4 gene.

cogapoe <- read_csv("../../data/cogapoe4.csv")
tibble(
  variables = names(cogapoe),
  description = c("Participant ID","Age (in Years)","Years of Education","Birthweight (in Kg)","APOE-4 Gene expansion ('none', 'apoe4a', 'apoe4b', apoe4c')","Score on Addenbrooke's Cognitive Examination")
) %>% knitr::kable() %>% kableExtra::kable_styling(.,full_width=T)

variables	description
pid	Participant ID
age	Age (in Years)
educ	Years of Education
birthweight_kg	Birthweight (in Kg)
apoe4	APOE-4 Gene expansion (‘none,’ ‘apoe4a,’ ‘apoe4b,’ apoe4c’)
acer	Score on Addenbrooke’s Cognitive Examination

Download Link

The data are available at https://uoepsy.github.io/data/cogapoe4.csv.

Question C1

Research Question: Does the relationship between age and cognitive function differ between those with and without the APOE-4 genotype?

Read in the data and explore the variables which you think you will use to answer this research question (create some plots, some descriptive stats etc.)

Some tips:

The pairs.panels() function from the psych package is quite a nice way to plot a scatterplot matrix of a dataset.
The describe() function is also quite nice (from the psych package too).

Solution

cogapoe <- read_csv("https://uoepsy.github.io/data/cogapoe4.csv")
summary(cogapoe)

##      pid                 age              educ       birthweight_kg  
##  Length:90          Min.   : 45.00   Min.   :15.00   Min.   : 0.500  
##  Class :character   1st Qu.: 57.25   1st Qu.:17.00   1st Qu.: 5.225  
##  Mode  :character   Median : 67.00   Median :19.00   Median : 6.550  
##                     Mean   : 70.53   Mean   :19.31   Mean   : 6.546  
##                     3rd Qu.: 86.75   3rd Qu.:22.00   3rd Qu.: 8.150  
##                     Max.   :100.00   Max.   :24.00   Max.   :11.500  
##     apoe4                acer       
##  Length:90          Min.   : 70.42  
##  Class :character   1st Qu.: 78.52  
##  Mode  :character   Median : 85.02  
##                     Mean   : 86.53  
##                     3rd Qu.: 95.36  
##                     Max.   :100.00

Judging by the research question, we’re going to be interested in participants’ ages, whether they carry the APOE4 gene, and their cognitive functioning.

library(psych)
cogapoe %>% 
  select(age, apoe4, acer) %>%
  pairs.panels()

cogapoe %>% 
  select(age, apoe4, acer) %>%
  describe()

##        vars  n  mean    sd median trimmed   mad   min max range  skew kurtosis
## age       1 90 70.53 16.28  67.00   70.28 19.27 45.00 100 55.00  0.21    -1.30
## apoe4*    2 90  2.76  1.13   3.00    2.82  1.48  1.00   4  3.00 -0.25    -1.40
## acer      3 90 86.53  9.39  85.02   86.63 13.11 70.42 100 29.58 -0.01    -1.34
##          se
## age    1.72
## apoe4* 0.12
## acer   0.99

Question C2

Check the apoe4 variable. It currently has four levels (“none”/“apoe4a”/“apoe4b”/“apoe4c”), but the research question is actually interested in two (“none” vs “apoe4”). We’ll need to fix this. One way to do this would be to use ifelse() to define a variable which takes one value (e.g., “NO”) if the observation meets from some condition, or another value (e.g., “YES”) if it does not. Type ?ifelse in the console if you want to see the help function. You can use it to add a new variable either inside mutate(), or using data$new_variable_name <- ifelse(test, x, y) syntax. We saw it briefly in the previous set of exercises.

Solution

Question C3

Produce a visualisation of the relationship between age and cognitive functioning, with separate facets for people with and without the APOE4 gene.

Hint: remember facet_wrap()?

Solution

Interactions in linear models

Specifying an interaction in a regression involves simply including the product of the two relevant predictor variables in your set of predictors. So if we are wanting to examine how the effect of $x$ on $y$ depends on $z$, we would want to estimate a parameter $b$ such that our outcome is predicted by $b(x \times z)$. However, we also need to include $x$ and $z$ themselves: $y = b_0 + b_1(x) + b_2(z) + b_3(x \times z)$.

“Except in special circumstances, a model including a product term for interaction between two explanatory variables should also include terms with each of the explanatory variables individually, even though their coefficients may not be significantly different from zero. Following this rule avoids the logical inconsistency of saying that the effect of $X_1$ depends on the level of $X_2$ but that there is no effect of $X_1$.”
Ramsey and Schafer (2012)

Question C4

To address the research question, we are going to fit the following model:
\[ \text{ACE-R} = b_0 + b_1(\text{Age}) + b_2(\text{isAPOE4}) + b_3 (\text{Age} \times \text{isAPOE4}) + \epsilon \\ \quad \\ \text{where} \quad \epsilon \sim N(0, \sigma) \text{ independently} \]

Fit the model using lm().

Tip:
When fitting a regression model in R with two explanatory variables A and B, and their interaction, these two are equivalent:

y ~ A + B + A:B
y ~ A*B

Solution

Interpreting coefficients for A and B in the presence of an interaction A:B

When you include an interaction between $x_1$ and $x_2$ in a regression model, you are estimating the extent to which the effect of $x_1$ on $y$ is different across the values of $x_2$.

What this means is that the effect of $x_1$ on $y$ depends on/is conditional upon the value of $x_2$.
(and vice versa, the effect of $x_2$ on $y$ is different across the values of $x_1$).
This means that we can no longer talk about the “effect of $x_1$ holding $x_2$ constant.” Instead we can talk about a conditional effect of $x_1$ on $y$ at a specific value of $x_2$.

When we fit the model $y = b_0 + b_1(x_1)+ b_2(x_2) + b_3(x_1 \times x_2) + \epsilon$ using lm():

the parameter estimate $\hat b_1$ is the conditional effect of $x_1$ on $y$ where $x_2 = 0$
the parameter estimate $\hat b_2$ is the conditional effect of $x_2$ on $y$ where $x_1 = 0$

side note: Regardless of whether or not there is an interaction term in our model, all parameter estimates in multiple regression are “conditional” in the sense that they are dependent upon the inclusion of other variables in the model. For instance, in $y = b_0 + b_1(x_1) + b_2(x_2) + \epsilon$ the coefficient $\hat b_1$ is conditional upon holding $x_2$ constant. The difference with an interaction is that they are conditional upon $x_2$ being at some specific value.

Interpreting the interaction term A:B

The coefficient for an interaction term can be thought of as providing an adjustment to the slope.

In our model: $\text{ACE-R} = b_0 + b_1(\text{Age}) + b_2(\text{isAPOE4}) + b_3 (\text{Age} \times \text{isAPOE4}) + \epsilon$, we have a numeric*categorical interaction.
The estimate $\hat b_3$ is the adjustment to the slope $\hat b_1$ to be made for the individuals in the $\text{isAPOE4}=1$ group.

“The best method of communicating findings about the presence of significant interaction may be to present a table of graph of the estimated means at various combinations of the interacting variables.”
Ramsey and Schafer (2012)

Question C5

Look at the parameter estimates from your model, and write a description of what each one corresponds to on the plot shown in Figure 3 (it may help to sketch out the plot yourself and annotate it).

Multiple regression model: ACER ~ Age * isAPOE4<br>The dotted lines show the extension back to where the x-axis is zero

Figure 3: Multiple regression model: ACER ~ Age * isAPOE4
The dotted lines show the extension back to where the x-axis is zero

Hints

Solution

We can obtain our parameter estimates using various functions such as summary(apoe4mod),coef(apoe4mod), coefficients(apoe4mod) etc.

coefficients(apoe4mod)

##    (Intercept)            age     isAPOE4Yes age:isAPOE4Yes 
##   104.09765349    -0.09796756    -4.37223106    -0.17966769

$\hat b_0$ = (Intercept) = 104.1: The point at which the red line cuts the y-axis (where age = 0).
$\hat b_1$ = age = -0.1: The slope (vertical change on the y-axis associated with a 1 unit change on the x-axis) of the red line.
$\hat b_2$ = isAPOE4Yes = -4.37: The vertical distance from the red to the blue line at the y-axis (where age = 0).
$\hat b_3$ = age:isAPOE4Yes = -0.18: How the slope of the line changes when you move from the red to the blue line.

Question C6

Make sure the sjPlot package is loaded and try using the function plot_model().
The default behaviour of plot_model() is to plot the parameter estimates and their confidence intervals. This is where type = "est". Try to create a plot like Figure 3, which shows the two lines (Hint: what is this set block of exercises all about? type = ???.)

Solution

~ Numeric * Numeric

We will now look at a multiple regression model with an interaction between two numeric explanatory variables. For these exercises we’ll move onto another different dataset.

Data: scs_study.csv

Data from 656 participants containing information on scores on each trait of a Big 5 personality measure, their perception of their own social rank, and their scores on a measure of depression.

The data in scs_study.csv contain seven attributes collected from a random sample of $n=656$ participants:

zo: Openness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
zc: Conscientiousness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
ze: Extraversion (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
za: Agreeableness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
zn: Neuroticism (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
scs: Social Comparison Scale - An 11-item scale that measures an individual’s perception of their social rank, attractiveness and belonging relative to others. The scale is scored as a sum of the 11 items (each measured on a 5-point scale), with higher scores indicating more favourable perceptions of social rank.
dass: Depression Anxiety and Stress Scale - The DASS-21 includes 21 items, each measured on a 4-point scale. The score is derived from the sum of all 21 items, with higher scores indicating higher a severity of symptoms.

Download link

The data is available at https://uoepsy.github.io/data/scs_study.csv

Refresher: Z-scores

When we standardise a variable, we re-express each value as the distance from the mean in units of standard deviations. These transformed values are called z-scores.

To transform a given value $x_i$ into a z-score $z_i$, we simply calculate the distance from $x_i$ to the mean, $\bar{x}$, and divide this by the standard deviation, $s$:
\[ z_i = \frac{x_i - \bar{x}}{s} \]

A Z-score of a value is the number of standard deviations below/above the mean that the value falls.

Question C7

Research question
Previous research has identified an association between an individual’s perception of their social rank and symptoms of depression, anxiety and stress. We are interested in the individual differences in this relationship.
Specifically: Does the effect of social comparison on symptoms of depression, anxiety and stress vary depending on level of neuroticism?

Read in the Social Comparison Study data and explore the relevant distributions and relationships between the variables of interest to the research question.

Solution

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
summary(scs_study)

##        zo                 zc                 ze                 za          
##  Min.   :-2.81928   Min.   :-3.21819   Min.   :-3.00576   Min.   :-2.94429  
##  1st Qu.:-0.63089   1st Qu.:-0.66866   1st Qu.:-0.68895   1st Qu.:-0.69394  
##  Median : 0.08053   Median : 0.00257   Median :-0.04014   Median :-0.01854  
##  Mean   : 0.09202   Mean   : 0.01951   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.80823   3rd Qu.: 0.71215   3rd Qu.: 0.67085   3rd Qu.: 0.72762  
##  Max.   : 3.55034   Max.   : 3.08015   Max.   : 2.80010   Max.   : 2.97010  
##        zn               scs             dass      
##  Min.   :-1.4486   Min.   :27.00   Min.   :23.00  
##  1st Qu.:-0.7994   1st Qu.:33.00   1st Qu.:41.00  
##  Median :-0.2059   Median :35.00   Median :44.00  
##  Mean   : 0.0000   Mean   :35.77   Mean   :44.72  
##  3rd Qu.: 0.5903   3rd Qu.:38.00   3rd Qu.:49.00  
##  Max.   : 3.3491   Max.   :54.00   Max.   :68.00

ggplot(data = scs_study, aes(x=dass)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of DASS-21 Scores", 
       x = "Depression Anxiety and Stress Scale", y = "Probability density")

The marginal distribution of scores on the Depression, Anxiety and Stress Scale (DASS-21) is unimodal with a mean of approximately 45 and a standard deviation of 7.

ggplot(data = scs_study, aes(x=scs)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of Social Comparison Scale (SCS) scores", 
       x = "Social Comparison Scale Score", y = "Probability density")

The marginal distribution of score on the Social Comparison Scale (SCS) is unimodal with a mean of approximately 36 and a standard deviation of 4. There look to be a number of outliers at the upper end of the scale.

ggplot(data = scs_study, aes(x=zn)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of Neuroticism (Z-Scored)", 
       x = "Neuroticism (Z-Scored)", y = "Probability density")

The marginal distribution of Neuroticism (Z-scored) is positively skewed, with the 25% of scores falling below -0.8, 75% of scores falling below 0.59.

library(patchwork) # for arranging plots side by side
library(knitr) # for making tables look nice

p1 <- ggplot(data = scs_study, aes(x=scs, y=dass)) + 
  geom_point()+
  labs(x = "SCS", y = "DASS-21")

p2 <- ggplot(data = scs_study, aes(x=zn, y=dass)) + 
  geom_point()+
  labs(x = "Neuroticism", y = "DASS-21")

p1 | p2

# the kable() function from the knitr package can make table outputs print nicely into html.
scs_study %>%
  select(dass, scs, zn) %>%
  cor %>% 
  kable

	dass	scs	zn
dass	1.0000000	-0.2280126	0.2001885
scs	-0.2280126	1.0000000	0.1146687
zn	0.2001885	0.1146687	1.0000000

There is a weak, negative, linear relationship between scores on the Social Comparison Scale and scores on the Depression Anxiety and Stress Scale for the participants in the sample. Severity of symptoms measured on the DASS-21 tend to decrease, on average, the more favourably participants view their social rank.
There is a weak, positive, linear relationship between the levels of Neuroticism and scores on the DASS-21. Participants who are more neurotic tend to, on average, display a higher severity of symptoms of depression, anxiety and stress.

Question C8

Specify the model you plan to fit in order to answer the research question (e.g., $\text{??} = b_0 + b_1 (\text{??}) + .... + \epsilon$)

Solution

Question C9

We named the data scs_study in our environment, but you will likely have named yours something different. Edit the code below accordingly to run it on your data.
The code takes the dataset, and uses the cut() function to add a new variable called “zn_group,” which is the “zn” variable split into 4 groups.

Remember: we have re-assign this output as the name of the dataset (the scs_study <- bit at the beginning) to make these changes occur in our environment (the top-right window of Rstudio). If we didn’t have the first line, then it would simply print the output.

scs_study <-
  scs_study %>%
  mutate(
    zn_group = cut(zn, 4)
  )

We can see how it has split the “zn” variable by plotting the two against one another:
(Note that the levels of the new variable are named according to the cut-points).

ggplot(data = scs_study, aes(x = zn_group, y = zn)) + 
  geom_point()

Question C10

Plot the relationship between scores on the SCS and scores on the DASS-21, for each group of the variable (zn_group) that we just created.
How does the pattern change? Does it suggest an interaction?

Tip: Rather than creating four separate plots, you might want to map some feature of the plot to the variable we created in the data, or make use of facet_wrap()/facet_grid().

Solution

Cutting one of the explanatory variables up into groups essentially turns a numeric variable into a categorical one. We did this just to make it easier to visualise how a relationship changes across the values of another variable, because we can imagine a separate line for the relationship between SCS and DASS-21 scores for each of the groups of neuroticism. However, in grouping a numeric variable like this we lose information. Neuroticism is measured on a continuous scale, and we want to capture how the relationship between SCS and DASS-21 changes across that continuum (rather than cutting it into chunks).
We could imagine cutting it into more and more chunks (see Figure 4), until what we end up with is a an infinite number of lines - i.e., a three-dimensional plane/surface (recall that in for a multiple regression model with 2 explanatory variables, we can think of the model as having three-dimensions). The inclusion of the interaction term simply results in this surface no longer being necessarily flat. You can see this in Figure 5.

Figure 4: Separate regression lines DASS ~ SCS for neuroticism when cut into 4 (left) or 6 (center) or 12 (right) groups

Figure 5: 3D plot of regression surface with interaction. You can explore the plot in the figure below from different angles by moving it around with your mouse.

Question C11

Fit your model using lm().

Solution

dass_mdl <- lm(dass ~ 1 + scs*zn, data = scs_study)
summary(dass_mdl)

## 
## Call:
## lm(formula = dass ~ 1 + scs * zn, data = scs_study)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.301  -3.825  -0.173   3.733  45.777 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.80887    2.45399  24.780  < 2e-16 ***
## scs         -0.44391    0.06834  -6.495 1.64e-10 ***
## zn          20.12813    2.35951   8.531  < 2e-16 ***
## scs:zn      -0.51861    0.06552  -7.915 1.06e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.123 on 652 degrees of freedom
## Multiple R-squared:  0.1825, Adjusted R-squared:  0.1787 
## F-statistic:  48.5 on 3 and 652 DF,  p-value: < 2.2e-16

Question C12

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	60.81	2.454	24.78	5.005e-96
scs	-0.4439	0.06834	-6.495	1.643e-10
zn	20.13	2.36	8.531	1.017e-16
scs:zn	-0.5186	0.06552	-7.915	1.063e-14

Recall that the coefficients zn and scs from our model now reflect the estimated change in the outcome associated with an increase of 1 in the explanatory variables, when the other variable is zero.

Think - what is 0 in each variable? what is an increase of 1? Are these meaningful? Would you suggest recentering either variable?

Solution

Question C13

Recenter one or both of your explanatory variables to ensure that 0 is a meaningful value

Solution

Question C14

We’ll now re-fit the model using mean-centered SCS scores instead of the original variable. Here are the parameter estimates:

dass_mdl2 <- lm(dass ~ 1 + scs_mc * zn, data = scs_study)

# pull out the coefficients from the summary():
summary(dass_mdl2)$coefficients

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 44.9324476 0.24052861 186.807079 0.000000e+00
## scs_mc      -0.4439065 0.06834135  -6.495431 1.643265e-10
## zn           1.5797687 0.24086372   6.558766 1.105118e-10
## scs_mc:zn   -0.5186142 0.06552100  -7.915236 1.063297e-14

Fill in the blanks in the statements below.

For those of average neuroticism and who score average on the SCS, the estimated DASS-21 Score is ???
For those who who score ??? on the SCS, an increase of ??? in neuroticism is associated with a change of 1.58 in DASS-21 Scores
For those of average neuroticism, an increase of ??? on the SCS is associated with a change of -0.44 in DASS-21 Scores
For every increase of ??? in neuroticism, the change in DASS-21 associated with an increase of ??? on the SCS is asjusted by ???
For every increase of ??? in SCS, the change in DASS-21 associated with an increase of ??? in neuroticism is asjusted by ???

Solution

Question C15

What do we get when we use the plot_model() function from sjPlot to plot this continuous*continuous interaction?

Solution

References

Ramsey, Fred, and Daniel Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning.

A common alternative is to use $\beta$ for the population parameter, $\hat \beta$ for the fitted estimate, and something like $\hat \beta^*$ for the standardised estimate.↩︎

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

More Linear Regression

Standardising Coefficients

More Categorical Predictors

Categorical Predictors with \(k\) levels

Interactions

~ Numeric * Categorical

~ Numeric * Numeric

Parameter Estimate	Model Coefficient	Estimate
\(\hat b_0\)	`(Intercept)`	26.25
\(\hat b_1\)	`outdoor_time`	0.92
\(\hat b_2\)	`routineRoutine`	7.29