Exercises: More Multiple Regression

In last week’s exercises, we extended the simple linear regression model to multiple regression (one outcome variable, multiple explanatory variables). Specifically, we looked at the case of when we had multiple numeric (or ‘continuous’) explanatory variables.
We’re now going to start by taking a look at a model with a mix of numeric and categorical explanatory variables.

~ Numeric + Categorical

Suppose that the group of researchers were instead wanting to study the relationship between well-being and time spent outdoors after taking into account the relationship between well-being and having a routine.

Note that in the dataset, ‘having a routine’ is categorical (participants either do, or don’t have a routine). So let’s go over the modelling process again, but paying careful attention to where and how the process differs when we have a categorical predictor.

Question A1

We have already visualised the marginal distribution of weekly outdoor time in questions last week, as well as its relationship with wellbeing scores.

Read in the wellbeing data (https://uoepsy.github.io/data/wellbeing.csv) and produce visualisations of:

the distribution of the routine variable
the relationship between routine and wellbeing.

Note: We cannot visualise the distribution of routine as a density curve or boxplot, because it is a categorical variable (observations can only take one of a set of discrete response values).

Solution

library(tidyverse)
# Read in data
mwdata = read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")

geom_bar() will count the number of observations falling into each unique level of the routine variable:

ggplot(data = mwdata, aes(x = routine)) +
  geom_bar()+
  labs(x = "Routine", y = "Frequency")

Figure 1: Marginal distribution plots of Routine (y/n)

We might plot the relationship between routine and wellbeing as two boxplots:

ggplot(data = mwdata, aes(x = routine, y = wellbeing)) +
  geom_boxplot()+
  labs(x = "Routine", y = "Wellbeing score (WEMWBS)")

Figure 2: Relationship between wellbeing and presence of routine

Categorical predictors

When we have a categorical explanatory variable such as this one, what actually gets inputted into our model is a number of dummy variables which are numeric variables that represents categorical data.

For a categorical variable with $k$ levels, we can express it in $k-1$ dummy variables.
For example, the “species” column below has three levels, and can be expressed by the two variables “species_dog” and “species_parrot”:

##   species species_dog species_parrot
## 1     cat           0              0
## 2     cat           0              0
## 3     dog           1              0
## 4  parrot           0              1
## 5     dog           1              0
## 6     cat           0              0
## 7     ...         ...            ...

The “cat” level is expressed whenever both the “species_dog” and “species_parrot” variables are 0.
The “dog” level is expressed whenever the “species_dog” variable is 1 and the “species_parrot” variable is 0.
The “parrot” level is expressed whenever the “species_dog” variable is 0 and the “species_parrot” variable is 1.

R will do all of this re-expression for us. If we include in our model a categorical explanatory variable with 4 different levels, the model will estimate 3 parameters - one for each dummy variable. We can interpret the parameter estimates (the coefficients we obtain using coefficients(),coef() or summary()) as the estimated increase in the outcome variable associated with an increase of one in each dummy variable (holding all other variables equal).
Note that in the above example, an increase in 1 of “species_dog” is the difference between a “cat” and a “dog”. An increase in one of “species_parrot” is the difference between a “cat” and a “parrot”. We think of the “cat” category in this example as the reference level - it is the category against which other categories are compared against.

Question A2

Fit the multiple regression model below using lm(), and assign it to an object named wb_mdl2. Examine the summary output of the model.

\[ Wellbeing = \beta_0 + \beta_1 \cdot OutdoorTime + \beta_2 \cdot Routine + \epsilon \]

Solution

wb_mdl2 <- lm(wellbeing ~ 1 + outdoor_time + routine, data = mwdata)
summary(wb_mdl2)

## 
## Call:
## lm(formula = wellbeing ~ 1 + outdoor_time + routine, data = mwdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3597  -5.7983   0.1047   7.2899  12.5957 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     26.2525     3.9536   6.640  2.8e-07 ***
## outdoor_time     0.9152     0.2358   3.881 0.000552 ***
## routineRoutine   7.2947     3.2507   2.244 0.032633 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.06 on 29 degrees of freedom
## Multiple R-squared:  0.4361, Adjusted R-squared:  0.3972 
## F-statistic: 11.21 on 2 and 29 DF,  p-value: 0.0002467

Question A3

$\hat \beta_0$ (the intercept) is the estimated average wellbeing score associated with zero hours of weekly outdoor time and zero in the routine variable.

What group is the intercept the estimated wellbeing score for when they have zero hours of outdoor time? Why (think about what zero in the routine variable means)?

Solution

Question A4

In last week’s example, we had a visualisation of our model as a regression surface.
Here, one of our explanatory variables has only two possible responses. How might we best visualise the model?

one line
one surface
two lines
two surfaces
a curved (not flat) surface

Solution

Question A5

Get a pen and paper, and sketch out the plot shown in Figure 3.

Figure 3: Multiple regression model: Wellbeing ~ Outdoor Time + Routine

Annotate your plot with labels for each of parameter estimates from your model:

Parameter Estimate	Model Coefficient	Estimate
$\hat \beta_0$	`(Intercept)`	26.25
$\hat \beta_1$	`outdoor_time`	0.92
$\hat \beta_2$	`routineRoutine`	7.29

Hint

Solution

Question A6

Load the sjPlot package using library(sjPlot) and try running the code below. (You may already have the sjPlot package installed from previous exercises. If not, you will need to install it first).

library(sjPlot)
plot_model(wb_mdl2)
plot_model(wb_mdl2, type = "pred")
plot_model(wb_mdl2, type = "pred",  terms=c("outdoor_time","routine"), show.data=TRUE)

What do you think each one is showing?

The plot_model function (and the sjPlot package) can do a lot of different things. Most packages in R come with tutorials (or “vignettes”), for instance: https://strengejacke.github.io/sjPlot/articles/plot_model_estimates.html

Solution

library(sjPlot)
plot_model(wb_mdl2)

These are the parameter estimates (the $\hat \beta$’s), and the confidence intervals.

confint(wb_mdl2)

##                     2.5 %    97.5 %
## (Intercept)    18.1665376 34.338493
## outdoor_time    0.4329433  1.397409
## routineRoutine  0.6462090 13.943226

When we add type="pred" we are asking for the predicted values. It will provide a separate plot for each explanatory variable, showing the predicted values at each level of that variable:

plot_model(wb_mdl2, type = "pred")

## $outdoor_time

## 
## $routine

We can combine these into one plot, and ask it to show the raw data as well:

plot_model(wb_mdl2, type = "pred",  terms=c("outdoor_time","routine"), show.data=TRUE)

Exercises: Interactions

An interaction is a quantitative researcher’s way of answering the question “what is the relationship between Y and X?” with, “well, it depends…”.

~ Numeric * Categorical

Question B1

Reseachers have become interested in how the number of social interactions might influence mental health and wellbeing differently for those living in rural communities compared to those in cities and suburbs. They want to assess whether the effect of social interactions on wellbeing is moderated by (depends upon) whether or not a person lives in a rural area.

For the data we have been working with above (from https://uoepsy.github.io/data/wellbeing.csv), count the number of respondents in each location (City/Location/Rural).

Open-ended: Do you think there is enough data to answer this question?

Solution

mwdata %>% count(location)

## # A tibble: 3 x 2
##   location     n
## * <chr>    <int>
## 1 City        15
## 2 Rural        7
## 3 Suburb      10

We have only 7 respondents who are from a rural location, and 25 from the city & suburbs. Intuitively, this doesn’t seem very many to rely on as representative of the population of those living in rural areas in Edinburgh & Lothians. Another thing to think about is that we probably don’t expect large differences between rural and city dwellers in the effect of social interaction on wellbeing (i.e., we might not expect differences in these sub-groups to be stronger than the overall relationship between social interaction and wellbeing).

Research Question: Does the relationship between number of social interactions and mental wellbeing differ between rural and non-rural residents?

To investigate how the relationship between the number of social interactions and mental wellbeing might be different for those living in rural communities, the researchers conduct a new study, collecting data from 200 randomly selected residents of the Edinburgh & Lothian postcodes.

Wellbeing/Rurality data codebook

Download link

The data is available at https://uoepsy.github.io/data/wellbeing_rural.csv.

Description

From the Edinburgh & Lothians, 100 city/suburb residences and 100 rural residences were chosen at random and contacted to participate in the study. The Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), was used to measure mental health and well-being. Participants filled out a questionnaire including items concerning: estimated average number of hours spent outdoors each week, estimated average number of social interactions each week (whether on-line or in-person), whether a daily routine is followed (yes/no). For those respondents who had an activity tracker app or smart watch, they were asked to provide their average weekly number of steps.

The data in wellbeing_rural.csv contain seven attributes collected from a random sample of $n=200$ hypothetical residents over Edinburgh & Lothians, and include:

wellbeing: Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), a self-report measure of mental health and well-being. The scale is scored by summing responses to each item, with items answered on a 1 to 5 Likert scale. The minimum scale score is 14 and the maximum is 70.
outdoor_time: Self report estimated number of hours per week spent outdoors
social_int: Self report estimated number of social interactions per week (both online and in-person)
routine: Binary 1=Yes/0=No response to the question “Do you follow a daily routine throughout the week?”
location: Location of primary residence (City, Suburb, Rural)
steps_k: Average weekly number of steps in thousands (as given by activity tracker if available)
age: Age in years of respondent

Preview

The first six rows of the data are:

age	outdoor_time	social_int	routine	wellbeing	location	steps_k
28	12	13	1	36	rural	21.6
56	5	15	1	41	rural	12.3
25	19	11	1	35	rural	49.8
60	25	15	0	35	rural	NA
19	9	18	1	32	rural	48.1
34	18	13	1	34	rural	67.3

Question B2

Specify a multiple regression model to answer the research question.
Read in the data, and assign it the name mwdata2. Then fully explore the variables and relationships which are going to be used in your analysis.

“Except in special circumstances, a model including a product term for interaction between two explanatory variables should also include terms with each of the explanatory variables individually, even though their coefficients may not be significantly different from zero. Following this rule avoids the logical inconsistency of saying that the effect of $X_1$ depends on the level of $X_2$ but that there is no effect of $X_1$.”
Ramsey and Schafer (2012)

Tip 1: Install the psych package (remember to use the console, not your script to install packages), and then load it (load it in your script/Rmarkdown). The pairs.panels() function from this package will plot all variables in a dataset against one another. This will save you the time you would have spent creating individual plots.
Tip 2: Check the “location” variable. It currently has three levels (Rural/Suburb/City), but we only want two (Rural/Not Rural). You’ll need to fix this. One way to do this would be to use ifelse() to define a variable which takes one value (“Rural”) if the observation meets from some condition, or another value (“Not Rural”) if it does not. Type ?ifelse in the console if you want to see the help function. You can use it to add a new variable either inside mutate(), or using data$new_variable_name <- ifelse(test, x, y) syntax.

Solution

To address the research question, we are going to fit the following model, where $y$ = wellbeing; $x_1$ = weekly outdoor time; and $x_2$ = whether or not the respondent lives in a rural location or not. \[ y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_1 \cdot x_2 + \epsilon \\ \quad \\ \text{where} \quad \epsilon \sim N(0, \sigma) \text{ independently} \]

First we read in the data, and take a quick look at our variables:

mwdata2 <- read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")
summary(mwdata2)

##       age         outdoor_time     social_int       routine        wellbeing   
##  Min.   :18.00   Min.   : 1.00   Min.   : 3.00   Min.   :0.000   Min.   :22.0  
##  1st Qu.:30.00   1st Qu.:12.75   1st Qu.: 9.00   1st Qu.:0.000   1st Qu.:33.0  
##  Median :42.00   Median :18.00   Median :12.00   Median :1.000   Median :35.0  
##  Mean   :42.30   Mean   :18.25   Mean   :12.06   Mean   :0.565   Mean   :36.3  
##  3rd Qu.:54.25   3rd Qu.:23.00   3rd Qu.:15.00   3rd Qu.:1.000   3rd Qu.:40.0  
##  Max.   :70.00   Max.   :35.00   Max.   :24.00   Max.   :1.000   Max.   :59.0  
##                                                                                
##    location            steps_k      
##  Length:200         Min.   :  0.00  
##  Class :character   1st Qu.: 24.00  
##  Mode  :character   Median : 42.45  
##                     Mean   : 44.93  
##                     3rd Qu.: 65.28  
##                     Max.   :111.30  
##                     NA's   :66

First let’s create a new variable for Rural/Not Rural

mwdata2 <- 
  mwdata2 %>% 
  mutate(
    isRural = ifelse(location == "rural", "rural","not rural")
  )

Now let’s use psych::pairs.panels() function.
We could use it on the whole dataset, but for now we’ll just do it on the variables we’re interested in:

library(psych)
pairs.panels(mwdata2 %>% select(wellbeing, social_int, isRural))

Question B3

Produce a visualisation of the relationship between weekly number of social interactions and well-being, with separate facets for rural vs non-rural respondents.

Solution

ggplot(data = mwdata2, aes(x = social_int, y = wellbeing)) + 
  geom_point() + 
  facet_wrap(~isRural)

Question B4

Fit your model using lm().

Hint: When fitting a regression model in R with two explanatory variables A and B, and their interaction, these two are equivalent:

y ~ A + B + A:B
y ~ A*B

Solution

rural_mod <- lm(wellbeing ~ 1 + social_int * isRural, data = mwdata2)

Interpreting coefficients for A and B in the presence of an interaction A:B

When you include an interaction between $x_1$ and $x_2$ in a regression model, you are estimating the extent to which the effect of $x_1$ on $y$ is different across the values of $x_2$.

What this means is that the effect of $x_1$ on $y$ depends on/is conditional upon the value of $x_2$.
(and vice versa, the effect of $x_2$ on $y$ is different across the values of $x_1$).
This means that we can no longer talk about the “effect of $x_1$ holding $x_2$ constant”. Instead we can talk about a marginal effect of $x_1$ on $y$ at a specific value of $x_2$.

When we fit the model $y = \beta_0 + \beta_1\cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_1 \cdot x_2 + \epsilon$ using lm():

the parameter estimate $\hat \beta_1$ is the marginal effect of $x_1$ on $y$ where $x_2 = 0$
the parameter estimate $\hat \beta_2$ is the marginal effect of $x_2$ on $y$ where $x_1 = 0$

side note: Regardless of whether or not there is an interaction term in our model, all parameter estimates in multiple regression are “conditional” in the sense that they are dependent upon the inclusion of other variables in the model. For instance, in $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$ the coefficient $\hat \beta_1$ is conditional upon holding $x_2$ constant.

Interpreting the interaction term A:B

The coefficient for an interaction term can be thought of as providing an adjustment to the slope.

In our model: $\text{wellbeing} = \beta_0 + \beta_1\cdot\text{social-interactions} + \beta_2\cdot\text{isRural} + \beta_3\cdot\text{social-interactions}\cdot\text{isRural} + \epsilon$, we have a numeric*categorical interaction.
The estimate $\hat \beta_3$ is the adjustment to the slope $\hat \beta_1$ to be made for the individuals in the $\text{isRural}=1$ group.

Question B5

Look at the parameter estimates from your model, and write a description of what each one corresponds to on the plot shown in Figure 4 (it may help to sketch out the plot yourself and annotate it).

“The best method of communicating findings about the presence of significant interaction may be to present a table of graph of the estimated means at various combinations of the interacting variables.”
Ramsey and Schafer (2012)

Multiple regression model: Wellbeing ~ Social Interactions * is Rural<br><small>Note that the dashed lines represent predicted values below the minimum observed number of social interactions, to ensure that zero on the x-axis is visible</small>

Figure 4: Multiple regression model: Wellbeing ~ Social Interactions * is Rural
Note that the dashed lines represent predicted values below the minimum observed number of social interactions, to ensure that zero on the x-axis is visible

Hints

Solution

We can obtain our parameter estimates using various functions such as summary(rural_mod),coef(rural_mod), coefficients(rural_mod) etc.

coefficients(rural_mod)

##             (Intercept)              social_int            isRuralrural 
##              30.9985688               0.6487945               1.3865688 
## social_int:isRuralrural 
##              -0.5175856

$\hat \beta_0$ = (Intercept) = 31: The point at which the blue line cuts the y-axis (where social_int = 0).
$\hat \beta_1$ = social_int = 0.65: The slope (vertical increase on the y-axis associated with a 1 unit increase on the x-axis) of the blue line.
$\hat \beta_2$ = isRuralrural = 1.39: The vertical distance from the blue to the red line at the y-axis (where social_int = 0).
$\hat \beta_3$ = social_int:isRuralrural = -0.52: How the slope of the line changes when you move from the blue to the red line.

Question B6

Make sure the sjPlot package is loaded and try using the function plot_model().
The default behaviour of plot_model() is to plot the parameter estimates and their confidence intervals. This is where type = "est". Try to create a plot like Figure 4, which shows the two lines (Hint: what is this set block of exercises all about? type = ???.)

Solution

library(sjPlot)
plot_model(rural_mod, type="int")

~ Numeric * Numeric

We will now look at a multiple regression model with an interaction betweeen two numeric explanatory variables.

Research question
Previous research has identified an association between an individual’s perception of their social rank and symptoms of depression, anxiety and stress. We are interested in the individual differences in this relationship.
Specifically:

Does the effect of social comparison on symptoms of depression, anxiety and stress vary depending on level of neuroticism?

Social Comparison Study data codebook

Download link

The data is available at https://uoepsy.github.io/data/scs_study.csv

Description

Data from 656 participants containing information on scores on each trait of a Big 5 personality measure, their perception of their own social rank, and their scores on a measure of depression.

The data in scs_study.csv contain seven attributes collected from a random sample of $n=656$ participants:

zo: Openness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
zc: Conscientiousness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
ze: Extraversion (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
za: Agreeableness (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
zn: Neuroticism (Z-scored), measured on the Big-5 Aspects Scale (BFAS)
scs: Social Comparison Scale - An 11-item scale that measures an individual’s perception of their social rank, attractiveness and belonging relative to others. The scale is scored as a sum of the 11 items (each measured on a 5-point scale), with higher scores indicating more favourable perceptions of social rank.
dass: Depression Anxiety and Stress Scale - The DASS-21 includes 21 items, each measured on a 4-point scale. The score is derived from the sum of all 21 items, with higher scores indicating higher a severity of symptoms.

Preview

The first six rows of the data are:

zo	zc	ze	za	zn	scs	dass
0.7587304	1.5838564	-0.79153841	-0.09104726	1.32023703	30	56
0.3033394	-0.2698774	-0.08636742	0.08849465	-0.40280445	30	48
-0.1347251	0.6566343	-0.79796588	-0.94759192	0.92927138	35	48
1.0640775	-1.0218540	-0.16475623	-0.50409122	-0.02021664	29	48
1.7411706	-0.7773205	-1.55003059	-2.86050116	-1.14145028	41	43
0.2203023	-0.4054477	0.78397367	0.90388509	-0.24641819	37	60

Refresher: Z-scores

When we standardise a variable, we re-express each value as the distance from the mean in units of standard deviations. These transformed values are called z-scores.

To transform a given value $x_i$ into a z-score $z_i$, we simply calculate the distance from $x_i$ to the mean, $\bar{x}$, and divide this by the standard deviation, $s$:
\[ z_i = \frac{x_i - \bar{x}}{s} \]

A Z-score of a value is the number of standard deviations below/above the mean that the value falls.

Question B7

Specify the model you plan to fit in order to answer the research question (e.g., $\text{??} = \beta_0 + \beta_1 \cdot \text{??} + .... + \epsilon$)

Solution

Question B8

Read in the data and assign it the name scs_study. Produce plots of the relevant distributions and relationships involved in the analysis.

Solution

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
summary(scs_study)

##        zo                 zc                 ze                 za          
##  Min.   :-2.81928   Min.   :-3.21819   Min.   :-3.00576   Min.   :-2.94429  
##  1st Qu.:-0.63089   1st Qu.:-0.66866   1st Qu.:-0.68895   1st Qu.:-0.69394  
##  Median : 0.08053   Median : 0.00257   Median :-0.04014   Median :-0.01854  
##  Mean   : 0.09202   Mean   : 0.01951   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.80823   3rd Qu.: 0.71215   3rd Qu.: 0.67085   3rd Qu.: 0.72762  
##  Max.   : 3.55034   Max.   : 3.08015   Max.   : 2.80010   Max.   : 2.97010  
##        zn               scs             dass      
##  Min.   :-1.4486   Min.   :27.00   Min.   :23.00  
##  1st Qu.:-0.7994   1st Qu.:33.00   1st Qu.:41.00  
##  Median :-0.2059   Median :35.00   Median :44.00  
##  Mean   : 0.0000   Mean   :35.77   Mean   :44.72  
##  3rd Qu.: 0.5903   3rd Qu.:38.00   3rd Qu.:49.00  
##  Max.   : 3.3491   Max.   :54.00   Max.   :68.00

ggplot(data = scs_study, aes(x=dass)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of DASS-21 Scores", 
       x = "Depression Anxiety and Stress Scale", y = "Probability density")

The marginal distribution of scores on the Depression, Anxiety and Stress Scale (DASS-21) is unimodal with a mean of approximately 45 and a standard deviation of 7.

ggplot(data = scs_study, aes(x=scs)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of Social Comparison Scale (SCS) scores", 
       x = "Social Comparison Scale Score", y = "Probability density")

The marginal distribution of score on the Social Comparison Scale (SCS) is unimodal with a mean of approximately 36 and a standard deviation of 4. There look to be a number of outliers at the upper end of the scale.

ggplot(data = scs_study, aes(x=zn)) + 
  geom_density() + 
  geom_boxplot(width = 1/50) +
  labs(title="Marginal distribution of Neuroticism (Z-Scored)", 
       x = "Neuroticism (Z-Scored)", y = "Probability density")

The marginal distribution of Neuroticism (Z-scored) is positively skewed, with the 25% of scores falling below -0.8, 75% of scores falling below 0.59.

library(patchwork) # for arranging plots side by side
library(knitr) # for making tables look nice

p1 <- ggplot(data = scs_study, aes(x=scs, y=dass)) + 
  geom_point()+
  labs(x = "SCS", y = "DASS-21")

p2 <- ggplot(data = scs_study, aes(x=zn, y=dass)) + 
  geom_point()+
  labs(x = "Neuroticism", y = "DASS-21")

p1 | p2

# the kable() function from the knitr package can make table outputs print nicely into html.
scs_study %>%
  select(dass, scs, zn) %>%
  cor %>% 
  kable

	dass	scs	zn
dass	1.0000000	-0.2280126	0.2001885
scs	-0.2280126	1.0000000	0.1146687
zn	0.2001885	0.1146687	1.0000000

There is a weak, negative, linear relationship between scores on the Social Comparison Scale and scores on the Depression Anxiety and Stress Scale for the participants in the sample. Severity of symptoms measured on the DASS-21 tend to decrease, on average, the more favourably participants view their social rank.
There is a weak, positive, linear relationship between the levels of Neuroticism and scores on the DASS-21. Participants who are more neurotic tend to, on average, display a higher severity of symptoms of depression, anxiety and stress.

Question B9

Run the code below. It takes the dataset, and uses the cut() function to add a new variable called “zn_group”, which is the “zn” variable split into 4 groups.

Remember: we have re-assign this output as the name of the dataset (the scs_study <- bit at the beginning) to make these changes occur in our environment (the top-right window of Rstudio). If we didn’t have the first line, then it would simply print the output.

scs_study <-
  scs_study %>%
  mutate(
    zn_group = cut(zn, 4)
  )

We can see how it has split the “zn” variable by plotting the two against one another:
(Note that the levels of the new variable are named according to the cut-points).

ggplot(data = scs_study, aes(x = zn_group, y = zn)) + 
  geom_point()

Plot the relationship between scores on the SCS and scores on the DASS-21, for each group of the variable we just created.
How does the pattern change? Does it suggest an interaction?

Tip: Rather than creating four separate plots, you might want to map some feature of the plot to the variable we created in the data, or make use of facet_wrap()/facet_grid().

Solution

ggplot(data = scs_study, aes(x = scs, y = dass, col = zn_group)) + 
  geom_point() + 
  facet_grid(~zn_group) +
  theme(legend.position = "none") # remove the legend

The relationship between SCS scores and DASS-21 scores appears to be different between these groups. For those with a relatively high neuroticism score, the relationship seems stronger, while for those with a low neuroticism score there is almost no discernable relationship. This suggests an interaction - the relationship of DASS-21 ~ SCS differs across the values of neuroticism!

Cutting one of the explanatory variables up into groups essentially turns a numeric variable into a categorical one. We did this just to make it easier to visualise how a relationship changes across the values of another variable, because we can imagine a separate line for the relationship between SCS and DASS-21 scores for each of the groups of neuroticism. However, in grouping a numeric variable like this we lose information. Neuroticism is measured on a continuous scale, and we want to capture how the relationship between SCS and DASS-21 changes across that continuum (rather than cutting it into chunks).
We could imagine cutting it into more and more chunks (see Figure 5), until what we end up with is a an infinite number of lines - i.e., a three-dimensional plane/surface (recall that in for a multiple regression model with 2 explanatory variables, we can think of the model as having three-dimensions). The inclusion of the interaction term simply results in this surface no longer being necessarily flat. You can see this in Figure 6.

Figure 5: Separate regression lines DASS ~ SCS for neuroticism when cut into 4 (left) or 6 (center) or 12 (right) groups

Figure 6: 3D plot of regression surface with interaction. You can explore the plot in the figure below from different angles by moving it around with your mouse.

Question B10

Fit your model using lm().

Solution

dass_mdl <- lm(dass ~ 1 + scs*zn, data = scs_study)
summary(dass_mdl)

## 
## Call:
## lm(formula = dass ~ 1 + scs * zn, data = scs_study)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.301  -3.825  -0.173   3.733  45.777 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.80887    2.45399  24.780  < 2e-16 ***
## scs         -0.44391    0.06834  -6.495 1.64e-10 ***
## zn          20.12813    2.35951   8.531  < 2e-16 ***
## scs:zn      -0.51861    0.06552  -7.915 1.06e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.123 on 652 degrees of freedom
## Multiple R-squared:  0.1825, Adjusted R-squared:  0.1787 
## F-statistic:  48.5 on 3 and 652 DF,  p-value: < 2.2e-16

Question B11

	Estimate	Std. Error	t value
(Intercept)	60.8088736	2.4539924	24.779569
scs	-0.4439065	0.0683414	-6.495431
zn	20.1281319	2.3595133	8.530629
scs:zn	-0.5186142	0.0655210	-7.915236

Recall that the coefficients zn and scs from our model now reflect the estimated change in the outcome associated with an increase of 1 in the explanatory variables, when the other variable is zero.

Think - what is 0 in each variable? what is an increase of 1? Are these meaningful? Would you suggest recentering either variable?

Solution

Question B12

Recenter one or both of your explanatory variables to ensure that 0 is a meaningful value

Solution

We’re going to mean-center the scores on the SCS. Think about what someone who now scores zero on the zn variable and zero on the mean-centered SCS?

scs_study <-
  scs_study %>%
  mutate(
    scs_mc = scs - mean(scs)
  )

Question B13

We re-fit the model using mean-centered SCS scores instead of the original variable. Here are the parameter estimates:

dass_mdl2 <- lm(dass ~ 1 + scs_mc * zn, data = scs_study)

# pull out the coefficients from the summary():
summary(dass_mdl2)$coefficients

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 44.9324476 0.24052861 186.807079 0.000000e+00
## scs_mc      -0.4439065 0.06834135  -6.495431 1.643265e-10
## zn           1.5797687 0.24086372   6.558766 1.105118e-10
## scs_mc:zn   -0.5186142 0.06552100  -7.915236 1.063297e-14

Fill in the blanks in the statements below.

For those of average neuroticism and who score average on the SCS, the estimated DASS-21 Score is ???
For those who who score ??? on the SCS, an increase of ??? in neuroticism is associated with a change of 1.58 in DASS-21 Scores
For those of average neuroticism, an increase of ??? on the SCS is associated with a change of -0.44 in DASS-21 Scores
For every increase of ??? in neuroticism, the change in DASS-21 associated with an increase of ??? on the SCS is asjusted by ???
For every increase of ??? in SCS, the change in DASS-21 associated with an increase of ??? in neuroticism is asjusted by ???

Solution

Exercises: Revisiting Assumptions

Question C1

Create a new section header in your Rmarkdown document, as we are moving onto a refresher on checking assumptions.

You will have just fitted the following interaction model:
- $\text{DASS-21 Score} = \beta_0 + \beta_1 \cdot \text{SCS Score} + \beta_2 \cdot \text{Neuroticism} + \beta_3 \cdot \text{SCS score} \cdot \text{Neuroticism} + \epsilon$
Check that your model meets the assumptions of the linear model (Tip: to get a broad overview you can pass your model to the plot() function to get a series of plots).
If you notice any violated assumptions:
- address the issues by, e.g., excluding observations from the analysis, or replacing outliers with the next most extreme value (Winsorisation).
- after fitting a new model which you hope addresses violations, you need to check all of your assumptions again. It can be an iterative process, and the most important thing is that your final model (the one you plan to use) meets all the assumptions.

Tips:

When there is an interaction in the model, assessing linearity becomes difficult. In fact, crPlots() will not work. To assess, you can create a residuals-vs-fitted plot like we saw in the previous week.
Interaction terms often result in multicollinearity, because these terms are made up of the product of some ‘main effects’. Mean-centering the variables like we have here will reduce this source of structural multicollinearity (“structural” here refers to the fact that multicollinearity is due to our model specification, rather than the data itself)
You can fit a model and exclude specific observations. For instance, to remove the 3rd and 5th rows of the dataset: lm(y ~ x1 + x2, data = dataset[-c(3,5),]). Be careful to remember that these values remain in the dataset, they have simply been excluded from the model fit.

Solution

Recall our model: We’re going to mean-center the scs variable from the outset.

dass_mdl2 <- lm(dass ~ 1 + scs_mc * zn, data = scs_study)
plot(dass_mdl2)

From quick visual inspection, it looks like there is at least one very influential point, which has been labelled for us as case number 35.

scs_study[35,]

## # A tibble: 1 x 9
##      zo    zc    ze     za    zn   scs  dass zn_group    scs_mc
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>
## 1 0.645 -1.49 0.970 -0.121  2.24    54    65 (2.15,3.35]   18.2

The code below fits a new model and assigns it the name dass_mdl3. How is it different from the previous model?

dass_mdl3 <- lm(dass ~ 1 + scs_mc * zn, data = scs_study[-35, ])

Does this new model meet the assumptions of multiple regression?

Linearity

plot(dass_mdl3, which = 1)

Equal variances (Homoscedasticity)

library(car)
residualPlots(dass_mdl3)

##            Test stat Pr(>|Test stat|)
## scs_mc        1.6260           0.1044
## zn           -0.5074           0.6121
## Tukey test   -1.3170           0.1878

ncvTest(dass_mdl3)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.3706851, Df = 1, p = 0.54263

Independence of errors

# we assume independence of observations was considered during study design (e.g., participants are not related, or influence one another in some way). We can test for autocorrelation:
durbinWatsonTest(dass_mdl3)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04018765      2.079286   0.334
##  Alternative hypothesis: rho != 0

Normality

hist(residuals(dass_mdl3))

shapiro.test(residuals(dass_mdl3))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(dass_mdl3)
## W = 0.99762, p-value = 0.4801

Check for multicollinearity

vif(dass_mdl3)

##    scs_mc        zn scs_mc:zn 
##  1.010026  1.010174  1.000240

Take a breather

Fun fact

Exercises: Model Fit and Model Comparison

Model Fit

Adjusted $R^2$

We know from our work on simple linear regression that the R-squared can be obtained as: \[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]

However, when we add more and more predictors into a multiple regression model, $SS_{Residual}$ cannot increase, and may decrease by pure chance alone, even if the predictors are unrelated to the outcome variable. Because $SS_{Total}$ is constant, the calculation $1-\frac{SS_{Residual}}{SS_{Total}}$ will increase by chance alone.

An alternative, the Adjusted-$R^2$, does not necessarily increase with the addition of more explanatory variables, by including a penalty according to the number of explanatory variables in the model. It is not by itself meaningful, but can be useful in determining what predictors to include in a model. \[ Adjusted{-}R^2=1-\frac{(1-R^2)(n-1)}{n-k-1} \\ \quad \\ \begin{align} & \text{Where:} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

In R, you can view the mutiple and adjusted $R^2$ at the bottom of the output of summary(<modelname>):

Figure 7: Multiple regression output in R, summary.lm(). R-squared highlighted

F-ratio

As in simple linear regression, the F-ratio is used to test the null hypothesis that all regression slopes are zero.

It is called the F-ratio because it is the ratio of the how much of the variation is explained by the model (per paramater) versus how much of the variation is unexplained (per remaining degrees of freedom).

\[ F_{df_{model},df_{residual}} = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model}/df_{Model}}{SS_{Residual}/df_{Residual}} \\ \quad \\ \begin{align} & \text{Where:} \\ & df_{model} = k \\ & df_{error} = n-k-1 \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

In R, at the bottom of the output of summary(<modelname>), you can view the F ratio, along with an hypothesis test against the alternative hypothesis that the at least one of the coefficients $\neq 0$ (under the null hypothesis that all coefficients = 0, the ratio of explained:unexplained variance should be approximately 1):

Figure 8: Multiple regression output in R, summary.lm(). F statistic highlighted

Question D1

Run the code below. It reads in the wellbeing/rurality study data, and creates a new binary variable which specifies whether or not each participant lives in a rural location.

library(tidyverse)
mwdata2<-read_csv("https://uoepsy.github.io/data/wellbeing_rural.csv")
mwdata2 <- 
  mwdata2 %>% mutate(
  isRural = ifelse(location=="rural","rural","notrural")
)

Fit the following model, and assign it the name “wb_mdl1”.

$\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \epsilon$

Does the model provide a better fit to the data than a model with no explanatory variables? (i.e., test against the alternative hypothesis that at least one of the explanatory variables significantly predicts wellbeing scores).

Solution

wb_mdl1 <- lm(wellbeing ~ social_int + isRural, data=mwdata2)
summary(wb_mdl1)

## 
## Call:
## lm(formula = wellbeing ~ social_int + isRural, data = mwdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3711  -3.1794   0.1097   2.5407  17.6324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.11591    1.07052  31.868  < 2e-16 ***
## social_int    0.38167    0.08257   4.622 6.85e-06 ***
## isRuralrural -4.85152    0.66283  -7.319 6.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.664 on 197 degrees of freedom
## Multiple R-squared:  0.2593, Adjusted R-squared:  0.2517 
## F-statistic: 34.47 on 2 and 197 DF,  p-value: 1.453e-13

Weekly social interactions and location (rural vs not rural) explained 25.2% of the variance (adjusted $R^2$ =0.252, $F$(2,197)=34.5, p<.001)

Model Comparison

Incremental F-test

If (and only if) two models are nested (one model contains all the predictors of the other and is fitted to the same data), we can compare them using an incremental F-test.

This is a formal test of whether the additional predictors provide a better fitting model.
Formally this is the test of:

$H_0:$ coefficients for the added/ommitted variables are all zero.
$H_1:$ at least one of the added/ommitted variables has a coefficient that is not zero.

The F-ratio for comparing the residual sums of squares between two models can be calculated as:

\[ F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F} \\ \quad \\ \begin{align} & \text{Where:} \\ & SSR_R = \text{residual sums of squares for the restricted model} \\ & SSR_F = \text{residual sums of squares for the full model} \\ & df_R = \text{residual degrees of freedom from the restricted model} \\ & df_F = \text{residual degrees of freedom from the full model} \\ \end{align} \]

In R, we can conduct an incremental F-test by constructing two models, and passing them to the anova() function: anova(model1, model2).

Question D2

The F-ratio you see at the bottom of summary(model) is actually a comparison between two models: your model (with some explanatory variables in predicting $y$) and the null model. In regression, the null model can be thought of as the model in which all explanatory variables have zero regression coefficients. It is also referred to as the intercept-only model, because if all predictor variable coefficients are zero, then the only we are only estimating $y$ via an intercept (which will be the mean - $\bar y$).

Use the code below to fit the null model.
Then, use the anova() function to perform a model comparison between your earlier model (wb_mdl1) and the null model.
Check that the F statistic is the same as that which is given at the bottom of summary(wb_mdl1).

null_model <- lm(wellbeing ~ 1, data = mwdata2)

Solution

# fit the null model
null_model <- lm(wellbeing ~ 1, data = mwdata2)

# model comparison null vs wb_mdl1
anova(null_model, wb_mdl1)

## Analysis of Variance Table
## 
## Model 1: wellbeing ~ 1
## Model 2: wellbeing ~ social_int + isRural
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    199 5785.6                                  
## 2    197 4285.6  2      1500 34.475 1.453e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# extract f statistic from summary of wb_mdl1
summary(wb_mdl1)$fstatistic

##     value     numdf     dendf 
##  34.47463   2.00000 197.00000

# we can retrieve the p-value:
fstat = summary(wb_mdl1)$fstatistic[1]
df_1 = summary(wb_mdl1)$fstatistic[2]
df_2 = summary(wb_mdl1)$fstatistic[3]
pf(fstat, df_1, df_2, lower.tail = FALSE)

##        value 
## 1.452925e-13

Question D3

Does weekly outdoor time explain a significant amount of variance in wellbeing scores over and above weekly social interactions and location (rural vs not-rural)?

Provide an answer to this question by fitting and comparing two models (one of them you may already have fitted in an earlier question).

Solution

We can compare the following models which predict wellbeing scores from weekly social interactions and location, with and without weekly outdoor time.

$\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \epsilon$
$\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{IsRural} + \beta_3 \cdot \text{Outdoor time} + \epsilon$

We have already fitted the first model and assigned it the name wb_mdl1.
We need to fit the second:

wb_mdl2 <- lm(wellbeing ~ 1 + social_int + isRural + outdoor_time, data=mwdata2)

Let’s look at the amount of variation in wellbeing scores explained by each model:

summary(wb_mdl1)$adj.r.squared

## [1] 0.251737

summary(wb_mdl2)$adj.r.squared

## [1] 0.3037301

The model with weekly outdoor time as a predictor explains 30% of the variance, and the model without explains 25%.
Does including weekly outdoor time as a predictor provide a significantly better fit of the data (wb_mdl2 compared to wb_mdl1)?

anova(wb_mdl1, wb_mdl2)

## Analysis of Variance Table
## 
## Model 1: wellbeing ~ social_int + isRural
## Model 2: wellbeing ~ 1 + social_int + isRural + outdoor_time
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    197 4285.6                                  
## 2    196 3967.6  1    318.03 15.711 0.0001033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Weekly outdoor time was found to explain a significant amount of variance in wellbeing scores over and above weekly social interactions and location (rural vs not-rural)
$F$(1 )=NA, 15.71, p<.001.

Incremental validity - A caution

A common goal for researchers is to determine which variables matter (and which do not) in contributing to some outcome variable. A common approach to answer such questions is to consider whether some variable $X$’s contribution remains significant after controlling for variables $Z$.

The reasoning:

If our measure of $X$ correlates significantly with outcome $Y$ even when controlling for our measure of $Z$, then $X$ contributes to $y$ over and above the contribution of $Z$.

In multiple regression, we might fit the model $Y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot Z + \epsilon$ and conclude that $X$ is a useful predictor of $Y$ over and above $Z$ based on the estimate $\hat \beta_1$, or via model comparison between that model and the model without $Z$ as a predictor ($Y = \beta_0 + \beta_1 \cdot X + \epsilon$).

A Toy Example

Suppose we have monthly data over a seven year period which captures the number of shark attacks on swimmers each month, and the number of ice-creams sold by beach vendors each month.
Consider the relationship between the two:

We can fit the linear model and see a significant relationship between ice cream sales and shark attacks:

sharkdata <- read_csv("https://uoepsy.github.io/data/sharks.csv")
shark_mdl <- lm(shark_attacks ~ ice_cream_sales, data = sharkdata)
summary(shark_mdl)

## 
## Call:
## lm(formula = shark_attacks ~ ice_cream_sales, data = sharkdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3945  -4.9268   0.5087   4.8152  15.7023 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.58835    5.19063   1.077    0.285    
## ice_cream_sales  0.32258    0.05809   5.553 3.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.245 on 81 degrees of freedom
## Multiple R-squared:  0.2757, Adjusted R-squared:  0.2668 
## F-statistic: 30.84 on 1 and 81 DF,  p-value: 3.461e-07

Question

Does the relationship between ice cream sales and shark attacks make sense? What might be missing from our model?

Solution

You might quite rightly suggest that this relationship is actually being driven by temperature - when it is hotter, there are more ice cream sales and there are more people swimming (hence more shark attacks).

Question

Is $X$ (the number of ice-cream sales) a useful predictor of $Y$ (numbers of shark attacks) over and above $Z$ (temperature)?

We might answer this with a multiple regression model including both temperature and ice cream sales as predictors of shark attacks:

shark_mdl2 <- lm(shark_attacks ~temperature + ice_cream_sales, data = sharkdata)
anova(shark_mdl2)

## Analysis of Variance Table
## 
## Response: shark_attacks
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## temperature      1 3000.00 3000.00 85.7795 2.689e-14 ***
## ice_cream_sales  1   71.71   71.71  2.0503    0.1561    
## Residuals       80 2797.87   34.97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What do you conclude?

Solution

It appears that numbers of ice cream sales is not a significant predictor of sharks attack numbers over and above the temperature.

However… In psychology, we can rarely observe and directly measure the constructs which we are interested in (for example, personality traits, intelligence, emotional states etc.). We rely instead on measurements of, e.g. behavioural tendencies, as a proxy for personality traits.

Let’s suppose that instead of including temperature in degrees celsius, we asked a set of people to self-report on a scale of 1 to 7 how hot it was that day. This measure should hopefully correlate well with the actual temperature, however, there will likely be some variation:

Question

Is $X$ (the number of ice-cream sales) a useful predictor of $Y$ (numbers of shark attacks) over and above $Z$ (temperature - measured on our self-reported heat scale)?

shark_mdl2a <- lm(shark_attacks ~ sr_heat + ice_cream_sales, data = sharkdata)
anova(shark_mdl2a)

## Analysis of Variance Table
## 
## Response: shark_attacks
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## sr_heat          1 2683.90 2683.90 73.3080 6.378e-13 ***
## ice_cream_sales  1  256.76  256.76  7.0133  0.009744 ** 
## Residuals       80 2928.90   36.61                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What do you conclude?

Moral of the story: be considerate of what exactly it is that you are measuring.
This example was adapted from Westfall and Yarkoni, 2020 which provides a much more extensive discussion of incremental validity and type 1 error rates.

AIC & BIC

We can also compare models using information criterion statistics, such as AIC and BIC. These combine information about the sample size, the number of model parameters and the residual sums of squares ($SS_{residual}$). Models do not need to be nested to be compared via AIC and BIC, but they need to have been fit to the same dataset.
For both of these fit indices, lower values are better, and both include a penalty for the number of predictors in the model, although BIC’s penalty is harsher:

\[ AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \\ \quad \\ BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \\ \quad \\ \begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align} \]

In R, we can calculate AIC and BIC by using the AIC() and BIC() functions.

Question D4

The code below fits 5 different models:

model1 <- lm(wellbeing ~ social_int + outdoor_time, data = mwdata2)
model2 <- lm(wellbeing ~ social_int + outdoor_time + age, data = mwdata2)
model3 <- lm(wellbeing ~ social_int + outdoor_time + routine, data = mwdata2)
model4 <- lm(wellbeing ~ social_int + outdoor_time + routine + age, data = mwdata2)
model5 <- lm(wellbeing ~ social_int + outdoor_time + routine + steps_k, data = mwdata2)

For each of the below pairs of models, what methods are/are not available for us to use for comparison and why?

model1 vs model2
model2 vs model3
model1 vs model4
model3 vs model5

Solution

Question D5

Returning to the data on Big 5 Personality traits, perceptions of social ranks, and depression and anxiety scale scores:

Research question

Beyond neuroticism and its interaction with social comparison, do other personality traits predict symptoms of depression, anxiety and stress?

Construct and compare multiple regression models to answer this question. Remember to check that your models meet assumptions (for this exercises, a quick eyeball of the diagnostic plots will suffice. Were this an actual research project, you would want to provide a more thorough check, for instance conducting formal tests of the assumptions).

Although the solutions are available immediately for this question, we strongly advocate that you attempt it yourself before looking at them.

Solution

We’ve already done this all earlier, but thus far, we should have read in the data, and mean-centered our social comparison scale scores:

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
scs_study <- 
  scs_study %>%
  mutate(
    scs_mc = scs - mean(scs)
  )

The question is asking whether including a group of predictors (the O, C, E, A personality traits) improves model fit beyond a model with just neuoriticism, social comparison score and their interaction.

Notice how our initial model has one very influential point, which we will remove:

dass_mod <- lm(dass ~ scs_mc * zn, data = scs_study)
plot(dass_mod)

dass_mod <- lm(dass ~ scs_mc * zn, data = scs_study[-35, ])
plot(dass_mod)

And our full model, with the other personality variables included:

dass_mod2 <- lm(dass ~ scs_mc * zn + zo + zc + ze + za, data = scs_study[-35, ])
plot(dass_mod2)

We can explore the individual coefficients of our full model, and we notice that none of the other personality variables (zo, zc, ze, za) significantly predict DASS-21 scores:

summary(dass_mod2)

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo + zc + ze + za, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1455  -3.8155  -0.0066   3.6905  18.1483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97703    0.22635 198.708  < 2e-16 ***
## scs_mc      -0.54832    0.06519  -8.412 2.58e-16 ***
## zn           1.41639    0.22661   6.250 7.44e-10 ***
## zo          -0.31435    0.22056  -1.425    0.155    
## zc           0.09134    0.22515   0.406    0.685    
## ze           0.52695    0.34233   1.539    0.124    
## za           0.33847    0.34281   0.987    0.324    
## scs_mc:zn   -0.78254    0.06817 -11.479  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733 on 647 degrees of freedom
## Multiple R-squared:  0.279,  Adjusted R-squared:  0.2712 
## F-statistic: 35.76 on 7 and 647 DF,  p-value: < 2.2e-16

However, when we compare the two models, we find that including these predictors does significantly improve model fit.

anova(dass_mod, dass_mod2)

## Analysis of Variance Table
## 
## Model 1: dass ~ scs_mc * zn
## Model 2: dass ~ scs_mc * zn + zo + zc + ze + za
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)   
## 1    651 21763                              
## 2    647 21262  4    501.25 3.8133 0.0045 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This may be a bit confusing - are we saying that none of openness, conscientiousness, agreeableness, or extraversion significantly predict DASS-21 scores, but collectively they do?

This sort of discrepancy can often be the result of multicollinearity. Note that there may be some correlation between the za and ze variables:

vif(dass_mod2)

##    scs_mc        zn        zo        zc        ze        za scs_mc:zn 
##  1.015133  1.015736  1.013310  1.008235  2.332486  2.342220  1.012475

What may actually be happening here is that one variable is masking the effect of the other. Note that when we take one of them out, the other becomes significant:

lm(dass ~ scs_mc * zn + zo + zc + ze, data = scs_study[-35, ]) %>% summary

## 
## Call:
## lm(formula = dass ~ scs_mc * zn + zo + zc + ze, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1327  -3.8012  -0.0145   3.7151  17.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97614    0.22634 198.709  < 2e-16 ***
## scs_mc      -0.55175    0.06509  -8.476  < 2e-16 ***
## zn           1.42834    0.22629   6.312 5.11e-10 ***
## zo          -0.30725    0.22044  -1.394 0.163853    
## zc           0.10062    0.22495   0.447 0.654794    
## ze           0.78223    0.22437   3.486 0.000523 ***
## scs_mc:zn   -0.78132    0.06816 -11.463  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.732 on 648 degrees of freedom
## Multiple R-squared:  0.2779, Adjusted R-squared:  0.2712 
## F-statistic: 41.56 on 6 and 648 DF,  p-value: < 2.2e-16

Ramsey, Fred, and Daniel Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning.

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

More Linear Regression

Exercises: More Multiple Regression

~ Numeric + Categorical

Exercises: Interactions

~ Numeric * Categorical

~ Numeric * Numeric

Exercises: Revisiting Assumptions

Take a breather

Exercises: Model Fit and Model Comparison

Model Fit

Adjusted \(R^2\)

F-ratio

Model Comparison

Incremental F-test

Incremental validity - A caution

AIC & BIC

Parameter Estimate	Model Coefficient	Estimate
\(\hat \beta_0\)	`(Intercept)`	26.25
\(\hat \beta_1\)	`outdoor_time`	0.92
\(\hat \beta_2\)	`routineRoutine`	7.29