Week 9 Exercises: Path Analysis & Mediation

Education, Skills, and Salary

Dataset: edskillsalary.csv

We sampled 500 people who were all 5 years out of their last year of education. All participants completed an extensive questionnaire to ascertain the number of skills of potential interest to employers each participant perceived themselves to have. This resulted in a “Skillset Metric”.
5 years later, participants were followed up and asked to provide their current salaries. 110 participants failed to respond to follow-ups and thus the final sample included 390 people.

The data are available at https://uoepsy.github.io/data/edskillsalary.csv.

variable	description
Educ	Number of years of education undertaken
Skill	Skillset metric (scores listed from a 100 item list of skills deemed relevant to employers)
Salary	Salary (in thousands of £)

Question 1

Read in the dataset.
Let’s suppose that the only statistical machinery available to us is the good old regression models with lm(), and we are interested in the estimated effect of education on salary.

Which model are you going to fit?

lm(Salary ~ Educ, data = ... )
lm(Salary ~ Skill + Educ, data = ... )

Solution 1. Here’s the data:

edskill <- read_csv("https://uoepsy.github.io/data/edskillsalary.csv")

I’m going to fit both models so that we can discuss them below

m1 <- lm(Salary ~ Educ, data = edskill)
m2 <- lm(Salary ~ Skill + Educ, data = edskill)
# model tables side by side
sjPlot::tab_model(m1,m2)

	Salary			Salary
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	21.88	15.98 – 27.77	<0.001	20.15	14.39 – 25.91	<0.001
Educ	1.14	0.77 – 1.51	<0.001	0.48	0.03 – 0.92	0.035
Skill				0.27	0.16 – 0.37	<0.001
Observations	390			390
R² / R² adjusted	0.085 / 0.083			0.141 / 0.136

So we have got two quite different pictures. The first model says that for every additional year of education, the salary offered would be £1140 more. The second model says it would only be £480 more, and it looks like skills are also important for getting a good salary, right?

But this makes it sound like “Education” and “Skills” are things that we can compare the effects of independently. The implicit model we are referring to when using lm(Salary ~ Skill + Educ) is in the left hand diagram, Model A, of Figure 1. But does it not make a lot more sense to think that education is a cause of peoples’ levels of skills (especially as the skill level being measured 5 years after their last year of education precludes the opposite direction). In other words, does Model B in Figure 1 not make more sense?

Figure 1: Two theoretical models of process that gives rise to the data

What Model B suggests is that part of how education influences peoples’ salaries is by giving them more skills. A perfect example is that you are all now on your way to becoming proficient R users!

This is a great example that illustrates instances we might not want to include a variable in our model. If you are interested in “what does having an extra year of education do for peoples’ salaries?”, then controlling for skill in our model actually removes that part of the mechanism by which education influences salary.

another example

To make this point super clear, suppose we were testing some drug that reduces blood pressure in order to lower the risk of cardiac arrest. We do an experiment where we gave some people the drug, and some people a placebo, then after a year we measured their blood pressure, and then we followed up 5 years later to see how many experienced cardiac arrests.

The diagram to show how these variable relate is clearly the one in Figure 2

Figure 2: A drug aimed at reducing cardiac arrests by lowering blood pressure. The faint grey line indicates the path that would be estimated if we attempted to control for post-treatment blood-pressure

If we fitted the model lm(cardiac_arrest ~ blood_pressure + drug), then the estimated coefficient we get for drug will be the effect of the drug on cardiac arrests that isn’t due to how it changes blood pressure. But that is exactly the opposite of what we want, because we think the drug works specifically by lowering blood pressure!

Contrast this with another example, where we do not run an experiment and allocate people into “drug” or “placebo”, we simply observe a whole load of people who either do or do not take the drug, and we follow them up to see how many have cardiac arrests. But, it is generally older people who take the drug, whereas younger people do not. In addition, older people have a higher risk of cardiac arrests. So our model is that seen in Figure 3.

Figure 3: We are interested in a drug aimed at reducing cardiac arrests. Older people are more likely to take the drug, and are also more likely to experience cardiac arrest

In this case, controlling for the third variable age is the right thing to do.
If we do not control for age, and fit the model lm(cardiac_arrest ~ drug), then the coefficient for drug will be biased because the group who take the drug are on average older, and thus will tend to experience more cardiac arrests. If the imbalance of age between the drug and no-drug groups is great enough, it could even look like the drug is making cardiac arrests more likely!

Question 2

Instead, let’s suppose we are actually interested in the mechanism of how education influences salary. Do more educated people tend to have higher salaries in part because of the skills obtained during their education?

Fit a path model in which education has an effect on salary both directly and indirectly, via its influence on the skills obtained (i.e., model B in Figure 1)

Hints

we have an outcome Y, a predictor X, and a mediator M:

mod <- "
Y ~ X + M
M ~ X
"

Question 3

While the model in the previous question better reflects our theoretical notions of how these variables are actually related, we would ideally get out an estimate of the indirect effect.

Edit your model formula from the previous question to also estimate both the total and the indirect effects.

Then re-fit the model, estimating the parameters using bootstrapping.
Is the association between education and salary mediated by skills? What proportion of the effect is mediated?

Hints

You’ll need to add some labels to the existing paths, and then define the indirect and total effects - see Chapter 7#mediation-in-lavaan.

Solution 3.

mod1 <- "
Salary ~ c*Educ + b*Skill
Skill ~ a*Educ

indirect := a*b
total := a*b + c
"

mod1.est <- sem(mod1, data = edskill, se = "bootstrap")

Warning message:
In lav_model_nvcov_bootstrap(lavmodel = lavmodel, lavsamplestats = lavsamplestats, :
lavaan WARNING: 8 bootstrap runs failed or did not converge.

Note that we get a warning here, but it’s not too much to worry about. It tells us that 8 (out of 1000) of the bootstraps has failed to converge. This means our bootstrapped estimates are really based on 992 bootstrap draws (and so a write up would need to clearly report this).
If this number was bigger (e.g. 10% or 20% of the bootstraps failed) then we it would be more of a cause for concern.

summary(mod1.est, ci = TRUE)

lavaan 0.6-20 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           390

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             1000
  Number of successful bootstrap draws             990

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
  Salary ~                                                              
    Educ       (c)    0.479    0.229    2.096    0.036    0.010    0.918
    Skill      (b)    0.266    0.054    4.887    0.000    0.159    0.374
  Skill ~                                                               
    Educ       (a)    2.481    0.168   14.798    0.000    2.134    2.826

Variances:
                   Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
   .Salary          129.722   10.220   12.693    0.000  108.806  148.937
   .Skill           118.754    8.139   14.591    0.000  103.382  135.392

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
    indirect          0.659    0.135    4.873    0.000    0.398    0.929
    total             1.138    0.189    6.032    0.000    0.752    1.488

The relationship between education and salary is mediated by skill level! We have a significant indirect effect of 0.66, 95% CI [0.4, 0.93].

The total effect of education on salary is 1.14, and the indirect effect is 0.66.
\(\frac{0.66}{1.14} = 0.58\), so 58% of the effect is mediated by the influence that education has on skillset and skillset has on salary.

Question 4

Nothing to code here, just spend a little time to get to grips with what the different parts of the model represent:

if you can, write down for explanation of each of:

a
b
the total effect
the direct effect
the indirect effect

Question 5

In order for us to accurately estimate mediation effects, a few conditions have to be met. One of the biggest is that of no confounding.

An unmeasured variable that is a common cause of both X and Y will bias the total and direct effects, and one that is a common cause of both X and M will bias the indirect effect (it will bias the X\(\rightarrow\)M path). Randomised experiments (i.e. randomly allocating people to different values of X) will avoid this, because nothing but the random allocation would cause X. However, confounding of the indirect effect can also happen if some variable is a common cause of both M and Y, and it is hard to randomly allocate to a mediator¹.

Can you think of two unmeasured variables that could be in the place of the variables indicated by ? in Figure 4 and may be confounding our estimates.

Question 6

Another assumption of our model here is that there is no “X-M interaction”. By this, I mean that we are assuming that the effect of the mediator M on the outcome Y does not depend on the level of the predictor X.

Think: what would an X-M interaction mean in this example?

More Conduct Problems

Dataset: conductprobteach.csv

Thus far, we have explored the underlying structure of a scale of adolescent ‘conduct problems’ (PCA & EFA exercises) and we then tested this measurement model when the scale was administered to a further sample (CFA exericses).

This week, we are looking at whether there are associations between conduct problems (both aggressive and non-aggressive) and academic performance and whether the relations are mediated by the quality of relationships with teachers. We collected data on 557 adolescents as they entered school. Their responses to the conduct problem scale were summed to create a scale score. Two years later, we followed up these students, and obtained measures of Academic performance and of their relationship quality with their teachers. Standardised scale scores were created for both of these measures.

The data are available at https://uoepsy.github.io/data/conductprobteach.csv

variable	description
ID	participant ID
Acad	Academic performance (average grade (0-100) based on all available assessments)
Teach_r	Teacher relationship quality (sum score based on the Teacher-Child-Relationship (TCR) scale - 7 items on a 5-point likert)
Non_agg	Non-Aggressive conduct problems (sum score based on items 1-5 of the 10 item conduct problems scale - 5 items on a 5-point likert)
Agg	Aggressive conduct problems (sum score based on items 6-10 of the 10 item conduct problems scale - 5 items on a 5-point likert)

Question 7

As a little exercise before we get started, let’s just show ourselves that we can use lavaan to estimate all sorts of models, including a multiple regression model.

Let’s first just explore the total effects of the two types of conduct problems on academic achievement.

The code below fits the same model using sem() and using lm(). Examine the summary() output for both models, and spot the similarities.

# read in data
cp_teach <- read_csv("https://uoepsy.github.io/data/conductprobteach.csv")

# a straightforward multiple regression model
m1_lm <- lm(Acad ~ Non_agg + Agg, data = cp_teach)

# the same model fitted in lavaan
m1_lav <- 'Acad ~ Non_agg + Agg'
m1_lav.est <- sem(m1_lav, data = cp_teach)

Question 8

Make a sketch for a model in which both aggressive and non-aggressive conduct problems have indirect (via teacher relationships) and direct effects on academic performance.

Sketch the path diagram on a piece of paper, or use a website like https://semdiag.psychstat.org/ or https://www.diagrams.net/.

Question 9

Now specify the model in R, taking care to also define the parameters for the indirect and total effects.

Make sure to define the indirect effects and total effects. Then estimate the model by bootstrapping.

Hints

you’ll need more labels than just a, b, and c!

Question 10

Given that the measures we are using here all have fairly uninterpretable scales (i.e., what does being 1 higher on Agg really represent?), we might prefer standardised coefficients instead.

You can get these using standardizedSolution(model).

Question 11

Now visualise the estimated model and its parameters using the semPaths() function from the semPlot package.

Question 12

Write a brief paragraph reporting on the results of the model estimates. Include a Figure or Table to display the parameter estimates.

Solution 12.

A path mediation model was used to test the direct and indirect effects (via teacher relationship quality) of aggressive and non-aggressive conduct problems on academic performance. In the model, academic performance was regressed on teacher relationship quality, non-aggressive conduct problems and aggressive conduct problems while teacher relationship quality (the mediator) was regressed on aggressive and non-aggressive conduct problems. The indirect effects were tested using the product of the coefficient for the regression of outcome on mediator and the coefficient for the regression of mediator on predictor. The statistical significance of the indirect effects were evaluated using bootstrapped 95% confidence intervals with 1000 bootstrap samples.

Standardised parameter estimates are provided in Figure 5. Solid lines indicate that a parameter is significant at the 5% significance level, while dashed lines represent non-significant paths. Total effects of both non-aggressive and aggressive conduct problems on academic performance were significant and negative, indicating that both conduct problem domains were associated with poorer academic outcomes.

The indirect effects of both conduct problem domains on academic performance via teacher-relationship quality were statistically significant (\(\beta = -0.088,\, 95\%\, CI\, [-0.131, -0.045]\) and \(\beta = -0.135\, [-0.197, -0.074]\) for non-aggressive and aggressive conduct problems respectively). The direct effect of non-aggressive conduct problems was not significant (\(\beta = -0.083\,[-0.198, 0.033]\)), suggesting that teacher-relationship quality fully mediated the relationship between non-aggressive conduct problems and academic performance. In contrast, teacher-relationship quality only partially mediated the effect of aggressive conduct problems (proportion mediated = 45%), as a significant direct effect on academic performance remained (\(\beta = -0.166\, [-0.284, -0.048]\)). These results suggest that while both types of conduct problems impact academics through the teacher-student bond, aggressive behaviors may also carry additional, direct risks to academic success that are independent of the teacher-student relationship.

Figure 5: Effect of conduct problems on academic performance mediated by quality of teacher relationship. Standardised estimates presented

Footnotes

there are methods that attempt to “block” a mediator, or to manipulate X in multiple ways in order to increase or decrease the mediators’ effect. If you’re interested see Design approaches to experimental mediation, Pirlott & MacKinnon 2016 ↩︎
Note that the model fitted with sem() provides \(Z\) values instead of the \(t\)-values in regression models. This is because sem() fits models with maximum likelihood thereby assuming a reasonably large sample size.↩︎

Week 9 Exercises: Path Analysis & Mediation

Education, Skills, and Salary

More Conduct Problems

Option A

Option B

Footnotes