DAPR3 Lab Exercises
  • Linear mixed models
    • Identifying grouping structure
    • Modelling group-level variability using random effects
    • Reasoning about random effects
  • Measurement & Factor Analysis

Modelling group-level variability using random effects

NoteGet set up
  1. Create a new .Rmd file for this week’s exercises.
  2. Save it somewhere you can find it again.
  3. Give it a clear name (for example, dapr3_lab02.Rmd).
  4. In the first code chunk, load the packages you’ll need this week:
    • tidyverse
    • lme4

Read in the dataset located at https://uoepsy.github.io/data/lmm_jsup.csv and name it wp_data.

RQ: How is the pride that government employees feel about their workplace associated with how long they have been employed and their seniority level?

variable description
department_name Name of government department
dept Department Acronym
virtual Whether the department functions as hybrid department with various employees working remotely (1), or as a fully in-person office (0)
role Employee role (A, B or C)
seniority Employee's seniority level. These map to roles, such that role A is 0-4, role B is 5-9, role C is 10-14. Higher numbers indicate more seniority
employment_length Length of employment in the department (years)
wp Composite measure of 'workplace pride'
NoteMore detail about this dataset

A questionnaire was sent to all UK civil service departments, and the lmm_jsup.csv dataset contains all responses that were received. Some of these departments work as hybrid or ‘virtual’ departments, with a mix of remote and office-based employees. Others are fully office-based.

The questionnaire included items asking about how much the respondent believe in the department and how it engages with the community, what it produces, how it operates and how treats its people. A composite measure of ‘workplace-pride’ was constructed for each employee. Employees in the civil service are categorised into 3 different roles: A, B and C. The roles tend to increase in responsibility, with role C being more managerial, and role A having less responsibility. We also have data on the length of time each employee has been in the department (sometimes new employees come straight in at role C, but many of them start in role A and work up over time).

Question 1

The research question (see above) is asking about how workplace pride is associated with employment length and seniority. From this question, we know that our outcome variable will be wp, and our predictors will be employment_length and seniority.

Now: Our analysis will also contain one set of random effects, because we’ll need to model the random variability contributed by a grouping variable. Identify the grouping variable that we will need to model using random effects.

🗂️ See Identify grouping variables flash card.

Solution 1. The grouping variable that we’ll need to model using random effects because it contributes random variability is either dept or department_name. These two are equivalent: they define the same groups and contain the same information. Because dept is shorter and contains the abbreviations, for simplicity’s sake we’ll use that one below.

The other possible answers and why they’re not applicable:

  • wp is continuous, so it cannot be a grouping variable, and moreover, it’s the outcome variable, so it also cannot be a grouping variable.
  • employment_length is continuous, so it cannot be a grouping variable.
  • seniority is continuous, so it cannot be a grouping variable.
  • virtual is a grouping variable, but it doesn’t contribute random variability, because it’s a variable that the study designers are controlling. If we were including it in the model (which we are not, because it does not pertain to the RQ), we would include it through fixed effects, not random effects.
  • role is a grouping variable, but it doesn’t contribute random variability, because it’s a variable that the study designers are controlling. If we were including it in the model (which we are not, because it does not pertain to the RQ), we would include it through fixed effects, not random effects.

Question 2

Get to know the dataset and describe the grouping variable.

Specifically:

  • What does each observation in wp_data represent?
  • How many levels are there in the grouping variable?
  • What does each level represent?
  • How many observations do we have from each level? Which level has the most observations? Which has the fewest?

🗂️ See Describe group-structured data flash card.

Solution 2. Based on the description of the dataset and a quick look at the first few rows:

head(wp_data)
# A tibble: 6 × 7
  department_name         dept  virtual role  seniority employment_length    wp
  <chr>                   <chr>   <dbl> <chr>     <dbl>             <dbl> <dbl>
1 UK Statistics Authority UKSA        1 A             1                 9  32.9
2 UK Statistics Authority UKSA        1 B             6                20  26.7
3 UK Statistics Authority UKSA        1 B             6                14  31.9
4 UK Statistics Authority UKSA        1 A             1                 9  33.4
5 UK Statistics Authority UKSA        1 A             4                 6  34.4
6 UK Statistics Authority UKSA        1 A             2                11  31  

Each row represents one employee’s data. The employee in the first row, for example, is employed at UKSA, works remotely (virtual = 1), is in a role of class A, has a seniority of 1, has been employed 9 years, and has a workplace pride score of 32.9.

To see how many observations we have per level of our grouping variable dept:

wp_data |>
  group_by(dept) |>
  count()
# A tibble: 16 × 2
# Groups:   dept [16]
   dept       n
   <chr>  <int>
 1 ACE       17
 2 CMA       21
 3 CPS       13
 4 FSA       25
 5 GLD       17
 6 HMRC      16
 7 NCA       20
 8 NS&I      20
 9 OFGEM     15
10 OFQUAL     5
11 OFSTED    17
12 OFWAT     16
13 ORR       17
14 SFO       18
15 UKSA      45
16 UKSC      13

Looking at this data frame, we can tell that there are 16 levels within dept because there are 16 rows, that is, 16 distinct values of dept that we are counting.

Each value of dept represents one department within the UK Government.

The output of group_by() |> count() tells us how many people we have data from in each department. We can also add on the arrange() function to sort these values in order, to easily see what the maximum and minimum values are.

wp_data |>
  group_by(dept) |>
  count() |>
  arrange(n)
# A tibble: 16 × 2
# Groups:   dept [16]
   dept       n
   <chr>  <int>
 1 OFQUAL     5
 2 CPS       13
 3 UKSC      13
 4 OFGEM     15
 5 HMRC      16
 6 OFWAT     16
 7 ACE       17
 8 GLD       17
 9 OFSTED    17
10 ORR       17
11 SFO       18
12 NCA       20
13 NS&I      20
14 CMA       21
15 FSA       25
16 UKSA      45

We see that OFQUAL (Office of Qualifications and Examinations Regulation) has the fewest observations at 5, and UKSA (UK Statistics Authority) has the most observations at 45.

Question 3

Here is a plot of the overall association between employment_length and wp. Each dot is coloured based on that employee’s seniority, so all variables from the RQ are included. First, copy this code and generate this plot yourself.

wp_data |>
  ggplot(aes(x = employment_length, y = wp, colour = seniority)) +
  geom_point()

Modify this plot to show the same data split into different panels (also called facets), one for each level of the grouping variable.

🗂️ See Plot group-structured data flash card.

Solution 3.

wp_data |>
  ggplot(aes(x = employment_length, y = wp, colour = seniority)) +
  facet_wrap(~ dept) +   # <- this is the line that facets the plot by dept
  geom_point()

Question 4

Make some predictions based on the plot you just created.

Specifically:

  • Which level(s) of the grouping variable appear to have higher workplace pride than the others? Which levels appear to have lower workplace pride?
  • Are there particular levels of this grouping variable that appear to have a stronger effect of employment length on workplace pride?

You’ll refer to these levels later, so write them down.

Solution 4. To see which departments have higher workplace pride than others, look for the dots that correspond to larger values on the y axis showing wp.

Eyeballing the plot from the previous solution, it looks to me like FSA and UKSA might have higher workplace pride than other departments. And when looking for smaller values on the y axis, I see that OFQUAL looks like it has pretty low workplace pride scores compared to the others.

To see which departments have a stronger effect of employment length on workplace pride, try to imagine how steep the line would be that correlates employment_length with wp. In other words: As employment_length increases, how fast does wp appear to decrease?

To me, it looks to me like OFSTED and CPS have slopes that might be a bit steeper. That is, their wp seems to decrease faster than in other departments.

Question 5

Now we’ll fit a linear mixed model to this data.

We want to estimate the effects of employment length and seniority on workplace pride. Therefore, we know that our fixed effects will be wp ~ employment_length + seniority.

The model should include random intercepts by the grouping variable, as well as random slopes by that variable over both predictors. Fill in the ... blanks in the R code below with the appropriate values and variable names.

wp_lmm <- lmer(
  wp ~ employment_length + seniority + (... + ... + ... | ...), 
  data = wp_data
)

🗂️ See Add random effects to a model formula flash card.

Solution 5.

wp_lmm <- lmer(
  wp ~ employment_length + seniority + (1 + employment_length + seniority | dept), 
  data = wp_data
)

Let’s look at the term (1 + employment_length + seniority | dept) piece by piece:

  • First, the right-hand side of the term (the bit after the vertical line |) is dept. Therefore the model will find adjustments to the fixed effects for each distinct level of dept.
  • Now, looking at the left-hand side (the bit before the vertical line |:
    • The 1 means that the model will estimate adjustments to the fixed intercept by dept (aka “random intercepts”).
    • The + employment_length means that the model will ALSO estimate adjustments to the fixed slope over employment_length by dept (aka “random slopes over employment_length”).
    • And the + seniority means that the model will ALSO estimate adjustments to the fixed slope over seniority by dept (aka “random slopes over seniority”).

Question 6

Plot the random effects estimated by wp_lmm using dotplot.ranef.mer().

🗂️ See Plot random effects flash card.

Solution 6.

dotplot.ranef.mer(ranef(wp_lmm))$dept

A little trick from the flash cards: to “zoom in” on the slope adjustments, you can modify the scales= argument, which will tell dotplot.ranef.mer() to scale each panel differently.

dotplot.ranef.mer(
  ranef(wp_lmm),
  scales = list(x = list(relation = 'free'))  # allow scales of subplots to vary
)$dept

Question 7

Imagine you were including this dotplot of random effects in a report, and you had write a caption to describe it to your reader. Write one sentence per panel to tell the reader what is being displayed in each panel, and make sure to include what the dots/error bars represent.

Solution 7. One example:

  • The (Intercept) panel shows the adjustments to the fixed intercept for each UK government department (each dot represents the adjustment estimate and the error bars represent its 95% CI).
  • The employment_length panel shows the adjustments to the fixed slope over employment_length for each UK government department.
  • The seniority panel shows the adjustments to the fixed slope over seniority for each UK government department.

(Preview: in future weeks, we’ll look in depth at how to interpret random effects in relation to fixed effects. After that, we can write a more complete caption for this plot which would also tell the reader what it means when the adjustments to each fixed effect are positive or negative.)

Question 8

Compare your earlier predictions to the model-estimated random effects.

Back in Q4, you named a few levels that you thought might behave differently from the others.

Locate those levels in the dotplot of your random effects.

Does it look like the levels you identified are behaving differently from average in the way you predicted? How can you tell?

Solution 8. Here’s the dotplot again for convenient reference:

dotplot.ranef.mer(ranef(wp_lmm), scales = list(x = list(relation = 'free')))$dept

I said that I expected FSA and UKSA to have higher workplace pride than other departments. This would be reflected in positive adjustments to the fixed intercept. And yes, UKSA and FSA both have positive intercept adjustments, along with a few other departments like CPS and OFWAT.

I also said that I expected OFQUAL to have lower workplace pride than other departments. This would be reflected by a negative adjustment to the fixed intercept. And yes, the adjustment for OFQUAL is negative. (Note that its 95% CI includes zero—this means that, if we were doing significance testing for the random effects, we couldn’t reject the H0 that OFQUAL has no adjustment from the fixed intercept. But we’re not doing significance testing of random effects, so we can basically ignore this CI.)

Finally, I thought that OFSTED and CPS might have steeper slopes associating employment_length and wp. “Steeper” in this case means “more negative”, since the slope is already negative and it just gets more steeply negative in those departments. Therefore we’re looking for negative adjustments to the fixed slope over employment_length for OFSTED and CPS. And yes, that’s what we see (and the slope for OFSTED has a much larger adjustment than for CPS).

Question 9

Why are the numbers given to you by ranef(wp_lmm) and by coef(wp_lmm) different?

Explain in a sentence or two.

🗂️ See Extract estimates from a fitted model flash card.

Solution 9.

ranef(wp_lmm)
$dept
       (Intercept) employment_length seniority
ACE         0.1380          -0.01001    0.1799
CMA        -2.3517           0.09808   -0.1955
CPS         2.0262          -0.07105   -0.0194
FSA         1.8789           0.06536    0.4612
GLD         1.3546          -0.09590    0.1840
HMRC       -2.5259           0.08541   -0.1323
NCA        -1.2742           0.00181   -0.1775
NS&I        0.0567           0.00187    0.1002
OFGEM      -1.2272           0.18228   -0.0614
OFQUAL     -2.3517           0.01706   -0.1248
OFSTED      0.2060          -0.21108   -0.2882
OFWAT       2.0169          -0.13475   -0.0406
ORR        -0.7655           0.03461   -0.1711
SFO        -0.1163          -0.08280    0.1473
UKSA        3.1022           0.10238    0.2963
UKSC       -0.1667           0.01674   -0.1581

with conditional variances for "dept" 
coef(wp_lmm)
$dept
       (Intercept) employment_length seniority
ACE           35.6            -0.909    0.2803
CMA           33.2            -0.800   -0.0950
CPS           37.5            -0.970    0.0811
FSA           37.4            -0.833    0.5617
GLD           36.9            -0.994    0.2844
HMRC          33.0            -0.813   -0.0318
NCA           34.2            -0.897   -0.0770
NS&I          35.6            -0.897    0.2007
OFGEM         34.3            -0.716    0.0390
OFQUAL        33.2            -0.881   -0.0243
OFSTED        35.7            -1.110   -0.1877
OFWAT         37.5            -1.033    0.0598
ORR           34.7            -0.864   -0.0707
SFO           35.4            -0.981    0.2478
UKSA          38.6            -0.796    0.3968
UKSC          35.3            -0.882   -0.0576

attr(,"class")
[1] "coef.mer"

ranef() tells you how much each fixed effect should be adjusted for each level of dept. In contrast, coef() combines the random effects with the fixed effects and tells you the adjusted \(\beta\) parameters of the linear expression that associates employment_length and seniority with wp.

Question 10

How would you use the values given to you by ranef(wp_lmm) and fixef() to recreate the values of coef()?

Describe in one sentence.

Bonus coding challenge: Actually use the values from ranef() and fixef() to recreate the values of coef().

Solution 10. The one-sentence answer: add them together.

The explanation: ranef() gives you the adjustments to the fixed effects, and fixef() gives you the fixed effects themselves. So to get the coef() values—the adjusted effects for each level of dept—you must add the adjustments from ranef() to each fixed effect from fixef().

One possible solution to the challenge (but there are many paths to get to the same outcome):

# Store the fixed effect values with shorter variable names:
(fixef_int <- fixef(wp_lmm)[['(Intercept)']])
[1] 35.5
(fixef_emp <- fixef(wp_lmm)[['employment_length']])
[1] -0.899
(fixef_sen <- fixef(wp_lmm)[['seniority']])
[1] 0.1
# Add each fixed effect to each column produced by ranef():
ranef(wp_lmm)$dept |>
  mutate(
    adjusted_int = `(Intercept)` + fixef_int,
    adjusted_emp = employment_length + fixef_emp,
    adjusted_sen = seniority + fixef_sen,
  ) |>
  # view only the new columns
  select(adjusted_int, adjusted_emp, adjusted_sen)
       adjusted_int adjusted_emp adjusted_sen
ACE            35.6       -0.909       0.2803
CMA            33.2       -0.800      -0.0950
CPS            37.5       -0.970       0.0811
FSA            37.4       -0.833       0.5617
GLD            36.9       -0.994       0.2844
HMRC           33.0       -0.813      -0.0318
NCA            34.2       -0.897      -0.0770
NS&I           35.6       -0.897       0.2007
OFGEM          34.3       -0.716       0.0390
OFQUAL         33.2       -0.881      -0.0243
OFSTED         35.7       -1.110      -0.1877
OFWAT          37.5       -1.033       0.0598
ORR            34.7       -0.864      -0.0707
SFO            35.4       -0.981       0.2478
UKSA           38.6       -0.796       0.3968
UKSC           35.3       -0.882      -0.0576

Exactly the same numbers as the output of coef():

coef(wp_lmm)$dep
       (Intercept) employment_length seniority
ACE           35.6            -0.909    0.2803
CMA           33.2            -0.800   -0.0950
CPS           37.5            -0.970    0.0811
FSA           37.4            -0.833    0.5617
GLD           36.9            -0.994    0.2844
HMRC          33.0            -0.813   -0.0318
NCA           34.2            -0.897   -0.0770
NS&I          35.6            -0.897    0.2007
OFGEM         34.3            -0.716    0.0390
OFQUAL        33.2            -0.881   -0.0243
OFSTED        35.7            -1.110   -0.1877
OFWAT         37.5            -1.033    0.0598
ORR           34.7            -0.864   -0.0707
SFO           35.4            -0.981    0.2478
UKSA          38.6            -0.796    0.3968
UKSC          35.3            -0.882   -0.0576

Notice that in the mutate() code above, we need to place ticks (that is, the character ` ) around "(Intercept)" because otherwise, the brackets around “Intercept” will stop mutate() from identifying it as a column name.