Identify possible random effects

The process in words

  1. Use your RQ to figure out what your model’s outcome and predictors are (i.e., your model’s fixed effects).

  2. Identify the grouping variables in your dataset.

  3. Think about the data generating process and figure out which grouping variables contribute random / non-manipulated / non-controlled variability to your data. For every randomly-varying grouping variable, your model must contain a random intercept for that grouping variable.

  4. Refer back to all predictors you identified in Step 1. For each randomly-varying grouping variable you have identified, do at least some of its levels appear with more than one value of each predictor? If yes—if some levels of the grouping variable appear with more than one value of a predictor—then (in addition to the random intercepts) your model can also contain a random slope over that predictor for that grouping variable.

The process in a flowchart

[TODO ranef flowchart]

Example 1: Life satisfaction in Scotland

These data come from 112 people across 12 different Scottish dwellings (cities and towns). Information is captured on their ages and a measure of life satisfaction. The researchers are interested in if there is an association between age and life-satisfaction.

Data are available at https://uoepsy.github.io/data/lmm_lifesatscot.csv.

variable description
age Age (years)
lifesat Life Satisfaction score
dwelling Dwelling (town/city in Scotland)
size Size of Dwelling (> or <100k people)
lifesatscot <- read_csv("https://uoepsy.github.io/data/lmm_lifesatscot.csv")

1. Predictors

The RQ asks whether there’s an association betwen age and lifesat. The model’s outcome is lifesat and the predictor is age.

The RQ doesn’t mention size, so we will not consider it a predictor. (In principle, it looks like a plausible covariate, that is, a way to control for how populous each area is.)

2. Grouping variables

The grouping variables are

  • age (treated as ordinal because each value is spaced five years apart)
  • dwelling (categorical)
  • size (categorical)
# age:
lifesatscot |> group_by(age) |> count()
# A tibble: 11 x 2
# Groups:   age [11]
     age     n
   <dbl> <int>
 1    15     2
 2    20     2
 3    25     9
 4    30     7
 5    35    23
 6    40    25
 7    45    18
 8    50    15
 9    55     7
10    60     3
11    65     1
# dwelling:
lifesatscot |> group_by(dwelling) |> count()
# A tibble: 12 x 2
# Groups:   dwelling [12]
   dwelling         n
   <chr>        <int>
 1 Aberdeen        10
 2 Dumfries        10
 3 Dundee          10
 4 Dunfermline     10
 5 Edinburgh       10
 6 Fort William    10
 7 Glasgow         10
 8 Inverness       10
 9 Kirkcaldy        2
10 Paisley         10
11 Perth           10
12 Stirling        10
# size:
lifesatscot |> group_by(size) |> count()
# A tibble: 2 x 2
# Groups:   size [2]
  size      n
  <chr> <int>
1 <100k    72
2 >100k    40

3. Random intercepts

Thinking about what kind of variability each grouping variable contributes:

  • age contributes reproducible (manipulated) variability: the researchers have gathered data from a bunch of different ages. We are interested in drawing conclusions specifically about the ages analysed.
  • dwelling contributes random variability: the RQ doesn’t specify that we manipulate or control for the dwelling place. And we want our results to be able to generalise beyond these twelve dwelling places.
  • size contributes reproducible (controlled) variability: the researchers have recorded how big each dwelling is in case this information might otherwise confound the association between lifesat and age.

Because dwelling contributes random variability, our model formula must at least include a random intercept by dwelling, which looks like (1 | dwelling).

4. Random slopes

Our model’s only predictor is age, and our only randomly-varying grouping variable is dwelling.

Do at least some levels of dwelling appear with more than one value of age?

stats::xtabs(~ dwelling + age, data = lifesatscot)
              age
dwelling       15 20 25 30 35 40 45 50 55 60 65
  Aberdeen      1  1  0  0  4  2  0  2  0  0  0
  Dumfries      0  0  1  0  3  3  0  2  0  1  0
  Dundee        0  0  1  1  2  3  3  0  0  0  0
  Dunfermline   0  0  0  1  1  2  3  1  2  0  0
  Edinburgh     0  1  1  0  3  2  2  1  0  0  0
  Fort William  0  0  0  0  3  2  1  2  1  1  0
  Glasgow       0  0  3  0  2  2  2  0  0  1  0
  Inverness     0  0  0  2  1  3  1  2  1  0  0
  Kirkcaldy     0  0  0  0  0  0  1  0  0  0  1
  Paisley       0  0  0  1  1  4  1  1  2  0  0
  Perth         0  0  3  1  2  1  1  1  1  0  0
  Stirling      1  0  0  1  1  1  3  3  0  0  0

Yes: in every dwelling, there is more than one value that’s not a zero.

Therefore we can have random slopes over age by dwelling: instead of just (1 | dwelling), our model will include (1 + age | dwelling).

It might be nice to get R to count how many non-zero values there are for each level of dwelling. This way we can make extra sure that the numbers are bigger than 1.

First we check whether each cell’s value is different from zero:

stats::xtabs(~ dwelling + age, data = lifesatscot) != 0
              age
dwelling          15    20    25    30    35    40    45    50    55    60
  Aberdeen      TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
  Dumfries     FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
  Dundee       FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
  Dunfermline  FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
  Edinburgh    FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
  Fort William FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  Glasgow      FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
  Inverness    FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
  Kirkcaldy    FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
  Paisley      FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
  Perth        FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
  Stirling      TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
              age
dwelling          65
  Aberdeen     FALSE
  Dumfries     FALSE
  Dundee       FALSE
  Dunfermline  FALSE
  Edinburgh    FALSE
  Fort William FALSE
  Glasgow      FALSE
  Inverness    FALSE
  Kirkcaldy     TRUE
  Paisley      FALSE
  Perth        FALSE
  Stirling     FALSE

And we can pipe this table into rowSums() to add up the number of TRUE values per row. (This works because underlyingly, R stores TRUE as the number 1 and FALSE as the number 0.)

(stats::xtabs(~ dwelling + age, data = lifesatscot) != 0) |>
  rowSums()
    Aberdeen     Dumfries       Dundee  Dunfermline    Edinburgh Fort William 
           5            5            5            6            6            6 
     Glasgow    Inverness    Kirkcaldy      Paisley        Perth     Stirling 
           5            6            2            6            7            6 

So in Aberdeen, for example, we have five distinct values of age, and in Kirkcaldy, we have two. All of these numbers are bigger than 1, so we know we can have random slopes over age by dwelling.

Example 2: CBT and stress

These data are simulated to represent data from 50 participants, each measured at three different time-points (pre, during, and post) on a measure of stress. Participants were randomly allocated such that half received some cognitive behavioural therapy (CBT) treatment, and half did not. This study is interested in assessing whether the two groups (control vs treatment) differ in changes in stress across the three time points.

The data are available at https://uoepsy.github.io/data/stressint.csv.

variable description
ppt Participant Identifier
stress Stress (range 0 to 100)
time Time (Pre/During/Post)
group Whether participant is in the CBT group or control group
stress <- read_csv("https://uoepsy.github.io/data/stressint.csv")

1. Predictors

The RQ asks whether control vs. treatment groups differ in how their stress changes across the three time points. (Alarm bells for an interaction model should be ringing!)

The outcome variable is stress. The predictors are time and group.

2. Grouping variables

The grouping variables are

  • ppt (categorical)
  • time (categorical)
  • group (categorical)
# ppt:
stress |> group_by(ppt) |> count()
# A tibble: 50 x 2
# Groups:   ppt [50]
   ppt       n
   <chr> <int>
 1 ID1       3
 2 ID10      3
 3 ID11      3
 4 ID12      3
 5 ID13      3
 6 ID14      3
 7 ID15      3
 8 ID16      3
 9 ID17      3
10 ID18      3
# i 40 more rows
# time:
stress |> group_by(time) |> count()
# A tibble: 3 x 2
# Groups:   time [3]
  time       n
  <chr>  <int>
1 During    50
2 Post      50
3 Pre       50
# group:
stress |> group_by(group) |> count()
# A tibble: 2 x 2
# Groups:   group [2]
  group         n
  <chr>     <int>
1 Control      75
2 Treatment    75

3. Random intercepts

Thinking about what kind of variability each grouping variable contributes:

  • ppt contributes random variability: we’re not interested in these participants specifically; we want our conclusions to be able to generalise across different people than the ones we’ve studied here. If we ran this study again, we would recruit different participants.
  • time contributes reproducible variability: we’re interested in drawing conclusions specifically about these three time points. If we ran this study again, we would keep the time variable the same.
  • group contributes reproducible variability: the design manipulates whether people are in the control group or treatment (CBT) group, and we care about these levels specifically. If we ran this study again, we would keep the group variable the same.

Because ppt contributes random variability, our model formula must at least include a random intercept by ppt, which looks like (1 | ppt).

4. Random slopes

There’s one randomly-varying grouping variable, ppt, and two predictors, time and group. We’ll check each predictor in turn.

Do at least some levels of ppt appear with more than one value of time?

stats::xtabs(~ ppt + time, data = stress)
      time
ppt    During Post Pre
  ID1       1    1   1
  ID10      1    1   1
  ID11      1    1   1
  ID12      1    1   1
  ID13      1    1   1
  ID14      1    1   1
  ID15      1    1   1
  ID16      1    1   1
  ID17      1    1   1
  ID18      1    1   1
  ID19      1    1   1
  ID2       1    1   1
  ID20      1    1   1
  ID21      1    1   1
  ID22      1    1   1
  ID23      1    1   1
  ID24      1    1   1
  ID25      1    1   1
  ID26      1    1   1
  ID27      1    1   1
  ID28      1    1   1
  ID29      1    1   1
  ID3       1    1   1
  ID30      1    1   1
  ID31      1    1   1
  ID32      1    1   1
  ID33      1    1   1
  ID34      1    1   1
  ID35      1    1   1
  ID36      1    1   1
  ID37      1    1   1
  ID38      1    1   1
  ID39      1    1   1
  ID4       1    1   1
  ID40      1    1   1
  ID41      1    1   1
  ID42      1    1   1
  ID43      1    1   1
  ID44      1    1   1
  ID45      1    1   1
  ID46      1    1   1
  ID47      1    1   1
  ID48      1    1   1
  ID49      1    1   1
  ID5       1    1   1
  ID50      1    1   1
  ID6       1    1   1
  ID7       1    1   1
  ID8       1    1   1
  ID9       1    1   1

Yes: every ppt has one data point for every time. Therefore in principle, we can have random slopes over time by ppt: instead of just (1 | ppt), we can write (1 + time | ppt).

(We’ll see in the troubleshooting flash card that although in principle these random slopes are possible, in practice the model won’t always manage to fit them.)

(stats::xtabs(~ ppt + time, data = stress) != 0 ) |>
  rowSums()
 ID1 ID10 ID11 ID12 ID13 ID14 ID15 ID16 ID17 ID18 ID19  ID2 ID20 ID21 ID22 ID23 
   3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
ID24 ID25 ID26 ID27 ID28 ID29  ID3 ID30 ID31 ID32 ID33 ID34 ID35 ID36 ID37 ID38 
   3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
ID39  ID4 ID40 ID41 ID42 ID43 ID44 ID45 ID46 ID47 ID48 ID49  ID5 ID50  ID6  ID7 
   3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
 ID8  ID9 
   3    3 

Because these values are greater than 1, random slopes over time by ppt are possible in principle.

Now: do at least some levels of ppt appear with more than one value of group, the other predictor?

stats::xtabs(~ ppt + group, data = stress)
      group
ppt    Control Treatment
  ID1        3         0
  ID10       3         0
  ID11       3         0
  ID12       3         0
  ID13       3         0
  ID14       3         0
  ID15       3         0
  ID16       3         0
  ID17       3         0
  ID18       3         0
  ID19       3         0
  ID2        3         0
  ID20       3         0
  ID21       3         0
  ID22       3         0
  ID23       3         0
  ID24       3         0
  ID25       3         0
  ID26       0         3
  ID27       0         3
  ID28       0         3
  ID29       0         3
  ID3        3         0
  ID30       0         3
  ID31       0         3
  ID32       0         3
  ID33       0         3
  ID34       0         3
  ID35       0         3
  ID36       0         3
  ID37       0         3
  ID38       0         3
  ID39       0         3
  ID4        3         0
  ID40       0         3
  ID41       0         3
  ID42       0         3
  ID43       0         3
  ID44       0         3
  ID45       0         3
  ID46       0         3
  ID47       0         3
  ID48       0         3
  ID49       0         3
  ID5        3         0
  ID50       0         3
  ID6        3         0
  ID7        3         0
  ID8        3         0
  ID9        3         0

No! Each participant is in only one group (control or treatment), not both. Therefore we cannot include random slopes over group by ppt. Our random effect for ppt remains (1 + time | ppt).

(stats::xtabs(~ ppt + group, data = stress) != 0 ) |>
  rowSums()
 ID1 ID10 ID11 ID12 ID13 ID14 ID15 ID16 ID17 ID18 ID19  ID2 ID20 ID21 ID22 ID23 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
ID24 ID25 ID26 ID27 ID28 ID29  ID3 ID30 ID31 ID32 ID33 ID34 ID35 ID36 ID37 ID38 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
ID39  ID4 ID40 ID41 ID42 ID43 ID44 ID45 ID46 ID47 ID48 ID49  ID5 ID50  ID6  ID7 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
 ID8  ID9 
   1    1 

Because these values are all equal to 1, random slopes over group by ppt are not possible.

Linked flash cards