| variable | description |
|---|---|
| age | Age (years) |
| lifesat | Life Satisfaction score |
| dwelling | Dwelling (town/city in Scotland) |
| size | Size of Dwelling (> or <100k people) |
Identify possible random effects
The process in words
Use your RQ to figure out what your model’s outcome and predictors are (i.e., your model’s fixed effects).
Identify the grouping variables in your dataset.
Think about the data generating process and figure out which grouping variables contribute random / non-manipulated / non-controlled variability to your data. For every randomly-varying grouping variable, your model must contain a random intercept for that grouping variable.
Refer back to all predictors you identified in Step 1. For each randomly-varying grouping variable you have identified, do at least some of its levels appear with more than one value of each predictor? If yes—if some levels of the grouping variable appear with more than one value of a predictor—then (in addition to the random intercepts) your model can also contain a random slope over that predictor for that grouping variable.
The process in a flowchart
[TODO ranef flowchart]
Example 1: Life satisfaction in Scotland
These data come from 112 people across 12 different Scottish dwellings (cities and towns). Information is captured on their ages and a measure of life satisfaction. The researchers are interested in if there is an association between age and life-satisfaction.
Data are available at https://uoepsy.github.io/data/lmm_lifesatscot.csv.
lifesatscot <- read_csv("https://uoepsy.github.io/data/lmm_lifesatscot.csv")1. Predictors
The RQ asks whether there’s an association betwen age and lifesat. The model’s outcome is lifesat and the predictor is age.
The RQ doesn’t mention size, so we will not consider it a predictor. (In principle, it looks like a plausible covariate, that is, a way to control for how populous each area is.)
2. Grouping variables
The grouping variables are
age(treated as ordinal because each value is spaced five years apart)dwelling(categorical)size(categorical)
# age:
lifesatscot |> group_by(age) |> count()# A tibble: 11 x 2
# Groups: age [11]
age n
<dbl> <int>
1 15 2
2 20 2
3 25 9
4 30 7
5 35 23
6 40 25
7 45 18
8 50 15
9 55 7
10 60 3
11 65 1
# dwelling:
lifesatscot |> group_by(dwelling) |> count()# A tibble: 12 x 2
# Groups: dwelling [12]
dwelling n
<chr> <int>
1 Aberdeen 10
2 Dumfries 10
3 Dundee 10
4 Dunfermline 10
5 Edinburgh 10
6 Fort William 10
7 Glasgow 10
8 Inverness 10
9 Kirkcaldy 2
10 Paisley 10
11 Perth 10
12 Stirling 10
# size:
lifesatscot |> group_by(size) |> count()# A tibble: 2 x 2
# Groups: size [2]
size n
<chr> <int>
1 <100k 72
2 >100k 40
3. Random intercepts
Thinking about what kind of variability each grouping variable contributes:
agecontributes reproducible (manipulated) variability: the researchers have gathered data from a bunch of different ages. We are interested in drawing conclusions specifically about the ages analysed.dwellingcontributes random variability: the RQ doesn’t specify that we manipulate or control for the dwelling place. And we want our results to be able to generalise beyond these twelve dwelling places.sizecontributes reproducible (controlled) variability: the researchers have recorded how big each dwelling is in case this information might otherwise confound the association betweenlifesatandage.
Because dwelling contributes random variability, our model formula must at least include a random intercept by dwelling, which looks like (1 | dwelling).
4. Random slopes
Our model’s only predictor is age, and our only randomly-varying grouping variable is dwelling.
Do at least some levels of dwelling appear with more than one value of age?
stats::xtabs(~ dwelling + age, data = lifesatscot) age
dwelling 15 20 25 30 35 40 45 50 55 60 65
Aberdeen 1 1 0 0 4 2 0 2 0 0 0
Dumfries 0 0 1 0 3 3 0 2 0 1 0
Dundee 0 0 1 1 2 3 3 0 0 0 0
Dunfermline 0 0 0 1 1 2 3 1 2 0 0
Edinburgh 0 1 1 0 3 2 2 1 0 0 0
Fort William 0 0 0 0 3 2 1 2 1 1 0
Glasgow 0 0 3 0 2 2 2 0 0 1 0
Inverness 0 0 0 2 1 3 1 2 1 0 0
Kirkcaldy 0 0 0 0 0 0 1 0 0 0 1
Paisley 0 0 0 1 1 4 1 1 2 0 0
Perth 0 0 3 1 2 1 1 1 1 0 0
Stirling 1 0 0 1 1 1 3 3 0 0 0
Yes: in every dwelling, there is more than one value that’s not a zero.
Therefore we can have random slopes over age by dwelling: instead of just (1 | dwelling), our model will include (1 + age | dwelling).
dwelling
It might be nice to get R to count how many non-zero values there are for each level of dwelling. This way we can make extra sure that the numbers are bigger than 1.
First we check whether each cell’s value is different from zero:
stats::xtabs(~ dwelling + age, data = lifesatscot) != 0 age
dwelling 15 20 25 30 35 40 45 50 55 60
Aberdeen TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
Dumfries FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
Dundee FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
Dunfermline FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Edinburgh FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Fort William FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
Glasgow FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
Inverness FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Kirkcaldy FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Paisley FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Perth FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Stirling TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
age
dwelling 65
Aberdeen FALSE
Dumfries FALSE
Dundee FALSE
Dunfermline FALSE
Edinburgh FALSE
Fort William FALSE
Glasgow FALSE
Inverness FALSE
Kirkcaldy TRUE
Paisley FALSE
Perth FALSE
Stirling FALSE
And we can pipe this table into rowSums() to add up the number of TRUE values per row. (This works because underlyingly, R stores TRUE as the number 1 and FALSE as the number 0.)
(stats::xtabs(~ dwelling + age, data = lifesatscot) != 0) |>
rowSums() Aberdeen Dumfries Dundee Dunfermline Edinburgh Fort William
5 5 5 6 6 6
Glasgow Inverness Kirkcaldy Paisley Perth Stirling
5 6 2 6 7 6
So in Aberdeen, for example, we have five distinct values of age, and in Kirkcaldy, we have two. All of these numbers are bigger than 1, so we know we can have random slopes over age by dwelling.
Example 2: CBT and stress
These data are simulated to represent data from 50 participants, each measured at three different time-points (pre, during, and post) on a measure of stress. Participants were randomly allocated such that half received some cognitive behavioural therapy (CBT) treatment, and half did not. This study is interested in assessing whether the two groups (control vs treatment) differ in changes in stress across the three time points.
The data are available at https://uoepsy.github.io/data/stressint.csv.
| variable | description |
|---|---|
| ppt | Participant Identifier |
| stress | Stress (range 0 to 100) |
| time | Time (Pre/During/Post) |
| group | Whether participant is in the CBT group or control group |
stress <- read_csv("https://uoepsy.github.io/data/stressint.csv")1. Predictors
The RQ asks whether control vs. treatment groups differ in how their stress changes across the three time points. (Alarm bells for an interaction model should be ringing!)
The outcome variable is stress. The predictors are time and group.
2. Grouping variables
The grouping variables are
ppt(categorical)time(categorical)group(categorical)
# ppt:
stress |> group_by(ppt) |> count()# A tibble: 50 x 2
# Groups: ppt [50]
ppt n
<chr> <int>
1 ID1 3
2 ID10 3
3 ID11 3
4 ID12 3
5 ID13 3
6 ID14 3
7 ID15 3
8 ID16 3
9 ID17 3
10 ID18 3
# i 40 more rows
# time:
stress |> group_by(time) |> count()# A tibble: 3 x 2
# Groups: time [3]
time n
<chr> <int>
1 During 50
2 Post 50
3 Pre 50
# group:
stress |> group_by(group) |> count()# A tibble: 2 x 2
# Groups: group [2]
group n
<chr> <int>
1 Control 75
2 Treatment 75
3. Random intercepts
Thinking about what kind of variability each grouping variable contributes:
pptcontributes random variability: we’re not interested in these participants specifically; we want our conclusions to be able to generalise across different people than the ones we’ve studied here. If we ran this study again, we would recruit different participants.timecontributes reproducible variability: we’re interested in drawing conclusions specifically about these three time points. If we ran this study again, we would keep thetimevariable the same.groupcontributes reproducible variability: the design manipulates whether people are in the control group or treatment (CBT) group, and we care about these levels specifically. If we ran this study again, we would keep thegroupvariable the same.
Because ppt contributes random variability, our model formula must at least include a random intercept by ppt, which looks like (1 | ppt).
4. Random slopes
There’s one randomly-varying grouping variable, ppt, and two predictors, time and group. We’ll check each predictor in turn.
Do at least some levels of ppt appear with more than one value of time?
stats::xtabs(~ ppt + time, data = stress) time
ppt During Post Pre
ID1 1 1 1
ID10 1 1 1
ID11 1 1 1
ID12 1 1 1
ID13 1 1 1
ID14 1 1 1
ID15 1 1 1
ID16 1 1 1
ID17 1 1 1
ID18 1 1 1
ID19 1 1 1
ID2 1 1 1
ID20 1 1 1
ID21 1 1 1
ID22 1 1 1
ID23 1 1 1
ID24 1 1 1
ID25 1 1 1
ID26 1 1 1
ID27 1 1 1
ID28 1 1 1
ID29 1 1 1
ID3 1 1 1
ID30 1 1 1
ID31 1 1 1
ID32 1 1 1
ID33 1 1 1
ID34 1 1 1
ID35 1 1 1
ID36 1 1 1
ID37 1 1 1
ID38 1 1 1
ID39 1 1 1
ID4 1 1 1
ID40 1 1 1
ID41 1 1 1
ID42 1 1 1
ID43 1 1 1
ID44 1 1 1
ID45 1 1 1
ID46 1 1 1
ID47 1 1 1
ID48 1 1 1
ID49 1 1 1
ID5 1 1 1
ID50 1 1 1
ID6 1 1 1
ID7 1 1 1
ID8 1 1 1
ID9 1 1 1
Yes: every ppt has one data point for every time. Therefore in principle, we can have random slopes over time by ppt: instead of just (1 | ppt), we can write (1 + time | ppt).
(We’ll see in the troubleshooting flash card that although in principle these random slopes are possible, in practice the model won’t always manage to fit them.)
time values for each ppt
(stats::xtabs(~ ppt + time, data = stress) != 0 ) |>
rowSums() ID1 ID10 ID11 ID12 ID13 ID14 ID15 ID16 ID17 ID18 ID19 ID2 ID20 ID21 ID22 ID23
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
ID24 ID25 ID26 ID27 ID28 ID29 ID3 ID30 ID31 ID32 ID33 ID34 ID35 ID36 ID37 ID38
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
ID39 ID4 ID40 ID41 ID42 ID43 ID44 ID45 ID46 ID47 ID48 ID49 ID5 ID50 ID6 ID7
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
ID8 ID9
3 3
Because these values are greater than 1, random slopes over time by ppt are possible in principle.
Now: do at least some levels of ppt appear with more than one value of group, the other predictor?
stats::xtabs(~ ppt + group, data = stress) group
ppt Control Treatment
ID1 3 0
ID10 3 0
ID11 3 0
ID12 3 0
ID13 3 0
ID14 3 0
ID15 3 0
ID16 3 0
ID17 3 0
ID18 3 0
ID19 3 0
ID2 3 0
ID20 3 0
ID21 3 0
ID22 3 0
ID23 3 0
ID24 3 0
ID25 3 0
ID26 0 3
ID27 0 3
ID28 0 3
ID29 0 3
ID3 3 0
ID30 0 3
ID31 0 3
ID32 0 3
ID33 0 3
ID34 0 3
ID35 0 3
ID36 0 3
ID37 0 3
ID38 0 3
ID39 0 3
ID4 3 0
ID40 0 3
ID41 0 3
ID42 0 3
ID43 0 3
ID44 0 3
ID45 0 3
ID46 0 3
ID47 0 3
ID48 0 3
ID49 0 3
ID5 3 0
ID50 0 3
ID6 3 0
ID7 3 0
ID8 3 0
ID9 3 0
No! Each participant is in only one group (control or treatment), not both. Therefore we cannot include random slopes over group by ppt. Our random effect for ppt remains (1 + time | ppt).
group values for each ppt
(stats::xtabs(~ ppt + group, data = stress) != 0 ) |>
rowSums() ID1 ID10 ID11 ID12 ID13 ID14 ID15 ID16 ID17 ID18 ID19 ID2 ID20 ID21 ID22 ID23
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ID24 ID25 ID26 ID27 ID28 ID29 ID3 ID30 ID31 ID32 ID33 ID34 ID35 ID36 ID37 ID38
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ID39 ID4 ID40 ID41 ID42 ID43 ID44 ID45 ID46 ID47 ID48 ID49 ID5 ID50 ID6 ID7
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ID8 ID9
1 1
Because these values are all equal to 1, random slopes over group by ppt are not possible.
Linked flash cards
Outgoing links
- TODO
Backlinks
- TODO