Exercises: Longitudinal Models
Last week when we introduced multilevel models (or “mixed effects models” or whatever we’re calling them!), we saw in the lectures a little bit about the idea of having datapoints from the same participant over time. This kind of data tends to get termed “longitudinal” (mainly used to refer to studies which follow-up participants over the course of months or years). The lectures this week have also introduced this idea of ‘change over time’ by looking at some data from Public Health England.
Let’s work our way through an example.
WeightMaintain Data Codebook
The weight maintenance data (WeightMaintain3
), a made-up data set based on Lowe et al. (2014, Obesity, 22, 94-100), contains information on overweight participants who completed a 12-week weight loss program, and were then randomly assigned to one of three weight maintenance conditions:
- None (Control)
- MR (meal replacements): use MR to replace one meal and snack per day
- ED (energy density intervention): book and educational materials on purchasing and preparing foods lower in ED (reducing fat content and/or increasing water content of foods)
Weight was assessed on day 1 of maintenance, 12 months post, 24 months post, and 36 months post.
It is available, in .rda format, at https://uoepsy.github.io/data/WeightMaintain3.rda
ID |
Participant ID |
Condition |
Weight Maintenance Condition (‘None’ = No maintenance program, ‘MR’ = Meal replacement, ‘ED’ = Energy Density intervention) |
Assessment |
Assessment number (0 = Day 1, 1 = 12 months, 2 = 24 months, 3 = 36 months) |
WeightChange |
Difference in weight (lbs) from end of 12-week weight loss program |
Question A1
Load the data, and take a look at what is in there. Hopefully it should match the description above.
Hint: load(url("https://uoepsy.github.io/data/WeightMaintain3.rda"))
Solution
load(url("https://uoepsy.github.io/data/WeightMaintain3.rda"))
summary(WeightMaintain3)
## ID Condition Assessment WeightChange
## 101 : 4 None:240 Min. :0.00 Min. :-8.3781
## 102 : 4 ED :240 1st Qu.:0.75 1st Qu.:-0.5024
## 103 : 4 MR :240 Median :1.50 Median : 0.7050
## 104 : 4 Mean :1.50 Mean : 1.4438
## 105 : 4 3rd Qu.:2.25 3rd Qu.: 2.8806
## 106 : 4 Max. :3.00 Max. :14.9449
## (Other):696
head(WeightMaintain3)
## ID Condition Assessment WeightChange
## 1 101 None 0 -0.116138529
## 2 101 None 1 0.333508907
## 3 101 None 2 1.678653492
## 4 101 None 3 2.756023400
## 5 102 None 0 -0.004420188
## 6 102 None 1 1.746725487
Question A2
Q: Overall, did the participants maintain their weight loss or did their weights change?
We need to remember that each of our participants has measurements at 4 assessments. We have randomly sampled participants, and then within them have measured multiple observations. So our observations are not independent. We’re not interested in estimating differences between specific participants - our participants are just a random sample of people. But we do want to account for the dependency they introduce in our data. This is why we would want to fit a multilevel model and incorporate participant-level variation into our model structure.
- Fit an “intercept-only” model.
- Fit a model with weight change predicted by assessment.
- Compare the two models (use
anova(model1, model2)
to conduct a likelihood ratio test).
Things to think about:
- We cannot compare models that differ in both the fixed and random parts.
- For now, ignore messages saying
boundary (singular) fit: see ?isSingular
(that comes in a couple of questions’ time).
Solution
This is our null model:
m.null <- lmer(WeightChange ~ 1 + (1 | ID), data=WeightMaintain3)
summary(m.null)
## Linear mixed model fit by REML ['lmerMod']
## Formula: WeightChange ~ 1 + (1 | ID)
## Data: WeightMaintain3
##
## REML criterion at convergence: 3601.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4349 -0.5330 -0.1154 0.4255 3.7017
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 3.765 1.940
## Residual 6.431 2.536
## Number of obs: 720, groups: ID, 180
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.4438 0.1728 8.357
We can see the 3.77 / (3.77 + 6.43)
, or 0.37 of the total variance is attributable to participant-level variation.
Now lets suppose we want to compare this null model with a model with an effect of Assessment
(to assess whether there is overall change over time).
Which model should we compare m.null
to?
modA <- lmer(WeightChange ~ 1 + Assessment + (1 + Assessment | ID), data=WeightMaintain3)
modB <- lmer(WeightChange ~ 1 + Assessment + (1 | ID), data=WeightMaintain3)
A comparison between these m.null
and modA
will not be assessing the influence of only the fixed effect of Assessment. Remember, we shouldn’t compare models with different random effect structures.
However, modB
doesn’t include our by-participant random effects of assessment, so comparing this to m.null
is potentially going to misattribute random deviations in participants’ change to being an overall effect of assessment.
If we want to conduct a model comparison to isolate the effect of overall change over time (a fixed effect of Assessment
), we might want to compare these two models:
m.base0 <- lmer(WeightChange ~ 1 + (1 + Assessment | ID), data=WeightMaintain3)
m.base <- lmer(WeightChange ~ 1 + Assessment + (1 + Assessment | ID), data=WeightMaintain3)
The first of these models is a bit weird to think about - how can we have by-participant random deviations of Assessment
if we don’t have a fixed effect of Assessment
? That makes very little sense. What it is actually fitting is a model where there is assumed to be no overall effect of Assessment. So the fixed effect is 0.
# Straightforward LRT
anova(m.base0, m.base)
## Data: WeightMaintain3
## Models:
## m.base0: WeightChange ~ 1 + (1 + Assessment | ID)
## m.base: WeightChange ~ 1 + Assessment + (1 + Assessment | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m.base0 5 2638.0 2660.9 -1314.0 2628.0
## m.base 6 2579.4 2606.8 -1283.7 2567.4 60.66 1 6.782e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This suggests that the inclusion Assessment does improve model fit, indicating that participants’ weights changed over course of 36 month assessment period.
Question A3
Q: Did the experimental condition groups differ in overall weight change and rate of weight change (non-maintenance)?
Hint: It helps to break it down. There are two questions here:
- do groups differ overall?
- do groups differ over time?
We can begin to see that we’re asking two questions about the Condition
variable here: “is there an effect of Condition?” and “Is there an interaction between Assessment and Condition?”
Try fitting two more models which incrementally build these levels of complexity, and compare them (perhaps to one another, perhaps to models from the previous question - think about what each comparison is testing!)
Solution
m.int <- lmer(WeightChange ~ Assessment + Condition + (1 + Assessment | ID),
data=WeightMaintain3)
m.full <- lmer(WeightChange ~ Assessment*Condition + (1 + Assessment | ID),
data=WeightMaintain3)
We’re going to compare each model to the previous one to examine the improvement in fit due to inclusion of each parameter.
We could do this quickly with
anova(m.base0, m.base, m.int, m.full)
## Data: WeightMaintain3
## Models:
## m.base0: WeightChange ~ 1 + (1 + Assessment | ID)
## m.base: WeightChange ~ 1 + Assessment + (1 + Assessment | ID)
## m.int: WeightChange ~ Assessment + Condition + (1 + Assessment | ID)
## m.full: WeightChange ~ Assessment * Condition + (1 + Assessment | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m.base0 5 2638.0 2660.9 -1314.0 2628.0
## m.base 6 2579.4 2606.8 -1283.7 2567.4 60.6605 1 6.782e-15 ***
## m.int 8 2573.9 2610.6 -1279.0 2557.9 9.4418 2 0.008907 **
## m.full 10 2537.5 2583.3 -1258.8 2517.5 40.3814 2 1.703e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Conditions differed overall in weight change \(\chi^2(2)=9.4, p = .009\)
- Conditions differed in change over assessment period \(\chi^2(2)=40.4, p < .001\)
boundary (singular) fit: see ?isSingular
Okay. Let’s talk about those “singular fits” messages we keep getting.
By now, you have hopefully fitted a number of models which incrementally add predictors. Ours are below:
m.base0 <- lmer(WeightChange ~ 1 + (1 + Assessment | ID), data=WeightMaintain3)
m.base <- lmer(WeightChange ~ Assessment + (1 + Assessment | ID), data=WeightMaintain3)
m.int <- lmer(WeightChange ~ Assessment + Condition + (1 + Assessment | ID), data=WeightMaintain3)
m.full <- lmer(WeightChange ~ Assessment * Condition + (1 + Assessment | ID), data=WeightMaintain3)
And many of these models were singular fits, and we just ignored them. We shouldn’t have.
What is the warning message telling us?
The warning is telling us that our model has resulted in a ‘singular fit.’ The easiest way to think of this is to think of it as indicating that the model is ‘overfitted’ - that there is not enough variation in our data for our model to be estimated properly.
What can we do?
In many cases, perhaps the most intuitive advice would be remove the most complex part of the random effects structure (i.e. random slopes). This leads to a simpler model that is not over-fitted. In other words, start simplifying from the top (where the most complexity is) to the bottom (where the lowest complexity is).
Additionally, when variance estimates are very low for a specific random effect term, this indicates that the model is not estimating this parameter to differ much between the levels of your grouping variable. It might, in some experimental designs, be perfectly acceptable to remove this or simply include it as a fixed effect.
A key point here is that when fitting a mixed model, we should think about how the data are generated. Asking yourself questions such as “do we have good reason to assume subjects might vary over time, or to assume that they will have different starting points (i.e., different intercepts)?” can help you in specifying your random effect structure
You can read in depth about what this means by reading the help documentation for ?isSingular
. For our purposes, a relevant section is copied below:
… intercept-only models, or 2-dimensional random effects such as intercept + slope models, singularity is relatively easy to detect because it leads to random-effect variance estimates of (nearly) zero, or estimates of correlations that are (almost) exactly -1 or 1.
Question A4
Re-read the description of the data, then ask yourself this question:
Do we think participants will vary in:
- their starting weight differences
(1|ID)
?
- their weight change over the course of the assessment period
(0 + Assessment | ID)
?
- both
(1 + Assessment | ID)
?
Hint: What do we think the baseline weight should be? Should it be the same for everyone? If so, might we want to remove the random intercept, which we do by setting it to 0
Can you re-fit your models without encountering singular fits?
Solution
In this very specific research design, it may actually make a lot of sense to not model by-participant variation around the intercept.
As baseline is at the very start of the weight maintenance, it makes sense that we wouldn’t have very much (if any) participant variation in change at this point.
Note that by removing the estimation of this parameter, our models now converge!
m.base0 <- lmer(WeightChange ~ 1 + (0 + Assessment | ID), data=WeightMaintain3)
m.base <- lmer(WeightChange ~ 1 + Assessment + (0 + Assessment | ID), data=WeightMaintain3)
m.int <- lmer(WeightChange ~ 1 + Assessment + Condition + (0 + Assessment | ID), data=WeightMaintain3)
m.full <- lmer(WeightChange ~ 1 + Assessment * Condition + (0 + Assessment | ID), data=WeightMaintain3)
We can check if a model has a singular fit:
isSingular(m.base)
## [1] FALSE
Question A5
Make a graph of the model fit and the observed means and standard errors at each time point for each condition.
Try using the effects package (hint, does this help: as.data.frame(effect("Assessment:Condition", model))
?)
Solution
library(effects)
ef <- as.data.frame(effect("Assessment:Condition", m.full))
ggplot(ef, aes(Assessment, fit, color=Condition)) +
geom_line() +
stat_summary(data=WeightMaintain3, aes(y=WeightChange),
fun.data=mean_se, geom="pointrange", size=1) +
theme_bw()
Question A6
Now let’s move to interpreting the coefficients.
Remember, we can get the coefficients using fixef(model)
.
We can also use tidy(model)
, and similar to models fitted with lm()
, we can pull out the bit of the summary()
using:
summary(model)$coefficients
From your model from the previous question which investigates whether conditions differed in their rate of weight change, examine the parameter estimates and interpret them (i.e., what does each parameter represent?)
Solution
round(coef(summary(m.full)), 3)
## Estimate Std. Error t value
## (Intercept) 0.060 0.098 0.616
## Assessment 1.849 0.181 10.202
## ConditionED -0.143 0.139 -1.032
## ConditionMR -0.149 0.139 -1.078
## Assessment:ConditionED -1.749 0.256 -6.825
## Assessment:ConditionMR -0.836 0.256 -3.262
(Intercept)
==> weight change at baseline in None group
Assessment
==> slope of weight change in None group
ConditionED
==> baseline weight change in ED group relative to None group
ConditionMR
==> baseline weight change in MR group relative to None group
Assessment:ConditionED
==> slope of weight change in ED group relative to None group
Assessment:ConditionMR
==> slope of weight change in MR groups relative to None group
Question A7
Can you state how the conditions differed?
Solution
summary(m.full)$coefficients
## Estimate Std. Error t value
## (Intercept) 0.06038642 0.09800169 0.6161773
## Assessment 1.84917936 0.18125403 10.2021423
## ConditionED -0.14303302 0.13859533 -1.0320191
## ConditionMR -0.14944649 0.13859533 -1.0782939
## Assessment:ConditionED -1.74949968 0.25633190 -6.8251344
## Assessment:ConditionMR -0.83624053 0.25633190 -3.2623350
Compared to no intervention, weight (re)gain was 1.75 lbs/year slower for the ED intervention and 0.84 lbs/year slower for the MR intervention.
Exercises: Logistic MLM
Don’t forget to look back at other materials!
Back in USMR, we introduced logistic regression in week 10. The lectures followed the example of some singing aliens that either survived or were splatted, and the exercises used some simulated data based on a hypothetical study about inattentional blindness.
That content will provide a lot of the groundwork for this week, so we recommend revisiting it if you feel like it might be useful.
lmer() >> glmer()
Remember how we simply used glm()
and could specify the family = "binomial"
in order to fit a logistic regression? Well it’s much the same thing for multi-level models!
- Gaussian model:
lmer(y ~ x1 + x2 + (1 | g), data = data)
- Binomial model:
glmer(y ~ x1 + x2 + (1 | g), data = data, family = binomial(link='logit'))
- or just
glmer(y ~ x1 + x2 + (1 | g), data = data, family = "binomial")
- or
glmer(y ~ x1 + x2 + (1 | g), data = data, family = binomial)
Binary? Binomial? Huh?
Very briefly in the USMR materials we made the distinction between binary and binomial. This wasn’t hugely relevant at the time just because we only looked at binary outcomes.
For binary regression, all the data in our outcome variable has to be a 0 or a 1.
For example, the correct
variable below:
participant |
question |
correct |
1 |
1 |
1 |
1 |
2 |
0 |
1 |
3 |
1 |
... |
... |
... |
But we can re-express this information in a different way, when we know the total number of questions asked:
participant |
questions_correct |
questions_incorrect |
1 |
2 |
1 |
2 |
1 |
2 |
3 |
3 |
0 |
... |
... |
... |
To model data when it is in this form, we can express our outcome as cbind(questions_correct, questions_incorrect)
Novel Word Learning: Data Codebook
load(url("https://uoepsy.github.io/msmr/data/nwl.RData"))
In the nwl
data set (accessed using the code above), participants with aphasia are separated into two groups based on the general location of their brain lesion: anterior vs. posterior. There is data on the numbers of correct and incorrect responses participants gave in each of a series of experimental blocks. There were 7 learning blocks, immediately followed by a test. Finally, participants also completed a follow-up test.
Data were also collect from healthy controls.
Figure 1 shows the differences between groups in the average proportion of correct responses at each point in time (i.e., each block, test, and follow-up)
group |
Whether participant is a stroke patient (‘patient’) or a healthy control (‘control’) |
lesion_location |
Location of brain lesion: anterior vs posterior |
block |
Experimental block (1-9). Blocks 1-7 were learning blocks, immediately followed by a test in block 8. Block 9 was a follow-up test at a later point |
PropCorrect |
Proportion of 30 responses in a given block that the participant got correct |
NumCorrect |
Number of responses (out of 30) in a given block that the participant got correct |
NumError |
Number of responses (out of 30) in a given block that the participant got incorrect |
ID |
Participant Identifier |
Phase |
Experimental phase, corresponding to experimental block(s): ‘Learning,’ ‘Immediate,’‘Follow-up’ |
Question B1
Load the data. Take a look around. Any missing values? Can you think of why?
Solution
load(url("https://uoepsy.github.io/msmr/data/nwl.RData"))
summary(nwl)
## group lesion_location block PropCorrect NumCorrect
## control:126 anterior : 45 Min. :1 Min. :0.2000 Min. : 6.00
## patient:117 posterior: 63 1st Qu.:3 1st Qu.:0.5333 1st Qu.:16.00
## NA's :135 Median :5 Median :0.7000 Median :21.00
## Mean :5 Mean :0.6822 Mean :20.47
## 3rd Qu.:7 3rd Qu.:0.8333 3rd Qu.:25.00
## Max. :9 Max. :1.0000 Max. :30.00
##
## NumError ID Phase
## Min. : 0.000 control1 : 9 Length:243
## 1st Qu.: 5.000 control10: 9 Class :character
## Median : 9.000 control11: 9 Mode :character
## Mean : 9.535 control12: 9
## 3rd Qu.:14.000 control13: 9
## Max. :24.000 control14: 9
## (Other) :189
The only missing vales are in the lesion location, and it’s probably because the healthy controls don’t have any lesions.
Question B2
Our broader research aim today is to compare two groups (those with anterior vs. posterior lesions) with respect to their accuracy of responses over the course of the study.
- What is our outcome?
Hint: Think carefully: there might be several variables which either fully or partly express the information we are considering the “outcome” here.
Solution
- The outcome here is (in words) “the proportion of correct answers in each block.” This makes it tempting to look straight to the variable called
PropCorrect
. Unfortunately, this is encoded as a proportion (i.e., is bounded by 0 and 1), and of the models we have learned about so far, we don’t definitely have the necessary tools to model this.
- linear regression = continuous unbounded outcome variable
- logistic regression = binomial, expressed as
0
or 1
if the number of trials per observation is 1, and expressed as cbind(num_successes, num_failures)
if the number of trials per observation is >1.
Question B3
Research Question 1:
Is the learning rate (training blocks) different between these two groups?
Hints:
Do we want cbind(num_successes, num_failures)
?
Make sure we’re running models on only the data we are actually interested in.
Think back to what we did in the exercises above using model comparison via likelihood ratio tests (using anova(model1, model2, model3, ...)
. We could use this approach for this question, to compare:
- A model with just the change over the sequence of blocks
- A model with the change over the sequence of blocks and an overall difference between groups
- A model with groups differing with respect to their change over the sequence of blocks
What about the random effects part?
- What are our observations grouped by?
- What variables can vary within these groups?
- What do you want your model to allow to vary within these groups?
Suggested answers to the hints if you don’t know where to start
- Make sure we’re running models on only the data we are actually interested in.
- We want only the learning blocks, and none of the healthy controls.
- You might want to store this data in a separate object, but in the code for the solutio we will just use
filter()
inside the glmer()
.
- A model with just the change over the sequence of blocks:
- A model with the change over the sequence of blocks and an overall difference between groups:
- outcome ~ block + lesion_location
- A model with groups differing with respect to their change *over the sequence of blocks:
- outcome ~ block * lesion_location
- What are our observations grouped by?
- repeated measures by-participant. i.e., the
ID
variable
- What variables can vary within these groups?
Block
and Phase
. Be careful though - you can create the Phase
variable out of the Block
variable, so really this is just one piece of information, encoded differently in two variables.
- The other variables (
lesion_location
and group
) do not vary for each ID. Lesions don’t suddenly change where they are located, nor do participants swap between being a patient vs a control (we don’t need the group variable anyway as we are excluding the controls).
What do you want your model to allow to vary within these groups?
- Do you think the change over the course of the blocks is the same for everybody? Or do you think it varies? Is this variation important to think about in terms of your research question?
Solution
m.base <- glmer(cbind(NumCorrect, NumError) ~ block + (block | ID),
data = filter(nwl, block < 8, !is.na(lesion_location)),
family=binomial)
m.loc0 <- glmer(cbind(NumCorrect, NumError) ~ block + lesion_location + (block | ID),
data=filter(nwl, block < 8, !is.na(lesion_location)),
family=binomial)
m.loc1 <- glmer(cbind(NumCorrect, NumError) ~ block * lesion_location + (block | ID),
data=filter(nwl, block < 8, !is.na(lesion_location)),
family=binomial)
#summary(m.loc1)
anova(m.base, m.loc0, m.loc1, test="Chisq")
## Data: filter(nwl, block < 8, !is.na(lesion_location))
## Models:
## m.base: cbind(NumCorrect, NumError) ~ block + (block | ID)
## m.loc0: cbind(NumCorrect, NumError) ~ block + lesion_location + (block | ID)
## m.loc1: cbind(NumCorrect, NumError) ~ block * lesion_location + (block | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m.base 5 454.12 466.27 -222.06 444.12
## m.loc0 6 454.66 469.25 -221.33 442.66 1.4572 1 0.2274
## m.loc1 7 454.47 471.48 -220.23 440.47 2.1974 1 0.1382
No significant difference in learning rate between groups (\(\chi^2(1)=2.2, p = 0.138\)).
Question B4
Research Question 2
In the testing phase, does performance on the immediate test differ between lesion location groups, and does their retention from immediate to follow-up test differ?
Let’s try a different approach to this. Instead of fitting various models and comparing them via likelihood ratio tests, just fit the one model which could answer both parts of the question above.
Hints:
- This might required a bit more data-wrangling before hand. Think about the order of your factor levels
(alphabetically speaking, “Follow-up” comes before “Immediate”)!
Solution
nwl_test <- filter(nwl, block > 7, !is.na(lesion_location)) %>%
mutate(
Phase = fct_relevel(factor(Phase),"Immediate")
)
m.recall.loc <- glmer(cbind(NumCorrect, NumError) ~ Phase * lesion_location + (Phase | ID),
nwl_test, family="binomial")
summary(m.recall.loc)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(NumCorrect, NumError) ~ Phase * lesion_location + (Phase |
## ID)
## Data: nwl_test
##
## AIC BIC logLik deviance df.resid
## 142.6 150.9 -64.3 128.6 17
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.31708 -0.45014 0.03291 0.46924 1.09355
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 0.47660 0.6904
## PhaseFollow-up 0.02539 0.1593 -1.00
## Number of obs: 24, groups: ID, 12
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.11375 0.35128 -0.324 0.746
## PhaseFollow-up -0.02452 0.24634 -0.100 0.921
## lesion_locationposterior 0.99918 0.46868 2.132 0.033 *
## PhaseFollow-up:lesion_locationposterior -0.25437 0.33904 -0.750 0.453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) PhsFl- lsn_lc
## PhaseFllw-p -0.578
## lsn_lctnpst -0.750 0.435
## PhsFllw-p:_ 0.421 -0.729 -0.589
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
Interpreting coefficients in logistic regression
Take some time to remind yourself from USMR of the interpretation of logistic regression coefficients.
The interpretation of the fixed effects of a logistic multi-level model is not very different.
We can obtain the fixed effects from our model by using:
fixef(model)
summary(model)$coefficients
coef(summary(model))
tidy(model)
from the broom.mixed package
- (there are probably more ways, but I can’t think of them right now!)
It’s just that for multi-level models, we can model by-cluster random variation around these effects.
Question B5
- In
family = binomial(link='logit')
. What function is used to relate the linear predictors in the model to the expected value of the response variable?
- How do we convert this into something more interpretable?
Solution
The link function is the logit
, or log-odds (other link functions are available).
To convert log-odds to odds, we can use exp()
, to get odds and odds ratios.
Question B6
Make sure you pay attention to trying to interpret each fixed effect from your models.
These can be difficult, especially when it’s logistic, and especially when there are interactions.
- What is the increase in the odds of answering correctly in the immediate test for someone with a posterior legion compared to someone with an anterior legion?
Solution
(Intercept)
==> Anterior lesion group performance in immediate test. This is the log-odds of them answering correctly in the immediate test.
PhaseFollow-up
==> Change in performance (for the anterior lesion group) from immediate to follow-up test.
lesion_locationposterior
==> Posterior lesion group performance in immediate test relative to anterior lesion group performance in immediate test
PhaseFollow-up:lesion_locationposterior
==> Change in performance from immediate to follow-up test, posterior lesion group relative to anterior lesion group
## lesion_locationposterior
## 2.716057
Those with posterior lesions have 2.72 times the odds of answering correctly in the immediate test compared to someone with an anterior lesion.
OPTIONAL: B7
Recreate the visualisation in Figure 2.
Solution
ggplot(filter(nwl, !is.na(lesion_location)), aes(block, PropCorrect,
color=lesion_location,
shape=lesion_location)) +
#geom_line(aes(group=ID),alpha=.2) +
stat_summary(fun.data=mean_se, geom="pointrange") +
stat_summary(data=filter(nwl, !is.na(lesion_location), block <= 7),
fun=mean, geom="line") +
geom_hline(yintercept=0.5, linetype="dashed") +
geom_vline(xintercept=c(7.5, 8.5), linetype="dashed") +
scale_x_continuous(breaks=1:9, labels=c(1:7, "Test", "Follow-Up")) +
theme_bw(base_size=10) +
labs(x="Block", y="Proportion Correct", shape="Lesion\nLocation", color="Lesion\nLocation")
Coding Schemes for Categorical Predictors
Remember that categorical predictors get inputted into our model as a series of variables which are 0s and 1s? Previously the variable in our model for lesion location was actually being coded with 0 representing one level, and 1 representing the other. This is known as “treatment coding.”
There are lots of other ways we might encode our categorical predictors. One common approach is to use “effects coding.” In the case where we have a binary predictor, this makes zero the mid-point between the two - i.e., the overall mean (this is a bit like mean-centering a continuous predictor).
If we recall that the intercept is “when all predictors are zero,” and that when we have an interaction Y~A+B+A*B
in our model, the individual coefficients for A
and B
are estimated “when the other variable is zero,” then we can start to understand how these different ways of encoding our categorical predictors can change what we are getting out of our model. (Note, they don’t actually change anything about the model fit, but they change what information we are estimating from the model).
If you want to understand this a little more, take a look here.
Question B8
This code is that we used to answer question B4 above, only we have edited it to change lesion location to be fitted with “effects coding.”
nwl_test <- filter(nwl, block > 7, !is.na(lesion_location)) %>%
mutate(
Phase = fct_relevel(factor(Phase),"Immediate")
)
m.recall.loc.effcoding <- glmer(cbind(NumCorrect, NumError) ~ Phase * lesion_location + (Phase | ID),
contrasts = list(lesion_location = "contr.sum"),
nwl_test, family="binomial")
The interpretation of this is going to get pretty tricky - we have a logistic regression, and we have different coding scheme for our categorical predictor, and we have an interaction.. 🤯
Can you work out the interpretation of the fixed effects estimates?
Solution
(Intercept)
==> Overall performance in immediate test. This is the overall log-odds of answering correctly in the immediate test.
PhaseFollow-up
==> Average change in performance from immediate to follow-up test.
lesion_location1
==> Anterior lesion group performance in immediate test relative to overall average performance in immediate test
PhaseFollow-up:lesion_location1
==> Change in performance from immediate to follow-up test, anterior lesion group relative to overall average
???
How do we know that lesion_location1
is the anterior and not the posterior lesion group?
We need to check the what the contrasts look like:
contrasts(nwl_test$lesion_location) <- "contr.sum"
contrasts(nwl_test$lesion_location)
## [,1]
## anterior 1
## posterior -1
Because there are only two levels to this variable, the estimate will simply flip sign (positive/negative) depending on which way the contrast is leveled.
Optional: I liked my coefficients being named properly
colnames(contrasts(nwl_test$lesion_location)) <- "PeppaPig"
contrasts(nwl_test$lesion_location)
## PeppaPig
## anterior 1
## posterior -1
modeltest <- glmer(cbind(NumCorrect, NumError) ~ Phase * lesion_location + (Phase | ID),
nwl_test, family="binomial")
summary(modeltest)$coefficients
## Estimate Std. Error z value
## (Intercept) 0.3858449 0.2341574 1.6478017
## PhaseFollow-up -0.1517019 0.1688879 -0.8982400
## lesion_locationPeppaPig -0.4995882 0.2343374 -2.1319179
## PhaseFollow-up:lesion_locationPeppaPig 0.1271798 0.1695166 0.7502495
## Pr(>|z|)
## (Intercept) 0.09939336
## PhaseFollow-up 0.36905763
## lesion_locationPeppaPig 0.03301360
## PhaseFollow-up:lesion_locationPeppaPig 0.45310448
Hang on.. p-values are back?!
We noted at the end of last week that we don’t have p-values for lmer()
, but you might notice that we do now have them when we’ve fitted a model with glmer()
?
The reason is partly just one of convention. There is a standard practice for determining the statistical significance of parameters in the generalised linear model, relying on asymptotic Wald tests which evaluate differences in log-likelihood.
Reading: Inference in MLM
In USMR, we fitted various simple linear models using lm()
, and we could get out either a table of coefficients (for example, 1) or a table of the reduction in sums of squared residuals (for example, 2). In both of these cases we had a nice set of p-values for us to look at (in R, we often find the p-values in the column named something like Pr(>F)
).
A quick reminder: a p-value is the probability of observing results as or more extreme than the data, if the data were really generated by a hypothesised chance process).
Table 1: t-tests for slope coefficients for predictors in a linear model
(Intercept) |
5.3703775 |
4.3205141 |
1.242995 |
0.2238259 |
outdoor_time |
0.5923673 |
0.1689445 |
3.506284 |
0.0014995 |
social_int |
1.8034489 |
0.2690982 |
6.701825 |
0.0000002 |
Table 2: F-tests for predictors in a linear model
outdoor_time |
1 |
1427.938 |
1427.9375 |
37.77483 |
1.1e-06 |
social_int |
1 |
1697.825 |
1697.8249 |
44.91446 |
2.0e-07 |
Residuals |
29 |
1096.238 |
37.8013 |
NA |
NA |
We could get p-values for these models because we know what the distribution of test statistics of interest (e.g. \(F\)-statistics and \(t\)-statistics), will look like if the null hypothesis were true (i.e., we can describe what they look like in terms of \(F\) and \(t\) distributions with specific degrees of freedom (look back to the USMR materials for a formula).
Unfortunately, the same is not true of multilevel models. For the multilevel model, we can think of it as having residuals at multiple levels: we have the random variation of clusters around the fixed effects, and then random variation of observations around the cluster-level effects (more on this in Week 4).
In the rare occasion that you have a perfectly balanced experimental design, then ratios of sums of squares for multi-level models follow an \(F\)-distribution, in which we know the numerator and denominator degrees of freedom (this means we can work out the degrees of freedom for the \(t\)-test of our fixed effect parameters). Unfortunately, in the real world where things are not often perfectly balanced, determining the denominator degrees of freedom becomes unclear.
Last week, we mentioned a couple of approaches that we might take for drawing inferences, finding ways to compute p-values or confidence intervals. We’re going to now extend this to include a few more, and discuss the strengths and limitations of the different approaches.
For these examples…
For the following examples, we’re going to return to our dataset of various toys, and we are going to be concerned with whether practice (the hrs_week
variable) is associated with changes in reading ages (R_AGE
variable).
To accommodate for the clustered nature of the data, we are going to fit a model with both intercepts and slopes varying by toy-type.
toys_read <- read_csv("https://uoepsy.github.io/data/toyexample.csv")
full_model <- lmer(R_AGE ~ hrs_week + (1 + hrs_week | toy_type), data = toys_read)
Use a normal approximation (not advisable)
Remember that the \(t\) distribution starts to look more and more like the \(z\) (“normal”) distribution when degrees of freedom increase? We could just assume we have infinite degrees of freedom in our test statistics, and pretend that the \(t\)-values we get are actually \(z\)-values. This is “anti-conservative” inasmuch as it is not a very cautious approach, and we are likely to have a higher false positive rate (e.g. more chance of saying “there is an effect!” when there actually isn’t.)
coefs <- as.data.frame(summary(full_model)$coefficients)
coefs$p.z <- 2 * (1 - pnorm(abs(coefs[,3])))
coefs
## Estimate Std. Error t value p.z
## (Intercept) 1.755943 0.9610348 1.827138 0.0676790647
## hrs_week 1.143508 0.2956875 3.867286 0.0001100533
Satterthwaite df approximation
There have been a couple of methods proposed to estimate the degrees of freedom in order to provide a better approximation to the null distribution of our tests. The way the Satterthwaite method has been implemented in R will just add a column for p-values to your summary(model)
output).
Load the lmerTest package, refit the model, and voila!
library(lmerTest)
full_model <- lmer(R_AGE ~ hrs_week + (1 + hrs_week | toy_type), data = toys_read)
summary(full_model)$coefficients
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.755943 0.9610348 16.19251 1.827138 0.086168780
## hrs_week 1.143508 0.2956875 17.51503 3.867286 0.001178702
Reporting
To account for the extra uncertainty brought by the inclusion of random effects in the model, the degrees of freedom in the coefficients tests have been corrected via Satterthwaite’s method.
…
…
Weekly hours of reading practice was associated increased reading age (\(\beta = 1.14,\ SE = 0.30,\ t(17.52^*) = 3.87,\ p = .001\)).
Note: if you have the lmerTest package loaded, then all the models you fit with lmer()
will show p-values! If you want to stop this, then you will have to detach/unload the package, and refit the model.
detach("package:lmerTest", unload=TRUE)
Kenward Rogers df approximations
The Kenward-Rogers approach is slightly more conservative than the Satterthwaite method, and has been implemented for model comparison between a full model and a restricted model (a model without the parameter of interest), using the KR adjustment for the denominator degrees of freedom in the \(F\)-test.
For this, models must be fitted with REML, not ML. The function KRmodcomp()
will take care of this and re-fit them for you.
library(pbkrtest)
restricted_model <- lmer(R_AGE ~ 1 + (1 + hrs_week | toy_type), data = toys_read)
full_model <- lmer(R_AGE ~ hrs_week + (1 + hrs_week | toy_type), data = toys_read)
KRmodcomp(full_model, restricted_model)
## large : R_AGE ~ hrs_week + (1 + hrs_week | toy_type)
## small : R_AGE ~ 1 + (1 + hrs_week | toy_type)
## stat ndf ddf F.scaling p.value
## Ftest 14.637 1.000 17.738 1 0.001266 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reporting
To account for the extra uncertainty brought by the inclusion of random effects in the model, the denominator degrees of freedom in have been corrected via Kenward-Rogers’ method.
…
…
Inclusion of weekly hours of reading practice as a predictor was associated with an improvement in model fit (\(F(1,17.74^*) = 14.64,\ p = .001\)).
Likelihood Ratio Test (LRT)
Conduct a model comparison between your model and a restricted model (a model without the parameter of interest), evaluating the change in log-likelihood.
Likelihood
“likelihood” is a function that associates to a parameter the probability (or probability density) of observing the given sample data.
In simpler terms, the likelihood is the probability of the model given that we have this data.
The intuition behind likelihood:
- I toss a coin 10 time and observed 8 Heads.
- We can think of a ‘model’ of the process that governs the coin’s behaviour in terms of just one number: a parameter that indicates the probability of the coin landing on heads.
I have two models:
- Model 1: The coin will land on heads 20% of the time. \(P(Heads)=0.2\)
- Model 2: The coin will land on heads 70% of the time. \(P(Heads)=0.7\)
- Given the data I observe (see 1, above), we can (hopefully) intuit that Model 2 is more likely than Model 1.
For a (slightly) more detailed explanation, see here.
This method assumes that the ratio of two likelihoods will (as sample size increases) become closer to being \(\chi^2\) distributed, and so may be unreliable for small samples.
Models must be fitted with ML, not REML. The function anova()
will re-fit them for you.
restricted_model <- lmer(R_AGE ~ 1 + (1 + hrs_week | toy_type), data = toys_read)
full_model <- lmer(R_AGE ~ hrs_week + (1 + hrs_week | toy_type), data = toys_read)
anova(restricted_model, full_model, test = "Chisq")
## refitting model(s) with ML (instead of REML)
## Data: toys_read
## Models:
## restricted_model: R_AGE ~ 1 + (1 + hrs_week | toy_type)
## full_model: R_AGE ~ hrs_week + (1 + hrs_week | toy_type)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## restricted_model 5 660.32 674.73 -325.16 650.32
## full_model 6 650.51 667.80 -319.25 638.51 11.813 1 0.000588 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reporting
A likelihood ratio test indicated that the inclusion of weekly hours of reading practice as a predictor was associated with an improvement in model fit (\(\chi^2(1) = 11.81, p < .001\)).
Optional: Parametric Bootstrap LRT
There are also various “bootstrapping” methods which it is worth looking into. Think back to USMR when we first learned about hypothesis testing. Remember that we did lots of simulating data, so that we can compare what we actually observe with what we would expect if the null hypothesis were true? By doing this, we were essentially creating a null distribution, so that our calculating a p-value can become an issue of summarising data (e.g. calculate the proportion of our simulated null distribution that is more extreme than our observed statistic)
Instead of assuming that the likelihood ratio test statistics are \(\chi^2\)-distributed, we can bootstrap this test instead. This approach simulates data from the simpler model, fits both the simple model and the complex model and evaluates the change in log-likelihood. By doing this over and over again, we build a distribution of what changes in log-likelihood we would be likely to see if the more complex model is not any better. In this way it actually constructs a distribution reflecting our null hypothesis, against which we can then compare our actual observed effect
library(pbkrtest)
PBmodcomp(full_model, restricted_model, nsim=1000)
Reporting
A parametric bootstrap likelihood ratio test (R = 1000) indicated that the inclusion of weekly hours of reading practice as a predictor was associated with an improvement in model fit (\(LRT = 11.79, p = .004\)).
Optional: Parametric Bootstrap Confidence Intervals
Much the same as above, but with just one model we simulate data many times and refit the model, so that we get an empirical distribution that we can use to construct confidence intervals for our effects.
confint(full_model, method="boot")
Reporting
95% Confidence Intervals were obtained via parametric bootstrapping with 1000 iterations.
…
…
Weekly hours of reading practice was associated increased reading age (\(\beta = 1.14,\ 95%\ CI\ [0.58 -- 1.73]\)).
Optional: Case-based Bootstrap Confidence Intervals
It’s worth noting that there are many different types of bootstrapping that we can conduct. Different methods of bootstrapping vary with respect to the assumptions we will have to make when using them for drawing inferences. For instance, the parametric bootstrap discussed above assumes that explanatory variables are fixed and that model specification and the distributions such as \(\zeta_i \sim N(0,\sigma_{\zeta})\) and \(\varepsilon_i \sim N(0,\sigma_{\varepsilon})\) are correct.
An alternative is to generate a distribution by resampling with replacement from our data, fitting our model to the resample, and then repeating this over and over. This doesn’t have to rely on assumptions about the shape of the distributions of \(\zeta_i\) and \(\varepsilon_i\) - we just need to ensure that we correctly specify the hierarchical dependency of data. It does, however, require the decision of at which levels to resample (this is discussed more in week 4, and is what the lmeresampler package is all about).
If you want more information (not required reading for this course), then Julian Faraway has a page here with links to some worked examples, and Ben Bolker has a wealth of information on his GLMM FAQ pages.