Recap of multilevel models

No specific exercises this week

This week, there aren’t any exercises, but there is a small recap reading of multilevel models, followed by some ‘flashcard’ type boxes to help you test your understanding of some of the key concepts.

Please use the lab sessions to go over exercises from previous weeks, as well as asking any questions about the content below.

Flashcards: `lm` to `lmer`

In a simple linear regression, there is only considered to be one source of random variability: any variability left unexplained by a set of predictors (which are modelled as fixed estimates) is captured in the model residuals.

Multi-level (or ‘mixed-effects’) approaches involve modelling more than one source of random variability - as well as variance resulting from taking a random sample of observations, we can identify random variability across different groups of observations. For example, if we are studying a patient population in a hospital, we would expect there to be variability across the our sample of patients, but also across the doctors who treat them.

We can account for this variability by allowing the outcome to be lower/higher for each group (a random intercept) and by allowing the estimated effect of a predictor vary across groups (random slopes).

Before you expand each of the boxes below, think about how comfortable you feel with each concept.
This content is very cumulative, which means often going back to try to isolate the place which we need to focus efforts in learning.

Simple Linear Regression

Clustered (multi-level) data

Random intercepts

Shrinkage

Random slopes

Fixed effects

Random effects

Group-level coefficients

Visualising Model Fitted values

Visualising Fixed Effects

Plotting random effects

Nested and Crossed structures

MLM in a nutshell

Model Structure & Notation

MLM allows us to model effects in the linear model as varying between groups. Our coefficients we remember from simple linear models (the \(\beta\)’s) are modelled as a distribution that has an overall mean around which our groups vary. We can see this in Figure 1, where both the intercept and the slope of the line are modelled as varying by-groups. Figure 1 shows the overall line in blue, with a given group’s line in green.

Figure 1: Multilevel Model. Each group (e.g. the group in the green line) deviates from the overall fixed effects (the blue line), and the individual observations (green points) deviate from their groups line

The formula notation for these models involves separating out our effects \(\beta\) into two parts: the overall effect \(\gamma\) + the group deviations \(\zeta_i\):

\[ \begin{align} & \text{for observation }j\text{ in group }i \\ \quad \\ & \text{Level 1:} \\ & \color{red}{y_{ij}}\color{black} = \color{blue}{\beta_{0i} \cdot 1 + \beta_{1i} \cdot x_{ij}}\color{black} + \varepsilon_{ij} \\ & \text{Level 2:} \\ & \color{blue}{\beta_{0i}}\color{black} = \gamma_{00} + \color{orange}{\zeta_{0i}} \\ & \color{blue}{\beta_{1i}}\color{black} = \gamma_{10} + \color{orange}{\zeta_{1i}} \\ \quad \\ & \text{Where:} \\ & \gamma_{00}\text{ is the population intercept, and }\color{orange}{\zeta_{0i}}\color{black}\text{ is the deviation of group }i\text{ from }\gamma_{00} \\ & \gamma_{10}\text{ is the population slope, and }\color{orange}{\zeta_{1i}}\color{black}\text{ is the deviation of group }i\text{ from }\gamma_{10} \\ \end{align} \]

The group-specific deviations \(\zeta_{0i}\) from the overall intercept are assumed to be normally distributed with mean \(0\) and variance \(\sigma_0^2\). Similarly, the deviations \(\zeta_{1i}\) of the slope for group \(i\) from the overall slope are assumed to come from a normal distribution with mean \(0\) and variance \(\sigma_1^2\). The correlation between random intercepts and slopes is \(\rho = \text{Cor}(\zeta_{0i}, \zeta_{1i}) = \frac{\sigma_{01}}{\sigma_0 \sigma_1}\):

\[ \begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} \sim N \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_0^2 & \rho \sigma_0 \sigma_1 \\ \rho \sigma_0 \sigma_1 & \sigma_1^2 \end{bmatrix} \right) \]

The random errors, independently from the random effects, are assumed to be normally distributed with a mean of zero

\[ \epsilon_{ij} \sim N(0, \sigma_\epsilon^2) \]

MLM in R

We can fit these models using the R package lme4, and the function lmer(). Think of it like building your linear model lm(y ~ 1 + x), and then allowing effects (i.e. things on the right hand side of the ~ symbol) to vary by the grouping of your data. We specify these by adding (vary these effects | by these groups) to the model:

library(lme4)
m1 <- lmer(y ~ x + (1 + x | group), data = df)
summary(m1)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 + x | group)
   Data: df

REML criterion at convergence: 637.9

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.49449 -0.57223 -0.01353  0.62544  2.39122 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 group    (Intercept) 2.2616   1.5038       
          x           0.7958   0.8921   0.55
 Residual             4.3672   2.0898       
Number of obs: 132, groups:  group, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)   1.7261     0.9673   1.785
x             1.1506     0.2968   3.877

Correlation of Fixed Effects:
  (Intr)
x -0.552

The summary of the lmer output returns estimated values for

Fixed effects:

\(\widehat \gamma_{00} = 1.726\)
\(\widehat \gamma_{10} = 1.151\)

Variability of random effects:

\(\widehat \sigma_{0} = 1.504\)
\(\widehat \sigma_{1} = 0.892\)

Correlation of random effects:

\(\widehat \rho = 0.546\)

Residuals:

\(\widehat \sigma_\epsilon = 2.09\)

Flashcards: Getting p-values/CIs for Multilevel Models

Inference for multilevel models is tricky, and there are lots of different approaches.
You might find it helpful to think of the majority of these techniques as either:

A. Fitting a full model and a reduced model. These differ only with respect to the relevant effect. Comparing the overall fit of these models by some metric allows us to isolate the improvement due to our effect of interest. B. Fitting the full model and - using some method of approximating the degrees of freedom for the tests of whether a coefficient is zero. - constructing some confidence intervals (e.g. via bootstrapping) in order to gain a range of plausible values for the estimate (typically we then ask whether zero is within the interval)

Neither of these is correct or incorrect, but they each have different advantages. In brief:

Satterthwaite \(df\) = Very easy to implement, can fit with REML
Kenward-Rogers \(df\) = Good when working with small samples, can fit with REML
Likelihood Ratio Tests = better with bigger samples (of groups, and of observations within groups), requires REML = FALSE
Parametric Bootstrap = takes time to run, but in general a reliable approach. , requires REML = FALSE if doing comparisons, but not for confidence intervals
Non-Parametric Bootstrap = takes time, needs careful thought about which levels to resample, but means we can relax distributional assumptions (e.g. about normality of residuals).

For the examples below, we’re investigating how “hours per week studied” influences a child’s reading age. We have data from 132 children from 20 schools, and we have fitted the model:

full_model <- lmer(reading_age ~ 1 + hrs_week + (1 + hrs_week | school), 
                   data = childread)

Use a normal approximation (not advisable)

Satterthwaite df approximation

Kenward Rogers df approximations

Likelihood Ratio Test (LRT)

Conduct a model comparison between your model and a restricted model (a model without the parameter of interest), evaluating the change in log-likelihood.

Likelihood

“likelihood” is a function that associates to a parameter the probability (or probability density) of observing the given sample data.
In simpler terms, the likelihood is the probability of the model given that we have this data.

The intuition behind likelihood:

I toss a coin 10 time and observed 8 Heads.
We can think of a ‘model’ of the process that governs the coin’s behaviour in terms of just one number: a parameter that indicates the probability of the coin landing on heads.
I have two models:

Model 1: The coin will land on heads 20% of the time. \(P(Heads)=0.2\)
Model 2: The coin will land on heads 70% of the time. \(P(Heads)=0.7\)

Given the data I observe (see 1, above), we can (hopefully) intuit that Model 2 is more likely than Model 1.

For a (slightly) more detailed explanation, see here.

This method assumes that the ratio of two likelihoods will (as sample size increases) become closer to being \(\chi^2\) distributed, and so may be unreliable for small samples.

Models must be fitted with ML, not REML. The function anova() will re-fit them for you.

restricted_model <- lmer(reading_age ~ 1 + (1 + hrs_week | school), 
                   data = childread, REML = FALSE)
full_model <- lmer(reading_age ~ 1 + hrs_week + (1 + hrs_week | school), 
                   data = childread, REML = FALSE)
anova(restricted_model, full_model, test = "Chisq")

Data: childread
Models:
restricted_model: reading_age ~ 1 + (1 + hrs_week | school)
full_model: reading_age ~ 1 + hrs_week + (1 + hrs_week | school)
                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
restricted_model    5 660.52 674.93 -325.26   650.52                         
full_model          6 650.66 667.96 -319.33   638.66 11.857  1  0.0005744 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reporting

A likelihood ratio test indicated that the inclusion of weekly hours of reading practice as a predictor was associated with an improvement in model fit (\(\chi^2(1) = 11.86, p < .001\)).

Parametric Bootstrap LRT

Parametric Bootstrap Confidence Intervals

Non-Parametric Bootstrap Confidence Intervals

Flashcards: lm to lmer

MLM in a nutshell

Flashcards: Getting p-values/CIs for Multilevel Models

Flashcards: `lm` to `lmer`