require(tidyverse)
require(lme4)
library(lmerTest)
# library(effects)

Recap of multilevel models

Do children scores in maths improve more in school 2020 vs school 2040?

Consider the following data, representing longitudinal measurements on 10 students from an urban public primary school. The outcome of interest is mathematics achievement. The data were collected at the end of first grade and annually thereafter up to sixth grade, but not all students have six observations. The variable year has been mean-centred to have mean 0 so that results will have as baseline the average.

How many students in each school?

## schoolid
## 2020 2040  Sum 
##   21   21   42

We have 42 students, 21 in school with id 2020 and 21 in school with id 2040:

The number of observations per child are as follows.

table(data$childid)

## 
## 253404261 253413681 270199271 273026452 273030991 273059461 278058841 285939962 
##         6         3         3         3         5         5         5         3 
## 288699161 289970511 292017571 292020281 292020361 292025081 292026211 292027291 
##         5         5         3         6         6         5         5         5 
## 292027531 292028181 292028931 292029071 292029901 292033851 292772811 293550291 
##         5         5         5         5         5         5         3         4 
## 295341521 298046562 299680041 301853741 303652591 303653561 303654611 303658951 
##         6         3         5         3         5         3         5         3 
## 303660691 303662391 303663601 303668751 303671891 303672001 303672861 303673321 
##         4         6         6         5         5         3         3         4 
## 307407931 307694141 
##         3         4

We can see that for some children we have fewer than the 6 observations: some have 3, 4, or 5.

School 2020

Let’s start by considering only the children in school 2020. The mathematics achievement over time is shown, for each student, in the plot below:

data2020 <- data %>% 
  filter(schoolid == 2020)

ggplot(data2020, aes(x = year, y = math)) +
  geom_point() +
  facet_wrap(~ childid, labeller = label_both) +
  labs(x = "Year (mean centred)", y = "Maths achievement score")

Clearly, the measurements of mathematics achievement related to each student are grouped data as they refer to the same entity.

If we were to ignore this grouping and consider all children as one single population, we would obtain misleading results.The observations for the same student are clearly correlated. Some students consistently have a much better performance than other students, perhaps due to underlying numerical skills.

A fundamental assumption of linear regression models is that the residuals, and hence the data too, should be uncorrelated. In this example this is not the case.

The following plot considers all data as a single population

ggplot(data2020, aes(x = year, y = math)) +
    geom_point() +
    geom_smooth(method = lm, se = FALSE) +
    labs(x = "Year (mean centred)", y = "Maths achievement score")

This is a simple linear regression model for the mathematics measurement of individual \(i\) on occasion \(j\): \[ \text{math}_{ij} = \beta_0 + \beta_1 \ \text{year}_{ij} + \epsilon_{ij} \]

where the subscript \(ij\) denotes the \(j\)th measurement from child \(i\).

Let’s fit this in R

m0 <- lm(math ~ year, data = data2020)
summary(m0)

## 
## Call:
## lm(formula = math ~ year, data = data2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6478 -0.6264 -0.1101  0.4543  2.4529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.14323    0.08493  -1.687    0.095 .  
## year         0.96072    0.05662  16.968   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8126 on 95 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7493 
## F-statistic: 287.9 on 1 and 95 DF,  p-value: < 2.2e-16

The intercept and slope of this model can be visually represented as:

Random intercept and slopes

In reality, we see that each student has their own line, with a different intercept and slope. In other words, they all have different values of maths achievement when year = 0 and they also differ in their learning rate.

ggplot(data2020, aes(x = year, y = math, color = childid)) +
    geom_point() +
    geom_smooth(method = lm, se = FALSE, fullrange = TRUE, 
                size = 0.5) +
    labs(x = "Year (mean centred)", y = "Maths achievement score") +
    theme(legend.position = 'bottom')

Let’s now write a model where each student has their own intercept and slope: \[ \begin{aligned} \text{math}_{ij} &= \beta_{0i} + \beta_{1i} \ \text{year}_{ij} + \epsilon_{ij} \\ &= (\text{intercept for child } i) + (\text{slope for child } i) \ \text{year}_{ij} + \epsilon_{ij} \\ &= (\gamma_{00} + \zeta_{0i}) + (\gamma_{10} + \zeta_{1i}) \ \text{year}_{ij} + \epsilon_{ij} \end{aligned} \]

where

\(\beta_{0i}\) is the intercept of the line for child \(i\)
\(\beta_{1i}\) is the slope of the line for child \(i\)
\(\epsilon_{ij}\) are the deviations of each child’s measurement \(\text{math}_{ij}\) from the line of child \(i\)

We can think each child-specific intercept (respectively, slope) as being made up of two components: an “overall” intercept \(\gamma_{00}\) (slope \(\gamma_{10}\)) and a child-specific deviation from the overall intercept \(\zeta_{0i}\) (slope \(\zeta_{1i}\)):

\(\beta_{0i} = \gamma_{00} + \zeta_{0i} = \text{(overall intercept) + (deviation for child }i)\)
\(\beta_{1i} = \gamma_{10} + \zeta_{1i} = \text{(overall slope) + (deviation for child }i)\)

FACT

Deviations from the mean average to zero (and sum to zero too!)

As you know, deviations from the mean average to 0.

This holds for the errors \(\epsilon_{ij}\), as well as the deviations \(\zeta_{0i}\) from the overall intercept, and the deviations \(\zeta_{1i}\) from the overall slope.

Think of data \(y_1, ..., y_n\) and their mean \(\bar y\). The average of the deviations from the mean is \[ \begin{aligned} \frac{\sum_i (y_i - \bar y)}{n} = \frac{\sum_i y_i }{n} - \frac{\sum_i \bar y}{n} = \bar y - \frac{n * \bar y}{n} = \bar y - \bar y = 0 \end{aligned} \]

The child-specific deviations \(\zeta_{0i}\) from the overall intercept are normally distributed with mean \(0\) and variance \(\sigma_0^2\). Similarly, the deviations \(\zeta_{1i}\) of the slope for child \(i\) from the overall slope come from a normal distribution with mean \(0\) and variance \(\sigma_1^2\). The correlation between random intercepts and slopes is \(\rho = \text{Cor}(\zeta_{0i}, \zeta_{1i}) = \frac{\sigma_{01}}{\sigma_0 \sigma_1}\):

\[ \begin{bmatrix} \zeta_{0i} \\ \zeta_{1i} \end{bmatrix} \sim N \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_0^2 & \rho \sigma_0 \sigma_1 \\ \rho \sigma_0 \sigma_1 & \sigma_1^2 \end{bmatrix} \right) \]

The random errors, independently from the random effects, are distributed \[ \epsilon_{ij} \sim N(0, \sigma_\epsilon^2) \]

This is fitted using lmer():

library(lme4)
library(lmerTest)

m1 <- lmer(math ~ 1 + year + (1 + year | childid), data = data2020)
summary(m1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: math ~ 1 + year + (1 + year | childid)
##    Data: data2020
## 
## REML criterion at convergence: 166.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3119 -0.6125 -0.0726  0.6002  2.4197 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  childid  (Intercept) 0.50065  0.7076       
##           year        0.01131  0.1063   0.82
##  Residual             0.16345  0.4043       
## Number of obs: 97, groups:  childid, 21
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  -0.1091     0.1605 19.7831   -0.68    0.505    
## year          0.9940     0.0381 13.1895   26.09 9.71e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## year 0.433

The summary of the lmer output returns estimated values for

Fixed effects:

\(\widehat \gamma_{00} = -0.109\)
\(\widehat \gamma_{10} = 0.994\)

Variability of random effects:

\(\widehat \sigma_{0} = 0.708\)
\(\widehat \sigma_{1} = 0.106\)

Correlation of random effects:

\(\widehat \rho = 0.816\)

Residuals:

\(\widehat \sigma_\epsilon = 0.404\)

Check normality of random effects:

par(mfrow = c(1,2))
qqnorm(ranef(m1)$childid[, 1], main = "Random intercept")
qqline(ranef(m1)$childid[, 1])

qqnorm(ranef(m1)$childid[, 2], main = "Random slope")
qqline(ranef(m1)$childid[, 2])

Check normality and independence of errors:

par(mfrow = c(1,2))
qqnorm(resid(m1), main = "Residuals")
qqline(resid(m1))

plot(fitted(m1), resid(m1), ylab = "Residuals", xlab = "Fitted values")
abline(h=0)

Visually inspect the correlation between the random intercept and slopes:

ggplot(ranef(m1)$childid,
       aes(x = `(Intercept)`, y = year)) +
    geom_smooth(method = lm, se = FALSE, 
                color = 'gray', size = 0.5) +
    geom_point()

Flashcards: `lm` to `lmer`

In a simple linear regression, there is only considered to be one source of random variability: any variability left unexplained by a set of predictors (which are modelled as fixed estimates) is captured in the model residuals.

Multi-level (or ‘mixed-effects’) approaches involve modelling more than one source of random variability - as well as variance resulting from taking a random sample of observations, we can identify random variability across different groups of observations. For example, if we are studying a patient population in a hospital, we would expect there to be variability across the our sample of patients, but also across the doctors who treat them.

We can account for this variability by allowing the outcome to be lower/higher for each group (a random intercept) and by allowing the estimated effect of a predictor vary across groups (random slopes).

Before you expand each of the boxes below, think about how comfortable you feel with each concept.
This content is very cumulative, which means often going back to try to isolate the place which we need to focus efforts in learning.

Simple Linear Regression

Clustered (multi-level) data

Random intercepts

Shrinkage

Random slopes

Polynomials!

Sometimes, data have a clear non-linear pattern, such as a curvilinear trend. In such case, it is reasonable to try modelling the outcome not as a linear function of the variable, but as a curvilinear function of it.

The following plots show data (as black dots) where the outcome \(y\) has a nonlinear and decreasing dependence on \(x\). That is, as \(x\) varies from 1 to 10, the outcome \(y\) decreases in a non-linear fashion. Superimposed to the same data, you can see a linear fit (red line) and a cubic fit (blue).

The residuals corresponding to each fit are:

Clearly, a linear fit doesn’t capture the real trend in the data, and any leftover systematic pattern that the model doesn’t explicity account for always ends up in the residuals as the red points show.

On the other hand, once we account for the nonlinear trend, that systematic pattern in the residuals disappears.

The secret is to use instead of \(x\) as a predictor, the corresponding polynomial up to a specific order:

\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon \]

Consider the following example data. You can add polynomials up to order 3, for example, of a predictor “time” by saying:

## # A tibble: 5 x 3
##   subject reaction  time
##     <dbl>    <dbl> <int>
## 1       1    0.428     1
## 2       1    0.427     2
## 3       1    0.211     3
## 4       1    0.585     4
## 5       1    0.127     5

source("https://uoepsy.github.io/msmr/functions/code_poly.R")

code_poly(df, predictor = 'time', poly.order = 3, draw.poly = FALSE)

## # A tibble: 5 x 7
##   subject reaction  time time.Index  poly1  poly2     poly3
##     <dbl>    <dbl> <int>      <dbl>  <dbl>  <dbl>     <dbl>
## 1       1    0.428     1          1 -0.632  0.535 -3.16e- 1
## 2       1    0.427     2          2 -0.316 -0.267  6.32e- 1
## 3       1    0.211     3          3  0     -0.535 -4.10e-16
## 4       1    0.585     4          4  0.316 -0.267 -6.32e- 1
## 5       1    0.127     5          5  0.632  0.535  3.16e- 1

and use those terms when specifying your linear model, for example:

lmer(reaction ~ poly1 + poly2 + poly3 + (1 | subject))

Fixed effects

We can extract the fixed effects using the fixef() function:

These are the overall intercept and slope.

fixef(random_slopes_model)

## (Intercept)          x1 
## 405.7897675  -0.6722654

Random effects

The plots below show the fitted values for each subject from each model that we have gone through in these expandable boxes (simple linear regression, random intercept, and random intercept & slope):

In the random-intercept model (center panel), the differences from each of the subjects’ intercepts to the fixed intercept (thick green line) have mean 0 and standard deviation \(\sigma_0\). The standard deviation (and variance, which is \(\sigma_0^2\)) is what we see in the random effects part of our model summary (or using the VarCorr() function).

In the random-slope model (right panel), the same is true for the differences from each subjects’ slope to the fixed slope. We can extract the deviations for each group from the fixed effect estimates using the ranef() function.

These are the deviations from the overall intercept (\(\widehat \gamma_{00} = 405.79\)) and slope (\(\widehat \gamma_{10} = -0.672\)) for each subject \(i\).

ranef(random_slopes_model)

## $subject
##         (Intercept)          x1
## sub_308   31.327291 -1.43995253
## sub_309  -28.832219  0.41839420
## sub_310    2.711822  0.05993766
## sub_330   59.398971  0.38526670
## sub_331   74.958481  0.17391602
## sub_332   91.086535 -0.23461836
## sub_333   97.852988 -0.19057838
## sub_334  -54.185688 -0.55846794
## sub_335  -16.902018  0.92071637
## sub_337   52.217859 -1.16602280
## sub_349  -67.760246 -0.68438960
## sub_350   -5.821271 -1.23788002
## sub_351   61.198823  0.05499816
## sub_352   -7.905596 -0.66495059
## sub_369  -47.636645 -0.46810258
## sub_370  -33.121093 -1.11001234
## sub_371   77.576205 -0.20402571
## sub_372  -36.389281 -0.45829505
## sub_373 -197.579562  1.79897904
## sub_374  -52.195357  4.60508775
## 
## with conditional variances for "subject"

Group-level coefficients

We can also see the actual intercept and slope for each subject \(i\) directly, using the coef() function.

coef(random_slopes_model)

## $subject
##         (Intercept)         x1
## sub_308    437.1171 -2.1122179
## sub_309    376.9575 -0.2538712
## sub_310    408.5016 -0.6123277
## sub_330    465.1887 -0.2869987
## sub_331    480.7482 -0.4983494
## sub_332    496.8763 -0.9068837
## sub_333    503.6428 -0.8628438
## sub_334    351.6041 -1.2307333
## sub_335    388.8877  0.2484510
## sub_337    458.0076 -1.8382882
## sub_349    338.0295 -1.3566550
## sub_350    399.9685 -1.9101454
## sub_351    466.9886 -0.6172672
## sub_352    397.8842 -1.3372160
## sub_369    358.1531 -1.1403680
## sub_370    372.6687 -1.7822777
## sub_371    483.3660 -0.8762911
## sub_372    369.4005 -1.1305604
## sub_373    208.2102  1.1267137
## sub_374    353.5944  3.9328224
## 
## attr(,"class")
## [1] "coef.mer"

Notice that the above are the fixed effects + random effects estimates, i.e. the overall intercept and slope + deviations for each subject.

coef(random_intercept_model)

## $subject
##         (Intercept)         x1
## sub_308    384.0955 -0.9135829
## sub_309    406.5426 -0.9135829
## sub_310    421.8658 -0.9135829
## sub_330    492.0476 -0.9135829
## sub_331    498.0868 -0.9135829
## sub_332    496.0130 -0.9135829
## sub_333    504.6193 -0.9135829
## sub_334    338.5855 -0.9135829
## sub_335    440.3964 -0.9135829
## sub_337    416.7346 -0.9135829
## sub_349    319.6674 -0.9135829
## sub_350    356.3696 -0.9135829
## sub_351    479.2943 -0.9135829
## sub_352    379.5162 -0.9135829
## sub_369    349.0152 -0.9135829
## sub_370    335.0869 -0.9135829
## sub_371    484.0427 -0.9135829
## sub_372    360.5322 -0.9135829
## sub_373    293.6168 -0.9135829
## sub_374    511.3440 -0.9135829
## 
## attr(,"class")
## [1] "coef.mer"

Plotting random effects

The quick and easy way to plot your random effects is to use the dotplot.ranef.mer() function in lme4.

randoms <- ranef(random_slopes_model, condVar=TRUE)
dotplot.ranef.mer(randoms)

## $subject

Completely optional - extracting them for plotting in ggplot

Sometimes, however, we might want to have a bit more control over our plotting, we can extract the estimates and correlations for each subject:

#we can get the random effects:
#(note that we use $subject because there might be other groupings, and the ranef() function will give us a list, with one element for each grouping variable)
randoms <-
  ranef(random_slopes_model)$subject %>%
  mutate(subject = row.names(.)) %>%  # the subject IDs are stored in the rownames, so lets add them as a variable
  pivot_longer(cols=1:2, names_to="term",values_to="estimate") # finally, let's reshape it for plotting

#and the same for the standard errors (from the arm package):
randoms_se <-
  arm::se.ranef(random_slopes_model)$subject %>%
  as.data.frame() %>%
  mutate(subject = row.names(.)) %>%
  pivot_longer(cols=1:2, names_to="term",values_to="se")

# join them together:
ranefs_plotting <- left_join(randoms, randoms_se)

# it's easier for plotting if we
ggplot(ranefs_plotting, aes(y=subject, x=estimate))+
  geom_errorbarh(aes(xmin=estimate-2*se, xmax=estimate+2*se))+
  facet_wrap(~term, scales="free_x")

Nested and Crossed structures

Exercise A

Question A1

Research question:

Do children scores in maths improve more in school 2020 vs school 2040?

Load into R the data from the beginning of this lab, on mathematics performance in two schools. These can be found at the following link: https://uoepsy.github.io/data/MathsAchievement.csv

Make sure that variables encoding groups are stored as factors!

Recall that the data represent longitudinal measurements on 42 students from two different schools, with id 2020 and 2040 (21 students from each school). The outcome of interest is mathematics achievement. The data were collected at the end of first grade and annually thereafter up to sixth grade, but not all students have six observations. The variable year has been mean-centred to have mean 0 so that results will have as baseline the average.

Solution

library(tidyverse)

schools <- read_csv('https://uoepsy.github.io/data/MathsAchievement.csv')
schools <- schools %>%
  mutate(schoolid = factor(schoolid),
         childid = factor(childid))

head(schools)

## # A tibble: 6 x 4
##   schoolid childid    year   math
##   <fct>    <fct>     <dbl>  <dbl>
## 1 2020     273026452   0.5  1.15 
## 2 2020     273026452   1.5  1.13 
## 3 2020     273026452   2.5  2.30 
## 4 2020     273030991  -1.5 -1.30 
## 5 2020     273030991  -0.5  0.439
## 6 2020     273030991   0.5  2.43

Question A2

Fit the appropriate model to answer the research question.

Think carefully - what is the question concerning? Where should you include schoolid? as a grouping level, or as a fixed effect?

Solution

modschools <- lmer(math ~ 1 + year + schoolid + (1 + year | childid), data = data)
summary(modschools)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: math ~ 1 + year + schoolid + (1 + year | childid)
##    Data: data
## 
## REML criterion at convergence: 332
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.85909 -0.67031 -0.06586  0.61547  2.34212 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  childid  (Intercept) 0.50671  0.7118       
##           year        0.01184  0.1088   1.00
##  Residual             0.18643  0.4318       
## Number of obs: 186, groups:  childid, 42
## 
## Fixed effects:
##              Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  -0.21077    0.15101 45.36068  -1.396   0.1696    
## year          0.94972    0.02886 24.11700  32.903   <2e-16 ***
## schoolid2040 -0.36306    0.19294 34.82271  -1.882   0.0683 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) year  
## year         0.416       
## schoold2040 -0.650 -0.008
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00288304 (tol = 0.002, component 1)

Question A3

extract the overall intercept and slope.
extract the deviations from the overall intercept and slope for each child.
extract the actual intercept and slope for the line of child \(i\)
how do you compute the intercept and slope for the line of child 273030991 using the output from (1) and (2)? Does it agree with (3)?

Solution

fixef(modschools)

##  (Intercept)         year schoolid2040 
##   -0.2107673    0.9497207   -0.3630573

ranef(modschools)

## $childid
##           (Intercept)         year
## 253404261  0.72336027  0.110392519
## 253413681 -1.74737801 -0.266897951
## 270199271 -0.46293088 -0.070839752
## 273026452  0.21024435  0.032061728
## 273030991  1.13643797  0.173706066
## 273059461  0.63945161  0.097684691
## 278058841  1.28024832  0.195489717
## 285939962  0.43254617  0.066098841
## 288699161  0.57365519  0.087740369
## 289970511  0.09341064  0.014202873
## 292017571  1.40710797  0.215048838
## 292020281 -0.22959172 -0.035153534
## 292020361 -0.37031396 -0.056606784
## 292025081  0.22852401  0.035086662
## 292026211 -0.22032149 -0.033764305
## 292027291  0.24685794  0.037851698
## 292027531  0.06396375  0.009822580
## 292028181 -0.74083925 -0.113102127
## 292028931 -0.07202067 -0.011242204
## 292029071  0.08721703  0.013423313
## 292029901 -0.49863173 -0.075921828
## 292033851 -0.75692101 -0.115459647
## 292772811 -0.70105345 -0.106958259
## 293550291 -0.23040136 -0.035244582
## 295341521  0.06975094  0.010825237
## 298046562 -0.49962987 -0.076393306
## 299680041 -0.04367604 -0.006780972
## 301853741  0.56571795  0.086387594
## 303652591  0.36200112  0.055329846
## 303653561 -0.32092014 -0.049005531
## 303654611 -0.73777527 -0.112808984
## 303658951  0.96735708  0.147887549
## 303660691 -0.71248408 -0.108948838
## 303662391 -0.67733173 -0.103452008
## 303663601 -0.13283943 -0.020263644
## 303668751  0.41668023  0.063369197
## 303671891 -0.52149751 -0.079781383
## 303672001  0.81414597  0.124350637
## 303672861  0.49095808  0.075129264
## 303673321 -0.90990542 -0.139093117
## 307407931  0.73125608  0.111666621
## 307694141 -0.95442964 -0.145837082
## 
## with conditional variances for "childid"

coef(modschools)

## $childid
##             (Intercept)      year schoolid2040
## 253404261  0.5125929533 1.0601132   -0.3630573
## 253413681 -1.9581453272 0.6828228   -0.3630573
## 270199271 -0.6736981926 0.8788810   -0.3630573
## 273026452 -0.0005229579 0.9817824   -0.3630573
## 273030991  0.9256706545 1.1234268   -0.3630573
## 273059461  0.4286842938 1.0474054   -0.3630573
## 278058841  1.0694810056 1.1452104   -0.3630573
## 285939962  0.2217788532 1.0158195   -0.3630573
## 288699161  0.3628878762 1.0374611   -0.3630573
## 289970511 -0.1173566687 0.9639236   -0.3630573
## 292017571  1.1963406577 1.1647695   -0.3630573
## 292020281 -0.4403590352 0.9145672   -0.3630573
## 292020361 -0.5810812731 0.8931139   -0.3630573
## 292025081  0.0177566996 0.9848074   -0.3630573
## 292026211 -0.4310888043 0.9159564   -0.3630573
## 292027291  0.0360906280 0.9875724   -0.3630573
## 292027531 -0.1468035589 0.9595433   -0.3630573
## 292028181 -0.9516065645 0.8366186   -0.3630573
## 292028931 -0.2827879859 0.9384785   -0.3630573
## 292029071 -0.1235502786 0.9631440   -0.3630573
## 292029901 -0.7093990434 0.8737989   -0.3630573
## 292033851 -0.9676883244 0.8342611   -0.3630573
## 292772811 -0.9118207676 0.8427624   -0.3630573
## 293550291 -0.4411686754 0.9144761   -0.3630573
## 295341521 -0.1410163713 0.9605459   -0.3630573
## 298046562 -0.7103971846 0.8733274   -0.3630573
## 299680041 -0.2544433495 0.9429397   -0.3630573
## 301853741  0.3549506383 1.0361083   -0.3630573
## 303652591  0.1512338057 1.0050506   -0.3630573
## 303653561 -0.5316874534 0.9007152   -0.3630573
## 303654611 -0.9485425779 0.8369117   -0.3630573
## 303658951  0.7565897654 1.0976083   -0.3630573
## 303660691 -0.9232513879 0.8407719   -0.3630573
## 303662391 -0.8880990428 0.8462687   -0.3630573
## 303663601 -0.3436067474 0.9294571   -0.3630573
## 303668751  0.2059129127 1.0130899   -0.3630573
## 303671891 -0.7322648190 0.8699393   -0.3630573
## 303672001  0.6033786603 1.0740713   -0.3630573
## 303672861  0.2801907677 1.0248500   -0.3630573
## 303673321 -1.1206727310 0.8106276   -0.3630573
## 307407931  0.5204887711 1.0613873   -0.3630573
## 307694141 -1.1651969571 0.8038836   -0.3630573
## 
## attr(,"class")
## [1] "coef.mer"

fixef(modschools) + ranef(modschools)$childid['273030991', ]

##           (Intercept)     year
## 273030991   0.9256707 1.123427

coef(modschools)$childid['273030991', ]

##           (Intercept)     year schoolid2040
## 273030991   0.9256707 1.123427   -0.3630573

Exercise B

The data

Data codebook

44 participants across 4 groups (between-subjects) were tested 5 times (waves) in 11 domains. In each wave, participants received a score (on a 20-point scale) for each domain and a set of questions which were they answered either correctly or incorrectly.

load(url("https://uoepsy.github.io/data/msmr_lab5.RData"))

summary(dat5)

##  Anonymous_Subject_ID   IndivDiff          Wave          Domain         
##  Length:2011          Min.   :39.30   Min.   :1.000   Length:2011       
##  Class :character     1st Qu.:69.20   1st Qu.:2.000   Class :character  
##  Mode  :character     Median :79.70   Median :3.000   Mode  :character  
##                       Mean   :77.73   Mean   :2.712                     
##                       3rd Qu.:88.10   3rd Qu.:4.000                     
##                       Max.   :95.20   Max.   :5.000                     
##                       NA's   :1474                                      
##     Correct           Error            Group               Score     
##  Min.   : 0.000   Min.   :0.00000   Length:2011        Min.   : 0.0  
##  1st Qu.: 4.000   1st Qu.:0.00000   Class :character   1st Qu.: 8.0  
##  Median : 8.000   Median :0.00000   Mode  :character   Median :14.0  
##  Mean   : 9.904   Mean   :0.06216                      Mean   :12.2  
##  3rd Qu.:12.000   3rd Qu.:0.00000                      3rd Qu.:17.0  
##  Max.   :45.000   Max.   :1.00000                      Max.   :20.0  
##

Exercise Ba.

Research question Did the groups differ in overall performance?

There are different ways to test this: use the 20-point score or the accuracy? Keep the domains separate or calculate an aggregate across all domains? Which way makes the most sense to you?

Question B1

Make a plot that corresponds to the reseach question. Does it look like there’s a difference?

Solution

Lots of options for this one, here is one that shows Group and Domain differences:

ggplot(dat5, aes(Domain, Score, color=Group)) +
  stat_summary(fun.data=mean_se, geom="pointrange") +
  coord_flip()

Looks like there are group differences and domain differences, but not much in the way of group-by-domain differences.

Question B2

Use a mixed-effects model to test the difference.

Will you use a linear or logistic model?
What should the fixed(s) effect be?
What should the random effect(s) be? We have observations clustered by subjects and by domains - are they nested?

Tip: For now, forget about the longitudinal aspect to the data.

Solution

We’re interested in the amount to which Groups vary in their overall performance, so we want a fixed effect of Group. Subjects and Domains are not nested - each subject sees different domains, and each domain is seen by multiple subjects.

# maximal model doesn't converge, removed random Group slopes for Domain
mod_grp <- lmer(Score ~ Group + 
                   (1 | Anonymous_Subject_ID) + 
                   (1 | Domain), 
                 data=dat5, REML=FALSE)
summary(mod_grp)

## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: Score ~ Group + (1 | Anonymous_Subject_ID) + (1 | Domain)
##    Data: dat5
## 
##      AIC      BIC   logLik deviance df.resid 
##  10398.2  10437.5  -5192.1  10384.2     2004 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0251 -0.4981  0.0639  0.6338  3.2779 
## 
## Random effects:
##  Groups               Name        Variance Std.Dev.
##  Anonymous_Subject_ID (Intercept) 19.486   4.414   
##  Domain               (Intercept)  1.064   1.031   
##  Residual                          9.122   3.020   
## Number of obs: 2011, groups:  Anonymous_Subject_ID, 44; Domain, 11
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)   15.832      1.121 50.067  14.122  < 2e-16 ***
## GroupB        -4.159      2.262 44.369  -1.839   0.0727 .  
## GroupC        -3.621      1.768 43.968  -2.048   0.0465 *  
## GroupW        -7.270      1.673 44.042  -4.345 8.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr) GroupB GroupC
## GroupB -0.457              
## GroupC -0.585  0.290       
## GroupW -0.618  0.306  0.392

Yes, substantial Group differences: overall, group A does the best, groups B and C next, and group W does the worst.

Exercise Bb

Research question Did performance change over time (across waves)? Did the groups differ in pattern of change?

Question B3

Make a plot that corresponds to the research question. Does it look like there was a change? A group difference?

Solution

ggplot(dat5, aes(Wave, Score, color=Group, fill=Group)) +
  stat_summary(fun.data=mean_se, geom="ribbon", alpha=0.3, color=NA) +
  stat_summary(fun.y=mean, geom="line")

Yes, looks like groups A, C, and W are improving, but group B is getting worse.

Question B4

Use mixed-effects model(s) to test this.

Hint: Fit a baseline model in which scores change over time (wave), then assess improvement in model fit due to inclusion of overall group effect and finally the interaction of group with time.

Solution

mod_wv <- lmer(Score ~ Wave + 
                   (1 + Wave | Anonymous_Subject_ID) + 
                   (1 + Wave | Domain), 
                 data=dat5, REML=FALSE,
                 lmerControl(optimizer = "bobyqa"))

mod_wv_grp <- lmer(Score ~ Wave+Group + 
                   (1 + Wave | Anonymous_Subject_ID) + 
                   (1 + Wave | Domain), 
                 data=dat5, REML=FALSE,
                 lmerControl(optimizer = "bobyqa"))

mod_wv_x_grp <- lmer(Score ~ Wave*Group + 
                   (1 + Wave | Anonymous_Subject_ID) + 
                   (1 + Wave | Domain), 
                 data=dat5, REML=FALSE,
                 lmerControl(optimizer = "bobyqa"))

anova(mod_wv, mod_wv_grp, mod_wv_x_grp)

## Data: dat5
## Models:
## mod_wv: Score ~ Wave + (1 + Wave | Anonymous_Subject_ID) + (1 + Wave | Domain)
## mod_wv_grp: Score ~ Wave + Group + (1 + Wave | Anonymous_Subject_ID) + (1 + Wave | Domain)
## mod_wv_x_grp: Score ~ Wave * Group + (1 + Wave | Anonymous_Subject_ID) + (1 + Wave | Domain)
##              npar    AIC    BIC  logLik deviance   Chisq Df Pr(>Chisq)   
## mod_wv          9 9719.6 9770.1 -4850.8   9701.6                         
## mod_wv_grp     12 9710.3 9777.6 -4843.1   9686.3 15.3456  3   0.001544 **
## mod_wv_x_grp   15 9710.5 9794.6 -4840.2   9680.5  5.8105  3   0.121204   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod_wv_x_grp)

## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: Score ~ Wave * Group + (1 + Wave | Anonymous_Subject_ID) + (1 +  
##     Wave | Domain)
##    Data: dat5
## Control: lmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##   9710.5   9794.6  -4840.2   9680.5     1996 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0053 -0.5764  0.0049  0.6132  3.7519 
## 
## Random effects:
##  Groups               Name        Variance Std.Dev. Corr 
##  Anonymous_Subject_ID (Intercept) 22.36604 4.7293        
##                       Wave         0.74787 0.8648   -0.34
##  Domain               (Intercept)  2.05332 1.4329        
##                       Wave         0.02189 0.1479   -0.99
##  Residual                          6.06856 2.4634        
## Number of obs: 2011, groups:  Anonymous_Subject_ID, 44; Domain, 11
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 12.77276    1.24928 52.57242  10.224 4.23e-14 ***
## Wave         1.25475    0.23886 44.74521   5.253 4.00e-06 ***
## GroupB      -1.36480    2.47446 45.83802  -0.552  0.58393    
## GroupC      -4.14669    1.91661 43.86581  -2.164  0.03599 *  
## GroupW      -6.31853    1.81665 44.18633  -3.478  0.00115 ** 
## Wave:GroupB -1.14231    0.52920 50.14297  -2.159  0.03570 *  
## Wave:GroupC -0.03687    0.36863 36.73490  -0.100  0.92087    
## Wave:GroupW -0.50887    0.35468 38.78415  -1.435  0.15937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Wave   GroupB GroupC GroupW Wv:GrB Wv:GrC
## Wave        -0.412                                          
## GroupB      -0.444  0.175                                   
## GroupC      -0.574  0.226  0.290                            
## GroupW      -0.605  0.239  0.306  0.395                     
## Wave:GroupB  0.157 -0.436 -0.389 -0.102 -0.108              
## Wave:GroupC  0.225 -0.625 -0.114 -0.366 -0.155  0.282       
## Wave:GroupW  0.234 -0.650 -0.118 -0.153 -0.370  0.293  0.421

Question B5

Plot the group-level data (see Question B3) and model fitted values for each group from your final model from Question B4.

Hint: using broom.mixed::augment(model) as your starting point will help.

Solution

broom.mixed::augment(mod_wv_x_grp) %>%
  ggplot(., aes(Wave, Score, color=Group)) +
  stat_summary(fun.data=mean_se, geom="pointrange") +
  stat_summary(aes(y=.fitted), fun=mean, geom="line")

We fit a linear model, but the model fit lines are not straight lines. Why is that?

Question B6

Create individual subject plots for the data and the model’s fitted values. Will these show straight lines?

Hint: make use of facet_wrap() to create a different panel for each level of a grouping variable.

Solution

broom.mixed::augment(mod_wv_x_grp) %>%
  ggplot(., aes(Wave, Score, color=Group)) +
  facet_wrap(~ Anonymous_Subject_ID) +
  stat_summary(fun.data=mean_se, geom="pointrange") +
  stat_summary(aes(y=.fitted), fun.y=mean, geom="line")

The individual subject plots show linear fits, which is a better match to the model. But now we see the missing data – some participants only completed the first few waves.

Question B7

Make a plot of the actual (linear) model prediction.

Hint: Use the effect() function from the effects package.

Solution

library(effects)
ef <- as.data.frame(effect("Wave:Group", mod_wv_x_grp))
ggplot(ef, aes(Wave, fit, color=Group, fill=Group)) +
  geom_ribbon(aes(ymax=fit+se, ymin=fit-se), color=NA, alpha=0.1) +
  geom_line()

Question B8

What important things are different between the plot from question B7 and that from question B5?
You can see the plots we created for these questions below:

Solution

Group B was not actually getting worse. The appearance that it was getting worse is an artifact of selective drop-out: there’s only a few people in this group and the better-performing ones only did the first few waves so they are not represented in the later waves, but the worse-performing ones are contributing to the later waves. The model estimates how the better-performing ones would have done in later waves based on their early-wave performance and the pattern of performance of other participants in the study.

summary(mod_wv_x_grp)$coefficients

##                Estimate Std. Error       df    t value     Pr(>|t|)
## (Intercept) 12.77275565  1.2492775 52.57242 10.2241143 4.233953e-14
## Wave         1.25474611  0.2388610 44.74521  5.2530381 3.998024e-06
## GroupB      -1.36479858  2.4744579 45.83802 -0.5515546 5.839325e-01
## GroupC      -4.14668919  1.9166107 43.86581 -2.1635532 3.598767e-02
## GroupW      -6.31853259  1.8166464 44.18633 -3.4781301 1.146411e-03
## Wave:GroupB -1.14230833  0.5292011 50.14297 -2.1585523 3.569665e-02
## Wave:GroupC -0.03687025  0.3686301 36.73490 -0.1000196 9.208726e-01
## Wave:GroupW -0.50887203  0.3546769 38.78415 -1.4347482 1.593740e-01

Note that the Group A slope (coefficient for Wave) is 1.255 and, relative to that slope, the Group B slope is -1.142 (coefficient for Wave:GroupB). This means that the model-estimated slope for Group B is 0.112, which is very slightly positive, not strongly negative as appeared in the initial plots.

One of the valuable things about mixed-effects (aka multilevel) modeling is that individual-level and group-level trajectories are estimated. This helps the model overcome missing data in a sensible way. In fact, MLM/MLR models are sometimes used for imputing missing data. However, one has to think carefully about why data are missing. Group B is small and it might just be a coincidence that the better-performing participants dropped out after the first few waves, which would make it easier to generalize the patterns to them. On the other hand, it might be the case that there is something about the study that makes better-performing members of Group B drop out, which should make us suspicious of generalizing to them.

Question B9

Create a plot of the subject and domain random effects. Notice the pattern between the random intercept and random slope estimates for the 11 domains - what in our model is this pattern representing?

Solution

randoms <- ranef(mod_wv_x_grp, condVar=TRUE)
dotplot.ranef.mer(randoms)

## $Anonymous_Subject_ID

## 
## $Domain

Notice that the domains with the lower relative intercept tend to have a higher relative slope (and vice versa). This is the negative correlation between random intercepts and slopes for domain in our model:

VarCorr(mod_wv_x_grp)

##  Groups               Name        Std.Dev. Corr  
##  Anonymous_Subject_ID (Intercept) 4.72927        
##                       Wave        0.86479  -0.336
##  Domain               (Intercept) 1.43294        
##                       Wave        0.14795  -0.993
##  Residual                         2.46344

Try removing the correlation (hint: use the ||) to see what happens. Does it make sense that these would be correlated? (Answer: we don’t really know enough about the study, but it’s something to think about!)

Revision & Individual Differences

Recap of multilevel models

School 2020

Random intercept and slopes

Flashcards: lm to lmer

Exercise A

Exercise B

The data

Exercise Ba.

Exercise Bb

Flashcards: `lm` to `lmer`