Each of these pages provides an analysis run through for a different type of design. Each document is structured in the same way:

  • First the data and research context is introduced. For the purpose of these tutorials, we will only use examples where the data can be shared - either because it is from an open access publication, or because it is unpublished or simulated.
  • Second, we go through any tidying of the data that is required, before creating some brief descriptives and visualizations of the raw data.
  • Then, we conduct an analysis. Where possible, we translate the research questions into formal equations prior to fitting the models in lme4. Model comparisons are conducted, along with checks of distributional assumptions on our model residuals.
  • Finally, we visualize and interpret our analysis.

Please note that there will be only minimal explanation of the steps undertaken here, as these pages are intended as example analyses rather than additional labs readings. Please also be aware that there are many decisions to be made throughout conducting analyses, and it may be the case that you disagree with some of the choices we make here. As always with these things, it is how we justify our choices that is important. We warmly welcome any feedback and suggestions to improve these examples: please email ug.ppls.stats@ed.ac.uk.

Overview

Intervention studies are where the researchers intervenes (e.g. through administration of a drug or treatment) as part of the study

The data used for this worked example are simulated to represent data from 50 participants, each measured at 3 different time-points (pre, during, and post) on a measure of stress. Half of the participants are in the control group, and half are in the treatment group (e.g. half received some cognitive behavioural therapy (CBT), half did not).

set.seed(983)
library(tidyverse)
simMIX <- tibble(
  ppt = factor(rep(paste("ID", 1:50, sep=""),each=3)),
  ppt_int = rep(rnorm(50,0,15), each=3), 
  stress = round(
    c(rnorm(75,c(50,45,42),sd=c(5, 6, 5)),
      rnorm(75,c(55,30,20),sd=c(6, 10, 5))) + ppt_int,
    0),
  time = as_factor(rep(c("Pre", "During", "Post"), 50)),
  group = as_factor(c(rep("Control", 75), rep("Treatment", 75)))
) %>% select(-ppt_int)

Data Wrangling

Because we simulated our data, it is already nice and tidy. Each observation is a row, and we have variable indicating participant identifier (the ppt variable).

summary(simMIX)
##       ppt          stress           time          group   
##  ID1    :  3   Min.   :-13.00   Pre   :50   Control  :75  
##  ID10   :  3   1st Qu.: 29.00   During:50   Treatment:75  
##  ID11   :  3   Median : 43.00   Post  :50                 
##  ID12   :  3   Mean   : 42.22                             
##  ID13   :  3   3rd Qu.: 55.75                             
##  ID14   :  3   Max.   : 90.00                             
##  (Other):132
bind_rows(head(simMIX), tail(simMIX))
## # A tibble: 12 x 4
##    ppt   stress time   group    
##    <fct>  <dbl> <fct>  <fct>    
##  1 ID1       65 Pre    Control  
##  2 ID1       81 During Control  
##  3 ID1       55 Post   Control  
##  4 ID2       62 Pre    Control  
##  5 ID2       57 During Control  
##  6 ID2       67 Post   Control  
##  7 ID49      58 Pre    Treatment
##  8 ID49      34 During Treatment
##  9 ID49      25 Post   Treatment
## 10 ID50      35 Pre    Treatment
## 11 ID50      21 During Treatment
## 12 ID50       0 Post   Treatment

Descriptives

We now have two factors of interest, scores over Time and by Group. So we do our descriptive statistics by multiple grouping factors.

sumMIX <- 
  simMIX %>%
  group_by(time, group) %>%
  summarise(
    N = n_distinct(ppt),
    Mean = round(mean(stress, na.rm=T),2),
    SD = round(sd(stress, na.rm=T),2)
    )
library(kableExtra)
kable(sumMIX) %>%
  kable_styling("striped")
time group N Mean SD
Pre Control 25 52.84 14.60
Pre Treatment 25 54.76 16.78
During Control 25 49.64 13.23
During Treatment 25 30.44 15.49
Post Control 25 46.08 14.24
Post Treatment 25 19.56 15.99

Visualizations

We can use a violin plot to visualize our data.

simMIX %>% 
  ggplot(aes(x=time, y=stress, color=group, fill=group))  + 
  geom_violin(alpha = .25) + 
  geom_dotplot(binaxis='y', stackdir='center', position=position_dodge(1),
               dotsize = 0.5) +
  labs(x="Time", y = "Stress Score", 
       title="Stress over Time by Treat Vs Control", color="Condition", fill="Condition")

It is useful to note that the violin plot contains much of the same information that a more traditional boxplot would have, but provides more information on the actual distribution with groups as it draws on density plots.

What we can see here is that over time, the control group shows little change, but the treatment group shows distinct declines.
We can also consider line plots with separate facets for each condition:

simMIX %>% 
  ggplot(aes(x=time, y=stress, col=group))  + 
  geom_point(size=3, alpha=.4)+
  geom_line(aes(group=ppt), alpha=.3)+
  stat_summary(geom="pointrange", col="black")+
  facet_wrap(~group)+
  labs(x="Time", y = "Stress Score", 
       title="Stress over Time by Treat Vs Control", color="Condition", fill="Condition")+
  theme_minimal()

Analysis

Equations

\[ \begin{aligned} \operatorname{stress}_{i[j]} &= \beta_{0i} + \beta_{1i}(\operatorname{timeDuring}_j) + \beta_{2i}(\operatorname{timePost}_j) + \varepsilon_{i[j]} \\ \beta_{0i} &= \gamma_{00} + \gamma_{01}(\operatorname{groupTreatment}_i) + \zeta_{0i} \\ \beta_{1i} &= \gamma_{10} + \gamma_{11}(\operatorname{groupTreatment}_i) \\ \beta_{2i} &= \gamma_{20} + \gamma_{21}(\operatorname{groupTreatment}_i) \\ & \text{for ppt i = 1,} \dots \text{, I} \end{aligned} \]

Fitting the models

library(lme4)

Base model:

m0 <- lmer(stress ~ 1 +
             (1 | ppt), data = simMIX)
summary(m0)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ 1 + (1 | ppt)
##    Data: simMIX
## 
## REML criterion at convergence: 1286.7
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.02099 -0.49670  0.01511  0.49445  1.96548 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ppt      (Intercept) 179.7    13.40   
##  Residual             209.7    14.48   
## Number of obs: 150, groups:  ppt, 50
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   42.220      2.234    18.9

Main effects:

m1 <- lmer(stress ~ 1 + time + group +
             (1 | ppt), data = simMIX)
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ 1 + time + group + (1 | ppt)
##    Data: simMIX
## 
## REML criterion at convergence: 1185.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.69666 -0.62649  0.01574  0.59245  1.92387 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ppt      (Intercept) 166.57   12.91   
##  Residual              98.01    9.90   
## Number of obs: 150, groups:  ppt, 50
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)      61.100      3.046  20.061
## timeDuring      -13.760      1.980  -6.950
## timePost        -20.980      1.980 -10.596
## groupTreatment  -14.600      3.992  -3.657
## 
## Correlation of Fixed Effects:
##             (Intr) tmDrng timPst
## timeDuring  -0.325              
## timePost    -0.325  0.500       
## groupTrtmnt -0.655  0.000  0.000

Interaction:

m2 <- lmer(stress ~ 1 + time + group + time*group +
             (1 | ppt), data = simMIX)
summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ 1 + time + group + time * group + (1 | ppt)
##    Data: simMIX
## 
## REML criterion at convergence: 1096.5
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.83946 -0.53326 -0.05639  0.53766  2.31546 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ppt      (Intercept) 184.82   13.595  
##  Residual              43.26    6.577  
## Number of obs: 150, groups:  ppt, 50
## 
## Fixed effects:
##                           Estimate Std. Error t value
## (Intercept)                 52.840      3.020  17.494
## timeDuring                  -3.200      1.860  -1.720
## timePost                    -6.760      1.860  -3.634
## groupTreatment               1.920      4.272   0.449
## timeDuring:groupTreatment  -21.120      2.631  -8.028
## timePost:groupTreatment    -28.440      2.631 -10.810
## 
## Correlation of Fixed Effects:
##             (Intr) tmDrng timPst grpTrt tmDr:T
## timeDuring  -0.308                            
## timePost    -0.308  0.500                     
## groupTrtmnt -0.707  0.218  0.218              
## tmDrng:grpT  0.218 -0.707 -0.354 -0.308       
## tmPst:grpTr  0.218 -0.354 -0.707 -0.308  0.500

Comparison with aov()

As we did with the simple repeated measures, we can also compare the sums of squares breakdown for the LMM (m2) by calling anova() on the lmer() model.

First with aov():

m3 <- aov(stress ~ time + group + time*group +
            Error(ppt), data = simMIX)
summary(m3)
## 
## Error: ppt
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## group      1   7993    7993   13.37 0.000633 ***
## Residuals 48  28691     598                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##            Df Sum Sq Mean Sq F value Pr(>F)    
## time        2  11360    5680  131.31 <2e-16 ***
## time:group  2   5452    2726   63.01 <2e-16 ***
## Residuals  96   4153      43                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And then summarise:

anova(m2)
## Analysis of Variance Table
##            npar  Sum Sq Mean Sq F value
## time          2 11360.4  5680.2 131.305
## group         1   578.5   578.5  13.373
## time:group    2  5452.0  2726.0  63.014

For ease, lets compare all models with a parametric bootstrap likelihood ratio test:

library(pbkrtest)
PBmodcomp(m1, m0)
PBmodcomp(m2, m1)
## Bootstrap test; time: 34.66 sec; samples: 1000; extremes: 0;
## large : stress ~ 1 + time + group + (1 | ppt)
## stress ~ 1 + (1 | ppt)
##          stat df   p.value    
## LRT    90.348  3 < 2.2e-16 ***
## PBtest 90.348     0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Bootstrap test; time: 37.80 sec; samples: 1000; extremes: 0;
## large : stress ~ 1 + time + group + time * group + (1 | ppt)
## stress ~ 1 + time + group + (1 | ppt)
##          stat df   p.value    
## LRT    83.853  2 < 2.2e-16 ***
## PBtest 83.853     0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And extract some bootstrap 95% CIs

confint(m2,method="boot")
##                                2.5 %      97.5 %
## .sig01                     11.063915  16.8989137
## .sigma                      5.570576   7.3318954
## (Intercept)                46.958215  58.9035049
## timeDuring                 -6.598935  -0.1161317
## timePost                  -10.473009  -3.1302747
## groupTreatment             -7.001190  10.0517531
## timeDuring:groupTreatment -26.395306 -16.1101659
## timePost:groupTreatment   -33.665532 -23.2588741

Check Model

The residuals look reasonably normally distributed, and there seems to be fairly constant variance across the linear predictor. We might be a little concerned about the potential tails of the plot below, at which residuals don’t appear to have a mean of zero

plot(m2, type = c("p","smooth"))

plot(m2, sqrt(abs(resid(.)))~fitted(.))

library(lattice)
qqmath(m2)

Random effects are (roughly) normally distributed:

rans <- as.data.frame(ranef(m2)$ppt)
ggplot(rans, aes(sample = `(Intercept)`)) + 
  stat_qq() + stat_qq_line() +
  labs(title="random intercept")

Visualise Model

library(sjPlot)
plot_model(m2, type="int")

plot_model(m2, type="re") # an alternative: dotplot.ranef.mer(ranef(m1))

Interpret model

Stress at each study time-point (pre-, during- and post-intervention) was modeled using linear mixed effects models, with by-participant random intercepts, and the incremental addition of time, condition (control vs treatment) and their interaction. An interaction between time and condition improved model fit over the model without the interaction (Parametric bootstrap LRT 83.85, p < .001), suggesting that patterns of change over time in stress levels differed for the treatment group from the control group.
Result show that for the control group, stress did not show clear change between time-points 1 and 2 (-3.2, 95% CI [-6.92 – 0.32]), but by time-point 3 had reduced relative to time-point 1 by -6.76 [-10.58 – -3.3] points. Prior to the intervention (time-point 1), the treatment group did not appear to differ from the control group with respect to their level of stress, but showed an additional -21.12 [-26.75 – -15.63] point change by time-point 2 and -28.44 [-33.75 – -23.23] points by the end of the study (time-point 3). This pattern of results is shown in Figure 1.

Figure 1: Stress levels over time for control and treatment groups, with bootstrap prediction intervals (R = 2000)

Figure 1: Stress levels over time for control and treatment groups, with bootstrap prediction intervals (R = 2000)