class: center, middle, inverse, title-slide #
Week 2: Longitudinal Data Analysis using Multilevel Modeling
## Multivariate Statistics and Methodology using R (MSMR)
### Dan Mirman ### Department of Psychology
The University of Edinburgh --- # Longitudinal data are a natural application domain for MLM * Longitudinal measurements are *nested* within subjects (by definition) * Longitudinal measurements are related by a continuous variable, spacing can be uneven across participants, and data can be missing + These are problems rmANOVA * Trajectories of longitudinal change can be nonlinear (we'll get to that next week) -- ## We've already seen some of this... * Weight maintenance (Week 1 lab) * Visual search (not longitudinal, but principles are the same) --- # Another example County-level percentage of adults who are physically active at recommended levels. Data from PHE for 2012-2015 (`PHE_MentalHealth.Rdata`) ```r library(tidyverse) library(lme4) library(effects) load("./data/PHE_MentalHealth.Rdata") str(mh_phe) ``` ``` ## 'data.frame': 9422 obs. of 7 variables: ## $ IndicatorID : int 848 848 848 848 848 848 848 848 848 848 ... ## $ IndicatorName: chr "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" ... ## $ County : chr "Hartlepool" "Middlesbrough" "Redcar and Cleveland" "Stockton-on-Tees" ... ## $ Region : chr "North East" "North East" "North East" "North East" ... ## $ Timeperiod : chr "2013/14" "2013/14" "2013/14" "2013/14" ... ## $ Value : num 5.32 5.74 7.86 9.19 7.38 ... ## $ Year : num 2013 2013 2013 2013 2013 ... ``` --- # Indicators ```r gt::gt(unique(mh_phe[, 1:2])) ```
IndicatorID
IndicatorName
848
Depression: Recorded prevalence (aged 18+)
41001
Suicide rate
90275
Percentage of physically active adults - historical method
90646
Depression: QOF incidence (18+) - new diagnosis
The indicator ID for percentage of physically active adults is 90275 --- ```r filter(mh_phe, IndicatorID == 90275) %>% ggplot(aes(Year, Value, color = Region)) + stat_summary(fun = mean, geom="point") + geom_smooth(method = "lm", se=FALSE) ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ![](msmr_lec02-2_LinearLDA_files/figure-html/unnamed-chunk-3-1.png)<!-- --> Looks like there are some interesting differences. We can ask two kinds of questions: 1. Did the regions differ in their *baseline (2012)* percentage physically active adults? 2. Did the regions differ in their *slopes of change* of percentage physically active adults? --- # What is the baseline? **Did the regions differ in their *baseline (2012)* percentage physically active adults?** We can answer this question using the **intercept** coefficients -- But those will be estimated at `Year = 0`, and the question is not about physical activity the year Jesus was born. So we need to adjust the time variable so that 2012 corresponds to time 0 (and we can select just the physical activity values while we're at it): ```r active_dat <- filter(mh_phe, IndicatorID == 90275) %>% mutate(Time = Year - 2012) ``` Now `Time` is a variable just like `Year`, but going from 0 to 3 instead of 2012 to 2015 --- # Fit the models ```r # base model: just change over time m <- lmer(Value ~ Time + (Time | County), data = active_dat, REML = F) # add baseline differences between regions m.0 <- lmer(Value ~ Time + Region + (Time | County), data = active_dat, REML = F) # add slope differences between regions m.1 <- lmer(Value ~ Time * Region + (Time | County), data = active_dat, REML = F) ``` --- # Compare models ```r anova(m, m.0, m.1) ``` ``` ## Data: active_dat ## Models: ## m: Value ~ Time + (Time | County) ## m.0: Value ~ Time + Region + (Time | County) ## m.1: Value ~ Time * Region + (Time | County) ## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq) ## m 6 3299 3325 -1643 3287 ## m.0 14 3275 3337 -1624 3247 39.34 8 4.3e-06 *** ## m.1 22 3281 3378 -1619 3237 9.86 8 0.28 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` Regions differed in baseline percentage of physically active adults ( `\(\chi^2(8) = 39.3, p < 0.0001\)` ), but there not in terms of the slope of change ( `\(\chi^2(8) = 9.86, p = 0.28\)` ). --- # Visualize effects, Option 1: model-estimated percentages ```r ef <- as.data.frame(effect("Time:Region", m.1)) ggplot(ef, aes(Time, fit, color=Region)) + geom_line() + theme_bw() + scale_color_brewer(palette = "Set1") ``` ![](msmr_lec02-2_LinearLDA_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- # Visualize effects, Option 2: estimated parameters [Side note: it took me a while to work out the data wrangling required to get the coefficients into a convenient form for plotting] ```r ggplot(est, aes(Region, estimate, ymin=(estimate-std.error), ymax=(estimate+std.error))) + facet_wrap(~ Type, scales="free_x") + geom_pointrange() + coord_flip() ``` ![](msmr_lec02-2_LinearLDA_files/figure-html/unnamed-chunk-9-1.png)<!-- -->