Week 11 Exercises: More SEM!

PSS & IQ

Prenatal Stress & IQ data

A researcher is interested in the effects of prenatal stress on child cognitive outcomes. She has a 5-item measure of prenatal stress and a 5 subtest measure of child cognitive ability, collected for 500 mother-infant dyads.

The data is available as a .csv file here: https://uoepsy.github.io/data/stressIQ.csv

variable	description
ID	Participant ID
stress1	acute stress
stress2	chronic stress
stress3	environmental stress
stress4	psychological stress
stress5	physiological stress
IQ1	verbal ability
IQ2	verbal memory
IQ3	inductive reasoning
IQ4	spatial orientiation
IQ5	perceptual speed

Question 1

Before we do anything with the data, grab some paper and sketch out the full model that you plan to fit to address the researcher’s question.

Tip: everything you need is in the description! Start by drawing the specific path(s) of interest. Are these between latent variables? If so, add in the paths to the indicators for each latent variable.

Question 2

Read in the data and explore it. Look at the individual distributions of each variable to get a sense of univariate normality, as well as the number of response options each item has.

Hints

The psych package has the useful functionality here. Specifically things like multi.hist() and describe() will be handy.

If necessary, you can set multi.hist(data, global = FALSE) to let each histogram’s x-axis be on a different scale.

Question 3

As always, we want to assess the measurement models of each construct.

Let’s start with IQ. Fit a one factor CFA for the IQ items, and examine the fit. If it doesn’t fit very well, consider checking for areas of local misfit (i.e., check your modindices()), and adjust your model accordingly.

Be sure to use an appropriate estimation method, given the distributions of the indicator variables!

Hints

See the section on non-normality in Chapter 8#SEM-non-normality.

Solution 3. Because our variables seem to be non-normal, therefore, we should use a robust estimator such as MLR for our CFA

model_IQ <- 'IQ =~ IQ1 + IQ2 + IQ3 + IQ4 + IQ5'

model_IQ.est <- cfa(model_IQ, data=stress_IQ_data, estimator='MLR')

We also get out robust fit measures, so we should ask for them:

fitmeasures(model_IQ.est)[c("rmsea.robust","srmr","cfi.robust","tli.robust")]

rmsea.robust         srmr   cfi.robust   tli.robust 
  0.18374167   0.05679702   0.90855073   0.81710146

The model doesn’t fit very well so we could check the modification indices for local mis-specifications

modindices(model_IQ.est, sort=T) |> head()

   lhs op rhs     mi     epc sepc.lv sepc.all sepc.nox
12 IQ1 ~~ IQ2 82.419  30.034  30.034    0.750    0.750
21 IQ4 ~~ IQ5 47.248  22.040  22.040    0.393    0.393
17 IQ2 ~~ IQ4 30.285 -16.713 -16.713   -0.352   -0.352
19 IQ3 ~~ IQ4 16.277  12.365  12.365    0.221    0.221
14 IQ1 ~~ IQ4 15.715 -13.750 -13.750   -0.266   -0.266
15 IQ1 ~~ IQ5 15.368 -12.954 -12.954   -0.273   -0.273

It looks like we might need to include residual covariances between the items IQ1 and IQ2 and maybe also between items IQ4 and IQ5. As always, we need to double check this makes substantive sense. Items IQ1 and IQ2 measure verbal comprehension and verbal memory - people might be likely to score low/high on both of these due to their verbal ability. IQ4 and IQ5 might be both related due to both being tests requiring visual perception, but it’s less obvious. We’d probably want to know more about the specific tests undertaken.

model2_IQ <- '
    IQ=~IQ1+IQ2+IQ3+IQ4+IQ5
    IQ1~~IQ2
'
model2_IQ.est <- cfa(model2_IQ, data=stress_IQ_data, estimator='MLR')

fitmeasures(model2_IQ.est)[c("rmsea.robust","srmr","cfi.robust","tli.robust")]

rmsea.robust         srmr   cfi.robust   tli.robust 
  0.07492424   0.02554867   0.98783535   0.96958837

The fit of the model is now much improved! The RMSEA is still in that grey area between 0.05 and 0.08, so we would probably want to flag this when writing up. We could keep trying to add stuff in order to get it below 0.05, but that means a high risk of overfitting.
Note also, that our loadings are all significant and \(>|0.3|\).

standardizedsolution(model2_IQ.est)

   lhs op rhs est.std    se      z pvalue ci.lower ci.upper
1   IQ =~ IQ1   0.649 0.037 17.470      0    0.576    0.722
2   IQ =~ IQ2   0.622 0.040 15.421      0    0.543    0.701
3   IQ =~ IQ3   0.660 0.037 17.764      0    0.587    0.733
4   IQ =~ IQ4   0.750 0.030 25.351      0    0.692    0.808
5   IQ =~ IQ5   0.747 0.030 25.108      0    0.688    0.805
6  IQ1 ~~ IQ2   0.467 0.045 10.287      0    0.378    0.556
7  IQ1 ~~ IQ1   0.579 0.048 11.994      0    0.484    0.673
8  IQ2 ~~ IQ2   0.613 0.050 12.196      0    0.514    0.711
9  IQ3 ~~ IQ3   0.564 0.049 11.496      0    0.468    0.660
10 IQ4 ~~ IQ4   0.437 0.044  9.839      0    0.350    0.524
11 IQ5 ~~ IQ5   0.442 0.044  9.960      0    0.355    0.529
12  IQ ~~  IQ   1.000 0.000     NA     NA    1.000    1.000

Question 4

Now do the same for a one-factor confirmatory factor analysis for the latent factor of Stress. Note that the items are measured on a 3-point scale!

Hints

See the section on categorical variables in the reading: Chapter 8#SEM-endogenous-categorical.

When you inspect your summary model output, notice that we have a couple of additional things - we have ‘scaled’ and ‘robust’ values for the fit statistics (we have a second column for all the fit indices but using a scaled version of the \(\chi^2\) statistic, and then we also have some extra rows of ‘robust’ measures), and we have the estimated ‘thresholds’ in our output (there are two thresholds per item in this example because we have a three-point response scale). The estimates themselves are not of great interest to us.

Solution 4.

# specify the model
model_stress <- 'Stress =~ stress1 + stress2 + stress3 + stress4 + stress5'

# estimate the model - cfa will automatically switch to a categorical estimator if we mention that our five variables are ordered-categorical, using the 'ordered' function
model_stress.est <- 
  cfa(model_stress, data=stress_IQ_data, ordered=TRUE)

# inspect the output
fitmeasures(model_stress.est)[c("rmsea.robust","srmr","cfi.robust","tli.robust")]

rmsea.robust         srmr   cfi.robust   tli.robust 
  0.00000000   0.01234819   1.00000000   1.06512057

summary(model_stress.est, standardized=TRUE)

lavaan 0.6-20 ended normally after 18 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        15

  Number of observations                           500

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.520       0.773
  Degrees of freedom                                 5           5
  P-value (Chi-square)                           0.991       0.979
  Scaling correction factor                                  0.740
  Shift parameter                                            0.070
    simple second-order correction                                

Parameter Estimates:

  Parameterization                               Delta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Stress =~                                                             
    stress1           1.000                               0.619    0.619
    stress2           1.111    0.121    9.214    0.000    0.688    0.688
    stress3           1.182    0.137    8.614    0.000    0.732    0.732
    stress4           1.049    0.130    8.089    0.000    0.649    0.649
    stress5           1.134    0.129    8.790    0.000    0.702    0.702

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stress1|t1       -0.151    0.056   -2.680    0.007   -0.151   -0.151
    stress1|t2        1.825    0.108   16.973    0.000    1.825    1.825
    stress2|t1       -0.900    0.065  -13.806    0.000   -0.900   -0.900
    stress2|t2        1.379    0.081   17.124    0.000    1.379    1.379
    stress3|t1        0.700    0.061   11.399    0.000    0.700    0.700
    stress3|t2        2.878    0.315    9.124    0.000    2.878    2.878
    stress4|t1       -1.751    0.102  -17.198    0.000   -1.751   -1.751
    stress4|t2        1.461    0.084   17.324    0.000    1.461    1.461
    stress5|t1       -0.233    0.057   -4.107    0.000   -0.233   -0.233
    stress5|t2        1.825    0.108   16.973    0.000    1.825    1.825

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .stress1           0.616                               0.616    0.616
   .stress2           0.526                               0.526    0.526
   .stress3           0.464                               0.464    0.464
   .stress4           0.578                               0.578    0.578
   .stress5           0.507                               0.507    0.507
    Stress            0.384    0.064    6.011    0.000    1.000    1.000

Question 5

Now its time to build the full SEM.
Estimate the effect of prenatal stress on IQ.

Hints

Remember: We know that IQ indicators are non-normal, so we would like to use a robust estimator (e.g. MLR). And the Stress indicators are only on a 3-point scale, so we want to make sure we specify that too. However, as lavaan will tell you if you try using estimator="MLR" at the same time as ordered = c(....), the MLR estimator is not supported for ordered data. It suggests instead using the WLSMV (weighted least square mean and variance adjusted) estimator.

As it happens, the WLSMV estimator is just the “DWLS” one we use for categorical variables, but with a correction to return robust standard errors. If you specify estimator="WLSMV" then your standard errors will be corrected, but don’t be misled by the fact that the summary here will still say that the estimator is DWLS.

Solution 5.

SEM_model <- '
    #IQ measurement model
    IQ =~ IQ1 + IQ2 + IQ3 + IQ4 + IQ5 
    IQ1 ~~ IQ2

    #stress measurement model 
    Stress =~ stress1 + stress2 + stress3 + stress4 + stress5 

    #structural part of model
    IQ ~ Stress
'

SEM_model.est <- sem(SEM_model, data=stress_IQ_data,
                     ordered=c('stress1','stress2','stress3','stress4','stress5'),
                     estimator="WLSMV")

Let’s print out the full summary.
Note that when we have a mix of ordered categoricals and continuous variables, then we can’t get the robust estimates of the fit indices. These are now NA. We do still get the scaled versions though, and everything seems to fit fairly well.

summary(SEM_model.est, fit.measures=T, standardized=T)

lavaan 0.6-20 ended normally after 126 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        32

  Number of observations                           500

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                18.118      30.244
  Degrees of freedom                                33          33
  P-value (Chi-square)                           0.983       0.605
  Scaling correction factor                                  0.802
  Shift parameter                                            7.657
    simple second-order correction                                

Model Test Baseline Model:

  Test statistic                              2153.655    1244.795
  Degrees of freedom                                45          45
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.758

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000       1.000
  Tucker-Lewis Index (TLI)                       1.010       1.003
                                                                  
  Robust Comparative Fit Index (CFI)                            NA
  Robust Tucker-Lewis Index (TLI)                               NA

Root Mean Square Error of Approximation:

  RMSEA                                          0.000       0.000
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.000       0.029
  P-value H_0: RMSEA <= 0.050                    1.000       1.000
  P-value H_0: RMSEA >= 0.080                    0.000       0.000
                                                                  
  Robust RMSEA                                                  NA
  90 Percent confidence interval - lower                        NA
  90 Percent confidence interval - upper                        NA
  P-value H_0: Robust RMSEA <= 0.050                            NA
  P-value H_0: Robust RMSEA >= 0.080                            NA

Standardized Root Mean Square Residual:

  SRMR                                           0.031       0.031

Parameter Estimates:

  Parameterization                               Delta
  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  IQ =~                                                                 
    IQ1               1.000                               6.954    0.658
    IQ2               0.844    0.046   18.428    0.000    5.869    0.630
    IQ3               0.902    0.070   12.908    0.000    6.272    0.676
    IQ4               1.105    0.080   13.768    0.000    7.681    0.736
    IQ5               1.033    0.071   14.601    0.000    7.187    0.730
  Stress =~                                                             
    stress1           1.000                               0.611    0.611
    stress2           1.099    0.125    8.820    0.000    0.672    0.672
    stress3           1.256    0.151    8.336    0.000    0.768    0.768
    stress4           1.066    0.152    7.008    0.000    0.652    0.652
    stress5           1.134    0.126    8.974    0.000    0.693    0.693

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  IQ ~                                                                  
    Stress            4.431    0.792    5.596    0.000    0.390    0.390

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .IQ1 ~~                                                                
   .IQ2              26.333    2.861    9.203    0.000   26.333    0.458

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .IQ1              23.108    0.587   39.391    0.000   23.108    2.187
   .IQ2              18.158    0.492   36.916    0.000   18.158    1.949
   .IQ3              10.100    0.563   17.945    0.000   10.100    1.088
   .IQ4              11.050    0.640   17.255    0.000   11.050    1.059
   .IQ5              21.376    0.514   41.601    0.000   21.376    2.171

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    stress1|t1       -0.151    0.056   -2.680    0.007   -0.151   -0.151
    stress1|t2        1.825    0.108   16.973    0.000    1.825    1.825
    stress2|t1       -0.900    0.065  -13.806    0.000   -0.900   -0.900
    stress2|t2        1.379    0.081   17.124    0.000    1.379    1.379
    stress3|t1        0.700    0.061   11.399    0.000    0.700    0.700
    stress3|t2        2.878    0.315    9.124    0.000    2.878    2.878
    stress4|t1       -1.751    0.102  -17.198    0.000   -1.751   -1.751
    stress4|t2        1.461    0.084   17.324    0.000    1.461    1.461
    stress5|t1       -0.233    0.057   -4.107    0.000   -0.233   -0.233
    stress5|t2        1.825    0.108   16.973    0.000    1.825    1.825

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .IQ1              63.305    4.235   14.948    0.000   63.305    0.567
   .IQ2              52.315    3.532   14.814    0.000   52.315    0.603
   .IQ3              46.783    3.610   12.960    0.000   46.783    0.543
   .IQ4              49.830    3.627   13.739    0.000   49.830    0.458
   .IQ5              45.268    3.566   12.696    0.000   45.268    0.467
   .stress1           0.626                               0.626    0.626
   .stress2           0.548                               0.548    0.548
   .stress3           0.411                               0.411    0.411
   .stress4           0.575                               0.575    0.575
   .stress5           0.519                               0.519    0.519
   .IQ               41.026    4.832    8.491    0.000    0.848    0.848
    Stress            0.374    0.065    5.774    0.000    1.000    1.000

We can see that the effect of prenatal stress on offspring IQ is \(\beta\) = 0.39 and is statistically significant (\(p<.05\)).

Question Extra 6

Returning to our full SEM, adjust your model so that instead of IQ ~ Stress we are fitting Stress ~ IQ (i.e. child cognitive ability \(\rightarrow\) prenatal stress).

Take a guess: will this model fit better or worse than our first one?

A Replication

Question 7

In order to try and replicate the IQ CFA, our researcher collects a new sample of size \(n=500\). However, she has some missing data (specifically, those who scored poorly on the first few tests tended to feel discouraged and chose not to complete further tests).

Read in the new dataset, plot and numerically summarise the univariate distributions of the measured variables, and then conduct a CFA using the new data, taking account of the missingness (don’t forget to also use an appropriate estimator to account for any non-normality). Does the model fit well?

The data can be found at https://uoepsy.github.io/data/IQdatam.csv

Hints

We can fit the model setting missing='FIML'. If data are missing at random (MAR) - i.e., missingness is related to the measured variables but not the unobserved missing values - then this gives us unbiased parameter estimates. Unfortunately we can never know whether data are MAR for sure as this would require knowledge of the missing values. See Chapter 8#SEM-missing-data.

Solution 7. Here’s the data. As before, the distributions of items look quite skewed.

IQ_data_new <- read_csv("https://uoepsy.github.io/data/IQdatam.csv")

multi.hist(IQ_data_new, global = FALSE)

IQ_data_new |> select(contains("IQ")) |> 
    describe() |> 
    as.data.frame() |>
    rownames_to_column(var = "variable") |> 
    select(variable,mean,sd,skew,kurtosis) |>
    kable(digits = 2) |>
    kable_styling(full_width = FALSE)

variable	mean	sd	skew	kurtosis
IQ1	17.49	9.75	1.28	2.29
IQ2	17.52	9.71	1.25	2.32
IQ3	10.53	9.93	1.39	2.39
IQ4	10.30	9.65	1.29	2.00
IQ5	19.62	9.41	1.76	5.63

IQ_model_missing <- '
  IQ=~IQ1+IQ2+IQ3+IQ4+IQ5
  IQ1~~IQ2
'

IQ_model_missing.est <- cfa(IQ_model_missing, 
                            data=IQ_data_new, 
                            missing='FIML', estimator="MLR")

summary(IQ_model_missing.est, fit.measures=T, standardized=T)

lavaan 0.6-20 ended normally after 65 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        16

  Number of observations                           500
  Number of missing patterns                         3

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 8.070       6.273
  Degrees of freedom                                 4           4
  P-value (Chi-square)                           0.089       0.180
  Scaling correction factor                                  1.286
    Yuan-Bentler correction (Mplus variant)                       

Model Test Baseline Model:

  Test statistic                               910.224     654.153
  Degrees of freedom                                10          10
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.391

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.995       0.996
  Tucker-Lewis Index (TLI)                       0.989       0.991
                                                                  
  Robust Comparative Fit Index (CFI)                         0.997
  Robust Tucker-Lewis Index (TLI)                            0.992

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -8693.803   -8693.803
  Scaling correction factor                                  1.634
      for the MLR correction                                      
  Loglikelihood unrestricted model (H1)      -8689.768   -8689.768
  Scaling correction factor                                  1.565
      for the MLR correction                                      
                                                                  
  Akaike (AIC)                               17419.605   17419.605
  Bayesian (BIC)                             17487.039   17487.039
  Sample-size adjusted Bayesian (SABIC)      17436.254   17436.254

Root Mean Square Error of Approximation:

  RMSEA                                          0.045       0.034
  90 Percent confidence interval - lower         0.000       0.000
  90 Percent confidence interval - upper         0.090       0.076
  P-value H_0: RMSEA <= 0.050                    0.501       0.685
  P-value H_0: RMSEA >= 0.080                    0.112       0.033
                                                                  
  Robust RMSEA                                               0.037
  90 Percent confidence interval - lower                     0.000
  90 Percent confidence interval - upper                     0.093
  P-value H_0: Robust RMSEA <= 0.050                         0.568
  P-value H_0: Robust RMSEA >= 0.080                         0.119

Standardized Root Mean Square Residual:

  SRMR                                           0.015       0.015

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  IQ =~                                                                 
    IQ1               1.000                               6.346    0.651
    IQ2               0.997    0.073   13.673    0.000    6.325    0.652
    IQ3               1.045    0.102   10.226    0.000    6.629    0.668
    IQ4               1.084    0.127    8.514    0.000    6.877    0.713
    IQ5               1.053    0.130    8.083    0.000    6.682    0.711

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
 .IQ1 ~~                                                                
   .IQ2              27.397    4.687    5.845    0.000   27.397    0.504

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .IQ1              17.494    0.436   40.159    0.000   17.494    1.796
   .IQ2              17.522    0.434   40.409    0.000   17.522    1.807
   .IQ3              10.488    0.444   23.615    0.000   10.488    1.056
   .IQ4              10.251    0.431   23.772    0.000   10.251    1.063
   .IQ5              19.467    0.419   46.416    0.000   19.467    2.070

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .IQ1              54.613    6.042    9.039    0.000   54.613    0.576
   .IQ2              54.004    6.203    8.706    0.000   54.004    0.574
   .IQ3              54.654    6.320    8.648    0.000   54.654    0.554
   .IQ4              45.771    5.128    8.926    0.000   45.771    0.492
   .IQ5              43.779    5.887    7.436    0.000   43.779    0.495
    IQ               40.269    6.905    5.832    0.000    1.000    1.000

Our fit indices all look very good!

Extra - Question 8

Note that the summary of the model output when we used FIML also told us that there are 3 patterns of missingness.

Can you find out a) what the patterns are, and b) how many people are in each pattern?

Hints

is.na() will help a lot here, as will distinct() and count().

Opportunities for growth?

development of pro-social behaviours in children: psb_traj.csv

We are interested in the development of pro-social behaviours over childhood. We recruited 50 children at age 4, and they completed a battery of assessments that aimed to measure how much they displayed sharing, cooperation, perspective-taking etc. These tasks altogether resulted in a score of pro-social behaviour — PSB in our data.
The data are available at https://uoepsy.github.io/data/PSBtraj.csv

Question 9

Forget about SEM and latent variables for a minute. We want to study the development of pro-social behaviours over childhood, and we have captured a measure of this (PSB) in ~50 children over 5 years.

We’re going back to lmer()!!

Fit a multi-level/mixed effects model to estimate the trajectory of PSB over childhood.
For now, please set REML = FALSE (this is just because we’re going to compare this model with something else fitted with standard ML).

Question 10

The model that we just fitted contains “random intercepts” and “random slopes”. We’ve talked about these as the group-level variability around the fixed intercept and fixed slope.

Think carefully about the following explanation of our random effects:

our random intercepts and random slopes are normally distributed variables that are not directly observed, that reflect “where childrens’ PSB starts” and “how childrens’ PSB changes”.

The quite subtle link that I am trying to make here is that we can think of those random effects in a similar way to how we think about latent variables!

And we can actually fit the exact same model in the latent variable modelling framework of lavaan!!

We have 5 time points, so let’s pivot things wider and consider our data in this format:

psbwide <- 
  psbtraj |> 
  pivot_wider(names_from = timepoint, values_from = PSB,
              names_prefix = "t")

head(psbwide)

# A tibble: 6 × 6
  child    t0    t1    t2    t3    t4
  <int> <int> <int> <int> <int> <int>
1     1    11    14    10    13    17
2     2    20    25    30    34    40
3     3    16    13    11    10     9
4     4    11    16    18    22    29
5     5    18    19    19    23    23
6     6    16    14    10    14    15

In this setup, we can start to think of each column containing scores as an indicator of a child’s underlying latent level of PSB (i.e., a child who scores high on all those observed variables (t0 to t4) is high on the unobserved latent construct of ‘pro-social behaviour’ and one who scores lower is estimated to be lower). With a little bit of trickery, we can encode the time-ordered structure of these indicators and split this up into an latent ‘intercept’ and a latent ‘slope’. To do so, we specify a model like the diagram below.

Note that we are fixing all the factor loadings to specific values. The “Intercepts” latent variable loads equally onto all timepoints scores, and the “Slopes” latent variable loads 0 on the first time point, 1 on the second, and so on..
So a child who is higher on latent PSB will score 1 higher at all time points. But additional to this, if a child has more of the latent “Slope of PSB” factor, will score 1 bit higher at time 2, 2 bits higher at time 3, 3 bits higher at time 4, and so on (where “bits” is yet to be estimated).
Think about what this implies for a child who has a latent intercept value of 10, and a latent slope value of 2:

timepoint	estimate	expectation
0	(1 x intercept) +(0 x slope)	(1 x 10) + (0 x 2) = 10
1	(1 x intercept) +(1 x slope)	(1 x 10) + (1 x 2) = 12
2	(1 x intercept) +(2 x slope)	(1 x 10) + (2 x 2) = 14
3	(1 x intercept) +(3 x slope)	(1 x 10) + (3 x 2) = 16
4	(1 x intercept) +(4 x slope)	(1 x 10) + (4 x 2) = 18

What we are then interested in estimating is the variances and the means of the two latent variables (Intercepts and Slopes). In previous models we’ve been mostly concerned with modelling covariance (i.e., how variables change with one another), but we can also include the estimation of means (the average of each variable). In a diagram, these sometimes get drawn as the paths from a triangles with a 1 in it (this is just like an intercept in regression \(y = b_1\cdot1 + b2\cdot x\) - it is the \(b_1\) we are estimating, and the 1 in the triangle is just to indicate that it is a constant).

Try fitting the model below.

Note, for the demonstration here I have had to add some extra constraints. This is because our lmer model assumes that the residual variance is the same across time. Our version in lavaan does not, so here we’ve used the label “rvar” to indicate that the residual variance is equal at each indicator t0 to t4.

lgc1 <- "
  ints =~ 1*t0 + 1*t1 + 1*t2 + 1*t3 + 1*t4
  slopes  =~ 0*t0 + 1*t1 + 2*t2 + 3*t3 + 4*t4

  t0 ~~ r*t0
  t1 ~~ r*t1
  t2 ~~ r*t2
  t3 ~~ r*t3
  t4 ~~ r*t4
"

lgc.est1 <- growth(lgc1, data = psbwide)

Examining the parameterestimates() for the lavaan model, and the summary() output for the lme4 model, compare and contrast.. What things are the same, what things are different?

Solution 10. For our lme4 model, here are the fixed effects and the variance in our random effects:

# fixed effects
fixef(lmm1)

(Intercept)   timepoint 
     14.776       1.594

# random effect variances
as.data.frame( VarCorr(lmm1) )

       grp        var1      var2      vcov      sdcor
1    child (Intercept)      <NA> 12.562986  3.5444302
2    child   timepoint      <NA>  8.864222  2.9772843
3    child (Intercept) timepoint -5.126645 -0.4858101
4 Residual        <NA>      <NA>  2.687334  1.6393091

And for our lavaan model, here are our parameter estimates:

parameterestimates(lgc.est1)

      lhs op    rhs label    est    se      z pvalue ci.lower ci.upper
1    ints =~     t0        1.000 0.000     NA     NA    1.000    1.000
2    ints =~     t1        1.000 0.000     NA     NA    1.000    1.000
3    ints =~     t2        1.000 0.000     NA     NA    1.000    1.000
4    ints =~     t3        1.000 0.000     NA     NA    1.000    1.000
5    ints =~     t4        1.000 0.000     NA     NA    1.000    1.000
6  slopes =~     t0        0.000 0.000     NA     NA    0.000    0.000
7  slopes =~     t1        1.000 0.000     NA     NA    1.000    1.000
8  slopes =~     t2        2.000 0.000     NA     NA    2.000    2.000
9  slopes =~     t3        3.000 0.000     NA     NA    3.000    3.000
10 slopes =~     t4        4.000 0.000     NA     NA    4.000    4.000
11     t0 ~~     t0     r  2.687 0.310  8.660  0.000    2.079    3.296
12     t1 ~~     t1     r  2.687 0.310  8.660  0.000    2.079    3.296
13     t2 ~~     t2     r  2.687 0.310  8.660  0.000    2.079    3.296
14     t3 ~~     t3     r  2.687 0.310  8.660  0.000    2.079    3.296
15     t4 ~~     t4     r  2.687 0.310  8.660  0.000    2.079    3.296
16   ints ~~   ints       12.563 2.841  4.422  0.000    6.994   18.132
17 slopes ~~ slopes        8.864 1.827  4.852  0.000    5.284   12.445
18   ints ~~ slopes       -5.127 1.799 -2.850  0.004   -8.652   -1.602
19     t0 ~1               0.000 0.000     NA     NA    0.000    0.000
20     t1 ~1               0.000 0.000     NA     NA    0.000    0.000
21     t2 ~1               0.000 0.000     NA     NA    0.000    0.000
22     t3 ~1               0.000 0.000     NA     NA    0.000    0.000
23     t4 ~1               0.000 0.000     NA     NA    0.000    0.000
24   ints ~1              14.776 0.532 27.751  0.000   13.732   15.820
25 slopes ~1               1.594 0.427  3.730  0.000    0.756    2.432

We can see that these parts of the model are all identical!

the fixed intercept in the lmer is the same as the ints ~ 1 estimate in lavaan. If you look at summary(lcg.est1), this comes under a section headed “Intercepts” which we haven’t seen before, but is essentially capturing the mean of the latent variable.
The fixed slope in the lmer is the same as the slopes ~ 1 part of the lavaan model.
The random intercept variability in lmer is capturing “how much do children vary in their intercepts?”. In the lavaan model, this is captured by the variance of the latent “ints” variable (in parameterestimates(lgc.est1) this is the ints ~~ ints bit)
The random slope variability in lmer is “how much do children vary in the slopes of PSB~time?”. In our lavaan model, this is the variance of the latent “slopes” variable (slopes ~~ slopes).
The correlation between random intercepts and random slopes in lmer is equivalent to the correlation between the two latent variables (ints ~~ slopes). In lavaan this is shown as the covariance, whereas in lmer we also see it standardised as a correlation.
The residual variability in the lmer is equivalent to the residual variance in lavaan for every observed variable t0 to t4 (remember we fixed these to all be the same as each other)

Here’s the quick plot with the estimates:

library(semPlot)
semPaths(lgc.est1, whatLabels = "est")

Bonus! Remember in lme4 we could extract the model predictions for each specific group in our sample, with functions like ranef() and coef()?¹.

The equivalent in the lavaan model is to ask where each person falls on the latent variables of “ints” and “slopes”.

And these are identical!

# each child's predicted intercept and slope from the lmer:
lmer_pred <- coef(lmm1)$child

# each child's predicted standing on the latent variables from lavaan:
lav_pred <- lavPredict(lgc.est1)

# bind them together and plot: 
cbind(lmer_pred, lav_pred) |> plot()

These kind of models are known as “Latent Growth Curve” models, and are really just the same thing as the mixed effects model we have already seen, but from the SEM perspective. We can even start playing with non-linear trajectories by changing the slope parameters to e.g., 1,2,4,9,16.
The downside is that the timepoints are need to be consistent – time is not treated continuously meaning that it becomes difficult if we have child 1 measured at “48 months, 64 months, 74 months” and child 2 at “50 months, 62 months, 76 months” - we end up collapsing them into “time1”, “time2” and “time3”, and therefore lose information. In addition, it’s harder to include more complex grouping structures in the SEM approach. The benefits, however, is that it can be extended to more complex sorts models that the mixed model might struggle with (like modelling trajectories of two outcomes simultaneously, or using the slope factor as a predictor of some other outcome).

Footnotes

whereas the variance of random effects was “how do groups in general vary?”, these are “where does [specific group X in our data] fall?”↩︎