Recap

Last week we learned about two methods of data reduction: Principal Components Analysis (PCA) and Factor Analysis.

In brief, PCA aims to summarise a set of measured variables into a set of uncorrelated (orthogonal) components, which are linear combinations (a weighted average) of the measured variables. Factor analysis, on the other hand, assumes that the relationships between a set of measured variables can be explained by a number of underlying latent factors.

PCA vs FA

Principal Component Analysis extracts composites of our observed variables.
Factor Analysis is a model that predicts our observed variables from some theoretical latent variables (factors).
If you just want to reduce a set of correlated observed variables down to a smaller number, conduct PCA. If you assume some underlying construct(s) is an underlying cause of your observed variables, and it is these constructs you are interested in, then conduct FA.

Figure 1: Path diagrams for PCA and FA.

In Figure 1, note how the directions of the arrows in are different between PCA and FA. In PCA, each component \(C_i\) is the weighted combination of the observed variables \(y_1, ...,y_n\). In FA, instead, each measured variable \(y_i\) is seen as generated by some shared latent factors \(F_1\) and \(F_2\) plus some unexplained variance \(u_i\).

Introducing CFA

When we conduct Exploratory Factor Analysis (EFA), we tend to start with no hypothesis about either the number of latent factors or about the specific relationships between latent factors and measured variables (the factor structure). All variables load onto all factors, and often a transformation method (e.g., rotation) is applied to make the results more easily interpretable.

Often, in psychology, we use scales that we already consider to be valid measures of some underlying construct, and we have a theoretical model that we wish to test. Confirmatory Factor Analysis (CFA) is a more hypothesis-driven form of factor analysis, which requires us to prespecify all aspects of our model: we need to have some a priori sense of how many factors that exist, which items a related to which factors, etc.

CFA is almost always used when developing scales, because it allows us to examine the underlying structure of our measure (e.g., questionnaires). It is also useful when investigating the convergent and discriminant validity of a theoretical construct (for instance, we might expect a measure of anxiety to positively relate to (‘converge’ with) a measure of depression, and to differ (‘discriminate’) from a measure of general happiness.

When we have clear a priori hypotheses about relationships between measured variables and latent factors, CFA imposes a specific factor structure on the data, where we pick and choose the paths (arrows) that we want to estimate, and leave out ones which our theory suggests are not present (as in Figure 2). It is important to note, that by excluding a specific path, our model is asserting that that specific relationship is 0 (a bit like if we leave out a predictor from our multiple regression model: y~w+x assumes that y~z is 0).

Figure 2: Path diagram for CFA

The purpose of CFA can be seen of as twofold:

To obtain parameter estimates (i.e., factor loadings, variances and covariances of factors, residual variances of measured variables)
To assess whether the model provides a good fit to the data. In other words, to assess if the observed data are consistent with the hypothesised model (stemming from a theory/hypothesis).

CFA as Structural Equation Modelling

CFA is a specific form of a Structural Equation Model (SEM) in which we are defining a (number of) factor structures. SEM is going to be the focus of weeks 9 and 10 of this course. In essence, SEM is a framework in which we can test our theoretical models and hypotheses.

You might be tempted to think “isn’t that what we’ve been doing already!?” and you would be right. However, SEM offers a huge amount more flexibility in the questions we can ask, and the types of theoretical model we can think about. In the multiple regression world, were restricted to focusing on one outcome variable, and examining the variance explained in that variable by some predictor variables. In SEM, our theoretical model may have multiple outcome variables, mediating paths (“z affects x which in turn affects y”), latent factors etc.

Sometimes the easiest way into thinking about things in the SEM framework is to draw all your variables on a whiteboard, draw any latent constructs you believe they measure, and then connect them all up with arrows according to your theoretical model. Sound familiar? Figure 2 shows a CFA model represented as a SEM diagram!

Introducing lavaan

For the remaining weeks of the course, we’re going to rely heavily on the lavaan (Latent Variable Analysis) package. This is the main package in R for fitting structural equation mdoels, and there is a huge scope of what we can do with it.

Operators in lavaan

The first thing to get to grips with is the various new operators which lavaan allows us to use.

Our standard multiple regression formula in R was specified as

y ~ x1 + x2 + x3 + ...

In lavaan, we continue to fit regressions using the ~ symbol, but we can also specify the construction of latent variables using =~ and residual variances & covariances using ~~.

Formula type	Operator	Mnemonic
latent variable definition	`=~`	“is measured by”
regression	`~`	“is regressed on”
(residual) (co)variance	`~~`	“is correlated with”
intercept	`~1`	“intercept”

(from https://lavaan.ugent.be/tutorial/syntax1.html)

Fitting models with lavaan

In practice, fitting models in lavaan tends to be a little different from things like lm() and (g)lmer(). Instead of including the model formula inside the fit function (e.g., lm(y ~ x1 + x2, data = df)), we tend to do it in a step-by-step process. This is because as our models become more complex, our formulas can pretty long!

In lavaan, it is typical to write the model as a character string (e.g. model <- "y ~ x1 + x2") and then we pass that formula along with the data to the relevant lavaan function such as cfa() or sem(), giving it the formula and the data: cfa(model, data = mydata).

Specify the model:

mymodel <- "
  factor1 =~ item1 + item2 + .....
  factor2 =~ item6 + ...... 
  ...
  ..
"

Estimate the model:

mymodelfit <- cfa(mymodel, data = mydata)

Examine the fitted model:
```
summary(mymodelfit)
```

Optional: Fitting a multiple regression model with lavaan

You can see a multiple regression fitted with lavaan here, we define the model as a character string, and then we pass that to the relevant lavaan function, such as cfa() or in this case sem() (we’ll use sem() more in the weeks to come).

library(tidyverse)
library(lavaan)

toys_read <- read_csv("https://uoepsy.github.io/data/toyexample.csv") 

# the lm() way
mreg_lm <- lm(R_AGE ~ hrs_week + age, toys_read)

# setting up the model for SEM
mreg_model <- "
    #regression
    R_AGE ~ 1 + hrs_week + age
"
mreg_sem <- sem(mreg_model, data=toys_read)

These are the coefficients from our lm() model:

coefficients(mreg_lm)

## (Intercept)    hrs_week         age 
##       2.122       0.686       0.188

And you can see the estimated parameters are the same for our sem() model!

summary(mreg_sem)

## lavaan 0.6.15 ended normally after 19 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         4
## 
##   Number of observations                           132
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   R_AGE ~                                             
##     hrs_week          0.686    0.484    1.419    0.156
##     age               0.188    0.226    0.832    0.405
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .R_AGE             2.122    2.575    0.824    0.410
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .R_AGE            27.268    3.356    8.124    0.000

Thinking in diagrams

In structural equation modeling, it is common to think about our theories in terms of the connections between variables drawn on a whiteboard. By representing a theory as paths to and from different variables, we open up a whole new way of ‘modelling’ the world around us. These path diagrams have different shapes to denote the covariances, regressions, observed variables and latent variables.

Observed variables are represented by squares or rectangles. These are the named variables of interest which exist in our dataset - i.e. the ones which we have measured directly.
Latent variables are represented as ovals/ellipses or circles.¹
Covariances are represented by double-headed arrows. In many diagrams these are curved.
Regressions are shown by single headed arrows (e.g., an arrow from \(x\) to \(y\) for the path \(y \sim x\)). Factor loadings are also regression paths.
Recall that specifying a factor structure is simply to say that some measured variables \(y_i\) are each regressed onto some unmeasured factor(s): \(y = \lambda \cdot F + u\) looks an awful lot like \(y = \beta \cdot x + \epsilon\)!!).

Making path diagrams in R

There are a couple of packages which can create visual diagrams of structural equation models: semPlot and tidySEM.

The semPlot package contains the function semPaths(), which is well established and works “out of the box” but is harder to edit. Alternatively, you can try your hand at a newer package which promises more customisable features for SEM diagrams called tidySEM. Often, if we want to include a SEM diagram in a report the raw output from semPaths() would not usually meet publication standards, and instead we tend to draw them in programs like powerpoint!

Optional: visualising a multiple regression model as a path diagram

library(semPlot)
semPaths(mreg_sem, whatLabels = "est", intercepts = F)

library(tidySEM)
graph_sem(mreg_sem, nodes = get_nodes(mreg_sem))

Exercises: One factor model

Data: Conduct Problems Dataset2

Last week we conducted an exploratory factor analysis of a dataset to try and identify an optimal factor structure for a new measure of conduct (i.e., antisocial behavioural) problems. This week, we’ll conduct some confirmatory factor analyses (CFA) of the same inventory, using some new data collected by the researchers from n=600 adolescents.

The questionnaire items referred to the following 10 behaviours:

item 1 - Stealing
item 2 - Lying
item 3 - Skipping school
item 4 - Vandalism
item 5 - Breaking curfew
item 6 - Threatening others
item 7 - Bullying
item 8 - Spreading malicious rumours
item 9 - Using a weapon
item 10 - Fighting

The data is available as a .csv at https://uoepsy.github.io/data/conduct_problems_2.csv

Question A1

Read in the new data, and construct a correlation matrix. Maybe create a visualisation of the correlation matrix?

Solution

df <- read.csv("https://uoepsy.github.io/data/conduct_problems_2.csv")
cor(df)

##        item1 item2  item3 item4 item5  item6 item7 item8 item9 item10
## item1  1.000 0.525 0.4350 0.483 0.564 0.1345 0.281 0.258 0.244  0.178
## item2  0.525 1.000 0.5010 0.512 0.674 0.1524 0.303 0.275 0.276  0.162
## item3  0.435 0.501 1.0000 0.490 0.599 0.0946 0.256 0.194 0.208  0.129
## item4  0.483 0.512 0.4897 1.000 0.620 0.1485 0.249 0.255 0.226  0.143
## item5  0.564 0.674 0.5991 0.620 1.000 0.1135 0.311 0.246 0.233  0.139
## item6  0.135 0.152 0.0946 0.148 0.114 1.0000 0.546 0.594 0.369  0.463
## item7  0.281 0.303 0.2557 0.249 0.311 0.5462 1.000 0.801 0.599  0.532
## item8  0.258 0.275 0.1936 0.255 0.246 0.5944 0.801 1.000 0.617  0.556
## item9  0.244 0.276 0.2080 0.226 0.233 0.3692 0.599 0.617 1.000  0.361
## item10 0.178 0.162 0.1292 0.143 0.139 0.4632 0.532 0.556 0.361  1.000

pheatmap::pheatmap(cor(df))

Question A2

Using lavaan syntax, specify a model in which all 10 items load on one latent variable.
Do not estimate the model yet, simply specify it in a character string, in preparation to fit it with the cfa() function.

Hint: Remember that to specify items loading on a latent variable we use the =~ operator. The latent variables goes on the left hand side and the list of indicators (i.e., items used to measure the latent variable) go on the right hand side separated by ‘+’. You can name the latent variable whatever you like.

Solution

model1 <- "CP=~item1+item2+item3+item4+item5+item6+item7+item8+item9+item10"

Question A3

We’re going to use the cfa() function to fit our model.

It is not necessary to refer to ALL of our CFA parameters in your model specification function to estimate our model, as some parameters are estimated or fixed by default when you estimate the model with the cfa() function. In this case, the residual variances and latent factor variances are missing because they are estimated by default.

The default scaling constraint/identification constraints imposed when using the cfa() function are to fix the loading of the first item of each latent variable to 1. We can override this by setting std.lv=TRUE, which will instead scale the latent variables by fixing them to 1.

Estimate your model using the cfa() function from the lavaan package. Scale your latent variable by fixing the latent variable variance to 1.

It is helpful to save the results of cfa() to a new object so that we can later inspect that object (to look at the model fit and parameter estimates).

Solution

model1.est <- cfa(model1, data=df, std.lv = TRUE)

Model Fit & Degrees of Freedom

One of the crucial things to realise when you’re learning about these methods is that there are certain terms (things like “model fit” and “degrees of freedom”) which have quite different meanings to those you are likely used to.

“Model Fit”: Same name, different idea

You’ll have heard the term “model fit” many times since september. However, there is a crucial difference in what it means when it is used in the SEM framework.

In things like multiple regression, we have been using “model fit” to be the measure of “how much variance can we explain in y with our set of predictors?”.

In SEM, examining “model fit” is more like asking “how well does our model reproduce the characteristics of the data that we observed?”. If you think of the characteristics of our data being represented by a covariance matrix, then we might think of “model fit” as being “how well can our model reproduce our observed covariance matrix?”.

In regression, we could only talk about model fit if we had more than 2 datapoints. This is because there is only one possible line that we can fit between 2 datapoints, and this line explains all of the variance in the outcome variable (it uses up all our 2 degrees of freedom to estimate 1. the intercept and 2. the slope).

The logic is the same for model fit in SEM (we need more degrees of freedom than we have parameters that are estimated), but we need to remember that we are now concerned with the covariance matrix, rather than with our raw observations. So we need to be estimating fewer paths (e.g. parameters) than there are variances/covariances in our covariance matrix. This is because if we just fit paths between all our variables, then our model would be able to reproduce the data perfectly (just like a regression with 2 datapoints has an \(R^2\) of 1).

Degrees of Freedom in SEM

The degrees of freedom for a Structural Equation Model correspond to the number of knowns (observed covariances/variances from our sample) minus the number of unknowns (parameters to be estimated by the model). A model is only able to be estimated if it has at least 0 degrees of freedom (if there are as many knowns as unknowns). A model with 0 degrees of freedom is termed just-identified (sometimes called “saturated”).
under- and over- identified models correspond to those with \(<0\) and \(>0\) degrees of freedom respectively.

An example of a just-identified model is the multiple regression model! In multiple regression, everything is allowed to vary with everything else, which means that there is a unique solution for all of the model’s parameters because there are as many paths as there are observed covariances. This means that in the SEM world, a multiple regression model is “just-identified”.

Demonstration

If I have two predictors and one outcome variable, then there are 6 variances and covariances available. For instance:

cov(toys_read %>% select(R_AGE, hrs_week, age))

##           R_AGE hrs_week   age
## R_AGE    28.089    0.649 0.892
## hrs_week  0.649    0.897 0.175
## age       0.892    0.175 4.095

The multiple regression model will estimate the two variances of the predictors, their covariance, the two paths from each predictor to the outcome, and the error variance. This makes up 6 estimated parameters - which is exactly how many known covariances there are.

How many knowns are there?

The number of known covariances in a set of \(k\) observed variables is equal to \(\frac{k \cdot (k+1)}{2}\).

Remember, in SEM the visualisations can play a key part - draw all our variables (both observed and latent) on the whiteboard; connect them up according to our theoretical model; we can then count the number of paths (arrows) and determine whether the \(\text{number of knowns} > \text{number of unknowns}\). We can reduce the number of unknowns by fixing parameters to be specific values.

By constraining some estimated parameter to be some specific value, we free-up a degree of freedom! For instance “the correlation between x1 and x2 is equal to 0.7 (\(r_{x_1x_2} = .07\))”. This would turn a previously estimated parameter into a fixed parameter, and this gains us the prize of a lovely degree of freedom!

By removing a path altogether, we are constraining it to be zero.

For instance, in Figure 3 we can see a the model of a latent factor loading on to 4 items. The number of paths to be estimated here is greater than the number of known covariances. However, we can get around this by fixing certain parameters to be specific values. In Figure 4, the latent factor variance is set at 1, and the residual factor loadings are also set to 1.
This has the additional benefit of making our latent factor have some defining features. Because we don’t actually measure the latent variable (it is a hypothetical construct), it doesn’t really have any intrinsic ‘scale’. When we fix the variance to be 1, we are providing some property (its variance) we create a reference from which the other paths to/from the variable are in relation to. A common alternative is to fix the factor loading of the first item to be 1 (see Figure 5).

Figure 3: A four item factor structure. There are 10 knowns, but 13 parameters

A four item factor structure. By fixing 5 of these parameters to be equal to 1, we gain back degrees of freedom and make our model identifiable

Figure 4: A four item factor structure. By fixing 5 of these parameters to be equal to 1, we gain back degrees of freedom and make our model identifiable

Figure 5: A four item factor structure. The ‘marker method’ fixes the first factor loading to be 1, leaving the factor variance free to be estimated.

Fit indices ~~Rules of Thumb~~ Cut-offs that people often use

There are too many different metrics that people use to examine model fit in SEM, and there’s lots of controversy over the various merits and disadvantages and proposed cutoffs of each method.

The main four fit indices are RMSEA, SRMR, CFI and TLI. We’ll look more into these in a couple of weeks, and we strongly encourage you to take a look at the accompanying reading on CFA which is posted on Learn, as this explains some of the more common measures. Additionally, there are many resources online, for instance David Kenny’s page on measuring model fit.

Rules of thumb:

Smaller values of RMSEA and SRMR mean better fit while larger values of CFI and TLI mean better fit.
If \(\textrm{RMSEA} < .05\), \(\textrm{SRMR} < .05\), \(\textrm{TLI} > .95\) and \(\textrm{CFI} > .95\) then the model fits well.

Exercises: Two factor model

Question B1

Examine the global fit of your one factor model. Does it fit well? (To obtain the global fit measures, we can use the summary() function to inspect our estimated model, setting fit.measures=TRUE).

Solution

summary(model1.est, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 45 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        20
## 
##   Number of observations                           600
## 
## Model Test User Model:
##                                                       
##   Test statistic                              1089.192
##   Degrees of freedom                                35
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              2836.905
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.622
##   Tucker-Lewis Index (TLI)                       0.515
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -7722.655
##   Loglikelihood unrestricted model (H1)      -7178.058
##                                                       
##   Akaike (AIC)                               15485.309
##   Bayesian (BIC)                             15573.248
##   Sample-size adjusted Bayesian (SABIC)      15509.753
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.224
##   90 Percent confidence interval - lower         0.213
##   90 Percent confidence interval - upper         0.236
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    1.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.181
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   CP =~                                               
##     item1             0.378    0.042    9.036    0.000
##     item2             0.429    0.044    9.782    0.000
##     item3             0.327    0.042    7.794    0.000
##     item4             0.392    0.045    8.689    0.000
##     item5             0.410    0.043    9.522    0.000
##     item6             0.602    0.037   16.202    0.000
##     item7             0.879    0.033   26.448    0.000
##     item8             0.859    0.032   26.712    0.000
##     item9             0.680    0.038   18.001    0.000
##     item10            0.579    0.037   15.557    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .item1             0.885    0.052   17.038    0.000
##    .item2             0.963    0.057   16.984    0.000
##    .item3             0.909    0.053   17.114    0.000
##    .item4             1.037    0.061   17.061    0.000
##    .item5             0.933    0.055   17.004    0.000
##    .item6             0.581    0.036   16.213    0.000
##    .item7             0.224    0.021   10.721    0.000
##    .item8             0.202    0.019   10.379    0.000
##    .item9             0.559    0.035   15.843    0.000
##    .item10            0.596    0.037   16.324    0.000
##     CP                1.000

According to conventionally accepted criteria for RMSEA, SRMR, TLI and CFI, the model fits poorly.

Question B2

Now let’s try a different model. Specify a CFA model with two correlated latent factors.
Consider items 1 to 5 as indicators of the first latent factor and items 6 to 10 as indicators of the second latent factor.

Specifying models this way requires separating the different (sets of) paths onto new lines.
So for this model you will want something with 3 lines.
You can add comments in as well, which will help!
The first one below is filled in for you:

model2 <- '
  # latent factor one "is measured by" items 1 to 5
  LV1 =~ item1 + item2 + item3 + item4 + item5
  # latent factor two "is measured by" items 6 to 10
  ...
  # latent factor one is correlated with latent factor two
  ...
'

Strings split over lines making R get stuck?

If you have your cursor on the first of a multi-line character string, and you press ctrl+enter in order to run it (i.e., send it down to the console), then R will not automatically run the next line. It will give you a little blue + in the console, and force you to run it line by line.

If you are seeing the little blue + then you can press the escape key to cancel the command.

It might be easier to highlight the entire model and run it all at once.

Solution

We can specify the model as below.

model2<-'
    LV1=~item1+item2+item3+item4+item5
    LV2=~item6+item7+item8+item9+item10
    LV1~~LV2
'

We now define two latent variables ‘LV1’ and ‘LV2’ using the ‘=~’ operator. We also introduce a new operator: ‘~~’ in order to specify the covariance between the two latent variables.

Question B3

Estimate this model using cfa().
Scale the latent variables using a reference indicator (rather than fixing the variance).
Does the model fit well?

Solution

cfa() uses a reference indicator to scale the latent variables by default, so we only need to specify the name of the object with the model syntax and the name of the dataset.

model2.est<-cfa(model2, data=df)
summary(model2.est, fit.measures=T)

## lavaan 0.6.15 ended normally after 27 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
## 
##   Number of observations                           600
## 
## Model Test User Model:
##                                                       
##   Test statistic                                65.765
##   Degrees of freedom                                34
##   P-value (Chi-square)                           0.001
## 
## Model Test Baseline Model:
## 
##   Test statistic                              2836.905
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.989
##   Tucker-Lewis Index (TLI)                       0.985
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -7210.941
##   Loglikelihood unrestricted model (H1)      -7178.058
##                                                       
##   Akaike (AIC)                               14463.881
##   Bayesian (BIC)                             14556.217
##   Sample-size adjusted Bayesian (SABIC)      14489.548
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.039
##   90 Percent confidence interval - lower         0.025
##   90 Percent confidence interval - upper         0.054
##   P-value H_0: RMSEA <= 0.050                    0.884
##   P-value H_0: RMSEA >= 0.080                    0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.035
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   LV1 =~                                              
##     item1             1.000                           
##     item2             1.217    0.076   15.965    0.000
##     item3             1.012    0.070   14.424    0.000
##     item4             1.146    0.077   14.967    0.000
##     item5             1.360    0.078   17.412    0.000
##   LV2 =~                                              
##     item6             1.000                           
##     item7             1.419    0.082   17.331    0.000
##     item8             1.437    0.081   17.662    0.000
##     item9             1.093    0.077   14.125    0.000
##     item10            0.950    0.073   13.073    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   LV1 ~~                                              
##     LV2               0.156    0.023    6.856    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .item1             0.574    0.037   15.398    0.000
##    .item2             0.477    0.035   13.778    0.000
##    .item3             0.551    0.036   15.262    0.000
##    .item4             0.597    0.040   14.867    0.000
##    .item5             0.264    0.028    9.537    0.000
##    .item6             0.561    0.035   16.193    0.000
##    .item7             0.228    0.021   10.767    0.000
##    .item8             0.153    0.019    8.122    0.000
##    .item9             0.566    0.035   15.978    0.000
##    .item10            0.587    0.036   16.351    0.000
##     LV1               0.453    0.052    8.701    0.000
##     LV2               0.381    0.045    8.396    0.000

This model fits well according to RMSEA, SRMR, CFI and TLI. The \(\chi^2\) value is significant but we don’t need to worry about this because the \(\chi^2\) test has a strong tendency to reject even trivially mis-specified models (more on this next week).

Question B4

Are there any areas of local mis-fit?

By “local” misfit, we mean specific paths in the model that we maybe should have included, but didn’t. We can look for these using the modindices() function. This will give us the expected improvement in the model fit if a parameter was added, and the expected parameter change associated with the addition of the parameter (an estimate of what the parameter estimate would be if the parameter was included in the model).

Solution

modindices(model2.est, sort=T)

##      lhs op    rhs     mi    epc sepc.lv sepc.all sepc.nox
## 72 item6 ~~ item10 10.747  0.082   0.082    0.144    0.144
## 25   LV1 =~  item7  8.106  0.119   0.080    0.080    0.080
## 65 item5 ~~  item7  6.720  0.039   0.039    0.160    0.160
## 71 item6 ~~  item9  6.675 -0.065  -0.065   -0.115   -0.115
## 24   LV1 =~  item6  6.100 -0.136  -0.092   -0.094   -0.094
## 33   LV2 =~  item5  5.273 -0.122  -0.075   -0.072   -0.072
## 60 item4 ~~  item7  4.156 -0.039  -0.039   -0.105   -0.105
## 27   LV1 =~  item9  4.086  0.113   0.076    0.075    0.075
## 78 item9 ~~ item10  4.029 -0.051  -0.051   -0.089   -0.089
## 44 item2 ~~  item4  4.021 -0.058  -0.058   -0.109   -0.109
## 26   LV1 =~  item8  3.912 -0.078  -0.053   -0.054   -0.054
## 66 item5 ~~  item8  3.702 -0.027  -0.027   -0.134   -0.134
## 61 item4 ~~  item8  3.126  0.031   0.031    0.103    0.103
## 29   LV2 =~  item1  2.920  0.105   0.065    0.064    0.064
## 55 item3 ~~  item8  2.869 -0.028  -0.028   -0.098   -0.098
## 64 item5 ~~  item6  2.812 -0.034  -0.034   -0.090   -0.090
## 37 item1 ~~  item5  2.750 -0.044  -0.044   -0.114   -0.114
## 70 item6 ~~  item8  2.726  0.034   0.034    0.115    0.115
## 49 item2 ~~  item9  2.563  0.039   0.039    0.074    0.074
## 54 item3 ~~  item7  2.295  0.027   0.027    0.077    0.077
## 73 item7 ~~  item8  1.964 -0.041  -0.041   -0.222   -0.222
## 30   LV2 =~  item2  1.869  0.081   0.050    0.046    0.046
## 74 item7 ~~  item9  1.602  0.028   0.028    0.077    0.077
## 69 item6 ~~  item7  1.395 -0.025  -0.025   -0.069   -0.069
## 28   LV1 =~ item10  1.357 -0.065  -0.044   -0.046   -0.046
## 68 item5 ~~ item10  1.356 -0.024  -0.024   -0.062   -0.062
## 45 item2 ~~  item5  1.352  0.035   0.035    0.099    0.099
## 34 item1 ~~  item2  1.322  0.031   0.031    0.060    0.060
## 52 item3 ~~  item5  1.175  0.029   0.029    0.076    0.076
## 43 item2 ~~  item3  1.120 -0.028  -0.028   -0.056   -0.056
## 42 item1 ~~ item10  0.944  0.025   0.025    0.043    0.043
## 76 item8 ~~  item9  0.731  0.018   0.018    0.062    0.062
## 59 item4 ~~  item6  0.688  0.022   0.022    0.037    0.037
## 67 item5 ~~  item9  0.672 -0.017  -0.017   -0.044   -0.044
## 41 item1 ~~  item9  0.613  0.020   0.020    0.035    0.035
## 36 item1 ~~  item4  0.604  0.022   0.022    0.038    0.038
## 53 item3 ~~  item6  0.541 -0.018  -0.018   -0.033   -0.033
## 35 item1 ~~  item3  0.528 -0.020  -0.020   -0.035   -0.035
## 51 item3 ~~  item4  0.467  0.019   0.019    0.034    0.034
## 56 item3 ~~  item9  0.373  0.015   0.015    0.027    0.027
## 32   LV2 =~  item4  0.341  0.037   0.023    0.021    0.021
## 58 item4 ~~  item5  0.241  0.014   0.014    0.036    0.036
## 31   LV2 =~  item3  0.224 -0.029  -0.018   -0.017   -0.017
## 40 item1 ~~  item8  0.185  0.007   0.007    0.025    0.025
## 38 item1 ~~  item6  0.145 -0.010  -0.010   -0.017   -0.017
## 50 item2 ~~ item10  0.082 -0.007  -0.007   -0.013   -0.013
## 48 item2 ~~  item8  0.077  0.004   0.004    0.017    0.017
## 62 item4 ~~  item9  0.057  0.006   0.006    0.011    0.011
## 47 item2 ~~  item7  0.054 -0.004  -0.004   -0.012   -0.012
## 75 item7 ~~ item10  0.048 -0.005  -0.005   -0.012   -0.012
## 63 item4 ~~ item10  0.031 -0.005  -0.005   -0.008   -0.008
## 57 item3 ~~ item10  0.024  0.004   0.004    0.007    0.007
## 77 item8 ~~ item10  0.011 -0.002  -0.002   -0.007   -0.007
## 46 item2 ~~  item6  0.007  0.002   0.002    0.004    0.004
## 39 item1 ~~  item7  0.006 -0.001  -0.001   -0.004   -0.004

While the modification indices suggest that the model could be improved with the addition of parameters, none of the expected parameter changes are very large. If we included any additonal parameters here it is likely we would be capitalising on chance. We would also have to consider whether we could justify their inclusion on substantive grounds, i.e., we would ask ourselves ‘does the addition of the parameter makes sense given our background knowledge of the construct?’.

Question B5

Take a look at the parameter estimates, are all of your loadings satisfactory? Which items are the best measures of the underlying latent variables?

Hint: It may help to look at the standardised parameter estimates, which we can do by using summary(model, standardized = TRUE).
Typically we would want standardised loadings to be \(>|.3|\) (there is no consensus on this, sometimes you will see \(>|.4|\) suggested, other times \(>|.6|\)!)

Solution

summary(model2.est, standardized=T)

## lavaan 0.6.15 ended normally after 27 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
## 
##   Number of observations                           600
## 
## Model Test User Model:
##                                                       
##   Test statistic                                65.765
##   Degrees of freedom                                34
##   P-value (Chi-square)                           0.001
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   LV1 =~                                                                
##     item1             1.000                               0.673    0.664
##     item2             1.217    0.076   15.965    0.000    0.819    0.765
##     item3             1.012    0.070   14.424    0.000    0.681    0.676
##     item4             1.146    0.077   14.967    0.000    0.771    0.706
##     item5             1.360    0.078   17.412    0.000    0.915    0.872
##   LV2 =~                                                                
##     item6             1.000                               0.618    0.636
##     item7             1.419    0.082   17.331    0.000    0.877    0.878
##     item8             1.437    0.081   17.662    0.000    0.887    0.915
##     item9             1.093    0.077   14.125    0.000    0.675    0.668
##     item10            0.950    0.073   13.073    0.000    0.587    0.608
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   LV1 ~~                                                                
##     LV2               0.156    0.023    6.856    0.000    0.375    0.375
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .item1             0.574    0.037   15.398    0.000    0.574    0.559
##    .item2             0.477    0.035   13.778    0.000    0.477    0.416
##    .item3             0.551    0.036   15.262    0.000    0.551    0.543
##    .item4             0.597    0.040   14.867    0.000    0.597    0.501
##    .item5             0.264    0.028    9.537    0.000    0.264    0.239
##    .item6             0.561    0.035   16.193    0.000    0.561    0.595
##    .item7             0.228    0.021   10.767    0.000    0.228    0.229
##    .item8             0.153    0.019    8.122    0.000    0.153    0.162
##    .item9             0.566    0.035   15.978    0.000    0.566    0.554
##    .item10            0.587    0.036   16.351    0.000    0.587    0.630
##     LV1               0.453    0.052    8.701    0.000    1.000    1.000
##     LV2               0.381    0.045    8.396    0.000    1.000    1.000

We can see that all loadings are statistically significant and all standardised loadings are >|.3|. This is good as it suggests all our items measure the relevant latent variable reasonably well. We can identify the ‘best’ indicators of our latent constructs according to those with the highest standardised loadings. We must, however, be aware that item loadings are model-specific: if we changed the model (e.g., the number of factors or which items we included) the loadings could change and so would our interpretation of the latent variables.

Question B6

Now it’s time to get R to draw some diagrams of our model for us!
Using R, represent the model, including the standardised parameters as a SEM diagram.

You can either use the semPaths() functions from the semPlot package, or you can try your hand at a newer package which promises more customisable features for SEM diagrams called tidySEM.

(often, if we want to include a SEM diagram in a report the raw output from semPaths() would not usually meet publication standards, and instead we tend to draw them in programs like powerpoint!)

Solution

In the semPaths() function, we can include the unstandardised estimates by setting what='est' and the standardised estimates by setting what='stand'.

library(semPlot)
semPaths(model2.est, what='stand')

library(tidySEM)
lay = get_layout(rep(NA, 3), "LV1", rep(NA,2),"LV2", rep(NA,3),
           paste0("item",1:10), rows = 2)
graph_sem(model2.est, layout = lay)

Question B7

Write a short paragraph summarising the method and results of the two factor model.

Remember: The main principle behind reporting any analysis is that you should be as transparent as possible (e.g., reporting any model modifications made) and a reader should be able to reproduce your analysis based on your description.

Solution

A two-factor model was tested. Items 1-5 loaded on a ‘non-aggressive conduct problems’ factor and items 6-10 loaded on an ‘aggression’ factor and these factors were allowed to correlate. Scaling and identification were achieved by fixing the loading of item 1 on the non-aggressive conduct problems factor and item 6 on the aggression factor to 1. The model was estimated using maximum likelihood estimation. The model fit well with CFI=.99, TLI=0.99, RMSEA=.04, and SRMR=.04 (Hu & Bentler, 1999). All loadings were statistically significant and >|.3| on the standardised scale. Modification indices and expected parameter changes were inspected but no modifications were made because no expected parameter changes were judged large enough to merit the inclusion of additional parameters given that there was little theoretical rationale for their inclusion. Overall, therefore, a two-factor oblique model was supported for the conduct problems items. The correlation between the factors was r=.38 (p<.001).

parameter	est	se	z	pvalue
LV1=~item1	1.000	0.000
LV1=~item2	1.217	0.076	15.96	<.001
LV1=~item3	1.012	0.070	14.42	<.001
LV1=~item4	1.146	0.077	14.97	<.001
LV1=~item5	1.360	0.078	17.41	<.001
LV2=~item6	1.000	0.000
LV2=~item7	1.419	0.082	17.33	<.001
LV2=~item8	1.437	0.081	17.66	<.001
LV2=~item9	1.093	0.077	14.12	<.001
LV2=~item10	0.950	0.073	13.07	<.001
LV1~~LV2	0.156	0.023	6.86	<.001
item1~~item1	0.574	0.037	15.40	<.001
item2~~item2	0.477	0.035	13.78	<.001
item3~~item3	0.551	0.036	15.26	<.001
item4~~item4	0.597	0.040	14.87	<.001
item5~~item5	0.264	0.028	9.54	<.001
item6~~item6	0.561	0.035	16.19	<.001
item7~~item7	0.228	0.021	10.77	<.001
item8~~item8	0.153	0.019	8.12	<.001
item9~~item9	0.566	0.035	15.98	<.001
item10~~item10	0.587	0.036	16.35	<.001
LV1~~LV1	0.453	0.052	8.70	<.001
LV2~~LV2	0.381	0.045	8.40	<.001

When you write up a CFA for a report, you should make sure to include the parameter estimates. This can be in the form of a table with the unstandardised loadings and factor covariances, their standard errors, p-values, and corresponding standardised values.

Note: according to APA style, you should include a leading zero when a parameter can take values out of the 0-1 range (e.g., SD=0.99) but you should not include a leading zero when a parameter can only take values in the 0-1 range (e.g., r=.5). This is why we have TLI=0.99 but CFI=.99. The former can technically (though rarely does) take values >1 while the latter only takes values between 0 and 1.

Note two things:
Firstly, in a diagram of Principal Components Analysis, the components are considered to be a reduced expression of the observed variables, and are represented by squares. The arrows go from the measured variables to the components. In Factor Analysis, the latent factors are represented by circles, and the arrows go from the factor to the measured variables, reflecting the idea that the observations on our measured variables are taken to be the result of some underlying construct.
Secondly, in some diagrams you will see the uniqueness of measured variables as a circle. Similarly, the error term in a multiple regression model might be represented this way, as it reflects the variance left unexplained by the predictors, and as such is not directly measured.↩︎