Path Analysis

class: center, middle, inverse, title-slide

.title[
# <b>Path Analysis</b>
]
.subtitle[
## Data Analysis for Psychology in R 3
]
.author[
### dapR3 Team
]
.institute[
### Department of Psychology<br/>The University of Edinburgh
]

---

# Weeks 7 to 11 Overview
- **Path analysis**
  - Week 7: Fundamentals of path analysis
  - Week 8: Path mediation
  
- **Data Reduction**
  - Week 9: Principal Components Analysis
  - Week 10: Exploratory Factor Analysis 1
  - Week 11: Exploratory Factor Analysis 2

---
# Diagrammatic Conventions

.pull-left[
- We are going to use lots of diagrams i nb
- Some conventions
  - Square = observed/measured
  - Circle/ellipse = latent/unobserved
  - Two-headed arrow = covariance
  - Single headed arrow = regression path

]

.pull-right[
<img src="dapr3_06_pathintro_files/figure-html/unnamed-chunk-1-1.png" width="4495" />
]

---
# Learning Objectives
1. Understand the core principle of path modelling
2. Be able to use `lavaan` to estimate linear models and simple path models
3. Interpret the output from a `lavaan` model.

---
class: inverse, center, middle

<h2>Part 1: Introduction and Motivation</h2>
<h2 style="text-align: left;opacity:0.3;">Part 2: Introducing `lavaan`</h2>
<h2 style="text-align: left;opacity:0.3;">Part 3: Model Specification & Estimation</h2>
<h2 style="text-align: left;opacity:0.3;">Part 4: Model Evaluation</h2>

---
# What issue do these methods solve?
- Path models
  - Sometimes we have more than one variable that needs to be treated as an outcome/dependent variable
  - We cant do this in a linear model.
  - **A path model allows us to test several linear models together as a set**

- **Example 1**: Suppose we want to test a model where interest in maths is predicted by Conscientiousness, and interest in English Literature is predicted by Conscientiousness and Extroversion. We want to control for age in this model.
  - Interest in Maths and Interest in English are both outcomes (DVs), so we couldn't use a single linear model.

- **Example 2**: Mediation (more on this next week)
  - Is when a predictor X, has an effect on an outcome Y, via a mediating variable M
    - Conscientiousness (X) affects health (Y) via health behaviors (M)
    - Attitudes to smoking (X) predict intentions to smoke (M) which in turn predicts smoking behavior (Y)

---
# Example 1: Diagram

---
# Terminology
- Time for some formal terminology

- Broadly, variables can be categorized as either exogenous or endogenous.

- **Exogenous:** are essentially independent variables. 
    - Only have directed arrows going out.

- **Endogenous:** are dependent variables in at least one part of the model. 
    - They have directed arrows going in.
    - In a linear model there is only one endogenous variable, but in a path model we can have multiple.
    - They also have an associated residual variance.
      - Just like in a `lm`
      - If something predicts (explains variance) a variable, there will be something left unexplained

---
# Example 1: With formal terminology

---
# IMPORTANT

- **It is all just covariance/correlation**

- There is lot's of terminology in path models, and they can be applied to lots of designs, but just like other statistics, the models are based on the correlation matrix of the measured variables in your study.

---
class: inverse, center, middle, animated, rotateInDownLeft

# End of Part 1

---
class: inverse, center, middle

<h2 style="text-align: left;opacity:0.3;">Part 1: Introduction and Motivation</h2>
<h2>Part 2: Introducing `lavaan`</h2>
<h2 style="text-align: left;opacity:0.3;">Part 3: Specification & Estimation</h2>
<h2 style="text-align: left;opacity:0.3;">Part 4: Model Evaluation</h2>

---
# `lavaan`
- The package we will use to fit our path models is called `lavaan`.

- Using `lavaan` requires us to go through three steps:

1. Specify the model and create a model object (i.e. say what we want to test)
2. Run the model
3. Evaluate the results

- This is functionally exactly the same as we have done with `lm` and `lmer`.

```r
results <- lm("model", data = dataset) # this combines specify and run
summary(results) # this evaluates
```

---
# Model statements in `lavaan`

.pull-left[

- When a variable is observed, we use the name it has in our data set.
  - Here `X1`, `X2`, `X3`, `Y2`

- When a variable is latent (more on this in the data reduction section) we give it a new name.
  - Here `Latent`

- To specify a covariance, we use `~~`

- To specify a regression path, we use `~`

]

.pull-right[
<img src="dapr3_06_pathintro_files/figure-html/unnamed-chunk-5-1.png" width="4495" />
]

---
# Model specification for our example

.pull-left[

```r
example1 = '
maths ~ con + age
english ~ con + ext + age
'
```
]

.pull-right[
<img src="dapr3_06_pathintro_files/figure-html/unnamed-chunk-7-1.png" width="1045" />
]

---
# Running a `lavaan` model
- Once we have our model statement, we then need to run our model.
  - There are a number of functions to do this, we will only use `sem()`

```r
library(lavaan)
m1 <- sem(model, # your model statement
          orderd = c(), # if variables are ordered categories list them
          estimator ="ml" , # name of the estimation method you wish to use
          missing = "" , # name of the missing data method you wish to use
          data = tbl) # your data set
```

- `lavaan` has sensible defaults, meaning most of the time you will only need to state you model and data.

- There is **lots** of information on using `lavaan` with lots of examples [on-line](https://lavaan.ugent.be/)

---
# Viewing the results
- Lastly, we need to use a `summary` function (like in `lm` and `glm`) to see results.

```r
summary(m1, # name given to our results object
        fit.measures = T, # model fit information
        standardized = T # provides standardized coefficients
        )
```

---
class: inverse, center, middle, animated, rotateInDownLeft

# End of Part 2

---
class: inverse, center, middle

<h2 style="text-align: left;opacity:0.3;">Part 1: Introduction and Motivation</h2>
<h2 style="text-align: left;opacity:0.3;">Part 2: Introducing `lavaan`</h2>
<h2>Part 3: Model Specification & Estimation</h2>
<h2 style="text-align: left;opacity:0.3;">Part 4: Model Evaluation</h2>

---
# Stages in path model
- Specification:
  - This is what we have just seen in our example. 
  - Specification concerns which variables relate to which others, and in what ways.
  
- **Specification is also where we formally set out our theory and hypotheses**

- We have seen the types of path we can include, but there are some other standard "rules"

1. All exogenous variables correlate
2. For endogenous variables, we correlate the residuals, not the variables.
3. Endogenous variable residuals **do not correlate** with exogenous variables.
4. All paths are recursive (i.e. we cant not have loops like A->B, B->A).

---
# Model identification
- Identification concerns the number of **knowns** versus **unknowns**

- **The knowns** 
  - Variances of measured variables  
  - Covariances between the variables
  - This = the unique values in a covariance/correlation matrix

- The unknowns are the parameters we want to estimate.
  - i.e. all the lines we include in our diagram

- **Degrees of freedom** are the difference between knowns and unknowns

- Degrees of freedom must be positive
  - This means we have more knowns than unknowns.
  - Or in other words, our model simplifies the data.

---
# Model Identification: t-rule
- As noted above, identification concerns the degrees of freedom in our model.
- We can calculate the known to be identified parameters by:

$$
\frac{\left(k \right)\left(k+1 \right)}{2}
$$

- where *k* is the number of observed variables.

- The unknowns are:
  - the variances of all variables (estimated)
  - covariances
  - regression paths

---
# Example 1: Knowns

.pull-left[

- Let's count the knowns

- Our example has 5 variables. So applying the *t*-rule:

$$
\frac{\left(k \right)\left(k+1 \right)}{2} = \frac{5*(5+1)}{2} = \frac{30}{2} = 15
$$

]

.pull-right[

- Look at the correlation matrix between 5 variables.

- Let's cross out the duplicate numbers, and see how many **unique** values are left.

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> ext </th>
   <th style="text-align:right;"> con </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> maths </th>
   <th style="text-align:right;"> english </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ext </td>
   <td style="text-align:right;"> 1.00 </td>
   <td style="text-align:right;"> -0.27 </td>
   <td style="text-align:right;"> -0.18 </td>
   <td style="text-align:right;"> 0.14 </td>
   <td style="text-align:right;"> 0.01 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> con </td>
   <td style="text-align:right;"> -0.27 </td>
   <td style="text-align:right;"> 1.00 </td>
   <td style="text-align:right;"> 0.50 </td>
   <td style="text-align:right;"> -0.09 </td>
   <td style="text-align:right;"> 0.15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> age </td>
   <td style="text-align:right;"> -0.18 </td>
   <td style="text-align:right;"> 0.50 </td>
   <td style="text-align:right;"> 1.00 </td>
   <td style="text-align:right;"> -0.19 </td>
   <td style="text-align:right;"> 0.16 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> maths </td>
   <td style="text-align:right;"> 0.14 </td>
   <td style="text-align:right;"> -0.09 </td>
   <td style="text-align:right;"> -0.19 </td>
   <td style="text-align:right;"> 1.00 </td>
   <td style="text-align:right;"> -0.05 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> english </td>
   <td style="text-align:right;"> 0.01 </td>
   <td style="text-align:right;"> 0.15 </td>
   <td style="text-align:right;"> 0.16 </td>
   <td style="text-align:right;"> -0.05 </td>
   <td style="text-align:right;"> 1.00 </td>
  </tr>
</tbody>
</table>

]

---
# Example 1: Unknowns

.pull-left[

- Suppose we start with a model where we propose no relationships between any variables.

- All our model will estimate is the variances of each variable (red arrows)

- Here we have 5 variances.

- We have 15 knowns.

- So would have 10 degrees of freedom

]

.pull-right[

]

---
# Example 1: Unknowns

.pull-left[

- But what about our actual model. What are we estimating here?

]

.pull-right[

]

---
# Example 1: Unknowns

.pull-left[

- But what about our actual model. What are we estimating here?

- In total we have 14 parameters:
  - 3 predictor (exogenous) variable variances (red arrows)
  - 3 predictor (exogenous) variable covariances (yellow arrows)
  - 5 regression paths from predictors to outcomes (blue arrows)
  - 2 residual variances of outcome (endogenous) variables (orange circles)
  - 1 residual covariance between outcomes (green arrow)
  
- 15 knowns - 14 unknowns = 1 degree of freedom

]

.pull-right[

]

---
# Levels of identification
- So for example 1, our degrees of freedom = 1.

- There are three levels of identification:
  - **Under-identified** models: have < 0 degrees of freedom 
  - **Just Identified** models: have 0 degrees of freedom (all standard linear models are just identified)
  - **Over-Identified** models: have > 0 degrees of freedom

- So example 1 is over-identified.

---
# Model estimation
- After we have specified our model (& checked it is identified) we proceed to **estimation**

- Model estimation refers to finding the 'best' values for the unknown parameters

- Maximum likelihood estimation is most commonly used
  - Finds the parameters that maximise the likelihood of the data 
  - Begins with a set of starting values
  - Iterative process of improving these values
    - i.e. to minimise the difference between the sample covariance matrix and the covariance matrix implied by the parameter values
  - Terminates when the values are no longer substantially improved across iterations
    - At this point **convergence** is said to have been reached
  
---
# No convergence?
- Sometimes estimation fails

- Common reasons are:
  - The model is not identified
  - The model is very mis-specified
  - The model is very complex so more iterations are needed than the program default

- If your model doesn't converge: 
  - first check (1) and (2)
  - if it still does not converge, you may need to rethink the model

---
class: inverse, center, middle, animated, rotateInDownLeft

# End of Part 3

---
class: inverse, center, middle

<h2 style="text-align: left;opacity:0.3;">Part 1: Introduction and Motivation</h2>
<h2 style="text-align: left;opacity:0.3;">Part 2: Introducing `lavaan`</h2>
<h2 style="text-align: left;opacity:0.3;">Part 3: Model Specification & Estimation</h2>
<h2>Part 4: Model Evaluation</h2>

---
# From path models to model evaluation
- Last lecture we saw how path models are estimated from covariances/correlations.
  - Let's call this the **observed correlation/covariance**.
  
- When we specify a model, we can use the parameter estimates to re-calculate the covariances/correlations
  - This is called path tracing (see lab)
  - Remember, ideally our model is a simplification of our data
  - So these estimates of the correlations wont be identical to the actual correlations
  - This covariances/correlation matrix is referred to as the **model implied correlation/covariance**.

- Comparing the observed and model implied correlation matrices is the key to assessing how good our model is.
  - If a simplified model can reproduce all the relationships in the data, it is a good model.

---
# Model Evaluation (Fit)
- In path models and it's extensions, we tend not to focus on the variance explained in the outcome (though we can calculate this)

- Instead, we tend to think about:
  1. Does our model fit the data?
  2. If it fit's the data, what are the parameter estimates?
  
- "Fitting the data" refers to the comparison between the observed and the model implied covariance matrices.
  - If our model reproduces the observed covariances well, then it is deemed to fit.
  - If our model reproduces the observed covariances poorly, then it is deemed to not fit (and we wouldn't interpret the model)
  
---
# Model fit
- Just-identified models will always fit perfectly.
  - They exactly reproduce the observed covariances.

- When we have positive degrees of freedom, we can calculate a variety of model fit indices.
  - We have seen some of these before (AIC and BIC)
  - But there are a huge number of model fit indices.

- For ease, we will note a small number, and focus on the suggested values that indicate good vs bad fit.
  - This will give an impression of certainty in the fit vs non-fit decision.
  - But be aware this is not a binary choice. 
  - Model fit is a continuum and the use of fit indices much debated.
  
---
# Global fit
- `$\chi^2$`
  - When we use maximum likelihood estimation we obtain a `$\chi^2$` value for the model
  - This can be compared to a `$\chi^2$` distribution with degrees of freedom equal to our model degrees of freedom
  - Statistically significant `$\chi^2$` suggests the model does not do a good job of reproducing the observed variance-covariance matrix

- However, `$\chi^2$` does not work well in practice
  - Leads to the rejection of models that are only trivially mis-specified
    
---
# Alternatives to `$\chi^2$`
- Absolute fit
  - Standardised root mean square residual (**SRMR**)
    - measures the discrepancy between the observed correlation matrix and model-implied correlation matrix
    - ranges from 0 to 1 with 0=perfect fit
    - values <.05 considered good

- Parsimony-corrected
  - Corrects for the complexity of the model
  - Adds a penalty for having more degrees of freedom
  - Root mean square square error of approximation (**RMSEA**)
    - 0=perfect fit
    - values <.05 considered good

---
# Incremental fit indices
- Compares the model to a more restricted baseline model
  - Usually an 'independence' model where all observed variable covariances fixed to 0

- Comparative fit index (**CFI**)
  - ranges between 0 and 1 with 1=perfect fit
  - values > .95 considered good

- Tucker-Lewis index (**TLI**)
  - includes a parsimony penalty
  - values >.95 considered good

---
# Local Fit
- It is also possible to examine **local** areas of mis-fit

- **Modification indices** estimate the improvement in `$\chi^2$` that could be expected from including an additional parameter

- **Expected parameter changes** estimate the value of the parameter were it to be included

---
# Making model modifications
- Modification indices and expected parameter changes can be helpful for identifying how to improve the model.
  - These can be extracted using the `modindices(model)`
  - They indicate the amount the model fit would improve if you added a path to your model

- However:
  - Modifications should be made iteratively
  - May just be capitalising on chance
  - Make sure the modifications can be justified on substantive grounds
  - Be aware that this becomes an exploratory modelling practice
  - Ideally replicate the new model in an independent sample

- As a general rule for dapR3 course, we want you to specify and test a specific model, and not seek to use exploratory modifications.

---
# Interpreting the model
- And after all this, we reach interpretation
  - Thankfully this is just as we have seen before.

- If our specified model estimates, and fits the data, we can interpret the parameter estimates

- Recall, these are just correlations and regression paths, so we interpret them in exactly the same way as we would `$r$` and `$\beta$` coefficients
  - e.g. for `$\beta = .20$` : for a 1 unit increase in the predictor, there is a 0.20 unit increase in the outcome

---
# Our example 
- Let's run through all the stages of our example in R.

---
class: extra, inverse, center, middle, animated, rotateInDownLeft

# End