Methods for Dimension Reduction, Discovery, and Testing

The key thing we are doing here is:

Start with a set of correlated measured variables
Identify a smaller number of dimensions through the capturing of (co)variability across the set of measured variables

Depending upon what the goal of our research is, the key questions are going to be:

How do we find the dimensions?
How many dimensions should we retain?
How does each dimension relate back to the original measured variables?
How well do the set of dimensions explain the observed data?
How do we represent a person’s standing on the dimensions?

Naive ‘Scale Scores’

Goal: Pragmatic. I want 1 thing to use in some other setting (e.g., in a further analysis, or in a clinical setting).

How do we find the dimensions?
- Assume there’s only one
How many dimensions should we retain?
- Assume there’s only one
How does each dimension relate back to the original measured variables?
- All variables are equally related to the dimension
How well do the set of dimensions explain the observed data?
- Compute a measure of “reliability”
How do we represent a person’s standing on the dimensions?
- Add up all the responses on each variable \(\text{sum score} = y_{1i} + y_{2i} + y_{3i} +\, ... \, + y_{pi}\), and if you want a mean-score, we then divide by the number of variables: \(\text{mean score} = \frac{y_{1i} + y_{2i} + y_{3i} +\, ... \, + y_{pi}}{p}\) (see Calculating Scale Scores)

For example, the AQ10 is a ten-question survey which (despite some well-established issues) is used by the NHS for autism screening. Participants essentially respond “agree” or “disagree” to each question, and the result is scored by summing up the number of responses in the expected direction. People must score above a 6 in order to be referred for autism diagnosis on the NHS.

This is a naive scale score. There is only one dimension (some measure of autism), and all questions are assumed to be equally related to this dimension.

Principal Component Analysis (PCA)

Goal: Pragmatic. I want fewer things for use in a subsequent analysis but don’t want to lose too much variability. I want those things to be uncorrelated.¹

How do we find the dimensions?
- Set of orthogonal (i.e., perpendicular² therefore uncorrelated) dimensions that sequentially capture most variability.
- Found via some complicated maths that feels a bit like magic! It utilises a method called “eigen decomposition”, the details of which are beyond the scope of this course, but the high level idea is given below.
How many dimensions should we retain?
- Pragmatic: defined by either the number of dimensions we want, or the proportion of variability we want to retain
- Various tools can guide us towards how many dimensions capture a “substantial” amount of variability in the data (See Identifying the Number of Components/Factors).
how does each dimension relate back to the original measured variables?
- Examine correlations between dimensions and variables
- Somewhat irrelevant for ‘pure’ PCA, where we are agnostic/don’t care about what the dimensions are.
How well do the set of dimensions explain the observed data?
- Strictly speaking, PCA is not an “explanatory” tool. It simply combines variables to preserve variance. The closest we might get is to consider how much variance is captured by each dimension.
How do we represent a person’s standing on the dimensions?
- Extract PCA scores (these are a weighted sum of the scores on each variables: \(\text{score}_{\text{component j}} = w_{1j}y_1 + w_{2j}y_2 + w_{3j}y_3 +\, ... \, + w_{pj}y_p\)). See PCA Walkthrough.

Exploratory Factor Analysis (EFA)

Goal: Exploratory/Discovery/Measure development. I don’t have a strong theoretical model yet, but I believe there are underlying constructs that explain why these variables correlate with each other. I want to understand what those construct(s) might be and how the variables relate to them.

How do we find the dimensions?
- Estimation
- Models with different numbers of (possibly correlated) latent dimensions are compared
How many dimensions should we retain?
- The model that “best explains” our observed relationships between variables
- “best explains” = a theoretical question as much as a numerical one!
How does each dimension relate back to the original measured variables?
- Examine “loadings” between dimensions and variables
How well do the set of dimensions explain the observed data?
- For EFA, this question is subsumed into questions 2 and 3. The aim is to settle on the model that makes most sense.
How do we represent a person’s standing on the dimensions?
- The dimensions in EFA are “latent” in that they are never directly observed and people’s standing on dimensions cannot be perfectly determined. There are infinitely many sets of scores for the dimensions that would work equally well for a given model (“factor score indeterminacy”). This means that there are many different ways to estimate scores. See EFA Walkthrough

Confirmatory Factor Analysis (CFA)

Goal: Theory testing. I have a theoretical model of how these variables relate to these constructs. I want to test how well that model is reflected in these data

How do we find the dimensions?
- They are pre-defined
How many dimensions should we retain?
- Irrelevant as the dimensions are pre-defined
How does each dimension relate back to the original measured variables?
- Pre-defined mapping of which variables to which dimensions. Magnitudes and directions are estimated.
How well do the set of dimensions explain the observed data?
- This is the key part of CFA. There are various measures of “model fit” that are ultimately asking “how well can the model reproduce the observed covariance matrix?”
How do we represent a person’s standing on the dimensions?
- Similar to EFA, standing is represented by estimated “Factor Scores” for the latent dimensions. Because the model structure is pre-defined, these scores are more “theoretically pure” reflections of the construct, though they still carry the same inherent uncertainty (indeterminacy) as EFA scores.
- In practice, CFA naturally extends to more advanced methods that allow us to estimate things of interest (e.g., relationships between constructs) without having to ever estimate individual scores on the construct. Rather than a 2 step process of 1) estimate scores, 2) use scores in subsequent analysis, methods such as Structural Equation Modelling (SEM) allow us to do it all in one.

Footnotes

I might want this if, e.g., I have issues with multicollinearity of predictors in a regression model↩︎
like all those that we’ve seen above↩︎