3: PCA Walkthrough

This walkthrough:

  • A brief overview of Principal Components Analysis (PCA).
  • Here we introduce PCA only briefly as a stepping stone to EFA (Chapter 4).

Dataset: physfunc.csv

These data contain measurements on 220 individuals on various measures of physical functioning:

  • Mobility: 6-Minute Walk Test (distance in metres walked in 6 minutes).
  • Balance: Single-Leg Stance Test (time in seconds standing on one leg).
  • Endurance: Stair Climb Test (time in seconds taken to climb a set number of stairs).
  • Flexibility: Sit and Reach Test (distance in cm reached beyond the feet while seated).
  • Strength: Grip Strength Test (measured in kilograms using a dynamometer).
  • Fine Motor Skills: Nine-Hole Peg Test (time in seconds to place and remove pegs from a board).
  • Respiratory Function: Forced Expiratory Volume in 1 Second (FEV1) (measured in Liters per second using a spirometer).

Dataset: https://uoepsy.github.io/data/physfunc.csv

physfunc <- read_csv("https://uoepsy.github.io/data/physfunc.csv")
head(physfunc)
# A tibble: 6 × 8
  mobility balance endurance flexibility strength coordination fine_motor
     <dbl>   <dbl>     <dbl>       <dbl>    <dbl>        <dbl>      <dbl>
1      587    23.7      13.1          11       43         1.62       20.8
2      594    23.8      12.9           9       33         1.56       20.8
3      554    29.1      14.3          12       50         1.69       22  
4      525    17.9      16.6           8       20         1.62       20.1
5      541    17.5      20.0           9       43         1.45       20.6
6      540    21.7      16.2          10       40         1.51       20.0
# ℹ 1 more variable: respiratory <dbl>

Using principal() from the {{psych}} package, with 8 observed variables, we are working in 8 dimensions. These observed variables are all correlated, and PCA re-expresses the information onto un-correlated (i.e. perpendicular) dimensions.

This is achieved using “eigen decomposition” which is a fairly complex calculation performed on the correlation matrix.

With \(p\) variables, we can get out \(p\) orthogonal dimensions, which here we call “principal components”. In the principal() function, we can set nfactors = 8 to extract 8 dimensions.1

library(psych)
fullpca <- principal(physfunc[,1:8], 
                     nfactors = 8, rotate = "none")

Variance Accounted For

Because the process of eigen decomposition is essentially just a re-expression of our correlation matrix, we if we take our 8 principal components all together, we are explaining all of the variance in the original data. In other words, the total variance of the principal components (the sum of their variances) is equal to the total variance in the original data (the sum of the variances of the variables). This makes sense - we have simply gone from having 8 correlated variables to having 8 orthogonal (unrelated) components - we haven’t lost any variance.

The key benefit of PCA is that our orthogonal components have sequential variance maximisation - i.e., the first captures the most variance, then the second captures the second most, and so on.. So if we just want to reduce down to 1 dimension, then we just keep the first component. If we want to keep some pre-specified amount (e.g., 80%) of variability, then accessing Vaccounted will give us information on the amount of variance captured by each component:

fullpca$Vaccounted
                            PC1       PC2        PC3        PC4        PC5
SS loadings           4.4621777 1.8232630 0.79738447 0.39431371 0.24229978
Proportion Var        0.5577722 0.2279079 0.09967306 0.04928921 0.03028747
Cumulative Var        0.5577722 0.7856801 0.88535315 0.93464237 0.96492984
Proportion Explained  0.5577722 0.2279079 0.09967306 0.04928921 0.03028747
Cumulative Proportion 0.5577722 0.7856801 0.88535315 0.93464237 0.96492984
                             PC6        PC7          PC8
SS loadings           0.16489579 0.11041822 0.0052472755
Proportion Var        0.02061197 0.01380228 0.0006559094
Cumulative Var        0.98554181 0.99934409 1.0000000000
Proportion Explained  0.02061197 0.01380228 0.0006559094
Cumulative Proportion 0.98554181 0.99934409 1.0000000000
  • SS loadings: The sum of the squared loadings (these are also the “eigenvalues”). These represent the amount of variance explained by each component, out of the total variance (which is 8, because we have 8 variables (the variables are all standardised, meaning they each have a variance of 1. \(8 \times 1 = 8\) so we can think of the total variance as 8!)
  • Proportion Var: The proportion the overall variance the component accounts for out of all the variables.
  • Cumulative Var: The cumulative sum of Proportion Var.
  • Proportion Explained: relative amount of variance explained (\(\frac{\text{Proportion Var}}{\text{sum(Proportion Var)}}\).
  • Cumulative Proportion: cumulative sum of Proportion Explained.

There are lots of methods to help us decide how many components we should keep.

Scree plot

Scree plots show the eigenvalues for each component (each new dimension). A typical scree plot features higher variances for the initial components and quickly drops to small variances where the curve is almost flat. The flat part of the curve represents the noise components, which are not able to capture the main sources of variability in the system.

According to the scree plot criterion, we should keep as many principal components up to where the “kink” (or “elbow”) in the plot occurs. So in the above plot, the kink occurs at 3 by my reckoning (yours might be different!), so I would keep 2.

NOTE: Scree plots are subjective and may have multiple or no obvious kinks/elbows, making them hard to interpret

scree(physfunc[,1:8], pc = TRUE, factors = FALSE)

The horizontal line is an additional suggestion that we should keep all components for which the eigenvalue is >1. This is known as “Kaiser’s Criterion”

Parallel analysis

Parallel analysis involves simulating lots of datasets of the same dimension but in which the variables are uncorrelated. For each of these simulations, a PCA is conducted on its correlation matrix, and the eigenvalues are extracted. We can then compare our eigenvalues from the PCA on our actual data to the average eigenvalues across these simulations. In theory, for uncorrelated variables, no components should explain more variance than any others, and eigenvalues should be equal to 1. In reality, variables are rarely truly uncorrelated, and so there will be slight variation in the magnitude of eigenvalues simply due to chance. The parallel analysis method suggests keeping those components for which the eigenvalues are greater than those from the simulations.

It can be conducted in R using the fa.parallel() function.
When conducting an EFA (we’ll see this later on in these readings) as opposed to a PCA, you can set fa = "both" to do this for both factor extraction and principal components.

fa.parallel(physfunc[,1:8], fa = "pc", n.iter = 500)

Parallel analysis suggests that the number of factors =  NA  and the number of components =  2 

NOTE: Parallel analysis will sometimes tend to over-extract (suggest too many components)

MAP

The Minimum Average Partial (MAP) test computes the partial correlation matrix (adjusting for and removing a component from the correlation matrix), sequentially partialling out each component. At each step, the partial correlations are squared and their average is computed.
At first, the components which are removed will be those that are most representative of the shared variance between 2+ variables, meaning that the “average squared partial correlation” will decrease. At some point in the process, the components being removed will begin represent variance that is specific to individual variables, meaning that the average squared partial correlation will increase.
The MAP method is to keep the number of components for which the average squared partial correlation is at the minimum.

We can conduct MAP in R using the VSS() function.

library(psych)
VSS(physfunc[,1:8], rotate = "none", 
    plot = FALSE, fm = "pc")

Very Simple Structure
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
    n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.96  with  2  factors
VSS complexity 2 achieves a maximimum of 0.99  with  3  factors

The Velicer MAP achieves a minimum of 0.14  with  1  factors 

BIC achieves a minimum of  Inf  with    factors
Sample Size adjusted BIC achieves a minimum of  Inf  with    factors

Statistics by number of factors 
  vss1 vss2  map dof chisq prob sqresid  fit RMSEA BIC SABIC complex eChisq
1 0.83 0.00 0.14   0    NA   NA 4.2e+00 0.83    NA  NA    NA      NA     NA
2 0.96 0.96 0.15   0    NA   NA 8.9e-01 0.96    NA  NA    NA      NA     NA
3 0.72 0.99 0.19   0    NA   NA 2.5e-01 0.99    NA  NA    NA      NA     NA
4 0.59 0.91 0.28   0    NA   NA 9.8e-02 1.00    NA  NA    NA      NA     NA
5 0.67 0.89 0.43   0    NA   NA 3.9e-02 1.00    NA  NA    NA      NA     NA
6 0.67 0.88 0.83   0    NA   NA 1.2e-02 1.00    NA  NA    NA      NA     NA
7 0.54 0.78 1.00   0    NA   NA 2.8e-05 1.00    NA  NA    NA      NA     NA
8 0.54 0.78   NA   0    NA   NA 1.3e-28 1.00    NA  NA    NA      NA     NA
  SRMR eCRMS eBIC
1   NA    NA   NA
2   NA    NA   NA
3   NA    NA   NA
4   NA    NA   NA
5   NA    NA   NA
6   NA    NA   NA
7   NA    NA   NA
8   NA    NA   NA

There is a lot of other information in this output too! For now just focus on the map column, and the part of the output that says “The Velicer MAP achieves a minimum of 0.14 with 1 factors”.
This statement tells us that the minimum average squared partial correlation is achieved for 1 factor, which means that MAP suggests we keep only one factor. (The map column shows us the average squared partial correlation for each number of factors, and the MAP test will find the lowest value in that column and recommend to us the corresponding number of factors.)

NOTE: The MAP method will sometimes tend to under-extract (suggest too few components)

Loadings

Supposing that we decide (based on the scree plot, parallel analysis and MAP) to retain two components, capturing 79% of the total variance, the “loadings” tell us how each component is related back to the original variables.

So in the case below, we can see that the first component is positively related to most of the variables apart from endurance (it’s small and negative) and respiratory (it’s not even shown because it’s so small). The second component is related positively to both of these variables, and not clearly related to the others.

pca_2dim <- principal(physfunc[,1:8], 
                     nfactors = 2, rotate = "none")
pca_2dim$loadings

Loadings:
             PC1    PC2   
mobility      0.889       
balance       0.895       
endurance    -0.115  0.951
flexibility   0.738       
strength      0.833       
coordination  0.877       
fine_motor    0.919       
respiratory          0.952

                 PC1   PC2
SS loadings    4.462 1.823
Proportion Var 0.558 0.228
Cumulative Var 0.558 0.786

Note: As soon as we start to examine the loadings and try to give some sort of meaning to the components, we are going away from a pure form of PCA, and using it more like exploratory factor analysis, which we will see more of below and in the next chapter

Scores

When it comes to using the principal components in some subsequent analysis, we need to access the “scores”.

Scores in PCA represent the relative standing of every observation in our dataset on each component.

These are essentially weighted combinations of an individual’s values on the original variables.
\[ \text{score}_{\text{component j}} = w_{1j}y_1 + w_{2j}y_2 + w_{3j}y_3 +\, ... \, + w_{pj}y_p \] Where \(w\) are the weights, \(y\) the original variable scores.

The weights are calculated from the eigenvectors as \[ w_{ij} = \frac{a_{ij}}{\sqrt{\lambda_j}} \] where \(w_{ij}\) is the weight for a given variable \(i\) on component \(j\) , \(a_{ij}\) is the value from the eigenvector for item \(i\) on component \(j\) and \(\lambda_{j}\) is the eigenvalue for that component.

The correlations between the scores and the measured variables are what we saw in the “loadings”!

# First PC
cor(pca_2dim$scores[,1], physfunc)
      mobility   balance  endurance flexibility strength coordination
[1,] 0.8889259 0.8950676 -0.1152972   0.7380577 0.833137    0.8769041
     fine_motor respiratory
[1,]  0.9187639 -0.07493035

So we can add these scores to our data:

physfunc$pc1 <- pca_2dim$scores[,1]
physfunc$pc2 <- pca_2dim$scores[,2]

And then go and use those in some subsequent analysis instead of our original 8 variables!

Footnotes

  1. If we chose a lower number, e.g., nfactors = 6, it would not show the 7th and 8th components↩︎