PCA and unequal variances

Simulating some data

We’re including this code if you want to create some data and play around with it yourself, but do not worry about understanding it! In brief, what it does is 1) create a covariance matrix 2) generate data based on the covariance matrix and 3) rename the columns to “item1”, “item2”.. etc.

Code
library(tidyverse)
set.seed(777)
nitem <- 5  
A <- matrix(runif(nitem^2)*2-1, ncol=nitem) 
scor <- t(A) %*% A
df <- MASS::mvrnorm(n=200,mu=rep(0,5),Sigma = scor) %>% as_tibble()
names(df)<-paste0("item",1:5)

The data we created has 5 items, all on similar scales:

Code
library(psych)
library(knitr)
kable(describe(df)[,c(3:4)])
mean sd
item1 0.054 1.126
item2 -0.098 1.626
item3 0.098 0.957
item4 -0.179 1.180
item5 -0.071 1.141

Doing PCA

We can start conducting a PCA from various different points. We can either start with the data itself, or we can start with a matrix representing the relationships between the variables (e.g. either a covariance or a correlation matrix).

When using the principal() function from the psych package, if we give the function the dataset itself, then this will create a correlation matrix internally in order to conduct the PCA. The same will happen if we give the function the covariance matrix and say covar = FALSE.

Let’s suppose we are reducing down to just 1 component.
These will all be the same:

Code
principal(df, nfactors = 1)
principal(cor(df), nfactors = 1)
principal(cov(df), nfactors = 1, covar = FALSE)

Here are the loadings:

principal(df, nfactors = 1) principal(cor(df), nfactors = 1) principal(cov(df), nfactors = 1, covar = FALSE)
-0.861 -0.861 -0.861
0.222 0.222 0.222
-0.834 -0.834 -0.834
0.765 0.765 0.765
0.863 0.863 0.863

PCA on the covariance matrix

If we use the covariance matrix, we get slightly different results, because the loadings are proportional to the scale of the variance for each item.

Code
principal(cov(df), nfactors = 1, covar = TRUE)$loadings

Loadings:
      PC1   
item1 -0.796
item2  0.772
item3 -0.874
item4  0.860
item5  0.898

                 PC1
SS loadings    3.540
Proportion Var 0.708
variance of item loadings cor PCA loadings cov PCA
item1 1.268 -0.861 -0.796
item2 2.643 0.222 0.772
item3 0.915 -0.834 -0.874
item4 1.392 0.765 0.860
item5 1.302 0.863 0.898

This means that if the items are measured on very different scales, using the covariance matrix will lead to the components being dominated by the items with the largest variance.

Let’s make another dataset in which item2 is just measured on a completely different scale

Code
dfb <- df %>% mutate(item2 = item2*20)
kable(describe(dfb)[,c(3:4)])
mean sd
item1 0.054 1.126
item2 -1.964 32.515
item3 0.098 0.957
item4 -0.179 1.180
item5 -0.071 1.141
variance of item loadings cor PCA loadings cov PCA
item1 1.268 -0.861 0.288
item2 1057.242 0.222 32.515
item3 0.915 -0.834 -0.593
item4 1.392 0.765 0.091
item5 1.302 0.863 0.064

Use of covar=..

The covar=TRUE/FALSE argument of principal() only makes a difference if you give the function a covariance matrix.

If you give the principal() function the raw data, then it will automatically conduct the PCA on the correlation matrix regardless of whether you put covar=TRUE or covar=FALSE

principal(dfb, nfactors = 1, covar = FALSE) dfb, nfactors = 1, covar = TRUE) principal(cor(dfb), nfactors = 1) principal(cov(dfb), nfactors = 1, covar = FALSE) principal(cov(dfb), nfactors = 1, covar = TRUE)
-0.861 -0.861 -0.861 -0.861 0.288
0.222 0.222 0.222 0.222 32.515
-0.834 -0.834 -0.834 -0.834 -0.593
0.765 0.765 0.765 0.765 0.091
0.863 0.863 0.863 0.863 0.064