PCA and unequal variances

Simulating some data

We’re including this code if you want to create some data and play around with it yourself, but do not worry about understanding it! In brief, what it does is 1) create a covariance matrix 2) generate data based on the covariance matrix and 3) rename the columns to “item1”, “item2”.. etc.

Code
library(tidyverse)
set.seed(993)
nitem <- 5  
A <- matrix(runif(nitem^2)*2-1, ncol=nitem) 
scor <- t(A) %*% A
df <- MASS::mvrnorm(n=200,mu=rep(0,5),Sigma = scor) %>% as_tibble()
names(df)<-paste0("item",1:5)

The data we created has 5 items, all on similar scales:

Code
library(psych)
library(knitr)
kable(describe(df)[,c(3:4)])
mean sd
item1 -0.096 1.48
item2 0.027 1.98
item3 -0.080 1.22
item4 -0.047 1.66
item5 0.140 1.65

Doing PCA

We can start conducting a PCA from various different points. We can either start with the data itself, or we can start with a matrix representing the relationships between the variables (e.g. either a covariance or a correlation matrix).

When using the principal() function from the psych package, if we give the function the dataset itself, then this will create a correlation matrix internally in order to conduct the PCA. The same will happen if we give the function the covariance matrix and say covar = FALSE.

Let’s suppose we are reducing down to just 1 component.
These will all be the same:

Code
principal(df, nfactors = 1)
principal(cor(df), nfactors = 1)
principal(cov(df), nfactors = 1, covar = FALSE)

Here are the loadings:

principal(df, nfactors = 1) principal(cor(df), nfactors = 1) principal(cov(df), nfactors = 1, covar = FALSE)
0.662 0.662 0.662
0.658 0.658 0.658
0.576 0.576 0.576
0.957 0.957 0.957
-0.630 -0.630 -0.630

PCA on the covariance matrix

If we use the covariance matrix, we get slightly different results, because the loadings are proportional to the scale of the variance for each item.

Code
principal(cov(df), nfactors = 1, covar = TRUE)$loadings
Pearson correlations of the raw data were found
Warning in cor.smooth(model): Matrix was not positive definite, smoothing was
done
Pearson correlations of the raw data were found
Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
Warning in log(det(m.inv.r)): NaNs produced

Loadings:
      PC1   
item1  0.995
item2  1.689
item3  0.403
item4  1.644
item5 -0.613

                PC1
SS loadings    7.08
Proportion Var 1.42
variance of item loadings cor PCA loadings cov PCA
item1 2.20 0.662 0.995
item2 3.92 0.658 1.689
item3 1.48 0.576 0.403
item4 2.75 0.957 1.644
item5 2.72 -0.630 -0.613

This means that if the items are measured on very different scales, using the covariance matrix will lead to the components being dominated by the items with the largest variance.

Let’s make another dataset in which item2 is just measured on a completely different scale

Code
dfb <- df %>% mutate(item2 = item2*20)
kable(describe(dfb)[,c(3:4)])
mean sd
item1 -0.096 1.48
item2 0.545 39.58
item3 -0.080 1.22
item4 -0.047 1.66
item5 0.140 1.65
variance of item loadings cor PCA loadings cov PCA
item1 2.20 0.662 0.546
item2 1566.53 0.658 39.579
item3 1.48 0.576 0.020
item4 2.75 0.957 1.337
item5 2.72 -0.630 0.069

Use of covar=..

The covar=TRUE/FALSE argument of principal() only makes a difference if you give the function a covariance matrix.

If you give the principal() function the raw data, then it will automatically conduct the PCA on the correlation matrix regardless of whether you put covar=TRUE or covar=FALSE

principal(dfb, nfactors = 1, covar = FALSE) dfb, nfactors = 1, covar = TRUE) principal(cor(dfb), nfactors = 1) principal(cov(dfb), nfactors = 1, covar = FALSE) principal(cov(dfb), nfactors = 1, covar = TRUE)
0.662 0.662 0.662 0.662 0.546
0.658 0.658 0.658 0.658 39.579
0.576 0.576 0.576 0.576 0.020
0.957 0.957 0.957 0.957 1.337
-0.630 -0.630 -0.630 -0.630 0.069