PCA and unequal variances

Simulating some data

We’re including this code if you want to create some data and play around with it yourself, but do not worry about understanding it! In brief, what it does is 1) create a covariance matrix 2) generate data based on the covariance matrix and 3) rename the columns to “item1”, “item2”.. etc.

Code

library(tidyverse)
set.seed(777)
nitem <- 5  
A <- matrix(runif(nitem^2)*2-1, ncol=nitem) 
scor <- t(A) %*% A
df <- MASS::mvrnorm(n=200,mu=rep(0,5),Sigma = scor) %>% as_tibble()
names(df)<-paste0("item",1:5)

The data we created has 5 items, all on similar scales:

Code

library(psych)
library(knitr)
kable(describe(df)[,c(3:4)])

	mean	sd
item1	0.054	1.126
item2	-0.098	1.626
item3	0.098	0.957
item4	-0.179	1.180
item5	-0.071	1.141

Doing PCA

We can start conducting a PCA from various different points. We can either start with the data itself, or we can start with a matrix representing the relationships between the variables (e.g. either a covariance or a correlation matrix).

When using the principal() function from the psych package, if we give the function the dataset itself, then this will create a correlation matrix internally in order to conduct the PCA. The same will happen if we give the function the covariance matrix and say covar = FALSE.

Let’s suppose we are reducing down to just 1 component.
These will all be the same:

Code

principal(df, nfactors = 1)
principal(cor(df), nfactors = 1)
principal(cov(df), nfactors = 1, covar = FALSE)

Here are the loadings:

principal(df, nfactors = 1)	principal(cor(df), nfactors = 1)	principal(cov(df), nfactors = 1, covar = FALSE)
-0.861	-0.861	-0.861
0.222	0.222	0.222
-0.834	-0.834	-0.834
0.765	0.765	0.765
0.863	0.863	0.863

PCA on the covariance matrix

If we use the covariance matrix, we get slightly different results, because the loadings are proportional to the scale of the variance for each item.

Code

principal(cov(df), nfactors = 1, covar = TRUE)$loadings


Loadings:
      PC1   
item1 -0.796
item2  0.772
item3 -0.874
item4  0.860
item5  0.898

                 PC1
SS loadings    3.540
Proportion Var 0.708

	variance of item	loadings cor PCA	loadings cov PCA
item1	1.268	-0.861	-0.796
item2	2.643	0.222	0.772
item3	0.915	-0.834	-0.874
item4	1.392	0.765	0.860
item5	1.302	0.863	0.898

This means that if the items are measured on very different scales, using the covariance matrix will lead to the components being dominated by the items with the largest variance.

Let’s make another dataset in which item2 is just measured on a completely different scale

Code

dfb <- df %>% mutate(item2 = item2*20)
kable(describe(dfb)[,c(3:4)])

	mean	sd
item1	0.054	1.126
item2	-1.964	32.515
item3	0.098	0.957
item4	-0.179	1.180
item5	-0.071	1.141

	variance of item	loadings cor PCA	loadings cov PCA
item1	1.268	-0.861	0.288
item2	1057.242	0.222	32.515
item3	0.915	-0.834	-0.593
item4	1.392	0.765	0.091
item5	1.302	0.863	0.064

Use of `covar=..`

The covar=TRUE/FALSE argument of principal() only makes a difference if you give the function a covariance matrix.

If you give the principal() function the raw data, then it will automatically conduct the PCA on the correlation matrix regardless of whether you put covar=TRUE or covar=FALSE

principal(dfb, nfactors = 1, covar = FALSE)	dfb, nfactors = 1, covar = TRUE)	principal(cor(dfb), nfactors = 1)	principal(cov(dfb), nfactors = 1, covar = FALSE)	principal(cov(dfb), nfactors = 1, covar = TRUE)
-0.861	-0.861	-0.861	-0.861	0.288
0.222	0.222	0.222	0.222	32.515
-0.834	-0.834	-0.834	-0.834	-0.593
0.765	0.765	0.765	0.765	0.091
0.863	0.863	0.863	0.863	0.064

Simulating some data

Doing PCA

PCA on the covariance matrix

Use of covar=..

Use of `covar=..`