PCA and unequal variances

Simulating some data

We’re including this code if you want to create some data and play around with it yourself, but do not worry about understanding it! In brief, what it does is 1) create a covariance matrix 2) generate data based on the covariance matrix and 3) rename the columns to “item1”, “item2”.. etc.

Code

library(tidyverse)
set.seed(993)
nitem <- 5  
A <- matrix(runif(nitem^2)*2-1, ncol=nitem) 
scor <- t(A) %*% A
df <- MASS::mvrnorm(n=200,mu=rep(0,5),Sigma = scor) %>% as_tibble()
names(df)<-paste0("item",1:5)

The data we created has 5 items, all on similar scales:

Code

library(psych)
library(knitr)
kable(describe(df)[,c(3:4)])

	mean	sd
item1	-0.096	1.48
item2	0.027	1.98
item3	-0.080	1.22
item4	-0.047	1.66
item5	0.140	1.65

Doing PCA

We can start conducting a PCA from various different points. We can either start with the data itself, or we can start with a matrix representing the relationships between the variables (e.g. either a covariance or a correlation matrix).

When using the principal() function from the psych package, if we give the function the dataset itself, then this will create a correlation matrix internally in order to conduct the PCA. The same will happen if we give the function the covariance matrix and say covar = FALSE.

Let’s suppose we are reducing down to just 1 component.
These will all be the same:

Code

principal(df, nfactors = 1)
principal(cor(df), nfactors = 1)
principal(cov(df), nfactors = 1, covar = FALSE)

Here are the loadings:

principal(df, nfactors = 1)	principal(cor(df), nfactors = 1)	principal(cov(df), nfactors = 1, covar = FALSE)
0.662	0.662	0.662
0.658	0.658	0.658
0.576	0.576	0.576
0.957	0.957	0.957
-0.630	-0.630	-0.630

PCA on the covariance matrix

If we use the covariance matrix, we get slightly different results, because the loadings are proportional to the scale of the variance for each item.

Code

principal(cov(df), nfactors = 1, covar = TRUE)$loadings

Pearson correlations of the raw data were found

Warning in cor.smooth(model): Matrix was not positive definite, smoothing was
done

Pearson correlations of the raw data were found

Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done

Warning in log(det(m.inv.r)): NaNs produced


Loadings:
      PC1   
item1  0.995
item2  1.689
item3  0.403
item4  1.644
item5 -0.613

                PC1
SS loadings    7.08
Proportion Var 1.42

	variance of item	loadings cor PCA	loadings cov PCA
item1	2.20	0.662	0.995
item2	3.92	0.658	1.689
item3	1.48	0.576	0.403
item4	2.75	0.957	1.644
item5	2.72	-0.630	-0.613

This means that if the items are measured on very different scales, using the covariance matrix will lead to the components being dominated by the items with the largest variance.

Let’s make another dataset in which item2 is just measured on a completely different scale

Code

dfb <- df %>% mutate(item2 = item2*20)
kable(describe(dfb)[,c(3:4)])

	mean	sd
item1	-0.096	1.48
item2	0.545	39.58
item3	-0.080	1.22
item4	-0.047	1.66
item5	0.140	1.65

	variance of item	loadings cor PCA	loadings cov PCA
item1	2.20	0.662	0.546
item2	1566.53	0.658	39.579
item3	1.48	0.576	0.020
item4	2.75	0.957	1.337
item5	2.72	-0.630	0.069

Use of `covar=..`

The covar=TRUE/FALSE argument of principal() only makes a difference if you give the function a covariance matrix.

If you give the principal() function the raw data, then it will automatically conduct the PCA on the correlation matrix regardless of whether you put covar=TRUE or covar=FALSE

principal(dfb, nfactors = 1, covar = FALSE)	dfb, nfactors = 1, covar = TRUE)	principal(cor(dfb), nfactors = 1)	principal(cov(dfb), nfactors = 1, covar = FALSE)	principal(cov(dfb), nfactors = 1, covar = TRUE)
0.662	0.662	0.662	0.662	0.546
0.658	0.658	0.658	0.658	39.579
0.576	0.576	0.576	0.576	0.020
0.957	0.957	0.957	0.957	1.337
-0.630	-0.630	-0.630	-0.630	0.069

Simulating some data

Doing PCA

PCA on the covariance matrix

Use of covar=..

Use of `covar=..`