Code
library(tidyverse)
set.seed(777)
<- 5
nitem <- matrix(runif(nitem^2)*2-1, ncol=nitem)
A <- t(A) %*% A
scor <- MASS::mvrnorm(n=200,mu=rep(0,5),Sigma = scor) %>% as_tibble()
df names(df)<-paste0("item",1:5)
We’re including this code if you want to create some data and play around with it yourself, but do not worry about understanding it! In brief, what it does is 1) create a covariance matrix 2) generate data based on the covariance matrix and 3) rename the columns to “item1”, “item2”.. etc.
library(tidyverse)
set.seed(777)
<- 5
nitem <- matrix(runif(nitem^2)*2-1, ncol=nitem)
A <- t(A) %*% A
scor <- MASS::mvrnorm(n=200,mu=rep(0,5),Sigma = scor) %>% as_tibble()
df names(df)<-paste0("item",1:5)
The data we created has 5 items, all on similar scales:
library(psych)
library(knitr)
kable(describe(df)[,c(3:4)])
mean | sd | |
---|---|---|
item1 | 0.054 | 1.126 |
item2 | -0.098 | 1.626 |
item3 | 0.098 | 0.957 |
item4 | -0.179 | 1.180 |
item5 | -0.071 | 1.141 |
We can start conducting a PCA from various different points. We can either start with the data itself, or we can start with a matrix representing the relationships between the variables (e.g. either a covariance or a correlation matrix).
When using the principal()
function from the psych package, if we give the function the dataset itself, then this will create a correlation matrix internally in order to conduct the PCA. The same will happen if we give the function the covariance matrix and say covar = FALSE
.
Let’s suppose we are reducing down to just 1 component.
These will all be the same:
principal(df, nfactors = 1)
principal(cor(df), nfactors = 1)
principal(cov(df), nfactors = 1, covar = FALSE)
Here are the loadings:
principal(df, nfactors = 1) | principal(cor(df), nfactors = 1) | principal(cov(df), nfactors = 1, covar = FALSE) |
---|---|---|
-0.861 | -0.861 | -0.861 |
0.222 | 0.222 | 0.222 |
-0.834 | -0.834 | -0.834 |
0.765 | 0.765 | 0.765 |
0.863 | 0.863 | 0.863 |
If we use the covariance matrix, we get slightly different results, because the loadings are proportional to the scale of the variance for each item.
principal(cov(df), nfactors = 1, covar = TRUE)$loadings
Loadings:
PC1
item1 -0.796
item2 0.772
item3 -0.874
item4 0.860
item5 0.898
PC1
SS loadings 3.540
Proportion Var 0.708
variance of item | loadings cor PCA | loadings cov PCA | |
---|---|---|---|
item1 | 1.268 | -0.861 | -0.796 |
item2 | 2.643 | 0.222 | 0.772 |
item3 | 0.915 | -0.834 | -0.874 |
item4 | 1.392 | 0.765 | 0.860 |
item5 | 1.302 | 0.863 | 0.898 |
This means that if the items are measured on very different scales, using the covariance matrix will lead to the components being dominated by the items with the largest variance.
Let’s make another dataset in which item2 is just measured on a completely different scale
<- df %>% mutate(item2 = item2*20)
dfb kable(describe(dfb)[,c(3:4)])
mean | sd | |
---|---|---|
item1 | 0.054 | 1.126 |
item2 | -1.964 | 32.515 |
item3 | 0.098 | 0.957 |
item4 | -0.179 | 1.180 |
item5 | -0.071 | 1.141 |
variance of item | loadings cor PCA | loadings cov PCA | |
---|---|---|---|
item1 | 1.268 | -0.861 | 0.288 |
item2 | 1057.242 | 0.222 | 32.515 |
item3 | 0.915 | -0.834 | -0.593 |
item4 | 1.392 | 0.765 | 0.091 |
item5 | 1.302 | 0.863 | 0.064 |
covar=..
The covar=TRUE/FALSE
argument of principal()
only makes a difference if you give the function a covariance matrix.
If you give the principal()
function the raw data, then it will automatically conduct the PCA on the correlation matrix regardless of whether you put covar=TRUE
or covar=FALSE
principal(dfb, nfactors = 1, covar = FALSE) | dfb, nfactors = 1, covar = TRUE) | principal(cor(dfb), nfactors = 1) | principal(cov(dfb), nfactors = 1, covar = FALSE) | principal(cov(dfb), nfactors = 1, covar = TRUE) |
---|---|---|---|---|
-0.861 | -0.861 | -0.861 | -0.861 | 0.288 |
0.222 | 0.222 | 0.222 | 0.222 | 32.515 |
-0.834 | -0.834 | -0.834 | -0.834 | -0.593 |
0.765 | 0.765 | 0.765 | 0.765 | 0.091 |
0.863 | 0.863 | 0.863 | 0.863 | 0.064 |