Checking Suitability for Factor Analysis

Overview

There are various ways of assessing the suitability of our items for exploratory factor analysis, and most of them rely on examining the observed correlations between items.

Data

Dataset: phoneaddiction.csv

The dataset at https://uoepsy.github.io/data/phoneaddiction.csv comes from a (fake) study that is interested in developing a measure of “phone-attachment/addiction” - i.e., the idea of being overly attached to a phone.

We have a set of 10 statements (Table 1) that get at different aspects of this idea, and we asked 240 people to rate how much they agreed with each of the 10 statements on a 1-5 scale (1 = strongly disagree, 5 = strongly agree).

phdat <- read_csv("https://uoepsy.github.io/data/phoneaddiction.csv")

Table 1: Item wordings for phoneaddiction.csv

variable	question
item1	I often reach for my phone even when I don't have a specific reason to use it.
item2	I feel a strong urge to check my phone frequently, even during meals or social interactions.
item3	I spend a large portion of my day using my phone, often for more hours than I intend.
item4	My phone use has interfered with my ability to complete tasks or responsibilities at school, work, or home.
item5	When I cannot use my phone (e.g., no battery or no signal), I feel anxious or uncomfortable.
item6	My phone use has negatively affected my sleep schedule, such as staying up late scrolling.
item7	I find it difficult to avoid checking my phone immediately upon waking up or right before bed.
item8	I often check my phone even in situations where it's distracting or socially inappropriate (e.g., meetings or classes).
item9	I sometimes feel that others are overly concerned about my phone use in social situations.
item10	I have tried to reduce my phone usage but find it challenging to do so.

Look at the correlation matrix

Use cor() on the set of items to create the correlation matrix. If it’s a big matrix (they get big quite quickly!) then we can make some sort of visual like a heatmap to get a better sense of the patterns

correlation matrix

Most of the correlations are in the range from 0.2 to 0.4. So they aren’t hugely strong correlations, but they are all related.

cor(phdat)

         item1  item2  item3  item4 item5  item6   item7  item8    item9 item10
item1   1.0000 0.3662 0.3154 0.1704 0.235 0.2848 0.33327 0.3351 -0.05446 0.2633
item2   0.3662 1.0000 0.3187 0.1628 0.219 0.2936 0.31832 0.3859  0.02236 0.2824
item3   0.3154 0.3187 1.0000 0.2329 0.217 0.1389 0.37925 0.3651  0.05906 0.2299
item4   0.1704 0.1628 0.2329 1.0000 0.332 0.2649 0.18160 0.1661  0.02771 0.3870
item5   0.2348 0.2194 0.2165 0.3317 1.000 0.3730 0.19311 0.2086  0.08096 0.4116
item6   0.2848 0.2936 0.1389 0.2649 0.373 1.0000 0.25692 0.1481  0.03049 0.3549
item7   0.3333 0.3183 0.3792 0.1816 0.193 0.2569 1.00000 0.3271  0.00506 0.2136
item8   0.3351 0.3859 0.3651 0.1661 0.209 0.1481 0.32715 1.0000  0.03707 0.1940
item9  -0.0545 0.0224 0.0591 0.0277 0.081 0.0305 0.00506 0.0371  1.00000 0.0743
item10  0.2633 0.2824 0.2299 0.3870 0.412 0.3549 0.21359 0.1940  0.07434 1.0000

correlation heatmap

From the plot below item9 already looks to be a weird one, which is something to keep in mind.

heatmap(cor(phdat))

scatterplot matrix

pairs(phdat)

Rabbit Hole: What if our variables are not continuous?

In psychological research, we spend a lot of time recording responses on a Likert scale (e.g.: “strongly disagree”,“disgree”,“neither”,“agree”,“strongly agree”).

If our measurements have 5+ Likert-like options, it is quite common convention to simply treat them as continuous variables.

But if we have fewer response options (or if people only ever used fewer options), then we don’t want to just use a standard Pearson’s correlation in order to capture the relationships.

Instead we would want to use polychoric correlations, which we can get out the polychoric() function

library(psych)
# get a matrix of polychoric correlations:  
polychoric(data)$rho

And we can easily then conduct the factor analysis by adding cor = "poly" into the use of the fa() function.

fa(data, nfactors = ??, fm = ??, 
         rotate = ??, cor = "poly")

“Polychoric correlations” are estimates of the correlation between two theorized normally distributed continuous variables, based on their observed ordinal manifestations.

Figure 1: the idea of polychoric correlation is to estimate the correlation between two unobserved continuous variables by examining the distributions across the set of ordered categories that we do observe

Bartlett’s Test

In the psych package, the function cortest.bartlett() conducts “Bartlett’s test”. This tests against the null that the correlation matrix is proportional to the identity matrix (a matrix of all 0s except for 1s on the diagonal).

Null hypothesis: observed correlation matrix is equivalent to the identity matrix
Alternative hypothesis: observed correlation matrix is not equivalent to the identity matrix.

The “Identity matrix” is a matrix of all 0s except for 1s on the diagonal.
e.g. for a 3x3 matrix:
\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix} \] If a correlation matrix looks like this, then it means there is no shared variance between the items, so it is not suitable for factor analysis.

For the phone-addiction example, the Bartlett test is significant, meaning that we reject the null hypothesis that our correlation matrix is equivalent to the identity matrix. This is a good thing - we want it to be different in order to justify doing factor analysis.

cortest.bartlett(phdat)

$chisq
[1] 408

$p.value
[1] 4.53e-60

$df
[1] 45

Kaiser, Meyer, Olkin Measure of Sampling Adequacy

You can check the “factorability” of the correlation matrix using KMO(data) (also from psych!).

Rules of thumb:
- \(0.8 < MSA < 1\): the sampling is adequate
- \(MSA <0.6\): sampling is not adequate
- \(MSA \sim 0\): large partial correlations compared to the sum of correlations. Not good for FA

These are Kaiser’s corresponding adjectives suggested for each level of the KMO:

0.00 to 0.49 “unacceptable”
0.50 to 0.59 “miserable”
0.60 to 0.69 “mediocre”
0.70 to 0.79 “middling”
0.80 to 0.89 “meritorious”
0.90 to 1.00 “marvelous”

phoneaddiction.csv example:
Overall KMO is good here at 0.83, and individual item KMOs are also looking pretty ‘meritorious’ too! The only thorny one is item9, again!

KMO(phdat)

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = phdat)
Overall MSA =  0.83
MSA for each item = 
 item1  item2  item3  item4  item5  item6  item7  item8  item9 item10 
  0.86   0.85   0.83   0.82   0.83   0.81   0.85   0.83   0.53   0.82

Check for linearity

It also makes sense to check for linearity of relationships prior to conducting EFA. EFA is all based on correlations, which assume the relations we are capturing are linear.

You can check linearity of relations using pairs.panels(data) (also from psych), and you can view the histograms on the diagonals, allowing you to check univariate normality (which is usually a good enough proxy for multivariate normality).

In our phone-addiction example, because these items are measured on likert scales of 1-5, looking at scatterplots can be a bit harder. The red lines on the plots are essentially “smooth” lines, which follow the mean of the variable on the y-axis as it moves across the x-axis. So we want them to all be straight-ish, which they are. We can see from the diagonals of this plot that the items are all fairly normally distributed too.

pairs.panels(phdat)