W8 Exercises: PCA

Relevant packages

psych

Exercises: Police Performance

Data: police_performance.csv

The dataset available at https://uoepsy.github.io/data/police_performance.csv contains records on fifty police officers who were rated in six different categories as part of an HR procedure. The rated skills were:

communication skills: commun
problem solving: probl_solv
logical ability: logical
learning ability: learn
physical ability: physical
appearance: appearance

The data also contains information on each police officer’s arrest rate (proportion of arrests that lead to criminal charges).

We are interested in if the skills ratings by HR are a good set of predictors of police officer success (as indicated by their arrest rate).

Question 1

Load the job performance data into R and call it job. Check whether or not the data were read correctly into R - do the dimensions correspond to the description of the data above?

Question 2

Provide descriptive statistics for each variable in the dataset.

variable	M	SD
commun	17.68	2.74
probl_solv	54.16	2.41
logical	24.02	2.49
learn	50.28	2.84
physical	54.16	2.41
appearance	21.06	2.99
arrest_rate	0.51	0.23

Question 3

Working with only the skills ratings (not the arrest rate - we’ll come back to that right at the end), investigate whether or not the variables are highly correlated and explain whether or not you PCA might be useful in this case.

Hints

We only have 6 variables here, but if we had many, how might you visualise cor(job)? Try the below:

library(pheatmap)
pheatmap(cor(data))

Question 4

Look at the variance of the skills ratings in the data set. Do you think that PCA should be carried on the covariance matrix or the correlation matrix? Or does it not matter?

Hints

See Reading 8: Performing PCA.

Question 5

Using the principal() function from the psych package, we conduct a PCA of the job skills.

Hints

Reading 8: Performing PCA shows an example of how to use the principal() function.

Question 6

Looking at the PCA output, how many principal components would you keep if you were following the cumulative proportion of explained variance criterion?

Hints

See Reading 8: How many components to keep? for an explanation of various criteria for deciding how many components we should keep.

Let’s look again at the PCA summary:

job_pca$loadings


Loadings:
           PC1    PC2    PC3    PC4    PC5    PC6   
commun      0.984 -0.120                0.101       
probl_solv  0.223  0.810  0.543                     
logical     0.329  0.747 -0.578                     
learn       0.987 -0.110                       0.105
physical    0.988                      -0.110       
appearance  0.979 -0.125         0.161              

                 PC1   PC2   PC3   PC4   PC5   PC6
SS loadings    4.035 1.261 0.631 0.035 0.022 0.016
Proportion Var 0.673 0.210 0.105 0.006 0.004 0.003
Cumulative Var 0.673 0.883 0.988 0.994 0.997 1.000

The following part of the output tells us that the first two components explain 88.3% of the total variance.

Cumulative Var 0.673 0.883 0.988 0.994 0.997 1.000

According to this criterion, we should keep 2 principal components.

Question 7

Looking again at the PCA output, how many principal components would you keep if you were following Kaiser’s criterion?

job_pca$loadings


Loadings:
           PC1    PC2    PC3    PC4    PC5    PC6   
commun      0.984 -0.120                0.101       
probl_solv  0.223  0.810  0.543                     
logical     0.329  0.747 -0.578                     
learn       0.987 -0.110                       0.105
physical    0.988                      -0.110       
appearance  0.979 -0.125         0.161              

                 PC1   PC2   PC3   PC4   PC5   PC6
SS loadings    4.035 1.261 0.631 0.035 0.022 0.016
Proportion Var 0.673 0.210 0.105 0.006 0.004 0.003
Cumulative Var 0.673 0.883 0.988 0.994 0.997 1.000

The eigenvalues are shown in the row

SS loadings    4.035 1.261 0.631 0.035 0.022 0.016

From the result we see that only the first two principal components have eigenvalues greater than 1, so this rule suggests to keep 2 PCs only.

Question 8

According to a scree plot, how many principal components would you retain?

Question 9

How many components should we keep according to the MAP method?

Question 10

How many components should we keep according to parallel analysis?

Question 11

Based on all of the criteria above, make a decision on how many components you will keep.

method	recommendation
explaining >80% variance	keep 2 components
kaiser’s rule	keep 2 components
scree plot	keep 1 or 3 components? (subjective)
MAP	keep 2 components
parallel analysis	keep 1 component

Question 12

perform PCA to extract the desired number of components

Question 13

Examine the loadings of the 2 Principal Components. Is there a link you can see?

Hints

See Reading 8 #Examining Loadings for an explanation of the loadings.

Question 14

Join the principal component scores for your retained components to the original dataset which has the arrest rates in.

Then fit a linear model to look at how the arrest rate of police officers is predicted by the two components representing different composites of the skills ratings by HR.

Check for multicollinearity between your predictors. How does this compare to a model which uses all 6 of the original variables instead?

Hints

We can get out scores using mypca$scores. We can add them to an existing dataset by just adding them as a new column:

data |>
  mutate(
    score1 = mypca$scores[,1]
  )

To examine multicollinearity - try vif() from the car package.