W8 Exercises: PCA

Relevant packages

  • psych

Exercises: Police Performance

Data: police_performance.csv

The dataset available at https://uoepsy.github.io/data/police_performance.csv contains records on fifty police officers who were rated in six different categories as part of an HR procedure. The rated skills were:

  • communication skills: commun
  • problem solving: probl_solv
  • logical ability: logical
  • learning ability: learn
  • physical ability: physical
  • appearance: appearance

The data also contains information on each police officer’s arrest rate (proportion of arrests that lead to criminal charges).

We are interested in if the skills ratings by HR are a good set of predictors of police officer success (as indicated by their arrest rate).

Question 1

Load the job performance data into R and call it job. Check whether or not the data were read correctly into R - do the dimensions correspond to the description of the data above?

Question 2

Provide descriptive statistics for each variable in the dataset.

Question 3

Working with only the skills ratings (not the arrest rate - we’ll come back to that right at the end), investigate whether or not the variables are highly correlated and explain whether or not you PCA might be useful in this case.

We only have 6 variables here, but if we had many, how might you visualise cor(job)? Try the below:

library(pheatmap)
pheatmap(cor(data))

Question 4

Look at the variance of the skills ratings in the data set. Do you think that PCA should be carried on the covariance matrix or the correlation matrix? Or does it not matter?

Question 5

Using the principal() function from the psych package, we conduct a PCA of the job skills.

Reading 8: Performing PCA shows an example of how to use the principal() function.

Question 6

Looking at the PCA output, how many principal components would you keep if you were following the cumulative proportion of explained variance criterion?

See Reading 8: How many components to keep? for an explanation of various criteria for deciding how many components we should keep.

Question 7

Looking again at the PCA output, how many principal components would you keep if you were following Kaiser’s criterion?

Question 8

According to a scree plot, how many principal components would you retain?

Question 9

How many components should we keep according to the MAP method?

Question 10

How many components should we keep according to parallel analysis?

Question 11

Based on all of the criteria above, make a decision on how many components you will keep.

Question 12

perform PCA to extract the desired number of components

Question 13

Examine the loadings of the 2 Principal Components. Is there a link you can see?

See Reading 8 #Examining Loadings for an explanation of the loadings.

Question 14

Join the principal component scores for your retained components to the original dataset which has the arrest rates in.

Then fit a linear model to look at how the arrest rate of police officers is predicted by the two components representing different composites of the skills ratings by HR.

Check for multicollinearity between your predictors. How does this compare to a model which uses all 6 of the original variables instead?

We can get out scores using mypca$scores. We can add them to an existing dataset by just adding them as a new column:

data |>
  mutate(
    score1 = mypca$scores[,1]
  )

To examine multicollinearity - try vif() from the car package.