DAPR3 Lab Exercises
  • Multi-level Models
    • W1: Regression Refresher
    • W2 Exercises: Introducing MLM
    • W3 Exercises: Nested and Crossed Structures
    • W4 Exercises: Centering
    • W5 Exercises: Bringing it all together
  • Measurement & Factor Analysis
    • W7 Exercises: Scale Scores & PCA
    • W8 Exercises: EFA
    • W9 Exercises: CFA
    • W10 Exercises: Reliability & Validity
    • W11 Exercises

On this page

  • Gambler’s fallacy

W7 Exercises: Scale Scores & PCA

Gambler’s fallacy

Dataset: gamblers.csv

A researcher is interested in assessing if people who gamble will tend to lose more if they are more ‘impulsive’, and whether this might depend on whether they are gambling online or in a casino.

They recruited 482 participants (248 in a casino, and 234 on an online gambling site). Each participant filled out a 6 question measure of “impulsivity”, and then their total net gains (or losses) for the day were recorded (in £). All people were only playing the game BlackJack.

Our research question: does greater impulsivity lead to bigger losses when comparing online gamblers to casino gamblers?

Dataset: The data can be found at https://uoepsy.github.io/data/gamblers.csv

Table 1: gamblers.csv Data Dictionary
variable description
online whether the person was gambling in a casino or online
imp_1 I often act on the spur of the moment without thinking.
imp_2 I find it hard to resist temptations.
imp_3 I make decisions quickly, even when they have serious consequences.
imp_4 I find it hard to stay focused on tasks that take a long time to finish.
imp_5 I prefer safe activities rather than risky things just for fun.
imp_6 I am usually patient and can wait for what I want.
gain net losses or gains upon leaving/logging out
Question 1

Read in the data and have a look at it.

  • What does each row represent?
  • What measurement(s) show us a person’s impulsivity?

Solution 1. Here’s the data:

gdat <- read_csv("https://uoepsy.github.io/data/gamblers.csv")
head(gdat)
# A tibble: 6 × 8
  online imp_1                      imp_2          imp_3 imp_4 imp_5 imp_6  gain
  <chr>  <chr>                      <chr>          <chr> <chr> <chr> <chr> <dbl>
1 casino Disagree                   Neither Disag… Disa… Neit… Neit… Neit…     0
2 casino Disagree                   Disagree       Disa… Neit… Agree Agree   -12
3 casino Neither Disagree nor Agree Neither Disag… Neit… Neit… Neit… Disa…   -35
4 casino Neither Disagree nor Agree Agree          Neit… Agree Agree Neit…   -20
5 online Disagree                   Neither Disag… Disa… Neit… Neit… Neit…   -14
6 casino Neither Disagree nor Agree Disagree       Disa… Stro… Neit… Neit…   -28

Each row is a participant, for each person there are 6 columns all measuring the construct of “impulsivity”.
And for each of those columns, there’s a whole load of words in there!

Question 2

First things first, our questionnaire software has given us the responses all in the descriptors used for each point of the likert scale, which is a bit annoying.
Convert them all to numbers, which we can then work with.

What we have What we want
Strongly Agree 5
Agree 4
Agree 4
Strongly Disagree 1
Neither Disagree nor Agree 3
Agree 4
Disagree 2
… …
Hints

See 1: Data Wrangling for Questionnaires#variable-recoding.

Solution 2. We want to turn all of the variables from imp_1 to imp_6 into numbers.

To do it with one variable:

gdat |> mutate(
  imp_1 = case_match(imp_1,
                     "Strongly Disagree" ~ 1,
                     "Disagree" ~ 2,
                     "Neither Disagree nor Agree" ~ 3,
                     "Agree" ~ 4,
                     "Strongly Agree" ~ 5
  )
)

And we can do it to all at once with across().

gdat <- gdat |> mutate(
  across(c(imp_1:imp_6),
         ~case_match(.,
                     "Strongly Disagree" ~ 1,
                     "Disagree" ~ 2,
                     "Neither Disagree nor Agree" ~ 3,
                     "Agree" ~ 4,
                     "Strongly Agree" ~ 5
         ))
  )

head(gdat)
# A tibble: 6 × 8
  online imp_1 imp_2 imp_3 imp_4 imp_5 imp_6  gain
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 casino     2     3     2     3     3     3     0
2 casino     2     2     2     3     4     4   -12
3 casino     3     3     3     3     3     2   -35
4 casino     3     4     3     4     4     3   -20
5 online     2     3     2     3     3     3   -14
6 casino     3     2     2     1     3     3   -28

Question 3

Just looking at the impulsivity questions, create a correlation matrix of 6 variables.
What do you notice? Does it make sense given the wording of the questions?

Solution 3.

cor(gdat[,2:7])
       imp_1  imp_2  imp_3  imp_4  imp_5  imp_6
imp_1  1.000  0.470  0.555  0.044 -0.464 -0.453
imp_2  0.470  1.000  0.414  0.321 -0.204 -0.305
imp_3  0.555  0.414  1.000  0.102 -0.406 -0.461
imp_4  0.044  0.321  0.102  1.000 -0.244 -0.334
imp_5 -0.464 -0.204 -0.406 -0.244  1.000  0.390
imp_6 -0.453 -0.305 -0.461 -0.334  0.390  1.000

Correlations are all positive except for those with Q5 and Q6. Q5 and Q6 are positively related, but they are negatively related to the other questions.

If you’re like Elizabeth and you have a hard time making sense of a matrix full of numbers, then the function heatmap() might be your friend. It represents the correlation matrix as colours. Cells representing strong positive correlations appear darker, and cells representing strong negative correlations appear lighter.

cor(gdat[,2:7]) |>
  heatmap()

There’s a dark diagonal line down the middle, which represents the perfect correlation between an item and itself. The cells at the intersections of Q5 and Q6 are fairly dark, and the cells at the intersections of Q1–Q4 are also fairly dark (the positive correlations). Where Q5 and Q6 meet Q1–Q4, the cells are fairly light (the negative correlations). In addition to the bracketing structure at the plot margins (in technical terms a “dendrogram”), this visual approach to a correlation matrix shows that Q5 and Q6 tend to pattern together in one way, and that the other four Qs tend to pattern together in another way.

This makes sense given the way the questions are worded - if people are impulsive, they will be more likely to disagree to Q5 and Q6, but agree with the others:

qitems
[1] "I often act on the spur of the moment without thinking."                 
[2] "I find it hard to resist temptations."                                   
[3] "I make decisions quickly, even when they have serious consequences."     
[4] "I find it hard to stay focused on tasks that take a long time to finish."
[5] "I prefer safe activities rather than risky things just for fun."         
[6] "I am usually patient and can wait for what I want."                      

Question 4

Reverse score questions 5 and 6.

Hints
  • See 1: Data Wrangling for Questionnaires#reverse-coding
  • Be careful!! if you have some code that reverse scores a question, and you run it twice, you will essentially reverse-reverse score the question, and it goes back to the original ordering!

Solution 4. There’s only 2, so let’s do this individually for each question:

gdat <- gdat |> 
  mutate(
    imp_5 = 6 - imp_5,
    imp_6 = 6 - imp_6,
)
head(gdat)
# A tibble: 6 × 8
  online imp_1 imp_2 imp_3 imp_4 imp_5 imp_6  gain
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 casino     2     3     2     3     3     3     0
2 casino     2     2     2     3     2     2   -12
3 casino     3     3     3     3     3     4   -35
4 casino     3     4     3     4     2     3   -20
5 online     2     3     2     3     3     3   -14
6 casino     3     2     2     1     3     3   -28

Question 5

Take a look at the correlation of the impulsivity questions again.
What has changed?

Solution 5. The negative correlations are now positive!

cor(gdat[,2:7])
      imp_1 imp_2 imp_3 imp_4 imp_5 imp_6
imp_1 1.000 0.470 0.555 0.044 0.464 0.453
imp_2 0.470 1.000 0.414 0.321 0.204 0.305
imp_3 0.555 0.414 1.000 0.102 0.406 0.461
imp_4 0.044 0.321 0.102 1.000 0.244 0.334
imp_5 0.464 0.204 0.406 0.244 1.000 0.390
imp_6 0.453 0.305 0.461 0.334 0.390 1.000

Question 6

We’re finally getting somewhere! Let’s create a score for “impulsivity” and add it as a new column onto the existing data frame.

The description of the questionnaire says that we should take the sum of the scores on each question, to get an overall measure of impulsivity.

Hints

The function rowSums() should help us here! See an example in 1: Data Wrangling for Questionnaires#row-scoring

Solution 6.

gdat$impulsivity <- rowSums(gdat[,2:7])

Question 7

Provide some descriptive statistics for the impulsivity scale scores of people at the two locations (online vs casino).

Hints

The describe() and describeBy() functions from the psych package can often pretty useful for this kind of thing. Alternatively, data |> group_by(...) |> summarise(....)!

Solution 7.

library(psych)
describeBy(gdat$impulsivity, group=gdat$online)

 Descriptive statistics by group 
group: casino
   vars   n mean   sd median trimmed  mad min max range  skew kurtosis  se
X1    1 248 16.8 3.22     17    16.8 2.97   7  25    18 -0.11    -0.13 0.2
------------------------------------------------------------ 
group: online
   vars   n mean   sd median trimmed  mad min max range skew kurtosis  se
X1    1 234 19.8 3.12     20    19.8 2.97  11  30    19 0.09    -0.06 0.2

The tidyverse way:

gdat |>
  group_by(online) |>
  summarise(
    mean = mean(impulsivity),
    median = median(impulsivity),
    min = min(impulsivity),
    max = max(impulsivity),
    sd = sd(impulsivity)
  )
# A tibble: 2 × 6
  online  mean median   min   max    sd
  <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 casino  16.8     17     7    25  3.22
2 online  19.8     20    11    30  3.12

Question 8

Does greater impulsivity lead to bigger losses when comparing online gamblers to casino gamblers?

Using the scale scores that you just computed, create a plot to show how impulsivity is associated with gains/losses of gamblers in the two places (casino vs online).

Solution 8. Something like this should do the trick:

ggplot(gdat,aes(x=impulsivity,y=gain,col=online))+
  geom_point(size = 3, alpha = .3)+
  geom_smooth(method=lm)

Question 9

Based on the plot in the previous question, if you fit the model lm(gain ~ impulsivity * online) to this data (where impulsivity is the scale score), what coefficients would the model estimate? would the sign of each coefficient be positive or negative?

Once you’ve made a good effort to predict the answers to these questions, fit the model and see if your predictions are borne out. (If your predictions are different from the outcomes, reflect on why the outcomes are the way they are.)

Solution 9. Here’s the model:

mod1 <- lm(gain ~ online * impulsivity, data = gdat)

It’s going to estimate 4 things:

estimate prediction explanation
intercept around zero/slightly negative the 'online' variable is coded with casino as the reference level, so the intercept is going to be the height of the casino line where impulsivity is 0. so it looks like it will be around 0, or a bit below.
onlineonline positive this coefficient will tell us the difference between casino and online when impulsivity is zero. the blue line in the plot is going to be higher than the red line when impulsivity is zero, so this coefficient will be positive
impulsivity negative this is going to be how gains/losses change when impulsivity increases, specifically for the casino group. so in my plot it is the slope of the red line. it's going to be decreasing
onlineonline:impulsivity negative this is going to be how the association between impulsivity and gains/losses changes when we move from casino to online. We know this association is negative in the casino group, and the online group looks like it is even more steeply downwards, so this is going to be a negative coefficient
summary(mod1)$coefficients
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                -2.415      3.488  -0.692 0.489038
onlineonline               12.192      5.563   2.192 0.028884
impulsivity                -0.737      0.204  -3.610 0.000338
onlineonline:impulsivity   -0.533      0.297  -1.793 0.073646

Question 10

Take a look again at the wordings of the questions on impulsivity. Do you think they equally represent the construct of ‘impulsivity’?

If you’re stuck, think about whether each question might be measuring something else, in addition to (or instead of?) impulsivity.

variable description
imp_1 I often act on the spur of the moment without thinking.
imp_2 I find it hard to resist temptations.
imp_3 I make decisions quickly, even when they have serious consequences.
imp_4 I find it hard to stay focused on tasks that take a long time to finish.
imp_5 I prefer safe activities rather than risky things just for fun.
imp_6 I am usually patient and can wait for what I want.
Hints

This is a very subjective question. “Impulsivity” will mean subtly different things to each one of us. The idea is that we want to get at whatever idea it is that is shared across us when we use this word. To me, one of these questions feels a little less closely linked to being an ‘impulsive’ behaviour than the others.

Solution 10. I’m going to rank them in order of how much I think each question captures “impulsivity” to me.

I hope that some of you disagree with me about this ranking - that’s what makes measurement frustrating fun!

variable description my_thoughts
imp_1 I often act on the spur of the moment without thinking. clearly impulsivity
imp_3 I make decisions quickly, even when they have serious consequences. could be impulsivity, could be that you're really good at making decisions
imp_6 I am usually patient and can wait for what I want. similar to imp_2, impatience and impulsivity kind of go hand in hand, but this is not quite so clearly the definition of impulsivity as the first two
imp_2 I find it hard to resist temptations. 'temptations' here makes me immediately think of edible temptations! which is one manifestation of impulsivity i guess!
imp_5 I prefer safe activities rather than risky things just for fun. is risk taking the same as impulsivity? you can take calculated risks? people do 'risky' sports like climbing for fun, but not out of impulsivity?
imp_4 I find it hard to stay focused on tasks that take a long time to finish. this doesn't really feel like it is as clearly impulsivity. lots of things can distract us from tasks. boredom?

Question 11

Okay, so if we’re not very happy that our 6 questions are equally representative of “impulsivity” (or maybe groups of questions capture distinct aspects of the construct?), we might not want to work with the plain old sum of impulsivity scores that we used above.

What are we going to do?

Let’s start by doing a Principal Component Analysis (PCA) on the 6 original items, and extracting 6 components.

Hints

See Chapter 3: PCA walkthrough for the demonstration!

Solution 11.

imppca <- principal(gdat[,2:7], nfactors = 6, rotate = "none")
imppca
Principal Components Analysis
Call: principal(r = gdat[, 2:7], nfactors = 6, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
       PC1   PC2   PC3   PC4   PC5   PC6 h2       u2 com
imp_1 0.78 -0.38 -0.09  0.10 -0.32  0.35  1 -4.4e-16 2.4
imp_2 0.66  0.16 -0.66  0.19 -0.08 -0.26  1 -1.1e-15 2.6
imp_3 0.76 -0.31 -0.07 -0.16  0.54  0.06  1 -8.9e-16 2.3
imp_4 0.41  0.87  0.04  0.07  0.10  0.24  1  5.6e-16 1.7
imp_5 0.67 -0.05  0.56  0.45  0.01 -0.19  1 -1.1e-15 2.9
imp_6 0.73  0.14  0.23 -0.56 -0.22 -0.17  1 -4.4e-16 2.5

                       PC1  PC2  PC3  PC4  PC5  PC6
SS loadings           2.78 1.04 0.81 0.59 0.46 0.32
Proportion Var        0.46 0.17 0.14 0.10 0.08 0.05
Cumulative Var        0.46 0.64 0.77 0.87 0.95 1.00
Proportion Explained  0.46 0.17 0.14 0.10 0.08 0.05
Cumulative Proportion 0.46 0.64 0.77 0.87 0.95 1.00

Mean item complexity =  2.4
Test of the hypothesis that 6 components are sufficient.

The root mean square of the residuals (RMSR) is  0 
 with the empirical chi square  0  with prob <  NA 

Fit based upon off diagonal values = 1

Question 12

Take a look at the ‘variance accounted for’ by each component (you could use a scree plot to show this too!), and think back to our research question, which has absolutely nothing to do with whether “impulsivity” is one thing, or two things, or 6 things…

How many components do you want to keep?

Solution 12. Our research question, remember is asking “does greater impulsivity lead to bigger losses when comparing online gamblers to casino gamblers?”

We’re getting bogged down in the weeds of what do we even mean by ‘impulsivity’?? I would make a case that our research question kind of pre-supposes that “impulsivity” is just one thing. If we reduce these 6 questions down to two or more things, then our research question becomes a little bit more complex to answer, with lots of nuance about what sort of impulsivity we’re talking about. So from a purely pragmatic standpoint, I am really hoping we can just keep one thing, and call that thing “impulsivity”!

As it stands, the scree plot is in our favour. It shows a kink (or ‘elbow’, if you’d prefer) at 2 components, suggesting that we would be fine to keep just one.

scree(gdat[,2:7])

The numbers going into the scree plot are shown in the ‘variance accounted for’ bit of the PCA. These tell us that if we keep just one component, then we are capturing 47% of the variability in the questionnaire. If we kept two, then we would capture 64%, 3 would capture 77%, and so on, until we just keep 6 and we’re capturing 100%.

imppca$Vaccounted
                        PC1   PC2   PC3   PC4    PC5    PC6
SS loadings           2.782 1.043 0.810 0.594 0.4552 0.3158
Proportion Var        0.464 0.174 0.135 0.099 0.0759 0.0526
Cumulative Var        0.464 0.637 0.772 0.872 0.9474 1.0000
Proportion Explained  0.464 0.174 0.135 0.099 0.0759 0.0526
Cumulative Proportion 0.464 0.637 0.772 0.872 0.9474 1.0000

There’s no “right” answer here as to how much we should keep. 47% makes it feel like we’re losing more than we’re capturing (which we are), but that might just be what we have to do!

Question 13

Extract the scores for the first principal component, and attach them to your dataset as a new set of scores for “impulsivity”.

Attend also to the loadings for that first component - is it related more to the questions you felt were more clearly asked about ‘impulsivity’?

Hints

To extract the scores, see Chapter 3: PCA walkthrough #scores .

Solution 13. Here are the scores added to our data:

gdat$pc1 <- imppca$scores[,1]

And here are the loadings for that first component. It’s less related to imp_4 and imp_2, and more related to imp_1 and imp_3 (this kind of fits in with my view that these two questions are more obviously asking about “impulsivity” to me).

sort(imppca$loadings[,1])
imp_4 imp_2 imp_5 imp_6 imp_3 imp_1 
0.411 0.660 0.670 0.733 0.762 0.782 

The nice result of this is that, if we use these scores and loadings in subsequent analyses, the questions that are better at targeting the construct we care about are going to be weighted more heavily than the questions that are worse at targeting that construct. The PCA scores are therefore a more accurate way of representing impulsivity than the un-weighted summed score we computed above.

Question 14

Using your PCA scores, not the old summed scale scores, create a plot that shows the relationship between impulsivity and financial loss or gain in the two different locations (casino and online). What changes, compared to the old plot?

Next, fit a new linear model that uses the PCA scores, not the old summed scale scores, to address the question of how impulsivity might affect gains in different locations. What changes, compared to the old model?

Solution 14.

library(patchwork)

p1 <- ggplot(gdat,aes(x=impulsivity,y=gain,col=online))+
  geom_point(size = 3, alpha = .3)+
  geom_smooth(method=lm) +
  labs(title="scale scores")

p2 <- ggplot(gdat,aes(x=pc1,y=gain,col=online))+
  geom_point(size = 3, alpha=.3)+
  geom_smooth(method=lm) +
  labs(title="PC scores")

p1 + p2 + plot_layout(guides="collect")

There are more different pc1 values than there are different scale score values. The scale score values can only be integers (that is, whole numbers), while the pc1 values can be decimal numbers too. This makes sense, because if the principal component scores are some weighting of each variable, then we’re going to end up with far more possible scores.

mod2 <- lm(gain ~ online * pc1, data = gdat)

sjPlot::tab_model(mod1, mod2)
  gain gain
Predictors Estimates CI p Estimates CI p
(Intercept) -2.41 -9.27 – 4.44 0.489 -15.91 -17.32 – -14.51 <0.001
online [online] 12.19 1.26 – 23.12 0.029 2.72 0.68 – 4.75 0.009
impulsivity -0.74 -1.14 – -0.34 <0.001
online [online] ×
impulsivity
-0.53 -1.12 – 0.05 0.074
pc1 -2.70 -4.09 – -1.30 <0.001
online [online] × pc1 -2.16 -4.19 – -0.12 0.038
Observations 482 482
R2 / R2 adjusted 0.091 / 0.085 0.105 / 0.099

Note that the better measurement of “impulsivity” by the PCA (weighting our scores more towards imp_1 and imp_3) results in a significant interaction here.

It’s also worth noting that the online [online] coefficient in the new model is also significant, but that is because the PCA scores are standardised, whereas the scale scores are not. So “0” means something very different on those two measures.