Week 2 Exercises: Logistic and Longitudinal

Great Apes!

Data: msmr_apespecies.csv & msmr_apeage.csv

We have data from a large sample of great apes who have been studied between the ages of 1 to 10 years old (i.e. during adolescence). Our data includes 4 species of great apes: Chimpanzees, Bonobos, Gorillas and Orangutans. Each ape has been assessed on a primate dominance scale at various ages. Data collection was not very rigorous, so apes do not have consistent assessment schedules (i.e., one may have been assessed at ages 1, 3 and 6, whereas another at ages 2 and 8).

The researchers are interested in examining how the adolescent development of dominance in great apes differs between species.

Data on the dominance scores of the apes are available at https://uoepsy.github.io/data/msmr_apeage.csv and the information about which species each ape is are in https://uoepsy.github.io/data/msmr_apespecies.csv.

Table 1:
Data Dictionary: msmr_apespecies.csv
variable	description
ape	Ape Name
species	Species (Bonobo, Chimpanzee, Gorilla, Orangutan)

Table 2:
Data Dictionary: msmr_apeage.csv
variable	description
ape	Ape Name
age	Age at assessment (years)
dominance	Dominance (Z-scored)

Question 1

Read in the data and check over it. Do any relevant cleaning/wrangling that might be necessary.

Question 2

How is this data structure “hierarchical” (or “clustered”)? What are our level 1 units, and what are our level 2 units?

Question 3

For how many apes do we have data? How many of each species?
How many datapoints does each ape have?

Hints

We’ve seen this last week too - counting the different levels in our data. See 2B #getting-to-know-my-monkeys for an example (with another monkey example!)

Question 4

Make a plot to show how dominance changes as apes get older.

Hints

In 2B #exploring-the-data we made a facet for each cluster (each participant). That was fine because we had only 20 people. In this dataset we have 168! That’s too many to facet. The group aesthetic will probably help instead!

Question 5

Recenter the age variable on 1, which is the youngest ages that we’ve got data on for any of our species.

Then fit a model that estimates the differences between primate species in how dominance changes over time.

Question 6

Do primate species differ in the growth of dominance?
Perform an appropriate test/comparison.

Hints

This is asking about the age*species interaction, which in our model is represented by 3 parameters. To assess the overall question, it might make more sense to do a model comparison.

Question 7

Plot the average model predicted values for each age.

Before you plot.. do you expect to see straight lines? (remember, not every ape is measured at age 2, or age 3, etc).

Hints

This is like taking predict() from the model, and then then grouping by age, and calculating the mean of those predictions. However, we can do this more easily using augment() and then some fancy stat_summary() in ggplot (see the lecture).

Question 8

Plot the model based fixed effects:

Question 9

Interpret each of the fixed effects from the model (you might also want to get some p-values or confidence intervals).

Hints

Each of the estimates should correspond to part of our plot from the previous question.

term	est	CI	interpretation
(Intercept)	-0.42	[-0.67, -0.17]*	estimated dominance of 1 year old bonobos (at left hand side of plot, bonobo line is lower than 0)
age	0.05	[0.02, 0.08]*	estimated change in dominance score for every year older a bonobo gets (slope of bonobo line)
specieschimp	0.45	[0.13, 0.77]*	estimated difference in dominance scores at age 1 between bonobos and chimps (at left hand side of plot, chimp line is higher than bonobo line)
speciesgorilla	0.62	[0.28, 0.95]*	estimated difference in dominance scores at age 1 between bonobos and gorillas (at left hand side of plot, gorilla line is higher than bonobo line)
speciesorangutan	-0.02	[-0.39, 0.34]	no significant difference in dominance scores at age 1 between bonobos and orangutans (at the left hand side of our plot, orangutan line is similar height to bonobo line)
age:specieschimp	0.00	[-0.04, 0.03]	no significant difference between chimps and bonobos in the change in dominance for every year older (slope of chimp line is similar to slope of bonobo line)
age:speciesgorilla	-0.01	[-0.05, 0.03]	no significant difference between gorillas and bonobos in the change in dominance for every year older (slope of gorilla line is similar to slope of bonobo line)
age:speciesorangutan	-0.06	[-0.11, -0.02]*	estimated difference between orangutans and bonobos in the change in dominance for every year older (slope of orangutan line is less steep than slope of bonobo line)

Trolley problems

Data: msmr_trolley.csv

The “Trolley Problem” is a thought experiment in moral philosophy that asks you to decide whether or not to pull a lever to divert a trolley. Pulling the lever changes the trolley direction from hitting 5 people to a track on which it will hit one person.

Previous research has found that the “framing” of the problem will influence the decisions people make:

positive frame	neutral frame	negative frame
5 people will be saved if you pull the lever; one person on another track will be saved if you do not pull the lever. All your actions are legal and understandable. Will you pull the lever?	5 people will be saved if you pull the lever, but another person will die. One people will be saved if you do not pull the lever, but 5 people will die. All your actions are legal and understandable. Will you pull the lever?	One person will die if you pull the lever. 5 people will die if you do not pull the lever. All your actions are legal and understandable. Will you pull the lever?

We conducted a study to investigate whether the framing effects on moral judgements depends upon the stakes (i.e. the number of lives saved).

120 participants were recruited, and each gave answers to 12 versions of the thought experiment. For each participant, four versions followed each of the positive/neutral/negative framings described above, and for each framing, 2 would save 5 people and 2 would save 15 people.

The data are available at https://uoepsy.github.io/data/msmr_trolley.csv.

Table 3:
Data Dictionary: trolley.csv
variable	description
PID	Participant ID
frame	framing of the thought experiment (positive/neutral/negative
lives	lives at stake in the thought experiment (5 or 15)
lever	Whether or not the participant chose to pull the lever (1 = yes, 0 = no)

Question 10

Read in the data and check over how many people we have, and whether we have complete data for each participant.

Hints

I would maybe try data |> group_by(participant) |> summarise(), and then use the n_distinct() function to count how many “things” each person sees (e.g., 2B #example).

trolley <- read_csv("https://uoepsy.github.io/data/msmr_trolley.csv")
head(trolley)

# A tibble: 6 × 4
  PID   frame    lives   lever
  <chr> <chr>    <chr>   <dbl>
1 PPT_1 positive 5lives      1
2 PPT_1 positive 15lives     1
3 PPT_1 positive 5lives      1
4 PPT_1 positive 15lives     1
5 PPT_1 neutral  5lives      1
6 PPT_1 neutral  15lives     0

How many participants?

length(unique(trolley$PID))

[1] 120

How many trials for each participant in each condition.
We can, for each participant, count how many trials they have in total, how many “frames” they see, how many “lives” they see, and how many “frame x lives” combinations they see:

trolley |>
  group_by(PID) |>
  summarise(
    n_trials = n(),
    n_frame = n_distinct(frame),
    n_lives = n_distinct(lives),
    n_combn = n_distinct(frame,lives)
  )

# A tibble: 120 × 5
   PID     n_trials n_frame n_lives n_combn
   <chr>      <int>   <int>   <int>   <int>
 1 PPT_1         12       3       2       6
 2 PPT_10        12       3       2       6
 3 PPT_100       12       3       2       6
 4 PPT_101       12       3       2       6
 5 PPT_102       12       3       2       6
 6 PPT_103       12       3       2       6
 7 PPT_104       12       3       2       6
 8 PPT_105       12       3       2       6
 9 PPT_106       12       3       2       6
10 PPT_107       12       3       2       6
# ℹ 110 more rows

If everybody gets the same here (as we can see they do below), then everyone has complete data!

trolley |>
  group_by(PID) |>
  summarise(
    n_trials = n(),
    n_frame = n_distinct(frame),
    n_lives = n_distinct(lives),
    n_combn = n_distinct(frame,lives)
  ) |>
  summary()

     PID               n_trials     n_frame     n_lives     n_combn 
 Length:120         Min.   :12   Min.   :3   Min.   :2   Min.   :6  
 Class :character   1st Qu.:12   1st Qu.:3   1st Qu.:2   1st Qu.:6  
 Mode  :character   Median :12   Median :3   Median :2   Median :6  
                    Mean   :12   Mean   :3   Mean   :2   Mean   :6  
                    3rd Qu.:12   3rd Qu.:3   3rd Qu.:2   3rd Qu.:6  
                    Max.   :12   Max.   :3   Max.   :2   Max.   :6

Question 11

Construct an appropriate plot to summarise the data in a suitable way to illustrate the research question.

Hints

Something making use of stat_summary() to give proportions, a bit like the plot in 2B #getting-to-know-my-monkeys?

Question 12

Fit a model to assess the research aims.
Don’t worry if it gives you an error, we’ll deal with that in a second.

Hints

Remember, a good way to start is to split this up into 3 parts: 1) the outcome and fixed effects, 2) the grouping structure, and 3) the random slopes.
fitting (or attempting to fit!) glmer models might take time!

Question 13

This is probably the first time we’ve had to deal with a model not converging.

While sometimes changing the optimizer can help, more often than not, the model we are trying to fit is just too complex. Often, the groups in our sample just don’t vary enough for us to estimate a random slope.

The aim here is to simplify our random effect structure in order to obtain a converging model, but be careful not to over simplify.

Try it now. What model do you end up with? (You might not end up with the same model as each other, which is fine. These methods don’t have “cookbook recipes”!)

Hints

you could think of the interaction as the ‘most complex’ part of our random effects, so you might want to remove that first.

Question 14

Plot the predicted probabilities from your model for each combination of frame and lives.

Optional extra: Novel Word Learning

Data: nwl.Rdata

load(url("https://uoepsy.github.io/msmr/data/nwl.RData"))

In the nwl data set (accessed using the code above), participants with aphasia are separated into two groups based on the general location of their brain lesion: anterior vs. posterior. There is data on the numbers of correct and incorrect responses participants gave in each of a series of experimental blocks. There were 7 learning blocks, immediately followed by a test. Finally, participants also completed a follow-up test.
Data were also collect from healthy controls.
Figure 1 shows the differences between lesion location groups in the average proportion of correct responses at each point in time (i.e., each block, test, and follow-up)

Figure 1: Differences between groups in the average proportion of correct responses at each block

variable	description
group	Whether participant is a stroke patient ('patient') or a healthy control ('control')
lesion_location	Location of brain lesion: anterior vs posterior
block	Experimental block (1-9). Blocks 1-7 were learning blocks, immediately followed by a test in block 8. Block 9 was a follow-up test at a later point
PropCorrect	Proportion of 30 responses in a given block that the participant got correct
NumCorrect	Number of responses (out of 30) in a given block that the participant got correct
NumError	Number of responses (out of 30) in a given block that the participant got incorrect
ID	Participant Identifier
Phase	Experimental phase, corresponding to experimental block(s): 'Learning', 'Immediate','Follow-up'

Question 16

Load the data. Take a look around. Any missing values? Can you think of why?

Question 17

Our broader research aim today is to compare the two lesion location groups (those with anterior vs. posterior lesions) with respect to their accuracy of responses over the course of the study.

What is the outcome variable?

Hints

Think carefully: there might be several variables which either fully or partly express the information we are considering the “outcome” here. We saw this back in USMR with the glm()!

participant	question	correct
1	1	1
1	2	0
1	3	1
...	...	...

participant	questions_correct	questions_incorrect
1	2	1
2	1	2
3	3	0
...	...	...

Question 18

Research Question 1:
Is the learning rate (training blocks) different between the two lesion location groups?

Hints

Do we want cbind(num_successes, num_failures)?
Ensure you are running models on only the data we are actually interested in.
- Are the healthy controls included in the research question under investigation?
- Are the testing blocks included in the research question, or only the learning blocks?
We could use model comparison via likelihood ratio tests (using anova(model1, model2, model3, ...). For this question, we could compare:
- A model with just the change over the sequence of blocks
- A model with the change over the sequence of blocks and an overall difference between groups
- A model with groups differing with respect to their change over the sequence of blocks
What about the random effects part?
1. What are our observations grouped by?
2. What variables can vary within these groups?
3. What do you want your model to allow to vary within these groups?

Question 19

Research Question 2
In the testing phase, does performance on the immediate test differ between lesion location groups, and does the retention from immediate to follow-up test differ between the two lesion location groups?

Let’s try a different approach to this. Instead of fitting various models and comparing them via likelihood ratio tests, just fit the one model which could answer both parts of the question above.

Hints

This might required a bit more data-wrangling beforehand. Think about the order of your factor levels (alphabetically speaking, “Follow-up” comes before “Immediate”)!

nwl_test <- filter(nwl, block > 7, !is.na(lesion_location)) %>%
    mutate(
        Phase = factor(Phase), 
        Phase = fct_relevel(Phase, "Immediate")
    )

m.recall.loc <- glmer(cbind(NumCorrect, NumError) ~ Phase * lesion_location + (1 | ID), 
                      nwl_test, family="binomial")

summary(m.recall.loc)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(NumCorrect, NumError) ~ Phase * lesion_location + (1 |      ID)
   Data: nwl_test

     AIC      BIC   logLik deviance df.resid 
   139.3    145.2    -64.6    129.3       19 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.1556 -0.3352  0.0039  0.4963  1.3506 

Random effects:
 Groups Name        Variance Std.Dev.
 ID     (Intercept) 0.3626   0.6021  
Number of obs: 24, groups:  ID, 12

Fixed effects:
                                        Estimate Std. Error z value Pr(>|z|)  
(Intercept)                              -0.1124     0.3167  -0.355   0.7226  
PhaseFollow-up                           -0.0278     0.2357  -0.118   0.9061  
lesion_locationposterior                  0.9672     0.4211   2.297   0.0216 *
PhaseFollow-up:lesion_locationposterior  -0.2035     0.3191  -0.638   0.5236  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) PhsFl- lsn_lc
PhaseFllw-p -0.372              
lsn_lctnpst -0.752  0.280       
PhsFllw-p:_  0.275 -0.739 -0.385

Note 1:

In the above, we have made sure to select the patients by specifying !is.na(lesion_location), meaning that we want those rows where the lesion location is not missing. As a reminder ! is the negation function (not). As we saw in the earlier question, this excludes the 126 healthy controls, as well as the 9 patients for which we have missing values (NAs).

Note 2:

We didn’t specify (Phase | ID) as the random effect because each participant only has 2 data points for Phase, and there is only one line that fits two data points. In other words, there is only one possible way to fit those two data points. As such, as each group of 2 points will have a perfect line fit, and the residuals \(\varepsilon_{ij}\) will all be 0. As a consequence of this, the residuals will have no variability as they are all 0, so \(\sigma_{\epsilon}\) is 0 which in turn leads to problem with estimating the model coefficients.

subset(nwl_test, ID == 'patient15')

     group lesion_location block PropCorrect NumCorrect NumError        ID
1  patient        anterior     8   0.5333333         16       14 patient15
13 patient        anterior     9   0.5333333         16       14 patient15
       Phase
1  Immediate
13 Follow-up

If you try using (Phase | ID) as random effect, you will see the following message:

boundary (singular) fit: see help('isSingular')

Question 20

In family = binomial(link='logit'). What function is used to relate the linear predictors in the model to the expected value of the response variable?
How do we convert this into something more interpretable?

Question 21

Make sure you pay attention to trying to interpret each fixed effect from your models.
These can be difficult, especially when it’s logistic, and especially when there are interactions.

What is the increase in the odds of answering correctly in the immediate test if you were to have a posterior legion instead of an anterior legion?

Question 22

Recreate the visualisation in Figure 2.

Figure 2: Differences between groups in the average proportion of correct responses at each block