Preliminaries

Open Rstudio, and create a new project for this course!!
Create a new RMarkdown document or R script (whichever you like) for this week.

New Packages!

These are the main packages we’re going to use in this block. It might make sense to install them now if you do not have them already (note, the rstudio.ppls.ed.ac.uk server already has lme4 and tidyverse installed for you).

tidyverse : for organising data
ICC : for quickly calculating intraclass correlation coefficient
lme4 : for fitting generalised linear mixed effects models
lmeresampler : for bootstrapping!
effects : for tabulating and graphing effects in linear models
broom.mixed : tidying methods for mixed models
sjPlot : for plotting models
DHARMa : for simulating residuals to assess assumptions
HLMdiag : for examining case diagnostics at multiple levels

install.packages(c("tidyverse","ICC","lme4","effects","broom.mixed","sjPlot","HLMdiag"))
# the lmeresampler package has had some recent updates. better to install the most recent version:
install.packages("devtools")
devtools::install_github("aloy/lmeresampler")

Linear model refresh

Recall that in the DAPR2 course last year we learned all about the linear regression model, which took the form:

\[ \begin{align}\\ & \text{for observation }i \\ & \color{red}{Y_i} = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} X_{1i} \ + \ ... \ + \ \beta_p \cdot{} X_{pi}} + \varepsilon_i \\ \end{align} \]

And if we wanted to write this more simply, we can express \(X_1\) to \(X_p\) as an \(n \times p\) matrix (samplesize \(\times\) parameters), and \(\beta_0\) to \(\beta_p\) as a vector of coefficients:

\[ \mathbf{y} = \boldsymbol{X\beta} + \boldsymbol{\varepsilon} \quad \\ \text{where} \quad \varepsilon \sim N(0, \sigma) \text{ independently} \]

Data: Toy Data

Let’s consider a little toy example in which we might use linear regression to determine how practice (in hours per week) influences the reading age of different toy figurines

Imagine that we have data on various types of toys, from Playmobil, to Powerrangers, to farm animals.
You can find a dataset at https://uoepsy.github.io/data/toyexample.csv, and read it into your R environment using the code below:

toys_read <- read_csv("https://uoepsy.github.io/data/toyexample.csv")

The dataset contains information on 132 different toy figures. You can see the variables in the table below¹.

variable	description
toy_type	Type of Toy
toy	Character
hrs_week	Hours of practice per week
age	Age (in years)
R_AGE	Reading Age

Question A1

Read in the toy data from https://uoepsy.github.io/data/toyexample.csv and plot the bivariate relationship between Reading Age and Hrs per Week practice, and then fit the simple linear model: \[ \text{Reading Age}_i = \beta_0 + \beta_1 \cdot \text{Hours per week practice}_i + \varepsilon_i \]

Solution

toys_read <- read_csv("https://uoepsy.github.io/data/toyexample.csv") 

ggplot(data = toys_read, aes(x = hrs_week, y = R_AGE))+
  geom_point()+
  geom_smooth(method = "lm")

simplemod <- lm(R_AGE ~ hrs_week, data = toys_read)
summary(simplemod)

## 
## Call:
## lm(formula = R_AGE ~ hrs_week, data = toys_read)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9086  -3.2084   0.0047   3.0493  12.3262 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.4554     2.0379   1.696   0.0924 .
## hrs_week      0.7229     0.4866   1.486   0.1398  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.276 on 130 degrees of freedom
## Multiple R-squared:  0.01669,    Adjusted R-squared:  0.009131 
## F-statistic: 2.207 on 1 and 130 DF,  p-value: 0.1398

Question A2

Think about the assumptions we make about our model: \[ \text{where} \quad \varepsilon_i \sim N(0, \sigma) \text{ independently} \] Have we satisfied this assumption (specifically, the assumption of independence of errors)?

Solution

Question A3

Try running the code below.

ggplot(data = toys_read, aes(x=hrs_week, y=R_AGE))+
  geom_point()+
  geom_smooth(method="lm",se=FALSE)

Then try editing the code to include an aesthetic mapping from the type of toy to the color in the plot.
How do your thoughts about the relationship between Reading Age and Practice change?

Solution

Complete Pooling

We can consider the simple regression model (lm(R_AGE ~ hrs_week, data = toys_read)) to “pool” the information from all observations together. In this ‘Complete Pooling’ approach, we simply ignore the natural clustering of the toys, as if we were unaware of it. The problem is that this assumes the same regression line for all toy types, which might not be that appropriate:

Figure 1: Complete pooling can lead to bad fit for certain groups

No Pooling

There are various ways we could attempt to deal with the problem that our data are in groups (or “clusters”). With the tools you have learned in DAPR2, you may be tempted to try including toy type in the model as another predictor, to allow for some toy types being generally better than others:

lm(R_AGE ~ hrs_week + toy_type, data = toys_read)

Or even to include an interaction to allow for toy types to respond differently to practice:

lm(R_AGE ~ hrs_week * toy_type, data = toys_read)

This approach gets termed the “No Pooling” method, because the information from each cluster contributes only to an estimated parameter for that cluster, and there is no pooling of information across clusters. This is a good start, but it means that a) we are estimating a lot of parameters, and b) we are not necessarily estimating the parameter of interest (the overall effect of practice on reading age). Furthermore, we’ll probably end up having high variance in the estimates at each group.

Question A4

Fit a linear model which accounts for the grouping of toys into their different types, but holds the effect of practice-hours-per-week on reading age as constant across types:

mod1 <- lm(R_AGE ~ hrs_week + toy_type, data = toys_read)

Can you construct a plot of the fitted values from this model, coloured by toy_type?
(Hint: you might want to use the augment() function from the broom package)

Solution

Question A5

What happens (to the plot, and to your parameter estimates) when you include the interaction between toy_type and hrs_week?

Solution

mod2 <- lm(R_AGE ~ hrs_week * toy_type, data = toys_read)

broom::augment(mod2) %>%
  ggplot(.,aes(x=hrs_week, y=.fitted, col=toy_type))+
  geom_line()

We can see now that our model is fitting a different relationship between reading age and practice for each toy type. This is good - we’re going to get better estimates for different types of toy (e.g. scooby doo’s reading age increases with practice, farm animals don’t).

We can see that this model provides a better fit - it results in a significant reduction in the residual sums of squares:

anova(mod1, mod2)

## Analysis of Variance Table
## 
## Model 1: R_AGE ~ hrs_week + toy_type
## Model 2: R_AGE ~ hrs_week * toy_type
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    111 555.72                              
## 2     92 393.12 19    162.59 2.0026 0.01538 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But accounting for this heterogeneity over clusters in the effect of interest comes at the expense of not pooling information across groups to get one estimate for “the effect of practice on reading age.” Additionally, these models will tend to have low statistical power because they are using fewer observations (only those within each cluster) to estimate parameters which only represent within-cluster effects.

Some Data Wrangling

Data: Raising the stakes

30 volunteers from an amateur basketball league participated in a study on stress induced by size and type of potential reward for successfully completing a throw. Each participant completed 20 trials in which they were tasked with throwing a basketball and scoring a goal in order to win a wager. The size of the wager varied between trials, ranging from 1 to 20 points, with the order randomised for each participant. If a participant successfully threw the ball in the basket, then their score increased accordingly. If they missed, their score decreased accordingly. Participants were informed of the size of the potential reward/loss prior to each throw.

To examine the influence of the type of reward/loss on stress-levels, the study consisted of two conditions. In the monetary condition, (n = 15) participants were informed at the start of the study that the points corresponded to a monetary reward, and that they would be given their total score in £ at the end of the study. In the reputation condition, (n = 15) participants were informed that the points would be inputted on to a scoreboard and distributed around the local basketball clubs and in the league newsletter.

Throughout each trial, participants’ heart rate variability (HRV) was measured via a chest strap. HRV is considered to be indirectly related to levels of stress (i.e., higher HRV = less stress).

The data is in stored in two separate files.

Information on the conditions for each trial for each participant is stored in .csv format at https://uoepsy.github.io/data/basketballconditions.csv.
Information on participants’ HRV for each trial is stored in .xlsx format, and can be downloaded from https://uoepsy.github.io/data/basketballhrv.xlsx

We’re going to need to do some data wrangling now, so take a read through the boxes below on reshaping and merging data.

Pivot!

One of the more confusing things to get to grips with is the idea of reshaping a dataframe.
For different reasons, you might sometimes want to have data in wide, or in long format.

Figure 2: Source: https://fromthebottomoftheheap.net/2019/10/25/pivoting-tidily/

When the data is wide, we can make it long using pivot_longer(). When we make data longer, we’re essentially making lots of columns into 2 longer columns. Above, in the animation, the wide variable x, y and z go into a new longer column called name that specifies which (x/y/z) it came from, and the values get put into the val column.

The animation takes a shortcut in the code it displays above, but you could also use pivot_longer(c(x,y,z), names_to = "name", values_to = "val"). To reverse this, and put it back to being wide, we tell R which columns to take the names and values from: pivot_wider(names_from = name, values_from = val).

Joining data

Now comes a fun bit. You may have noticed that we have two datasets for this study. If we are interested in relationships between the heart rate variability (HRV) of participants during each trial, as well as the experimental manipulations (i.e., the condition of each trial), these are currently in different datasets.
Solution: we need to join them together!

Provided that both data-sets contain information on participant number and trial number, which uniquely identify each observation, we can join them together matching on those variables!

There are lots of different ways to join data-sets, depending on whether we want to keep rows from one data-set or the other, or keep only those in both data-sets etc.

Figure 3: Check out the help documentation for them all using ?full_join.

Question B1

Get the data into your R session.

Note: For one of the files, this is a bit different to how we have given you data in previous exercises. You may remember that for a .csv file, you can read directly into R from the link using, read_csv("https://uoepsy.......).

However, in reality you are likely to be confronted with data in all sorts of weird formats, such as .xlsx files from MS Excel. Have a look around the internet to try and find any packages/functions/techniques for getting both the datasets in to R.

Solution

bball <- read_csv("https://uoepsy.github.io/data/basketballconditions.csv")
head(bball)

## # A tibble: 6 × 6
##   stakes condition   sub throw trial_no success_rate
##    <dbl> <chr>     <dbl> <dbl> <chr>           <dbl>
## 1     13 money         1     1 trial_1             1
## 2     17 money         1     1 trial_2             1
## 3      7 money         1     1 trial_3             1
## 4      1 money         1     1 trial_4             1
## 5      2 money         1     1 trial_5             1
## 6      8 money         1     1 trial_6             1

For the .xlsx data:

Step 1: download the data to your computer
Step 2: load the readxl package.
Step 3: use the read_xlsx() function to read in the data, pointing it to the relevant place on your computer.

You can actually do all these steps from within R.

# Step 1
download.file(url = "https://uoepsy.github.io/data/basketballhrv.xlsx", 
              destfile = "baskeballhrvdata.xlsx")
# Step 2
library(readxl)
# Step 3
bballhrv <- read_xlsx("baskeballhrvdata.xlsx")
head(bballhrv)

## # A tibble: 6 × 21
##     sub trial_1 trial_2 trial_3 trial_4 trial_5 trial_6 trial_7 trial_8 trial_9
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1    5.47    6.69    2.72    4.95    5.96    4.93    4.62    4.70    5.78
## 2     2    4.60    6.46    3.77    4.80    6.33    5.15    5.04    5.70    4.66
## 3     3    5.14    5.98    4.30    4.40    4.97    5.16    4.71    4.60    5.94
## 4     4    5.85    3.74    3.40    4.97    6.46    3.87    5.14    6.26    3.60
## 5     5    7.46    5.83    4.83    6.26    3.52    5.92    4.24    4.39    5.75
## 6     6    3.53    2.89    2.07    2.20    3.92    4.45    3.19    5.20    3.81
## # … with 11 more variables: trial_10 <dbl>, trial_11 <dbl>, trial_12 <dbl>,
## #   trial_13 <dbl>, trial_14 <dbl>, trial_15 <dbl>, trial_16 <dbl>,
## #   trial_17 <dbl>, trial_18 <dbl>, trial_19 <dbl>, trial_20 <dbl>

Unfortunartely, a few students are getting error messages which we could not solve when trying to read in the xlsx data. The same data is available at https://uoepsy.github.io/data/bballhrv.csv so that you can read it in using:

read_csv("https://uoepsy.github.io/data/bballhrv.csv")

Question B2

Is each dataset in wide or long format? We want them both in long format, so try to reshape either/both if necessary.

Hint - in the tidyverse functions, you can specify all columns between column x and column z by using the colon, x:z.

Solution

Only the HRV data is in wide format:

head(bballhrv)

## # A tibble: 6 × 21
##     sub trial_1 trial_2 trial_3 trial_4 trial_5 trial_6 trial_7 trial_8 trial_9
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1    5.47    6.69    2.72    4.95    5.96    4.93    4.62    4.70    5.78
## 2     2    4.60    6.46    3.77    4.80    6.33    5.15    5.04    5.70    4.66
## 3     3    5.14    5.98    4.30    4.40    4.97    5.16    4.71    4.60    5.94
## 4     4    5.85    3.74    3.40    4.97    6.46    3.87    5.14    6.26    3.60
## 5     5    7.46    5.83    4.83    6.26    3.52    5.92    4.24    4.39    5.75
## 6     6    3.53    2.89    2.07    2.20    3.92    4.45    3.19    5.20    3.81
## # … with 11 more variables: trial_10 <dbl>, trial_11 <dbl>, trial_12 <dbl>,
## #   trial_13 <dbl>, trial_14 <dbl>, trial_15 <dbl>, trial_16 <dbl>,
## #   trial_17 <dbl>, trial_18 <dbl>, trial_19 <dbl>, trial_20 <dbl>

bballhrv <-
  bballhrv %>%
  pivot_longer(trial_1:trial_20, names_to = "trial_no", values_to = "hrv")

head(bballhrv)

## # A tibble: 6 × 3
##     sub trial_no   hrv
##   <dbl> <chr>    <dbl>
## 1     1 trial_1   5.47
## 2     1 trial_2   6.69
## 3     1 trial_3   2.72
## 4     1 trial_4   4.95
## 5     1 trial_5   5.96
## 6     1 trial_6   4.93

Question B3

join the two datasets (both in long format) together.

Note that the variables we are matching on need to have the information in the same format. For instance, R won’t be able to match "trial_1","trial_2","trial_3" with 1, 2, 3 because they are different things. We would need to edit one of them to be in the same format.

Hint: You should end up with 600 rows.

Solution

bball <- full_join(bball, bballhrv)
head(bball)

## # A tibble: 6 × 7
##   stakes condition   sub throw trial_no success_rate   hrv
##    <dbl> <chr>     <dbl> <dbl> <chr>           <dbl> <dbl>
## 1     13 money         1     1 trial_1             1  5.47
## 2     17 money         1     1 trial_2             1  6.69
## 3      7 money         1     1 trial_3             1  2.72
## 4      1 money         1     1 trial_4             1  4.95
## 5      2 money         1     1 trial_5             1  5.96
## 6      8 money         1     1 trial_6             1  4.93

Exploring Clustering

Question B4

Continuing with our basketball/hrv study, consider the following questions:

What are the units of observations?
What are the groups/clusters?
What varies within these clusters?
What varies between these clusters?

Solution

Question B5

Now that you have tidied and joined all the data together, plot the relationship between size of reward and HRV, ignoring the fact that there are repeated observations for each subject.
Can you make a separate plot for each of the experimental conditions? (Hint: facet_wrap())

Solution

Question B6

How are stress levels (measured via HRV) influenced by the size of potential reward/loss?

Ignore the clustering, and fit a simple linear regression estimating how heart rate variability is influenced by how high the stakes are (i.e. how big the reward is) for a given throw.

Solution

simple_mod <- lm(hrv ~ stakes, data = bball)
summary(simple_mod)

## 
## Call:
## lm(formula = hrv ~ stakes, data = bball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6396 -0.8275  0.0418  0.8787  4.0314 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.978570   0.114654  43.423   <2e-16 ***
## stakes      -0.022790   0.009571  -2.381   0.0176 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.352 on 598 degrees of freedom
## Multiple R-squared:  0.009392,   Adjusted R-squared:  0.007736 
## F-statistic:  5.67 on 1 and 598 DF,  p-value: 0.01757

Question B7

Consider the following research question:

How do size and type of reward/loss interact to influence levels of stress?

Extend your model to include the interaction between stakes and experimental condition and examine the parameter values.

Solution

simple_mod <- lm(hrv ~ condition*stakes, data = bball)
anova(simple_mod)

## Analysis of Variance Table
## 
## Response: hrv
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## condition          1   36.62  36.621 20.8484 6.042e-06 ***
## stakes             1   10.36  10.362  5.8989   0.01544 *  
## condition:stakes   1    9.34   9.344  5.3194   0.02143 *  
## Residuals        596 1046.91   1.757                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(simple_mod)

## 
## Call:
## lm(formula = hrv ~ condition * stakes, data = bball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4250 -0.8292  0.0231  0.8821  4.1205 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.998385   0.158965  31.443   <2e-16 ***
## conditionmoney        -0.039631   0.224810  -0.176   0.8601    
## stakes                -0.001148   0.013270  -0.087   0.9311    
## conditionmoney:stakes -0.043284   0.018767  -2.306   0.0214 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.325 on 596 degrees of freedom
## Multiple R-squared:  0.05106,    Adjusted R-squared:  0.04628 
## F-statistic: 10.69 on 3 and 596 DF,  p-value: 7.494e-07

Heart Rate Variability (HRV) was found to be influenced by both the size of the potential reward/loss of a given trial, whether whether participants were playing for money or for a place on the scoreboard, and the interaction between the two. For a 1 point increase in stakes, HRV decreased by -0.04 (\(SE=0.02,t(596)=-2.31,p=0.02\))**** in the condition in which participants played for money relative to that in which participants played for kudos, suggesting that the size of the reward has a greater effect on stress levels when playing for money compared to playing for reputation.

**** Caveat: Our model did not account for by-participant clustering of data, thereby violating the assumption that errors are iid (independent and identically distributed).

Question B8

Let’s start to examine the clustering a bit more.
Plot the relationship between size of reward and HRV, with a separate line for each subject.

Hint: remember the group = aesthetic in ggplot!

Solution

Question B9

Calculate the ICC, using the ICCbare() function from the ICC package.

Remember, you can look up the help for a function by typing a ? followed by the function name in the console.

Solution

library(ICC)
ICCbare(x = sub, y = hrv, data = bball)

## [1] 0.3141482

Optional - Extra difficult. Calculate ICC manually

We have equal group sizes here (there are 2 \(\times\) 15 participants, each with 20 observations), which makes calculating ICC by hand a lot easier, but it’s still a bit tricky.

Let’s take a look at the formula for ICC

\[ \begin{align} ICC \; (\rho) = & \frac{\sigma^2_{b}}{\sigma^2_{b} + \sigma^2_e} \\ \qquad \\ = & \frac{\frac{MS_b - MS_e}{k}}{\frac{MS_b - MS_e}{k} + MS_e} \\ \qquad \\ = & \frac{MS_b - MS_e}{MS_b + (k-1)MS_e} \\ \qquad \\ \qquad \\ \text{Where:} & \\ k = & \textrm{number of observations in each group} \\ MS_b = & \textrm{Mean Squares between groups} = \frac{\text{Sums Squares between groups}}{df_\text{groups}} = \frac{\sum\limits_{i=1}(\bar{y}_i - \bar{y})^2}{\textrm{n groups}-1}\\ MS_e = & \textrm{Mean Squares within groups} \frac{\text{Sums Squares within groups}}{df_\text{within groups}} = \frac{\sum\limits_{i=1}\sum\limits_{j=1}(y_{ij} - \bar{y_i})^2}{\textrm{n obs}-\textrm{n groups}}\\ \end{align} \] So we’re going to need to calculate the grand mean of \(y\), the group means of \(y\), and then the various squared differences between group means and grand mean, and between observations and their respective group means.

The code below will give us a new column which is the overall mean of y. This bit is fairly straightforward.

bball %>% mutate(
  grand_mean = mean(hrv)
)

We have seen a lot of the combination of group_by() %>% summarise(), but we can also combine group_by() with mutate()!

Try the following:

bball %>% mutate(
    grand_mean = mean(hrv)
  ) %>% 
  group_by(sub) %>%
  mutate(
    group_mean = mean(hrv)
  )

The grouping gets carried forward.

Using group_by() can quite easily land you in trouble if you forget that you have grouped the dataframe.

Look at the output of class() when we have grouped the data. It still mentions something about the grouping.

bball <- bball %>% mutate(
    grand_mean = mean(hrv)
  ) %>% 
  group_by(sub) %>%
  mutate(
    group_mean = mean(hrv)
  )

class(bball)

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

To remove the grouping, we can use ungroup() (we could also just add this to the end of our code sequence above and re-run it):

bball <- ungroup(bball)
class(bball)

## [1] "tbl_df"     "tbl"        "data.frame"

Now we need to create a column which is the squared differences between the observations \(y_{ij}\) and the group means \(\bar{y_i}\).
We also want a column which is the squared differences between the group means \(\bar{y_i}\) and the overall mean \(\bar{y}\).

bball <- bball %>% 
  mutate(
    within = (hrv-group_mean)^2,
    between = (group_mean-grand_mean)^2
  )

And then we want to sum them:

ssbetween = sum(bball$between)
sswithin = sum(bball$within)

Finally, we divide them by the degrees of freedom.

# Mean Squares between
msb = ssbetween / (30-1)
# Mean Squares within 
mse = sswithin / (600-30)

And calculate the ICC!!!

# ICC
(msb-mse) /(msb + (19*mse))

## [1] 0.3141482

Understanding ICC a bit better

Think about what ICC represents - the ratio of the variance between the groups to the total variance.
You can think of the “variance between the groups” as the group means varying around the overall mean (the black dots around the black line), and the total variance as that plus the addition of the variance of the individual observations around each group mean (each set of coloured points around their respective larger black dot):

ggplot(bball, aes(x=sub, y=hrv))+
  geom_point(aes(col=sub),alpha=.3)+
  stat_summary(geom = "pointrange")+
  geom_hline(yintercept = mean(bball$hrv))+
  guides(col=FALSE)

You can also think of the ICC as the correlation between two randomly drawn observations from the same group. This is a bit of a tricky thing to get your head round if you try to relate it to the type of “correlation” that you are familiar with. Pearson’s correlation (e.g think about a typical scatterplot) operates on pairs of observations (a set of values on the x-axis and their corresponding values on the y-axis), whereas ICC operates on data which is structured in groups.

Optional - ICC as the expected correlation between two observations from same group

Let’s suppose we had only 2 observations in each group.

##   cluster observation   y
## 1 group_1           1   4
## 2 group_1           2   2
## 3 group_2           1   4
## 4 group_2           2   2
## 5 group_3           1   7
## 6 group_3           2   5
## 7     ...         ... ...

The ICC for this data is 0.18:

Now suppose we reshape our data so that we have one row per group, and one column for each observation to look like this:

## # A tibble: 7 × 3
##   cluster obs1  obs2 
##   <chr>   <chr> <chr>
## 1 group_1 4     2    
## 2 group_2 4     2    
## 3 group_3 7     5    
## 4 group_4 2     7    
## 5 group_5 3     8    
## 6 group_6 6     7    
## 7 ...     ...   ...

Calculating Pearson’s correlation on those two columns yields 0.2, which isn’t quite right. It’s close, but not quite..

The crucial thing here is that it is completely arbitrary which observations get called “obs1” and which get called “obs2.”
The data aren’t paired, but grouped.

Essentially, there are lots of different combinations of “pairs” here. There are the ones we have shown above:

## # A tibble: 7 × 3
##   cluster obs1  obs2 
##   <chr>   <chr> <chr>
## 1 group_1 4     2    
## 2 group_2 4     2    
## 3 group_3 7     5    
## 4 group_4 2     7    
## 5 group_5 3     8    
## 6 group_6 6     7    
## 7 ...     ...   ...

But we might have equally chosen these:

## # A tibble: 7 × 3
##   cluster obs1  obs2 
##   <chr>   <chr> <chr>
## 1 group_1 2     4    
## 2 group_2 4     2    
## 3 group_3 5     7    
## 4 group_4 7     2    
## 5 group_5 3     8    
## 6 group_6 7     6    
## 7 ...     ...   ...

or these:

## # A tibble: 7 × 3
##   cluster obs1  obs2 
##   <chr>   <chr> <chr>
## 1 group_1 2     4    
## 2 group_2 2     4    
## 3 group_3 5     7    
## 4 group_4 7     2    
## 5 group_5 8     3    
## 6 group_6 7     6    
## 7 ...     ...   ...

If we take the correlation of all these combinations of pairings, then we get our ICC of 0.18!

ICC = the expected correlation of a randomly drawn pair of observations from the same group.

Fixed effects

Question C1

Let’s suppose we want to account for the by-participant clustering in our data with the “No pooling” method (i.e., in our multiple regression model, include participant ID as an additional predictor, along with its interaction with explanatory variable of interest).

Make sure that the participant ID variable is a factor!
Fit the model
Use the plot_model() function (with type = "int") to plot the interaction terms between stakes and each participant.

Note: When examining parameter values, remember to think about how HRV is considered to relate to stress, and whether the direction of any effect you see makes theoretical sense.

Solution

Question C2

We have fitted two models so far:

The complete pooling model: lm(hrv ~ stakes, data = bball), which ignores the fact that our data has some inherent grouping (multiple datapoints per participant)
The no pooling model: lm(hrv ~ stakes*sub, data = bball), which estimates only participant-specific effects.

Compare the two models using anova(). Which model provides the best fit?

Solution

pool_mod <- lm(hrv ~ stakes, data=bball)
nopool_mod <- lm(hrv ~ stakes*sub, data=bball)
anova(pool_mod, nopool_mod)

## Analysis of Variance Table
## 
## Model 1: hrv ~ stakes
## Model 2: hrv ~ stakes * sub
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    598 1092.87                                  
## 2    540  477.39 58    615.48 12.003 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the no pooling model provides a significantly better fit to the data.

Question C3

Recall our research question: “How are stress levels (measured via HRV) influenced by the size of potential reward/loss?”

From the model lm(hrv ~ stakes*sub, data=bball), do we estimate a parameter that allows us to complete the following sentence?:

“For every 1 point increase in reward, heart rate variability changed by _ ? _ units.”

Solution

Remember that when there is an interaction term, coefficients for individual variables included in the interaction term are conditional effects. In the model lm(y ~ x * z), the coefficient for x is the estimated effect of x on y when z=0 (and vice versa). When one of these is categorical, 0 is a reference level.

So we don’t really estimate a parameter for the overall effect of stakes on hrv - we simply have the participant-specific estimates.

summary(nopool_mod)$coefficients

##                  Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)   4.300458869 0.43677304  9.84598064 3.793839e-21
## stakes        0.061259772 0.03646111  1.68014014 9.350841e-02
## sub2          0.551948310 0.61769035  0.89356796 3.719512e-01
## sub3          0.820197993 0.61769035  1.32784653 1.847898e-01
## sub4          0.532046420 0.61769035  0.86134811 3.894286e-01
## sub5          4.050033460 0.61769035  6.55673743 1.287653e-10
## sub6         -1.487195643 0.61769035 -2.40767180 1.638893e-02
## sub7         -0.176553063 0.61769035 -0.28582778 7.751196e-01
## sub8          0.487065284 0.61769035  0.78852662 4.307348e-01
## sub9          1.523370708 0.61769035  2.46623685 1.396366e-02
## sub10        -0.227750652 0.61769035 -0.36871331 7.124859e-01
## sub11         1.046044119 0.61769035  1.69347653 9.094150e-02
## sub12        -0.943711183 0.61769035 -1.52780625 1.271462e-01
## sub13         2.373490152 0.61769035  3.84252424 1.362505e-04
## sub14        -0.456905318 0.61769035 -0.73969962 4.598037e-01
## sub15         1.782354021 0.61769035  2.88551377 4.063958e-03
## sub16         1.532933894 0.61769035  2.48171901 1.337812e-02
## sub17         0.500646099 0.61769035  0.81051306 4.180024e-01
## sub18         0.983211079 0.61769035  1.59175398 1.120252e-01
## sub19         1.502042629 0.61769035  2.43170809 1.535192e-02
## sub20         3.241516039 0.61769035  5.24780098 2.215398e-07
## sub21         1.147792016 0.61769035  1.85819968 6.368464e-02
## sub22        -0.612560929 0.61769035 -0.99169580 3.217901e-01
## sub23        -0.205178646 0.61769035 -0.33217071 7.398893e-01
## sub24         1.712626937 0.61769035  2.77263022 5.752586e-03
## sub25         1.384399492 0.61769035  2.24125160 2.541529e-02
## sub26        -2.317817283 0.61769035 -3.75239353 1.941184e-04
## sub27         1.393567204 0.61769035  2.25609352 2.446399e-02
## sub28        -0.425881245 0.61769035 -0.68947369 4.908214e-01
## sub29         0.642571954 0.61769035  1.04028167 2.986746e-01
## sub30        -0.010970790 0.61769035 -0.01776099 9.858361e-01
## stakes:sub2  -0.012862460 0.05156379 -0.24944751 8.031095e-01
## stakes:sub3  -0.035097345 0.05156379 -0.68065869 4.963792e-01
## stakes:sub4  -0.029913376 0.05156379 -0.58012364 5.620732e-01
## stakes:sub5  -0.386778846 0.05156379 -7.50097719 2.621397e-13
## stakes:sub6  -0.018284698 0.05156379 -0.35460343 7.230251e-01
## stakes:sub7  -0.045618976 0.05156379 -0.88470944 3.767071e-01
## stakes:sub8  -0.111747546 0.05156379 -2.16717073 3.065825e-02
## stakes:sub9  -0.225284706 0.05156379 -4.36904826 1.496785e-05
## stakes:sub10 -0.055625358 0.05156379 -1.07876775 2.811729e-01
## stakes:sub11 -0.143857500 0.05156379 -2.78989360 5.458818e-03
## stakes:sub12 -0.166749061 0.05156379 -3.23384000 1.296027e-03
## stakes:sub13 -0.192176570 0.05156379 -3.72696718 2.142290e-04
## stakes:sub14  0.057197973 0.05156379  1.10926618 2.678091e-01
## stakes:sub15 -0.218576167 0.05156379 -4.23894652 2.641414e-05
## stakes:sub16 -0.207404000 0.05156379 -4.02227964 6.585414e-05
## stakes:sub17 -0.087620016 0.05156379 -1.69925461 8.984716e-02
## stakes:sub18 -0.160005481 0.05156379 -3.10305871 2.015560e-03
## stakes:sub19 -0.025923824 0.05156379 -0.50275245 6.153433e-01
## stakes:sub20 -0.202111254 0.05156379 -3.91963501 1.000874e-04
## stakes:sub21 -0.075588319 0.05156379 -1.46591848 1.432525e-01
## stakes:sub22 -0.046772269 0.05156379 -0.90707577 3.647713e-01
## stakes:sub23  0.014323707 0.05156379  0.27778613 7.812829e-01
## stakes:sub24 -0.116921035 0.05156379 -2.26750254 2.375390e-02
## stakes:sub25 -0.056800361 0.05156379 -1.10155511 2.711459e-01
## stakes:sub26  0.220232109 0.05156379  4.27106095 2.298973e-05
## stakes:sub27 -0.093440459 0.05156379 -1.81213311 7.052092e-02
## stakes:sub28  0.019860359 0.05156379  0.38516094 7.002698e-01
## stakes:sub29 -0.120547800 0.05156379 -2.33783803 1.975988e-02
## stakes:sub30  0.002596578 0.05156379  0.05035662 9.598568e-01

Question C4

Let’s suppose we want to examine the interaction between size and type of reward (stakes * condition), using the “no pooling” method (i.e., including participant as a fixed effect).

We have the variable stakes, that varies within each participant, and another variable condition that varies between participants.

This becomes difficult because the sub variable (the participant id variable) uniquely identifies the two conditions. Note that if we fit the following model, some coefficients are not defined. Try it and see:

lm(hrv ~ stakes*sub + stakes*condition, data=bball)

This sort of perfectly balanced design has traditionally been approached with extensions of ANOVA (“repeated measures ANOVA,”“mixed ANOVA”). These methods can partition out variance due to one level of clustering (e.g. subjects), and can examine factorial designs when one factor is within cluster, and the other is between. You can see an example below if you are interested. However, ANOVA has a lot of constraints - it can’t handle multiple levels of clustering (e.g. children in classes in schools), it will likely require treating variables suc as time as a factor, and it’s not great with missing data. The multi-level model (MLM) provides a more flexible framework, and this is what we will begin to look at next week.

Image sources:
http://tophatsasquatch.com/2012-tmnt-classics-action-figures/
https://www.dezeen.com/2016/02/01/barbie-dolls-fashionista-collection-mattel-new-body-types/
https://www.wish.com/product/5da9bc544ab36314cfa7f70c
https://www.worldwideshoppingmall.co.uk/toys/jumbo-farm-animals.asp
https://www.overstock.com/Sports-Toys/NJ-Croce-Scooby-Doo-5pc.-Bendable-Figure-Set-with-Scooby-Doo-Shaggy-Daphne-Velma-and-Fred/28534567/product.html
https://tvtropes.org/pmwiki/pmwiki.php/Toys/Furby
https://www.fun.com/toy-story-4-figure-4-pack.html
https://www.johnlewis.com/lego-minifigures-71027-series-20-pack/p5079461 ↩︎

Regression Refresh and Clustered Data

New Packages!

Linear model refresh

Some Data Wrangling

Exploring Clustering

Fixed effects