Exercises: Multiple Regression

More monkeys

Data: monkeytoys.csv

After their recent study investigating how age is associated with inquisitiveness in monkeys (see Week 5 Exercises) our researchers have become interested in whether primates show preferences for certain types of object - are they more interested in toys with moving parts, or with soft plush toys?

They conduct another study (Bolton, Archer, Peng, Winther & Gandolfi, 20241) in which they gave 119 monkeys each a different toy, and recorded the amount of time each monkey spent exploring their toy. Toys were categorised as either being ‘mechanical’ or ‘soft’. Mechanical toys had several parts that could be manipulated, while soft toys did not. They also recorded the age of each monkey, and a few further attributes of each toy (its size and colour).

The aim of this study is to investigate the following question:

Do monkeys have a preference between soft toys vs toys with moving parts?

The data is available at https://uoepsy.github.io/data/monkeytoys.csv and contains the variables described in Table 1

Table 1: Data dictionary for monkeytoys.csv
variable description
name Monkey Name
age Age of monkey in years
obj_type Type of novel object given (mechanical / soft)
obj_colour Main colour of object (red / green / blue)
obj_size Size of object in cm (length of largest dimension of the object)
exploration_time Time (in minutes) spent exploring the object
Question 1

Fit a simple linear model examining whether exploration_time depends on the type of object given to monkeys (obj_type).

Make a plot too if you want!

There’s nothing new here. It’s just lm(outcome ~ predictor).

For the plot, try a boxplot maybe? or even a violin plot if you’re feeling adventurous!

Question 2

Is the distribution of ages of the monkeys with soft toys similar to those with mechanical toys?

Is there a way you could test this?

We’re wanting to know if age (continuous) is different between two groups (monkeys seeing soft toys and monkeys seeing moving toys). Anyone for \(t\)?

Question 3

Discuss: What does this mean for our model of exploration_time? Remember - the researchers already discovered last week that younger monkeys tend to be more inquisitive about new objects than older monkeys are.

  • If older monkeys spend less time exploring novel objects
  • And our group of monkeys with mechanical toys are older than the group with soft toys.
  • Then…

Question 4

Fit a model the association between exploration time and type of object while controlling for age.

  • When we add multiple predictors in to lm(), it can sometimes matter what order we put them in (e.g. if we want to use anova(model) to do a quick series of incremental model comparisons as in 7A #shortcuts-for-model-comparisons). Good practice is to put the thing you’re interested in (the ‘focal predictor’) at the end, e.g.: lm(outcome ~ covariates + predictor-of-interest)

Question 5

The thing we’re interested in here is association between exploration_time and obj_type.
How does it differ between the models we’ve created so far, and why?

  • To quickly compare several models side by side, the tab_model() function from the sjPlot package can be quite useful, e.g. tab_model(model1, model2, ...).
  • alternatively, just use summary() on each model.

Question 6

Plot the model estimated difference in exploration time for each object type.

To do this, you’ll need to create a little data frame for plotting, then give that to the augment() function from the broom package. This will then give us the model fitted value and the confidence interval, which we can plot!

  • An example of this whole process is in 7A#model-visualisations.
    • The example has a continuous predictor, so we plotted a line and a ribbon. An alternative for a categorical predictor might be a geom_pointrange().

We’ve split this solution in to parts so that you can have a go at some bits without seeing it all at once.

Question 7

Are other aspects of the toys (their size and colour) also associated with more/less exploration time?

We can phrase this as “do size and colour explain additional variance in exploration time?”. How might we test such a question?

  • We basically just want to add these new predictors into our model.
  • Don’t worry about interpreting the coefficients right now (we’ll talk more about categorical predictors next week), but we can still test whether the inclusion of size and colour improve our model! (see 7A#model-comparisons).

Social Media Use

Data: socmedmin.csv

Is more social media use associated with more happiness?

77 participants completed a short online form that included a questionnaire (9 questions) to get a measure of their happiness. Information was also recorded on their age, the number of minutes per day spent using social media, and the number of hours per day spent having face-to-face interactions.

The data is available at https://uoepsy.github.io/data/socmedmin.csv

Table 2: Data Dictionary: socmedmin.csv
variable description
name Participant Name
age Participant Age (years)
happy Happiness Score (sum of 9 likert questions each scored 1-5)
smm Social Media Use (minutes per day)
f2f Face-to-face interactions (hours per day)
Question 8

Read in the data and have a look around.

Data often doesn’t come to us in a neat format. Something here is a bit messy, so you’ll need to figure out how to tidy it up.

Is every variable of the right type (numeric, character, factor etc)? If not, we’ll probably want to convert any that aren’t the type we want.

Be careful not to lose data when we convert things. Note that R cannot do this:

as.numeric("20 minutes")
[1] NA

So maybe some combination of as.numeric() and gsub() might work? (see 6A #dealing-with-character-strings)

Question 9

Something we haven’t really come across until now is the importance of checking the range (i.e. min and max) of our variables. This is a good way to check for errors in the data (i.e. values that we shouldn’t be able to obtain using our measurement tool).

Check the range of the happy variable - does it look okay, based on the description of how it is recorded?

  • min(), max(), or even just range()!!
  • If there are any observations you think have impossible values, then you could set them as NA for now (see 6A #impossible-values).

Question 10

Finally, we’ve got to a point where we can look at some descriptive statistics - e.g. mean and standard deviations for continuous variables, frequencies (counts) if there are any categorical variables.

You can do this the manual way, calculating it for each variable, but there are also lots of handy functions to make use of.

  • describe() from the psych package
  • CreateTableOne() from the tableone package

Question 12

For our research question (“Is more social media use associated with more happiness?”), we could consider fitting the following model:

lm(happy ~ smm, data = smmdat)

But is it not a bit more complicated than that? Surely there are lots of other things that are relevant? For instance, it’s quite reasonable to assume that social media use is related to someone’s age? It’s also quite likely that happiness changes with age too. But that means that our coefficient of happy ~ smm could actually just be changes in happiness due to something else (like age)? Similarly, people who use social media might just be more sociable people (i.e. they might see more people in real life, and that might be what makes them happy).

Especially in observational research (i.e. we aren’t intervening and asking some people to use social media and others to not use it), figuring out the relevant association that we want can be incredibly tricky.

As it happens, we do have data on these participants’ ages, and on the amount of time they spend having face-to-face interactions with people!

Look at how all these variables correlate with one another, and make some quick plots.

  • You can give a data frame of numeric variables to cor() and it gives you all the correlations between pairs of variables in a “correlation matrix”
  • if you want some quick pair-wise plots (not pretty, but useful!), try the pairs.panels() function from the psych package.

Question 13

Is social media usage associated with happiness, after accounting for age and the number of face-to-face interactions?

  • This question can be answered in a couple of ways. You might be able to do some sort of model comparison, or you could look at the test of a coefficient.

Question 14

Plot the model estimated association between social media usage and happiness.

This follows just the same logic as we did for the monkeys!

Optional Extra

In all the plots we have been making from our models, the other predictors in our model (e.g. age and f2f in this case) have been held at their mean.

What happens if you create a plot estimating the association between happiness and social media usage, but having age at 15, or 30, or 45?

Question 15

How many observations has our model been fitted to?

It’s not just the 77 people why have in the dataset..

Question 16

Check the assumptions of your model.

7B Assumptions & Diagnostics shows how we can do this. We can rely on tests if we like, or we can do it visually through plots. Getting a good sense of “what looks weird” in these assumption plots is something that comes with time.

More RMarkdown/Quarto

Question 17
  1. Open a .Rmd or .qmd document, and delete all the template stuff.
    • Keep the code chunk at the top called “setup”
  2. In the “setup” chunk, change it to echo = FALSE. This will make sure all code gets hidden from the final rendered document.
  3. Make a new code chunk called “analysis”, and for this chunk, set include = FALSE. This will make sure that the code in that chunk gets run, but does not actually produce any output.
  4. Shove all our working analysis (below) in that ‘analysis’ code chunk.
  5. Write a paragraph describing the analysis.
    • what type of model/analysis is being conducted?
    • what is the outcome? the predictors? how are they smeasured?
  6. Write a paragraph highlighting the key results. Try to use inline R code. This example may help.
  7. In a new code chunk, create and show a plot. Make sure this code chunk is set to include = TRUE, because we do want the output to show (leaving it blank will also work, because this is the default).
  8. Click Knit!

Law enforcement in some countries regularly rely on ‘polygraph’ tests as a form of ‘lie detection’. These tests involve measuring heart rate, blood pressure, respiratory rate and sweat. However, there is very little evidence to suggest that these methods are remotely accurate in being able to determine whether or not someone is lying.

Researchers are interested in if peoples’ heart rates during polygraph tests can be influenced by various pre-test strategies, including deep breathing, or closing their eyes. They recruited 142 participants (ages 22 to 64). Participants were told they were playing a game in which their task was to deceive the polygraph test, and they would receive financial rewards if they managed successfully. At the outset of the study, they completed a questionnaire which asked about their anxiety in relation to taking part. Participants then chose one of 4 strategies to prepare themselves for the test, each lasting 1 minute. These were “do nothing”, “deep breathing”, “close your eyes” or “cough”2. The average heart rate of each participant was recorded during their test.

usmr_polygraph.csv data dictionary
variable description
age Age of participant (years)
anx Anxiety measure (Z-scored)
strategy Pre-test Strategy (0 = do nothing, 1 = close eyes, 2 = cough, 3 = deep breathing)
hr Average Heart Rate (bpm) during test

Analysis

Code
# load libraries
library(tidyverse)
library(psych)
# read in the data
liedf <- read_csv("https://uoepsy.github.io/data/usmr_polygraph.csv")

# there seems to be a 5 there.. 
table(liedf$strategy)
# the other variables look okay though
describe(liedf)
pairs.panels(liedf)


liedf <- liedf |> 
  filter(strategy!=5) |>
  mutate(
    # strategy is a factor. but currently numbers
    # i'm going to give them better labels too.. 
    # to do this is need to tell factor() what "levels" to look for
    # and then give it some "labels" to apply to those.
    strategy = factor(strategy, 
                      levels = c("0","1","2","3"),
                      labels = c("do nothing", "close eyes",
                                 "cough", "deep breathing")
                      )
  )

liemod <- lm(hr ~ age + anx + strategy, data = liedf)

# Does HR differ between strategies?
anova(liemod)
# the above is a shortcut for getting this comparison out:
anova(
  lm(hr ~ age + anx, data = liedf),
  lm(hr ~ age + anx + strategy, data = liedf)
)

# i want a plot to show the HRs of different strategies.. 
# ??

Footnotes

  1. Another fake study!↩︎

  2. apparently coughing is a method of immediately lowering heart rate!↩︎