Exercises: Multiple Regression

More monkeys

Data: monkeytoys.csv

After their recent study investigating how age is associated with inquisitiveness in monkeys (see Week 5 Exercises) our researchers have become interested in whether primates show preferences for certain types of object - are they more interested in toys with moving parts, or with soft plush toys?

They conduct another study (Liu, Hajnosz, Xu & Li, 2023¹) in which they gave 119 monkeys each a different toy, and recorded the amount of time each monkey spent exploring their toy. Toys were categorised as either being ‘mechanical’ or ‘soft’. Mechanical toys had several parts that could be manipulated, while soft toys did not. They also recorded the age of each monkey, and a few further attributes of each toy (its size and colour).

The aim of this study is to investigate the following question:

Do monkeys have a preference between soft toys vs toys with moving parts?

The data is available at https://uoepsy.github.io/data/monkeytoys.csv and contains the variables described in Table 1

Table 1:
Data dictionary for monkeytoys.csv
variable	description
name	Monkey Name
age	Age of monkey in years
obj_type	Type of novel object given (mechanical / soft)
obj_colour	Main colour of object (red / green / blue)
obj_size	Size of object in cm (length of largest dimension of the object)
exploration_time	Time (in minutes) spent exploring the object

Question 1

Fit a simple linear model examining whether exploration_time depends on the type of object given to monkeys (obj_type).

Make a plot too if you want!

Hints

There’s nothing new here. It’s just lm(outcome ~ predictor).

For the plot, try a boxplot maybe? or even a violin plot if you’re feeling adventurous!

Question 2

Is the distribution of ages of the monkeys with soft toys similar to those with mechanical toys?

Is there a way you could test this?

Hints

We’re wanting to know if age (continuous) is different between two groups (monkeys seeing soft toys and monkeys seeing moving toys). Anyone for \(t\)?

Question 3

Discuss: What does this mean for our model of exploration_time? Remember - the researchers already discovered last week that younger monkeys tend to be more inquisitive about new objects than older monkeys are.

Hints

If older monkeys spend less time exploring novel objects
And our group of monkeys with mechanical toys are older than the group with soft toys.
Then…

Question 4

Fit a model the association between exploration time and type of object while controlling for age.

Hints

When we add multiple predictors in to lm(), it can sometimes matter what order we put them in (e.g. if we want to use anova(model) to do a quick series of incremental model comparisons as in 7A #shortcuts-for-model-comparisons). Good practice is to put the thing you’re interested in (the ‘focal predictor’) at the end, e.g.: lm(outcome ~ covariates + predictor-of-interest)

Question 5

The thing we’re interested in here is association between exploration_time and obj_type.
How does it differ between the models we’ve created so far, and why?

Hints

To quickly compare several models side by side, the tab_model() function from the sjPlot package can be quite useful, e.g. tab_model(model1, model2, ...).
alternatively, just use summary() on each model.

Question 6

Plot the model estimated difference in exploration time for each object type.

To do this, you’ll need to create a little data frame for plotting, then give that to the augment() function from the broom package. This will then give us the model fitted value and the confidence interval, which we can plot!

Hints

An example of this whole process is in 7A#model-visualisations.
- The example has a continuous predictor, so we plotted a line and a ribbon. An alternative for a categorical predictor might be a geom_pointrange().

We’ve split this solution in to parts so that you can have a go at some bits without seeing it all at once.

Question 7

Are other aspects of the toys (their size and colour) also associated with more/less exploration time?

We can phrase this as “do size and colour explain additional variance in exploration time?”. How might we test such a question?

Hints

We basically just want to add these new predictors into our model.
Don’t worry about interpreting the coefficients right now (we’ll talk more about categorical predictors next week), but we can still test whether the inclusion of size and colour improve our model! (see 7A#model-comparisons).

Social Media Use

Data: socmedmin.csv

Is more social media use associated with more happiness?

77 participants completed a short online form that included a questionnaire (9 questions) to get a measure of their happiness. Information was also recorded on their age, the number of minutes per day spent using social media, and the number of hours per day spent having face-to-face interactions.

The data is available at https://uoepsy.github.io/data/socmedmin.csv

Table 2:
Data Dictionary: socmedmin.csv
variable	description
name	Participant Name
age	Participant Age (years)
happy	Happiness Score (sum of 9 likert questions each scored 1-5)
smm	Social Media Use (minutes per day)
f2f	Face-to-face interactions (hours per day)

Question 8

Read in the data and have a look around.

Data often doesn’t come to us in a neat format. Something here is a bit messy, so you’ll need to figure out how to tidy it up.

Hints

Is every variable of the right type (numeric, character, factor etc)? If not, we’ll probably want to convert any that aren’t the type we want.

Be careful not to lose data when we convert things. Note that R cannot do this:

as.numeric("20 minutes")

[1] NA

So maybe some combination of as.numeric() and gsub() might work? (see 6A #dealing-with-character-strings)

Question 9

Something we haven’t really come across until now is the importance of checking the range (i.e. min and max) of our variables. This is a good way to check for errors in the data (i.e. values that we shouldn’t be able to obtain using our measurement tool).

Check the range of the happy variable - does it look okay, based on the description of how it is recorded?

Hints

min(), max(), or even just range()!!
If there are any observations you think have impossible values, then you could set them as NA for now (see 6A #impossible-values).

Question 10

Finally, we’ve got to a point where we can look at some descriptive statistics - e.g. mean and standard deviations for continuous variables, frequencies (counts) if there are any categorical variables.

Hints

You can do this the manual way, calculating it for each variable, but there are also lots of handy functions to make use of.

describe() from the psych package
CreateTableOne() from the tableone package

Question 12

For our research question (“Is more social media use associated with more happiness?”), we could consider fitting the following model:

lm(happy ~ smm, data = smmdat)

But is it not a bit more complicated than that? Surely there are lots of other things that are relevant? For instance, it’s quite reasonable to assume that social media use is related to someone’s age? It’s also quite likely that happiness changes with age too. But that means that our coefficient of happy ~ smm could actually just be changes in happiness due to something else (like age)? Similarly, people who use social media might just be more sociable people (i.e. they might see more people in real life, and that might be what makes them happy).

Especially in observational research (i.e. we aren’t intervening and asking some people to use social media and others to not use it), figuring out the relevant association that we want can be incredibly tricky.

As it happens, we do have data on these participants’ ages, and on the amount of time they spend having face-to-face interactions with people!

Look at how all these variables correlate with one another, and make some quick plots.

Hints

You can give a data frame of numeric variables to cor() and it gives you all the correlations between pairs of variables in a “correlation matrix”
if you want some quick pair-wise plots (not pretty, but useful!), try the pairs.panels() function from the psych package.

Question 13

Is social media usage associated with happiness, after accounting for age and the number of face-to-face interactions?

Hints

This question can be answered in a couple of ways. You might be able to do some sort of model comparison, or you could look at the test of a coefficient.

Question 14

Plot the model estimated association between social media usage and happiness.

Hints

This follows just the same logic as we did for the monkeys!

Optional Extra

In all the plots we have been making from our models, the other predictors in our model (e.g. age and f2f in this case) have been held at their mean.

What happens if you create a plot estimating the association between happiness and social media usage, but having age at 15, or 30, or 45?

Question 15

How many observations has our model been fitted to?

Hints

It’s not just the 77 people why have in the dataset..

Because we had those very happy and very unhappy people (happy variable was outside our range) that we replaced with NA, we have fewer bits of data to fit our model to.

That’s because lm() will default to something called “listwise deletion”, which will remove any observation where any variable in the model (outcome or predictor) is missing.

We can see how many observations went into our model because we know how many residuals we have:

length(residuals(mod2))

[1] 75

And we can also see it from the degrees of freedom at the bottom of the summary() output. We know that we have \(n-k-1\) degrees of freedom (see 7A #the-f-statistic-a-joint-test), and that is shown as 71 here. \(k\) is the number of predictors, which we know is 3. So \(n\) is 75!

summary(mod2)

...
F-statistic: 98.23 on 3 and 71 DF,  p-value: < 2.2e-16

Question 16

Check the assumptions of your model.

Hints

7B Assumptions & Diagnostics shows how we can do this. We can rely on tests if we like, or we can do it visually through plots. Getting a good sense of “what looks weird” in these assumption plots is something that comes with time.

More RMarkdown

Question 17

Open a .Rmd document, and delete all the template stuff.
- Keep the code chunk at the top called “setup”
In the “setup” chunk, change it to echo = FALSE. This will make sure all code gets hidden from the final rendered document.
Make a new code chunk called “analysis”, and for this chunk, set include = FALSE. This will make sure that the code in that chunk gets run, but does not actually produce any output.
Shove all our working analysis (below) in that ‘analysis’ code chunk.
Write a paragraph describing the analysis.
- what type of model/analysis is being conducted?
- what is the outcome? the predictors? how are they scaled?
Write a paragraph highlighting the key results. Try to use inline R code. This example may help.
In a new code chunk, create and show a plot. Make sure this code chunk is set to include = TRUE, because we do want the output to show (leaving it blank will also work, because this is the default).
Click Knit!

Study: Lie detectors

Law enforcement in some countries regularly rely on ‘polygraph’ tests as a form of ‘lie detection’. These tests involve measuring heart rate, blood pressure, respiratory rate and sweat. However, there is very little evidence to suggest that these methods are remotely accurate in being able to determine whether or not someone is lying.

Researchers are interested in if peoples’ heart rates during polygraph tests can be influenced by various pre-test strategies, including deep breathing, or closing their eyes. They recruited 142 participants (ages 22 to 64). Participants were told they were playing a game in which their task was to deceive the polygraph test, and they would receive financial rewards if they managed successfully. At the outset of the study, they completed a questionnaire which asked about their anxiety in relation to taking part. Participants then chose one of 4 strategies to prepare themselves for the test, each lasting 1 minute. These were “do nothing”, “deep breathing”, “close your eyes” or “cough”². The average heart rate of each participant was recorded during their test.

usmr_polygraph.csv data dictionary
variable	description
age	Age of participant (years)
anx	Anxiety measure (Z-scored)
strategy	Pre-test Strategy (0 = do nothing, 1 = close eyes, 2 = cough, 3 = deep breathing)
hr	Average Heart Rate (bpm) during test

Analysis

Code

# load libraries
library(tidyverse)
library(psych)
# read in the data
liedf <- read_csv("https://uoepsy.github.io/data/usmr_polygraph.csv")

# there seems to be a 5 there.. 
table(liedf$strategy)
# the other variables look okay though
describe(liedf)
pairs.panels(liedf)


liedf <- liedf |> 
  filter(strategy!=5) |>
  mutate(
    # strategy is a factor. but currently numbers
    # i'm going to give them better labels too.. 
    # to do this is need to tell factor() what "levels" to look for
    # and then give it some "labels" to apply to those.
    strategy = factor(strategy, 
                      levels = c("0","1","2","3"),
                      labels = c("do nothing", "close eyes",
                                 "cough", "deep breathing")
                      )
  )

liemod <- lm(hr ~ age + anx + strategy, data = liedf)

# Does HR differ between strategies?
anova(liemod)
# the above is a shortcut for getting this comparison out:
anova(
  lm(hr ~ age + anx, data = liedf),
  lm(hr ~ age + anx + strategy, data = liedf)
)

# i want a plot to show the HRs of different strategies.. 
# ??

Footnotes

Another fake study!↩︎
apparently coughing is a method of immediately lowering heart rate!↩︎

	exploration time			exploration time
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	9.24	8.15 – 10.32	<0.001	13.99	11.95 – 16.03	<0.001
obj type [soft]	-1.27	-2.82 – 0.28	0.108	-2.75	-4.25 – -1.24	<0.001
age				-0.31	-0.42 – -0.19	<0.001
Observations	119			119
R² / R² adjusted	0.022 / 0.014			0.210 / 0.196

More monkeys

Social Media Use

base R

tidyverse

Tidyverse

psych::describe

tableone

One-by-one

All-in-one

More RMarkdown

Footnotes