library(tidyverse)
library(patchwork) #used to arrange plots
<- read_csv("https://uoepsy.github.io/data/wellbeing.csv")
mwdata
<-
wellbeing_plot ggplot(data = mwdata, aes(x = wellbeing)) +
geom_density() +
geom_boxplot(width = 1/250) +
labs(x = "Score on WEMWBS (range 14-70)", y = "Probability\ndensity")
<-
outdoortime_plot ggplot(data = mwdata, aes(x = outdoor_time)) +
geom_density() +
geom_boxplot(width = 1/200) +
labs(x = "Time spent outdoors per week (hours)", y = "Probability\ndensity")
<-
social_plot ggplot(data = mwdata, aes(x = social_int)) +
geom_density() +
geom_boxplot(width = 1/150) +
labs(x = "Number of social interactions per week", y = "Probability\ndensity")
/ outdoortime_plot / social_plot wellbeing_plot
Week 8 Exercises: Multiple Regression
Wellbeing and Outdoor Time
Research Question: after taking into account differences due to people’s level of social interactions, to what extent is well-being associated with the time people spend outdoors?
Data: wellbeing.csv
Researchers interviewed 32 participants, selected at random from the population of residents of Edinburgh & Lothians. They used the Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), a self-report measure of mental health and well-being. The scale is scored by summing responses to each item, with items answered on a 1 to 5 Likert scale. The minimum scale score is 14 and the maximum is 70.
The researchers also asked participants to estimate the average number of hours they spend outdoors each week, the average number of social interactions they have each week (whether on-line or in-person), and whether they believe that they stick to a routine throughout the week (Yes/No).
The dataset is available at https://uoepsy.github.io/data/wellbeing.csv and contains five attributes:
wellbeing
: Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), a self-report measure of mental health and well-being. The scale is scored by summing responses to each item, with items answered on a 1 to 5 Likert scale. The minimum scale score is 14 and the maximum is 70.
outdoor_time
: Self report estimated number of hours per week spent outdoors
social_int
: Self report estimated number of social interactions per week (both online and in-person)routine
: Binary Yes/No response to the question “Do you follow a daily routine throughout the week?”location
: Location of primary residence (City, Suburb, Rural)
Read in the data, and explore and describe the variables and relationships that are relevant to the research question.
Hints:
You might want to:
- plot the marginal distributions (the distributions of each variable in the analysis without reference to the other variables)
- plot the marginal relationships between the outcome variable and each of the explanatory variables.
- make a quick correlation matrix of the variables which are to be used in the analysis, and write a short paragraph describing the relationships.
- e.g.
cor(data[,c(1:4)])
will give us a matrix of correlations between each pair of the first 4 variables in the data
- e.g.
Research Question: after taking into account differences due to people’s level of social interactions, to what extent is well-being associated with the time people spend outdoors?
Fit a regression model that you can use to answer the research question (make sure to give it a name to store it in your environment).
Hints:
\(Wellbeing = b_0 \ + \ b_1 \cdot Social Interactions \ + \ b_2 \cdot Outdoor Time \ + \ \epsilon\)
Extract and interpret the parameter estimates (the coefficients) from your model.
Hints:
- There’s a comprehensive section on how we interpret multiple regression coefficients in 8A#multiple-regression-coefficients
Along with the p-values given by summary()
that test the null hypothesis that each coefficient is equal to zero, we can obtain confidence intervals for our estimates to provide a range of plausible values for each coefficient. This may be useful if we want to focus more on the estimated value, rather than “is it zero or not?”.
Look up the function confint()
and use it to obtain some 95% Confidence Intervals for the coefficients.
Does your model provide a better fit to the data than a model with no explanatory variables? (i.e., test against the alternative hypothesis that at least one of the explanatory variables significantly predicts wellbeing scores).
Hints:
- it’s all in the
summary()
!! - this might not be a useful question to ask, but we’re just trying to go through what each bit of the model output shows.
Just to prove it to ourselves, conduct a model comparison between your model and the “null model”, and check that the \(F\)-statistic from the summary()
of your model matches that from the comparison.
Hints:
- we talked about the ‘null model’, and conducting model comparison in 8A#model-comparisons.
- we can fit a model with no predictors such as
lm(wellbeing ~ 1, data = mwdata)
.
Check the assumptions of your model.
Hints:
- See 8B#assumptions
- You don’t necessarily have to perform statistical tests of assumptions, but check the plots at the very least!
- Interpreting these plots is ultimately a judgement call, and comes with experience.
Check for influential observations.
Hints:
- 8B#individual-case-diagnostics covers a whole load of different metrics you might use. As a starting point, perhaps consider Cook’s Distances plots, or use
influence.measures()
to get lots of info.
Create a visualisation of the relationship between wellbeing and outdoor time, after accounting for social interactions.
Hints:
- There’s an example of visualising multiple regression coefficients in 8A#multiple-regression-coefficients.
- to visualise just one association, you might need the
terms
argument inplot_model()
. Don’t forget you can look up the documentation by typing?plot_model
in the console.
Create a regression table to present your results
Hints:
- There is a useful function called
tab_model()
, also from the sjPlot package. You can see it used in 7A#example.
Student Sleep Quality
USMR 2022 Data
The data from the USMR 2022 survey (now closed) can be found at https://uoepsy.github.io/data/usmr2022.csv.
note, this is the survey data just from USMR this year, not other students on other courses or in previous years
Recall that last week we discussed my little theory that people who believe they have more control over life will rate their sleep quality as being better.
Monica has pointed out that it’s likely that the association we have found (Figure 3) might be due to other things. There are lots of explanations we could come up with. Some of these might involve variables we have measured.1 What if the relationship we see between peoples’ loc
and sleeprating
is better explained by their emotional stability? Monica makes a convincing argument that having higher emotional stability may be correlated with feeling more in control of one’s life, and may also influence sleep quality.
Sometimes it can help to think of models in the form of diagrams:
Monica tells me that if I want to look at how ‘locus of control’ is associated with sleep quality, it would be more useful to think about the association after controlling for emotional stability.
How does locus of control (
loc
) influence sleep quality (sleeprating
) after accounting for emotional stability (emot_stability
)?
Hints:
See section 8A#multiple-regression-coefficients for how this might be achieved.
the way the question is phrased suggests that
sleeprating
is our outcome (the thing being explained/influenced/predicted);loc
is the focal predictor (the main predictor of interest); andemot_stability
is a covariate. we treat “covariates” and “focal predictors” exactly the same in the model:lm(outcome ~ covariates and focal predictors)
, it is just terminology that we use to distinguish what we are interested in from what we want to acknowledge as being theoretically relevant.Don’t worry about assumptions, we’ll take a look at them next.
Monica and I want to write up our study of how locus of control is associated with sleep ratings.
However, we haven’t checked any of our model assumptions, and so we shouldn’t really trust the p-values we have been looking at so far.
Take a look at the assumptions and diagnostics of your model. If your assumptions are worryingly weird, there are things that may be useful to consider doing. We would probably recommend bootstrapping (see 8B#back-to-the-bootstrap).
Hints:
You can ask yourself one of two questions (or both if you like), but we recommend the visualisation approach:
- Do the plots look okay? (8B#plotting-assumptions)
- Do the residuals pass statistical tests? (8B#testing-assumptions)
It’s also worth taking a look at the ‘diagnostics’ (such as the influence of individual cases, see 8B#individual-case-diagnostics.
In order to bootstrap, you might need to fit the model to a dataset without any NAs in it.
Monica convinces me that we have theoretical motivation for thinking that sleeprating is influenced by both emotional stability and locus of control.
Tia is curious if other aspects of personality influence sleep quality ratings. She knows that we measured 5 personality traits: emotional stability, imagination, extraversion, agreeableness, and conscientiousness.
Tia asks us:
Over and above the influence of locus of control and emotional stability, are other personality traits important predictors of how sleep quality is rated?
Hints:
- This question asks about the explanatory power of a set of predictors. The best way to test this might be model comparison (see 8A#model-comparisons)
- To compare two models, the models need to be fitted to the same data. However, when we fit a model using
lm()
, rows where there is missing data for any of the variables in the model will be excluded.- if we have a model
mod1 <- lm(y ~ x1)
and we compare it tomod2 <- lm(y ~ x1 + x2)
, if there is missing data onx2
, thenmod2
will be fitted to a smaller set of the data.
- a good approach is to make a separate object in your environment which is the data you want to model.
# select all variables included in either model, and omit the NAs <- data %>% select(y, x1, x2) %>% na.omit moddata # these two models will now be fitted to the same data <- lm(y ~ x1, moddata) mod1 <- lm(y ~ x1 + x2, moddata) mod2 # meaning we can compare them: anova(mod1, mod2)
- if we have a model
RMarkdown
Create an RMarkdown document that presents and describes analyses that addresses the research questions below, and then presents and interprets the results.
Research Questions
- After accounting for emotional stability, how is perceived control over one’s life associated with quality of sleep?
- In addition to the associations studied above, are other personality traits associated with differences in sleep quality?
Hints:
- You’ve done the modelling already in the last couple of questions, so this just a matter of writing, interpreting, and presenting!
- Don’t forget that we have the “Rmd-bootcamp” materials for reference: https://uoepsy.github.io/scs/rmd-bootcamp/
Optional Extra: Conscientiousness
This is Eleanor Abernathy. She has been learning about astrology and horoscopes, but is still skeptical. She wants to know if, after accounting for differences between cat/dog people, a person’s month-of-birth is associated with how conscientious they are.
Using the same USMR 2022 survey data, fit the a model and use it to address the following research question:
After accounting for differences between cat/dog people, is month-of-birth associated with conscientiousness?
Hint:
- Note that the question is more a matter of “is”/“does”, and not “what is”/“how does”.
- To answer this question, we might be more inclined to analyse the variance in conscientiousness explained by birth-months (rather than looking at specific differences) - see 8A#analysis-of-variance.
Footnotes
When designing a study/collected data, it’s good practice to do this sort of thinking before hand, to make sure you measure all the things you want to measure↩︎