Categorical Predictors & Block 1 Recap

Learning Objectives

At the end of this lab, you will:

  1. Understand the meaning of (and how to interpret) a multiple regression model with a binary predictor
  2. Understand how to specify a new baseline/reference level for categorical variables

What You Need

  1. Be up to date with lectures
  2. Have completed Labs 1 - 4

Required R Packages

Remember to load all packages within a code chunk at the start of your RMarkdown file using library(). If you do not have a package and need to install, do so within the console using install.packages(" "). For further guidance on installing/updating packages, see Section C here.

For this lab, you will need to load the following package(s):

  • tidyverse
  • patchwork
  • sjPlot

Lab Data

You can download the data required for this lab here or read it in via this link https://uoepsy.github.io/data/wellbeing.csv.

Note: this is the same data as Lab 3 & 4.

Section A: Numeric + Categorical

Study Overview

Research Question

Is there an assocation between well-being and time spent outdoors after taking into account the assocation between well-being and having a routine?

Wellbeing data codebook.

Setup

Setup
  1. Create a new RMarkdown file
  2. Load the required package(s)
  3. Read the wellbeing dataset into R, assigning it to an object named mwdata
  4. Check coding of variables (e.g., make sure that catgorical variables are coded as factors)

You will need to use the as_factor() function here. Note that this function creates levels from the order in which they appear in your dataset (e.g., routine is in the first row of our mwdata, so would be assigned as the reference group).

Solution


Question 1

Produce visualisations of:

  1. the distribution of the routine variable
  2. the association between routine and wellbeing.

Provide interpretation of these figures.

Note

We cannot visualise the distribution of routine as a density curve or boxplot, because it is a categorical variable (observations can only take one of a set of discrete response values). Revise the DAPR1 materials for a recap of data types.

Consider using geom_bar() and/or geom_boxplot(). The DAPR1 categorical data lab might provide a useful starting point if needed.

Solution


Question 2
  1. Formally state:
  • your chosen significance level
  • the null and alternative hypotheses
  1. Fit the multiple regression model below using lm(), and assign it to an object named mdl2.

\[ Wellbeing = \beta_0 + \beta_1 \cdot Routine_{No Routine} + \beta_2 \cdot OutdoorTime + \epsilon \]

Examine the summary() output of the model.

\(\hat \beta_0\) (the intercept) is the estimated average wellbeing score associated with zero hours of weekly outdoor time and zero in the routine variable. What group is the intercept the estimated wellbeing score for when they have zero hours of outdoor time? Why (think about what zero in the routine variable means)?

Solution


Question 3

The researchers have decided that they would prefer ‘no routine’ to be considered the reference level for the “routine” variable instead of ‘routine’. Apply this change to the variable and re-run your model.

You will need to use the relevel() function here.

Solution


Question 4

We can visualise the model below as two lines.

\(\widehat{Wellbeing} = \hat \beta_0 + \hat \beta_1 \cdot Routine_{Routine} + \hat \beta_2 \cdot OutdoorTime\)

Each line represents the model predicted values for wellbeing scores across the range of weekly outdoor time, with one line for those who report having “Routine” and one for those with “No Routine”.

Get a pen and paper, and sketch out the plot shown in Figure 3.

Figure 3: Multiple regression model: Wellbeing ~ Routine + Outdoor Time

Annotate your plot with labels for each of parameter estimates from your model:

Parameter Estimate Model Coefficient Estimate
\(\hat \beta_0\) (Intercept) 26.25
\(\hat \beta_1\) routineRoutine 7.29
\(\hat \beta_2\) outdoor_time 0.92

Below you can see where to add the labels, but we have not said which is which.

  • A is the vertical distance between the red and blue lines (the lines are parallel, so this distance is the same wherever you cut it on the x-axis).
  • B is the point at which the blue line cuts the y-axis.
  • C is the vertical increase (increase on the y-axis) for the blue line associated with a 1 unit increase on the x-axis (the lines are parallel, so this is the same for the red line).

Solution


Question 5

Interpret your results in the context of the research question and report your model in full.

Provide key model results in a formatted table.

Solution

Section B: Weeks 1 - 4 Recap

In the second part of the lab, there is no new content - the purpose of the recap section is for you to revisit and revise the concepts you have learned over the last 4 weeks.

Before you expand each of the boxes below, think about how comfortable you feel with each concept.

Types of Models: Deterministic vs Statistical

Null & Alternative Hypotheses

Simple Linear Regression

Multiple Linear Regression

Partitioning Variation: Sum of Squares

F-test & F-ratio

R-squared and Adjusted R-squared

Standardisation

Binary Variables

Categorical Predictors with k levels

Steps Involved in Modelling