Categorical Predictors & Block 1 Recap

Learning Objectives

At the end of this lab, you will:

Understand the meaning of (and how to interpret) a multiple regression model with a binary predictor
Understand how to specify a new baseline/reference level for categorical variables

What You Need

Be up to date with lectures
Have completed Labs 1 - 4

Required R Packages

Remember to load all packages within a code chunk at the start of your RMarkdown file using library(). If you do not have a package and need to install, do so within the console using install.packages(" "). For further guidance on installing/updating packages, see Section C here.

For this lab, you will need to load the following package(s):

tidyverse
patchwork
sjPlot

Lab Data

You can download the data required for this lab here or read it in via this link https://uoepsy.github.io/data/wellbeing.csv.

Note: this is the same data as Lab 3 & 4.

Section A: Numeric + Categorical

Study Overview

Research Question

Is there an assocation between well-being and time spent outdoors after taking into account the assocation between well-being and having a routine?

Wellbeing data codebook.

wellbeing	outdoor_time	social_int	location	routine
30	7	8	Suburb	Routine
21	9	8	City	No Routine
38	14	10	Suburb	Routine
27	16	10	City	No Routine
20	1	10	Rural	No Routine
37	11	12	Suburb	No Routine

Setup

Create a new RMarkdown file
Load the required package(s)
Read the wellbeing dataset into R, assigning it to an object named mwdata
Check coding of variables (e.g., make sure that catgorical variables are coded as factors)

Hint

You will need to use the as_factor() function here. Note that this function creates levels from the order in which they appear in your dataset (e.g., routine is in the first row of our mwdata, so would be assigned as the reference group).

Solution

Question 1

Produce visualisations of:

the distribution of the routine variable
the association between routine and wellbeing.

Provide interpretation of these figures.

Note

We cannot visualise the distribution of routine as a density curve or boxplot, because it is a categorical variable (observations can only take one of a set of discrete response values). Revise the DAPR1 materials for a recap of data types.

Hint

Consider using geom_bar() and/or geom_boxplot(). The DAPR1 categorical data lab might provide a useful starting point if needed.

Solution

Question 2

Formally state:

your chosen significance level
the null and alternative hypotheses

Fit the multiple regression model below using lm(), and assign it to an object named mdl2.

\[ Wellbeing = \beta_0 + \beta_1 \cdot Routine_{No Routine} + \beta_2 \cdot OutdoorTime + \epsilon \]

Examine the summary() output of the model.

\(\hat \beta_0\) (the intercept) is the estimated average wellbeing score associated with zero hours of weekly outdoor time and zero in the routine variable. What group is the intercept the estimated wellbeing score for when they have zero hours of outdoor time? Why (think about what zero in the routine variable means)?

Solution

Question 3

The researchers have decided that they would prefer ‘no routine’ to be considered the reference level for the “routine” variable instead of ‘routine’. Apply this change to the variable and re-run your model.

Hint

You will need to use the relevel() function here.

Solution

Question 4

We can visualise the model below as two lines.

\(\widehat{Wellbeing} = \hat \beta_0 + \hat \beta_1 \cdot Routine_{Routine} + \hat \beta_2 \cdot OutdoorTime\)

Each line represents the model predicted values for wellbeing scores across the range of weekly outdoor time, with one line for those who report having “Routine” and one for those with “No Routine”.

Get a pen and paper, and sketch out the plot shown in Figure 3.

Figure 3: Multiple regression model: Wellbeing ~ Routine + Outdoor Time

Annotate your plot with labels for each of parameter estimates from your model:

Parameter Estimate	Model Coefficient	Estimate
\(\hat \beta_0\)	`(Intercept)`	26.25
\(\hat \beta_1\)	`routineRoutine`	7.29
\(\hat \beta_2\)	`outdoor_time`	0.92

Hint

Below you can see where to add the labels, but we have not said which is which.

A is the vertical distance between the red and blue lines (the lines are parallel, so this distance is the same wherever you cut it on the x-axis).
B is the point at which the blue line cuts the y-axis.
C is the vertical increase (increase on the y-axis) for the blue line associated with a 1 unit increase on the x-axis (the lines are parallel, so this is the same for the red line).

Solution

Question 5

Interpret your results in the context of the research question and report your model in full.

Provide key model results in a formatted table.

Solution

Table 1: Regression Table for Wellbeing Model
	Wellbeing (WEMWBS Scores)
Predictors	Estimates	CI	p
(Intercept)	26.25	18.17 – 34.34	<0.001
Has Routine	7.29	0.65 – 13.94	0.033
Outdoor Time (hours per week)	0.92	0.43 – 1.40	0.001
Observations	32
R² / R² adjusted	0.436 / 0.397

Section B: Weeks 1 - 4 Recap

In the second part of the lab, there is no new content - the purpose of the recap section is for you to revisit and revise the concepts you have learned over the last 4 weeks.

Before you expand each of the boxes below, think about how comfortable you feel with each concept.

Types of Models: Deterministic vs Statistical

Null & Alternative Hypotheses

Simple Linear Regression

Multiple Linear Regression

Partitioning Variation: Sum of Squares

F-test & F-ratio

R-squared and Adjusted R-squared

Standardisation

Binary Variables

Categorical Predictors with k levels

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	60.28	1.209	49.86	5.273e-39
speciesdog	-11.47	1.71	-6.708	3.806e-08
speciesparrot	-4.916	1.71	-2.875	0.006319

Steps Involved in Modelling

Categorical Predictors & Block 1 Recap

Learning Objectives

What You Need

Required R Packages

Lab Data

Section A: Numeric + Categorical

Study Overview

Setup

Section B: Weeks 1 - 4 Recap

**Deterministic (Example: Perimeter & Side)**

**Statistical (Example: Height & Handspan)**

Sum of Squares

Total Sum of Squares

Residual Sum of Squares

Model Sum of Squares