Model Comparisons

Learning Objectives

At the end of this lab, you will:

Understand measures of model fit using F.
Understand the principles of model selection and how to compare models via F tests.
Understand AIC and BIC.

What You Need

Be up to date with lectures
Have completed previous lab exercises from Semester 1

Required R Packages

Remember to load all packages within a code chunk at the start of your RMarkdown file using library(). If you do not have a package and need to install, do so within the console using install.packages(" "). For further guidance on installing/updating packages, see Section C here.

For this lab, you will need to load the following package(s):

tidyverse
stargazer

Lab Data

You can download the data required for this lab here or read it in via this link https://uoepsy.github.io/data/wellbeing_rural.csv

Study Overview

Research Questions

RQ1: Is there an overall effect of the number of social interactions on wellbeing scores?

RQ2: Does the association between number of social interactions and wellbeing differ between rural and non-rural residents?

RQ3: Does weekly outdoor time explain a significant amount of variance in wellbeing scores over and above the interaction between weekly social interactions and location (rural vs not-rural)?

Wellbeing/Rurality data codebook.

age	outdoor_time	social_int	routine	wellbeing	location	steps_k
28	12	13	1	36	rural	21.6
56	5	15	1	41	rural	12.3
25	19	11	1	35	rural	49.8
60	25	15	0	35	rural	NA
19	9	18	1	32	rural	48.1
34	18	13	1	34	rural	67.3

Setup

Create a new RMarkdown file
Load the required package(s)
Read the wellbeing_rural dataset into R, assigning it to an object named wrdata

Solution

Exercises

Question 1

Check coding of variables (e.g., that categorical variables are coded as factors), and create a new binary variable which specifies whether or not each participant lives in a rural location.

Hint

You will need to use two functions - firstly mutate(), and then ifelse() when creating your new variable (which should also be coded as a factor).

Solution

Question 2

Using fct_relevel(), specify ‘not rural’ as your reference group for your newly created variable (i.e., the isRural variable).

Solution

Question 3

Fit the below 5 models required to address the three research questions stated above. Note down which model(s) will be used to address each research question, and examine the results of each model.

Name the models as follows: “wb_mdl0”, “wb_mdl1”, “wb_mdl2”, “wb_mdl3”, and “wb_mdl4”.

\[ \text{Wellbeing} = \beta_0 + \epsilon \]
\[ \text{Wellbeing} = \beta_0 + \beta_1 \cdot Social Interactions + \epsilon \]
\[ \text{Wellbeing} = \beta_0 + \beta_1 \cdot Social Interactions + \beta_2 \cdot Location_{Rural} + \epsilon \]
\[ \begin{split} \text{Wellbeing} = \beta_0 + \beta_1 \cdot Social Interactions + \beta_2 \cdot Location_{Rural} \\+ \beta_3 \cdot (Social Interactions \cdot Location_{Rural}) + \epsilon \end{split} \]
\[ \begin{split} \text{Wellbeing} = \beta_0 + \beta_1 \cdot Social Interactions + \beta_2 \cdot Location_{Rural} \\+ \beta_3 \cdot (Social Interactions \cdot Location_{Rural}) + \beta_4 \cdot \text{Outdoor Time} + \epsilon \end{split} \]

Hint

The summary() function will be useful to examine the model output.

Solution

Question 4

Provide key model results from the two models required to address RQ1 - whether there is an overall effect of the number of social interactions on wellbeing scores - in a single formatted table.

Hint

You will need to use a new package to do this - stargazer.

Like tab_model() that you have used in many previous labs, stargazer() can take lots of different arguments to customize and build a table. You may want to consider specifying the below (and remember you can use the helper function via ?stargazer() for further information about the functionality of the package):

title = - specify the title of your table
dep.var.labels = - specify the name of your dependent variable(s)
covariate.labels = - specify the names of your covariates (or independent) variables
type = - specify whether you want ‘html’ (use when knitting to HTML), ‘latex’ (use when knitting to PDF), or ‘text’ (use when knitting to Word)
digits = - specify rounding (remember APA standard is, in most cases, 2 decimal places)
intercept.bottom = - specify if you want the intercept (or ‘constant’) value to be printed at the bottom (TRUE) or top (FALSE) of the output

Note

Your table will only render once you have knitted your document. Within your code chunk options, you may need to specify results = 'asis'.

You can learn more about updating your code chunk options here, and you should end up with the below:

Solution

**Regression Model Results for RQ1**

	Dependent variable:

	Wellbeing Score
	(1)	(2)

Intercept	36.29^*** (0.38)	32.41^*** (1.18)
Number of Social Interactions		0.32^*** (0.09)

Observations	200	200
R²	0.00	0.06
Adjusted R²	0.00	0.05
Residual Std. Error	5.39 (df = 199)	5.25 (df = 198)
F Statistic		12.15^*** (df = 1; 198)

Note:	p<0.1; p<0.05; p<0.01

Question 5

Is there a main effect of the number of weekly social interactions?

Check that the \(F\)-statistic and the \(p\)-value are the the same as that which is given at the bottom of summary(wb_mdl1).

Hint

Use the anova() function to perform a model comparison between your model with social interactions (wb_mdl1) to the null model (wb_mdl0).

Remember that the null model tests the null hypothesis that all beta coefficients are zero. By comparing wb_mdl0 to wb_mdl1, we can test whether we should include the IV of social_int.

Solution

Question 6

Does the association between number of social interactions and wellbeing differ between rural and non-rural residents?

Provide key model results from the two models in a single formatted table, and report the results of the model comparison in APA format.

Hint

To address RQ2, you need to compare “wb_mdl2” and “wb_mdl3”

Solution

**Regression Model Results for RQ2**

	Dependent variable:

	Wellbeing Score
	(1)	(2)

Intercept	34.12^*** (1.07)	31.00^*** (1.43)
Number of Social Interactions	0.38^*** (0.08)	0.65^*** (0.12)
Location - Rural	-4.85^*** (0.66)	1.39 (2.05)
Number of Social Interactions * Location - Rural		-0.52^*** (0.16)

Observations	200	200
R²	0.26	0.30
Adjusted R²	0.25	0.29
Residual Std. Error	4.66 (df = 197)	4.56 (df = 196)
F Statistic	34.47^*** (df = 2; 197)	27.49^*** (df = 3; 196)

Note:	p<0.1; p<0.05; p<0.01

Question 7

Look at the amount of variation in wellbeing scores explained by models “wb_mdl3” and “wb_mdl4”.

From this, can we answer the third research question of whether weekly outdoor time explains a significant amount of variance in wellbeing scores over and above the interaction between weekly social interactions and location (rural vs not-rural)?

Provide justification/rationale for your answer.

Hint

Recall from Semester 1 that to determine how much variation is explained by a model, we need to look at our \(R^2\) values (specifically the adjusted \(R^2\) value in this case since the models have multiple predictors.

Solution

Question 8

Does weekly outdoor time explain a significant amount of variance in wellbeing scores over and above the interaction between weekly social interactions and location (rural vs not-rural)?

Provide key model results from the two models in a single formatted table, and report the results of the model comparison in APA format.

Hint

To address RQ3, you need to compare “wb_mdl3” and “wb_mdl4”

Solution

**Regression Model Results for RQ3**

	Dependent variable:

	Wellbeing Score
	(1)	(2)

Intercept	31.00^*** (1.43)	27.61^*** (1.61)
Number of Social Interactions	0.65^*** (0.12)	0.65^*** (0.11)
Location - Rural	1.39 (2.05)	1.47 (1.98)
Outdoor Time (Hours)		0.18^*** (0.04)
Number of Social Interactions * Location - Rural	-0.52^*** (0.16)	-0.51^*** (0.16)

Observations	200	200
R²	0.30	0.35
Adjusted R²	0.29	0.34
Residual Std. Error	4.56 (df = 196)	4.39 (df = 195)
F Statistic	27.49^*** (df = 3; 196)	26.28^*** (df = 4; 195)

Note:	p<0.1; p<0.05; p<0.01

Question 9

Compare the two following models, each looking at the associations of Wellbeing scores and two different predictor variables.

\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Social Interactions} + \beta_2 \cdot \text{Age} + \epsilon\)

\(\text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Outdoor Time} + \beta_2 \cdot \text{Routine} + \epsilon\)

In APA format, report which model you think best fits the data.

Hint

Compare using AIC() and BIC() since the models are non-nested.

Solution

Question 10

The code below fits 5 different models based on our wrdata:

model1 <- lm(wellbeing ~ social_int + outdoor_time, data = wrdata)
model2 <- lm(wellbeing ~ social_int + outdoor_time + age, data = wrdata)
model3 <- lm(wellbeing ~ social_int + outdoor_time + routine, data = wrdata)
model4 <- lm(wellbeing ~ social_int + outdoor_time + routine + age, data = wrdata)
model5 <- lm(wellbeing ~ social_int + outdoor_time + routine + steps_k, data = wrdata)

For each of the below pairs of models, what methods are/are not available for us to use for comparison and why?

model1 vs model2
model2 vs model3
model1 vs model4
model3 vs model5

This flowchart might help you to reach your decision:

Hint

You may need to examine the dataset, and check for accuracy (e.g., are there any impossible / out of range values?) and completeness (e.g., are there any missing values?).

Solution