This week, we’ll be reviewing what we’ve learned in weeks 7-10 and applying it to a practical example. Specifically, we’ll cover:
Our example is based on the paper Class Attendance in College: A Meta-Analytic Review of the Relationship of Class Attendance With Grades and Student Characteristics This paper looks at the association between class attendance and a range of other variables (such as student personality, academic performance, class marks, etc.) Imagine that we’ve read that paper and we want to perform further investigation on the variables that contribute to attendance. Specifically, we will investigate a range of categorical variables here. We’ll also use our data to try and replicate the association between attendance and marks that is found in this paper.
During this week’s example, we will conduct two separate, but related, studies. In the first, we are interested in exploring possible predictors of attendance in university courses. In the second, we will investigate the association between attendance and final marks in our sample.
We collected data from 397 university students across years 1-4, as well as students in MSc and PhD programs. From these students, we gathered data on class time, the use of online materials, and student’s conscientiousness.
Variable | Description |
---|---|
pid | Participant ID number |
Attendance | Total attendance in days |
Conscientiousness | Level of conscientiousness (levels = Low; Moderate; High) |
Time | Class time (levels = 9AM; 10AM; 11AM; 12PM; 1PM; 2PM; 3PM; 4PM) |
OnlineAccess | Frequency of access to online course materials (levels = Rarely; Sometimes; Often) |
Year | Year in university (levels = Y1; Y2; Y3; Y4; MSc; PhD) |
We want to use these data to investigate the following research questions:
Research Question 1: Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?
Research Question 2: Is the time at which the class is scheduled associated with student attendance?
First, let’s load the data and all necessary packages. Note that by using the stringsAsFactors
argument in the read.csv
function, all of our character variables are automatically imported as factors, which is very handy when you’re working with a dataset with lots of factors in it.
library(tidyverse)
library(kableExtra)
library(car)
library(sjPlot)
library(emmeans)
dat <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart1.csv', stringsAsFactors = T)
First, we’ll have a look at our data using the summary
function:
summary(dat)
## pid Attendance Conscientiousness Time
## Min. : 1 Min. : 0.00 Length:397 Length:397
## 1st Qu.:100 1st Qu.:16.00 Class :character Class :character
## Median :199 Median :29.00 Mode :character Mode :character
## Mean :199 Mean :28.28
## 3rd Qu.:298 3rd Qu.:40.00
## Max. :395 Max. :60.00
## OnlineAccess Year
## Length:397 Length:397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
You’ll notice that we have a continuous outcome variable, Attendance
, and discrete predictor variables, Conscientiousness
, Time
, OnlineAccess
and Year
.
We’ll also have a look at how our variables are distributed, using a histogram for our continuous outcome and bar plots for our factors:
ggplot(dat, aes(Attendance)) + geom_histogram(colour = 'black', binwidth = 5)
ggplot(dat, aes(Conscientiousness, fill = Conscientiousness)) + geom_bar() +
theme(legend.position = 'none')
ggplot(dat, aes(Year, fill = Year)) + geom_bar() +
theme(legend.position = 'none')
ggplot(dat, aes(Time, fill = Time)) + geom_bar() +
theme(legend.position = 'none')
Given the first research question we’ve specified, Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?, we want to know the relationship between the following variables:
\[Attendance \sim Conscientiousness + OnlineAccess + Year\]
Between the 3 predictors, we have 12 levels. Before we run our model, we have to make a few decisions in terms of the coding and baseline comparisons we’ll be making across predictors. We decide to use dummy coding. We’ll use be using Moderate
as the baseline level for Conscientiousness
and Sometimes
as the baseline level for OnlineAccess
. We’ll keep Y1
as the baseline level for Year
. We can use the factor
function to order our levels accordingly:
dat$Conscientiousness <- factor(dat$Conscientiousness, levels = c('Moderate', 'Low', 'High'))
summary(dat$Conscientiousness)
## Moderate Low High
## 146 127 124
dat$OnlineAccess <- factor(dat$OnlineAccess, levels = c('Sometimes', 'Rarely', 'Often'))
summary(dat$OnlineAccess)
## Sometimes Rarely Often
## 170 101 126
dat$Year <- factor(dat$Year, levels = c('Y1', 'Y2', 'Y3', 'Y4', 'MSc', 'PhD'))
summary(dat$Year)
## Y1 Y2 Y3 Y4 MSc PhD
## 89 100 66 71 48 23
Now that we have this done, we should end up with a model that looks like the one below. Note that some variable names have been shortened for sizing purposes:
\[Attendance \sim \beta_0+\beta_1Consc_{low} + \beta_2Consc_{high} + \beta_3Online_{rarely} + \beta_4Online_{often} + \beta_5Year_{Y2} + \beta_6Year_{Y3} + \beta_7Year_{Y4} + \beta_8Year_{MSc} + \beta_9Year_{PhD} + \epsilon\]
Now we can run our model:
m1 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m1)
##
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess +
## Year, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.657 -6.990 -0.279 6.085 31.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.874 1.533 18.179 < 2e-16 ***
## ConscientiousnessLow -10.292 1.399 -7.359 1.12e-12 ***
## ConscientiousnessHigh 7.366 1.392 5.292 2.03e-07 ***
## OnlineAccessRarely -5.378 1.441 -3.732 0.000218 ***
## OnlineAccessOften -3.533 1.339 -2.639 0.008649 **
## YearY2 4.574 1.657 2.760 0.006049 **
## YearY3 3.418 1.853 1.844 0.065926 .
## YearY4 4.266 1.817 2.347 0.019418 *
## YearMSc 5.649 2.046 2.760 0.006051 **
## YearPhD 12.484 2.661 4.692 3.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3424
## F-statistic: 23.91 on 9 and 387 DF, p-value: < 2.2e-16
This looks interesting, but before we interpret, let’s check assumptions:
Linearity We can assume linearity when all x-variables entered into the model are categorical (see here)
Independence of Errors We are using between-subjects data, so we’ll also assume independence of our error terms.
Normality of Residuals We can check this using histograms and QQ plots:
hist(m1$residuals)
plot(m1, which = 2)
Tails are a bit fatter than we would expect (particularly on the right), but overall, this looks pretty good. No major concerns.
Equality of Variance (Homoscedasticity)
We can check for heteroscedasticity using residuals vs predicted values plots. We can get these using the residualPlot
from the car
package:
residualPlot(m1)
Looks good! Now we can interpret the model.
summary(m1)
##
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess +
## Year, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.657 -6.990 -0.279 6.085 31.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.874 1.533 18.179 < 2e-16 ***
## ConscientiousnessLow -10.292 1.399 -7.359 1.12e-12 ***
## ConscientiousnessHigh 7.366 1.392 5.292 2.03e-07 ***
## OnlineAccessRarely -5.378 1.441 -3.732 0.000218 ***
## OnlineAccessOften -3.533 1.339 -2.639 0.008649 **
## YearY2 4.574 1.657 2.760 0.006049 **
## YearY3 3.418 1.853 1.844 0.065926 .
## YearY4 4.266 1.817 2.347 0.019418 *
## YearMSc 5.649 2.046 2.760 0.006051 **
## YearPhD 12.484 2.661 4.692 3.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3424
## F-statistic: 23.91 on 9 and 387 DF, p-value: < 2.2e-16
To get some practice with coding schemes, let’s run the same model with effects coding.
First, let’s consider which level we want to drop in each case. Perhaps we’re more interested in how the “extreme” groups differ from average. We’ll drop the Moderate
and Sometimes
levels from the Conscientiousness
and OnlineAccess
variables, respectively. We’ll drop Y3
from our Year
variable. To do this, we’ll need to reorder our variables accordingly:
dat$Conscientiousness <- factor(dat$Conscientiousness, levels = c('Low', 'High', 'Moderate'))
summary(dat$Conscientiousness)
## Low High Moderate
## 127 124 146
dat$OnlineAccess <- factor(dat$OnlineAccess, levels = c('Rarely', 'Often', 'Sometimes'))
summary(dat$OnlineAccess)
## Rarely Often Sometimes
## 101 126 170
dat$Year <- factor(dat$Year, levels = c('Y1', 'Y2', 'Y4', 'MSc', 'PhD', 'Y3'))
summary(dat$Year)
## Y1 Y2 Y4 MSc PhD Y3
## 89 100 71 48 23 66
We also need to specify effects coding using the contr.sum
function:
contrasts(dat$Conscientiousness) <- contr.sum
contrasts(dat$OnlineAccess) <- contr.sum
contrasts(dat$Year) <- contr.sum
Now we can run our model. Keep in mind that we don’t need to recheck assumptions. The data are the same, it’s just the comparisons that differ.
m2 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m2)
##
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess +
## Year, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.657 -6.990 -0.279 6.085 31.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.9933 0.6544 44.306 < 2e-16 ***
## Conscientiousness1 -9.3164 0.8280 -11.251 < 2e-16 ***
## Conscientiousness2 8.3412 0.8243 10.119 < 2e-16 ***
## OnlineAccess1 -2.4075 0.8844 -2.722 0.00678 **
## OnlineAccess2 -0.5626 0.8291 -0.679 0.49785
## Year1 -5.0652 1.1782 -4.299 2.17e-05 ***
## Year2 -0.4909 1.1291 -0.435 0.66396
## Year3 -0.7992 1.2777 -0.626 0.53200
## Year4 0.5836 1.4888 0.392 0.69528
## Year5 7.4193 2.0433 3.631 0.00032 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3424
## F-statistic: 23.91 on 9 and 387 DF, p-value: < 2.2e-16
Before I include the model summary in my write-up, I have to format it into a nicer looking table.
tab_model(m2, pred.labels = c('Intercept', 'Low Conscientiousness', 'High Conscientiousness', 'Online Access - Rarely', 'Online Access - Often', 'Y1', 'Y2', 'Y4', 'MSc', 'PhD'),
title = 'Regression Results - Predictors of Student Attendance')
Attendance | |||
---|---|---|---|
Predictors | Estimates | CI | p |
Intercept | 28.99 | 27.71 – 30.28 | <0.001 |
Low Conscientiousness | -9.32 | -10.94 – -7.69 | <0.001 |
High Conscientiousness | 8.34 | 6.72 – 9.96 | <0.001 |
Online Access - Rarely | -2.41 | -4.15 – -0.67 | 0.007 |
Online Access - Often | -0.56 | -2.19 – 1.07 | 0.498 |
Y1 | -5.07 | -7.38 – -2.75 | <0.001 |
Y2 | -0.49 | -2.71 – 1.73 | 0.664 |
Y4 | -0.80 | -3.31 – 1.71 | 0.532 |
MSc | 0.58 | -2.34 – 3.51 | 0.695 |
PhD | 7.42 | 3.40 – 11.44 | <0.001 |
Observations | 397 | ||
R2 / R2 adjusted | 0.357 / 0.342 |
Our second research question is Is the time at which class is scheduled associated with student attendance?. To test this, we’ll use the emmeans
function to write our own contrasts. We want to know the relationship between the following variables:
\[Attendance \sim Time\]
First, let’s relevel our Time variable
so that it’s in chronological order:
dat$Time <- factor(dat$Time, levels = c('9AM', '10AM', '11AM','12PM', '1PM', '2PM', '3PM', '4PM'))
summary(dat$Time)
## 9AM 10AM 11AM 12PM 1PM 2PM 3PM 4PM
## 56 48 46 47 45 46 52 57
We’ll look at how Time
is distributed using a bar plot:
ggplot(dat, aes(Time, fill = Time)) + geom_bar() +
theme(legend.position = 'none')
First let’s run our model and check assumptions. Remember, we don’t need to check linearity or independence.
Normality of residuals:
m3 <- lm(Attendance~Time, dat)
hist(m3$residuals)
plot(m3, which = 2)
The distribution is a bit flat, but if you’ll remember from class, distributional problems are usually not a strong concern (and we can always bootstrap for extra certainty!). Heteroscedasticity is much more of a problem. Let’s check that using our residuals by predicted values plot.
residualPlot(m3)
This looks great! We’re going to keep our model as is.
Now let’s specify our contrasts with emmeans:
timeMean <- emmeans(m3, ~Time)
plot(timeMean)
I’d guess we’re going to see some significant differences here. Let’s say, however, we’re less worried about specific times and more about times of day in general. We decide to look at midday compared to earlier (9AM and 10AM) and later (3PM and 4PM) classes. Before running our analysis, we have to manually group our levels:
levels(dat$Time)
## [1] "9AM" "10AM" "11AM" "12PM" "1PM" "2PM" "3PM" "4PM"
timeComp <- list('Middle of the Day vs Early or Late' = c(-1/4,-1/4, 1/4, 1/4, 1/4, 1/4, -1/4, -1/4))
In this case, our \(H_0\) is
\[\frac{1}{4}(\mu_3+\mu_4+\mu_5+\mu_6) - \frac{1}{4}(\mu_1+\mu_2+\mu_7+\mu_8) = 0\]
timeTest <- contrast(timeMean, timeComp)
timeTest
## contrast estimate SE df t.ratio p.value
## Middle of the Day vs Early or Late 5.36 1.35 389 3.963 0.0001
confint(timeTest)
## contrast estimate SE df lower.CL upper.CL
## Middle of the Day vs Early or Late 5.36 1.35 389 2.7 8.01
##
## Confidence level used: 0.95
We collected attendance and final mark data from 200 university students.
Variable | Description |
---|---|
Marks | Final Mark in points |
Attendance | Total attendance in days |
We want to use these data to investigate the following research question:
Research Question: Can student attendance be used to predict marks?
Our hypotheses in this case, can be stated in the following way:
\[H_0: \beta_1 = 0\] \[H_1: \beta_1 \neq 0\]
First, let’s load the data.
dat2 <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart2.csv', stringsAsFactors = T)
First, let’s have a look at the data:
summary(dat2)
## Marks Attendance
## Min. :25.01 Min. :10.50
## 1st Qu.:38.22 1st Qu.:22.88
## Median :48.81 Median :35.25
## Mean :49.79 Mean :35.25
## 3rd Qu.:60.90 3rd Qu.:47.62
## Max. :98.20 Max. :60.00
You’ll notice that here we have two numerical variables, Attendance
and Marks
.
Let’s look at how our variables are distributed, using histograms:
ggplot(dat2, aes(Attendance)) + geom_histogram(colour = 'black', binwidth = 2)
Ok, that looks a bit weird…but let’s keep going.
ggplot(dat2, aes(Marks)) + geom_histogram(colour = 'black', binwidth = 4)
We’ll fit the following model to investigate our research question:
\[Marks \sim \beta_0 + \beta_1Attendance + \epsilon\]
Let’s run our model, then we’ll check assumptions:
m4 <- lm(Marks~Attendance, dat2)
summary(m4)
##
## Call:
## lm(formula = Marks ~ Attendance, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1477 -4.5210 -0.1861 4.2501 26.8415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.83270 1.25534 11.82 <2e-16 ***
## Attendance 0.99163 0.03296 30.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.727 on 198 degrees of freedom
## Multiple R-squared: 0.8205, Adjusted R-squared: 0.8196
## F-statistic: 905.3 on 1 and 198 DF, p-value: < 2.2e-16
This model looks highly significant, but we’ll have to check assumptions before interpretation, especially given how the distribution of attendance looked.
Linearity It’s a simple regression, so we can just look at the scatterplot:
ggplot(dat2, aes(Attendance, Marks)) + geom_point() +
geom_smooth(method = 'lm') +
geom_smooth(method = 'loess', colour = 'red')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
There’s a very slight curve in the loess line, but based on the points, I think that’s due to possible issues in the distribution of the observations rather than a nonlinear relationship. We’ll say linearity assumption is not violated.
Independence of Errors We are using between-subjects data, so we’ll assume independence of our error terms.
Normality of Residuals Let’s check with a histogram and QQ plot:
hist(m4$residuals)
A possible outlier to the right, but not bad…
plot(m4, which = 2)
Ok, this actually looks alright, I think. Just the slight tail to the right.
Equality of Variance (Homoscedasticity)
We’ll use the residualPlot
to check for heteroscedasticity:
residualPlot(m4)
Oh no! It’s violated! We can’t trust the SEs in our model! We’ll need to bootstrap. We can do this using the Boot
function in the car
package. Let’s resample 1000 times.
k <- 1000
boot_m4 <- Boot(m4, R = k)
## Loading required namespace: boot
summary(boot_m4)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed
## (Intercept) 14.83270 -0.0412065 0.980182 14.75525
## Attendance 0.99163 0.0016119 0.034549 0.99423
Let’s calculate confidence intervals to test our hypothesis:
confint(boot_m4)
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 13.0081992 16.799837
## Attendance 0.9223813 1.058881
We can make this look nicer for our write-up:
confint(boot_m4) %>%
kable(digits = 2, caption = 'Bootstrapped 95% Confidence Intervals') %>%
kable_styling(full_width = F)
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 13.01 | 16.80 |
Attendance | 0.92 | 1.06 |
Because our confidence interval for the beta associated with attendance does not include 0, we can be more certain that attendance is, in fact, a significant predictor of marks.
The lab this week involves writing up the results from the three RQs above