This week, we’ll be reviewing what we’ve learned in weeks 7-10 and applying it to a practical example. Specifically, we’ll cover:
Our example is based on the paper Class Attendance in College: A Meta-Analytic Review of the Relationship of Class Attendance With Grades and Student Characteristics This paper looks at the association between class attendance and a range of other variables (such as student personality, academic performance, class marks, etc.) Imagine that we’ve read that paper and we want to perform further investigation on the variables that contribute to attendance. Specifically, we will investigate a range of categorical variables here. We’ll also use our data to try and replicate the association between attendance and marks that is found in this paper.
During this week’s example, we will conduct two separate, but related, studies. In the first, we are interested in exploring possible predictors of attendance in university courses. In the second, we will investigate the association between attendance and final marks in our sample.
We collected data from 397 university students across years 1-4, as well as students in MSc and PhD programs. From these students, we gathered data on class time, the use of online materials, and student’s self-discipline.
Variable | Description |
---|---|
pid | Participant ID number |
Attendance | Total attendance in days |
Conscientiousness | Level of conscientiousness (levels = Low; Moderate; High) |
Time | Class time (levels = 9AM; 10AM; 11AM; 12PM; 1PM; 2PM; 3PM; 4PM) |
OnlineAccess | Frequency of access to online course materials (levels = Rarely; Sometimes; Often) |
Year | Year in university (levels = Y1; Y2; Y3; Y4; MSc; PhD) |
We want to use these data to investigate the following research questions:
Research Question 1: Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?
Research Question 2: Is the time at which the class is scheduled associated with student attendance?
First, let’s load the data and all necessary packages. Note that by using the stringsAsFactors
argument in the read.csv
function, all of our character variables are automatically imported as factors, which is very handy when you’re working with a dataset with lots of factors in it.
library(tidyverse)
library(emmeans)
library(kableExtra)
library(car)
library(sjPlot)
dat <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart1.csv', stringsAsFactors = T)
First, we’ll have a look at our data using the summary
function:
dat %>%
summary(.)
## pid Attendance Conscientiousness Time
## Min. : 1 Min. : 0.00 Length:397 Length:397
## 1st Qu.:100 1st Qu.:16.00 Class :character Class :character
## Median :199 Median :29.00 Mode :character Mode :character
## Mean :199 Mean :28.28
## 3rd Qu.:298 3rd Qu.:40.00
## Max. :395 Max. :60.00
## OnlineAccess Year
## Length:397 Length:397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
You’ll notice that we have a continuous outcome variable, Attendance
, and discrete predictor variables, Conscientiousness
, Time
, OnlineAccess
and Year
.
We’ll also have a look at how our variables are distributed, using a histogram for our continous outcome and bar plots for our factors:
ggplot(dat, aes(Attendance)) + geom_histogram(colour = 'black', binwidth = 5)
ggplot(dat, aes(Conscientiousness, fill = Conscientiousness)) + geom_bar() +
theme(legend.position = 'none')
ggplot(dat, aes(Year, fill = Year)) + geom_bar() +
theme(legend.position = 'none')
ggplot(dat, aes(Time, fill = Time)) + geom_bar() +
theme(legend.position = 'none')
Given the first research question we’ve specified, Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?, we’ll use the following model:
\[Attendance \sim Conscientiousness + OnlineAccess + Year\]
Between the 3 predictors, we have 12 levels. Before we run our model, we have to make a few decisions in terms of the coding and baseline comparisons we’ll be making across predictors. We decide to use dummy coding. We’ll use be using Moderate
as the baseline level for Conscientiousness
and Sometimes
as the baseline level for OnlineAccess
. We’ll keep Y1
as the baseline level for Year
. We can use the factor
function to order our levels accordingly:
dat$Conscientiousness <- dat$Conscientiousness %>%
factor(., levels = c('Moderate', 'Low', 'High'))
summary(dat$Conscientiousness)
## Moderate Low High
## 146 127 124
dat$OnlineAccess <- dat$OnlineAccess %>%
factor(., levels = c('Sometimes', 'Rarely', 'Often'))
summary(dat$OnlineAccess)
## Sometimes Rarely Often
## 170 101 126
dat$Year <- dat$Year %>%
factor(., levels = c('Y1', 'Y2', 'Y3', 'Y4', 'MSc', 'PhD'))
summary(dat$Year)
## Y1 Y2 Y3 Y4 MSc PhD
## 89 100 65 72 48 23
Now that we have this done, we should end up with a model that looks like the one below. Note that some variable names have been shortened for sizing purposes:
\[Attendance \sim \beta_0+\beta_1Consc_{low} + \beta_2Consc_{high} + \beta_3Online_{rarely} + \beta_4Online_{often} + \beta_5Year_{Y2} + \beta_6Year_{Y3} + \beta_7Year_{Y4} + \beta_8Year_{MSc} + \beta_9Year_{PhD} + \epsilon\]
Now we can run our model:
m1 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m1)
##
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess +
## Year, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.777 -6.996 -0.243 6.079 31.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.882 1.533 18.186 < 2e-16 ***
## ConscientiousnessLow -10.304 1.398 -7.370 1.04e-12 ***
## ConscientiousnessHigh 7.361 1.392 5.288 2.08e-07 ***
## OnlineAccessRarely -5.387 1.441 -3.738 0.000214 ***
## OnlineAccessOften -3.535 1.339 -2.640 0.008631 **
## YearY2 4.573 1.657 2.760 0.006064 **
## YearY3 3.534 1.860 1.899 0.058247 .
## YearY4 4.150 1.812 2.291 0.022528 *
## YearMSc 5.649 2.047 2.760 0.006059 **
## YearPhD 12.483 2.661 4.691 3.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared: 0.3572, Adjusted R-squared: 0.3423
## F-statistic: 23.9 on 9 and 387 DF, p-value: < 2.2e-16
This looks interesting, but before we interpret, let’s check assumptions:
Linearity We can assume linearity when working with categorical predictors (see here)
Independence of Errors We are using between-subjects data, so we’ll also assume independence of our error terms.
Normality of Residuals We can check this using histograms and QQ plots:
hist(m1$residuals)
plot(m1, which = 2)
Tails are a bit fatter than we would expect (particularly on the right), but overall, this looks pretty good. No major concerns.
Equality of Variance (Homoskedasticity)
We can check for heteroskedasticity using residuals vs predicted values plots. We can get these using the residualPlot
from the car
package:
residualPlot(m1)
Looks good! Now we can interpret the model.
summary(m1)
##
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess +
## Year, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.777 -6.996 -0.243 6.079 31.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.882 1.533 18.186 < 2e-16 ***
## ConscientiousnessLow -10.304 1.398 -7.370 1.04e-12 ***
## ConscientiousnessHigh 7.361 1.392 5.288 2.08e-07 ***
## OnlineAccessRarely -5.387 1.441 -3.738 0.000214 ***
## OnlineAccessOften -3.535 1.339 -2.640 0.008631 **
## YearY2 4.573 1.657 2.760 0.006064 **
## YearY3 3.534 1.860 1.899 0.058247 .
## YearY4 4.150 1.812 2.291 0.022528 *
## YearMSc 5.649 2.047 2.760 0.006059 **
## YearPhD 12.483 2.661 4.691 3.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared: 0.3572, Adjusted R-squared: 0.3423
## F-statistic: 23.9 on 9 and 387 DF, p-value: < 2.2e-16
Now that we’ve created a model with dummy-coded predictors, let’s run the same model with effects coding.
First, let’s consider which level we want to drop in each case. Perhaps we’re more interested in how the “extreme” groups differ from average. We’ll drop the Moderate
and Sometimes
levels from the Conscientiousness
and OnlineAccess
variables, respectively. We’ll drop Y3
from our Year
variable. To do this, we’ll need to reorder our variables accordingly:
dat$Conscientiousness <- dat$Conscientiousness %>%
factor(., levels = c('Low', 'High', 'Moderate'))
summary(dat$Conscientiousness)
## Low High Moderate
## 127 124 146
dat$OnlineAccess <- dat$OnlineAccess %>%
factor(., levels = c('Rarely', 'Often', 'Sometimes'))
summary(dat$OnlineAccess)
## Rarely Often Sometimes
## 101 126 170
dat$Year <- dat$Year %>%
factor(., levels = c('Y1', 'Y2', 'Y4', 'MSc', 'PhD', 'Y3'))
summary(dat$Year)
## Y1 Y2 Y4 MSc PhD Y3
## 89 100 72 48 23 65
We also need to specify effects coding using the contr.sum
function:
contrasts(dat$Conscientiousness) <- contr.sum
contrasts(dat$OnlineAccess) <- contr.sum
contrasts(dat$Year) <- contr.sum
Now we can run our model. Keep in mind that we don’t need to recheck assumptions. The data are the same, it’s just the comparisons that differ.
m2 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m2)
##
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess +
## Year, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.777 -6.996 -0.243 6.079 31.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.9921 0.6545 44.296 < 2e-16 ***
## Conscientiousness1 -9.3232 0.8278 -11.262 < 2e-16 ***
## Conscientiousness2 8.3421 0.8244 10.119 < 2e-16 ***
## OnlineAccess1 -2.4128 0.8850 -2.727 0.006692 **
## OnlineAccess2 -0.5609 0.8298 -0.676 0.499436
## Year1 -5.0648 1.1784 -4.298 2.18e-05 ***
## Year2 -0.4914 1.1293 -0.435 0.663726
## Year3 -0.9146 1.2725 -0.719 0.472740
## Year4 0.5838 1.4890 0.392 0.695223
## Year5 7.4179 2.0436 3.630 0.000322 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared: 0.3572, Adjusted R-squared: 0.3423
## F-statistic: 23.9 on 9 and 387 DF, p-value: < 2.2e-16
This isn’t entirely necessary, but here I’m just going to create some objects to use for inline coding in the example.
sumM2 <- summary(m2)
FDat <- round(sumM2$fstatistic, 2)
m2Coefs <- round(coef(m2), 2)
tStats <- round(coef(sumM2)[, "t value"],2)
seVals <- round(coef(sumM2)[, "Std. Error"],2)
We will be evaluating all results using \(\alpha = .05\).
Looking at these results, we can see that our overall model is significant, \(F(9, 387) = 23.9, p < .001\). Conscientiousness, frequency of online access, and year in university explain 34% of the variance in attendance. Both those with high and low levels of conscientiousness exhibited attendance rates that were significantly different than average (See Figure 2.1). Specifically, those with high levels of conscientiousness attended class significantly more often than average, \(\beta = 8.34, SE = 0.82, t = 10.12, p < .001\). Conversely, those with low levels of conscientiousness attended class significantly less than average, \(\beta = -9.32, SE = 0.83, t = -11.26, p < .001\). Specifically, those with low levels of conscientiousness attended, on average, 9.32 fewer class periods than average, when controlling for year of study and online access. See Table 2.1.
Attendance | |||
---|---|---|---|
Predictors | Estimates | CI | p |
Intercept | 28.99 | 27.71 – 30.28 | <0.001 |
Low Conscientiousness | -9.32 | -10.95 – -7.70 | <0.001 |
High Conscientiousness | 8.34 | 6.72 – 9.96 | <0.001 |
Online Access - Rarely | -2.41 | -4.15 – -0.67 | 0.007 |
Online Access - Often | -0.56 | -2.19 – 1.07 | 0.499 |
Y1 | -5.06 | -7.38 – -2.75 | <0.001 |
Y2 | -0.49 | -2.71 – 1.73 | 0.664 |
Y4 | -0.91 | -3.42 – 1.59 | 0.473 |
MSc | 0.58 | -2.34 – 3.51 | 0.695 |
PhD | 7.42 | 3.40 – 11.44 | <0.001 |
Observations | 397 | ||
R2 / R2 adjusted | 0.357 / 0.342 |
Note that this write-up does not cover the full interpretation of the results. This is just the interpretation of the overall model and the predictor Conscientiousness. If you were to actually report your results, you would need to also report the findings from the other predictors.
Our second research question is Is the time at which class is scheduled associated with student attendance?. To test this, we’ll use the emmeans
function to write our own contrasts. We’re going to be testing the following model:
\[Attendance \sim Time\]
First, let’s relevel our Time variable
so that it’s in chronological order:
dat$Time <- dat$Time %>%
factor(., levels = c('9AM', '10AM', '11AM','12PM', '1PM', '2PM', '3PM', '4PM'))
summary(dat$Time)
## 9AM 10AM 11AM 12PM 1PM 2PM 3PM 4PM
## 56 48 46 47 45 46 52 57
We’ll look at how Time
is distributed using a bar plot:
ggplot(dat, aes(Time, fill = Time)) + geom_bar() +
theme(legend.position = 'none')
First let’s run our model and check assumptions. Remember, we don’t need to check linearity or independence.
Normality of residuals:
m3 <- lm(Attendance~Time, dat)
hist(m3$residuals)
plot(m3, which = 2)
The distribution is a bit flat, but if you’ll remember from class, distributional problems are usually not a strong concern (and we can always bootstrap for extra certainty!). Heteroskedasticity is much more of a problem. Let’s check that using our residual by predicted values plot.
residualPlot(m3)
This looks great! We’re going to keep our model as is.
Now let’s specify our contrasts with emmeans:
(timeMean <- emmeans(m3, ~Time))
## Time emmean SE df lower.CL upper.CL
## 9AM 20.1 1.79 389 16.6 23.6
## 10AM 27.0 1.94 389 23.2 30.8
## 11AM 27.8 1.98 389 23.9 31.7
## 12PM 31.3 1.96 389 27.5 35.1
## 1PM 33.5 2.00 389 29.5 37.4
## 2PM 32.4 1.98 389 28.5 36.3
## 3PM 31.7 1.86 389 28.0 35.3
## 4PM 24.8 1.78 389 21.3 28.2
##
## Confidence level used: 0.95
plot(timeMean)
I’d guess we’re going to see some significant differences here. Let’s say, however, we’re less worried about specific times and more about times of day in general. We decide to look at midday compared to earlier (9AM and 10AM) and later (3PM and 4PM) classes. Before running our analysis, we have to manually group our levels:
levels(dat$Time)
## [1] "9AM" "10AM" "11AM" "12PM" "1PM" "2PM" "3PM" "4PM"
timeComp <- list('Early or Late vs Middle of the Day' = c(-1/4,-1/4, 1/4, 1/4, 1/4, 1/4, -1/4, -1/4))
In this case, our \(H_0\) is
\[\frac{1}{4}(\mu_1+\mu_2+\mu_7+\mu_8) -\frac{1}{4}(\mu_3+\mu_4+\mu_5+\mu_6) = 0\]
(timeTest <- contrast(timeMean, timeComp))
## contrast estimate SE df t.ratio p.value
## Early or Late vs Middle of the Day 5.36 1.35 389 3.963 0.0001
confint(timeTest)
## contrast estimate SE df lower.CL upper.CL
## Early or Late vs Middle of the Day 5.36 1.35 389 2.7 8.01
##
## Confidence level used: 0.95
Here, we see that, at \(\alpha = .05\), students with class in the middle of the day attend, on average, 5.36 (\(SE = 1.35\)) more class periods than those who have classes either early in the morning or later in the afternoon, \(t(389) = 3.96, p < .001\). We are 95% confident that those who had midday classes attended approximately 2-8 classes more than those who had classes in the morning or late afternoon (\(CI_{95}[2.70, 8.01]\)).
We collected attendance and final mark data from 200 university students.
Variable | Description |
---|---|
Marks | Total attendance in days |
Attendance | Final Mark in points |
We want to use these data to investigate the following research question:
Research Question: Can student attendance be used to predict marks?
Our hypotheses in this case, can be stated in the following way:
\[H_0: \beta_1 = 0\] \[H_1: \beta_1 \neq 0\]
First, let’s load the data.
dat2 <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart2.csv', stringsAsFactors = T)
First, let’s have a look at the data:
dat2 %>%
summary(.)
## Marks Attendance
## Min. :25.01 Min. :10.50
## 1st Qu.:38.22 1st Qu.:22.88
## Median :48.81 Median :35.25
## Mean :49.79 Mean :35.25
## 3rd Qu.:60.90 3rd Qu.:47.62
## Max. :98.20 Max. :60.00
You’ll notice that here we have two continuous outcome variables, Attendance
and Marks
.
Let’s look at how our variables are distributed, using histograms:
ggplot(dat2, aes(Attendance)) + geom_histogram(colour = 'black')
Ok, that looks a bit weird…but let’s keep going.
ggplot(dat2, aes(Marks)) + geom_histogram(colour = 'black', binwidth = 4)
We’ll use the following model to investigate our research question:
\[Attendance \sim Time\]
Let’s run our model, then we’ll check assumptions:
m4 <- lm(Marks~Attendance, dat2)
summary(m4)
##
## Call:
## lm(formula = Marks ~ Attendance, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1477 -4.5210 -0.1861 4.2501 26.8415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.83270 1.25534 11.82 <2e-16 ***
## Attendance 0.99163 0.03296 30.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.727 on 198 degrees of freedom
## Multiple R-squared: 0.8205, Adjusted R-squared: 0.8196
## F-statistic: 905.3 on 1 and 198 DF, p-value: < 2.2e-16
This model looks highly significant, but we’ll have to check assumptions before interpretation, especially given how the distribution of attendance looked.
Linearity It’s a simple regression, so we can just look at the scatterplot:
ggplot(dat2, aes(Attendance, Marks)) + geom_point() +
geom_smooth(method = 'lm') +
geom_smooth(method = 'loess', colour = 'red')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
There’s a very slight curve in the loess line, but based on the points, I think that’s due to possible issues in the distribution of the observations rather than a nonlinear relationship. We’ll say linearity assumption is not violated.
Independence of Errors We are using between-subjects data, so we’ll assume independence of our error terms.
Normality of Residuals Let’s check with a histogram and QQ plot:
hist(m4$residuals)
Bit of a tail to the right, but not bad…
plot(m4, which = 2)
Ok, this actually looks alright, I think. Just the slight tail to the right.
Equality of Variance (Homoskedasticity)
We’ll use the residualPlot
to check for heteroskedasticity:
residualPlot(m4)
Oh no! It’s violated! We can’t trust the SEs in our model! We’ll need to bootstrap. We can do this using the Boot
function in the car
package. Let’s resample 1000 times.
k <- 1000
boot_m4 <- Boot(m4, R = k)
summary(boot_m4)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed
## (Intercept) 14.83270 0.0735109 1.048128 14.92314
## Attendance 0.99163 -0.0021716 0.036977 0.98813
Let’s calculate confidence intervals to test our hypothesis:
confint(boot_m4)
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 12.4846168 16.768616
## Attendance 0.9269348 1.069712
Because our confidence interval for the beta associated with attendance does not include 0, we can be more certain that attendance is, in fact, a significant predictor of marks.
We investigated data from 200 undergraduate students to examine the association between course attendance and final marks.
To answer the research hypothesis, we fitted the following regression model:
\[Marks \sim \beta_0 + \beta_1Attendance + \epsilon\]
However, this model did not satisfy the regression assumptions. Specifically, it exhibited heteroskedasticity (see Figure 3.1), and for this reason we assessed statistical significance using the bootstrap approach with \(R = 1000\) resamples.
The 95% bootstrap confidence intervals are presented in Table 3.1. Results suggested that there is a significant positive relationship between marks and attendance. Specifically, we are 95% confident that for every day one attends class, their final mark will increase, on average, between 0.93 and 1.07 points. As this 95% CI did not contain zero, we rejected the null hypothesis as there was evidence of an association between student’ attendance and their final marks.
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 12.48 | 16.77 |
Attendance | 0.93 | 1.07 |