1 Overview of the Week

This week, we’ll be reviewing what we’ve learned in weeks 7-10 and applying it to a practical example. Specifically, we’ll cover:

  1. Multiple linear regression with categorical variables;
  2. Dummy, effects, and manual contrast coding;
  3. Checking model assumptions & diagnostics;
  4. Bootstrapping

Our example is based on the paper Class Attendance in College: A Meta-Analytic Review of the Relationship of Class Attendance With Grades and Student Characteristics This paper looks at the association between class attendance and a range of other variables (such as student personality, academic performance, class marks, etc.) Imagine that we’ve read that paper and we want to perform further investigation on the variables that contribute to attendance. Specifically, we will investigate a range of categorical variables here. We’ll also use our data to try and replicate the association between attendance and marks that is found in this paper.

During this week’s example, we will conduct two separate, but related, studies. In the first, we are interested in exploring possible predictors of attendance in university courses. In the second, we will investigate the association between attendance and final marks in our sample.

2 Study 1

We collected data from 397 university students across years 1-4, as well as students in MSc and PhD programs. From these students, we gathered data on class time, the use of online materials, and student’s self-discipline.

Variable Description
pid Participant ID number
Attendance Total attendance in days
Conscientiousness Level of conscientiousness (levels = Low; Moderate; High)
Time Class time (levels = 9AM; 10AM; 11AM; 12PM; 1PM; 2PM; 3PM; 4PM)
OnlineAccess Frequency of access to online course materials (levels = Rarely; Sometimes; Often)
Year Year in university (levels = Y1; Y2; Y3; Y4; MSc; PhD)

We want to use these data to investigate the following research questions:

Research Question 1: Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?

Research Question 2: Is the time at which the class is scheduled associated with student attendance?

2.0.1 Setup

First, let’s load the data and all necessary packages. Note that by using the stringsAsFactors argument in the read.csv function, all of our character variables are automatically imported as factors, which is very handy when you’re working with a dataset with lots of factors in it.

library(tidyverse)
library(emmeans)
library(kableExtra)
library(car)
library(sjPlot)

dat <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart1.csv', stringsAsFactors = T)

2.0.2 Checking the Data

First, we’ll have a look at our data using the summary function:

dat %>%
  summary(.)
##       pid        Attendance    Conscientiousness      Time          
##  Min.   :  1   Min.   : 0.00   Length:397         Length:397        
##  1st Qu.:100   1st Qu.:16.00   Class :character   Class :character  
##  Median :199   Median :29.00   Mode  :character   Mode  :character  
##  Mean   :199   Mean   :28.28                                        
##  3rd Qu.:298   3rd Qu.:40.00                                        
##  Max.   :395   Max.   :60.00                                        
##  OnlineAccess           Year          
##  Length:397         Length:397        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

You’ll notice that we have a continuous outcome variable, Attendance, and discrete predictor variables, Conscientiousness, Time, OnlineAccess and Year.

We’ll also have a look at how our variables are distributed, using a histogram for our continous outcome and bar plots for our factors:

ggplot(dat, aes(Attendance)) + geom_histogram(colour = 'black', binwidth = 5)

ggplot(dat, aes(Conscientiousness, fill = Conscientiousness)) + geom_bar() +
  theme(legend.position = 'none')

ggplot(dat, aes(Year, fill = Year)) + geom_bar() +
  theme(legend.position = 'none')

ggplot(dat, aes(Time, fill = Time)) + geom_bar() +
  theme(legend.position = 'none')

2.0.3 Investigating RQ 1

Given the first research question we’ve specified, Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?, we’ll use the following model:

\[Attendance \sim Conscientiousness + OnlineAccess + Year\]

2.0.3.1 Dummy Coding

Between the 3 predictors, we have 12 levels. Before we run our model, we have to make a few decisions in terms of the coding and baseline comparisons we’ll be making across predictors. We decide to use dummy coding. We’ll use be using Moderate as the baseline level for Conscientiousness and Sometimes as the baseline level for OnlineAccess. We’ll keep Y1 as the baseline level for Year. We can use the factor function to order our levels accordingly:

dat$Conscientiousness <- dat$Conscientiousness %>% 
  factor(., levels = c('Moderate', 'Low', 'High'))

summary(dat$Conscientiousness)
## Moderate      Low     High 
##      146      127      124
dat$OnlineAccess <- dat$OnlineAccess %>% 
  factor(., levels = c('Sometimes', 'Rarely', 'Often'))

summary(dat$OnlineAccess)
## Sometimes    Rarely     Often 
##       170       101       126
dat$Year <- dat$Year %>% 
  factor(., levels = c('Y1', 'Y2', 'Y3', 'Y4', 'MSc', 'PhD'))

summary(dat$Year)
##  Y1  Y2  Y3  Y4 MSc PhD 
##  89 100  65  72  48  23

Now that we have this done, we should end up with a model that looks like the one below. Note that some variable names have been shortened for sizing purposes:

\[Attendance \sim \beta_0+\beta_1Consc_{low} + \beta_2Consc_{high} + \beta_3Online_{rarely} + \beta_4Online_{often} + \beta_5Year_{Y2} + \beta_6Year_{Y3} + \beta_7Year_{Y4} + \beta_8Year_{MSc} + \beta_9Year_{PhD} + \epsilon\]

Now we can run our model:

m1 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m1)
## 
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess + 
##     Year, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.777  -6.996  -0.243   6.079  31.849 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.882      1.533  18.186  < 2e-16 ***
## ConscientiousnessLow   -10.304      1.398  -7.370 1.04e-12 ***
## ConscientiousnessHigh    7.361      1.392   5.288 2.08e-07 ***
## OnlineAccessRarely      -5.387      1.441  -3.738 0.000214 ***
## OnlineAccessOften       -3.535      1.339  -2.640 0.008631 ** 
## YearY2                   4.573      1.657   2.760 0.006064 ** 
## YearY3                   3.534      1.860   1.899 0.058247 .  
## YearY4                   4.150      1.812   2.291 0.022528 *  
## YearMSc                  5.649      2.047   2.760 0.006059 ** 
## YearPhD                 12.483      2.661   4.691 3.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3423 
## F-statistic:  23.9 on 9 and 387 DF,  p-value: < 2.2e-16

This looks interesting, but before we interpret, let’s check assumptions:

Linearity We can assume linearity when working with categorical predictors (see here)

Independence of Errors We are using between-subjects data, so we’ll also assume independence of our error terms.

Normality of Residuals We can check this using histograms and QQ plots:

hist(m1$residuals)

plot(m1, which = 2)

Tails are a bit fatter than we would expect (particularly on the right), but overall, this looks pretty good. No major concerns.

Equality of Variance (Homoskedasticity) We can check for heteroskedasticity using residuals vs predicted values plots. We can get these using the residualPlot from the car package:

residualPlot(m1)

Looks good! Now we can interpret the model.

summary(m1)
## 
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess + 
##     Year, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.777  -6.996  -0.243   6.079  31.849 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.882      1.533  18.186  < 2e-16 ***
## ConscientiousnessLow   -10.304      1.398  -7.370 1.04e-12 ***
## ConscientiousnessHigh    7.361      1.392   5.288 2.08e-07 ***
## OnlineAccessRarely      -5.387      1.441  -3.738 0.000214 ***
## OnlineAccessOften       -3.535      1.339  -2.640 0.008631 ** 
## YearY2                   4.573      1.657   2.760 0.006064 ** 
## YearY3                   3.534      1.860   1.899 0.058247 .  
## YearY4                   4.150      1.812   2.291 0.022528 *  
## YearMSc                  5.649      2.047   2.760 0.006059 ** 
## YearPhD                 12.483      2.661   4.691 3.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3423 
## F-statistic:  23.9 on 9 and 387 DF,  p-value: < 2.2e-16

2.0.3.2 Effects Coding

Now that we’ve created a model with dummy-coded predictors, let’s run the same model with effects coding.

First, let’s consider which level we want to drop in each case. Perhaps we’re more interested in how the “extreme” groups differ from average. We’ll drop the Moderate and Sometimes levels from the Conscientiousness and OnlineAccess variables, respectively. We’ll drop Y3 from our Year variable. To do this, we’ll need to reorder our variables accordingly:

dat$Conscientiousness <- dat$Conscientiousness %>% 
  factor(., levels = c('Low', 'High', 'Moderate'))

summary(dat$Conscientiousness)
##      Low     High Moderate 
##      127      124      146
dat$OnlineAccess <- dat$OnlineAccess %>% 
  factor(., levels = c('Rarely', 'Often', 'Sometimes'))

summary(dat$OnlineAccess)
##    Rarely     Often Sometimes 
##       101       126       170
dat$Year <- dat$Year %>% 
  factor(., levels = c('Y1', 'Y2', 'Y4', 'MSc', 'PhD', 'Y3'))

summary(dat$Year)
##  Y1  Y2  Y4 MSc PhD  Y3 
##  89 100  72  48  23  65

We also need to specify effects coding using the contr.sum function:

contrasts(dat$Conscientiousness) <- contr.sum
contrasts(dat$OnlineAccess) <- contr.sum
contrasts(dat$Year) <- contr.sum

Now we can run our model. Keep in mind that we don’t need to recheck assumptions. The data are the same, it’s just the comparisons that differ.

m2 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m2)
## 
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess + 
##     Year, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.777  -6.996  -0.243   6.079  31.849 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         28.9921     0.6545  44.296  < 2e-16 ***
## Conscientiousness1  -9.3232     0.8278 -11.262  < 2e-16 ***
## Conscientiousness2   8.3421     0.8244  10.119  < 2e-16 ***
## OnlineAccess1       -2.4128     0.8850  -2.727 0.006692 ** 
## OnlineAccess2       -0.5609     0.8298  -0.676 0.499436    
## Year1               -5.0648     1.1784  -4.298 2.18e-05 ***
## Year2               -0.4914     1.1293  -0.435 0.663726    
## Year3               -0.9146     1.2725  -0.719 0.472740    
## Year4                0.5838     1.4890   0.392 0.695223    
## Year5                7.4179     2.0436   3.630 0.000322 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3423 
## F-statistic:  23.9 on 9 and 387 DF,  p-value: < 2.2e-16

This isn’t entirely necessary, but here I’m just going to create some objects to use for inline coding in the example.

sumM2 <- summary(m2)
FDat <- round(sumM2$fstatistic, 2)
m2Coefs <- round(coef(m2), 2)
tStats <- round(coef(sumM2)[, "t value"],2)
seVals <- round(coef(sumM2)[, "Std. Error"],2)

2.0.4 Sample write up and model interpretation, RQ1

We will be evaluating all results using \(\alpha = .05\).

Looking at these results, we can see that our overall model is significant, \(F(9, 387) = 23.9, p < .001\). Conscientiousness, frequency of online access, and year in university explain 34% of the variance in attendance. Both those with high and low levels of conscientiousness exhibited attendance rates that were significantly different than average (See Figure 2.1). Specifically, those with high levels of conscientiousness attended class significantly more often than average, \(\beta = 8.34, SE = 0.82, t = 10.12, p < .001\). Conversely, those with low levels of conscientiousness attended class significantly less than average, \(\beta = -9.32, SE = 0.83, t = -11.26, p < .001\). Specifically, those with low levels of conscientiousness attended, on average, 9.32 fewer class periods than average, when controlling for year of study and online access. See Table 2.1.

Attendance by Level of Conscientiousness

Figure 2.1: Attendance by Level of Conscientiousness

Table 2.1: Regression Results - Predictors of Student Attendance
  Attendance
Predictors Estimates CI p
Intercept 28.99 27.71 – 30.28 <0.001
Low Conscientiousness -9.32 -10.95 – -7.70 <0.001
High Conscientiousness 8.34 6.72 – 9.96 <0.001
Online Access - Rarely -2.41 -4.15 – -0.67 0.007
Online Access - Often -0.56 -2.19 – 1.07 0.499
Y1 -5.06 -7.38 – -2.75 <0.001
Y2 -0.49 -2.71 – 1.73 0.664
Y4 -0.91 -3.42 – 1.59 0.473
MSc 0.58 -2.34 – 3.51 0.695
PhD 7.42 3.40 – 11.44 <0.001
Observations 397
R2 / R2 adjusted 0.357 / 0.342

Note that this write-up does not cover the full interpretation of the results. This is just the interpretation of the overall model and the predictor Conscientiousness. If you were to actually report your results, you would need to also report the findings from the other predictors.

2.0.5 Investigating RQ 2

Our second research question is Is the time at which class is scheduled associated with student attendance?. To test this, we’ll use the emmeans function to write our own contrasts. We’re going to be testing the following model:

\[Attendance \sim Time\]

First, let’s relevel our Time variable so that it’s in chronological order:

dat$Time <- dat$Time %>% 
  factor(., levels = c('9AM', '10AM', '11AM','12PM', '1PM', '2PM', '3PM', '4PM'))

summary(dat$Time)
##  9AM 10AM 11AM 12PM  1PM  2PM  3PM  4PM 
##   56   48   46   47   45   46   52   57

We’ll look at how Time is distributed using a bar plot:

ggplot(dat, aes(Time, fill = Time)) + geom_bar() + 
  theme(legend.position = 'none')

First let’s run our model and check assumptions. Remember, we don’t need to check linearity or independence.

Normality of residuals:

m3 <- lm(Attendance~Time, dat)

hist(m3$residuals)

plot(m3, which = 2)

The distribution is a bit flat, but if you’ll remember from class, distributional problems are usually not a strong concern (and we can always bootstrap for extra certainty!). Heteroskedasticity is much more of a problem. Let’s check that using our residual by predicted values plot.

residualPlot(m3)

This looks great! We’re going to keep our model as is.

Now let’s specify our contrasts with emmeans:

(timeMean <- emmeans(m3, ~Time))
##  Time emmean   SE  df lower.CL upper.CL
##  9AM    20.1 1.79 389     16.6     23.6
##  10AM   27.0 1.94 389     23.2     30.8
##  11AM   27.8 1.98 389     23.9     31.7
##  12PM   31.3 1.96 389     27.5     35.1
##  1PM    33.5 2.00 389     29.5     37.4
##  2PM    32.4 1.98 389     28.5     36.3
##  3PM    31.7 1.86 389     28.0     35.3
##  4PM    24.8 1.78 389     21.3     28.2
## 
## Confidence level used: 0.95
plot(timeMean)

I’d guess we’re going to see some significant differences here. Let’s say, however, we’re less worried about specific times and more about times of day in general. We decide to look at midday compared to earlier (9AM and 10AM) and later (3PM and 4PM) classes. Before running our analysis, we have to manually group our levels:

levels(dat$Time)
## [1] "9AM"  "10AM" "11AM" "12PM" "1PM"  "2PM"  "3PM"  "4PM"
timeComp <- list('Early or Late vs Middle of the Day' = c(-1/4,-1/4, 1/4, 1/4, 1/4, 1/4, -1/4, -1/4))

In this case, our \(H_0\) is

\[\frac{1}{4}(\mu_1+\mu_2+\mu_7+\mu_8) -\frac{1}{4}(\mu_3+\mu_4+\mu_5+\mu_6) = 0\]

(timeTest <- contrast(timeMean, timeComp))
##  contrast                           estimate   SE  df t.ratio p.value
##  Early or Late vs Middle of the Day     5.36 1.35 389   3.963  0.0001
confint(timeTest)
##  contrast                           estimate   SE  df lower.CL upper.CL
##  Early or Late vs Middle of the Day     5.36 1.35 389      2.7     8.01
## 
## Confidence level used: 0.95

2.0.6 Sample model interpretation - RQ2

Here, we see that, at \(\alpha = .05\), students with class in the middle of the day attend, on average, 5.36 (\(SE = 1.35\)) more class periods than those who have classes either early in the morning or later in the afternoon, \(t(389) = 3.96, p < .001\). We are 95% confident that those who had midday classes attended approximately 2-8 classes more than those who had classes in the morning or late afternoon (\(CI_{95}[2.70, 8.01]\)).

3 Study 2

We collected attendance and final mark data from 200 university students.

Variable Description
Marks Total attendance in days
Attendance Final Mark in points

We want to use these data to investigate the following research question:

Research Question: Can student attendance be used to predict marks?

Our hypotheses in this case, can be stated in the following way:

\[H_0: \beta_1 = 0\] \[H_1: \beta_1 \neq 0\]

3.1 Study 2 - Setup

First, let’s load the data.

dat2 <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart2.csv', stringsAsFactors = T)

3.2 Study 2 - Checking the Data

First, let’s have a look at the data:

dat2 %>%
  summary(.)
##      Marks         Attendance   
##  Min.   :25.01   Min.   :10.50  
##  1st Qu.:38.22   1st Qu.:22.88  
##  Median :48.81   Median :35.25  
##  Mean   :49.79   Mean   :35.25  
##  3rd Qu.:60.90   3rd Qu.:47.62  
##  Max.   :98.20   Max.   :60.00

You’ll notice that here we have two continuous outcome variables, Attendance and Marks.

Let’s look at how our variables are distributed, using histograms:

ggplot(dat2, aes(Attendance)) + geom_histogram(colour = 'black')

Ok, that looks a bit weird…but let’s keep going.

ggplot(dat2, aes(Marks)) + geom_histogram(colour = 'black', binwidth = 4)

3.3 Study 2 - Investigating the RQ

We’ll use the following model to investigate our research question:

\[Attendance \sim Time\]

Let’s run our model, then we’ll check assumptions:

m4 <- lm(Marks~Attendance, dat2)
summary(m4)
## 
## Call:
## lm(formula = Marks ~ Attendance, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1477  -4.5210  -0.1861   4.2501  26.8415 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.83270    1.25534   11.82   <2e-16 ***
## Attendance   0.99163    0.03296   30.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.727 on 198 degrees of freedom
## Multiple R-squared:  0.8205, Adjusted R-squared:  0.8196 
## F-statistic: 905.3 on 1 and 198 DF,  p-value: < 2.2e-16

This model looks highly significant, but we’ll have to check assumptions before interpretation, especially given how the distribution of attendance looked.

Linearity It’s a simple regression, so we can just look at the scatterplot:

ggplot(dat2, aes(Attendance, Marks)) + geom_point() + 
  geom_smooth(method = 'lm') + 
  geom_smooth(method = 'loess', colour = 'red')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

There’s a very slight curve in the loess line, but based on the points, I think that’s due to possible issues in the distribution of the observations rather than a nonlinear relationship. We’ll say linearity assumption is not violated.

Independence of Errors We are using between-subjects data, so we’ll assume independence of our error terms.

Normality of Residuals Let’s check with a histogram and QQ plot:

hist(m4$residuals)

Bit of a tail to the right, but not bad…

plot(m4, which = 2)

Ok, this actually looks alright, I think. Just the slight tail to the right.

Equality of Variance (Homoskedasticity) We’ll use the residualPlot to check for heteroskedasticity:

residualPlot(m4)

Oh no! It’s violated! We can’t trust the SEs in our model! We’ll need to bootstrap. We can do this using the Boot function in the car package. Let’s resample 1000 times.

3.4 Bootstrapping - Study 2

k <- 1000
boot_m4 <- Boot(m4, R = k)

summary(boot_m4)
## 
## Number of bootstrap replications R = 1000 
##             original   bootBias   bootSE  bootMed
## (Intercept) 14.83270  0.0735109 1.048128 14.92314
## Attendance   0.99163 -0.0021716 0.036977  0.98813

Let’s calculate confidence intervals to test our hypothesis:

confint(boot_m4)
## Bootstrap bca confidence intervals
## 
##                  2.5 %    97.5 %
## (Intercept) 12.4846168 16.768616
## Attendance   0.9269348  1.069712

Because our confidence interval for the beta associated with attendance does not include 0, we can be more certain that attendance is, in fact, a significant predictor of marks.

3.5 Sample write up and model interpretation - Study 2

We investigated data from 200 undergraduate students to examine the association between course attendance and final marks.

To answer the research hypothesis, we fitted the following regression model:

\[Marks \sim \beta_0 + \beta_1Attendance + \epsilon\]

However, this model did not satisfy the regression assumptions. Specifically, it exhibited heteroskedasticity (see Figure 3.1), and for this reason we assessed statistical significance using the bootstrap approach with \(R = 1000\) resamples.

Heteroskedasticity in Attendance Model

Figure 3.1: Heteroskedasticity in Attendance Model

The 95% bootstrap confidence intervals are presented in Table 3.1. Results suggested that there is a significant positive relationship between marks and attendance. Specifically, we are 95% confident that for every day one attends class, their final mark will increase, on average, between 0.93 and 1.07 points. As this 95% CI did not contain zero, we rejected the null hypothesis as there was evidence of an association between student’ attendance and their final marks.

Table 3.1: Bootstrapped 95% Confidence Intervals
2.5 % 97.5 %
(Intercept) 12.48 16.77
Attendance 0.93 1.07