1 Overview of the Week

This week, we’ll be reviewing what we’ve learned in weeks 7-10 and applying it to a practical example. Specifically, we’ll cover:

  1. Multiple linear regression with categorical variables;
  2. Dummy, effects, and manual contrast coding;
  3. Checking model assumptions;
  4. Bootstrapping

Our example is based on the paper Class Attendance in College: A Meta-Analytic Review of the Relationship of Class Attendance With Grades and Student Characteristics This paper looks at the association between class attendance and a range of other variables (such as student personality, academic performance, class marks, etc.) Imagine that we’ve read that paper and we want to perform further investigation on the variables that contribute to attendance. Specifically, we will investigate a range of categorical variables here. We’ll also use our data to try and replicate the association between attendance and marks that is found in this paper.

During this week’s example, we will conduct two separate, but related, studies. In the first, we are interested in exploring possible predictors of attendance in university courses. In the second, we will investigate the association between attendance and final marks in our sample.

2 Study 1

We collected data from 397 university students across years 1-4, as well as students in MSc and PhD programs. From these students, we gathered data on class time, the use of online materials, and student’s conscientiousness.

Variable Description
pid Participant ID number
Attendance Total attendance in days
Conscientiousness Level of conscientiousness (levels = Low; Moderate; High)
Time Class time (levels = 9AM; 10AM; 11AM; 12PM; 1PM; 2PM; 3PM; 4PM)
OnlineAccess Frequency of access to online course materials (levels = Rarely; Sometimes; Often)
Year Year in university (levels = Y1; Y2; Y3; Y4; MSc; PhD)

We want to use these data to investigate the following research questions:

Research Question 1: Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?

Research Question 2: Is the time at which the class is scheduled associated with student attendance?

2.1 Setup

First, let’s load the data and all necessary packages. Note that by using the stringsAsFactors argument in the read.csv function, all of our character variables are automatically imported as factors, which is very handy when you’re working with a dataset with lots of factors in it.

library(tidyverse)
library(kableExtra)
library(car)
library(sjPlot)
library(emmeans)

dat <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart1.csv', stringsAsFactors = T)

2.2 Checking the Data

First, we’ll have a look at our data using the summary function:

  summary(dat)
##       pid        Attendance    Conscientiousness      Time          
##  Min.   :  1   Min.   : 0.00   Length:397         Length:397        
##  1st Qu.:100   1st Qu.:16.00   Class :character   Class :character  
##  Median :199   Median :29.00   Mode  :character   Mode  :character  
##  Mean   :199   Mean   :28.28                                        
##  3rd Qu.:298   3rd Qu.:40.00                                        
##  Max.   :395   Max.   :60.00                                        
##  OnlineAccess           Year          
##  Length:397         Length:397        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

You’ll notice that we have a continuous outcome variable, Attendance, and discrete predictor variables, Conscientiousness, Time, OnlineAccess and Year.

We’ll also have a look at how our variables are distributed, using a histogram for our continuous outcome and bar plots for our factors:

ggplot(dat, aes(Attendance)) + geom_histogram(colour = 'black', binwidth = 5)

ggplot(dat, aes(Conscientiousness, fill = Conscientiousness)) + geom_bar() +
  theme(legend.position = 'none')

ggplot(dat, aes(Year, fill = Year)) + geom_bar() +
  theme(legend.position = 'none')

ggplot(dat, aes(Time, fill = Time)) + geom_bar() +
  theme(legend.position = 'none')

2.3 Investigating RQ 1

Given the first research question we’ve specified, Can student’s conscientiousness, frequency of access to online materials, and year in University predict course attendance?, we want to know the relationship between the following variables:

\[Attendance \sim Conscientiousness + OnlineAccess + Year\]

2.3.1 Dummy Coding

Between the 3 predictors, we have 12 levels. Before we run our model, we have to make a few decisions in terms of the coding and baseline comparisons we’ll be making across predictors. We decide to use dummy coding. We’ll use be using Moderate as the baseline level for Conscientiousness and Sometimes as the baseline level for OnlineAccess. We’ll keep Y1 as the baseline level for Year. We can use the factor function to order our levels accordingly:

dat$Conscientiousness <- factor(dat$Conscientiousness, levels = c('Moderate', 'Low', 'High'))

summary(dat$Conscientiousness)
## Moderate      Low     High 
##      146      127      124
dat$OnlineAccess <- factor(dat$OnlineAccess, levels = c('Sometimes', 'Rarely', 'Often'))

summary(dat$OnlineAccess)
## Sometimes    Rarely     Often 
##       170       101       126
dat$Year <- factor(dat$Year, levels = c('Y1', 'Y2', 'Y3', 'Y4', 'MSc', 'PhD'))

summary(dat$Year)
##  Y1  Y2  Y3  Y4 MSc PhD 
##  89 100  66  71  48  23

Now that we have this done, we should end up with a model that looks like the one below. Note that some variable names have been shortened for sizing purposes:

\[Attendance \sim \beta_0+\beta_1Consc_{low} + \beta_2Consc_{high} + \beta_3Online_{rarely} + \beta_4Online_{often} + \beta_5Year_{Y2} + \beta_6Year_{Y3} + \beta_7Year_{Y4} + \beta_8Year_{MSc} + \beta_9Year_{PhD} + \epsilon\]

Now we can run our model:

m1 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)
summary(m1)
## 
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess + 
##     Year, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.657  -6.990  -0.279   6.085  31.844 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.874      1.533  18.179  < 2e-16 ***
## ConscientiousnessLow   -10.292      1.399  -7.359 1.12e-12 ***
## ConscientiousnessHigh    7.366      1.392   5.292 2.03e-07 ***
## OnlineAccessRarely      -5.378      1.441  -3.732 0.000218 ***
## OnlineAccessOften       -3.533      1.339  -2.639 0.008649 ** 
## YearY2                   4.574      1.657   2.760 0.006049 ** 
## YearY3                   3.418      1.853   1.844 0.065926 .  
## YearY4                   4.266      1.817   2.347 0.019418 *  
## YearMSc                  5.649      2.046   2.760 0.006051 ** 
## YearPhD                 12.484      2.661   4.692 3.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3424 
## F-statistic: 23.91 on 9 and 387 DF,  p-value: < 2.2e-16

This looks interesting, but before we interpret, let’s check assumptions:

Linearity We can assume linearity when all x-variables entered into the model are categorical (see here)

Independence of Errors We are using between-subjects data, so we’ll also assume independence of our error terms.

Normality of Residuals We can check this using histograms and QQ plots:

hist(m1$residuals)

plot(m1, which = 2)

Tails are a bit fatter than we would expect (particularly on the right), but overall, this looks pretty good. No major concerns.

Equality of Variance (Homoscedasticity) We can check for heteroscedasticity using residuals vs predicted values plots. We can get these using the residualPlot from the car package:

residualPlot(m1)

Looks good! Now we can interpret the model.

summary(m1)
## 
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess + 
##     Year, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.657  -6.990  -0.279   6.085  31.844 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.874      1.533  18.179  < 2e-16 ***
## ConscientiousnessLow   -10.292      1.399  -7.359 1.12e-12 ***
## ConscientiousnessHigh    7.366      1.392   5.292 2.03e-07 ***
## OnlineAccessRarely      -5.378      1.441  -3.732 0.000218 ***
## OnlineAccessOften       -3.533      1.339  -2.639 0.008649 ** 
## YearY2                   4.574      1.657   2.760 0.006049 ** 
## YearY3                   3.418      1.853   1.844 0.065926 .  
## YearY4                   4.266      1.817   2.347 0.019418 *  
## YearMSc                  5.649      2.046   2.760 0.006051 ** 
## YearPhD                 12.484      2.661   4.692 3.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3424 
## F-statistic: 23.91 on 9 and 387 DF,  p-value: < 2.2e-16


2.3.2 Effects Coding

To get some practice with coding schemes, let’s run the same model with effects coding.

First, let’s consider which level we want to drop in each case. Perhaps we’re more interested in how the “extreme” groups differ from average. We’ll drop the Moderate and Sometimes levels from the Conscientiousness and OnlineAccess variables, respectively. We’ll drop Y3 from our Year variable. To do this, we’ll need to reorder our variables accordingly:

dat$Conscientiousness <- factor(dat$Conscientiousness, levels = c('Low', 'High', 'Moderate'))

summary(dat$Conscientiousness)
##      Low     High Moderate 
##      127      124      146
dat$OnlineAccess <- factor(dat$OnlineAccess, levels = c('Rarely', 'Often', 'Sometimes'))

summary(dat$OnlineAccess)
##    Rarely     Often Sometimes 
##       101       126       170
dat$Year <- factor(dat$Year, levels = c('Y1', 'Y2', 'Y4', 'MSc', 'PhD', 'Y3'))

summary(dat$Year)
##  Y1  Y2  Y4 MSc PhD  Y3 
##  89 100  71  48  23  66

We also need to specify effects coding using the contr.sum function:

contrasts(dat$Conscientiousness) <- contr.sum
contrasts(dat$OnlineAccess) <- contr.sum
contrasts(dat$Year) <- contr.sum

Now we can run our model. Keep in mind that we don’t need to recheck assumptions. The data are the same, it’s just the comparisons that differ.

m2 <- lm(Attendance~Conscientiousness+OnlineAccess+Year, dat)

summary(m2)
## 
## Call:
## lm(formula = Attendance ~ Conscientiousness + OnlineAccess + 
##     Year, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.657  -6.990  -0.279   6.085  31.844 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         28.9933     0.6544  44.306  < 2e-16 ***
## Conscientiousness1  -9.3164     0.8280 -11.251  < 2e-16 ***
## Conscientiousness2   8.3412     0.8243  10.119  < 2e-16 ***
## OnlineAccess1       -2.4075     0.8844  -2.722  0.00678 ** 
## OnlineAccess2       -0.5626     0.8291  -0.679  0.49785    
## Year1               -5.0652     1.1782  -4.299 2.17e-05 ***
## Year2               -0.4909     1.1291  -0.435  0.66396    
## Year3               -0.7992     1.2777  -0.626  0.53200    
## Year4                0.5836     1.4888   0.392  0.69528    
## Year5                7.4193     2.0433   3.631  0.00032 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.35 on 387 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3424 
## F-statistic: 23.91 on 9 and 387 DF,  p-value: < 2.2e-16

Before I include the model summary in my write-up, I have to format it into a nicer looking table.

tab_model(m2, pred.labels = c('Intercept', 'Low Conscientiousness', 'High Conscientiousness', 'Online Access - Rarely', 'Online Access - Often', 'Y1', 'Y2', 'Y4', 'MSc', 'PhD'),
          title = 'Regression Results - Predictors of Student Attendance')
Regression Results - Predictors of Student Attendance
  Attendance
Predictors Estimates CI p
Intercept 28.99 27.71 – 30.28 <0.001
Low Conscientiousness -9.32 -10.94 – -7.69 <0.001
High Conscientiousness 8.34 6.72 – 9.96 <0.001
Online Access - Rarely -2.41 -4.15 – -0.67 0.007
Online Access - Often -0.56 -2.19 – 1.07 0.498
Y1 -5.07 -7.38 – -2.75 <0.001
Y2 -0.49 -2.71 – 1.73 0.664
Y4 -0.80 -3.31 – 1.71 0.532
MSc 0.58 -2.34 – 3.51 0.695
PhD 7.42 3.40 – 11.44 <0.001
Observations 397
R2 / R2 adjusted 0.357 / 0.342


2.4 Investigating RQ 2

Our second research question is Is the time at which class is scheduled associated with student attendance?. To test this, we’ll use the emmeans function to write our own contrasts. We want to know the relationship between the following variables:

\[Attendance \sim Time\]

First, let’s relevel our Time variable so that it’s in chronological order:

dat$Time <- factor(dat$Time, levels = c('9AM', '10AM', '11AM','12PM', '1PM', '2PM', '3PM', '4PM'))

summary(dat$Time)
##  9AM 10AM 11AM 12PM  1PM  2PM  3PM  4PM 
##   56   48   46   47   45   46   52   57

We’ll look at how Time is distributed using a bar plot:

ggplot(dat, aes(Time, fill = Time)) + geom_bar() + 
  theme(legend.position = 'none')

First let’s run our model and check assumptions. Remember, we don’t need to check linearity or independence.

Normality of residuals:

m3 <- lm(Attendance~Time, dat)

hist(m3$residuals)

plot(m3, which = 2)

The distribution is a bit flat, but if you’ll remember from class, distributional problems are usually not a strong concern (and we can always bootstrap for extra certainty!). Heteroscedasticity is much more of a problem. Let’s check that using our residuals by predicted values plot.

residualPlot(m3)

This looks great! We’re going to keep our model as is.

Now let’s specify our contrasts with emmeans:

timeMean <- emmeans(m3, ~Time)
plot(timeMean)

I’d guess we’re going to see some significant differences here. Let’s say, however, we’re less worried about specific times and more about times of day in general. We decide to look at midday compared to earlier (9AM and 10AM) and later (3PM and 4PM) classes. Before running our analysis, we have to manually group our levels:

levels(dat$Time)
## [1] "9AM"  "10AM" "11AM" "12PM" "1PM"  "2PM"  "3PM"  "4PM"
timeComp <- list('Middle of the Day vs Early or Late' = c(-1/4,-1/4, 1/4, 1/4, 1/4, 1/4, -1/4, -1/4))

In this case, our \(H_0\) is

\[\frac{1}{4}(\mu_3+\mu_4+\mu_5+\mu_6) - \frac{1}{4}(\mu_1+\mu_2+\mu_7+\mu_8) = 0\]

timeTest <- contrast(timeMean, timeComp)

timeTest
##  contrast                           estimate   SE  df t.ratio p.value
##  Middle of the Day vs Early or Late     5.36 1.35 389   3.963  0.0001
confint(timeTest)
##  contrast                           estimate   SE  df lower.CL upper.CL
##  Middle of the Day vs Early or Late     5.36 1.35 389      2.7     8.01
## 
## Confidence level used: 0.95

3 Study 2

We collected attendance and final mark data from 200 university students.

Variable Description
Marks Final Mark in points
Attendance Total attendance in days

We want to use these data to investigate the following research question:

Research Question: Can student attendance be used to predict marks?

Our hypotheses in this case, can be stated in the following way:

\[H_0: \beta_1 = 0\] \[H_1: \beta_1 \neq 0\]

3.1 Study 2 - Setup

First, let’s load the data.

dat2 <- read.csv('https://uoepsy.github.io/data/DapR2_S1B2_PracticalPart2.csv', stringsAsFactors = T)

3.2 Study 2 - Checking the Data

First, let’s have a look at the data:

summary(dat2)
##      Marks         Attendance   
##  Min.   :25.01   Min.   :10.50  
##  1st Qu.:38.22   1st Qu.:22.88  
##  Median :48.81   Median :35.25  
##  Mean   :49.79   Mean   :35.25  
##  3rd Qu.:60.90   3rd Qu.:47.62  
##  Max.   :98.20   Max.   :60.00

You’ll notice that here we have two numerical variables, Attendance and Marks.

Let’s look at how our variables are distributed, using histograms:

ggplot(dat2, aes(Attendance)) + geom_histogram(colour = 'black', binwidth = 2)

Ok, that looks a bit weird…but let’s keep going.

ggplot(dat2, aes(Marks)) + geom_histogram(colour = 'black', binwidth = 4)

3.3 Study 2 - Investigating the RQ

We’ll fit the following model to investigate our research question:

\[Marks \sim \beta_0 + \beta_1Attendance + \epsilon\]

Let’s run our model, then we’ll check assumptions:

m4 <- lm(Marks~Attendance, dat2)
summary(m4)
## 
## Call:
## lm(formula = Marks ~ Attendance, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1477  -4.5210  -0.1861   4.2501  26.8415 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.83270    1.25534   11.82   <2e-16 ***
## Attendance   0.99163    0.03296   30.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.727 on 198 degrees of freedom
## Multiple R-squared:  0.8205, Adjusted R-squared:  0.8196 
## F-statistic: 905.3 on 1 and 198 DF,  p-value: < 2.2e-16

This model looks highly significant, but we’ll have to check assumptions before interpretation, especially given how the distribution of attendance looked.

Linearity It’s a simple regression, so we can just look at the scatterplot:

ggplot(dat2, aes(Attendance, Marks)) + geom_point() + 
  geom_smooth(method = 'lm') + 
  geom_smooth(method = 'loess', colour = 'red')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

There’s a very slight curve in the loess line, but based on the points, I think that’s due to possible issues in the distribution of the observations rather than a nonlinear relationship. We’ll say linearity assumption is not violated.

Independence of Errors We are using between-subjects data, so we’ll assume independence of our error terms.

Normality of Residuals Let’s check with a histogram and QQ plot:

hist(m4$residuals)

A possible outlier to the right, but not bad…

plot(m4, which = 2)

Ok, this actually looks alright, I think. Just the slight tail to the right.

Equality of Variance (Homoscedasticity) We’ll use the residualPlot to check for heteroscedasticity:

residualPlot(m4)

Oh no! It’s violated! We can’t trust the SEs in our model! We’ll need to bootstrap. We can do this using the Boot function in the car package. Let’s resample 1000 times.

3.4 Bootstrapping - Study 2

k <- 1000
boot_m4 <- Boot(m4, R = k)
## Loading required namespace: boot
summary(boot_m4)
## 
## Number of bootstrap replications R = 1000 
##             original   bootBias   bootSE  bootMed
## (Intercept) 14.83270 -0.0412065 0.980182 14.75525
## Attendance   0.99163  0.0016119 0.034549  0.99423

Let’s calculate confidence intervals to test our hypothesis:

confint(boot_m4)
## Bootstrap bca confidence intervals
## 
##                  2.5 %    97.5 %
## (Intercept) 13.0081992 16.799837
## Attendance   0.9223813  1.058881

We can make this look nicer for our write-up:

confint(boot_m4) %>%
  kable(digits = 2, caption = 'Bootstrapped 95% Confidence Intervals') %>%
  kable_styling(full_width = F)
Table 3.1: Table 3.2: Bootstrapped 95% Confidence Intervals
2.5 % 97.5 %
(Intercept) 13.01 16.80
Attendance 0.92 1.06

Because our confidence interval for the beta associated with attendance does not include 0, we can be more certain that attendance is, in fact, a significant predictor of marks.

4 Next steps

The lab this week involves writing up the results from the three RQs above