Motor Offences data

Two datasets can be loaded from the following url:

load(url("https://uoepsy.github.io/data/usmr_1920_assignment.RData"))

The data provided contains information about the nature and circumstances of motorists stopped and breathalysed by the Police.
Data is collected every time that driver is stopped by the Police and breathalysed. Records indicate the speed at which the driver is travelling when they are stopped, and the blood alcohol content of the driver when measured via breathalyser. Information is also captured on the age and prior motoring offences of the driver, and whether the incident occurred at day or night. Police officers may have had reasons for stopping drivers other than presuming them to be intoxicated (for instance, someone who is stopped for speeding may subsequently be breathalysed if they are deemed to be acting unusually).

Each time a police officer stops a motorist, an incident ID is created. A separate database used primarily for administrative purposes includes records of which officer (recorded as initials) attends which incidents.

Variable	Description
age	Age of driver (in years)
nighttime	Whether or not the incident occurred at night
prior_offence	Offence code for any prior motoring offences
speed	Speed when stopped by police (mph)
bac	Blood Alcohol Content (%) as measured by breathalyser
outcome	Outcome of stop ('fine','warning')
incident_id	ID of incident
officer	Officer attending (initials)

We saw in the lecture a brief explanation of approaching the following sample question from last years coursework report:

Sample Question: Driving speeds, night vs. day

Does time of day and speed of driving predict the blood alcohol content over and above driver’s age? Fit appropriate model(s) to test this question, and report the results (you may add a figure or table if appropriate).

Question A1

Explore and clean the dataset (i.e., remove any impossible values etc).
Some info on the lecture slides this week will help with guidance on what to look for.

(for now, you can ignore things like the “prior_offence” variable if you want, as this is a tricky one to tidy up, and isn’t relevant for the sample question we are considering)

Solution

drinkdriving %>% mutate(
  age = case_when(age > 120 | is.na(age) ~ NA_real_, TRUE ~ age),
  outcome = factor(tolower(outcome)),
  nighttime = factor(ifelse(nighttime %in% c("day","night"), nighttime, NA)),
  # prior_offence = fct_other(factor(prior_offence), keep=c("N","DR50")),
  prior_offence = factor(case_when(str_detect(prior_offence, "DR50") ~ "DUI",
                            str_detect(prior_offence, "N") ~ "None",
                            TRUE ~ "Other"))
  ) -> drinkdriving

Question A2

Fit the following models:

m1<-lm(bac~age + nighttime + speed, data=drinkdriving)
m2<-lm(bac~speed + age + nighttime, data=drinkdriving)
m3<-lm(bac~nighttime + speed + age, data=drinkdriving)
m4<-lm(bac~age + speed + nighttime, data=drinkdriving)

Are they different in any way? Are the coefficients different, or the significance tests different?

Solution

Nope! They’re all the same!

You can use summary() here, but we thought you might also like to learn about the tidy() function from the broom package:

library(broom)
tidy(m1)

## # A tibble: 4 x 5
##   term           estimate std.error statistic     p.value
##   <chr>             <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)     14.5       2.73        5.32 0.000000265
## 2 age             -0.0399    0.0323     -1.24 0.218      
## 3 nighttimenight   3.83      0.902       4.25 0.0000325  
## 4 speed            0.198     0.0383      5.17 0.000000529

tidy(m2)

## # A tibble: 4 x 5
##   term           estimate std.error statistic     p.value
##   <chr>             <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)     14.5       2.73        5.32 0.000000265
## 2 speed            0.198     0.0383      5.17 0.000000529
## 3 age             -0.0399    0.0323     -1.24 0.218      
## 4 nighttimenight   3.83      0.902       4.25 0.0000325

tidy(m3)

## # A tibble: 4 x 5
##   term           estimate std.error statistic     p.value
##   <chr>             <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)     14.5       2.73        5.32 0.000000265
## 2 nighttimenight   3.83      0.902       4.25 0.0000325  
## 3 speed            0.198     0.0383      5.17 0.000000529
## 4 age             -0.0399    0.0323     -1.24 0.218

tidy(m4)

## # A tibble: 4 x 5
##   term           estimate std.error statistic     p.value
##   <chr>             <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)     14.5       2.73        5.32 0.000000265
## 2 age             -0.0399    0.0323     -1.24 0.218      
## 3 speed            0.198     0.0383      5.17 0.000000529
## 4 nighttimenight   3.83      0.902       4.25 0.0000325

Types of Sums of Squares

Question A3

Run the following:

anova(m1)
anova(m2)
anova(m3)
anova(m4)

Are they different? In what way?

Solution

The results below are all different from one another. The astute of you may notice that the last row of each output (i.e., corresponding to the last term specified in m1, m2, and m3, and m4) give the same \(p\)-values as the coefficients from summary() of any of m1, m2, m3 or m4.

anova(m1)

## Analysis of Variance Table
## 
## Response: bac
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## age         1 1417.8  1417.8  37.020 5.326e-09 ***
## nighttime   1  620.5   620.5  16.202 7.901e-05 ***
## speed       1 1024.8  1024.8  26.760 5.287e-07 ***
## Residuals 214 8195.7    38.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m2)

## Analysis of Variance Table
## 
## Response: bac
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## speed       1 2271.0 2270.99  59.299 4.956e-13 ***
## age         1  101.8  101.79   2.658    0.1045    
## nighttime   1  690.3  690.33  18.025 3.249e-05 ***
## Residuals 214 8195.7   38.30                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m3)

## Analysis of Variance Table
## 
## Response: bac
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## nighttime   1  779.3  779.31 20.3489 1.064e-05 ***
## speed       1 2225.2 2225.21 58.1034 7.982e-13 ***
## age         1   58.6   58.58  1.5297    0.2175    
## Residuals 214 8195.7   38.30                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m4)

## Analysis of Variance Table
## 
## Response: bac
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## age         1 1417.8 1417.77  37.020 5.326e-09 ***
## speed       1  955.0  955.01  24.937 1.228e-06 ***
## nighttime   1  690.3  690.33  18.025 3.249e-05 ***
## Residuals 214 8195.7   38.30                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These results are all different from one another, because when we use anova(), we are by default using Type 1 Sums of Squares. What does this mean? It means that we’re calculating each predictor’s improvement to the model in the order that they are specified in the model (See Week 8 Lecture).
For anova(), order matters.

**Sums of Squares

Type 1 sum of squares (sequential): Variables are tested in the order that they are listed in the model.
Type 2 sums of squares: Rarely used. Similar to Type 3 below, but main effects come before interactions, preserving the principle of marginality.
Type 3 sum of squares (partial): Variables are tested in light of every other term in the model (i.e., as if they are the last term in Type 1).

Demonstration

Take, for instance:

m1<-lm(bac ~ age + nighttime + speed, data=drinkdriving)

anova(m1)

## Analysis of Variance Table
## 
## Response: bac
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## age         1 1417.8  1417.8  37.020 5.326e-09 ***
## nighttime   1  620.5   620.5  16.202 7.901e-05 ***
## speed       1 1024.8  1024.8  26.760 5.287e-07 ***
## Residuals 214 8195.7    38.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the result for the speed variable is the same as the the coefficient for speed in the lm()s of m1, m2, and m3:

tidy(m1)

## # A tibble: 4 x 5
##   term           estimate std.error statistic     p.value
##   <chr>             <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)     14.5       2.73        5.32 0.000000265
## 2 age             -0.0399    0.0323     -1.24 0.218      
## 3 nighttimenight   3.83      0.902       4.25 0.0000325  
## 4 speed            0.198     0.0383      5.17 0.000000529

The same applies for age in this model:

m3<-lm(bac~nighttime + speed + age, data=drinkdriving)

anova(m3)

## Analysis of Variance Table
## 
## Response: bac
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## nighttime   1  779.3  779.31 20.3489 1.064e-05 ***
## speed       1 2225.2 2225.21 58.1034 7.982e-13 ***
## age         1   58.6   58.58  1.5297    0.2175    
## Residuals 214 8195.7   38.30                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# These aren't all run here, but they will show exactly the same: 
# tidy(m1) %>% filter(term=="age")
# tidy(m2) %>% filter(term=="age")
# tidy(m3) %>% filter(term=="age")
tidy(m4) %>% filter(term=="age")

## # A tibble: 1 x 5
##   term  estimate std.error statistic p.value
##   <chr>    <dbl>     <dbl>     <dbl>   <dbl>
## 1 age    -0.0399    0.0323     -1.24   0.218

A sample question

Question A4

Recall our question:

Does time of day and speed of driving predict the blood alcohol content over and above driver’s age? Fit appropriate model(s) to test this question, and report the results (you may add a figure or table if appropriate).

How might we use anova() and/or lm() to best answer this question? Can you give extra context to your answer?

Solution

“over and above” here indicates that we are looking at the improvement in model RSS (residual sums of squares), i.e., we would ideally want to examine the effect(s) in question in light of every other term in the model.

However, the question concerns the improvement due to a set of predictors, not just one. So how can we examine this? Well, one option is to use model comparison!

Because we want to compare models, we need to use the same dataset, and lm() will do case-wise deletion of any observations which are missing in any of the predictors. E.g. if we have some NAs in speed, it will drop these from a model which includes speed as a predictor, but include them for a model which doesn’t (provided it is not also NA in the other variables).

drinkdriving2 <- drinkdriving %>% filter(!is.na(age), !is.na(nighttime), !is.na(speed))
modA <- lm(bac ~ age, drinkdriving2)
modB <- lm(bac ~ age + nighttime + speed, drinkdriving2)
anova(modA, modB)

## Analysis of Variance Table
## 
## Model 1: bac ~ age
## Model 2: bac ~ age + nighttime + speed
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    216 9841.0                                 
## 2    214 8195.7  2    1645.3 21.481 3.15e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And we should remember to assess our model assumptions:

plot(modA)

plot(modA)

Which highlights a couple of points which seem to be extremely influential. On further investigation, one of these observations seemed to be very fast (was the max speed observed) and both have recorded very high BAC values.

drinkdriving2[c(101,140),]

## # A tibble: 2 x 7
##     age nighttime prior_offence speed   bac outcome incident_id
##   <dbl> <fct>     <fct>         <dbl> <dbl> <fct>   <chr>      
## 1    29 day       Other         120.   57.3 fine    inc_110    
## 2    44 night     Other          81.0  65.8 fine    inc_154

Why are these different from the -c(99, 136) in the lecture recordings???
This is a nice little demonstration of what is sometimes termed “researcher degrees of freedom”. Some decision early on (which may not have even felt like a ‘decision’ at the time) has meant that the analysis presented here began to deviate from that presented in the lecture. Small decisions we make have trickle down effects on our results. This is unavoidable, and provided we feel justified in the steps we have taken, we should be happy to continue. In most cases, we would hope that variance in results due to differences in little decisions do not lead to meaningful differences in conclusions.

What did we do here differently from the lecture?
Just looking quickly:

Lecture removed datapoints where speed = 0, we did not (1 observation).
Lecture recoded datapoints where nighttime = ‘2:05’ to ‘night’. Here we excluded them. (2 observations).

With the limited knowledge we have about the data, both approaches are arguably justified. More often, you will have more scope to query your knowledge about how the data was collected (e.g., if you collected it yourself).

Removing these points from our models does not change the overall conclusion:

modA <- lm(bac ~ age, drinkdriving2[-c(101,140), ])
modB <- lm(bac ~ age + nighttime + speed, drinkdriving2[-c(101,140), ])
anova(modA, modB)

## Analysis of Variance Table
## 
## Model 1: bac ~ age
## Model 2: bac ~ age + nighttime + speed
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    214 6948.0                                  
## 2    212 6219.1  2    728.91 12.424 7.906e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, we might be interested in contextualising our answer to discuss the addition of each effect individually.
Which we can obtain using, e.g.:

modB <- lm(bac ~ age + nighttime + speed, drinkdriving2[-c(101,140),])
modC <- lm(bac ~ age + speed + nighttime, drinkdriving2[-c(101,140),])
anova(modB)
anova(modC)

Or we can get the addition of speed to (age + nighttime), and the addition of nighttime to (age + speed), which will match the final rows outputted from anova(modB) and anova(modC) respectively (\(t\) = \(\sqrt{F}\), see optional box in Week 8 exercises).

# or summary(modC), the results will be the same
summary(modB)

## 
## Call:
## lm(formula = bac ~ age + nighttime + speed, data = drinkdriving2[-c(101, 
##     140), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6457  -4.3757   0.5792   3.5662  14.2890 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    23.52651    2.66327   8.834 3.89e-16 ***
## age            -0.11596    0.02990  -3.878 0.000141 ***
## nighttimenight  3.87648    0.79353   4.885 2.04e-06 ***
## speed           0.04110    0.03958   1.038 0.300244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.416 on 212 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.246,  Adjusted R-squared:  0.2353 
## F-statistic: 23.05 on 3 and 212 DF,  p-value: 5.894e-13

A good answer to this question could be one which draws on model comparison between a restricted “age only” model, and a model with both additional predictors (speed and nighttime) in it.
It would detail any observations excluded from the analysis and the reasons for doing so (e.g., high influence on model fit).

It could refer back to the research question and state a conclusion. In doing so, it would report in text the results which have a bearing on the question (highlighted below), and include also a table/figure where this provides extra information.

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
214	6948.043	NA	NA	NA	NA
212	6219.134	2	728.9098	12.42367	7.905711e-06

It could then discuss in more depth an explanation for this finding, for instance by discussing the improvement due to the speed and nighttime variables individually (for instance, the table below), as well as how the age coefficient changes with their inclusion.

	bac
Predictors	Estimates	CI	p
(Intercept)	23.53	18.28 – 28.78	<0.001
age	-0.12	-0.17 – -0.06	<0.001
nighttime [night]	3.88	2.31 – 5.44	<0.001
speed	0.04	-0.04 – 0.12	0.300
Observations	216
R² / R² adjusted	0.246 / 0.235

Write-up task

Here, we’re going to walk through a high-level step-by-step guide of what to include in a write-up of a statistical analysis. We’re going to use an example analysis using one of the datasets we have worked with on a number of exercises in previous labs concerning personality traits, social comparison, and depression and anxiety.

The aim in writing should be that a reader is able to more or less replicate your analyses without referring to your R code. This requires detailing all of the steps you took in conducting the analysis.
The point of using RMarkdown is that you can pull your results directly from the code. If your analysis changes, so does your report!

You can find a .pdf of the take-everywhere write-up checklist here.

Research question and analysis

Research question
Previous research has identified an association between an individual’s perception of their social rank and symptoms of depression, anxiety and stress. We are interested in the individual differences in this relationship.
Specifically:

Controlling for other personality traits, does neuroticism moderate effects of social comparison on symptoms of depression, anxiety and stress?

library(tidyverse) # for all things!
library(psych) # good for descriptive stats
library(car) # for assumption tests
library(sjPlot) # for plotting models

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")

# scale scs score
scs_study <- 
  scs_study %>% 
    mutate(
      zscs = (scs-mean(scs))/sd(scs)
    )

# the describe() function is from the psych package
describe(scs_study)

##      vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## zo      1 656  0.09 1.02   0.08    0.08 1.07 -2.82  3.55  6.37  0.15    -0.08
## zc      2 656  0.02 1.00   0.00    0.03 1.03 -3.22  3.08  6.30 -0.08    -0.13
## ze      3 656  0.00 1.00  -0.04   -0.01 1.00 -3.01  2.80  5.81  0.05    -0.13
## za      4 656  0.00 1.00  -0.02    0.01 1.04 -2.94  2.97  5.91 -0.08    -0.04
## zn      5 656  0.00 1.00  -0.21   -0.10 1.00 -1.45  3.35  4.80  0.80     0.04
## scs     6 656 35.77 3.53  35.00   35.59 2.97 27.00 54.00 27.00  0.60     0.96
## dass    7 656 44.72 6.76  44.00   44.62 5.93 23.00 68.00 45.00  0.18     0.33
## zscs    8 656  0.00 1.00  -0.22   -0.05 0.84 -2.48  5.16  7.64  0.60     0.96
##        se
## zo   0.04
## zc   0.04
## ze   0.04
## za   0.04
## zn   0.04
## scs  0.14
## dass 0.26
## zscs 0.04

dass_mdl <- lm(dass ~ 1 + zscs*zn + zo + zc + ze + za, data = scs_study)
plot(dass_mdl)

##       35 
## 2.664798

dass_mdl2 <- lm(dass ~ 1 + zscs*zn + zo + zc + ze + za, data = scs_study[-35, ])

# linearity
plot(dass_mdl2, which=1)

# equal variances
residualPlots(dass_mdl2)

##            Test stat Pr(>|Test stat|)  
## zscs          1.8141          0.07013 .
## zn           -0.5911          0.55467  
## zo            1.7801          0.07553 .
## zc           -0.2403          0.81018  
## ze           -0.9951          0.32004  
## za            0.0725          0.94219  
## Tukey test   -1.6406          0.10089  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ncvTest(dass_mdl2)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.508659, Df = 1, p = 0.21934

# normality
shapiro.test(residuals(dass_mdl2))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(dass_mdl2)
## W = 0.99826, p-value = 0.7616

# independence
dwt(dass_mdl2)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.03991314      2.078413   0.324
##  Alternative hypothesis: rho != 0

# multicollinearity
vif(dass_mdl2)

##     zscs       zn       zo       zc       ze       za  zscs:zn 
## 1.015133 1.015736 1.013310 1.008235 2.332486 2.342220 1.012475

summary(dass_mdl2)

## 
## Call:
## lm(formula = dass ~ 1 + zscs * zn + zo + zc + ze + za, data = scs_study[-35, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1455  -3.8155  -0.0066   3.6905  18.1483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.97703    0.22635 198.708  < 2e-16 ***
## zscs        -1.93818    0.23042  -8.412 2.58e-16 ***
## zn           1.41639    0.22661   6.250 7.44e-10 ***
## zo          -0.31435    0.22056  -1.425    0.155    
## zc           0.09134    0.22515   0.406    0.685    
## ze           0.52695    0.34233   1.539    0.124    
## za           0.33847    0.34281   0.987    0.324    
## zscs:zn     -2.76609    0.24097 -11.479  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733 on 647 degrees of freedom
## Multiple R-squared:  0.279,  Adjusted R-squared:  0.2712 
## F-statistic: 35.76 on 7 and 647 DF,  p-value: < 2.2e-16

Think

What do you know? What do you hope to learn? What did you learn during the exploratory analysis?

B1: Describe design

If you were reporting on your own study, then the first you would want to describe the study design, the data collection strategy, etc.
This is not necessary here, but we could always say something brief like:

Data was obtained from https://uoepsy.github.io/data/scs_study.csv: a dataset containing information on 656 participants

B2: Describe the data

How many observational units?
Are there any observations that have been excluded based on pre-defined criteria? How/why, and how many?
Describe and visualise the variables of interest. How are they scored? have they been transformed at all?
Describe and visualise relationships between variables. Report covariances/correlations.

Solution

B3: Describe the analytical approach

What type of statistical analysis do you use to answer the research question? (e.g., t-test, simple linear regression, multiple linear regression)
Describe the model/analysis structure
What is your outcome variable? What is its type?
What are your predictors? What are their types?
Any other specifics?

Solution

B4: Planned analysis vs actual analysis

Was there anything you had to do differently than planned during the analysis? Did the modelling highlight issues in your data?
Did you have to do anything (e.g., transform any variables, exclude any observations) in order to meet assumptions?

Solution

Show

Show the mechanics and visualisations which will support your conclusions

B5: Present and describe final model

Present and describe the model or test which you deemed best to answer your question.

Solution

B6: Are the assumptions and conditions of your final test or model satisfied?

For the final model (the one you report results from), were all assumptions met? (Hopefully yes, or there is more work to do…). Include evidence (tests or plots).

Solution

B7: Report your test or model results

Provide a table of results if applicable (for regression tables, try tab_model() from the sjPlot package).
Provide plots if applicable.

Solution

Tell

Communicate your findings

B8: Interpret your results in the context of your research question.

What do your results suggest about your research question?
Make direct links from hypotheses to models (which bit is testing hypothesis)
Be specific - which statistic did you use/what did the statistical test say? Comment on effect sizes.
Make sure to include measurement units where applicable.

Solution

Tying it all together

All the component parts we have just written in the exercises above can be brought together to make a reasonable draft of a statistical report. There is a lot of variability in how to structure the reporting of statistical analyses, for instance you may be using the same model to test a selection of different hypotheses.

The answers contained within the solution box below is just an example. While we hope it is useful for you when you are writing your report, it should not be taken as an exemplary template for a report which would score 100%.
We have also included the RMarkdown file used to create this, which may be useful to see how things such as formatting and using inline R code can be used.

Solution SPOILERS ALERT

You can find find the .Rmd file for the draft below at https://uoepsy.github.io/files/mlr_writeup_example.Rmd. Why not try downloading and compiling it to see how it works?

PLEASE NOTE: A weird issue with hosting .Rmd files like this means that when you download it, it will remove the top bit of meta-data (i.e., the bit which includes author, title etc) meaning it will not compile without a small amount of editing. You can see the .Rmd file more clearly including these first 6 lines here.

Data was obtained from https://uoepsy.github.io/data/scs_study.csv: a dataset containing information on 656 participants, including Z-scores on the 5 personality traits assessed by the Big-Five Aspects Scale (BFAS) (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism). Participants were also assessed on the Social Comparison Scale (SCS), which is an 11-item scale measuring self-perception (relative to others) of social rank, attractiveness and belonging, and the Depression Anxiety and Stress Scale (DASS-21) - a 21 item measure with higher scores indicating higher severity of symptoms. For both of these measures, only total scores are available. Items in the SCS are measured on a 5-point scale, giving minimum and maximum possible scores of 11 and 55 respectively. Items in the DASS-21 are measured on a 4-point scale, meaning that scores can range from a possible 21 to 84.

All participant data was complete (no missing values), with scores on the SCS and the DASS-21 all within possible ranges (see Table 3). Bivariate correlations show a moderate negative relationship between DASS-21 and SCS scores; a moderate positive relationship between DASS-21 and Neuroticism, and a weak positive correlation between SCS and Neuroticism. Additionally, a strong positive relationship is evident between Extraversion and Agreeableness (see Figure 4).

Table 3: SCS and DASS-21 descriptive statistics
	n	mean	sd	min	max
scs	656	35.77	3.53	27	54
dass	656	44.72	6.76	23	68

Bivariate scatter plots (below diagonal), histograms (diagonal), and Pearson correlation coefficient (above diagonal), of personality trait measures and scores on the SCS and the DASS-21

Figure 4: Bivariate scatter plots (below diagonal), histograms (diagonal), and Pearson correlation coefficient (above diagonal), of personality trait measures and scores on the SCS and the DASS-21

Analysis

To investigate whether, when controlling for other personality traits, neuroticism moderates the effect of social comparison on symptoms of depression, anxiety and stress, total scores on the DASS-21 were modelled using multiple linear regression. The Z-scored measures on each of the big-five personality traits were included as predictors, along with scores on the SCS (Z-scored) and its interaction with the measure of Neuroticism. Effects will be considered statistically significant at \(\alpha = 0.01\). One observation was excluded from the final analysis as it was judged to be too influential on the model (Cook’s Distance = 2.66).
The final model was fitted to the remaining 655 observations, and took the form: \[ \text{DASS-21} = \beta_0 + \beta_1 \text{O} + \beta_2 \text{C} + \beta_3 \text{E} + \beta_4 \text{A} + \beta_5 \text{N} + \beta_6 \text{SCS} + \beta_7 \text{SCS} \cdot \text{N} + \epsilon \\ \begin{align} \text{Where} \\ & \text{O = Openness} \\ & \text{C = Conscientiousness} \\ & \text{E = Extraversion} \\ & \text{A = Agreeableness} \\ & \text{N = Neuroticism} \\ \end{align} \]

To address the research question of whether neuroticism moderates the effect of social comparison on depression and anxiety, we will consider the hypothesis test that the interaction coefficient is equal to zero, where:

\(H_0: \beta_7 = 0\). The interaction between SCS and Neuroticism is equal to zero.
\(H_1: \beta_7 \neq 0\). The interaction between SCS and Neuroticism is not equal to zero.

The regression model met assumptions of linearity (see plot of model residuals vs fitted values, Figure 5), homoscedasticity (non-constant variance test indicated no evidence against the null hypothesis that the error variance is constant across level of the response, \(\chi^2(1)\)=1.51, \(p\)=0.219), independence of errors (Durbin-Watson test for autocorrelation of residuals: \(DW\)=2.08, \(p\)=0.294), and normality of error term (Shapiro-Wilk test indicated no evidence against the null hypothesis that the residuals were drawn from a normally distributed population: \(W\)=1, \(p\)=0.762).

Figure 5: Residuals vs Fitted plot demonstrating overall near constant mean and variance of error term across levels of the response

Results

Results showed a significant conditional association between SCS scores (Z-scored) and DASS-21 Scores (\(\beta\) = 44.98,SE = 0.23, p <.01), suggesting that for those at the mean level of neuroticism, scores on the DASS-21 decrease by 44.98 for every 1 standard deviation increase in SCS scores. A significant conditional association was also evident between Neuroticism (Z-scored) and DASS-21 Scores (\(\beta\) = -1.94,SE = 0.23, p <.01), suggesting that for those who score the mean on the SCS, scores on the DASS-21 increase by -1.94 for every 1 standard deviation increase in neuroticism. Crucially, the association between social comparison and symptoms of depression and anxiety was found to be dependent upon the level of neuroticism, with a greater negative association between the two for those with high levels of neuroticism (\(\beta\) = -2.77,SE = 0.24, p <.01). This interaction is visually presented in Figure 6.
The F-test for model utility was significant (F(7,647)=35.76, p<.001), with the model explaining approximately 27.1% of the variability in DASS-21 Scores. Full regression results including 95% Confidence Intervals are shown in Table 4.
The results presented here indicate that the association between social comparison and depression and anxiety may depend upon individuals’ levels of neuroticism, with perceived social rank perhaps leading to more symptoms of depression and anxiety for highly neurotic individuals. However, it is important to note that we can make no claims on the directions of these associations from these data - it may be that social comparison leads to more depression and anxiety in neurotic individuals, but also consistent is the view that - for these individuals - higher levels of depression leads to a greater reduction in perceived social rank.

Figure 6: Predicted DASS-21 score across SCS scores, for +/-1 SD Neuroticism

Table 4: Regression table for DASS-21 model. Outcome variable is raw total score on DASS-21, all predictors are Z-scored
	DASS-21
Predictors	Estimates	CI	p
(Intercept)	44.98	44.53 – 45.42	<0.001
Social Comparison Scale	-1.94	-2.39 – -1.49	<0.001
Neuroticism	1.42	0.97 – 1.86	<0.001
Openness	-0.31	-0.75 – 0.12	0.155
Conscientiousness	0.09	-0.35 – 0.53	0.685
Extraversion	0.53	-0.15 – 1.20	0.124
Agreeableness	0.34	-0.33 – 1.01	0.324
Social Comparison Scale : Neutoricism	-2.77	-3.24 – -2.29	<0.001
Observations	655
R² / R² adjusted	0.279 / 0.271

Extra: linear models and other things

Once you start using linear models, you might begin to think about how many other common statistical tests can be put into a linear model framework. Below are some very quick demonstrations of a couple of equivalences, but there are many more, and we encourage you to explore this further by a) playing around with R, and b) reading through some of the examples at https://lindeloev.github.io/tests-as-linear/.

lm and t.test

t.test(drinkdriving$age ~ drinkdriving$outcome, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  drinkdriving$age by drinkdriving$outcome
## t = -9.2825, df = 223, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -26.99844 -17.54249
## sample estimates:
##    mean in group fine mean in group warning 
##              38.32044              60.59091

summary(lm(age ~ outcome, data = drinkdriving))

## 
## Call:
## lm(formula = age ~ outcome, data = drinkdriving)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.59  -9.32  -0.32   8.68  47.68 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      38.320      1.061  36.119   <2e-16 ***
## outcomewarning   22.270      2.399   9.283   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.27 on 223 degrees of freedom
##   (25 observations deleted due to missingness)
## Multiple R-squared:  0.2787, Adjusted R-squared:  0.2755 
## F-statistic: 86.16 on 1 and 223 DF,  p-value: < 2.2e-16

lm and cor.test

Or a test of the correlation coefficient:

cor.test(drinkdriving$bac, drinkdriving$age)

## 
##  Pearson's product-moment correlation
## 
## data:  drinkdriving$bac and drinkdriving$age
## t = -5.2356, df = 218, p-value = 3.863e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4467334 -0.2112790
## sample estimates:
##        cor 
## -0.3342105

summary(lm(bac ~ age, data = drinkdriving))

## 
## Call:
## lm(formula = bac ~ age, data = drinkdriving)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.719  -4.516  -0.861   3.849  42.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.29455    1.25564  23.330  < 2e-16 ***
## age         -0.14353    0.02741  -5.236 3.86e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.82 on 218 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.1117, Adjusted R-squared:  0.1076 
## F-statistic: 27.41 on 1 and 218 DF,  p-value: 3.863e-07

Cheat Sheets

You can find many RStudio cheatsheets at https://rstudio.com/resources/cheatsheets/, but some of the more relevant ones to this course are listed below:

Created with the pairs.panels() function from the psych package if you’re interested.↩︎

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

Writing-up

Types of Sums of Squares

A sample question

Write-up task

Think

Show

Tell

Tying it all together

Extra: linear models and other things

Cheat Sheets