A sample question

We saw in the lecture a brief explanation of approaching the following sample question from a previous year’s coursework report:

Sample Question: Driving speeds, night vs. day

Does time of day and speed of driving predict the blood alcohol content over and above driver’s age? Fit appropriate model(s) to test this question, and report the results (you may add a figure or table if appropriate).

Drinkdriving Data

Two datasets can be loaded from the following url:

load(url("https://uoepsy.github.io/data/usmr_1920_assignment.RData"))

The data provided contains information about the nature and circumstances of motorists stopped and breathalysed by the Police.
Data is collected every time that driver is stopped by the Police and breathalysed. Records indicate the speed at which the driver is travelling when they are stopped, and the blood alcohol content of the driver when measured via breathalyser. Information is also captured on the age and prior motoring offences of the driver, and whether the incident occurred at day or night. Police officers may have had reasons for stopping drivers other than presuming them to be intoxicated (for instance, someone who is stopped for speeding may subsequently be breathalysed if they are deemed to be acting unusually).

Each time a police officer stops a motorist, an incident ID is created. A separate database used primarily for administrative purposes includes records of which officer (recorded as initials) attends which incidents.

Variable	Description
age	Age of driver (in years)
nighttime	Whether or not the incident occurred at night
prior_offence	Offence code for any prior motoring offences
speed	Speed when stopped by police (mph)
bac	Blood Alcohol Content (%) as measured by breathalyser
outcome	Outcome of stop ('fine','warning')
incident_id	ID of incident
officer	Officer attending (initials)

Question A1

Explore and clean the dataset (i.e., remove any impossible values etc).
Some info on the lecture slides this week will help with guidance on what to look for.

(for now, you can ignore things like the “prior_offence” variable if you want, as this is a tricky one to tidy up, and isn’t relevant for the sample question we are considering)

Question A2

Take another look at our question:

Does time of day and speed of driving predict the blood alcohol content over and above driver’s age? Fit appropriate model(s) to test this question, and report the results (you may add a figure or table if appropriate).

Try to provide an answer (hint: we’re probably going to want to use lm() and/or anova()). Can you give extra context to your answer?

“over and above” here indicates that we are looking at the improvement in model RSS (residual sums of squares), i.e., we would ideally want to examine the effect(s) in question in light of every other term in the model.

However, the question concerns the improvement due to a set of predictors, not just one. So how can we examine this? Well, one option is to use model comparison!

Because we want to compare models, we need to use the same dataset, and lm() will do case-wise deletion of any observations which are missing in any of the predictors. E.g. if we have some NAs in speed, it will drop these from a model which includes speed as a predictor, but include them for a model which doesn’t (provided it is not also NA in the other variables).

drinkdriving2 <- drinkdriving %>% filter(!is.na(age), !is.na(nighttime), !is.na(speed))
modA <- lm(bac ~ age, drinkdriving2)
modB <- lm(bac ~ age + nighttime + speed, drinkdriving2)
anova(modA, modB)

## Analysis of Variance Table
## 
## Model 1: bac ~ age
## Model 2: bac ~ age + nighttime + speed
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    216 9841                              
## 2    214 8196  2      1645 21.5 3.1e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And we should remember to assess our model assumptions:

plot(modA)

plot(modA)

Which highlights a couple of points which seem to be extremely influential. On further investigation, one of these observations seemed to be very fast (was the max speed observed) and both have recorded very high BAC values.

drinkdriving2[c(101,140),]

## # A tibble: 2 x 7
##     age nighttime prior_offence  speed   bac outcome incident_id
##   <dbl> <fct>     <chr>          <dbl> <dbl> <fct>   <chr>      
## 1    29 day       CD60,CD60      120.   57.3 fine    inc_110    
## 2    44 night     CD60,CD10,DR80  81.0  65.8 fine    inc_154

Why are these different from the -c(99, 136) in the lecture recordings???
This is a nice little demonstration of what is sometimes termed “researcher degrees of freedom.” Some decision early on (which may not have even felt like a ‘decision’ at the time) has meant that the analysis presented here began to deviate from that presented in the lecture. Small decisions we make have trickle down effects on our results. This is unavoidable, and provided we feel justified in the steps we have taken, we should be happy to continue. In most cases, we would hope that variance in results due to differences in little decisions do not lead to meaningful differences in conclusions.

What did we do here differently from the lecture?
Just looking quickly:

Lecture removed datapoints where speed = 0, we did not (1 observation).
Lecture recoded datapoints where nighttime = ‘2:05’ to ‘night.’ Here we excluded them. (2 observations).

With the limited knowledge we have about the data, both approaches are arguably justified. More often, you will have more scope to query your knowledge about how the data was collected (e.g., if you collected it yourself).

Removing these points from our models does not change the overall conclusion:

modA <- lm(bac ~ age, drinkdriving2[-c(101,140), ])
modB <- lm(bac ~ age + nighttime + speed, drinkdriving2[-c(101,140), ])
anova(modA, modB)

## Analysis of Variance Table
## 
## Model 1: bac ~ age
## Model 2: bac ~ age + nighttime + speed
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    214 6948                              
## 2    212 6219  2       729 12.4 7.9e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, we might be interested in contextualising our answer to discuss the addition of each effect individually.
Which we can obtain using, e.g.:

modB <- lm(bac ~ age + nighttime + speed, drinkdriving2[-c(101,140),])
modC <- lm(bac ~ age + speed + nighttime, drinkdriving2[-c(101,140),])
anova(modB)
anova(modC)

In this way, we can get the addition of speed to (age + nighttime), and the addition of nighttime to (age + speed), which will match the final rows outputted from anova(modB) and anova(modC) respectively (\(t\) = \(\sqrt{F}\), see this guide on sums of squares for explanation of this).

# or summary(modC), the results will be the same
summary(modB)

## 
## Call:
## lm(formula = bac ~ age + nighttime + speed, data = drinkdriving2[-c(101, 
##     140), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.646  -4.376   0.579   3.566  14.289 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     23.5265     2.6633    8.83  3.9e-16 ***
## age             -0.1160     0.0299   -3.88  0.00014 ***
## nighttimenight   3.8765     0.7935    4.89  2.0e-06 ***
## speed            0.0411     0.0396    1.04  0.30024    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.42 on 212 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.246,  Adjusted R-squared:  0.235 
## F-statistic: 23.1 on 3 and 212 DF,  p-value: 5.89e-13

A good answer to this question could be one which draws on model comparison between a restricted “age only” model, and a model with both additional predictors (speed and nighttime) in it.
It would detail any observations excluded from the analysis and the reasons for doing so (e.g., high influence on model fit).
It could refer back to the research question and state a conclusion. In doing so, it would report in text the results which have a bearing on the question (highlighted below), and include also a table/figure where this provides extra information.

Res.Df

RSS

Df

Sum of Sq

F

Pr(>F)

214

6948

NA

NA

NA

NA

212

6219

2

729

12.4

0

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
214	6948	NA	NA	NA	NA
212	6219	2	729	12.4	0

It could then discuss in more depth an explanation for this finding, for instance by discussing the improvement due to the speed and nighttime variables individually (for instance, the table below), as well as how the age coefficient changes with their inclusion.

	bac
Predictors	Estimates	CI	p
(Intercept)	23.53	18.28 – 28.78	<0.001
age	-0.12	-0.17 – -0.06	<0.001
nighttime [night]	3.88	2.31 – 5.44	<0.001
speed	0.04	-0.04 – 0.12	0.300
Observations	216
R² / R² adjusted	0.246 / 0.235

Writing-up Guide

Here, we’re going to walk through a high-level step-by-step guide of what to include in a write-up of a statistical analysis. We’re going to use an example analysis using one of the datasets we have worked with on a number of exercises in previous labs concerning personality traits, social comparison, and depression and anxiety.

The aim in writing should be that a reader is able to more or less replicate your analyses without referring to your R code. This requires detailing all of the steps you took in conducting the analysis.
The point of using RMarkdown is that you can pull your results directly from the code. If your analysis changes, so does your report!

You can find a .pdf of the take-everywhere write-up checklist here.

Research Question

Previous research has identified an association between an individual’s perception of their social rank and symptoms of depression, anxiety and stress. We are interested in the individual differences in this relationship.
Specifically:

Research question: Controlling for other personality traits, does neuroticism moderate effects of social comparison on symptoms of depression, anxiety and stress?

Here is our analysis

library(tidyverse) # for all things!
library(psych) # good for descriptive stats
library(car) # for assumption tests
library(sjPlot) # for plotting models

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")

# scale scs score
scs_study <- 
  scs_study %>% 
    mutate(
      zscs = (scs-mean(scs))/sd(scs)
    )

# the describe() function is from the psych package
describe(scs_study)

##      vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## zo      1 656  0.09 1.02   0.08    0.08 1.07 -2.82  3.55  6.37  0.15    -0.08
## zc      2 656  0.02 1.00   0.00    0.03 1.03 -3.22  3.08  6.30 -0.08    -0.13
## ze      3 656  0.00 1.00  -0.04   -0.01 1.00 -3.01  2.80  5.81  0.05    -0.13
## za      4 656  0.00 1.00  -0.02    0.01 1.04 -2.94  2.97  5.91 -0.08    -0.04
## zn      5 656  0.00 1.00  -0.21   -0.10 1.00 -1.45  3.35  4.80  0.80     0.04
## scs     6 656 35.77 3.53  35.00   35.59 2.97 27.00 54.00 27.00  0.60     0.96
## dass    7 656 44.72 6.76  44.00   44.62 5.93 23.00 68.00 45.00  0.18     0.33
## zscs    8 656  0.00 1.00  -0.22   -0.05 0.84 -2.48  5.16  7.64  0.60     0.96
##        se
## zo   0.04
## zc   0.04
## ze   0.04
## za   0.04
## zn   0.04
## scs  0.14
## dass 0.26
## zscs 0.04

dass_mdl <- lm(dass ~ 1 + zscs*zn + zo + zc + ze + za, data = scs_study)
plot(dass_mdl)

##   35 
## 2.66

dass_mdl2 <- lm(dass ~ 1 + zscs*zn + zo + zc + ze + za, data = scs_study[-35, ])

# linearity
plot(dass_mdl2, which=1)

# equal variances
residualPlots(dass_mdl2)

##            Test stat Pr(>|Test stat|)  
## zscs            1.81            0.070 .
## zn             -0.59            0.555  
## zo              1.78            0.076 .
## zc             -0.24            0.810  
## ze             -1.00            0.320  
## za              0.07            0.942  
## Tukey test     -1.64            0.101  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ncvTest(dass_mdl2)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.51, Df = 1, p = 0.2

# normality
shapiro.test(residuals(dass_mdl2))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(dass_mdl2)
## W = 1, p-value = 0.8

# independence
dwt(dass_mdl2)

##  lag Autocorrelation D-W Statistic p-value
##    1         -0.0399          2.08   0.314
##  Alternative hypothesis: rho != 0

# multicollinearity
vif(dass_mdl2)

##    zscs      zn      zo      zc      ze      za zscs:zn 
##    1.02    1.02    1.01    1.01    2.33    2.34    1.01

summary(dass_mdl2)

## 
## Call:
## lm(formula = dass ~ 1 + zscs * zn + zo + zc + ze + za, data = scs_study[-35, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.146  -3.816  -0.007   3.690  18.148 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.9770     0.2263  198.71  < 2e-16 ***
## zscs         -1.9382     0.2304   -8.41  2.6e-16 ***
## zn            1.4164     0.2266    6.25  7.4e-10 ***
## zo           -0.3143     0.2206   -1.43     0.15    
## zc            0.0913     0.2251    0.41     0.69    
## ze            0.5269     0.3423    1.54     0.12    
## za            0.3385     0.3428    0.99     0.32    
## zscs:zn      -2.7661     0.2410  -11.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.73 on 647 degrees of freedom
## Multiple R-squared:  0.279,  Adjusted R-squared:  0.271 
## F-statistic: 35.8 on 7 and 647 DF,  p-value: <2e-16

1. Think

What do you know? What do you hope to learn? What did you learn during the exploratory analysis?

B1: Describe design

If you were reporting on your own study, then the first you would want to describe the study design, the data collection strategy, etc.
This is not necessary here, but we could always say something brief like:

Data was obtained from https://uoepsy.github.io/data/scs_study.csv: a dataset containing information on 656 participants

B2: Describe the data

How many observational units?
Are there any observations that have been excluded based on pre-defined criteria? How/why, and how many?
Describe and visualise the variables of interest. How are they scored? have they been transformed at all?
Describe and visualise relationships between variables. Report covariances/correlations.

Solution

Table 1: SCS and DASS-21 descriptive statistics
	n	mean	sd	min	max
scs	656	35.8	3.53	27	54
dass	656	44.7	6.76	23	68

B3: Describe the analytical approach

What type of statistical analysis do you use to answer the research question? (e.g., t-test, simple linear regression, multiple linear regression)
Describe the model/analysis structure
What is your outcome variable? What is its type?
What are your predictors? What are their types?
Any other specifics?

Solution

B4: Planned analysis vs actual analysis

Was there anything you had to do differently than planned during the analysis? Did the modelling highlight issues in your data?
Did you have to do anything (e.g., transform any variables, exclude any observations) in order to meet assumptions?

Solution

2. Show

Show the mechanics and visualisations which will support your conclusions

B5: Present and describe final model

Present and describe the model or test which you deemed best to answer your question.

Solution

B6: Are the assumptions and conditions of your final test or model satisfied?

For the final model (the one you report results from), were all assumptions met? (Hopefully yes, or there is more work to do…). Include evidence (tests or plots).

Solution

B7: Report your test or model results

Provide a table of results if applicable (for regression tables, try tab_model() from the sjPlot package).
Provide plots if applicable.

Solution

3. Tell

Communicate your findings

B8: Interpret your results in the context of your research question.

What do your results suggest about your research question?
Make direct links from hypotheses to models (which bit is testing hypothesis)
Be specific - which statistic did you use/what did the statistical test say? Comment on effect sizes.
Make sure to include measurement units where applicable.

Solution

Tying it all together

All the component parts we have just written in the exercises above can be brought together to make a reasonable draft of a statistical report. There is a lot of variability in how to structure the reporting of statistical analyses, for instance you may be using the same model to test a selection of different hypotheses.

The answers contained within the solution boxes above are an example. While we hope it is useful for you when you are writing reports, dissertations etc, it should not be taken as an exemplary template for a report which would score 100%.
You can find an RMarkdown file which contains all these parts here, which may be useful to see how things such as formatting and using inline R code can be used.

Life Beyond USMR

Linear models and other things

Once you start using linear models, you might begin to think about how many other common statistical tests can be put into a linear model framework. Below are some very quick demonstrations of a couple of equivalences, but there are many more, and we encourage you to explore this further by a) playing around with R, and b) reading through some of the examples at https://lindeloev.github.io/tests-as-linear/.

lm and t.test

t.test(drinkdriving$age ~ drinkdriving$outcome, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  drinkdriving$age by drinkdriving$outcome
## t = -9, df = 223, p-value <2e-16
## alternative hypothesis: true difference in means between group fine and group warning is not equal to 0
## 95 percent confidence interval:
##  -27.0 -17.5
## sample estimates:
##    mean in group fine mean in group warning 
##                  38.3                  60.6

summary(lm(age ~ outcome, data = drinkdriving))

## 
## Call:
## lm(formula = age ~ outcome, data = drinkdriving)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.59  -9.32  -0.32   8.68  47.68 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.32       1.06   36.12   <2e-16 ***
## outcomewarning    22.27       2.40    9.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 223 degrees of freedom
##   (25 observations deleted due to missingness)
## Multiple R-squared:  0.279,  Adjusted R-squared:  0.275 
## F-statistic: 86.2 on 1 and 223 DF,  p-value: <2e-16

lm and cor.test

Or a test of the correlation coefficient:

cor.test(drinkdriving$bac, drinkdriving$age)

## 
##  Pearson's product-moment correlation
## 
## data:  drinkdriving$bac and drinkdriving$age
## t = -5, df = 218, p-value = 4e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.447 -0.211
## sample estimates:
##    cor 
## -0.334

summary(lm(bac ~ age, data = drinkdriving))

## 
## Call:
## lm(formula = bac ~ age, data = drinkdriving)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.72  -4.52  -0.86   3.85  42.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.2945     1.2556   23.33  < 2e-16 ***
## age          -0.1435     0.0274   -5.24  3.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.82 on 218 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.112,  Adjusted R-squared:  0.108 
## F-statistic: 27.4 on 1 and 218 DF,  p-value: 3.86e-07

Cheat Sheets

You can find many RStudio cheatsheets at https://rstudio.com/resources/cheatsheets/, but some of the more relevant ones to this course are listed below:

Thank you!

Lastly, we’d just like to say a big thank you for following all of our ramblings, for your attendance across the various sessions, and for all your excellent questions on Piazza. We hope that you feel that you have learned something this semester, and that it has been (at least in some ways) an enjoyable experience. Those of you who are planning on taking the Multivariate Stats (MSMR) course next semester are not free of us just yet - we’ll see you in January!

Josiah, Umberto & Emma

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

	DASS-21
Predictors	Estimates	CI	p
(Intercept)	44.98	44.53 – 45.42	<0.001
Social Comparison Scale	-1.94	-2.39 – -1.49	<0.001
Neuroticism	1.42	0.97 – 1.86	<0.001
Openness	-0.31	-0.75 – 0.12	0.155
Conscientiousness	0.09	-0.35 – 0.53	0.685
Extraversion	0.53	-0.15 – 1.20	0.124
Agreeableness	0.34	-0.33 – 1.01	0.324
Social Comparison Scale : Neutoricism	-2.77	-3.24 – -2.29	<0.001
Observations	655
R² / R² adjusted	0.279 / 0.271