Write Up Example & Block 4 Recap

Learning Objectives

At the end of this lab, you will:

Understand how to write-up and provide interpretation of a binary logistic regression model

What You Need

Be up to date with lectures
Have completed previous lab exercises from Semester 2 Week 7 and Semester 2 Week 8

Required R Packages

Remember to load all packages within a code chunk at the start of your RMarkdown file using library(). If you do not have a package and need to install, do so within the console using install.packages(" "). For further guidance on installing/updating packages, see Section C here.

For this lab, you will need to load the following package(s):

tidyverse
patchwork
kableExtra
psych
sjPlot

Lab Data

You can download the data required for this lab here or read it in via this link https://uoepsy.github.io/data/mallow2.csv

Section A: Write-Up

In this section of the lab you will be be presented with a research question, and tasked with writing up and presenting your analyses.

The aim in writing should be that a reader is able to more or less replicate your analyses without referring to your R code. This requires detailing all of the steps you took in conducting the analysis. The point of using RMarkdown is that you can pull your results directly from the code. If your analysis changes, so does your report!

Make sure that your final report doesn’t show any R functions or code. Remember you are interpreting and reporting your results in text, tables, or plots, targeting a generic reader who may use different software or may not know R at all. If you need a reminder on how to hide code, format tables, etc., make sure to review the rmd bootcamp.

Important - Write-Up Examples & Plagiarism

The example write-up sections included below are not perfect - they instead should give you a good example of what information you should include within each section, and how to structure this. For example, some information is missing (e.g., interpretation of descriptive statistics, what type of interaction is present), some information could be presented more clearly (e.g., variable names in tables, table/figure titles/captions, and rationales for choices), and writing could be more concise in places (e.g., discussion section is quite long).

Further, you must not copy any of the write-up included below for future reports - if you do, you will be committing plagiarism, and this type of academic misconduct is taken very seriously by the University. You can find out more here.

Study Overview

Research Aim

Explore the associations among the ability to delay gratification and age, visibility, and time of day.

Research Question

Does the probability of delaying gratification change as a function of age, marshmallow visibility, and time of day?

Marshmallows: Data Codebook

Description

The data used for this write-up exercise are simulated, drawing on the Stanford marshmallow experiment - a study on delayed gratification in children. The simulated data are loosely based on the findings of this work, and acted to expand upon the methods and results reported in the paper:

Mischel, W., Ebbesen, E. B., & Raskoff Zeiss, A. (1972). Cognitive and attentional mechanisms in delay of gratification. Journal of Personality and Social Psychology, 21(2), 204–218. https://doi.org/10.1037/h0032198

In the current study, a sample of 304 children, ranging in ages from 3 to 10 years old, took part. Each child was shown a marshmallow, and it was explained that they were about to be left alone for 10 minutes. They were told that they were welcome to eat the marshmallow while they were waiting, but if the marshmallow was still there after 10 minutes, they would be rewarded with two marshmallows.

For half of the children who took part, the marshmallow was visible for the entire 10 minutes (or until they ate it!). For the other half, the marshmallow was placed under a plastic cup.

The experiment took part at various times throughout the working day, and researchers were worried about children being more hungry at certain times of day, so they kept track of whether each child completed the task in the morning or the afternoon, so that they could control for this in their analyses.

For an example of the marshmallow experiment, watch the following video:

Data Dictionary

The data in mallow2.csv contain five attributes collected from a simulated sample of \(n=304\) hypothetical individuals, and includes:

Variable	Description
name	Participant name
agemonths	Age (in months)
timeofday	Time of day that the experiment took place ('am' = morning, 'pm' = afternoon)
visibility	Experimental condition - whether the marshmallow was 'visible' or 'hidden' for the 10 minutes
taken	Whether or not the participant took ('taken') the marshmallow within the 10 minutes or left it ('waited')

Preview

The first six rows of the data are:

name	agemonths	timeofday	visibility	taken
Lauren	60	pm	visible	taken
Taraleah	100	pm	visible	waited
Connor	109	pm	visible	waited
Siengthong	61	pm	hidden	waited
Naseem	90	pm	hidden	waited
Martin	636	am	hidden	taken

Setup

Create a new RMarkdown file
Load the required package(s)
Read the mallow2 dataset into R, assigning it to an object named marshmallow

Solution

Analysis Code

Try to answer the research question above without referring to the provided analysis code below, and then check how your script matches up - is there anything you missed or done differently? If so, discuss the differences with a tutor - there are lots of ways to code to the same solution!

Provided Analysis Code

######Step 1 ######
## TASKS: read in the data, then check, clean, describe, and visualise it.


############## CHECK ###################

#check coding of variables - are they coded as they should be?
str(marshmallow)

spc_tbl_ [304 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ name      : chr [1:304] "Lauren" "Taraleah" "Connor" "Siengthong" ...
 $ agemonths : num [1:304] 60 100 109 61 90 636 115 70 105 114 ...
 $ timeofday : chr [1:304] "pm" "pm" "pm" "pm" ...
 $ visibility: chr [1:304] "visible" "visible" "visible" "hidden" ...
 $ taken     : chr [1:304] "taken" "waited" "waited" "waited" ...
 - attr(*, "spec")=
  .. cols(
  ..   name = col_character(),
  ..   agemonths = col_double(),
  ..   timeofday = col_character(),
  ..   visibility = col_character(),
  ..   taken = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

head(marshmallow)

# A tibble: 6 × 5
  name       agemonths timeofday visibility taken 
  <chr>          <dbl> <chr>     <chr>      <chr> 
1 Lauren            60 pm        visible    taken 
2 Taraleah         100 pm        visible    waited
3 Connor           109 pm        visible    waited
4 Siengthong        61 pm        hidden     waited
5 Naseem            90 pm        hidden     waited
6 Martin           636 am        hidden     taken

#time of day, visibility, and taken all currently coded as character variables when should be factors, need to fix this.
# age also in months - likely more useful to have in years given age included up to 10 years old.


### check range of values, NAs, etc. for each variable 

# age - should range 36-120 months
describe(marshmallow$agemonths)

   vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se
X1    1 304 81.47 43.06     79   79.18 29.65  36 636   600 8.24    96.98 2.47

# ISSUE - we have values WAY above what should be included from the sample

#visibility - should either be hidden or visible, with same number of ppts in each condition
table(marshmallow$visibility)


 hidden visible 
    152     152

# All good

#time of day - should either be am or pm
table(marshmallow$timeofday)


5 pm   am noon   pm 
   1  145    1  157

# ISSUE - we have one mis-coded as “5 pm”, which I’m guessing should just be “pm”. And we have one at “noon”. There’s no way of knowing whether that was morning or afternoon, so safest option is to just remove

#taken - shold either be taken or waited
table(marshmallow$taken)


 taken waited 
   136    168

# All good

############## CLEAN ###################
# We can do all cleaning in one long command.
# First create age (in years) variable
# Second fix time of day - make factor and specify correct levels
# Third make visibility a factor and specify levels
# Fourth make taken a factor and specify levels
# Lastly we want to remove anyone over 120 months (i.e., 10 years) and those with NA values for time of day (i.e., the value of noon)

marshmallow <- marshmallow %>%
    mutate(
      age = agemonths/12,
      timeofday = factor(timeofday, 
                         levels=c("am","pm","5 pm"), # possible levels
                         labels = c("am","pm","pm")), # make the levels these
      visibility = factor(visibility, 
                          levels=c("visible","hidden")),
      taken = factor(taken, 
                     levels=c("waited","taken"))
      ) %>%
    filter(agemonths <= 120, !is.na(timeofday))

#check cleaning has worked:
describe(marshmallow$agemonths) #min over 36, and max 120 - all good

   vars   n  mean    sd median trimmed   mad min max range  skew kurtosis   se
X1    1 301 78.73 22.44     79   79.07 28.17  39 120    81 -0.06    -1.19 1.29

describe(marshmallow$age) #min over 3 yo, and max 10 yo - all good

   vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
X1    1 301 6.56 1.87   6.58    6.59 2.35 3.25  10  6.75 -0.06    -1.19 0.11

table(marshmallow$timeofday) #now only am or pm as should be


 am  pm 
144 157

############## DESCRIBE ###################

#create descriptives table
descript <- marshmallow %>% 
    group_by(visibility, timeofday) %>%
   summarise(
       M_Age = mean(age),
       SD_Age = sd(age),
       percent_pm = sum(timeofday=="pm")/n()*100,
       percent_taken = sum(taken=="taken")/n()*100) %>%
  kable(caption = "Descriptive Statistics", digits = 2) %>%
  kable_styling()
descript

Descriptive Statistics
visibility	timeofday	M_Age	SD_Age	percent_pm	percent_taken
visible	am	7.17	1.69	0	52.70
visible	pm	6.04	1.78	100	59.21
hidden	am	7.47	1.72	0	28.57
hidden	pm	5.71	1.71	100	37.04

############## VISUALISE ###################

#bar plot
mallow_plt1 <- ggplot(data = marshmallow, aes(x = as_factor(taken), fill = as_factor(taken))) + 
  geom_bar() + 
    labs(x = "Taken Marshmallow (0 = No, 1 = Yes)", fill = "Marshmallow Status", y = "Frequency")
mallow_plt1

#density plot
mallow_plt2 <- ggplot(data = marshmallow, aes(x = age, fill = as_factor(taken))) + 
  geom_density() + 
    labs(x = "Age", fill = "Marshmallow Status")
mallow_plt2

#density plot with facets
mallow_plt3 <- ggplot(data = marshmallow, aes(x = age, fill = as_factor(taken))) + 
  geom_density() + 
    facet_grid(visibility ~ timeofday) +
    labs(x = "Age", fill = "Marshmallow Status")
mallow_plt3

######Step 2 ######
## TASKS: run your model(s) of interest to answer your research question, and make sure that the data meet the assumptions of your chosen test

############## BUILD MODEL & EXAMINE OUTPUT ###################

mm1 <- glm(taken ~ timeofday + age + visibility, data = marshmallow, family = binomial)
summary(mm1)


Call:
glm(formula = taken ~ timeofday + age + visibility, family = binomial, 
    data = marshmallow)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       4.40584    0.68525   6.430 1.28e-10 ***
timeofdaypm      -0.48621    0.29202  -1.665   0.0959 .  
age              -0.58448    0.08622  -6.779 1.21e-11 ***
visibilityhidden -1.19939    0.27296  -4.394 1.11e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 413.65  on 300  degrees of freedom
Residual deviance: 338.63  on 297  degrees of freedom
AIC: 346.63

Number of Fisher Scoring iterations: 4

exp(coefficients(mm1))

     (Intercept)      timeofdaypm              age visibilityhidden 
      81.9282592        0.6149502        0.5573964        0.3013777

############## ASSUMPTIONS ###################

#std deviance residuals
plot(rstandard(mm1, type = "deviance"), ylab = "Standardised Deviance Residuals")

#cooks D
plot(cooks.distance(mm1), ylab = "Cook's Distance")

############## MODEL FIT ###################

#compare to null - conduct model comparison
#fit null
mm0 <- glm(taken ~ 1, family = "binomial", data = marshmallow)

#compare models - models are nested
anova(mm0, mm1, test = "Chisq")

Analysis of Deviance Table

Model 1: taken ~ 1
Model 2: taken ~ timeofday + age + visibility
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       300     413.65                          
2       297     338.63  3   75.021 3.586e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(mm0, mm1)

    df      AIC
mm0  1 415.6494
mm1  4 346.6284

BIC(mm0, mm1)

    df      BIC
mm0  1 419.3565
mm1  4 361.4569

############## PLOT / TABLE MODEL RESULTS ###################

#results in formatted table
tab_model(mm1,
          dv.labels = "Marshmallow Taken",
          pred.labels = c("Intercept", "Time of Day - PM", "Age (in years)", "Visibility - Hidden"),
          title = "Regression Table for Marshmallow Model")

Regression Table for Marshmallow Model
	Marshmallow Taken
Predictors	Odds Ratios	CI	p
Intercept	81.93	22.47 – 332.30	<0.001
Time of Day - PM	0.61	0.34 – 1.08	0.096
Age (in years)	0.56	0.47 – 0.66	<0.001
Visibility - Hidden	0.30	0.17 – 0.51	<0.001
Observations	301
R² Tjur	0.233

The 3-Act Structure: Analysis Strategy, Results, & Discussion

We need to present our report in three clear sections - think of your sections like the 3 key parts of a play or story - we need to (1) provide some background and scene setting for the reader, (2) present our results in the context of the research question, and (3) present a resolution to our story - relate our findings back to the question we were asked and provide our answer.

Act I: Analysis Strategy

Question 1

Attempt to draft a discussion section based on the above research question and analysis provided.

Analysis Strategy - What to Include

Your analysis strategy will contain a number of different elements detailing plans and changes to your plan. Remember, your analysis strategy should not contain any results. You may wish to include the following sections:

Very brief data and design description:
- Give the reader some background on the context of your write-up. For example, you may wish to describe the data source, data collection strategy, study design, number of observational units.
- Specify the variables of interest in relation to the research question, including their unit of measurement, the allowed range (for Likert scales), and how they are scored. If you have categorical data, you will need to specify the levels and coding of your variables, and what was specified as your reference level and the justification for this choice.
Data management:
- Describe any data cleaning and/or recoding.
- Are there any observations that have been excluded based on pre-defined criteria? How/why, and how many?
- Describe any transformations performed to aid your interpretation (i.e., mean centering, standardisation, etc.)
Model specification:
- Clearly state your hypotheses and specify your chosen significance level.
- What type of statistical analysis do you plan to use to answer the research question? (e.g., simple linear regression, multiple linear regression, binary logistic regression, etc.)
- In some cases, you may wish to include some visualisations and descriptive tables to motivate your model specification.
- Specify the model(s) to be fitted to answer your given research question and analysis structure. Clearly specify the response and explanatory variables included in your model(s). This includes specifying the type of coding scheme applied if using categorical data.
- Specify the assumption and diagnostic checks that you will conduct. Specify what plots you will use, and how you will evaluate these.

As noted and encouraged throughout the course, one of the main benefits of using RMarkdown is the ability to include inline R code in your document. Try to incorporate this in your write up so you can automatically pull the specified values from your code. If you need a reminder on how to do this, see Lesson 4 of the Rmd Bootcamp.

Example Write-Up of Analysis Strategy Section

The mallow2 dataset contained information on 304 participants who took part in a study concerning delayed gratification - where children were presented with a single marshmallow, but were told that if they could leave it for 10 minutes, they would be rewarded with two marshmallows (scored dichotomously as taken or waited). The children participated in either the morning (am) or afternoon (pm), and the marshmallow was either visible or hidden for the 10 minute duration of the experiment. The age of each child was also recorded. All participant data was complete (no missing values), but two participants were excluded due to their age being outside of the range included in the study, and so the final sample size was 302.

To investigate whether the probability of taking the marshmallow changed as a function of time of day (am/pm), age, and marshmallow visibility (visible/hidden), a binary logistic regression model was used. Effects were considered statistically significant at \(\alpha = .05\). The following model specification was used:

\[ \begin{aligned} M_1 &: \qquad \log \left( \frac{p}{1 - p}\right) = \beta_0 + \beta_1 \cdot \text{Time of day}_{PM} + \beta_2 \cdot \text{Age} + \beta_3 \cdot \text{Visibility}_{Hidden} \end{aligned} \]

\[ \begin{aligned} \text{where}~{p}~ &=~ \text{probability of taking the marshmallow} \end{aligned} \]

To address the research question of whether the probability of taking the marshmallow changed as a function of time of day, age, and marshmallow visibility, this formally corresponded to:

\[ H_0: \text{All} ~~ \beta_j ~~ = ~~ 0 ~~ (for ~~ j ~~ = 1, 2, 3) \]

\[ H_1: \text{At least one} ~~ \beta_j ~~ \neq ~~ 0 ~~ (for ~~ j ~~ = 1, 2, 3) \]

To assess model fit, we visually assessed the standardized deviance residuals and Cook’s Distance. We expected the former to identify outliers (or extreme values), and we expected residuals to fall within the range of -2 to 2. We used the latter to check for influential observations, and visually assessed if any of our 302 observations had a Cook’s distance > 0.5 (moderately influential) or > 1 (highly influential).

Act II: Results

Question 2

Attempt to draft a results section based on your detailed analysis strategy and the analysis provided.

Results - What To Include

The results section should follow from your analysis strategy. This is where you would present the evidence and results that will be used to answer the research questions and can support your conclusions. Make sure that you address all aspects of the approach you outlined in the analysis strategy (including the evaluation of assumptions and diagnostics).

In this section, it is useful to include tables and plots to clearly present your findings to your reader. It is important, however, to carefully select what is the key information that should be presented. You don’t want to overload the reader with unnecessary or duplicate information, and you also want to save space in case there is a page limit. Make use of figures with multiple panels where you can.

As a broad guideline, you want to start with the results of any exploratory data analysis, presenting tables of summary statistics and exploratory plots. You may also want to visualise associations between/among variables and report covariances or correlations. Then, you should move on to the results from your model.

Example Write-Up of Results Section

Table 1: Descriptive Statistics
visibility	timeofday	M_Age	SD_Age	percent_taken
visible	am	7.17	1.69	52.70
visible	pm	6.04	1.78	59.21
hidden	am	7.47	1.72	28.57
hidden	pm	5.71	1.71	37.04

Table 2: Regression Table for Marshmallow Model
	Marshmallow Taken
Predictors	Odds Ratios	CI	p
Intercept	81.93	22.47 – 332.30	<0.001
Time of Day - PM	0.61	0.34 – 1.08	0.096
Age (in years)	0.56	0.47 – 0.66	<0.001
Visibility - Hidden	0.30	0.17 – 0.51	<0.001
Observations	301
R² Tjur	0.233

Act III: Discussion

Question 3

Attempt to draft a discussion section based on your results and the analysis provided.

Discussion - What To Include

In the discussion section, you should summarise the key findings from the results section and provide the reader with a few take-home sentences drawing the analysis together and relating it back to the original question.

The discussion should be relatively brief, and should not include any statistical analysis - instead think of the discussion as a conclusion, providing an answer to the research question(s).

Example Write-Up of Discussion Section

Section B: Weeks 6-9 Recap

In the second part of the lab, there is no new content - the purpose of the recap section is for you to revisit and revise the concepts you have learned over the last 4 weeks.

Before you expand each of the boxes below, think about how comfortable you feel with each concept.

Errors and Power in Hypothesis Testing

Factors Affecting Power

Effect Size

The pwr Package

Function	Description
`pwr.2p.test`	Two proportions (equal n)
`pwr.2p2n.test`	Two proportions (unequal n)
`pwr.anova.test`	Balanced one-way ANOVA
`pwr.chisq.test`	Chi-square test
`pwr.f2.test`	General linear model
`pwr.p.test`	Proportion (one sample)
`pwr.r.test`	Correlation
`pwr.t.test`	t-tests (one sample, two samples, paired)
`pwr.t2n.test`	t-test (two samples with unequal n)

Type of test	Small	Medium	Large
t-test	0.20	0.50	0.80
ANOVA	0.10	0.25	0.40
Linear regression	0.02	0.15	0.35

Power for t-tests

Power for Linear Regression

Probability, Odds, Log-Odds

Binary Logistic Regression

Interpretation of Coefficients

coefficient	b	exp(b)
(Intercept)	3.76	43.13
age	-0.62	0.54

Odds	Probability
0.5	\(\frac{1}{1+0.5} = 0.33\)
1	\(\frac{1}{1+1} = 0.5\)
2	\(\frac{2}{1+2} = 0.66\)
4	\(\frac{4}{1+4} = 0.8\)
8	\(\frac{8}{1+8} = 0.88\)

Generalized Linear Models

Drop-in-Deviance Test to Compare Mested Models

Akaike and Bayesian Information Criteria

Comparing lm() and glm() summary output

Footnotes

\(e^{a+b} = e^a \times e^b\). For example: \(2^2 \times 2^3 = 4 \times 8 = 32 = 2^5 = 2^{2+3}\)↩︎