Bootstrapping

class: center, middle, inverse, title-slide

.title[
# Bootstrapping 
]
.subtitle[
## Data Analysis for Psychology in R 2 
]
.author[
### dapR2 Team
]
.institute[
### Department of Psychology The University of Edinburgh
]

---

# Weeks Learning Objectives
1. Understand the principles of bootstrapping.
2. Understand bootstrap distribution.
3. Understand the application of confidence intervals within bootstrapping
4. Apply the bootstrap confidence interval to inference in linear models

---
# Assumption violations

+ Assumption violations make it difficult to draw conclusions from our linear models.
  + It may mean our estimates are bad
  + Or it may mean of inferences are poor

+ Violations can have many sources.

---
# Model misspecification
+ Sometimes assumptions appear violated because our model is not correct.

+ Typically we have:
  + Failed to include an interaction
  + Failed to include a non-linear (higher order) effect
  
+ Usually detected by observing violations of linearity or normality of residuals.
  + Solved by including the terms in our linear model.

---
# Non-linear transformations 
+ Another approach is a non-linear transformation of the outcome and/or predictors.
  + Often related to non-normal residuals, heteroscedasticity and non-linearity.

+ This involves applying a function to the values of a variable. 
  + This changes the values and overall shape of the distribution

+ For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality

+ Non-linearity may be helped by a transformation of both predictors and outcomes

---
# Generalised linear model
+ All the models we have been discussing are suitable for continuous outcome variables.

+ Sometimes our outcomes are not continuous or normally distributed not because of an error in measurement, but because they would not be expected to be.
  + E.g. Reaction time, counts, binary variables.

+ For such data, we need a slightly different version of a linear model.
  + More on this to come later in the course.
  
  
---
# Bootstrapped inference
+ One of the concerns when we have violated assumptions is that we make poor inferences.

+ This is because with violated assumptions, the building blocks of our inferences may be unreliable.

+ Bootstrapping as a tool can help us here.
  + We will cover this in detail later in the course.

---
class: inverse, center, middle

# Part 1
## Bootstrapping

---

# Samples
<center>
<img src="jk_img_sandbox/statistical_inference.png" width="600" height="500" />
</center>
---

# Good Samples

- If a sample of `$n$` is drawn at **random**, it will be unbiased and representative of `$N$`
- Point estimates from such samples will be good estimates of the population parameter.
    - Without the need for census.

![](jk_img_sandbox/sampling_bias.png)

---

# Recap on sampling distributions
.pull-left[
- We have a population.
- We take a sample of size `$n$` from it, and calculate our statistic
    - The statistic is our estimate of the population parameter.
    
- We do this repeatedly, and we can construct a sampling distribution.

- The mean of the sampling distribution will be a good approximation to the population parameter.

- To quantify sampling variation we can refer to the standard deviation of the sampling distribution (the **standard error**) 
]
.pull-right[
{{content}}
]
--
+ University students
{{content}}
--

+ We take a sample of 30 students, calculate the mean height. 
{{content}}
--
    + This is our estimate of the mean height of all university students.
{{content}}
--

+ Do this repeatedly (take another sample of 30, calculate mean height).
{{content}}
--

+ The mean of these sample means will be a good approximation of the population mean.
{{content}}
--

+ To quantify sampling variation in mean heights of 30 students, we can refer to the standard deviation of these sample means.
{{content}}

---

# Practical problem:

.pull-left[
![](dapr2_15_BootstrapLM_files/figure-html/unnamed-chunk-2-1.png)
]
.pull-right[
- This process allows us to get an estimate of the sampling variability, **but is this realistic?**
 
- Can I really go out and collect 500 samples of 30 students from the population?
 
 - Probably not...
{{content}} 
]

- So how else can I get a sense of the variability in my sample estimates?

---

.pull-left[
## Solution 1  
### Theoretical

- Collect one sample.

- Estimate the Standard Error using the formula: 
 
`$\text{SE} = \frac{\sigma}{\sqrt{n}}$`

]
.pull-right[
## Solution 2  
### Bootstrap

- Collect one sample.

- Mimick the act of repeated sampling from the population by repeated **resampling with replacement** from the original sample.

- Estimate the standard error using the standard deviation of the distribution of **resample** statistics.

]

---

# Resampling 1: The sample

.pull-left[
<img src="jk_img_sandbox/sample.png" width="350" />

Suppose I am interested in the mean age of all characters in The Simpsons, and I have collected a sample of `$n=10$`. 
{{content}}
]

+ The mean age of my sample is 44.9.

.pull-right[

```
## # A tibble: 10 × 2
## name age
## <chr> <dbl>
## 1 Homer Simpson 39
## 2 Ned Flanders 60
## 3 Chief Wiggum 43
## 4 Milhouse 10
## 5 Patty Bouvier 43
## 6 Janey Powell 8
## 7 Montgomery Burns 104
## 8 Sherri Mackleberry 10
## 9 Krusty the Clown 52
## 10 Jacqueline Bouvier 80
```

```r
simpsons_sample %>%
  summarise(mean_age = mean(age))
```

```
## # A tibble: 1 × 1
## mean_age
## <dbl>
## 1 44.9
```

]

---