Semester 2, Week 1: Bootstrapping and Confidence Intervals

class: center, middle, inverse, title-slide

# Semester 2, Week 1: Bootstrapping and Confidence Intervals
## Data Analysis for Psychology in R 1
### 
### Department of Psychology The University of Edinburgh
### AY 2020-2021

---

# This Week's Learning Objectives

1. Understand how bootstrap resampling with replacement can be used to approximate a sampling distribution.

2. Understand how the bootstrap distribution can be used to construct a range of highly plausible values (a confidence interval).

3. Understand the link between simulation-based standard errors and theory-based standard errors.

---
class: inverse, center, middle

# Part 1
## Bootstrapping

---

# Samples
<center>
<img src="jk_img_sandbox/statistical_inference.png" width="600" height="500" />
</center>
---

# Good Samples

- If a sample of `$n$` is drawn at **random**, it will be unbiased and representative of `$N$`
- Point estimates from such samples will be good estimates of the population parameter.
    - Without the need for census.

![](jk_img_sandbox/sampling_bias.png)

---

# Recap on sampling distributions
.pull-left[
- We have a population.
- We take a sample of size `$n$` from it, and calculate our statistic
    - The statistic is our estimate of the population parameter.
    
- We do this repeatedly, and we can construct a sampling distribution.

- The mean of the sampling distribution will be a good approximation to the population parameter.

- To quantify sampling variation we can refer to the standard deviation of the sampling distribution (the **standard error**) 
]
.pull-right[
{{content}}
]
--
+ University students
{{content}}
--

+ We take a sample of 30 students, calculate the mean height. 
{{content}}
--
    + This is our estimate of the mean height of all university students.
{{content}}
--

+ Do this repeatedly (take another sample of 30, calculate mean height).
{{content}}
--

+ The mean of these sample means will be a good approximation of the population mean.
{{content}}
--

+ To quantify sampling variation in mean heights of 30 students, we can refer to the standard deviation of these sample means.
{{content}}

---

# Practical problem:

.pull-left[
![](dapR1_lec11_Bootstrap_CIs_files/figure-html/unnamed-chunk-2-1.png)
]
.pull-right[
- This process allows us to get an estimate of the sampling variability, **but is this realistic?**
 
- Can I really go out and collect 500 samples of 30 students from the population?
 
 - Probably not...
{{content}} 
]

- So how else can I get a sense of the variability in my sample estimates?

---

.pull-left[
## Solution 1  
### Theoretical

- Collect one sample.

- Estimate the Standard Error using the formula: 
 
`$\text{SE} = \frac{\sigma}{\sqrt{n}}$`

]
.pull-right[
## Solution 2  
### Bootstrap

- Collect one sample.

- Mimick the act of repeated sampling from the population by repeated **resampling with replacement** from the original sample.

- Estimate the standard error using the standard deviation of the distribution of **resample** statistics.

]

---

# Resampling 1: The sample

.pull-left[
<img src="jk_img_sandbox/sample.png" width="350" />

Suppose I am interested in the mean age of all characters in The Simpsons, and I have collected a sample of `$n=10$`. 
{{content}}
]

+ The mean age of my sample is 44.9.

.pull-right[

```
## # A tibble: 10 x 2
## name age
## <chr> <dbl>
## 1 Homer Simpson 39
## 2 Ned Flanders 60
## 3 Chief Wiggum 43
## 4 Milhouse 10
## 5 Patty Bouvier 43
## 6 Janey Powell 8
## 7 Montgomery Burns 104
## 8 Sherri Mackleberry 10
## 9 Krusty the Clown 52
## 10 Jacqueline Bouvier 80
```

```r
simpsons_sample %>%
  summarise(mean_age = mean(age))
```

```
## # A tibble: 1 x 1
## mean_age
## <dbl>
## 1 44.9
```

]

---