Bootstrapping Linear Models

class: center, middle, inverse, title-slide

# <b> Bootstrapping Linear Models </b>
## Data Analysis for Psychology in R 2<br><br>
### dapR2 Team
### Department of Psychology<br>The University of Edinburgh

---

# Weeks Learning Objectives
1. Recap the principles of bootstrapping.
2. Recap the concept of the bootstrap distribution.
3. Recap confidence intervals
4. Apply the bootstrap confidence interval to inference in linear models

---
# Topics for this week

1. Bootstrapping theory (recap)
2. Confidence intervals (recap)
3. Why this is useful for linear models?
4. Applying bootstrap inference to linear models

---
# Inference with assumption violations
- Note that for us to make inferences about our statistics, we need a known sampling distribution under the null.
  - If we have this, we can use our normal tools of inference.

- But these sampling distributions are only accurate when model assumptions are met.

- If they are not, we are in a tricky position.
  - We can not trust our estimates or inferences from these models.

- So what can we do?

---
# Bootstrapping as an alternative
- We saw that we can compute a confidence interval using bootstrap methods.
  - And we know we can use confidence intervals to make inferences
  - Does the CI include 0?

- The key difference with the bootstrapping procedure is it does not make distributional assumptions.
  - The bootstrap distriubtion is drawn from our sample.
  - And the sample can have any distribution it likes.

- Bootstrapping is also useful with small sample sizes.
  - Central limit theorem is a great thing, but when `$n$` is small (<20), we may be best not to rely on it.

---
# Bootstrapping a linear model
- Last time we looked at bootstrapping of the mean.

- But we can compute a bootstrap distribution of any statistic.

- As a result, it is a straightforward extension to linear models.

- We can calculate `$\beta$` coefficients, `$R^2$`, `$F$`-statistics etc.
  - In each case we generate a resample
  - Run the linear model
  - Save the statistic of interest
  - Repeat this `$K$` times
  - Generate the distribution of `$K$` statistics of interest.

---
# Toy example

- Remember the height and weight data from week 2 (probably not!)

```r
tib1 <- tibble(
  name = as_factor(c("John", "Peter","Robert","David","George","Matthew", "Bradley")),
  height = c(1.52,1.60,1.68,1.78,1.86,1.94,2.09),
  weight = c(54,49,50,67,70,110,98)
)
slice(tib1, 1:3)
```

```
## # A tibble: 3 x 3
##   name   height weight
##   <fct>   <dbl>  <dbl>
## 1 John     1.52     54
## 2 Peter    1.6      49
## 3 Robert   1.68     50
```

- Let's draw 3 resamples, and run the linear model on this data, predicting `weight` from `height`.
  - Although a toy example, this is a small sample size, so bootstrapping is useful.
  
---
# Resample 1
.pull-left[

```r
set.seed(101)
rep_1 <- tib1[sample(nrow(tib1), 7, replace = T),]
rep_1
```

```
## # A tibble: 7 x 3
##   name    height weight
##   <fct>    <dbl>  <dbl>
## 1 John      1.52     54
## 2 John      1.52     54
## 3 Matthew   1.94    110
## 4 Bradley   2.09     98
## 5 Bradley   2.09     98
## 6 John      1.52     54
## 7 Peter     1.6      49
```
]

.pull-right[

```r
res1 <- lm(weight ~ height, data = rep_1)
res1$coefficients
```

```
## (Intercept)      height 
##   -85.44045    90.80482
```

]

---
# Resample 2
.pull-left[

```r
set.seed(102)
rep_2 <- tib1[sample(nrow(tib1), 7, replace = T),]
rep_2
```

```
## # A tibble: 7 x 3
##   name    height weight
##   <fct>    <dbl>  <dbl>
## 1 Bradley   2.09     98
## 2 Matthew   1.94    110
## 3 John      1.52     54
## 4 David     1.78     67
## 5 Bradley   2.09     98
## 6 John      1.52     54
## 7 Peter     1.6      49
```
]

.pull-right[

```r
res2 <- lm(weight ~ height, data = rep_2)
res2$coefficients
```

```
## (Intercept)      height 
##   -89.23772    92.07847
```

]

---
# Resample 3
.pull-left[

```r
set.seed(103)
rep_3 <- tib1[sample(nrow(tib1), 7, replace = T),]
rep_3
```

```
## # A tibble: 7 x 3
##   name    height weight
##   <fct>    <dbl>  <dbl>
## 1 David     1.78     67
## 2 Bradley   2.09     98
## 3 Matthew   1.94    110
## 4 John      1.52     54
## 5 Matthew   1.94    110
## 6 Matthew   1.94    110
## 7 Matthew   1.94    110
```
]

.pull-right[

```r
res3 <- lm(weight ~ height, data = rep_3)
res3$coefficients
```

```
## (Intercept)      height 
##   -112.4241    109.9596
```

]

---
# Our coefficients

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Resample </th>
   <th style="text-align:right;"> Intercept </th>
   <th style="text-align:right;"> Slope </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -85.44045 </td>
   <td style="text-align:right;"> 90.80482 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> -89.23772 </td>
   <td style="text-align:right;"> 92.07847 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> -112.42413 </td>
   <td style="text-align:right;"> 109.95961 </td>
  </tr>
</tbody>
</table>

- Now this would be quite tedious to do 1000 time.

- So thankfully R has some tools to help us out.

---
# `Boot` in `car`
- The primary package in R for doing bootstrapping is called `boot`
  - But it is moderately complicated to use.

- Thankfully there is an easier to use wrapper in the `car` package called `Boot`
  - Note the capital letters.

```r
library(car)
?Boot
```

---
# `Boot` in `car`
- `Boot` takes the following arguments:

1. Your fitted model.

2. `f`, saying which bootstrap statistics to compute on each bootstrap sample. 
  - By default `f = coef`, returning the regression coefficients.

3. `R`, saying how many bootstrap samples to compute. 
  - By default `R = 999`.

4. `ncores`, saying if to perform the calculations in parallel (and more efficiently).  
  - By default the function uses `ncores = 1`.

---
# Applying bootstrap

- Step 1. Run model

```r
m1 <- lm(weight ~ height, data = tib1)
```

- Step 2. Load `car`

```r
library(car)
```

- Step 3. Run `Boot`

```r
boot_m1 <- Boot(m1, R = 1000)
```

```
## Loading required namespace: boot
```

---
# Applying bootstrap

- Step 4. See summary results

```r
summary(boot_m1)
```

```
## 
## Number of bootstrap replications R = 1000 
##             original bootBias bootSE bootMed
## (Intercept)  -115.98  -7.6867 55.583 -118.84
## height        105.04   4.1589 32.192  106.00
```

---
# Applying bootstrap

- Step 5. Calculate confidence interval

```r
Confint(boot_m1, type = "perc")
```

```
## Bootstrap percent confidence intervals
## 
##              Estimate      2.5 %    97.5 %
## (Intercept) -115.9788 -258.64277 -21.83695
## height       105.0402   47.91165 185.87667
```

---
# Interpreting the results
- Well currently, the intercept makes very little sense:
  - The average expected value of weight when height is equal to zero is -116 kg. 
- Neither does the slope.
  - For every metre increase in height, weight increases by 105kg.
- Let's re-scale `height` to be in centimetres, mean centre,  and re-run.

```r
tib1 <- tib1 %>%
  mutate(
    heightcm = height*100
  )
m2 <- lm(weight ~ scale(heightcm, scale=F), data = tib1)
boot_m2 <- Boot(m2, R = 1000)
Confint(boot_m2, type = "perc")
```

```
## Bootstrap percent confidence intervals
## 
##                             Estimate      2.5 %    97.5 %
## (Intercept)                71.142857 62.1214789 80.382107
## scale(heightcm, scale = F)  1.050402  0.5331947  1.917832
```

---
# Interpreting the results

```r
resCI <- Confint(boot_m2, type = "perc")
resCI
```

- The average expected weight of participants with average height (178cm) is 71.1kg.
- For every centimetre increase in height, there is a 1.05kg increase in weight. The 95% CI [0.53 , 1.92] does not include 0, and as such we can reject the null at `$\alpha = 0.05$`

---
# Summary of today
- Today we have look at the practical application of bootstrap inference to linear models.

- We discussed when it is useful:
  - Small samples
  - Violated assumptions

- And how to conduct the analyses using `Boot` from `car`.