class: center, middle, inverse, title-slide .title[ #
Bootstrapping
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- ## Week's Learning Objectives 1. Understand the principles of bootstrapping. 2. Understand bootstrap distribution. 3. Understand the application of confidence intervals within bootstrapping 4. Apply the bootstrap confidence interval to inference in linear models --- ## Issues with Linear Models + Sometimes we may not be able to draw accurate conclusions from our linear models. + Assumptions violations may mean our estimates are bad or our inferences poor + Violations can have many sources: + Model misspecification + Atypical distributions within our variables + Unsuitable outcome variable --- ## Model Misspecification + Sometimes assumptions appear violated because our model is not correct. + Typically we have: + Failed to include an interaction + Failed to include a non-linear (higher order) effect + Usually detected by observing violations of linearity or normality of residuals. + Solved by including the terms in our linear model. --- ## Atypical Distributions + We may see non-normal residuals, heteroscedasticity and/or non-linearity due to distributional issues with our model variables + One approach to dealing with this is a non-linear transformation of the outcome and/or predictors. + This involves applying a function to the values of a variable. + This changes the values and overall shape of the distribution + For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality + Non-linearity may be helped by a transformation of both predictors and outcomes --- ## Unsuitable Outcome Variable + Outcome variable is not continuous or normally distributed, not due to a measurement error, but because it is not be expected to be + E.g. Reaction time, counts, binary variables. + For data like these, we need a slightly different version of a linear model. + More on this to come later in the course. --- ## Bootstrapped inference + One of the concerns when we have violated assumptions is that we make poor inferences. + This is because with violated assumptions, the building blocks of our inferences may be unreliable. + Bootstrapping as a tool can help us here. --- class: inverse, center, middle ## Part 1: Bootstrapping --- ## Samples .center[ <img src="figs/Sampling.png" width="65%" /> ] --- ## Good Samples - If a sample of `\(n\)` is drawn at **random**, it should be unbiased and representative of `\(N\)` - Point estimates from such samples will be good estimates of the population parameter. .pull-left[ .center[**Unbiased Sample**] ![](figs/unbiasedSample.png)<!-- --> ] .pull-right[ .center[**Biased Sample**] ![](figs/biasedSample.png)<!-- --> ] --- ## Recap on sampling distributions .pull-left[ - We have a population. - We take a sample of size `\(n\)` from it, and calculate our statistic - The statistic is our estimate of the population parameter. - We do this repeatedly, and we can construct a sampling distribution. - The mean of the sampling distribution will be a good approximation to the population parameter. - To quantify sampling variation we can refer to the standard deviation of the sampling distribution (the **standard error**) ] .pull-right[ {{content}} ] -- + University students {{content}} -- + We take a sample of 30 students, calculate the mean height. {{content}} -- + This is our estimate of the mean height of all university students. {{content}} -- + Do this repeatedly (take another sample of 30, calculate mean height). {{content}} -- + The mean of these sample means will be a good approximation of the population mean. {{content}} -- + To quantify sampling variation in mean heights of 30 students, we can refer to the standard deviation of these sample means. {{content}} --- ## Practical problem: .pull-left[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-5-1.svg)<!-- --> ] .pull-right[ + This process allows us to get an estimate of the sampling variability, **but is it realistic?** + Can I really go out and collect 500 samples of 30 students from the population? + Probably not... {{content}} ] -- + So how else can I get a sense of the variability in my sample estimates? --- .pull-left[ ## Solution 1 ### Theoretical + Collect one sample. + Estimate the Standard Error using the formula: `$$\text{SE} = \frac{\sigma}{\sqrt{n}}$$` ] .pull-right[ ## Solution 2 ### Bootstrap - Collect one sample. - Mimic the act of repeated sampling from the population by repeated **resampling with replacement** from the original sample. - Estimate the standard error using the standard deviation of the distribution of **resample** statistics. ] --- ## Bootstrapping > **Bootstrapping:** The process of resampling *with replacement* from the original data to generate multiple samples with the same `\(n\)` as the original data. -- - Start with an initial sample of size `\(n\)`. - Take `\(k\)` resamples (sampling with replacement) of size `\(n\)`, and calculate your statistic from each one. - As `\(k\to\infty\)`, the distribution of the `\(k\)` resample statistics begins to approximate the sampling distribution. + This distribution of resample statistics is known as the **bootstrap distribution** -- > **Test your understanding:** In our original study, we took a sample of 35 DapR2 students and calculated their mean scores on the exam. We want to use bootstraping. What will our resample `\(n\)` be? What will our resample statistic be? --- ## The sample .pull-left[ <img src="jk_img_sandbox/sample.png" width="350" /> Suppose I am interested in the mean age of all characters in The Simpsons, and I have collected a sample of `\(n=10\)`. {{content}} ] -- + The mean age of my sample is 44.9. -- .pull-right[ ``` ## # A tibble: 10 × 2 ## name age ## <chr> <dbl> ## 1 Homer Simpson 39 ## 2 Ned Flanders 60 ## 3 Chief Wiggum 43 ## 4 Milhouse 10 ## 5 Patty Bouvier 43 ## 6 Janey Powell 8 ## 7 Montgomery Burns 104 ## 8 Sherri Mackleberry 10 ## 9 Krusty the Clown 52 ## 10 Jacqueline Bouvier 80 ``` ```r simpsons_sample %>% summarise(mean_age = mean(age)) ``` ``` ## # A tibble: 1 × 1 ## mean_age ## <dbl> ## 1 44.9 ``` ] --- ## The **re**sample .pull-left[ I randomly draw out one person from my original sample, I note the value of interest, and then I put that person "back in the pool" (i.e. I sample with replacement). <br> <br> <img src="jk_img_sandbox/resample1.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 1 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ``` ] --- ## The **re**sample .pull-left[ Again, I draw one person at random again, note the value of interest, and replace them. <br> <br> <img src="jk_img_sandbox/resample2.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 2 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ``` ] --- ## The **re**sample .pull-left[ And again... <br> <br> <br> <img src="jk_img_sandbox/resample3.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 3 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ``` ] --- ## The **re**sample .pull-left[ And again... <br> <br> <br> <img src="jk_img_sandbox/resample4.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 4 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ## 4 Ned Flanders 60 ``` ] --- ## The **re**sample .pull-left[ Repeat until I have a the same number as my original sample ( `\(n = 10\)` ). <br> <br> <img src="jk_img_sandbox/resample.png" width="350" /> {{content}} ] .pull-right[ ``` ## # A tibble: 10 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ## 4 Ned Flanders 60 ## 5 Patty Bouvier 43 ## 6 Chief Wiggum 43 ## 7 Jacqueline Bouvier 80 ## 8 Montgomery Burns 104 ## 9 Sherri Mackleberry 10 ## 10 Patty Bouvier 43 ``` ] -- - This is the **resample** {{content}} -- - The mean age of the resample is 49.4 --- ## Bootstrapping: `\(Resample_1:Resample_k\)` .pull-left[ + If I repeat this whole process `\(k\)` times, I will have `\(k\)` means from `\(k\)` resamples. + Note these are entirely derived from the original sample of 10 ] --- count: false ## Bootstrapping: `\(Resample_1:Resample_k\)` .pull-left[ + If I repeat this whole process `\(k\)` times, I will have `\(k\)` means from `\(k\)` resamples. + Note these are entirely derived from the original sample of 10 + So, if `\(k\)` = 5, I would have something like this: ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-20-1.svg)<!-- --> ] .pull-right[ <table> <thead> <tr> <th style="text-align:right;"> k </th> <th style="text-align:right;"> M </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 49.4 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 46.9 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 59.6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 47.4 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 44.8 </td> </tr> </tbody> </table> ] --- ## Increasing `\(k\)` + Recall that as `\(k\)` grows larger, the distribution of resample statistics begins to approximate the sampling distribution. -- .center[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-22-1.svg)<!-- --> ] --- ## Bootstrap Standard Error + We have seen the standard error in our linear models, and have discussed it as the measure of sampling variability. -- + It reflects the SD of the sampling distribution. -- + In the same vein, we can use our bootstrapped distribution to calculate the SE > **Test your understanding:** How do you think we get this value? -- > The SD of the bootstrap distribution. -- .center[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-23-1.svg)<!-- --> ] --- class: center, middle ## Questions? --- class: inverse, center, middle ## Part 2: Confidence Intervals --- ## Confidence interval + Remember, we usually do not know the value of a population parameter. + We are trying to estimate this from our data. -- + A **confidence interval** defines a plausible range of values for our population parameter. + To estimate we need: + A confidence level + A measure of sampling variability (e.g. SE/bootstrap SE). --- ## Confidence interval & level + Recall that the confidence level refers to the percentage of times confidence intervals would be expected to contain the true population parameter across repeated samples + Also remember that the typical confidence level used is **95%** -- + So, if we were to take 100 samples and calculate a 95% CI on each of them, ~95 of those intervals would contain the true population parameter -- + What are we 95% confident *in?* + We are 95% confident that our interval contains the true population mean. + The *95% probability* comes from the long-run frequencies of our intervals. -- > Let's see a quick demonstration of this in R... --- ## Simple Visualization .pull-left[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-24-1.svg)<!-- --> ] .pull-right[ + The confidence interval works outwards from the centre + As such, it "cuts-off" the tails. + E.g. the most extreme estimates will not fall within the interval ] --- count: false ## Simple Visualization .pull-left[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-25-1.svg)<!-- --> ] .pull-right[ + The confidence interval works outwards from the centre + As such, it "cuts-off" the tails. + E.g. the most extreme estimates will not fall within the interval + To calculate the CI, we identify the upper and lower bounds of the interval (i.e. the red lines) + These need to be positioned so that 95%* of all possible sample mean estimates fall within the bounds. + *Replace this value with whatever confidence level you choose ] --- ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-26-1.svg)<!-- --> ] --- ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... + 68% of observations fall within 1 SD of the mean ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-27-1.svg)<!-- --> ] --- count: false ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... + 68% of observations fall within 1 SD of the mean + **95% of observations fall within 1.96 SD of the mean** ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-28-1.svg)<!-- --> ] --- count: false ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... + 68% of observations fall within 1 SD of the mean + **95% of observations fall within 1.96 SD of the mean** + 99% of observations fall within 3 SD of the mean ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-29-1.svg)<!-- --> ] --- ## Calculating CI .pull-left[ + The bounds of the 95% CI for a mean are: $$ \text{Lower Bound} = mean - 1.96 \times SE $$ $$ \text{Upper Bound} = mean + 1.96 \times SE $$ + where `$$SE = \frac{SD}{\sqrt{n}}$$` ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-30-1.svg)<!-- --> ] -- > **Test your understanding:** If we wanted to calculate the 99% confidence interval, what would we replace 1.96 with in the equation above? --- ## Calculating CI - Theoretical Approach ```r simpsons_sample <- read_csv("https://uoepsy.github.io/data/simpsons_sample.csv") mean(simpsons_sample$age) ``` ``` ## [1] 44.9 ``` `$$mean \pm 1.96 \times \frac{SD}{\sqrt{n}}$$` ```r mean(simpsons_sample$age) - 1.96*(sd(simpsons_sample$age)/sqrt(10)) ``` ``` ## [1] 25.47182 ``` ```r mean(simpsons_sample$age) + 1.96*(sd(simpsons_sample$age)/sqrt(10)) ``` ``` ## [1] 64.32818 ``` --- ## Calculating CI - Bootstrap Approach Step 1 - Perform bootstrapping .pull-left[ ```r resamples2000 <- rep_sample_n(simpsons_sample, n = 10, samples = 2000, replace = TRUE) bootstrap_dist <- resamples2000 %>% group_by(sample) %>% summarise(resamplemean = mean(age)) ``` ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-34-1.svg)<!-- --> ] --- ## Calculating CI - Bootstrap Approach Step 2 - Calculate CI .pull-left[ + Remember that the SE = the `\(SD\)` of the bootstrap... ```r sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 9.380293 ``` **Lower Bound:** ```r mean(simpsons_sample$age) - 1.96*sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 26.51463 ``` **Upper Bound:** ```r mean(simpsons_sample$age) + 1.96*sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 63.28537 ``` ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-38-1.svg)<!-- --> ] --- ## Sampling Distributions and CIs are not just for means For a 95% Confidence Interval around a statistic: $$ \text{Lower Bound} = statistic - 1.96 \times SE $$ $$ \text{Upper Bound} = statistic + 1.96 \times SE $$ --- class: center, middle # Questions? --- class: inverse, center, middle # Part 3 ## Bootstrapping linear model --- ## Bootstrapping a linear model - We have looked at bootstrapping of the mean. - But we can compute a bootstrap distribution of any statistic. - As a result, it is a straightforward extension to linear models. - We can calculate `\(\beta\)` coefficients, `\(R^2\)`, `\(F\)`-statistics etc. - In each case we generate a resample - Run the linear model - Save the statistic of interest - Repeat this `\(K\)` times - Generate the distribution of `\(K\)` statistics of interest. --- ## `Boot` in `car` - The primary package in R for doing bootstrapping is called `boot` - But it is moderately complicated to use. - Thankfully there is an easier to use wrapper in the `car` package called `Boot` - Note the capital letters. ```r library(car) ?Boot ``` --- ## `Boot` in `car` - The key arguments to `Boot` are the following: ```r Boot(object, f, R) ``` + `object`: Your fitted model. + `f`: Which bootstrap statistics to compute on each resample. - By default, `f = coef`, returning the regression coefficients. + `R`: how many bootstrap samples to compute. - By default, `R = 999`. --- ## Applying bootstrap .pull-left[ - **Step 1.** Run model ```r m1 <- lm(weight ~ height, data = tib1) ``` - **Step 2.** Load `car` ```r library(car) ``` - **Step 3.** Run `Boot` ```r boot_m1 <- Boot(m1, R = 1000) ``` ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> height </th> <th style="text-align:right;"> weight </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 1.52 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> Peter </td> <td style="text-align:right;"> 1.60 </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> Robert </td> <td style="text-align:right;"> 1.68 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> David </td> <td style="text-align:right;"> 1.78 </td> <td style="text-align:right;"> 67 </td> </tr> <tr> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 1.86 </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> Matthew </td> <td style="text-align:right;"> 1.94 </td> <td style="text-align:right;"> 110 </td> </tr> <tr> <td style="text-align:left;"> Bradley </td> <td style="text-align:right;"> 2.09 </td> <td style="text-align:right;"> 98 </td> </tr> </tbody> </table> ] --- ## Applying bootstrap `$$weight\sim height$$` - **Step 4.** See summary results ```r summary(boot_m1) ``` ``` ## ## Number of bootstrap replications R = 1000 ## original bootBias bootSE bootMed ## (Intercept) -115.98 -7.8534 54.738 -117.81 ## height 105.04 4.0292 31.544 105.45 ``` --- ## Applying bootstrap `$$weight\sim height$$` - **Step 5.** Calculate confidence interval ```r Confint(boot_m1, type = "perc") ``` ``` ## Bootstrap percent confidence intervals ## ## Estimate 2.5 % 97.5 % ## (Intercept) -115.9788 -256.21544 -27.94371 ## height 105.0402 51.32111 184.23926 ``` -- > **Test your understanding:** Given that the intercept and slopes are interpreted in the same way as they would be in the `lm` output, how would you interpret these estimates? --- ## Interpreting the results + So currently, the intercept and slope values make very little sense. + Let's re-scale `height` to be in centimetres, mean centre, and re-run. ```r tib1 <- tib1 %>% mutate( heightcm = height*100 ) m2 <- lm(weight ~ scale(heightcm, scale=F), data = tib1) boot_m2 <- Boot(m2, R = 1000) ``` --- ## Interpreting the results ``` ## Bootstrap percent confidence intervals ## ## Estimate 2.5 % 97.5 % ## (Intercept) 71.142857 62.5664207 81.426269 ## scale(heightcm, scale = F) 1.050402 0.5130964 1.839142 ``` ```r Confint(boot_m2, type = "perc") ``` > **Test your understanding:** Now how will you interpret these estimates? What about the CI? -- > The average expected weight of participants with *average* height (178cm) is 71.1kg. > For every cm increase in height, there is a 1.05kg increase in weight. > We can be 95% confident that the CI [0.51 , 1.84] includes the true population `\(\beta_1\)` estimate. Because the CI does not include 0, we can reject the null at `\(\alpha = 0.05\)` --- class: center, middle ## Questions? --- ## Summary - Good samples are representative, random and unbiased. - Bootstrap resampling is a tool to construct a *bootstrap distribution* of any statistic which, with sufficient resamples, will approximate the *sampling distribution* of the statistic. - Confidence Intervals are a tool for considering the plausible value for an unknown population parameter. - We can use bootstrap SE to calculate CI. - And using bootstrap CI's is a plausible approach when we have assumption violations, and other issues, in our linear models. + We have seen how to apply this for `lm` using `Boot` from `car`. --- class: inverse, center, middle # Thanks for Listening!