class: center, middle, inverse, title-slide #
Bootstrapping Theory
## Data Analysis for Psychology in R 2
### dapR2 Team ### Department of Psychology
The University of Edinburgh --- # Weeks Learning Objectives 1. Recap the principles of bootstrapping. 2. Recap the concept of the bootstrap distribution. 3. Recap confidence intervals 4. Apply the bootstrap confidence interval to inference in linear models --- # Topics for this week 1. Bootstrapping theory (recap) 2. Confidence intervals (recap) 3. Why this is useful for linear models? 4. Applying bootstrap inference to linear models --- class: inverse, center, middle # Part 1 ## Bootstrapping --- # Samples <center> <img src="jk_img_sandbox/statistical_inference.png" width="600" height="500" /> </center> --- # Good Samples - If a sample of `\(n\)` is drawn at **random**, it will be unbiased and representative of `\(N\)` - Point estimates from such samples will be good estimates of the population parameter. - Without the need for census. ![](jk_img_sandbox/sampling_bias.png) --- # Recap on sampling distributions .pull-left[ - We have a population. - We take a sample of size `\(n\)` from it, and calculate our statistic - The statistic is our estimate of the population parameter. - We do this repeatedly, and we can construct a sampling distribution. - The mean of the sampling distribution will be a good approximation to the population parameter. - To quantify sampling variation we can refer to the standard deviation of the sampling distribution (the **standard error**) ] .pull-right[ {{content}} ] -- + University students {{content}} -- + We take a sample of 30 students, calculate the mean height. {{content}} -- + This is our estimate of the mean height of all university students. {{content}} -- + Do this repeatedly (take another sample of 30, calculate mean height). {{content}} -- + The mean of these sample means will be a good approximation of the population mean. {{content}} -- + To quantify sampling variation in mean heights of 30 students, we can refer to the standard deviation of these sample means. {{content}} --- # Practical problem: .pull-left[ ![](dapr2_14_BootstrapTheory_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] .pull-right[ - This process allows us to get an estimate of the sampling variability, **but is this realistic?** - Can I really go out and collect 500 samples of 30 students from the population? - Probably not... {{content}} ] -- - So how else can I get a sense of the variability in my sample estimates? --- .pull-left[ ## Solution 1 ### Theoretical - Collect one sample. - Estimate the Standard Error using the formula: <br> `\(\text{SE} = \frac{\sigma}{\sqrt{n}}\)` ] .pull-right[ ## Solution 2 ### Bootstrap - Collect one sample. - Mimick the act of repeated sampling from the population by repeated **resampling with replacement** from the original sample. - Estimate the standard error using the standard deviation of the distribution of **resample** statistics. ] --- # Resampling 1: The sample .pull-left[ <img src="jk_img_sandbox/sample.png" width="350" /> Suppose I am interested in the mean age of all characters in The Simpsons, and I have collected a sample of `\(n=10\)`. {{content}} ] -- + The mean age of my sample is 44.9. -- .pull-right[ ``` ## # A tibble: 10 × 2 ## name age ## <chr> <dbl> ## 1 Homer Simpson 39 ## 2 Ned Flanders 60 ## 3 Chief Wiggum 43 ## 4 Milhouse 10 ## 5 Patty Bouvier 43 ## 6 Janey Powell 8 ## 7 Montgomery Burns 104 ## 8 Sherri Mackleberry 10 ## 9 Krusty the Clown 52 ## 10 Jacqueline Bouvier 80 ``` ```r simpsons_sample %>% summarise(mean_age = mean(age)) ``` ``` ## # A tibble: 1 × 1 ## mean_age ## <dbl> ## 1 44.9 ``` ] --- # Resampling 2: The **re**sample .pull-left[ I randomly draw out one person from my original sample, I note the value of interest, and then I put that person "back in the pool" (i.e. I sample with replacement). <br> <br> <img src="jk_img_sandbox/resample1.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 1 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ``` ] --- # Resampling 2: The **re**sample .pull-left[ Again, I draw one person at random again, note the value of interest, and replace them. <br> <br> <img src="jk_img_sandbox/resample2.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 2 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ``` ] --- # Resampling 2: The **re**sample .pull-left[ And again... <br> <br> <br> <img src="jk_img_sandbox/resample3.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 3 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ``` ] --- # Resampling 2: The **re**sample .pull-left[ And again... <br> <br> <br> <img src="jk_img_sandbox/resample4.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 4 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ## 4 Ned Flanders 60 ``` ] --- # Resampling 2: The **re**sample .pull-left[ Repeat until I have a the same number as my original sample ( `\(n = 10\)` ). <br> <br> <img src="jk_img_sandbox/resample.png" width="350" /> {{content}} ] .pull-right[ ``` ## # A tibble: 10 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ## 4 Ned Flanders 60 ## 5 Patty Bouvier 43 ## 6 Chief Wiggum 43 ## 7 Jacqueline Bouvier 80 ## 8 Montgomery Burns 104 ## 9 Sherri Mackleberry 10 ## 10 Patty Bouvier 43 ``` ] -- - This is known as a **resample** {{content}} -- - The mean age of the resample is 49.4 --- # Bootstrapping: Resample<sub>1</sub>, ..., Resample<sub>k</sub> - If I repeat this whole process many times, say k=1000, I will have 1000 means from 1000 resamples. - Note these are entirely derived from the original sample. - This is known as called **bootstrapping**, and the resultant distribution of statistics (in our example, the distribution of 1000 resample means) is known as a **bootstrap distribution.** <div style="border-radius: 5px; padding: 20px 20px 10px 20px; margin-top: 20px; margin-bottom: 20px; background-color:#fcf8e3 !important;"> **Bootstrapping** The process of resampling *with replacement* from the original data to generate a multiple resamples of the same `\(n\)` as the original data. --- # Bootstrap distribution - Start with an initial sample of size `\(n\)`. - Take `\(k\)` resamples (sampling with replacement) of size `\(n\)`, and calculate your statistic on each one. - As `\(k\to\infty\)`, the distribution of the `\(k\)` resample statistics begins to approximate the sampling distribution. - Note, this is just the same exercise as we did with samples from the population in previous weeks. --- # k = 20, 50, 200, 2000, ... ![](dapr2_14_BootstrapTheory_files/figure-html/unnamed-chunk-17-1.png)<!-- --> --- # Bootstrap Standard Error - Previously we spoke about the standard error as the measure of sampling variability. -- - We stated that this was just the SD of the sampling distribution. -- - In the same vein, we can calculate a bootstrap standard error - the SD of the bootstrap distribution. ![](dapr2_14_BootstrapTheory_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- class: inverse, center, middle, animated, rotateInDownLeft # End of Part 1 --- class: inverse, center, middle # Part 2 ## Confidence Intervals --- # Confidence interval - Remember, usually we do not know the value of a population parameter. - We are trying to estimate this from our data. -- - A confidence interval defines a plausible range of values for our population parameter. - To estimate we need: - A **confidence level** - A measure of sampling variability (e.g. SE/bootstrap SE). --- # Confidence interval & level <div style="border-radius: 5px; padding: 20px 20px 10px 20px; margin-top: 20px; margin-bottom: 20px; background-color:#fcf8e3 !important;"> **x% Confidence interval** across repeated samples, [x]% confidence intervals would be expected to contain the true population parameter value.</div> x% is the *confidence level*. Commonly, you will use and read about **95%** confidence intervals. If we were to take 100 samples, and calculate a 95% CI on each of them, approx 95 of them would contain the true population mean. -- - What are we 95% confident *in?* - We are 95% confident that our interval [lower, upper] contains the true population mean. - This is subtly different from saying that we are 95% confident that the true mean is inside our interval. The 95% probability is related to the long-run frequencies of our intervals. --- # Simple Visualization .pull-left[ ![](dapr2_14_BootstrapTheory_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .pull-right[ - The confidence interval works outwards from the centre - As such, it "cuts-off" the tails. - E.g. the most extreme estimates will not fall within the interval ] --- # Calculating CI - We want to identify the upper and lower bounds of the interval (i.e. the red lines from previous slide) - These need to be positioned so that 95% of all possible sample mean estimates fall within the bounds. --- # Calculating CI: 68/95/99 Rule - Remember that sampling distributions become normal... - There are fixed properties of normal distributions. -- - Specifically: - 68% of density falls within 1 SD of the mean - 95% of density falls with 1.96 SD of the mean - 99.7% of density falls within 3 SD of the mean - Remember the standard error = SD of the bootstrap (or sampling distribution)... --- # Calculating CI - ... the bounds of the 95% CI for a mean are: $$ \text{Lower Bound} = mean - 1.96 \cdot SE $$ $$ \text{Upper Bound} = mean + 1.96 \cdot SE $$ --- # Calculating CI - Example ```r simpsons_sample <- read_csv("https://uoepsy.github.io/data/simpsons_sample.csv") mean(simpsons_sample$age) ``` ``` ## [1] 44.9 ``` --- # Calculating CI - Example ```r # Theoretical approach mean(simpsons_sample$age) - 1.96*(sd(simpsons_sample$age)/sqrt(10)) ``` ``` ## [1] 25.47 ``` ```r mean(simpsons_sample$age) + 1.96*(sd(simpsons_sample$age)/sqrt(10)) ``` ``` ## [1] 64.33 ``` --- # Calculating CI - Example ```r # Bootstrap Approach source('https://uoepsy.github.io/files/rep_sample_n.R') resamples2000 <- rep_sample_n(simpsons_sample, n = 10, samples = 2000, replace = TRUE) bootstrap_dist <- resamples2000 %>% group_by(sample) %>% summarise(resamplemean = mean(age)) sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 9.445 ``` ```r mean(simpsons_sample$age) - 1.96*sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 26.39 ``` ```r mean(simpsons_sample$age) + 1.96*sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 63.41 ``` --- # Sampling Distributions and CIs are not just for means. For a 95% Confidence Interval around a statistic: $$ \text{Lower Bound} = statistic - 1.96 \cdot SE $$ $$ \text{Upper Bound} = statistic + 1.96 \cdot SE $$ --- # Summary - Good samples are representative, random and unbiased. - Bootstrap resampling is a tool to construct a *bootstrap distribution* of any statistic which, with sufficient resamples, will approximate the *sampling distribution* of the statistic. - Confidence Intervals are a tool for considering the plausible value for an unknown population parameter. - We can use bootstrap SE to calculate CI. - And using bootstrap CI's is a plausible approach when we have assumption violations, and other issues, in our linear models. --- class: inverse, center, middle, animated, rotateInDownLeft # End