class: center, middle, inverse, title-slide .title[ #
Bootstrapping
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Course Overview .pull-left[ <table style="border: 1px solid black;> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1;text-align:center;vertical-align: middle"> <b>Introduction to Linear Models</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Intro to Linear Regression</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Interpreting Linear Models</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Testing Individual Predictors</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Model Testing & Comparison</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Linear Model Analysis</td></tr> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1;text-align:center;vertical-align: middle"> <b>Analysing Experimental Studies</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Categorical Predictors & Dummy Coding</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Effects Coding & Coding Specific Contrasts</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> Assumptions & Diagnostics</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> <b>Bootstrapping</b></td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Categorical Predictor Analysis</td></tr> </table> ] .pull-right[ <table style="border: 1px solid black;> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle"> <b>Interactions</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interactions I</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interactions II</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interactions III</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Analysing Experiments</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interaction Analysis</td></tr> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle"> <b>Advanced Topics</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Power Analysis</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Binary Logistic Regression I</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Binary Logistic Regression II</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Logistic Regresison Analysis</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Exam Prep and Course Q&A</td></tr> </table> ] --- class: inverse, center, middle ## From last week: Dealing with influential cases --- ## What should be done when we identify issues with model diagnostics? + Easier to identify unusual cases than it is to decide what to do about them -- + In general, not a good idea to delete cases automatically for exceeding a threshold + Especially if they don't show large influence + Think of linear model diagnostics more as a way of learning about the limitations of the model + Which cases can't it explain very well? + Why is that? + Do results depend quite strongly on certain cases? -- + When unusual cases arise, try the following: 1. Try to identify why a case is unusual 2. Run a sensitivity analysis --- ## Investigating and dealing with unusual cases + Is there a data entry error? + Is the value within the plausible range of values for a variable? + Can it be corrected? If not, the incorrect value should be deleted + Are there unusual values as a result of skewness? + Is the data legitimate but extreme? + Avoid changing or removing legitimate values when possible, because they may reflect natural variance in the population + Before making any changes, check influence values and run a sensitivity analysis + After taking these steps, if a very small proportion of observations are strongly influencing the overall model, consider ways to reduce their influence rather than removing them entirely (e.g. **windsorising**) + There may be model specification problems (e.g. missing variables, interactions - coming soon) ??? + Replacing the extreme value with the next largest value for that variable + Avoids missingness/ preserves information + Note that deleting or winsorising values can change the model, therefore, different cases might then show up as outlying, high leverage, or influential + Iterative approach to removing/winsorising cases is needed --- ## Sensitivity Analyses + Sensitivity analysis refers to the idea of checking whether you get similar results irrespective of the methodological decisions you make + Sensitivity analysis can be used to check whether you get the same pattern of results. Do the estimated regression coefficients change substantially: + With versus without including certain unusual cases? + With versus without transforming a variable? + If results are highly similar, you have more confidence that the results aren't very dependent on those decisions + If they differ a lot, this should be reported as a limitation --- class: inverse, center, middle ## Onto this week... --- ## This Week's Learning Objectives 1. Understand the principles of bootstrapping 2. Understand bootstrap distribution 3. Understand the application of confidence intervals within bootstrapping 4. Apply the bootstrap confidence interval to inference in linear models --- ## Issues with Linear Models: Violated Assumptions + Sometimes we may not be able to draw accurate conclusions from our linear models + Assumptions violations may mean our estimates are bad or our inferences poor + Violations can have many sources: + Model misspecification + Atypical distributions within our variables + Unsuitable outcome variable --- ## Model Misspecification + Sometimes assumptions appear violated because our model is not correct + Typically we have: + Failed to include an interaction + Failed to include a non-linear (higher order) effect + Usually detected by observing violations of linearity, normality of residuals or heteroscedasticity + Solved by including additional terms in our linear model --- ## Atypical Distributions + We may also see non-normal residuals, heteroscedasticity and/or non-linearity due to distributional issues with our model variables + One approach to dealing with this is a non-linear transformation of the outcome and/or predictors + This involves applying a function to the values of a variable + This changes the values and overall shape of the distribution + For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality + Non-linearity may be helped by a transformation of both predictors and outcomes --- ## Unsuitable Outcome Variable + Outcome variable is not continuous or normally distributed, not due to a measurement error, but because it is not be expected to be + E.g. Reaction time, counts, binary variables + For data like these, we need a slightly different version of a linear model + More on this to come later in the course --- ## Bootstrapped inference + One of the concerns when we have violated assumptions is that we make poor inferences + This is because with violated assumptions, the building blocks of our inferences may be unreliable + Bootstrapping as a tool can help us here --- class: inverse, center, middle ## Part 1: Bootstrapping --- ## Samples .center[ <img src="figs/Sampling.png" width="65%" /> ] --- ## Good Samples - If a sample of `\(n\)` is drawn at **random**, it should be unbiased and representative of `\(N\)` - Point estimates from such samples will be good estimates of the population parameter .pull-left[ .center[**Unbiased Sample**] ![](figs/unbiasedSample.png)<!-- --> ] .pull-right[ .center[**Biased Sample**] ![](figs/biasedSample.png)<!-- --> ] --- ## Recap on sampling distributions .pull-left[ - We have a population - We take a sample of size `\(n\)` from it, and calculate our statistic - The statistic is our estimate of the population parameter - We do this repeatedly, and we can construct a sampling distribution - The mean of the sampling distribution will be a good approximation to the population parameter - To quantify sampling variation we can refer to the standard deviation of the sampling distribution (the **standard error**) ] .pull-right[ {{content}} ] -- + University students {{content}} -- + We take a sample of 30 students, calculate the mean height {{content}} -- + This is our estimate of the mean height of all university students {{content}} -- + Do this repeatedly (take another sample of 30, calculate mean height) {{content}} -- + The mean of these sample means will be a good approximation of the population mean {{content}} -- + To quantify sampling variation in mean heights of 30 students, we can refer to the standard deviation of these sample means {{content}} --- ## Practical problem: .pull-left[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-7-1.svg)<!-- --> ] .pull-right[ + This process allows us to get an estimate of the sampling variability, **but is the process realistic?** + Do I really want to repeat the sampling 500 times? + Probably not... {{content}} ] -- + So how else can I get a sense of the variability in my sample estimates? --- .pull-left[ ## Solution 1 ### Theoretical + Collect one sample + Estimate the Standard Error using the formula: `$$\text{SE} = \frac{\sigma}{\sqrt{n}}$$` ] .pull-right[ ## Solution 2 ### Bootstrap - Collect one sample - Mimic the act of repeated sampling from the population by repeated **resampling with replacement** from the original sample - Estimate the standard error using the standard deviation of the distribution of **resample** statistics ] --- ## Bootstrapping > **Bootstrapping:** The process of resampling *with replacement* from the original data to generate multiple samples with the same `\(n\)` as the original data -- - Start with an initial sample of size `\(n\)` - Take `\(k\)` resamples (sampling with replacement) of size `\(n\)`, and calculate your statistic from each one - As `\(k\to\infty\)`, the distribution of the `\(k\)` resample statistics begins to approximate the sampling distribution + This distribution of resample statistics is known as the **bootstrap distribution** -- > **Test your understanding:** In a study, we took a sample of 35 DapR2 students and calculated their mean scores on the exam. We want to use bootstraping. What will our resample `\(n\)` be? What will our resample statistic be? --- ## The sample .pull-left[ <img src="jk_img_sandbox/sample.png" width="350" /> Suppose I am interested in the mean age of all characters in The Simpsons, and I have collected a sample of `\(n=10\)` {{content}} ] -- + The mean age of my sample is 44.9 -- .pull-right[ ``` ## # A tibble: 10 × 2 ## name age ## <chr> <dbl> ## 1 Homer Simpson 39 ## 2 Ned Flanders 60 ## 3 Chief Wiggum 43 ## 4 Milhouse 10 ## 5 Patty Bouvier 43 ## 6 Janey Powell 8 ## 7 Montgomery Burns 104 ## 8 Sherri Mackleberry 10 ## 9 Krusty the Clown 52 ## 10 Jacqueline Bouvier 80 ``` ``` r simpsons_sample %>% summarise(mean_age = mean(age)) ``` ``` ## # A tibble: 1 × 1 ## mean_age ## <dbl> ## 1 44.9 ``` ] --- ## The **re**sample .pull-left[ I randomly draw out one person from my original sample, I note the value of interest, and then I put that person "back in the pool" (i.e. I sample with replacement). <br> <br> <img src="jk_img_sandbox/resample1.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 1 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ``` ] --- ## The **re**sample .pull-left[ Again, I draw one person at random again, note the value of interest, and replace them. <br> <br> <img src="jk_img_sandbox/resample2.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 2 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ``` ] --- ## The **re**sample .pull-left[ And again... <br> <br> <br> <img src="jk_img_sandbox/resample3.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 3 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ``` ] --- ## The **re**sample .pull-left[ And again... <br> <br> <br> <img src="jk_img_sandbox/resample4.png" width="350" /> ] .pull-right[ ``` ## # A tibble: 4 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ## 4 Ned Flanders 60 ``` ] --- ## The **re**sample .pull-left[ Repeat until I have a the same number as my original sample ( `\(n = 10\)` ). <br> <br> <img src="jk_img_sandbox/resample.png" width="350" /> {{content}} ] .pull-right[ ``` ## # A tibble: 10 × 2 ## name age ## <chr> <dbl> ## 1 Chief Wiggum 43 ## 2 Ned Flanders 60 ## 3 Janey Powell 8 ## 4 Ned Flanders 60 ## 5 Patty Bouvier 43 ## 6 Chief Wiggum 43 ## 7 Jacqueline Bouvier 80 ## 8 Montgomery Burns 104 ## 9 Sherri Mackleberry 10 ## 10 Patty Bouvier 43 ``` ] -- - This is the **resample** {{content}} -- - The mean age of the resample is 49.4 --- ## Bootstrapping: `\(Resample_1:Resample_k\)` .pull-left[ + If I repeat this whole process `\(k\)` times, I will have `\(k\)` means from `\(k\)` resamples. + Note these are entirely derived from the original sample of 10 ] --- count: false ## Bootstrapping: `\(Resample_1:Resample_k\)` .pull-left[ + If I repeat this whole process `\(k\)` times, I will have `\(k\)` means from `\(k\)` resamples. + Note these are entirely derived from the original sample of 10 + So, if `\(k\)` = 5, I would have something like this: ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-22-1.svg)<!-- --> ] .pull-right[ <table> <thead> <tr> <th style="text-align:right;"> k </th> <th style="text-align:right;"> M </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 49.4 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 46.9 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 59.6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 47.4 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 44.8 </td> </tr> </tbody> </table> ] --- ## Increasing `\(k\)` + Recall that as `\(k\)` grows larger, the distribution of resample statistics begins to approximate the sampling distribution -- .center[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-24-1.svg)<!-- --> ] --- ## Bootstrap Standard Error + We have seen the standard error in our linear models, and have discussed it as the measure of sampling variability -- + It reflects the SD of the sampling distribution -- + In the same vein, we can use our bootstrapped distribution to calculate the SE > **Test your understanding:** How do you think we get this value? -- > The SD of the bootstrap distribution -- .center[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-25-1.svg)<!-- --> ] --- class: center, middle ## Questions? --- class: inverse, center, middle ## Part 2: Confidence Intervals --- ## Confidence interval + Remember, we usually do not know the value of a population parameter + We are trying to estimate this from our data -- + A **confidence interval** defines a plausible range of values for our population parameter + To estimate we need: + A confidence level + A measure of sampling variability (e.g. SE/bootstrap SE) --- ## Confidence interval & level + Recall that the confidence level refers to the percentage of times confidence intervals would be expected to contain the true population parameter across repeated samples + Also remember that the typical confidence level used is **95%** -- + So, if we were to take 100 samples and calculate a 95% CI on each of them, ~95 of those intervals would contain the true population parameter -- + What are we 95% confident *in?* + We are 95% confident that our interval contains the true population mean + The *95% probability* comes from the long-run frequencies of our intervals -- > Let's see a quick demonstration of this in R... --- ## Simple Visualisation .pull-left[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-26-1.svg)<!-- --> ] .pull-right[ + The confidence interval works outwards from the centre + As such, it "cuts off" the tails + E.g. the most extreme estimates will not fall within the interval ] --- count: false ## Simple Visualisation .pull-left[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-27-1.svg)<!-- --> ] .pull-right[ + The confidence interval works outwards from the centre + As such, it "cuts off" the tails + E.g. the most extreme estimates will not fall within the interval + To calculate the CI, we identify the upper and lower bounds of the interval (i.e. the red lines) + These need to be positioned so that 95%* of all possible sample mean estimates fall within the bounds + *Replace this value with whatever confidence level you choose ] --- ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-28-1.svg)<!-- --> ] --- ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... + 68% of observations fall within 1 SD of the mean ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-29-1.svg)<!-- --> ] --- count: false ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... + 68% of observations fall within 1 SD of the mean + **95% of observations fall within 1.96 SD of the mean** ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-30-1.svg)<!-- --> ] --- count: false ## Calculating CI: 68/95/99 Rule .pull-left[ + Remember that sampling distributions become normal as `\(k\)` increases, and normal distributions have fixed properties + Specifically... + 68% of observations fall within 1 SD of the mean + **95% of observations fall within 1.96 SD of the mean** + 99% of observations fall within 3 SD of the mean ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-31-1.svg)<!-- --> ] --- ## Calculating CI .pull-left[ + The bounds of the 95% CI for a mean are: $$ \text{Lower Bound} = mean - 1.96 \times SE $$ $$ \text{Upper Bound} = mean + 1.96 \times SE $$ + where `$$SE = \frac{SD}{\sqrt{n}}$$` ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-32-1.svg)<!-- --> ] -- > **Test your understanding:** If we wanted to calculate the 99% confidence interval, what would we replace 1.96 with in the equation above? --- ## Calculating CI - Theoretical Approach ``` r simpsons_sample <- read_csv("https://uoepsy.github.io/data/simpsons_sample.csv") mean(simpsons_sample$age) ``` ``` ## [1] 44.9 ``` `$$mean \pm 1.96 \times \frac{SD}{\sqrt{n}}$$` ``` r mean(simpsons_sample$age) - 1.96*(sd(simpsons_sample$age)/sqrt(10)) ``` ``` ## [1] 25.47182 ``` ``` r mean(simpsons_sample$age) + 1.96*(sd(simpsons_sample$age)/sqrt(10)) ``` ``` ## [1] 64.32818 ``` --- ## Calculating CI - Bootstrap Approach Step 1 - Perform bootstrapping .pull-left[ ``` r resamples2000 <- rep_sample_n(simpsons_sample, n = 10, samples = 2000, replace = TRUE) bootstrap_dist <- resamples2000 %>% group_by(sample) %>% summarise(resamplemean = mean(age)) ``` ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-36-1.svg)<!-- --> ] --- ## Calculating CI - Bootstrap Approach Step 2 - Calculate CI .pull-left[ + Remember that the SE = the `\(SD\)` of the bootstrap... ``` r sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 9.380293 ``` **Lower Bound:** ``` r mean(simpsons_sample$age) - 1.96*sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 26.51463 ``` **Upper Bound:** ``` r mean(simpsons_sample$age) + 1.96*sd(bootstrap_dist$resamplemean) ``` ``` ## [1] 63.28537 ``` ] .pull-right[ ![](dapr2_09_BootstrapLM_files/figure-html/unnamed-chunk-40-1.svg)<!-- --> ] --- ## Sampling Distributions and CIs are not just for means For a 95% Confidence Interval around a statistic: $$ \text{Lower Bound} = statistic - 1.96 \times SE $$ $$ \text{Upper Bound} = statistic + 1.96 \times SE $$ --- class: center, middle # Questions? --- class: inverse, center, middle # Part 3 ## Bootstrapping linear model --- ## Bootstrapping a linear model - We have looked at bootstrapping of the mean - But we can compute a bootstrap distribution of any statistic - As a result, it is a straightforward extension to linear models - We can calculate `\(\beta\)` coefficients, `\(R^2\)`, `\(F\)`-statistics etc - In each case we generate a resample - Run the linear model - Save the statistic of interest - Repeat this `\(K\)` times - Generate the distribution of `\(K\)` statistics of interest --- ## `Boot` in `car` - The primary package in R for doing bootstrapping is called `boot` - But it is moderately complicated to use - Thankfully there is an easier to use wrapper in the `car` package called `Boot` - Note the capital letters ``` r library(car) ?Boot ``` --- ## `Boot` in `car` - The key arguments to `Boot` are the following: ``` r Boot(object, f, R) ``` + `object`: Your fitted model + `f`: Which bootstrap statistics to compute on each resample - By default, `f = coef`, returning the regression coefficients + `R`: how many bootstrap samples to compute - By default, `R = 999` --- ## Applying bootstrap .pull-left[ - **Step 1.** Run model ``` r m1 <- lm(weight ~ height, data = tib1) ``` - **Step 2.** Load `car` ``` r library(car) ``` - **Step 3.** Run `Boot` ``` r boot_m1 <- Boot(m1, R = 1000) ``` ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> height </th> <th style="text-align:right;"> weight </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 1.52 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> Peter </td> <td style="text-align:right;"> 1.60 </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> Robert </td> <td style="text-align:right;"> 1.68 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> David </td> <td style="text-align:right;"> 1.78 </td> <td style="text-align:right;"> 67 </td> </tr> <tr> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 1.86 </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> Matthew </td> <td style="text-align:right;"> 1.94 </td> <td style="text-align:right;"> 110 </td> </tr> <tr> <td style="text-align:left;"> Bradley </td> <td style="text-align:right;"> 2.09 </td> <td style="text-align:right;"> 98 </td> </tr> </tbody> </table> ] --- ## Applying bootstrap `$$weight\sim height$$` - **Step 4.** See summary results ``` r summary(boot_m1) ``` ``` ## ## Number of bootstrap replications R = 1000 ## original bootBias bootSE bootMed ## (Intercept) -115.98 -7.8534 54.738 -117.81 ## height 105.04 4.0292 31.544 105.45 ``` --- ## Applying bootstrap `$$weight\sim height$$` - **Step 5.** Calculate confidence interval ``` r Confint(boot_m1, type = "perc") ``` ``` ## Bootstrap percent confidence intervals ## ## Estimate 2.5 % 97.5 % ## (Intercept) -115.9788 -256.21544 -27.94371 ## height 105.0402 51.32111 184.23926 ``` -- > **Test your understanding:** Given that the intercept and slopes are interpreted in the same way as they would be in the `lm` output, how would you interpret these estimates? --- ## Interpreting the results + Let's re-scale `height` to be in centimetres, mean-centre it, and re-run the model ``` r tib1 <- tib1 %>% mutate( heightcm = height*100 ) m2 <- lm(weight ~ scale(heightcm, scale=F), data = tib1) boot_m2 <- Boot(m2, R = 1000) ``` --- ## Interpreting the results ``` ## Bootstrap percent confidence intervals ## ## Estimate 2.5 % 97.5 % ## (Intercept) 71.142857 62.5664207 81.426269 ## scale(heightcm, scale = F) 1.050402 0.5130964 1.839142 ``` ``` r Confint(boot_m2, type = "perc") ``` > **Test your understanding:** Now how will you interpret these estimates? What about the CI? -- > The average expected weight of participants with *average* height (178cm) is 71.1kg. > For every cm increase in height, there is a 1.05kg increase in weight. > We can be 95% confident that the CI [0.51 , 1.84] includes the true population `\(\beta_1\)` estimate. Because the CI does not include 0, we can reject the null at `\(\alpha = 0.05\)` --- class: center, middle ## Questions? --- ## Summary - Good samples are representative, random and unbiased - Bootstrap resampling is a tool to construct a *bootstrap distribution* of any statistic which, with sufficient resamples, will approximate the *sampling distribution* of the statistic - Confidence Intervals are a tool for considering the plausible value for an unknown population parameter - We can use bootstrap SE to calculate CI - And using bootstrap CI's is a plausible approach when we have assumption violations, and other issues, in our linear models + We have seen how to apply this for `lm` using `Boot` from `car`. --- ## This week .pull-left[ ### Tasks <img src="figs/labs.svg" width="10%" /> **Attend your lab and work together on the exercises** <br> <img src="figs/exam.svg" width="10%" /> **Complete the weekly quiz** ] .pull-right[ ### Support <img src="figs/forum.svg" width="10%" /> **Help each other on the Piazza forum** <br> <img src="figs/oh.png" width="10%" /> **Attend office hours (see Learn page for details)** ]