class: center, middle, inverse, title-slide #
Week 8: The Linear Model (ctd)
## Univariate Statistics and Methodology using R ### Martin Corley ### Department of Psychology
The University of Edinburgh --- class: inverse, center, middle # Part 1 ## Checking Assumptions --- # Back on the Road Again! .pull-left[ ![](lecture_7_files/figure-html/frecap2-1.svg)<!-- --> .tc[ "for every extra 0.01% blood alcohol, reaction time slows down by around 32 ms" ] ] .pull-right[ ![](lecture_7_files/img/playmo_police.jpg) ] --- # The Model We Nearly Made ### All 50 Participants ``` ## ## Call: ## lm(formula = RT ~ BloodAlc, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -115.92 -40.42 1.05 42.93 126.64 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 321 91 3.53 0.00093 *** ## BloodAlc 3228 888 3.64 0.00067 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 55.8 on 48 degrees of freedom ## Multiple R-squared: 0.216, Adjusted R-squared: 0.2 ## F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673 ``` ??? - note the _huge_ affect of `BloodAlc` - why is this? [pause?] - because these are "real" units, not units x 100 --- # Assumptions of Linear Models .pull-left[ .br3.bg-gray.white.pa2[ ### required - **linearity** of relationship(!) - for the _residuals_: + **normality** + **homogeneity of variance** + **independence** ] ] .pull-right[ .br3.bg-gray.white.pa2[ ### desirable <!-- - uncorrelated predictors (no collinearity) --> - no 'bad' (overly influential) observations ]] --- # Residuals .pull-left[ `$$y_i=b_0+b_1\cdot{}x_i+\epsilon_i$$` `$$\color{red}{\epsilon\sim{}N(0,\sigma)~\text{independently}}$$` - normally distributed (mean should be `\(\simeq{}\)` zero) - homogeneous (differences from `\(\hat{y}\)` shouldn't be systematically smaller or larger for different `\(x\)`) - independent (residuals shouldn't influence other residuals) ] .pull-right[ ![](lecture_7_files/figure-html/resid-1.svg)<!-- --> ] --- # At A Glance ```r summary(mod) ``` ``` ## ## Call: ## lm(formula = RT ~ BloodAlc, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max *## -115.92 -40.42 1.05 42.93 126.64 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 321 91 3.53 0.00093 *** ## BloodAlc 3228 888 3.64 0.00067 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 55.8 on 48 degrees of freedom ## Multiple R-squared: 0.216, Adjusted R-squared: 0.2 ## F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673 ``` --- # In More Detail .pull-left[ ### linearity ```r plot(mod,which=1) ``` - plots residuals `\(\epsilon_i\)` against fitted values `\(\hat{y}_i\)` - the 'average residual' is roughly zero across `\(\hat{y}\)`, so relationship is likely to be linear ] .pull-right[ ![](lecture_7_files/figure-html/resid1-1.svg) ] --- # In More Detail .pull-left[ ### normality ```r hist(resid(mod)) ``` ] .pull-right[ ![](lecture_7_files/figure-html/resid2-1.svg) ] --- count: false # In More Detail .pull-left[ ### normality ```r plot(density(resid(mod))) ``` - check that residuals `\(\epsilon\)` are approximately normally distributed - in fact there's a better way of doing this... ] .pull-right[ ![](lecture_7_files/figure-html/resid3-1.svg) ] --- name: QQ # In More Detail .pull-left[ ### normality ```r plot(mod,which=2) ``` - **Q-Q plot** compares the residuals `\(\epsilon\)` against a known distribution (here, normal) - observations close to the straight line mean residuals are approximately normal - numbered observations refer to _row numbers_ in the original data, for checking ] .pull-right[ ![](lecture_7_files/figure-html/resid4-1.svg) ] --- # Q-Q Plots .pull-left[ #### y axis - for a normal distribution, what values _should_ (say) 2%, or 4% of the observations lie below? - expressed in "standard deviations from the mean" ```r qnorm(c(.02,.04)) ``` ``` ## [1] -2.054 -1.751 ``` ] -- .pull-right[ #### x axis - from our residuals, what values _are_ (say) 2%, or 4%, of observations found to be less than? - convert to "standard deviations from the mean" ```r quantile(scale(resid(mod)),c(.02,.04)) ``` ``` ## 2% 4% ## -1.527 -1.466 ``` ] -- - Q-Q Plot shows these values plotted against each other --- template: QQ --- # In More Detail .pull-left[ ### homogeneity of variance ```r plot(mod,which=3) ``` - the _size_ of the residuals is approximately the same across values of `\(\hat{y}\)` ] .pull-right[ ![](lecture_7_files/figure-html/resid5-1.svg) ] --- # Visual vs Other Methods - statistical ways of checking assumptions will be introduced in labs - they tend to have limitations (for example, they're susceptible to sample size) - nothing beats looking at plots like these (and `plot(<model>)` makes it easy) - however, two things: -- .pt2[ 1. there are no criteria for deciding exactly when assumptions are sufficiently met + it's a matter of experience and judgement 1. we need to talk about **independence** of residuals ] --- class: inverse, center, middle, animated, flip # End of Part 1 --- class: inverse, center, middle # Part 2 ## Independence, Influence --- # Independence - no easy way to check independence of residuals - in part, because it depends on the _source_ of the observations .pt2[ - one determinant might be a single person being observed multiple times - e.g., my reaction times might tend to be slower than yours<br/> `\(\rightarrow\)` multivariate statistics ] --- # Independence - another determinant might be _time_ - observations in a sequence might be autocorrelated - can be checked using the Durbin-Watson Test from the `car` package ```r library(car) dwt(mod) ``` ``` ## lag Autocorrelation D-W Statistic p-value *## 1 -0.1377 2.22 0.42 ## Alternative hypothesis: rho != 0 ``` - shows no autocorrelation at lag 1 --- # Influence .center[ ![](lecture_7_files/figure-html/baddata1-1.svg)<!-- --> ] - even substantial **outliers** may only have small effects on the model - here, only the intercept is affected --- # Influence .center[ ![](lecture_7_files/figure-html/baddata2-1.svg)<!-- --> ] - observations with high **leverage** are inconsistent with other data, but may not be distorting the model --- # Influence .center[ ![](lecture_7_files/figure-html/baddata3-1.svg)<!-- --> ] - we care about observations with high **influence** (outliers with high leverage) --- # Cook's Distance .center[ ![](lecture_7_files/figure-html/cook-1.svg)<!-- --> ] - a standardised measure of "how much the model differs without observation `\(i\)`" --- .flex.items-center[.w-10.pa1[![:scale 70%](lecture_7_files/img/danger.svg)] .f1.w-80.pa1[Cook's Distance]] .pt3[ `$$D_i=\frac{\sum_{j=1}^{n}{(\hat{y}_j-\hat{y}_{j(i)})^2}}{(p+1)\hat{\sigma}^2}$$` - `\(\hat{y}+j\)` is the `\(j\)`th fitted value - `\(\hat{y}_{j(i)}\)` is the `\(j\)`th value from a fit which doesn't include observation `\(i\)` - `\(p\)` is the number of regression coefficients - `\(\hat{\sigma}^2\)` is the estimated variance from the fit, i.e., mean squared error ] --- # Cook's Distance .pull-left[ ```r plot(mod,which=4) ``` - observations labelled by row - various rules of thumb, but start looking when Cook's Distance > 0.5 ] .pull-right[ ![](lecture_7_files/figure-html/cook2-1.svg) ] --- class: animated, rollIn # Learning to Read .pull-left[ ![](lecture_7_files/img/playmo_teach.jpg) ] .pull-right[ - the Playmo School has been evaluating its reading programmes, using 50 students - ages of students - hours per week students spend reading of their own volition - whether they are taught using phonics or whole-word methods - **outcome: "reading age"** ] --- count: false # Learning to Read .pull-left[ ![](lecture_7_files/img/playmo_teach.jpg) ] .pull-right[ .center[
age
hrs_wk
method
R_AGE
10.266
5.821
phonics
15.962
7.385
4.281
phonics
11.598
10.445
5.552
phonics
15.062
6.189
3.063
phonics
8.065
10.463
5.690
phonics
15.125
7.801
5.158
word
7.194
9.981
3.487
word
10.208
8.715
5.110
word
8.268
7.408
3.435
word
6.570
7.811
5.625
word
6.718
]] ??? - this is just some of the data in the dataframe - on the right you can see the outcome variable, `R_AGE` for reading age. - note that I tend to put my dependent variables in all caps, just so that I can recognise at a glance that they're what we measured. --- count: false # Learning to Read .pull-left[ ![](lecture_7_files/img/playmo_teach.jpg) ] .pull-right[ .center[
age
hrs_wk
method
R_AGE
10.266
5.821
phonics
15.962
7.385
4.281
phonics
11.598
10.445
5.552
phonics
15.062
6.189
3.063
phonics
8.065
10.463
5.690
phonics
15.125
7.801
5.158
word
7.194
9.981
3.487
word
10.208
8.715
5.110
word
8.268
7.408
3.435
word
6.570
7.811
5.625
word
6.718
]] --- # Does Practice Affect Reading Age? .pull-left[ ```r p <- reading %>% ggplot(aes(x=hrs_wk,y=R_AGE)) + geom_point(size=3) + ylab("reading age") + xlab("hours reading/week") p ``` ] .pull-right[ ![](lecture_7_files/figure-html/plot1-1.svg) ] -- - hours per week is correlated with reading age: `\(r=0.3483\)`, `\(p=0.0132\)` - we can use a linear model to say something about the effect size --- # Does Practice Affect Reading Age? .pull-left[ ```r p + geom_smooth(method="lm") ``` ] .pull-right[ ![](lecture_7_files/figure-html/plot2-1.svg) ] -- - each extra hour spent reading a week adds 1.21 years to reading age --- # A Linear Model ```r mod <- lm(R_AGE ~ hrs_wk, data=reading) summary(mod) ``` ``` ## ## Call: ## lm(formula = R_AGE ~ hrs_wk, data = reading) ## ## Residuals: ## Min 1Q Median 3Q Max *## -5.10 -2.73 0.13 2.42 5.27 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) *## (Intercept) 3.639 1.857 1.96 0.0559 . *## hrs_wk 1.211 0.351 3.46 0.0012 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.03 on 48 degrees of freedom *## Multiple R-squared: 0.199, Adjusted R-squared: 0.183 *## F-statistic: 11.9 on 1 and 48 DF, p-value: 0.00116 ``` --- class: center, middle, inverse .f1[but...] --- .center[ ![](lecture_7_files/figure-html/diagp2-1.svg)<!-- --> ] --- # Assumptions Not Met! .pull-left[ - it seems that the assumptions aren't met for this model - (another demonstration on the right) - one reason for this can be because there's still systematic structure in the residuals - i.e., _more than one thing_ which can explain the variance ] .pull-right[ ![](lecture_7_files/figure-html/re-1.svg)<!-- --> ] --- class: inverse, center, middle, animated, flip # End of Part 2 --- class: inverse, center, middle # Part 3 ## Multiple Regression --- # Adding Age into the Equation .pull-left[ - so far, have focused on effects of practice - but presumably older children read better?
age
hrs_wk
method
R_AGE
10.266
5.821
phonics
15.962
7.385
4.281
phonics
11.598
10.445
5.552
phonics
15.062
8.715
5.110
word
8.268
7.408
3.435
word
6.570
7.811
5.625
word
6.718
] .pull-right[
] ??? - **rotate cube leftward to show R_AGE~age** --- # Another Model .center[ ![](lecture_7_files/figure-html/anmod-1.svg)<!-- --> ] --- # Another Model ```r mod2 <- lm(R_AGE ~ age, data=reading) summary(mod2) ``` ``` ## ## Call: ## lm(formula = R_AGE ~ age, data = reading) ## ## Residuals: ## Min 1Q Median 3Q Max *## -4.097 -2.655 -0.474 2.630 5.685 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) *## (Intercept) 1.867 1.857 1.00 0.32 *## age 1.010 0.228 4.42 0.000056 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.86 on 48 degrees of freedom *## Multiple R-squared: 0.289, Adjusted R-squared: 0.275 *## F-statistic: 19.6 on 1 and 48 DF, p-value: 0.0000558 ``` ??? - point out that intercept is quite meaningless (reading age for age zero) --- .center[ ![](lecture_7_files/figure-html/diag2-1.svg)<!-- --> ] --- # Two Models, No Answers .pull-left[ - we now have two models that don't map well to assumptions - each suggests an effect + one of `age` + one of `hrs_wk` ] .pull-right[ - if we run them independently, the chances of a type 1 error are + `\(\frac{1}{20}\)` (`mod`, including `hrs_wk`) + `\(\frac{1}{20}\)` (`mod2`, including `age`) - or ** `\(\frac{1}{10}\)` ** overall ] -- .pt1[ ] .br3.pa2.bg-gray.white.tc[ we need to test multiple predictors in _one_ linear model ] --- # Model Equations Again `$$\color{red}{\textrm{outcome}_i}=\color{blue}{(\textrm{model})_i}+\textrm{error}_i$$` `$$\color{red}{y_i}=\color{blue}{b_0\cdot{}1+b_1\cdot{}x_i}+\epsilon_i$$` -- ### linear model with two predictors `$$\color{red}{y_i}=\color{blue}{b_0\cdot{}1+b_1\cdot{}x_{1i}+b_2\cdot{}x_{2i}}+\epsilon_i$$` `$$\color{red}{\hat{y}_i}=\color{blue}{b_0\cdot{}1+b_1\cdot{}x_{1i}+b_2\cdot{}x_{2i}}$$` -- .center[ `y ~ 1 + x1 + x2` `R_AGE ~ 1 + hrs_wk + age` or `R_AGE ~ hrs_wk + age`<sup>1</sup> ] .footnote[ <sup>1</sup> we'll come back to why order can matter in a bit ] --- # Running A Multiple Regression ```r mod.m <- lm(R_AGE ~ age + hrs_wk, data=reading) summary(mod.m) ``` ``` ## ## Call: ## lm(formula = R_AGE ~ age + hrs_wk, data = reading) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.249 -2.266 0.742 2.351 3.453 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) *## (Intercept) -2.092 2.128 -0.98 0.33039 *## age 0.882 0.214 4.12 0.00015 *** *## hrs_wk 0.966 0.309 3.12 0.00307 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.63 on 47 degrees of freedom ## Multiple R-squared: 0.412, Adjusted R-squared: 0.386 ## F-statistic: 16.4 on 2 and 47 DF, p-value: 0.00000388 ``` --- # Running a Multiple Regression ``` ## ... ## (Intercept) -2.092 2.128 -0.98 0.33039 ## age 0.882 0.214 4.12 0.00015 *** ## hrs_wk 0.966 0.309 3.12 0.00307 ** ## ... ``` - there are _independent_ effects of age and practice + reading age improves by 0.8817 for each year of age + reading age improves by 0.9661 for each weekly hour of practice - note that the _intercept_ (0 years old, 0 hours/week) is meaningless here -- - important question: is this model _better_ than a model based just on age? --- # Model Fit ``` ## ... ## Residual standard error: 2.63 on 47 degrees of freedom ## Multiple R-squared: 0.412, Adjusted R-squared: 0.386 ## F-statistic: 16.4 on 2 and 47 DF, p-value: 0.00000388 ``` - in multiple regression, `\(R^2\)` measures the fit of the entire model + sum of individual `\(R^2\)`s _if predictors not correlated_ - `\(R^2=0.4115\)` looks better than the `\(R^2\)` for `mod2` (`age` as a predictor) of `\(0.2895\)` - but _any_ predictor will improve `\(R^2\)` (chance associations guarantee this) ```r *mod.r <- update(mod2, ~ . + runif(50)) summary(mod.r) ``` ``` ## ... ## Multiple R-squared: 0.32, Adjusted R-squared: 0.291 ## ... ``` ??? - Adjusted `\(R^2\)` accounts for multiple predictors - for `mod2`, it's 0.2747 - for `mod.r`, it's 0.2909 --- # Comparing Models ``` ## ... ## Multiple R-squared: 0.32, Adjusted R-squared: 0.291 *## F-statistic: 11 on 2 and 47 DF, p-value: 0.000117 ``` - `\(F\)` ratio compares model to the null model, which just has an intercept - but we can also use `\(F\)` ratios to compare models in succession ```r anova(mod.m) ``` ``` ## Analysis of Variance Table ## ## Response: R_AGE ## Df Sum Sq Mean Sq F value Pr(>F) ## age 1 160 159.5 23.12 0.000016 *** ## hrs_wk 1 67 67.3 9.75 0.0031 ** ## Residuals 47 324 6.9 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Order Matters! - `age` then `hrs_wk` ``` ## Analysis of Variance Table ## ## Response: R_AGE ## Df Sum Sq Mean Sq F value Pr(>F) ## age 1 160 159.5 23.12 0.000016 *** ## hrs_wk 1 67 67.3 9.75 0.0031 ** ## Residuals 47 324 6.9 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` - `hrs_wk` then `age` ``` ## Analysis of Variance Table ## ## Response: R_AGE ## Df Sum Sq Mean Sq F value Pr(>F) ## hrs_wk 1 110 109.8 15.9 0.00023 *** ## age 1 117 117.0 16.9 0.00015 *** ## Residuals 47 324 6.9 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Type 1 vs. Type 3 SS - order matters because R, by default, uses **Type 1** sums of squares + calculate each predictor's improvement to the model _in turn_ - compare to **Type 3** sums of squares + calculate each predictor's improvement to the model _taking all other predictors into account_ - huge debate about which is "better" (nobody likes Type 2) - if using Type 1, _predictors should be entered into the model in a theoretically-motivated order_ --- # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13/type0-0.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13/type0-0.svg) ]] --- count: false # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13//type1-1.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13/type0-0.svg) ]] --- count: false # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13//type1-2.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13/type0-0.svg) ]] --- count: false # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13//type1-3.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13/type0-0.svg) ]] --- count: false # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13/type1-3.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13//type3-1.svg) ]] --- count: false # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13/type1-3.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13//type3-2.svg) ]] --- count: false # Type 1 vs. Type 3 SS .pull-left[ ### type 1 .center[ ![:scale 70%](lecture_7_files/t13/type1-3.svg) ]] .pull-right[ ### type 3 .center[ ![:scale 70%](lecture_7_files/t13//type3-3.svg) ]] --- # The Two-Predictor Model ```r mod.m <- lm(R_AGE ~ age + hrs_wk, data=reading) summary(mod.m) ``` ``` ## ## Call: ## lm(formula = R_AGE ~ age + hrs_wk, data = reading) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.249 -2.266 0.742 2.351 3.453 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) *## (Intercept) -2.092 2.128 -0.98 0.33039 *## age 0.882 0.214 4.12 0.00015 *** *## hrs_wk 0.966 0.309 3.12 0.00307 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.63 on 47 degrees of freedom ## Multiple R-squared: 0.412, Adjusted R-squared: 0.386 ## F-statistic: 16.4 on 2 and 47 DF, p-value: 0.00000388 ``` --- # The Two-Predictor Model .flex.items-center[ .w-25[ ] .w-50[
] .w-25[ ]] --- class: inverse, center, middle, animated, flip # End --- # Acknowledgements - icons by Diego Lavecchia from the [Noun Project](https://thenounproject.com/)