class: center, middle, inverse, title-slide .title[ #
Linear Regression Refresh
] .subtitle[ ## Data Analysis for Psychology in R 3 ] .author[ ### Josiah King ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # QUICK ANNOUNCEMENTS - This afternoon's lecture is in EICC Sidlaw Theatre (see Learn announcement) - Please look at the "Hello and Welcome to DAPR3 (README FIRST)" page on Learn. - The course intro video covers _A LOT_ of logistics - Please update/install RStudio on your computers - see Learn page for details and instructions - Do this sooner rather than later. --- class: inverse, center, middle <h1 style="text-align: left;">This Lecture:</h1> <h3 style="text-align: left;">1. Linear Models Refresher</h3> --- # Models .pull-left[ __deterministic__ given the same input, deterministic functions return *exactly* the same output - `\(y = mx + c\)` - area of sphere = `\(4\pi r^2\)` - height of fall = `\(1/2 g t^2\)` - `\(g = \textrm{gravitational constant, }9.8m/s^2\)` - `\(t = \textrm{time (in seconds) of fall}\)` ] -- .pull-right[ __statistical__ .br2.f4.white.bg-gray[ $$ \textrm{outcome} = (\textrm{model}) + \color{black}{\textrm{error}} $$ ] - handspan = height + randomness - cognitive test score = age + premorbid IQ + ... + randomness ] --- # The Linear Model .br3.pa2.f2[ $$ `\begin{align} \color{red}{\textrm{outcome}} & = \color{blue}{(\textrm{model})} + \textrm{error} \\ \color{red}{y_i} & = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} x_i} + \varepsilon_i \\ \text{where } \\ \varepsilon_i & \sim N(0, \sigma) \text{ independently} \\ \end{align}` $$ ] --- # The Linear Model .flex.items-top[ .w-50.pa2[ Our proposed model of the world: `\(\color{red}{y_i} = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} x_i} + \varepsilon_i\)` {{content}} ] .w-50.pa2[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/bb-1.svg)<!-- --> ]] -- Our model _fitted_ to some data (note the `\(\widehat{\textrm{hats}}\)`): `\(\hat{y}_i = \color{blue}{\hat \beta_0 \cdot{} 1 + \hat \beta_1 \cdot{} x_i}\)` {{content}} -- For the `\(i^{th}\)` observation: - `\(\color{red}{y_i}\)` is the value we observe for `\(x_i\)` - `\(\hat{y}_i\)` is the value the model _predicts_ for `\(x_i\)` - `\(\color{red}{y_i} = \hat{y}_i + \hat\varepsilon_i\)` --- # An Example .flex.items-top[ .w-50.pa2[ `\(\color{red}{y_i} = \color{blue}{5 \cdot{} 1 + 2 \cdot{} x_i} + \hat\varepsilon_i\)` __e.g.__ for the observation `\(x_i = 1.2, \; y_i = 9.9\)`: $$ `\begin{align} \color{red}{9.9} & = \color{blue}{5 \cdot{}} 1 + \color{blue}{2 \cdot{}} 1.2 + \hat\varepsilon_i \\ & = 7.4 + \hat\varepsilon_i \\ & = 7.4 + 2.5 \\ \end{align}` $$ ] .w-50.pa2[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/errplot-1.svg)<!-- --> ]] --- # Categorical Predictors .pull-left[ <br><table> <thead> <tr> <th style="text-align:left;"> y </th> <th style="text-align:left;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 7.99 </td> <td style="text-align:left;"> Category1 </td> </tr> <tr> <td style="text-align:left;"> 4.73 </td> <td style="text-align:left;"> Category0 </td> </tr> <tr> <td style="text-align:left;"> 3.66 </td> <td style="text-align:left;"> Category0 </td> </tr> <tr> <td style="text-align:left;"> 3.41 </td> <td style="text-align:left;"> Category0 </td> </tr> <tr> <td style="text-align:left;"> 5.75 </td> <td style="text-align:left;"> Category1 </td> </tr> <tr> <td style="text-align:left;"> 5.66 </td> <td style="text-align:left;"> Category0 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> ] .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-2-1.svg)<!-- --> ] --- # Multiple Predictors .pull-left[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-3-1.svg)<!-- --> ] -- .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-4-1.svg)<!-- --> ] --- exclude: true # Multiple Predictors .pull-left[ <img src="jk_img_sandbox/SSblank1.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[] --- # Multiple Predictors .pull-left[ <img src="jk_img_sandbox/SSblank1cov.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[ ```r lm(y ~ x1) ``` ``` ## ## Call: ## lm(formula = y ~ x1) ## ## Coefficients: ## (Intercept) x1 ## -0.0509 0.2007 ``` ] --- # Multiple Predictors .pull-left[ <img src="jk_img_sandbox/SSblank1_uncor.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[ ```r lm(y ~ x1 + x2u) ``` ``` ## ## Call: ## lm(formula = y ~ x1 + x2u) ## ## Coefficients: ## (Intercept) x1 x2u ## -0.0446 0.2007 0.1304 ``` ] --- # Multiple Predictors .pull-left[ <img src="jk_img_sandbox/SStype3.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[ ```r lm(y ~ x1 + x2) ``` ``` ## ## Call: ## lm(formula = y ~ x1 + x2) ## ## Coefficients: ## (Intercept) x1 x2 ## -0.0644 0.0912 0.3302 ``` {{content}} ] -- ```r resx1.x2 = resid(lm(x1 ~ x2)) lm(y ~ resx1.x2) ``` ``` ## ## Call: ## lm(formula = y ~ resx1.x2) ## ## Coefficients: ## (Intercept) resx1.x2 ## -0.0697 0.0912 ``` --- # Interactions .pull-left[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-14-1.svg)<!-- --> ] -- .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-15-1.svg)<!-- --> ] --- # Interactions .pull-left[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-16-1.svg)<!-- --> ] .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-17-1.svg)<!-- --> ] --- # Notation `\(\begin{align} \color{red}{y} \;\;\;\; & = \;\;\;\;\; \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} x_1 + ... + \beta_k \cdot x_k} & + & \;\;\;\varepsilon \\ \qquad \\ \color{red}{\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ \vdots \\ y_n \end{bmatrix}} & = \color{blue}{\begin{bmatrix} 1 & x_{11} & x_{21} & \dots & x_{k1} \\ 1 & x_{12} & x_{22} & & x_{k2} \\ 1 & x_{13} & x_{23} & & x_{k3} \\ 1 & x_{14} & x_{24} & & x_{k4} \\ 1 & x_{15} & x_{25} & & x_{k5} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & x_{2n} & \dots & x_{kn} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{bmatrix}} & + & \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5 \\ \vdots \\ \varepsilon_n \end{bmatrix} \\ \qquad \\ \\\color{red}{\boldsymbol y} \;\;\;\;\; & = \qquad \qquad \;\;\; \mathbf{\color{blue}{X \qquad \qquad \qquad \;\;\;\: \boldsymbol \beta}} & + & \;\;\; \boldsymbol \varepsilon \\ \end{align}\)` --- # Link functions `\(\begin{align} \color{red}{y} = \mathbf{\color{blue}{X \boldsymbol{\beta}} + \boldsymbol{\varepsilon}} & \qquad & (-\infty, \infty) \\ \qquad \\ \qquad \\ \color{red}{ln \left( \frac{p}{1-p} \right) } = \mathbf{\color{blue}{X \boldsymbol{\beta}} + \boldsymbol{\varepsilon}} & \qquad & [0,1] \\ \qquad \\ \qquad \\ \color{red}{ln (y) } = \mathbf{\color{blue}{X \boldsymbol{\beta}} + \boldsymbol{\varepsilon}} & \qquad & (0, \infty) \\ \end{align}\)` --- # Linear Models in R - Linear regression ```r linear_model <- lm(continuous_y ~ x1 + x2 + x3*x4, data = df) ``` - Logistic regression ```r logistic_model <- glm(binary_y ~ x1 + x2 + x3*x4, data = df, family=binomial(link="logit")) ``` - Poisson regression ```r poisson_model <- glm(count_y ~ x1 + x2 + x3*x4, data = df, family=poisson(link="log")) ``` --- # Null Hypothesis Testing ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-21-1.svg)<!-- --> --- # Inference for Coefficients <img src="jk_img_sandbox/sum1.png" width="2560" height="500px" /> --- # Inference for Coefficients <img src="jk_img_sandbox/sum2.png" width="2560" height="500px" /> --- # Inference for Coefficients <img src="jk_img_sandbox/sum3.png" width="2560" height="500px" /> --- # Inference for Coefficients <img src="jk_img_sandbox/sum4.png" width="2560" height="500px" /> --- # Sums of Squares Rather than specific coefficients, we can also think of our model as a whole. We can talk in terms of the sums of squared residuals `\(\sum^{n}_{i=1}(y_i - \hat y_i)^2\)`. .pull-left[ <img src="jk_img_sandbox/SS3xmodel1.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-27-1.svg)<!-- --> ] --- # `\(R^2\)` .pull-left[ <img src="jk_img_sandbox/SSr2.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[ `\(R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}\)` ```r mdl <- lm(y ~ x1 + x2) summary(mdl)$r.squared ``` ``` ## [1] 0.1935 ``` ] --- # Inference: Joint tests We can test reduction in residual sums of squares: .pull-left[ ```r m1 <- lm(y ~ x1, data = df) ``` <img src="jk_img_sandbox/SS3xmodel1.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] .pull-right[ ```r m2 <- lm(y ~ x1 + x2 + x3, data = df) ``` <img src="jk_img_sandbox/SS3xfullmodel.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] --- # Inference: Joint tests i.e. isolating the improvement in model fit due to inclusion of additional parameters .pull-left[ ```r m1 <- lm(y ~ x1, data = df) m2 <- lm(y ~ x1 + x2 + x3, data = df) anova(m1,m2) ``` ``` ## Analysis of Variance Table ## ## Model 1: y ~ x1 ## Model 2: y ~ x1 + x2 + x3 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 98 1141 ## 2 96 357 2 785 106 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] .pull-right[ <img src="jk_img_sandbox/SS3xincrement.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] --- # Inference: Joint tests "additional parameters" could be a set of coefficients for levels of a categorical variable. This provides a way of assessing "are there differences in group means?". .pull-left[ ```r m1 = lm(y ~ x1, data = df) m2 = lm(y ~ x1 + species, data = df) coef(m2) ``` ``` ## (Intercept) x1 speciesb speciesc speciesd ## 13.841 1.059 -3.162 -2.943 -3.415 ``` ```r anova(m1, m2) ``` ``` ## Analysis of Variance Table ## ## Model 1: y ~ x1 ## Model 2: y ~ x1 + species ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 98 1141 ## 2 95 950 3 192 6.39 0.00055 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] .pull-right[ <img src="jk_img_sandbox/SSblankq.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] --- exclude: true # Inference: Joint tests This is kind of where traditional analysis of variance sits. Think of it as testing the addition of each variable entered in to the model, .pull-left[ ```r m2 = lm(y ~ x1 + species, data = df) anova(m2) ``` ``` ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## x1 1 91 90.6 9.06 0.00334 ** ## species 3 192 63.9 6.39 0.00055 *** ## Residuals 95 950 10.0 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] .pull-right[ <img src="jk_img_sandbox/SStype1.png" width="400px" height="350px" style="display: block; margin: auto;" /> ] --- # Assumptions Our model: `\(\color{red}{y} = \color{blue}{\mathbf{X \boldsymbol \beta}} + \varepsilon \qquad \text{where } \boldsymbol \varepsilon \sim N(0, \sigma) \text{ independently}\)` Our ability to generalise from the model we fit on sample data to the wider population requires making some _assumptions._ -- - assumptions about the nature of the **model** .tr[ (linear) ] -- - assumptions about the nature of the **errors** .tr[ (normal) ] .footnote[ You can also phrase the linear model as: `\(\color{red}{\boldsymbol y} \sim Normal(\color{blue}{\mathbf{X \boldsymbol \beta}}, \sigma)\)` ] --- # Assumptions: The Broad Idea All our work here is in aim of making **models of the world**. - Models are models. They are simplifications. They are therefore wrong. .pull-left[] .pull-right[ ![](jk_img_sandbox/joeymap.jpg) ] --- count:false # Assumptions: The Broad Idea All our work here is in aim of making **models of the world**. - Models are models. They are simplifications. They are therefore wrong. - Our residuals ( `\(y - \hat{y}\)` ) reflect everything that we **don't** account for in our model. -- - In an ideal world, our model accounts for _all_ the systematic relationships. The leftovers (our residuals) are just random noise. -- - If our model is mis-specified, or we don't measure some systematic relationship, then our residuals will reflect this. -- - We check by examining how much "like randomness" the residuals appear to be (zero mean, normally distributed, constant variance, i.i.d ("independent and identically distributed") - _these ideas tend to get referred to as our "assumptions"_ -- - We will **never** know whether our residuals contain only randomness - we can never observe everything! --- # Assumptions .pull-left[ What does "zero mean and constant variance" look like? - mean of the residuals = zero across the predicted values on the linear predictor. - spread of residuals is normally distributed and constant across the predicted values on the linear predictor. ] .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-42-1.svg)<!-- --> ] --- # Assumptions .pull-left[ What does "zero mean and constant variance" look like? - __mean of the residuals = zero across the predicted values on the linear predictor.__ - spread of residuals is normally distributed and constant across the predicted values on the linear predictor. ] .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-43-1.svg)<!-- --> ] --- # Assumptions .pull-left[ What does "zero mean and constant variance" look like? - mean of the residuals = zero across the predicted values on the linear predictor. - __spread of residuals is normally distributed and constant across the predicted values on the linear predictor.__ ] .pull-right[ ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-44-1.svg)<!-- --> ] --- # Assumptions .pull-left[ What does "zero mean and constant variance" look like? - mean of the residuals = zero across the predicted values on the linear predictor. - spread of residuals is normally distributed and constant across the predicted values on the linear predictor. ] .pull-right[ __`plot(model)`__ ```r my_model <- lm(y ~ x, data = df) plot(my_model, which = 1) ``` ![](dapr3_2324_01a_lmrefresh_files/figure-html/unnamed-chunk-46-1.svg)<!-- --> ] --- # Assumptions: Recipe Book <br><br> <center> <div class="acronym"> L </div> inearity<br> <div class="acronym"> I </div> ndependence<br> <div class="acronym"> N </div> ormality<br> <div class="acronym"> E </div> qual variance<br> </center> .footnote["Line without N is a Lie!" (Umberto)] --- # What if our model doesn't meet assumptions? - is our model mis-specified? - is the relationship non-linear? higher order terms? (e.g. `\(y \sim x + x^2\)`) - is there an omitted variable or interaction term? -- - transform the outcome variable? - makes things look more "normal" - but can make things more tricky to interpret: `lm(y ~ x)` and `lm(log(y) ~ x)` are quite different models -- - bootstrap\* - do many times: resample (w/ replacement) your data, and refit your model. - obtain a distribution of parameter estimate of interest. - compute a confidence interval for estimate - celebrate .footnote[\* not great with small samples.] -- __looking ahead:__ these don't help if we have violated our assumption of independence... --- # Summary - we can fit a linear regression model which takes the form `\(\color{red}{y} = \color{blue}{\mathbf{X} \boldsymbol{\beta}} + \boldsymbol{\varepsilon}\)` - in R, we fit this with `lm(y ~ x1 + .... xk, data = mydata)`. - we can extend this to different link functions to model outcome variables which follow different distributions. - when drawing inferences from a fitted model to the broader population, we rely on certain assumptions. - one of these is that the errors are independent. --- class: inverse, center, middle, animated, rotateInDownLeft # End