class: center, middle, inverse, title-slide #
Week 2: Introduction to Linear Model
## Data Analysis for Psychology in R 2
### TOM BOOTH & ALEX DOUMAS ### Department of Psychology
The University of Edinburgh ### AY 2020-2021 --- # Week's Learning Objectives 1. Be able to specify a simple linear model. 2. Understand and describe fitted values and residuals. 3. Be able to interpret the coefficients from a linear model. 4. Be able to test hypotheses and construct confidence intervals for the model coefficients. --- # Topics for today + Moving on from the idea of a line and a function, we will discuss: + Least squares and the linear model + Differentiate measured data, fitted values and residuals + Calculating slope and intercept --- # Things to recap + This week we will build from: + arithmetic mean + concept of squared deviations ??? + this is material to point students to. + no need to spend time on this here --- # Recap correlation + Correlation coefficient is a **standardized measure of association** between two variables. + We can calculate correlations for different data types. + For now, we will focus on two numeric (continuous) variables. + Typical visualization of correlations is through **scatterplots**. --- # Scatterplot + Scatterplots plot points at the (x,y) co-ordinates for two measured variables. + We plot each individual data point (typically a participants pair of responses). + This produces the clouds of points. --- # Scatterplot .pull-left[ <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-3-1.png" width="90%" /> ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> height </th> <th style="text-align:right;"> weight </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 1.52 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> Peter </td> <td style="text-align:right;"> 1.60 </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> Robert </td> <td style="text-align:right;"> 1.68 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> David </td> <td style="text-align:right;"> 1.78 </td> <td style="text-align:right;"> 67 </td> </tr> <tr> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 1.86 </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> Matthew </td> <td style="text-align:right;"> 1.94 </td> <td style="text-align:right;"> 110 </td> </tr> <tr> <td style="text-align:left;"> Bradley </td> <td style="text-align:right;"> 2.09 </td> <td style="text-align:right;"> 98 </td> </tr> </tbody> </table> + `name` = nominal variable + `height` = height in metres, numeric + `weight` = weight in kg's, numeric ] --- # Strength of correlation <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-5-1.png" width="55%" /> --- # Linear model + What we will focus on for the majority of the course is how we move from the idea of an association, to estimating a model for the relationship. + This model is the **linear model** + When using a linear model, we are typically trying to explain variation in an **outcome** (Y, dependent, response) variable, using one or more **predictor** (x, independent, explanatory) variable(s). --- # Example .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> student </th> <th style="text-align:right;"> hours </th> <th style="text-align:right;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID1 </td> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> ID2 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID3 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> ID4 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> ID5 </td> <td style="text-align:right;"> 2.5 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> ID6 </td> <td style="text-align:right;"> 3.0 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> ID7 </td> <td style="text-align:right;"> 3.5 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID8 </td> <td style="text-align:right;"> 4.0 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID9 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID10 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> ] .pull-right[ **Simple data** + `student` = ID variable unique to each respondent + `hours` = the number of hours spent studying. This will be our predictor ( `\(x\)` ) + `score` = test score ( `\(y\)` ) **Question: Do students who study more get higher scores on the test?** ] --- # Scatterplot of our data .pull-left[ <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-7-1.png" width="90%" /> ] .pull-right[ {{content}} ] -- <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-8-1.png" width="90%" /> {{content}} ??? + we can visualize our data. We can see points moving bottom left to top right + so association looks positive + Now let's add a line that represents the best model --- # Definition of the line + The line can be described by two values: + **Intercept**: the point where the line crosses `\(y\)`, and `\(x\)` = 0 + **Slope**: the gradient of the line, or rate of change ??? + In our example, intercept = for someone who doesn't study, what score will they get? + Slope = for every hour of study, how much will my score change --- # Intercept and slope .pull-left[ <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-10-1.png" width="90%" /> ] .pull-right[ <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-11-1.png" width="90%" /> ] --- # How to find a line? + The line represents a model of our data. + In our example, the model that best characterizes the relationship between hours of study and test score. + In the scatterplot, the data is represented by points. + So a good line, is a line that is "close" to all points. --- # Linear Model `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + `\(y_i\)` = the outcome variable (e.g. `score`) + `\(x_i\)` = the predictor variable, (e.g. `hours`) + `\(\beta_0\)` = intercept + `\(\beta_1\)` = slope + `\(\epsilon_i\)` = residual (we will come to this shortly) where `\(\epsilon_i \sim N(0, \sigma)\)` independently. + `\(\sigma\)` = standard deviation (spread) of the errors + The standard deviation of the errors, `\(\sigma\)`, is constant --- # Linear Model `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + **Why do we have `\(i\)` in some places and not others?** -- + `\(i\)` is a subscript to indicate that each participant has their own value. + So each participant has their own: + score on the test ( `\(y_i\)` ) + number of hours studied ( `\(x_i\)` ) and + residual term ( `\(\epsilon_i\)` ) -- + **What does it mean that the intercept ( `\(\beta_0\)` ) and slope ( `\(\beta_1\)` ) do not have the subscript `\(i\)`?** -- + It means there is one value for all observations. + Remember the model is for **all of our data** --- # What is `\(\epsilon_i\)`? .pull-left[ + `\(\epsilon_i\)`, or the residual, is a measure of how well the model fits each data point. + It is the distance between the model line (on `\(y\)`-axis) and a data point. + `\(\epsilon_i\)` is positive if the point is above the line (red in plot) + `\(\epsilon_i\)` is negative if the point is below the line (blue in plot) ] .pull-right[ <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-12-1.png" width="90%" /> ] ??? + comment red = positive and bigger (longer arrow) model is worse + blue is negative, and smaller (shorter arrow) model is better + key point to link here is the importance of residuals for knowing how good the model is + Link to last lecture in that they are the variability + that is the link into least squares --- class: center, middle # Time for a break --- class: center, middle # Welcome Back! **Where we left off... ** --- # Principle of least squares + The numbers `\(\beta_0\)` and `\(\beta_1\)` are typically **unknown** and need to be estimated in order to fit a line through the point cloud. + We denote the "best" values as `\(\hat \beta_0\)` and `\(\hat \beta_1\)` + The best fitting line is found using **least squares** + Minimizes the distances between the actual values of `\(y\)` and the model-predicted values of `\(\hat y\)` + Specifically minimizes the sum of the *squared* deviations --- # Principle of least squares + Actual value = `\(y_i\)` + Model-predicted value = `\(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\)` + Deviation or residual = `\(y_i - \hat y_i\)` + Minimize the **residual sum of squares**, `\(SS_{Residual}\)`, which is `$$SS_{Residual} = \sum_{i=1}^{n} [y_i - (\hat \beta_0 + \hat \beta_1 x_{i})]^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2$$` --- # Principle of least squares + **Why do you think we square the deviations? ** + HINT: Look back to the "What is `\(\epsilon_i\)`?" slide -- + We have positive and negative residual terms + If we simply added them, they would cancel out. --- # Data, predicted values and residuals + Data = `\(y_i\)` + This is what we have measured in our study. + For us, the test scores. + Predicted value = `\(\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i\)` = the y-value on the line at specific values of `\(x\)` + Or, the value of the outcome our model predicts given someone's values for predictors. + In our example, given you study for 4 hrs, what test score does our model predict you will get. + Residual = Difference between `\(y_i\)` and `\(\hat{y}_i\)`. So; `$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$` ??? + these are important distinctions for understanding linear models + return to them a lot. --- # Fitting the line + Calculations for slope: `$$\hat \beta_1 = \frac{SP_{xy}}{SS_x}$$` + `\(SP_{xy}\)` = sum of cross-products: `$$SP_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$` + `\(SS_x\)` = sums of squared deviations of `\(x\)`: `$$SS_x = \sum_{i=1}^{n}(x_i - \bar{x})^2$$` <!-- --- --> <!-- # Equivalent formula --> <!-- $$\hat \beta_1 = --> <!-- \frac{SP_{xy}}{SS_x} = --> <!-- r \frac{s_y}{s_x}$$ --> <!-- where --> <!-- - `\(r = \frac{SP_{xy}}{\sqrt{SS_x \times SS_y}}\)` --> <!-- - `\(s_y = \sqrt{ \frac{SS_y}{n - 1} } = \sqrt{ \frac{\sum_{i=1}^{n}(y_i - \bar{y})^2}{n - 1} }\)` --> <!-- - `\(s_x = \sqrt{ \frac{SS_x}{n - 1} } = \sqrt{ \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1} }\)` --> --- # Fitting the line + Calculations for intercept: `$$\hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x}$$` + `\(\hat \beta_1\)` = slope estimate + `\(\bar{y}\)` = mean of `\(y\)` + `\(\bar{x}\)` = mean of `\(x\)` --- class: center, middle # Time for a break This would be a good time to take a look at the lecture 3 worked example Here we show these calculations for our example. --- class: center, middle # Welcome Back! **Where we left off... ** Calculated the intercept and slope Now let's think about error... --- # What is `\(\sigma\)`? .pull-left[ <center>**Small `\(\sigma\)`**</center> <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <center>**Large `\(\sigma\)`**</center> <img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" /> ] --- # What is `\(\sigma\)`? + The less scatter around the line, the smaller the standard deviation of the errors + The less scatter around the line, the stronger the relationship between `\(y\)` and `\(x\)`. -- + We estimate `\(\sigma\)` using the residuals + The estimated standard deviation of the errors is: `$$\hat \sigma = \sqrt{\frac{SS_{Residual}}{n - k - 1}} = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n - k - 1}}$$` + In simple linear regression we only have one `\(x\)`, so `\(k = 1\)` and the denominator becomes `\(n - 2\)`. --- # Summary of today + Moved from correlation to linear model + Calculated slope and intercept + Discussed `\(SS_{Residual}\)` + Discussed `\(\hat \sigma\)` and the relation to good models --- # Next tasks + This week: + Complete your lab + Come to office hours + Weekly quiz - practice no.2 + Open Monday 09:00 + Closes Sunday 17:00