class: center, middle, inverse, title-slide .title[ #
Introduction to the linear model (LM)
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Course Overview .pull-left[ <table style="border: 1px solid black;> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1;text-align:center;vertical-align: middle"> <b>Introduction to Linear Models</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1"> <b>Intro to Linear Regression</b></td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interpreting Linear Models</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Testing Individual Predictors</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Model Testing & Comparison</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Linear Model Analysis</td></tr> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle"> <b>Analysing Experimental Studies</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Categorical Predictors & Dummy Coding</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Effects Coding & Coding Specific Contrasts</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Assumptions & Diagnostics</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Bootstrapping</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Categorical Predictor Analysis</td></tr> </table> ] .pull-right[ <table style="border: 1px solid black;> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle"> <b>Interactions</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interactions I</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interactions II</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interactions III</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Analysing Experiments</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Interaction Analysis</td></tr> <tr style="padding: 0 1em 0 1em;"> <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle"> <b>Advanced Topics</b></td> <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Power Analysis</td> </tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Binary Logistic Regression I</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Binary Logistic Regression II</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Logistic Regresison Analysis</td></tr> <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4"> Exam Prep and Course Q&A</td></tr> </table> ] --- # This Week's Learning Objectives 1. Understand the link between models and functions 2. Understand the key concepts (intercept and slope) of the linear model 3. Understand what residuals represent 4. Understand the key principles of least squares 5. Be able to specify a simple linear model (labs) --- class: inverse, center, middle # Part 1: Functions & Models --- # What is a model? + Pretty much all statistics is about models + A model is a formal representation of a system + Put another way, a model is an idea about the way the world is --- # A model as a function + We tend to represent mathematical models as functions + A **function** is an expression that defines the relationship between one variable (or set of variables) and another variable (or set of variables) + It allows us to specify what is important (arguments) and how these things interact with each other (operations) + This allows us to make and test predictions --- # An Example + To think through these relations, we can use a simpler example + Suppose I have a model for growth of babies <sup>1</sup> $$ Length = 55 + 4 * Age $$ -- + I'm using this model to formally represent the relationship between a baby's age and their length .footnote[ [1] Length is measured in cm; Age is measured in months ] --- # Visualising a model .pull-left[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-3-1.png" width="80%" /> ] .pull-right[ {{content}} ] -- + The x-axis shows `Age` {{content}} -- + The y-axis shows `Length` {{content}} -- + The black line represents our model: `\(y = 55+4x\)` {{content}} --- # Models as "a state of the world" + Let's suppose my model is true + That is, it is a perfect representation of how babies grow + What are the implications of this? -- + My models creates predictions + **IF** my model is a true representation of the world, **THEN** data from the world should closely match my predictions. --- # Predictions and data .pull-left[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-4-1.png" width="80%" /> ] .pull-right[ <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Age </th> <th style="text-align:right;"> PredictedLength </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10.00 </td> <td style="text-align:right;"> 95 </td> </tr> <tr> <td style="text-align:right;"> 10.25 </td> <td style="text-align:right;"> 96 </td> </tr> <tr> <td style="text-align:right;"> 10.50 </td> <td style="text-align:right;"> 97 </td> </tr> <tr> <td style="text-align:right;"> 10.75 </td> <td style="text-align:right;"> 98 </td> </tr> <tr> <td style="text-align:right;"> 11.00 </td> <td style="text-align:right;"> 99 </td> </tr> <tr> <td style="text-align:right;"> 11.25 </td> <td style="text-align:right;"> 100 </td> </tr> <tr> <td style="text-align:right;"> 11.50 </td> <td style="text-align:right;"> 101 </td> </tr> <tr> <td style="text-align:right;"> 11.75 </td> <td style="text-align:right;"> 102 </td> </tr> <tr> <td style="text-align:right;"> 12.00 </td> <td style="text-align:right;"> 103 </td> </tr> </tbody> </table> ] + Our predictions are points which fall on our line (representing the model, as a function) + The arrows are showing how we can use the model to find a predicted value --- # Predictions and data .pull-left[ + Consider the predictions when the children get a lot older... {{content}} ] .pull-right[ <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Age </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> Prediction </th> <th style="text-align:right;"> Prediction_M </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 216 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 919 </td> <td style="text-align:right;"> 9.19 </td> </tr> <tr> <td style="text-align:right;"> 228 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 967 </td> <td style="text-align:right;"> 9.67 </td> </tr> <tr> <td style="text-align:right;"> 240 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1015 </td> <td style="text-align:right;"> 10.15 </td> </tr> <tr> <td style="text-align:right;"> 252 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 1063 </td> <td style="text-align:right;"> 10.63 </td> </tr> <tr> <td style="text-align:right;"> 264 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 1111 </td> <td style="text-align:right;"> 11.11 </td> </tr> <tr> <td style="text-align:right;"> 276 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> 1159 </td> <td style="text-align:right;"> 11.59 </td> </tr> <tr> <td style="text-align:right;"> 288 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 1207 </td> <td style="text-align:right;"> 12.07 </td> </tr> <tr> <td style="text-align:right;"> 300 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 1255 </td> <td style="text-align:right;"> 12.55 </td> </tr> </tbody> </table> ] -- + What does this say about our model? {{content}} -- + If we were to collect actual data on height and age, will our observations fall on the line? {{content}} --- # Length & Age is non-linear .pull-left[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-7-1.png" width="80%" /> ] .pull-right[ + Our red line is plotted based on the mean length for different ages [real data](https://www.cdc.gov/growthcharts/who/boys_length_weight.htm) ] --- # How good is my model? + How might we judge how good our model is? -- 1. Model is represented as a function 2. We see that as a line (or surface if we have more things to consider) 3. That yields predictions (or values we expect if our model is true) 4. We can collect data 5. If the predictions do not match the observed data (observations deviate from our line), that says something about our model. --- # Models and Statistics + In statistics we (roughly) follow this process: + We define a model that represents one state of the world (probabilistically) + We collect data to compare to it. + These comparisons lead us to make inferences about how the world actually is, by comparison to a world that we specify by our model. --- # Deterministic vs Statistical models .pull-left[ A deterministic model is a model for an **exact** relationship: $$ y = \underbrace{3 + 2 x}_{f(x)} $$ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-8-1.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ A statistical model allows for case-by-case **variability**: $$ y = \underbrace{3 + 2 x}_{f(x)} + \epsilon $$ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-9-1.png" width="70%" style="display: block; margin: auto;" /> ] --- class: center, middle # Questions? --- class: inverse, center, middle # Part 2: The Linear Model --- # Linear model + For the majority of the course, we will focus on how we move from the idea of an association to estimating a model for the relationship. + We'll mostly look at the **linear model** + Assumes the relationship between the outcome variable and the predictor(s) is linear + Describes a continuous **outcome** variable as a function of one or more **predictor** variables + In other words, in using a linear model, we are typically trying to explain variation in an outcome (AKA `\(Y\)` , dependent, response) variable, using one or more predictor ( `\(x\)` , independent, explanatory) variable(s). --- # Example **Question: Do students who study more get higher scores on the test?** -- .pull-left[ |student | hours| score| |:-------|-----:|-----:| |ID1 | 0.5| 1| |ID2 | 1.0| 3| |ID3 | 1.5| 1| |ID4 | 2.0| 2| |ID5 | 2.5| 2| |ID6 | 3.0| 6| |ID7 | 3.5| 3| |ID8 | 4.0| 3| |ID9 | 4.5| 4| |ID10 | 5.0| 8| ] .pull-right[ **Simple data** + `student` = ID variable unique to each respondent + `hours` = the number of hours spent studying. This will be our predictor ( `\(x\)` ). + `score` = test score. This will be our outcome ( `\(y\)` ). ] --- # Scatterplot of our data .pull-left[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-11-1.png" width="80%" /> ] .pull-right[ {{content}} ] -- <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-12-1.png" width="80%" /> {{content}} + The line represents the best model --- # Definition of the line .pull-left[ + The line can be described by two values: + **Intercept**: the point where the line crosses the `\(y\)` -axis and `\(x = 0\)` + **Slope**: the gradient of the line, or rate of change + What do the intercept and slope stand for in our example? ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-13-1.png" width="80%" /> ] --- # Intercept and slope .pull-left[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-15-1.png" width="80%" /> ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-16-1.png" width="80%" /> ] --- # Linear Model Equation `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + `\(y_i\)` = the outcome variable (e.g. `score`) + `\(x_i\)` = the predictor variable, (e.g. `hours`) + `\(\beta_0\)` = intercept + `\(\beta_1\)` = slope + `\(\epsilon_i\)` = residual (we will come to this shortly) --- # Linear Model Equation `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + **Why do we have `\(i\)` in some places and not others?** -- + `\(i\)` is a subscript to indicate that each participant has their own value. + So each participant has their own: + score on the test ( `\(y_i\)` ) + number of hours studied ( `\(x_i\)` ) and + residual term ( `\(\epsilon_i\)` ) -- + **What does it mean that the intercept ( `\(\beta_0\)` ) and slope ( `\(\beta_1\)` ) do not have the subscript `\(i\)`?** -- + It means there is one value for all observations. + Remember the model is for **all of our data** --- # What is `\(\epsilon_i\)`? .pull-left[ + `\(\epsilon_i\)`, or the residual, is a measure of how well the model fits each data point. + It is the distance between the model line (on `\(y\)`-axis) and a data point. + `\(\epsilon_i\)` is positive if the point is above the line (red in plot) + `\(\epsilon_i\)` is negative if the point is below the line (blue in plot) ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-17-1.png" width="80%" /> ] ??? + comment red = positive and bigger (longer arrow) model is worse + blue is negative, and smaller (shorter arrow) model is better + key point to link here is the importance of residuals for knowing how good the model is + Link to last lecture in that they are the variability + that is the link into least squares --- # How to find the line? .pull-left[ + The line represents a model of our data. + In our example, the model that best characterises the relationship between hours of study and test score + In the scatterplot, the data are represented by points + So a good line is a line that is "close" to all points + The method that we use to identify the best-fitting line is the **Principle of Least Squares** ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-18-1.png" width="80%" /> ] --- class: center, middle # Questions? --- class: inverse, center, middle # Part 3: Principle of Least Squares --- # Linear Model + The linear model equation: `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + Where + `\(y_i\)` is our measured outcome variable + `\(x_i\)` is our measured predictor variable + `\(\beta_0\)` is the model intercept + `\(\beta_1\)` is the model slope + `\(\epsilon_i\)` is the residual error (difference between the model predicted and the observed value of `\(y\)`) -- + The values of `\(y\)` and `\(x\)` come from the observed data. + We'll now go through calculating `\(\beta_0\)` and `\(\beta_1\)` --- # Principle of least squares .pull-left[ + The values `\(\beta_0\)` and `\(\beta_1\)` are typically **unknown** and need to be estimated from our data. + We denote the "best" estimated values as `\(\hat \beta_0\)` and `\(\hat \beta_1\)` + We find the values of `\(\hat \beta_0\)` and `\(\hat \beta_1\)` (and thus our best line) using **least squares** + Least squares minimises the distances between the actual values of `\(y\)` and the model-predicted values of `\(\hat y\)` + That is, it minimises the residuals for each data point (the line is "close") ] -- .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-19-1.png" width="80%" /> ] --- # Principle of least squares + Formally, least squares minimises the **residual sum of squares** -- .pull-left[ + Essentially: + Fit a line ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-20-1.png" width="80%" /> ] --- count: false # Principle of least squares + Formally, least squares minimises the **residual sum of squares** .pull-left[ + Essentially: + Fit a line + Calculate the residuals ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-21-1.png" width="80%" /> ] --- count: false # Principle of least squares + Formally, least squares minimises the **residual sum of squares** .pull-left[ + Essentially: + Fit a line + Calculate the residuals + Square them ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-22-1.png" width="80%" /> ] --- count: false # Principle of least squares + Formally, least squares minimises the **residual sum of squares** .pull-left[ + Essentially: + Fit a line + Calculate the residuals + Square them + Sum up the squares ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-23-1.png" width="80%" /> ] --- count: false # Principle of least squares + Formally, least squares minimises the **residual sum of squares** .pull-left[ + Essentially: + Fit a line + Calculate the residuals + Square them + Sum up the squares + **Why do you think we square the deviations? ** ] .pull-right[ <img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-24-1.png" width="80%" /> ] --- # Residual Sum of Squares `$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$` --- count: false # Residual Sum of Squares `$$SS_{Residual} = \sum_{i=1}^{n}(\color{#BF1932}{y_i} - \hat{y}_i)^2$$` + Data = `\(y_i\)` + This is what we have measured in our study + For us, the test scores --- count: false # Residual Sum of Squares `$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \color{#BF1932}{\hat{y}_i})^2$$` + Data = `\(y_i\)` + This is what we have measured in our study. + For us, the test scores. + Predicted value = `\(\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i\)` + Or, the value of the outcome our model predicts given someone's values for predictors. + In our example: given you study for 4 hours, what test score does our model predict you will get? --- count: false # Residual Sum of Squares `$$SS_{Residual} = \sum_{i=1}^{n}(\color{#BF1932}{y_i - \hat{y}_i})^2$$` + Data = `\(y_i\)` + This is what we have measured in our study + For us, the test scores + Predicted value = `\(\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i\)` + Or, the value of the outcome our model predicts given someone's values for predictors + In our example: given you study for 4 hours, what test score does our model predict you will get? + Residual = Difference between `\(y_i\)` and `\(\hat{y}_i\)` --- # Key Point + It is worth a brief pause as this is a very important point > The values of the intercept and slope that minimise the sum of square residual are our estimated coefficients from our data -- > Minimising the `\(SS_{residual}\)` means that across all our data, the predicted values from our model are as close as they can be to the actual measured values of the outcome --- # Calculating the slope + Calculations for slope: `$$\hat \beta_1 = \frac{SP_{xy}}{SS_x}$$` .pull-left[ + `\(SP_{xy}\)` = sum of cross-products: `$$SP_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$` + `\(SS_x\)` = sums of squared deviations of `\(x\)`: `$$SS_x = \sum_{i=1}^{n}(x_i - \bar{x})^2$$` ] .pull-right[ + `\(x_i\)` = predictor data (in our example, `hours`) + `\(y_i\)` = outcome data (in our example, `scores`) + `\(\bar{y}\)` = mean of `\(y\)` + `\(\bar{x}\)` = mean of `\(x\)` + `\(n\)` = total number of observations + `\(\Sigma\)` = sum it all up ] --- # Calculating the intercept + Calculations for intercept: `$$\hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x}$$` + `\(\hat \beta_1\)` = slope estimate + `\(\bar{y}\)` = mean of `\(y\)` + `\(\bar{x}\)` = mean of `\(x\)` --- class: center, middle ## Questions? --- class: center, middle ## Time for a little R and to look at an example hand calculation --- # `lm` in R + We do not generally calculate our linear models by hand. + In R, we use the `lm()` function. ``` r lm(DV ~ IV, data = datasetName) ``` + The first bit of code is the model formula: + The outcome or DV appears on the left of ~ + The predictor(s) or IV appear on the right of ~ + We then give R the name of the data set + This set must contain variables (columns) with the same names as you have specified in the model formula. --- # `lm` in R .pull-left[ ``` r test <- tibble( student = paste(rep("ID",10),1:10, sep=""), hours = seq(0.5,5,.5), score = c(1,3,1,2,2,6,3,3,4,8) ) ``` ] .pull-right[ ``` r head(test, 4) ``` ``` ## # A tibble: 4 × 3 ## student hours score ## <chr> <dbl> <dbl> ## 1 ID1 0.5 1 ## 2 ID2 1 3 ## 3 ID3 1.5 1 ## 4 ID4 2 2 ``` ] -- ``` r lm(score ~ hours, data = test) ``` ``` ## ## Call: ## lm(formula = score ~ hours, data = test) ## ## Coefficients: ## (Intercept) hours ## 0.400 1.055 ``` --- class: center, middle ## We've just run our first linear model! ## Questions? --- # Summary + Take home points... 1. In statistics, we are building models that describe how a set of variables relate 2. The **linear model** is one such model we will use in this course 3. The linear model describes our data based on an intercept and a slope(s) 4. From this model (line) we can make predictions about peoples scores on an outcome 5. The degree to which our predictions differ from the observed data = residual = error = how good (or bad) the model is 6. We find our model coefficients based on least squares, which are the coefficients that minimise the sum of squared residuals + The majority of this course is going to revolve around getting a deeper understanding of these points. --- ## This week .pull-left[ ### Tasks <img src="figs/labs.svg" width="10%" /> **Attend your lab and work together on the exercises** <br> <img src="figs/exam.svg" width="10%" /> **Complete the weekly quiz** ] .pull-right[ ### Support <img src="figs/forum.svg" width="10%" /> **Help each other on the Piazza forum** <br> <img src="figs/oh.png" width="10%" /> **Attend office hours (see Learn page for details)** ] --- class: center, middle # Thanks for listening!