class: center, middle, inverse, title-slide .title[ #
Introduction to the Linear Model
] .subtitle[ ## DPUK Spring Academy
] .author[ ### Umberto Noè, Josiah King, (and credits to Tom Booth) ] .institute[ ### Department of Psychology
The University of Edinburgh ] .date[ ### April 2025 ] --- # Overview - Day 2: What is a linear model? - Day 3: But I have more variables, what now? - Day 4: Interactions - Day 5: Is my model any good? --- class: center, middle # Day 2 **What is a linear model?** --- class: inverse, center, middle <h2>Part 1: What is the linear model?</h2> <h2 style="text-align: left;opacity:0.3;">Part 2: Best line </h2> <h2 style="text-align: left;opacity:0.3;">Part 3: Single continuous predictor = correlation</h2> <h2 style="text-align: left;opacity:0.3;">Part 4: Single binary predictor = t-test</h2> --- # What is a model? + Pretty much all statistics is about models. + A model is an idea about the way the world is. + A formal representation of a system or relationships + Typically we represent models as functions. + We input data + Specify a set of relationships + We output a prediction --- # An Example + To think through these relations, we can use a simple example. + Suppose I have a model for growth of babies.<sup>1</sup> $$ Length = 55 + 4 * Month $$ .footnote[ [1] Length is measured in cm. ] --- # Visualizing a model .pull-left[ <!-- --> ] .pull-right[ {{content}} ] -- + The black line represents our model {{content}} -- + The x-axis shows `Age` `\((x)\)` {{content}} -- + The y-axis values for `Length` our model predicts {{content}} --- # Models as "a state of the world" + Let's suppose my model is true. + That is, it is a perfect representation of how babies grow. + My models creates predictions. + **IF** my model is a true representation of the world, **THEN** data from the world should closely match my predictions. --- # Predictions and data .pull-left[ <!-- --> ] .pull-right[ <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Age </th> <th style="text-align:right;"> Prediction </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10.00 </td> <td style="text-align:right;"> 95 </td> </tr> <tr> <td style="text-align:right;"> 10.25 </td> <td style="text-align:right;"> 96 </td> </tr> <tr> <td style="text-align:right;"> 10.50 </td> <td style="text-align:right;"> 97 </td> </tr> <tr> <td style="text-align:right;"> 10.75 </td> <td style="text-align:right;"> 98 </td> </tr> <tr> <td style="text-align:right;"> 11.00 </td> <td style="text-align:right;"> 99 </td> </tr> <tr> <td style="text-align:right;"> 11.25 </td> <td style="text-align:right;"> 100 </td> </tr> <tr> <td style="text-align:right;"> 11.50 </td> <td style="text-align:right;"> 101 </td> </tr> <tr> <td style="text-align:right;"> 11.75 </td> <td style="text-align:right;"> 102 </td> </tr> <tr> <td style="text-align:right;"> 12.00 </td> <td style="text-align:right;"> 103 </td> </tr> </tbody> </table> ] ??? + Our predictions are points which fall on our line (representing the model, as a function) + Here the arrows are showing how we can use the model to find a predicted value. + we find the value of the input on the x-axis (here 11), read up to the line, then across to the y-axis --- # Predictions and data .pull-left[ + Consider the predictions when the children get a lot older... {{content}} ] .pull-right[ <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Age </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> Prediction </th> <th style="text-align:right;"> Prediction_M </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 216 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 919 </td> <td style="text-align:right;"> 9.19 </td> </tr> <tr> <td style="text-align:right;"> 228 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 967 </td> <td style="text-align:right;"> 9.67 </td> </tr> <tr> <td style="text-align:right;"> 240 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1015 </td> <td style="text-align:right;"> 10.15 </td> </tr> <tr> <td style="text-align:right;"> 252 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 1063 </td> <td style="text-align:right;"> 10.63 </td> </tr> <tr> <td style="text-align:right;"> 264 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 1111 </td> <td style="text-align:right;"> 11.11 </td> </tr> <tr> <td style="text-align:right;"> 276 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> 1159 </td> <td style="text-align:right;"> 11.59 </td> </tr> <tr> <td style="text-align:right;"> 288 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 1207 </td> <td style="text-align:right;"> 12.07 </td> </tr> <tr> <td style="text-align:right;"> 300 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 1255 </td> <td style="text-align:right;"> 12.55 </td> </tr> </tbody> </table> ] -- + What do you think this would mean for our actual data? {{content}} -- + Will the data fall on the line? {{content}} --- # How good is my model? + How might we judge how good our model is? 1. Model is represented as a function 2. We see that as a line (or surface if we have more things to consider) 3. That yields predictions (or values we expect if our model is true) 4. We can collect data 5. If the predictions do not match the data (points deviate from our line), that says something about our model. --- # Linear model + The linear model is the workhorse of statistics. + When using a linear model, we are typically trying to explain variation in an **outcome** (Y, dependent, response) variable, using one or more **predictor** (x, independent, explanatory) variable(s). --- # Example .pull-left[ |student | hours| score| |:-------|-----:|-----:| |ID1 | 0.5| 1| |ID2 | 1.0| 3| |ID3 | 1.5| 1| |ID4 | 2.0| 2| |ID5 | 2.5| 2| |ID6 | 3.0| 6| |ID7 | 3.5| 3| |ID8 | 4.0| 3| |ID9 | 4.5| 4| |ID10 | 5.0| 8| ] .pull-right[ **Simple data** + `student` = ID variable unique to each respondent + `hours` = the number of hours spent studying. This will be our predictor ( `\(x\)` ) + `score` = test score ( `\(y\)` ) **Question: Do students who study more get higher scores on the test?** ] --- # Scatterplot of our data .pull-left[ <!-- --> ] .pull-right[ {{content}} ] -- <!-- --> {{content}} ??? + we can visualize our data. We can see points moving bottom left to top right + so association looks positive + Now let's add a line that represents the best model --- # Definition of the line + The line can be described by two values: + **Intercept**: the point where the line crosses `\(y\)`, and `\(x\)` = 0 + **Slope**: the gradient of the line, or rate of change ??? + In our example, intercept = for someone who doesn't study, what score will they get? + Slope = for every hour of study, how much will my score change --- # Intercept and slope .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # How to find a line? + The line represents a model of our data. + In our example, the model that best characterizes the relationship between hours of study and test score. + In the scatterplot, the data is represented by points. + So a good line, is a line that is "close" to all points. --- # Linear Model `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + `\(y_i\)` = the outcome variable (e.g. `score`) + `\(x_i\)` = the predictor variable, (e.g. `hours`) + `\(\beta_0\)` = intercept + `\(\beta_1\)` = slope + `\(\epsilon_i\)` = residual (we will come to this shortly) where `\(\epsilon_i \sim N(0, \sigma)\)` independently. This means: + `\(\sigma\)` = standard deviation (spread) of the errors + The standard deviation of the errors, `\(\sigma\)`, is constant. --- # Linear Model `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + **Why do we have `\(i\)` in some places and not others?** -- + `\(i\)` is a subscript to indicate that each participant has their own value. + So each participant has their own: + score on the test ( `\(y_i\)` ) + number of hours studied ( `\(x_i\)` ) and + residual term ( `\(\epsilon_i\)` ) -- + **What does it mean that the intercept ( `\(\beta_0\)` ) and slope ( `\(\beta_1\)` ) do not have the subscript `\(i\)`?** -- + It means there is one value for all observations. + Remember the model is for **all of our data** --- # What is `\(\epsilon_i\)`? .pull-left[ + `\(\epsilon_i\)`, or the residual, is a measure of how well the model fits each data point. + It is the distance between the model line (on `\(y\)`-axis) and a data point. + `\(\epsilon_i\)` is positive if the point is above the line (red in plot) + `\(\epsilon_i\)` is negative if the point is below the line (blue in plot) ] .pull-right[ <!-- --> ] ??? + comment red = positive and bigger (longer arrow) model is worse + blue is negative, and smaller (shorter arrow) model is better + key point to link here is the importance of residuals for knowing how good the model is + Link to last lecture in that they are the variability + that is the link into least squares --- class: inverse, center, middle <h2 style="text-align: left;opacity:0.3;">Part 1: What is the linear model?</h2> <h2>Part 2: Best line </h2> <h2 style="text-align: left;opacity:0.3;">Part 3: Single continuous predictor = correlation</h2> <h2 style="text-align: left;opacity:0.3;">Part 4: Single binary predictor = t-test</h2> --- # Principle of least squares + The numbers `\(\beta_0\)` and `\(\beta_1\)` are typically **unknown** and need to be estimated in order to fit a line through the point cloud. + We denote the "best" values as `\(\hat \beta_0\)` and `\(\hat \beta_1\)` + The best fitting line is found using **least squares** + Minimizes the distances between the actual values of `\(y\)` and the model-predicted values of `\(\hat y\)` + Specifically minimizes the sum of the *squared* deviations --- # Principle of least squares + Actual value = `\(y_i\)` + Model-predicted value = `\(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\)` + Deviation or residual = `\(y_i - \hat y_i\)` + Minimize the **residual sum of squares**, `\(SS_{Residual}\)`, which is `$$SS_{Residual} = \sum_{i=1}^{n} [y_i - (\hat \beta_0 + \hat \beta_1 x_{i})]^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2$$` --- # Data, predicted values and residuals + Data = `\(y_i\)` + This is what we have measured in our study. + For us, the test scores. + Predicted value = `\(\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i\)` = the y-value on the line at specific values of `\(x\)` + Or, the value of the outcome our model predicts given someone's values for predictors. + In our example, given you study for 4 hrs, what test score does our model predict you will get. + Residual = Difference between `\(y_i\)` and `\(\hat{y}_i\)`. So; `$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$` ??? + these are important distinctions for understanding linear models + return to them a lot. --- # Data, predicted values and residuals .pull-left[ `$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$` + Squared distance of each point from the predicted value. ] .pull-right[ <!-- --> ] --- class: inverse, center, middle <h2 style="text-align: left;opacity:0.3;">Part 1: What is the linear model?</h2> <h2 style="text-align: left;opacity:0.3;">Part 2: Best line </h2> <h2>Part 3: Single continuous predictor = correlation</h2> <h2 style="text-align: left;opacity:0.3;">Part 4: Single binary predictor = t-test</h2> --- # `lm` in R ``` r res <- lm(score ~ hours, data = test) summary(res) ``` ``` ## ## Call: ## lm(formula = score ~ hours, data = test) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.6182 -1.0773 -0.7454 1.1773 2.4364 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4000 1.1111 0.360 0.7282 ## hours 1.0545 0.3581 2.945 0.0186 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.626 on 8 degrees of freedom ## Multiple R-squared: 0.5201, Adjusted R-squared: 0.4601 ## F-statistic: 8.67 on 1 and 8 DF, p-value: 0.01858 ``` --- # Interpretation + **Slope is the number of units by which Y increases, on average, for a unit increase in X.** -- + Unit of Y = 1 point on the test + Unit of X = 1 hour of study -- + So, for every hour of study, test score increases on average by 1.055 points. -- + **Intercept is the expected value of Y when X is 0.** -- + X = 0 is a student who does not study. -- + So, a student who does no study would be expected to score 0.40 on the test. ??? + So we know in a general sense what the intercept and slope are, but what do they mean with respect to our data and question? --- # Note of caution on intercepts + In our example, 0 has a meaning. + It is a student who has studied for 0 hours. + But it is not always the case that 0 is meaningful. + Suppose our predictor variable was not hours of study, but age. + **A person of 0 age has a test score of 0.40.** + To make the intercept more meaningful, consider using mean-centred age as the predictor, `age_mc` = `age - mean(age)`. + The intercept will be the estimated test score when `age_mc` = 0, i.e. when `age - mean(age)` = 0, i.e. when `age` = `mean(age)`. + If the predictor is X_mc, the intercept represents the expected value of Y when X is at the mean. --- # Unstandardized vs standardized coefficients - In this example, we have unstandardized `\(\hat \beta_1\)`. + We interpreted the slope as the change in `\(y\)` units for a unit change in `\(x\)` + Where the unit is determined by how we have measured our variables. + However, sometimes we may want to represent our results in standard units. + If the scales of our variables are arbitrary. + If we want to compare the effects of variables on different scales. --- # Standardized results + We can either... + Standardized coefficients: `$$\hat{\beta_1^*} = \hat \beta_1 \frac{s_x}{s_y}$$` + where; + `\(\hat{\beta_1^*}\)` = standardized beta coefficient + `\(\hat \beta_1\)` = unstandardized beta coefficient + `\(s_x\)` = standard deviation of `\(x\)` + `\(s_y\)` = standard deviation of `\(y\)` --- # Standardizing the variables + Alternatively, for continuous variables, transforming both the IV and DV to `\(z\)`-scores (mean=0, SD=1) prior to fitting the model yields standardised betas. + `\(z\)`-score for `\(x\)`: `$$z_{x_i} = \frac{x_i - \bar{x}}{s_x}$$` + and the `\(z\)`-score for `\(y\)`: `$$z_{y_i} = \frac{y_i - \bar{y}}{s_y}$$` + That is, we divide the individual deviations from the mean by the standard deviation --- # `lm()` using z-scores ``` r test <- test |> mutate( z_score = scale(score, center = T, scale = T), z_hours = scale(hours, center = T, scale = T) ) res_z <- lm(z_score ~ z_hours, data = test) round(summary(res_z)$coefficients, 3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.000 0.232 0.000 1.000 ## z_hours 0.721 0.245 2.945 0.019 ``` --- # Interpreting standardized regression coefficients + `\(b_0\)` (intercept) = zero when all variables are standardized: + The interpretation of the coefficients becomes the increase in `\(y\)` in standard deviation units for every standard deviation increase in `\(x\)` + So, in our example: >**For every standard deviation increase in hours of study, test score increases by 0.72 standard deviations** --- # Relationship to r + Standardized slope ( `\(\hat \beta_1^*\)` ) = correlation coefficient ( `\(r\)` ) for a linear model with a single continuous predictor. + In our example, `\(\hat \beta_{hours}^*\)` = 0.72 ``` r round(cor(test$hours, test$score), 3) ``` ``` ## [1] 0.721 ``` + `\(r\)` is a standardized measure of linear association + `\(\hat \beta_1^*\)` is a standardized measure of the linear slope. --- class: inverse, center, middle <h2 style="text-align: left;opacity:0.3;">Part 1: What is the linear model?</h2> <h2 style="text-align: left;opacity:0.3;">Part 2: Best line </h2> <h2 style="text-align: left;opacity:0.3;">Part 3: Single continuous predictor = correlation</h2> <h2>Part 4: Single binary predictor = t-test</h2> --- # Binary variable + Binary variable is a categorical variable with two levels. + Traditionally coded with a 0 and 1 + Referred to as dummy coding + We will come back to this for categorical variables with 2+ levels -- + Why 0 and 1? + Quick version: It has some nice properties when it comes to interpretation. --- # Extending our example .pull-left[ + Our in class example so far has used test scores and revision time for 10 students. + Let's say we collect this data on 150 students. + We also collected data on who they studied with; + 0 = alone + 1 = with others + So our variable `study` is a binary ] .pull-right[ ``` ## # A tibble: 10 × 4 ## ID score hours study ## <chr> <dbl> <dbl> <dbl> ## 1 ID1 5 3.3 0 ## 2 ID2 6 2.6 0 ## 3 ID3 5 3.7 1 ## 4 ID4 6 3.6 0 ## 5 ID5 6 3.7 1 ## 6 ID6 7 4.4 1 ## 7 ID7 6 3.6 1 ## 8 ID8 6 4.1 1 ## 9 ID9 5 3.6 0 ## 10 ID10 5 3.9 0 ``` ] --- # LM with binary predictors + Now we can ask the question: + **Do students who study with others score better than students who study alone?** `$$score_i = \beta_0 + \beta_1 study_{i} + \epsilon_i$$` --- # In `R` ``` r res2 <- lm(score ~ study, data = df) summary(res2) ``` ``` ## ## Call: ## lm(formula = score ~ study, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.8333 -0.8333 0.1667 0.7778 2.1667 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.2222 0.1076 48.552 < 2e-16 *** ## study 0.6111 0.1492 4.097 6.87e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9127 on 148 degrees of freedom ## Multiple R-squared: 0.1019, Adjusted R-squared: 0.0958 ## F-statistic: 16.79 on 1 and 148 DF, p-value: 6.866e-05 ``` --- # Interpretation .pull-left[ + As before, the intercept `\(\hat \beta_0\)` is the expected value of `\(y\)` when `\(x=0\)` + What is `\(x=0\)` here? + It is the students who study alone. + So what about `\(\hat \beta_1\)`? + **Look at the output on the right hand side.** + What do you notice about the difference in averages? ] .pull-right[ ``` r df |> * group_by(study) |> summarise( * Average = round(mean(score), 4) ) ``` ``` ## # A tibble: 2 × 2 ## study Average ## <dbl> <dbl> ## 1 0 5.22 ## 2 1 5.83 ``` ] --- # Interpretation + `\(\hat \beta_0\)` = predicted expected value of `\(y\)` when `\(x = 0\)` + Or, the mean of group coded 0 (those who study alone) + `\(\hat \beta_1\)` = predicted difference between the means of the two groups. + Group 1 - Group 0 (Mean `score` for those who study with others - mean `score` of those who study alone) + Notice how this maps to our question. + Do students who study with others score better than students who study alone? --- # Visualize the model <img src="data:image/png;base64,#DPUK_B_2025_files/figure-html/unnamed-chunk-21-1.png" width="55%" /> --- # Hold on... it's a t-test ``` r t.test(score ~ study, data = df, var.equal = TRUE) ``` ``` ## ## Two Sample t-test ## ## data: score by study ## t = -4.0971, df = 148, p-value = 6.866e-05 ## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0 ## 95 percent confidence interval: ## -0.9058636 -0.3163586 ## sample estimates: ## mean in group 0 mean in group 1 ## 5.222222 5.833333 ``` ??? Yup! --- class: center, middle # Thanks all!