class: center, middle, inverse, title-slide .title[ #
Linear Model: Fundamentals
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Week's Learning Objectives 1. Be able to interpret the coefficients from a simple linear model. 2. Understand how these interpretations change when we add more predictors. 3. Understand how and why we standardize coefficients and how this impacts interpretation --- class: inverse, center, middle # Part 1: Recap & Coefficient Interpretation --- # Linear Model + Last week we left off having introduced the linear model: `$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$` + Where, + `\(y_i\)` is our measured outcome variable + `\(x_i\)` is our measured predictor variable + `\(\beta_0\)` is the model intercept + `\(\beta_1\)` is the model slope + `\(\epsilon_i\)` is the residual error (difference between the model predicted and the observed value of `\(y\)`) + We spoke about calculating by hand, and also the key concept of **residuals** --- # `lm` in R + We also introduced the basic structure of the `lm()` function. ```r lm(DV ~ IV, data = datasetName) ``` + And we had run our first model.... ```r lm(score ~ hours, data = test) ``` ``` ## ## Call: ## lm(formula = score ~ hours, data = test) ## ## Coefficients: ## (Intercept) hours ## 0.400 1.055 ``` - Today we are going to focus on the interpretation of our model, and how we extend it to include more predictors. --- # `lm` in R ```r summary(lm(score ~ hours, data = test)) ``` ``` ## ## Call: ## lm(formula = score ~ hours, data = test) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.6182 -1.0773 -0.7454 1.1773 2.4364 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4000 1.1111 0.360 0.7282 ## hours 1.0545 0.3581 2.945 0.0186 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.626 on 8 degrees of freedom ## Multiple R-squared: 0.5201, Adjusted R-squared: 0.4601 ## F-statistic: 8.67 on 1 and 8 DF, p-value: 0.01858 ``` --- # Interpretation .pull-left[ + **Slope is the number of units by which Y increases, on average, for a unit increase in X.** ] .pull-right[ <img src="dapr2_02_LM2_files/figure-html/unnamed-chunk-5-1.png" width="80%" /> <img src="figs/firstLMresults.png" width="80%" /> ] --- count: false # Interpretation .pull-left[ + **Slope is the number of units by which Y increases, on average, for a unit increase in X.** + Unit of Y = 1 point on the test + Unit of X = 1 hour of study ] .pull-right[ <img src="dapr2_02_LM2_files/figure-html/unnamed-chunk-7-1.png" width="80%" /> <img src="figs/firstLMresults.png" width="80%" /> ] --- count: false # Interpretation .pull-left[ + **Slope is the number of units by which Y increases, on average, for a unit increase in X.** + Unit of Y = 1 point on the test + Unit of X = 1 hour of study + So, for every hour of study, test score increases on average by 1.055 points. ] .pull-right[ <img src="dapr2_02_LM2_files/figure-html/unnamed-chunk-9-1.png" width="80%" /> <img src="figs/firstLMresults.png" width="80%" /> ] --- # Interpretation .pull-left[ + **Slope is the number of units by which Y increases, on average, for a unit increase in X.** + Unit of Y = 1 point on the test + Unit of X = 1 hour of study + So, for every hour of study, test score increases on average by 1.055 points. + **Intercept is the expected value of Y when X is 0.** + X = 0 is a student who does not study. ] .pull-right[ <img src="dapr2_02_LM2_files/figure-html/unnamed-chunk-11-1.png" width="80%" /> <img src="figs/firstLMresults.png" width="80%" /> ] --- count: false # Interpretation .pull-left[ + **Slope is the number of units by which Y increases, on average, for a unit increase in X.** + Unit of Y = 1 point on the test + Unit of X = 1 hour of study + So, for every hour of study, test score increases on average by 1.055 points. + **Intercept is the expected value of Y when X is 0.** + X = 0 is a student who does not study. + So, a student who does not study would be expected to score 0.40 on the test. ] .pull-right[ <img src="dapr2_02_LM2_files/figure-html/unnamed-chunk-13-1.png" width="80%" /> <img src="figs/firstLMresults.png" width="80%" /> ] --- # Note of caution on intercepts + In our example, 0 has a meaning. + It is a student who has studied for 0 hours. + But it is not always the case that 0 is meaningful. + Suppose our predictor variable was not hours of study, but age. + **Look back at the interpretation of the intercept, and instead of hours of study, insert age. Read this aloud a couple of times.** -- + This is the first instance of a very general lesson about interpreting statistical tests. + The interpretation is always in the context of the constructs and how we have measured them. --- # Scale of measurement .pull-left[ + Let's have some practice.... + `\(X\)` = unit is 1 year + `\(Y\)` = unit is £1000 + `\(\beta_1\)` = 0.4 ] .pull-right[ + `\(\beta_0\)` = **Intercept is the expected value of Y when X is 0.** + `\(\beta_1\)` = **Slope is the number of units by which Y increases, on average, for a unit increase in X.** ] --- # Scale of measurement .pull-left[ + Let's have some practice.... + `\(X\)` = unit is 1kg + `\(Y\)` = unit is 1cm + `\(\beta_1\)` = -3.2 ] .pull-right[ + `\(\beta_0\)` = **Intercept is the expected value of Y when X is 0.** + `\(\beta_1\)` = **Slope is the number of units by which Y increases, on average, for a unit increase in X.** ] --- # Scale of measurement .pull-left[ + Let's have some practice.... + `\(X\)` = unit is 1 increment on a likert scale ranging from 1 to 5 measuring conscientiousness + `\(Y\)` = unit is 1 increment on a healthy eating scale + `\(\beta_1\)` = 0.25 ] .pull-right[ + `\(\beta_0\)` = **Intercept is the expected value of Y when X is 0.** + `\(\beta_1\)` = **Slope is the number of units by which Y increases, on average, for a unit increase in X.** ] --- class: center, middle # Questions? --- class: inverse, center, middle # Part 2: Standardization --- # Unstandardized vs standardized coefficients - So far we have calculated _unstandardized_ `\(\hat \beta_1\)`. + This means we use the units of the variables we measured. + We interpreted the slope as the change in `\(y\)` units for a unit change in `\(x\)` , where the unit is determined by how we have measured our variables. + However, sometimes these units do not make the most sense + When this is the case, we may want to do something to help interpretation. + This is typically called **standardization** --- # Standardized units + Why might standard units be useful? -- + **If the scales of our variables are arbitrary.** + Example: A sum score of questionnaire items answered on a Likert scale. + A unit here would equal moving from a 2 to 3 on one item. + This is not especially meaningful (and actually has A LOT of associated assumptions) -- + **If we want to compare the effects of variables on different scales** + If we want to say something like, the effect of `\(x_1\)` is stronger than the effect of `\(x_2\)`, we need a common scale. --- # Standardizing the coefficients + After calculating a `\(\hat \beta_1\)`, it can be standardized by: `$$\hat{\beta_1^*} = \hat \beta_1 \frac{s_x}{s_y}$$` + where; + `\(\hat{\beta_1^*}\)` = standardized beta coefficient + `\(\hat \beta_1\)` = unstandardized beta coefficient + `\(s_x\)` = standard deviation of `\(x\)` + `\(s_y\)` = standard deviation of `\(y\)` --- # Standardizing the variables + Alternatively, for continuous variables, transforming both the predictor and outcome variables to `\(z\)`-scores (mean=0, SD=1) prior to fitting the model yields standardised betas. + `\(z\)`-score for `\(x\)`: `$$z_{x_i} = \frac{x_i - \bar{x}}{s_x}$$` + and the `\(z\)`-score for `\(y\)`: `$$z_{y_i} = \frac{y_i - \bar{y}}{s_y}$$` + That is, we divide the individual deviations from the mean by the standard deviation --- # Two approaches in action ```r m1 <- lm(score~hours, data = test) summary(m1)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.400000 1.1111010 0.3600033 0.7281636 ## hours 1.054545 0.3581403 2.9445039 0.0185812 ``` ```r *round(1.054545 * (sd(test$hours)/sd(test$score)),3) ``` ``` ## [1] 0.721 ``` --- # Two approaches in action ```r test <- test %>% mutate( * z_score = scale(score, center = T, scale = T), * z_hours = scale(hours, center = T, scale = T) ) performance_z <- lm(z_score ~ z_hours, data = test) round(summary(performance_z)$coefficients,3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.000 0.232 0.000 1.000 ## z_hours 0.721 0.245 2.945 0.019 ``` -- `$$z_{x_i} = \frac{\color{#BF1932}{x_i - \bar{x}}}{s_x}$$` + `center = T` indicates `\(x\)` should be mean centered --- count: false # Two approaches in action ```r test <- test %>% mutate( * z_score = scale(score, center = T, scale = T), * z_hours = scale(hours, center = T, scale = T) ) performance_z <- lm(z_score ~ z_hours, data = test) round(summary(performance_z)$coefficients,3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.000 0.232 0.000 1.000 ## z_hours 0.721 0.245 2.945 0.019 ``` `$$z_{x_i} = \frac{x_i - \bar{x}}{\color{#BF1932}{s_x}}$$` + `center = T` indicates `\(x\)` should be mean centered + `scale = T` indicates `\(x\)` should be divided by `\(s_x\)` --- ## Interpreting standardized regression coefficients .pull-left[ **Unstandardized** <img src="figs/unstandardizedResults.png" width="80%" /> ] .pull-right[ **Standardized** <img src="figs/standardizedResults.png" width="80%" /> ] -- + `\(R^2\)` , `\(F\)` and `\(t\)`-test and their corresponding `\(p\)`-values remain the same for the standardized coefficients as for unstandardised coefficients. -- + `\(\beta_0\)` (intercept) = zero when all variables are standardized: `$$\beta_0 = \bar{y}-\hat \beta_1\bar{x}$$` `$$\bar{y} - \hat \beta_1 \bar{x} = 0 - \hat \beta_1 0 = 0$$` --- ## Interpreting standardized regression coefficients + The interpretation of the coefficients becomes the increase in `\(y\)` in standard deviation units for every standard deviation increase in `\(x\)` + So, in our example: >**For every standard deviation increase in hours of study, test score increases by 0.72 standard deviations** --- # Which should we use? + Unstandardized regression coefficients are often more useful when the variables are on meaningful scales + E.g. X additional hours of exercise per week adds Y years of healthy life + Sometimes it's useful to obtain standardized regression coefficients + When the scales of variables are arbitrary + When there is a desire to compare the effects of variables measured on different scales + Cautions + Just because you can put regression coefficients on a common metric doesn't mean they can be meaningfully compared. + The SD is a poor measure of spread for skewed distributions, therefore, be cautious of their use with skewed variables --- # Relationship to correlation ( `\(r\)` ) + Standardized slope ( `\(\hat \beta_1^*\)` ) = correlation coefficient ( `\(r\)` ) for a linear model with a single continuous predictor. + For example: ```r round(lm(z_score ~ z_hours, data = test)$coefficients, 2) ``` ``` ## (Intercept) z_hours ## 0.00 0.72 ``` ```r round(cor(test$hours, test$score),2) ``` ``` ## [1] 0.72 ``` --- # Relationship to correlation ( `\(r\)` ) + They are the same: + `\(r\)` is a standardized measure of linear association + `\(\hat \beta_1^*\)` is a standardized measure of the linear slope. + Something similar is true for linear models with multiple predictors. + Slopes are equivalent to the *part correlation coefficient* --- class: center, middle # Questions? --- class: inverse, center, middle # Part 3: Adding more predictors to our model --- # Multiple predictors (multiple regression) + The aim of a linear model is to explain variance in an outcome + In simple linear models, we have a single predictor, but the model can accommodate (in principle) any number of predictors. + However, when we include multiple predictors, those predictors are likely to correlate. + Thus, a linear model with multiple predictors finds the optimal prediction of the outcome from several predictors, **taking into account their redundancy with one another** --- # Uses of multiple regression + **For prediction:** multiple predictors may lead to improved prediction. + **For theory testing:** often our theories suggest that multiple variables together contribute to variation in an outcome + **For covariate control:** we might want to assess the effect of a specific predictor, controlling for the influence of others. + E.g., effects of personality on health after removing the effects of age and sex --- # Extending the regression model + Our model for a single predictor: `$$y_i = \beta_0 + \beta_1 x_{1i} + \epsilon_i$$` + is extended to include additional `\(x\)`'s: `$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \epsilon_i$$` + For each `\(x\)`, we have an additional `\(\beta\)` + `\(\beta_1\)` is the coefficient for the 1st predictor + `\(\beta_2\)` for the second etc. --- # Interpreting coefficients in multiple regression `$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_j x_{ji} + \epsilon_i$$` + Given that we have additional variables, our interpretation of the regression coefficients changes a little + `\(\beta_0\)` = the predicted value for `\(y\)` **all** `\(x\)` are 0. + Each `\(\beta_j\)` is now a **partial regression coefficient** + It captures the change in `\(y\)` for a one unit change in `\(x\)` **when all other x's are held constant** --- # What does holding constant mean? + Refers to finding the effect of the predictor when the values of the other predictors are fixed + It may also be expressed as the effect of **controlling for**, or **partialling out**, or **residualizing for** the other `\(x\)`'s + With multiple predictors `lm` isolates the effects and estimates the unique contributions of predictors. --- # Visualizing models .pull-left[ <img src="dapr2_02_LM2_files/figure-html/unnamed-chunk-23-1.png" width="80%" /> ] .pull-right[ <img src="figs/lm_surface.png" width="80%" /> ] ??? + In simple linear models, we could visualise the model as a straight line in 2D space + Least squares finds the coefficients that produces the *regression line* that minimises the vertical distances of the observed y-values from the line + In a regression with 2 predictors, this becomes a regression plane in 3D space + The goal now becomes finding the set of coefficients that minimises the vertical distances between the *regression* *plane* and the observed y-values + The logic extends to any number of predictors + (but becomes very difficult to visualise!) --- # Example: lm with 2 predictors + Imagine we extend our study of test scores. + We sample 150 students taking a multiple choice Biology exam (max score 40). + We give all students a survey at the start of the year measuring their school motivation. + We standardize this variable so the mean is 0, negative numbers are low motivation, and positive numbers high motivation. + We then measure the hours they spent studying for the test, and collate their scores on the test. --- # Our data ```r slice(test_study2, 1:6) ``` ``` ## ID score hours motivation ## 1 ID101 7 2 -1.42 ## 2 ID102 23 12 -0.41 ## 3 ID103 17 4 0.49 ## 4 ID104 6 2 0.24 ## 5 ID105 12 2 0.09 ## 6 ID106 24 12 1.05 ``` --- # `lm` code ```r *performance <- lm(score ~ hours + motivation, data = test_study2) ``` + Multiple predictors are separated by `+` --- # Multiple regression coefficients + Before we interpret our results, let's put all the pieces together: -- `$$Score_i = \beta_0 + \beta_1 Hours_{i} + \beta_2 Motivation_{i} + \epsilon_i$$` -- .pull-left[ <img src="figs/perfResults.png" width="125%" /> ] .pull-right[ + `\(\beta_0=\)` + `\(\beta_1=\)` + `\(\beta_2=\)` ] --- count: false # Multiple regression coefficients + Before we interpret our results, let's put all the pieces together: `$$Score_i = \beta_0 + \beta_1 Hours_{i} + \beta_2 Motivation_{i} + \epsilon_i$$` .pull-left[ <img src="figs/perfResults.png" width="125%" /> ] .pull-right[ + `\(\beta_0 = 6.87\)` + `\(\beta_1 = 1.38\)` + `\(\beta_2 = 0.92\)` ] --- count: false # Multiple regression coefficients + Before we interpret our results, let's put all the pieces together: `$$Score_i = \beta_0 + \beta_1 Hours_{i} + \beta_2 Motivation_{i} + \epsilon_i$$` .pull-left[ <img src="figs/perfResults.png" width="125%" /> ] .pull-right[ + `\(\beta_0 = 6.87\)` + `\(\beta_1 = 1.38\)` + `\(\beta_2 = 0.92\)` ```r test_study2[1,] ``` ``` ## ID score hours motivation ## 1 ID101 7 2 -1.42 ``` ```r round(residuals(performance)[1],2) ``` ``` ## 1 ## -1.32 ``` ] -- `$$7 = 6.87 + 1.38 \times 2 + 0.92 \times -1.42 + -1.32$$` --- # Multiple regression coefficients ```r res <- summary(performance) round(res$coefficients,2) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.87 0.65 10.49 0.00 ## hours 1.38 0.08 17.22 0.00 ## motivation 0.92 0.38 2.39 0.02 ``` + **A student who did not study, and who has average school motivation would be expected to score 6.87 on the test.** --- # Multiple regression coefficients ```r res <- summary(performance) round(res$coefficients,2) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.87 0.65 10.49 0.00 ## hours 1.38 0.08 17.22 0.00 ## motivation 0.92 0.38 2.39 0.02 ``` + **Controlling for students level of motivation, for every additional hour studied, there is a 1.38 points increase in test score.** --- # Multiple regression coefficients ```r res <- summary(performance) round(res$coefficients,2) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.87 0.65 10.49 0.00 ## hours 1.38 0.08 17.22 0.00 ## motivation 0.92 0.38 2.39 0.02 ``` + **Controlling for hours of study, for every SD unit increase in motivation, there is a 0.92 points increase in test score.** --- class: center, middle # Questions? --- # Key Take Homes 1. We run linear models using `lm()` in R 2. The intercept is the value of `\(Y\)` when `\(X\)` = 0 3. The slope is the unit change in `\(Y\)` for each unit change in `\(X\)` 4. In certain cases, we may standardize our variables; this will affect their interpretation 5. We can easily add more predictors to our model 6. When we do, our interpretations of the coefficients are when all other predictors are held constant. --- # For next week + Things to recap... + We will look again at significance testing. + And also discuss sampling variability. + If you want a refresh, go back and review sampling and hypothesis testing material from dapR1 --- class: center, middle # Thanks for listening!