class: center, middle, inverse, title-slide .title[ #
Introduction to the Linear Model
] .subtitle[ ## DPUK Spring Academy
] .author[ ### Josiah King, Umberto Noe, (and credits to Tom Booth) ] .institute[ ### Department of Psychology
The University of Edinburgh ] .date[ ### April 2025 ] --- # Overview - Day 2: What is a linear model? - Day 3: But I have more variables, what now? - Day 4: Interactions - Day 5: Is my model any good? --- class: center, middle # Day 5 **Is my model any good?** --- class: inverse, center, middle <h2>Part 1: Key model assumptions </h2> <h2 style="text-align: left;opacity:0.3;">Part 2: Basic model diagnostics </h2> <h2 style="text-align: left;opacity:0.3;">Part 3: What we have not covered? </h2> --- # Linear model assumptions + So far, we have discussed evaluating linear models with respect to: + Overall model fit, and comparison of nested models ( `\(F\)` -ratio, `\(R^2\)`) + Individual coefficients (and associated `\(t\)` statistics) + The linear model is also built on a set of assumptions. + If these assumptions are violated, we can't generalise from model to broader population + We therefore need to assess the extent to which these assumptions are met. --- # Linear model assumptions Our model: `\(\color{red}{y} = \color{blue}{\beta_0 + \beta_1x_1 + ... + \beta_kx_k} + \varepsilon \qquad \text{where } \boldsymbol \varepsilon \sim N(0, \sigma) \text{ independently}\)` <br> <br> Our ability to generalise from the model we fit on sample data to the wider population requires making some _assumptions._ -- - assumptions about the nature of the **model** (linear) -- - assumptions about the nature of the **errors** (normal) --- # 30,000 foot view All our work here is in aim of making **models of the world**. - Models are models. They are simplifications. They are therefore wrong. -- - Our residuals reflect everything that we **don't** account for in our model. `\(y - \hat{y}\)` -- - In an ideal world, our model accounts for _all_ the systematic relationships. The leftovers (our residuals) are just random noise. -- - If our model is mis-specified, or we don't measure some systematic relationship, then our residuals may reflect this. -- - We check by examining how much "like randomness" the residuals appear to be (zero mean, normally distributed, constant variance, i.i.d ("independent and identically distributed") - _these ideas tend to get referred to as our "assumptions"_ -- - We will **never** know whether our residuals contain only randomness --- # Visualizations vs tests + There exist a variety of ways to assess assumptions, which broadly split into statistical tests and visualizations. + We will focus on visualization: + Easier to see the nature and magnitude of the assumption violation + There is also a very useful function for producing them all. + Statistical tests often suggest assumptions are violated when problem is small. + This is to do with the statistical power of the tests. + They give no information on what the actual problem is. --- ## assumptions - L .pull-left[ What does randomness look like? - **_L_**: linear across the predictors - I: - N: - E: - mean of the residuals = zero across the predicted values of the model. ] .pull-right[ <!-- --> ] --- ## assumptions - L .pull-left[ What does randomness look like? - **_L_**: linear across the predictors - I: - N: - E: - mean of the residuals = zero across the predicted values of the model. ] .pull-right[ <!-- --> ] --- ## assumptions - L .pull-left[ What does randomness look like? - **_L_**: linear across the predictors - I: - N: - E: - mean of the residuals = zero across the predicted values of the model. + **Investigated with**: + Scatterplots with loess lines (single variables) + Component-residual plots (when we have multiple predictors) ] .pull-right[ ``` r salary2 <- read_csv("https://uoepsy.github.io/scs/dpuk/data/salary2.csv") salary2 %>% ggplot(., aes(x=serv, y=perf)) + geom_point()+ geom_smooth(method = "lm", se=F) + # << * geom_smooth(method = "loess", se=F, col = "red") + labs(x= "Years of Service", y="Performance", title = "Scatterplot with linear (blue) and loess (red) lines") ``` ] --- ## assumptions - L .pull-left[ What does randomness look like? - **_L_**: linear across the predictors - I: - N: - E: - mean of the residuals = zero across the predicted values of the model. + **Investigated with**: + Scatterplots with loess lines (single variables) + Component-residual plots (when we have multiple predictors) ] .pull-right[ <!-- --> ] --- ## assumptions - N .pull-left[ What does randomness look like? - L: - I: - **_N_**: normally distributed at each level of the predictor - E: - residuals are normally distributed ] .pull-right[ <!-- --> ] --- ## assumptions - N .pull-left[ What does randomness look like? - L: - I: - **_N_**: normally distributed at each level of the predictor - E: - residuals are normally distributed + **Investigated with**: + QQ-plots + Histograms ] .pull-right[ ``` r hist(resid(model)) ``` <!-- --> ] --- ## assumptions - N .pull-left[ What does randomness look like? - L: - I: - **_N_**: normally distributed at each level of the predictor - E: - residuals are normally distributed + **Investigated with**: + QQ-plots + Histograms ] .pull-right[ ``` r qqnorm(resid(model)) qqline(resid(model)) ``` <!-- --> ] --- ## assumptions - E .pull-left[ What does randomness look like? - L: - I: - N: - **_E_**: equal variance across the predictor - spread of residuals is constant across the predicted values of the model. ] .pull-right[ <!-- --> ] --- ## assumptions - E .pull-left[ What does randomness look like? - L: - I: - N: - **_E_**: equal variance across the predictor - spread of residuals is constant across the predicted values of the model. + **Investigated with**: + Plot residual values against the predicted values ( `\(\hat{y}\)` ). ] .pull-right[ ``` r plot(model, which = 1) ``` <!-- --> ] --- ## assumptions - LNE .pull-left[ What does randomness look like? - **_L_**: linear across the predictors - I: - **_N_**: normally distributed at each level of the predictor - **_E_**: equal variance across the predictor ] .pull-right[ __`plot(model)`__ ``` r my_model <- lm(y ~ x1 + x2 + ... + xk, data = df) plot(my_model) ``` <!-- --> ] --- ## assumptions - I .pull-left[ What does randomness look like? - L: linear across the predictors - **_I_**: Independence - N: normally distributed at each level of the predictor - E: equal variance across the predictor ] .pull-right[ - need to know about your study design |ppt | x1| x2| x3| y| |:---|--:|--:|-----:|----:| |p_1 | 0| 52| 1.47| 39.6| |p_2 | 14| 81| 0.18| 45.9| |p_3 | 9| 62| -1.91| 48.0| |p_4 | 9| 68| 0.54| 43.2| |p_5 | 7| 43| 0.51| 46.0| |p_6 | 18| 63| 0.74| 49.6| ] --- # Multi-collinearity + This is **not strictly an assumption of linear model**, but it is something we need to consider. + It sits between assumptions and case diagnostics. + Multi-collinearity refers to the correlation between predictors + We saw this in the formula for the standard error of model slopes for an `lm` with multiple predictors. + When there are large correlations between predictors, the standard errors are increased + Therefore, we don't want our predictors to be too correlated --- # Variance Inflation Factor + The **Variance Inflation Factor** or VIF quantifies the extent to which standard errors are increased by predictor inter-correlations + It can be obtained in R using the `vif()` function: ``` r vif(model) ``` ``` ## x1 x2 x3 x4 ## 2.188558 1.920981 1.227417 1.011182 ``` + The function gives a VIF value for each predictor + Ideally, we want values to be close to 1 + VIFs> 10 indicate a problem --- # What to do about multi-collinearity + In practice, multi-collinearity is not often a major problem + When issues arise, consider: + Combining highly correlated predictors into a single composite + E.g. create a sum or average of the two predictors + Dropping an IV that is obviously statistically and conceptually redundant with another from the model --- class: inverse, center, middle <h2 style="text-align: left;opacity:0.3;">Part 1: Key model assumptions </h2> <h2>Part 2: Basic model diagnostics </h2> <h2 style="text-align: left;opacity:0.3;">Part 3: What we have not covered? </h2> --- # Three important features + Model outliers + Cases for which there is a large discrepancy between their predicted value ( `\(\hat{y_i}\)` ) and their observed value ( `\(y_i\)` ) -- + High leverage cases + Cases with an unusual value of the predictor ( `\(x_i\)` ) -- + High influence cases + Cases who are having a large impact on the estimation of model --- # Influence + High leverage cases, when they are also linear model outliers, will have high **influence** + Cases with high influence, have a strong effect on the coefficients + If we deleted such a case, the linear model coefficients would change substantially --- # Influence <img src="data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Influence + If a handful of influential cases are responsible for the linear model results, the conclusions might not generalise very well + Multiple ways to consider influence. + Here we will discuss Cook's distance. + Cook's Distance of a data point `\(i\)` (can be written many ways): `$$D_i = \frac{(\text{StandardizedResidual}_i)^2}{k+1} \times \frac{h_i}{1-h_i}$$` --- # Cooks Distance `$$\frac{(\text{StandardizedResidual}_i)^2}{k+1} = \text{Outlyingness}$$` `$$\frac{h_i}{1-h_i} = \text{Leverage}$$` + So `\(D_i = \text{Outlyingness} \times \text{Leverage}\)` + Cook's distance refers to **the average distance the `\(\hat{y}\)` values will move if a given case is removed.** + If removing one case changes the predicted values a lot (moves the regression line), then that case is influencing our results. --- # Cooks Distance + Many different suggestions for cut-off's: + `\(D_i > 1\)` + `\(D_i > \frac{4}{n-k-1}\)` + Or size relative all values in data set --- # Cook's distance in R .pull-left[ ``` r cooks.distance(model) influence.measures(model) ``` ] .pull-right[ ``` r plot(model, which = 4) ``` <!-- --> ] --- # Other Influence Measures + Cook's distance is a single value summarizing the total influence of a case + In the context of a lm with 2+ predictors, we may want to look in a little more detail. + **DFFit**: The difference between the predicted outcome value for a case with versus without a case included + **DFbeta**: The difference between the value for a coefficient with and without a case included + **DFbetas**: A standardised version of DFbeta + Obtained by dividing by an estimate of the standard error of the regression coefficient with the case removed --- # Other Influence Measures - COVRATIO + Influence on standard errors can be measured using the **COVRATIO** statistic + COVRATIO value <1 show that precision is decreased (SE increased) by a case + COVRATIO value >1 show that precision is increased (SE decreased) by a case + Cases with COVRATIOS `\(> 1+[3(k +1)/n]\)` or `\(< 1-[3( k +1)/ n ]\)` can be considered to have a strong influence on the standard errors --- # COVRATIO in R .pull-left[ ``` r covratio(model) influence.measures(model) ``` ] --- class: inverse, center, middle <h2 style="text-align: left;opacity:0.3;">Part 1: Key model assumptions </h2> <h2 style="text-align: left;opacity:0.3;">Part 2: Basic model diagnostics </h2> <h2>Part 3: What we have not covered? </h2> --- # In short, quite a lot! + Coding schemes for categorical data + Coding specific comparisons + Interactions with categorical variables and 2+ levels + Detailed probing of interactions + Statistical tests for assumptions + Assumption corrections for assumption violations + Bootstrapping + Extended model diagnostics + Non-continuous outcome variables + .... --- # Material links + All our lecture materials can be found [here](https://uoepsy.github.io/) --- class: center, middle # Thanks all!