Introduction to the Linear Model

class: center, middle, inverse, title-slide

.title[
# <b> Introduction to the Linear Model </b>
]
.subtitle[
## DPUK Spring Academy<br><br>
]
.author[
### Josiah King, Umberto Noe, (and credits to Tom Booth)
]
.institute[
### Department of Psychology<br>The University of Edinburgh
]
.date[
### April 2025
]

---

# Overview

- Day 2: What is a linear model?
- Day 3: But I have more variables, what now?
- Day 4: Interactions
- Day 5: Is my model any good?

---
class: center, middle
# Day 5
**Is my model any good?**

---
class: inverse, center, middle

<h2>Part 1: Key model assumptions </h2>
<h2 style="text-align: left;opacity:0.3;">Part 2: Basic model diagnostics </h2>
<h2 style="text-align: left;opacity:0.3;">Part 3: What we have not covered? </h2>

---
# Linear model assumptions 
+ So far, we have discussed evaluating linear models with respect to:
	+ Overall model fit, and comparison of nested models ( `$F$` -ratio, `$R^2$`)
	+ Individual coefficients (and associated `$t$` statistics)

+ The linear model is also built on a set of assumptions.

+ If these assumptions are violated, we can't generalise from model to broader population

+ We therefore need to assess the extent to which these assumptions are met.

---
# Linear model assumptions

Our model:

`$\color{red}{y} = \color{blue}{\beta_0 + \beta_1x_1 + ... + \beta_kx_k} + \varepsilon \qquad \text{where } \boldsymbol \varepsilon \sim N(0, \sigma) \text{ independently}$`
<br>
<br>
Our ability to generalise from the model we fit on sample data to the wider population requires making some _assumptions._

- assumptions about the nature of the **model**
(linear)

- assumptions about the nature of the **errors** (normal)

---
# 30,000 foot view

All our work here is in
aim of making **models of the world**.

- Models are models. They are simplifications. They are therefore wrong.

- Our residuals reflect everything that we **don't** account for in our model. `$y - \hat{y}$`

- In an ideal world, our model accounts for _all_ the systematic relationships. The leftovers (our residuals) are just random noise.  
--

- If our model is mis-specified, or we don't measure some systematic relationship, then our residuals may reflect this.

- We check by examining how much "like randomness" the residuals appear to be (zero mean, normally distributed, constant variance, i.i.d ("independent and identically distributed")
    - _these ideas tend to get referred to as our "assumptions"_

- We will **never** know whether our residuals contain only randomness

---
# Visualizations vs tests

+ There exist a variety of ways to assess assumptions, which broadly split into statistical tests and visualizations.

+ We will focus on visualization:
	+ Easier to see the nature and magnitude of the assumption violation
	+ There is also a very useful function for producing them all.

+ Statistical tests often suggest assumptions are violated when problem is small.
  + This is to do with the statistical power of the tests.
  + They give no information on what the actual problem is.

---
## assumptions - L

.pull-left[

What does randomness look like?

- **_L_**: linear across the predictors
- I:
- N:
- E:

- mean of the residuals = zero across the predicted values of the model.

]
.pull-right[

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-1-1.png)

]

---
## assumptions - L

.pull-left[

What does randomness look like?

- **_L_**: linear across the predictors
- I:
- N:
- E:

- mean of the residuals = zero across the predicted values of the model.

]
.pull-right[

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-2-1.png)
]

---
## assumptions - L

.pull-left[

What does randomness look like?

- **_L_**: linear across the predictors
- I:
- N:
- E:

- mean of the residuals = zero across the predicted values of the model.

+ **Investigated with**:
  + Scatterplots with loess lines (single variables)
  + Component-residual plots (when we have multiple predictors)

]
.pull-right[

``` r
salary2 <- read_csv("https://uoepsy.github.io/scs/dpuk/data/salary2.csv")
salary2 %>%
  ggplot(., aes(x=serv, y=perf)) +
  geom_point()+
  geom_smooth(method = "lm", se=F) + # <<
* geom_smooth(method = "loess", se=F,
              col = "red") +
  labs(x= "Years of Service", y="Performance", 
       title = "Scatterplot with linear (blue) 
       and loess (red) lines")
```
]

---
## assumptions - L

.pull-left[

What does randomness look like?

- **_L_**: linear across the predictors
- I:
- N:
- E:

- mean of the residuals = zero across the predicted values of the model.

+ **Investigated with**:
  + Scatterplots with loess lines (single variables)
  + Component-residual plots (when we have multiple predictors)

]
.pull-right[

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-4-1.png)
]

---
## assumptions - N

.pull-left[

What does randomness look like?

- L:
- I:
- **_N_**: normally distributed at each level of the predictor 
- E:

- residuals are normally distributed

]
.pull-right[

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-5-1.png)

]

---
## assumptions - N

.pull-left[

What does randomness look like?

- L:
- I:
- **_N_**: normally distributed at each level of the predictor 
- E:

- residuals are normally distributed

+ **Investigated with**:
  + QQ-plots
  +	Histograms
  
]
.pull-right[

``` r
hist(resid(model))
```

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-7-1.png)

]

---
## assumptions - N

.pull-left[

What does randomness look like?

- L:
- I:
- **_N_**: normally distributed at each level of the predictor 
- E:

- residuals are normally distributed

+ **Investigated with**:
  + QQ-plots
  +	Histograms
  
]
.pull-right[

``` r
qqnorm(resid(model))
qqline(resid(model))
```
![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-10-1.png)

]

---
## assumptions - E

.pull-left[

What does randomness look like?

- L:
- I:
- N:
- **_E_**: equal variance across the predictor

- spread of residuals is constant across the predicted values of the model.

]
.pull-right[

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-11-1.png)

]

---
## assumptions - E

.pull-left[

What does randomness look like?

- L:
- I:
- N:
- **_E_**: equal variance across the predictor

- spread of residuals is constant across the predicted values of the model.

+ **Investigated with**:
  + Plot residual values against the predicted values ( `$\hat{y}$` ).

]
.pull-right[

``` r
plot(model, which = 1)
```

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-12-1.png)

]

---
## assumptions - LNE

.pull-left[

What does randomness look like?

- **_L_**: linear across the predictors
- I: 
- **_N_**: normally distributed at each level of the predictor 
- **_E_**: equal variance across the predictor

]
.pull-right[
__`plot(model)`__

``` r
my_model <- lm(y ~ x1 + x2 + ... + xk, data = df)
plot(my_model)
```

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-15-1.png)

]

---
## assumptions - I

.pull-left[

What does randomness look like?

- L: linear across the predictors
- **_I_**: Independence
- N: normally distributed at each level of the predictor 
- E: equal variance across the predictor

]
.pull-right[

- need to know about your study design

|ppt | x1| x2|    x3|    y|
|:---|--:|--:|-----:|----:|
|p_1 |  0| 52|  1.47| 39.6|
|p_2 | 14| 81|  0.18| 45.9|
|p_3 |  9| 62| -1.91| 48.0|
|p_4 |  9| 68|  0.54| 43.2|
|p_5 |  7| 43|  0.51| 46.0|
|p_6 | 18| 63|  0.74| 49.6|

]

---
#  Multi-collinearity

+ This is **not strictly an assumption of linear model**, but it is something we need to consider. 
  + It sits between assumptions and case diagnostics.
  
+ Multi-collinearity refers to the correlation between predictors
  + We saw this in the formula for the standard error of model slopes for an `lm` with multiple predictors.

+ When there are large correlations between predictors, the standard errors are increased
	+ Therefore, we don't want our predictors to be too correlated

---
#  Variance Inflation Factor

+ The **Variance Inflation Factor** or VIF quantifies the extent to which standard errors are increased by predictor inter-correlations

+ It can be obtained in R using the `vif()` function:

``` r
vif(model)
```

```
##       x1       x2       x3       x4 
## 2.188558 1.920981 1.227417 1.011182
```

+ The function gives a VIF value for each predictor

+ Ideally, we want values to be close to 1

+ VIFs> 10 indicate a problem

---
#  What to do about multi-collinearity

+ In practice, multi-collinearity is not often a major problem

+ When issues arise, consider:
	+ Combining highly correlated predictors into a single composite
		  + E.g. create a sum or average of the two predictors
	+ Dropping an IV that is obviously statistically and conceptually redundant with another from the model

---
class: inverse, center, middle

<h2 style="text-align: left;opacity:0.3;">Part 1: Key model assumptions </h2>
<h2>Part 2: Basic model diagnostics  </h2>
<h2 style="text-align: left;opacity:0.3;">Part 3: What we have not covered? </h2>

---
#  Three important features

+ Model outliers

+ Cases for which there is a large discrepancy between their predicted value ( `$\hat{y_i}$` ) and their observed value ( `$y_i$` )

+ High leverage cases

+ Cases with an unusual value of the predictor ( `$x_i$` )

+ High influence cases

+ Cases who are having a large impact on the estimation of model

---
#  Influence

+ High leverage cases, when they are also linear model outliers, will have high **influence**

+ Cases with high influence, have a strong effect on the coefficients

+ If we deleted such a case, the linear model coefficients would change substantially

---
# Influence

---
# Influence

+ If a handful of influential cases are responsible for the linear model results, the conclusions might not generalise very well

+ Multiple ways to consider influence.
  + Here we will discuss Cook's distance.
  
+ Cook's Distance of a data point `$i$` (can be written many ways):

`$$D_i = \frac{(\text{StandardizedResidual}_i)^2}{k+1} \times \frac{h_i}{1-h_i}$$`

---
#  Cooks Distance 
`$$\frac{(\text{StandardizedResidual}_i)^2}{k+1} = \text{Outlyingness}$$`

`$$\frac{h_i}{1-h_i} = \text{Leverage}$$`

+ So `$D_i = \text{Outlyingness} \times \text{Leverage}$`

+ Cook's distance refers to **the average distance the `$\hat{y}$` values will move if a given case is removed.**
  + If removing one case changes the predicted values a lot (moves the regression line), then that case is influencing our results.

---
#  Cooks Distance

+ Many different suggestions for cut-off's:
  + `$D_i > 1$` 
  + `$D_i > \frac{4}{n-k-1}$`
  + Or size relative all values in data set

---
#  Cook's distance in R

.pull-left[

``` r
cooks.distance(model)

influence.measures(model)
```

]

.pull-right[

``` r
plot(model, which = 4)
```

![](data:image/png;base64,#DPUK_E_2025_files/figure-html/unnamed-chunk-21-1.png)

]

---
# Other Influence Measures

+ Cook's distance is a single value summarizing the total influence of a case

+ In the context of a lm with 2+ predictors, we may want to look in a little more detail.

+ **DFFit**: The difference between the predicted outcome value for a case with versus without a case included

+ **DFbeta**: The difference between the value for a coefficient with and without a case included

+ **DFbetas**: A standardised version of DFbeta
  + Obtained by dividing by an estimate of the standard error of the regression coefficient with the case removed

---
# Other Influence Measures - COVRATIO

+ Influence on standard errors can be measured using the **COVRATIO** statistic
	+ COVRATIO value <1 show that precision is decreased (SE increased)  by a case
	+ COVRATIO value >1 show that precision is increased (SE decreased) by a case

+ Cases with COVRATIOS `$> 1+[3(k +1)/n]$` or `$< 1-[3( k +1)/ n ]$` can be considered to have a strong influence on the standard errors

---
#  COVRATIO in R

.pull-left[

``` r
covratio(model)

influence.measures(model)
```

]

---
class: inverse, center, middle

<h2 style="text-align: left;opacity:0.3;">Part 1: Key model assumptions </h2>
<h2 style="text-align: left;opacity:0.3;">Part 2: Basic model diagnostics </h2>
<h2>Part 3: What we have not covered? </h2>

---
# In short, quite a lot!

+ Coding schemes for categorical data
+ Coding specific comparisons
+ Interactions with categorical variables and 2+ levels
+ Detailed probing of interactions
+ Statistical tests for assumptions
+ Assumption corrections for assumption violations
+ Bootstrapping
+ Extended model diagnostics
+ Non-continuous outcome variables
+ ....

---
# Material links

+ All our lecture materials can be found [here](https://uoepsy.github.io/)

---
class: center, middle
# Thanks all!