Assumptions & Diagnostics

class: center, middle, inverse, title-slide

.title[
# <b> Assumptions & Diagnostics </b>
]
.subtitle[
## Data Analysis for Psychology in R 2<br><br>
]
.author[
### dapR2 Team
]
.institute[
### Department of Psychology<br>The University of Edinburgh
]

---

# Course Overview

.pull-left[

<table style="border: 1px solid black;>
  <tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1;text-align:center;vertical-align: middle">
        <b>Introduction to Linear Models</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Intro to Linear Regression</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Interpreting Linear Models</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Testing Individual Predictors</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Model Testing & Comparison</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Linear Model Analysis</td></tr>

<tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1;text-align:center;vertical-align: middle">
        <b>Analysing Experimental Studies</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Categorical Predictors & Dummy Coding</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        	Effects Coding & Coding Specific Contrasts</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        <b>Assumptions & Diagnostics</b></td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Bootstrapping</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        	Categorical Predictor Analysis</td></tr>
</table>

]

.pull-right[

<table style="border: 1px solid black;>
  <tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle">
        <b>Interactions</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interactions I</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interactions II</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interactions III</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Analysing Experiments</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interaction Analysis</td></tr>

<tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle">
        <b>Advanced Topics</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Power Analysis</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Binary Logistic Regression I</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Binary Logistic Regression II</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Logistic Regresison Analysis</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        	Exam Prep and Course Q&A</td></tr>
</table>

]

---

## Learning Objectives

1. Be able to state the assumptions underlying a linear model

2. Understand how to test linear model assumptions

3. Understand the difference between outliers and influential points

4. Test and assess the effect of influential cases on linear model coefficients and overall model evaluations

---
class: center, middle, inverse

## Part 1: Assumptions of Linear Model

---
## Linear model assumptions 
+ So far, we have discussed evaluating linear models with respect to:
	+ Overall model fit ( `$F$` -ratio, `$R^2$`)
	+ Individual predictors

+ However, the linear model is also built on a set of assumptions

+ If these assumptions are violated, the conclusions we draw from our model will not be valid

+ Thus, before drawing conclusions about our research questions and hypotheses, we also need to assess the extent to which linear model assumptions are met

---
## Assumptions (LINE)

+ What are the assumptions of linear models?

+ **L**inearity
	+ **I**ndependence of errors
	+ **N**ormality of errors
	+ **E**qual variance (Homoscedasticity)

---
## Visualisations vs tests
+ There exist a variety of ways to assess assumptions, which broadly split into _statistical tests_ and _visualisations_

+ We will focus on **visualisations**:
	+ Easier to see the nature and magnitude of the assumption violation
	+ Visualisations for checking model assumptions easy to produce in R

+ Statistical tests often suggest assumptions are violated even when problem is small
  + This has to do with the statistical power of the tests
  + Harder to diagnose the exact problem from a test than a visualisation

---
## Visualisations made easy
+ For many assumption and diagnostic plots, we will make use of the `plot()` function
  + If we give `plot()` a linear model object (e.g. `m1` or `m2`), we can automatically generate assumption plots

+ We will also make use of some individual functions for specific visualisations

+ Alternatively, we can also use `check_model()` from the `performance` package
  + This provides `ggplot` figures as well as some notes to aid interpretation
  + Caution that these plots are **not in a format to use directly in reports**
  + However it is a useful quick way to check models when we are working through stages of an analysis

---
## **L** - Linearity 
+ **Assumption**: The relationship between each numerical `$x$`, and `$y$` is linear
  + Assuming a linear relation when the true relation is non-linear can result in under-estimating that relation

.pull-left[
.center[**Linear Relationship**]
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-4-1.png" width="90%" />

]

.pull-right[
.center[**Nonlinear Relationship**]
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-5-1.png" width="90%" />

]

+ **Investigated with**:

1. Scatterplots with loess lines (single variables)
  2. Component-residual plots (multiple predictors in the model)

---
##  Scatterplots with loess lines

.pull-left[
+ A *loess line* is a method for helping visualise the shape of relationships
    
+ Stands for...
    + **LO**cally
    + **E**stimated
    + **S**catterplot
    + **S**moothing
    
+ Essentially, produces a line or curve that closely follows the data

]

.pull-right[
.center[**Linear**]
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-6-1.png" width="90%" />

.center[**Nonlinear**]
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-7-1.png" width="90%" />

]

---
## Visualisation of Loess Lines

.pull-left[

``` r
ggplot(df2, aes(x=x, y=y2)) +
  geom_point() +
* geom_smooth(method = "lm",
              se=F, colour = 'blue') +
* geom_smooth(method = 'loess',
              se=F, colour = "red") +
    labs(x='X', y = 'Y')
```
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-9-1.png" width="90%" />
]

---
## Component-residual plots

+ Component-residual plots are also called partial-residual plots 
  
+ They allow us to check for linearity between each predictor and outcome, controlling for the other predictors

+ Component-residual plots have the predictor values on the X-axis and partial residuals on the Y-axis

+ *Partial residuals* for each X variable are:

`$$\epsilon_i + B_jX_{ij}$$`
+ Where :
  + `$\epsilon_i$` is the residual from the linear model including all the predictors
  
  + `$B_jX_{ij}$` is the partial (linear) relation between `$x_j$` and `$y$`

---
## `crPlots()`

.pull-left[

+ Component-residual plots can be obtained using the `crPlots()` function from `car` package

+ The plots for continuous predictors show a linear (dashed) and loess (solid) line

+ The loess line should follow the linear line closely, with deviations suggesting non-linearity

``` r
m1 <- lm(y ~ x1 + x2, data = df3)
crPlots(m1)
```
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-12-1.png" width="90%" />
]

---
count: false

## `crPlots()`

.pull-left[

+ Component-residual plots can be obtained using the `crPlots()` function from `car` package

+ The plots for continuous predictors show a linear (dashed) and loess (solid) line

+ The loess line should follow the linear line closely, with deviations suggesting non-linearity

``` r
m1 <- lm(y ~ x1 + x2, data = df3)
crPlots(m1)
```

> **Test your understanding:** Do both `$x1$` and `$x2$` meet the assumption of linearity?
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-14-1.png" width="90%" />
]

---
## Dealing with violations of linearity

+ Reassess your model

+ Including higher order regression terms can help capture more complex relationships
  
  + Quadratic, cubic, or higher degree polynomial terms are better suited to model nonlinear data
  
  + More on this in DapR3

---
##  **I** - Independence of errors 
+ **Assumption**: The errors are not correlated with one another

+ Difficult to test unless we know the potential source of correlation between cases

+ When we do, we can use a variant of the linear model to account for this
  
  + We will see more of this in year 3

+ Essentially, for now, we will evaluate this based on study design

+ If a design is between-person, we will assume the errors to be independent
  
---
## **N** - Normally distributed errors 
+ **Assumption**: The errors ( `$\epsilon_i$` ) are normally distributed, with a mean of 0

+ **Investigated with**:
  1. Histograms
  2. QQ-plots

---
## Histograms
+ Used to plot the frequency distribution of a variable

+ To test the assumption that error is normally distributed, you plot the residuals ( `$\epsilon_i$` ):

.pull-left[
.center[**Seems Normal Enough**]
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-15-1.png" width="90%" />
]

.pull-right[
.center[**Not So Normal**]
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-16-1.png" width="90%" />
]

---
## Histograms

.pull-left[

``` r
hist(m1$residuals)
```

<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-17-1.png" width="90%" />
]

.pull-left[

``` r
hist(m2$residuals)
```

]

---
## Q-Q Plots

.pull-left[

+ Quantile-quantile plots compare actual and expected quantiles for the model's residuals

+ Plot the standardised residuals from the model against their theoretically expected values

+ If residuals are normally distributed, points should fall neatly on the diagonal of the plot

+ Non-normally distributed residuals cause deviations of points from the diagonal

+ The specific shape of these deviations are characteristic of the distribution of the residuals
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-19-1.png" width="90%" />
]
---
## Q-Q Plots

.pull-left[

``` r
plot(m1, which = 2)
```

<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-20-1.png" width="90%" />
.center[**Looks alright**]
]

.pull-right[

``` r
plot(m2, which = 2)
```

<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-21-1.png" width="90%" />
.center[**Nope.**]
]

---
## **Test your Understanding:**

> Do you think any of these suggest a violated assumption?

.pull-left[

]

.pull-right[

<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-25-1.png" width="90%" />
]

---
## Dealing with non-normality

+ Has been suggested that data be transformed, but...

> "We find that violations of the normality of residuals assumption are rarely problematic for hypothesis testing and parameter estimation, and we argue that the commonly recommended solutions may bear greater risks than the one to be solved." - Knief & Forstmeier, 2021, [here](https://link.springer.com/article/10.3758/s13428-021-01587-5)

+ Unless there is a significant violation, it may be best to report it and carry on with the analysis

---
##  **E** - Equal variance (Homoscedasticity)

+ **Assumption**: The equal variances assumption is that the variance is constant across values of the predictors `$x_1$`, ... `$x_k$`, and across values of the outcome `$\hat{y}$`
	+ Heteroscedasticity refers to when this assumption is violated (non-constant variance)

.pull-left[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-26-1.png" width="90%" />
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-27-1.png" width="90%" />
]

+ **Investigated by**:
  + Plotting residual values against the predicted values
  + Plotting residual values against the predictors

---
##  Residual-vs-predicted values plot

.pull-left[
+ In R, we can plot the residuals vs predicted values using `residualPlot()` function in the `car` package.

+ Categorical predictors should show a similar spread of residual values across their levels
  
  + The plots for continuous predictors should look like a random array of dots
  
	  + The solid line should follow the dashed line closely
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-28-1.png" width="90%" />

]

---
## Residual-vs-predicted values plot

.pull-left[

``` r
residualPlot(m5)
```

]

.pull-right[

``` r
plot(m5, which = 1)
```

<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-30-1.png" width="90%" />
]

---
## Dealing with heteroscedasticity

+ Violations of heteroscedasticity suggest that the model is not equally good at predicting the outcome along the values of the predictor(s) or outcome

+ Additional variables or interaction terms may improve equal variance

+ If problem is severe, you could use *weighted least squares regression* instead of ordinary least squares

+ Each observation is assigned a weight that determines how much it influences the parameter estimates
  
  + Less weight is given to the observations with higher variance in their fitted values

---

## Summary of assumptions

+ **Linearity**: The relationship between each numerical `$x$`, and `$y$` is linear

+ **Independence of errors**: The errors are not correlated with one another

+ **Normally distributed errors**: The errors ( `$\epsilon_i$` ) are normally distributed around 0

+ **Equal variance**: The equal variances assumption is constant across values of the predictors `$x_1$`, ... `$x_k$`, and across values of the fitted values `$\hat{y}$`

---
class: center, middle
# Questions?

---
class: center, middle, inverse

## Part 2: Linear Model Diagnostics
---
## Linear model diagnostics

+ So far in the course, we have discussed evaluating linear models:
	+ Evaluating model fit ( `$R^2$` and `$F$`-ratio )
	+ Evaluating individual predictors
	+ Evaluating assumptions

+ Another important aspect of evaluating linear models pertains to the cases (individuals, or rows in the data)
	+ Do some individuals have a lot of influence on the results?

+ Linear model diagnostics allow us to explore individual cases in the context of a model

---
##  Three important cases to consider

+ **Model outliers**
	
	+ Cases for which there is a large discrepancy between their predicted value ( `$\hat{y_i}$` ) and their observed value ( `$y_i$` )

+ **High leverage cases**
	
	+ Cases with an unusual value of the predictor ( `$x_i$` )

+ **High influence cases**
	
	+ Cases that are having a large impact on the estimates of the model

---
## Some data for today

.pull-left[
+ We'll look at data predicting salary from years of service and performance ratings.

`$$y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon_i$$`

+ `$y$` = Salary (unit = thousands of pounds)

+ `$x_1$` = Years of service

+ `$x_2$` = Average performance ratings

+ We will run all diagnostics on the object `m1`:

``` r
m1 <- lm(salary ~ perf + serv, data = salary2)
```

]

.pull-right[

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> id </th>
   <th style="text-align:right;"> salary </th>
   <th style="text-align:right;"> serv </th>
   <th style="text-align:right;"> perf </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID101 </td>
   <td style="text-align:right;"> 80.18 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID102 </td>
   <td style="text-align:right;"> 123.98 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID103 </td>
   <td style="text-align:right;"> 80.55 </td>
   <td style="text-align:right;"> 2.4 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID104 </td>
   <td style="text-align:right;"> 84.35 </td>
   <td style="text-align:right;"> 4.6 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID105 </td>
   <td style="text-align:right;"> 83.76 </td>
   <td style="text-align:right;"> 4.8 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID106 </td>
   <td style="text-align:right;"> 117.61 </td>
   <td style="text-align:right;"> 4.4 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID107 </td>
   <td style="text-align:right;"> 96.38 </td>
   <td style="text-align:right;"> 4.3 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID108 </td>
   <td style="text-align:right;"> 96.49 </td>
   <td style="text-align:right;"> 5.0 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID109 </td>
   <td style="text-align:right;"> 88.23 </td>
   <td style="text-align:right;"> 2.4 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID110 </td>
   <td style="text-align:right;"> 143.69 </td>
   <td style="text-align:right;"> 4.6 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
</tbody>
</table>

]

---
##  Model outliers

+ Linear model outliers are cases that have unusual outcome values given their predictor values

+ They will show a large difference between the predicted ( `$\hat{y_i}$` ) and the observed value ( `$y_i$` )

> **Test your understanding:** What quantity have we calculated that would summarise this difference?

+ Outliers will demonstrate large **residuals** ( `$\epsilon_i$` )

+ If you got stuck here, look back at the introduction to linear models

---
##  Why are we interested in linear model outliers?

.pull-left[
+ They can (but do not necessarily) have a strong influence on the model

+ It may be informative to follow-up and investigate outliers
	+ Is it a data entry error?
	+ Does the case somehow not belong with the rest of the data? (E.g., a male in a sample of females)
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-33-1.png" width="90%" />

]

---
##  How do we determine if a case is an outlier?

+ We judge linear model outlying-ness of a case on the basis of the size of its residual

+ Unstandardised residuals are:

`$$y_i - \hat{y_i}$$`

+ They are in the same units as the DV

+ Fine for comparison across cases within the same linear model

+ Difficult to compare across models where the DVs will have different units

---
##  Standardised residuals

+ **Standardised residuals**
	+ Divide the unstandardised residuals by an estimate of their standard deviation
	+ This converts the residuals to z-score units
	+ However, their calculation includes the potential outlier (the standard deviation is bigger than it would be without the outlier)

+ **Studentised residuals**
	+ A version of the standardised residual excluding the specific case from the calculation of the standard deviation
	+ Values **above +2 or below -2** indicate potential outlyingness (these values are 2 or more standard deviations away from the model's estimate)

---
##  Identifying studentised residuals > 2

.pull-left[

``` r
salary2 %>%
  mutate(
*   resid = rstudent(m1)
  ) %>%
* dplyr::filter(resid > 2 | resid < -2)
```

+ Steps:
  + Extract the studentized residuals from our model
  + Identify those that are more extreme than `$\pm 2$`

]

.pull-right[
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> id </th>
   <th style="text-align:right;"> salary </th>
   <th style="text-align:right;"> serv </th>
   <th style="text-align:right;"> perf </th>
   <th style="text-align:right;"> resid </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID114 </td>
   <td style="text-align:right;"> 132.12 </td>
   <td style="text-align:right;"> 6.1 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2.090732 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID125 </td>
   <td style="text-align:right;"> 51.78 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> -2.324626 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID127 </td>
   <td style="text-align:right;"> 144.33 </td>
   <td style="text-align:right;"> 4.0 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 2.028044 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID133 </td>
   <td style="text-align:right;"> 41.60 </td>
   <td style="text-align:right;"> 4.8 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> -2.238126 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID159 </td>
   <td style="text-align:right;"> 129.42 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 2.939648 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID161 </td>
   <td style="text-align:right;"> 148.05 </td>
   <td style="text-align:right;"> 5.6 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 2.194388 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID173 </td>
   <td style="text-align:right;"> 17.86 </td>
   <td style="text-align:right;"> 4.9 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -2.880247 </td>
  </tr>
</tbody>
</table>

]

---
##  Leverage

.pull-left[
+ High leverage cases are those with an unusual predictor value or combination of predictor values
	
	+ In simple linear model: an `$x$` value far from the `$\bar{x}$`

+ **Why are we interested in leverage?**

+ High leverage cases have potential to influence `$\hat \beta_0$` and `$\hat \beta_1$` 
]

.pull-right[
<img src="dapr2_08_assumptions_diagnostics_files/figure-html/unnamed-chunk-36-1.png" width="90%" />
]

---
##  Finding a case with high leverage

+ **Hat values** ( `$h_i$` ) are used to assess leverage in linear model

+ For a simple linear model, the hat value for case `$i$` for a particular predictor would be:

`$$h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i=1}^n(x_i - \bar{x})^2}$$`

+ Where:
  + `$n$` is the sample size
  + `$(x_i - \bar{x})^2$` is the squared deviation of the predictor value for that case, `$x_i$`, from the mean `$\bar x$`
  + `$\sum_i^n(x_i - \bar{x})^2$` is the sum of all these squared deviations, for all cases

---
##  Finding a case with high leverage

+ The mean of hat values ( `$\bar{h}$` ) is then:

`$$\bar{h} = (k+1)/n$$`

+ `$k$` is the number of predictors
  + `$n$` is the sample size

+ In simple linear regression `$k = 1$` as there is just one predictor, hence `$\bar h = (1 + 1) / n = 2 /n$`

+ As a rough heuristic, values more than `$2\bar{h}$` are considered high leverage

---
##  Hat values in R

.pull-left[

``` r
salary2 %>%
  mutate(
*   hat = hatvalues(m1)
  ) %>%
* dplyr::filter(., hat > 2*((2+1)/100)) %>%
  kable(.)  %>%
  kable_styling(., full_width = F)
```

+ Steps to identify large `$h_i$` values:
  + Extract the `$h_i$` from our model.
  + Identify those outside `$2\bar{h}$`
]

.pull-right[

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> id </th>
   <th style="text-align:right;"> salary </th>
   <th style="text-align:right;"> serv </th>
   <th style="text-align:right;"> perf </th>
   <th style="text-align:right;"> hat </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID122 </td>
   <td style="text-align:right;"> 89.50 </td>
   <td style="text-align:right;"> 6.5 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0.0694283 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID124 </td>
   <td style="text-align:right;"> 81.67 </td>
   <td style="text-align:right;"> 2.2 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.0676132 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID134 </td>
   <td style="text-align:right;"> 125.90 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.0768679 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID151 </td>
   <td style="text-align:right;"> 64.59 </td>
   <td style="text-align:right;"> 4.4 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.0659745 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID173 </td>
   <td style="text-align:right;"> 17.86 </td>
   <td style="text-align:right;"> 4.9 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.0735795 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID182 </td>
   <td style="text-align:right;"> 110.15 </td>
   <td style="text-align:right;"> 0.7 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 0.0659204 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID199 </td>
   <td style="text-align:right;"> 95.12 </td>
   <td style="text-align:right;"> 0.7 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.0736123 </td>
  </tr>
</tbody>
</table>

]

---
class: center, middle

## Questions?

---
##  Influence

+ High leverage cases, when they are also linear model outliers, may have high **influence**
  + Cases with high influence have a strong effect on the coefficients ( `$\beta$` )
  + If we deleted such a case, the linear model coefficients would change
  + The degree of change is one way to judge the magnitude of influence

+ If a handful of influential cases are responsible for the linear model results, the conclusions might not generalise very well

+ Multiple ways to consider influence. We will look at:
  1. Cook's Distance
  1. Dfbeta

---
##  Cook's Distance 
 
+ **Cook's Distance** of a data point `$i$`:

`$$D_i = \frac{(\text{StandardizedResidual}_i)^2}{k+1} \times \frac{h_i}{1-h_i}$$`
+ Where

`$$\frac{(\text{StandardizedResidual}_i)^2}{k+1} = \text{Outlyingness}$$`
+ and

`$$\frac{h_i}{1-h_i} = \text{Leverage}$$`

+ So

`$$D_i = \text{Outlyingness} \times \text{Leverage}$$`

---
##  Cook's Distance

+ Cook's distance refers to **the average distance the `$\hat{y}$` values will move if a given case is removed**
  + If removing one case changes the predicted values a lot (moves the regression line), then that case is influencing our results

+ Many different suggestions for cut-offs:
  + `$D_i > 1$` 
  + `$D_i > \frac{4}{n-k-1}$`
  + Or relative to all `$D_i$` values in data set

---
##  Cook's Distance in R

.pull-left[

``` r
salary2 %>%
  mutate(
*   cook = cooks.distance(m1)) %>%
* dplyr::filter(., cook > 4/(100-2-1)) %>%
  arrange(., desc(cook)) %>%
  kable(.)  %>%   kable_styling(., full_width = F) 
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> id </th>
   <th style="text-align:right;"> salary </th>
   <th style="text-align:right;"> serv </th>
   <th style="text-align:right;"> perf </th>
   <th style="text-align:right;"> cook </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID173 </td>
   <td style="text-align:right;"> 17.86 </td>
   <td style="text-align:right;"> 4.9 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.2042640 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID159 </td>
   <td style="text-align:right;"> 129.42 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.0867202 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID114 </td>
   <td style="text-align:right;"> 132.12 </td>
   <td style="text-align:right;"> 6.1 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0.0833774 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID133 </td>
   <td style="text-align:right;"> 41.60 </td>
   <td style="text-align:right;"> 4.8 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.0704271 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID161 </td>
   <td style="text-align:right;"> 148.05 </td>
   <td style="text-align:right;"> 5.6 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0.0591664 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID134 </td>
   <td style="text-align:right;"> 125.90 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 0.0545706 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID180 </td>
   <td style="text-align:right;"> 111.36 </td>
   <td style="text-align:right;"> 2.4 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.0432211 </td>
  </tr>
</tbody>
</table>

]

.pull-right[

``` r
plot(m1, which = 4)
```

]

---
## Influence on coefficients
+ Cook's distance summarises the total influence of an observation

+ In the context of a `lm` with 2+ predictors, we may want to look in a little more detail

+ **DFFit**: The difference between the predicted outcome value for an observation with versus without the observation included

+ **DFbeta**: The difference between the value for a coefficient with and without an observation included

+ **DFbetas**: A standardised version of DFbeta
  + **DFbetas** obtained by dividing **DFbeta** by an estimate of the standard error of the regression coefficient (SE calculated with the observation in question removed)

---
## In R

+ We can extract these measures using the `influence.measures()` function:

``` r
dfs_m1 <- influence.measures(m1)
round(dfs_m1$infmat[1:10,],3)
```

```
##    dfb.1_ dfb.perf dfb.serv  dffit cov.r cook.d   hat
## 1  -0.091    0.042    0.066 -0.106 1.041  0.004 0.023
## 2  -0.005    0.005    0.004  0.008 1.058  0.000 0.025
## 3  -0.083    0.041    0.054 -0.097 1.038  0.003 0.020
## 4   0.057   -0.018   -0.120 -0.184 0.991  0.011 0.018
## 5  -0.004    0.042   -0.065 -0.098 1.046  0.003 0.025
## 6  -0.016    0.007    0.035  0.061 1.040  0.001 0.015
## 7   0.127   -0.140   -0.091 -0.227 0.986  0.017 0.023
## 8   0.183   -0.141   -0.177 -0.278 0.992  0.025 0.032
## 9  -0.030    0.015    0.020 -0.036 1.051  0.000 0.020
## 10 -0.056    0.062    0.029  0.078 1.078  0.002 0.047
```

---
##  Influence on standard errors 
+ Influential cases can impact the `$SE$` as well as the estimation of coefficients
  + This means it impacts our inferences

+ Recall, the standard error for a regression slope:

`$$SE(\hat \beta_j) = \sqrt{\frac{ SS_{Residual}/(n-k-1)}{\sum(x_{ij} - \bar{x_{j}})^2(1-R_{xj}^2)}}$$`

+ `$\sum(x_ij  -\bar{x_j})^2$` is the sum of squared deviations of each `$x$` value from the mean of `$x$`

+ This term implies that increasing the variance in `$x$` will decrease the standard error of the slope estimate

+ High leverage cases (which are far from the mean of `$x$` ) can increase `$x$` variance

---
##  COVRATIO 
+ Influence on standard errors can be measured using the **COVRATIO** statistic
	+ COVRATIO value <1 show that precision is decreased (SE increased)  by a case
	+ COVRATIO value >1 show that precision is increased (SE decreased) by a case

+ Cases can be considered to have a strong influence on the standard errors of `$\beta$`s if COVRATIOS are:
  + larger than `$1+\frac{3(k +1)}{n}$`
  + or smaller than `$1-\frac{3(k +1)}{n}$`

---
##  COVRATIO in R

.pull-left[
+ COVRATIO values can be extracted using the `covratio()` function:

``` r
cr <- covratio(m1)
cr[1:5]
```

```
##        1        2        3        4        5 
## 1.040592 1.057786 1.038253 0.990737 1.045659
```
]

.pull-right[
+ And we can check cases using the cutoffs:

``` r
which(cr > (1+(3*(2+1))/100))
```

```
## 22 51 55 82 
## 22 51 55 82
```

``` r
which(cr < (1-(3*(2+1))/100))
```

```
## 25 59 73 
## 25 59 73
```
]

---
## Multi-collinearity 
+ This is not an assumption of linear model or a case diagnostic, but it is something we need to consider
  
+ Multi-collinearity refers to the correlation between predictors

+ When there are large correlations between predictors, the standard errors of `$\beta$`s are increased
  + If predictors are correlated with each other, we are less certain about the effect of each individual predictor on the outcome
	+ Therefore, we don't want our predictors to be too correlated

+ We can assess this using the **Variance Inflation Factor**, which quantifies the extent to which `$SE$` is increased by predictor inter-correlations

.pull-left[
+ It can be obtained in R using the `vif()` function:

``` r
vif(m1)
```

```
##     perf     serv 
## 1.001337 1.001337
```
]

.pull-right[
+ Ideally, we want values to be close to 1

+ VIFs > 10 indicate a problem, as the SE of `$\beta_i$` is more than `$\sqrt{10} = 3.16$` times larger than it would be if other predictors were removed from the model
]

---
##  What to do about multi-collinearity

+ In practice, multi-collinearity is not often a major problem

+ When issues arise, consider:
	
	+ Combining highly correlated predictors into a single composite
		  + E.g. create a sum or average of the two predictors
	
	+ Dropping an IV that is obviously statistically and conceptually redundant with another from the model

---
class: center, middle
## Questions?

---
##  What should be done when we identify issues?

+ Easier to identify unusual cases than it is to decide what to do about them

+ In general, not a good idea to delete cases automatically for exceeding a threshold
	+ Especially if they don't show large influence

+ Think of linear model diagnostics more as a way of learning about the limitations of the model
	+ Which cases can't it explain very well?
	+ Why is that?
	+ Do results depend quite strongly on certain cases?

+ When unusual cases arise, try the following:
  1. Try to identify why a case is unusual
  2. Run a sensitivity analysis

---
##  Investigating and dealing with unusual cases

+ Is there a data entry error?
	+ Is the value within the plausible range of values for a variable?
	+ Can it be corrected? If not, the incorrect value should be deleted

+ Are there unusual values as a result of skewness?

+ Is the data legitimate but extreme?
  + Avoid changing or removing legitimate values when possible, because they may reflect natural variance in the population
  + Before making any changes, check influence values and run a sensitivity analysis
  + After taking these steps, if a very small proportion of observations are strongly influencing the overall model, consider ways to reduce their influence rather than removing them entirely (e.g. **windsorising**)
	+ There may be model specification problems (e.g. missing variables, interactions - coming soon)

???
+ Replacing the extreme value with the next largest value for that variable
+ Avoids missingness/ preserves information
+ Note that deleting or winsorising values can change the model, therefore, different cases might then show up as outlying, high leverage, or influential
+ Iterative approach to removing/winsorising cases is needed

---
## Sensitivity Analyses

+ Sensitivity analysis refers to the idea of checking whether you get similar results irrespective of the methodological decisions you make

+ Sensitivity analysis can be used to check whether you get the same pattern of results. Do the estimated regression coefficients change substantially:
	+ With versus without including certain unusual cases?
	+ With versus without transforming a variable?

+ If results are highly similar, you have more confidence that the results aren't very dependent on those decisions

+ If they differ a lot, this should be reported as a limitation

---
## Diagnostics Summary

+ We have looked at case diagnostics for `lm`

+ Outliers can be assessed via studentized residuals

+ Leverage can be assessed via hat values

+ Influence can be assessed via Cook's distance
  + As well as dfbeta and COVRATIO

+ In short, we do not want a small number of observations having undue influence over our results, so we should always look closely at diagnostics

---
class: center, middle
## Questions?

---

## This week

.pull-left[

### Tasks

**Attend your lab and work together on the exercises**

<br>

**Complete the weekly quiz**

]

.pull-right[

### Support

**Help each other on the Piazza forum**

<br>

**Attend office hours (see Learn page for details)**

]