Helpful Mnemonics

It may help to think of the sequence of steps involved in statistical modeling as:
\[ \text{Choose} \rightarrow \text{Fit} \rightarrow \text{Assess} \rightarrow \text{Use} \]

We explore/visualise our data and Choose our model specification.
Then we Fit the model in R.
Next, we Assess the fit, to ensure that it meets all the underlying assumptions?
Finally, we Use our model to draw statistical inferences about the world, or to make predictions.

A general rule
Do not use (draw inferences or predictions from) a model before you have assessed that the model satisfies the underlying assumptions

The assumptions of the linear model can be committed to memory using the LINE mnemonic:

Linearity: The relationship between $y$ and $x$ is linear.
Independence of errors: The error terms should be independent from one another.
Normality: The errors $\epsilon$ are normally distributed
Equal variances (“Homoscedasticity”): The scale of the variability of the errors $\epsilon$ is constant at all values of $x$.

When we fit a model, we evaluate many of these assumptions by looking at the residuals
(the deviations from the observed values $y_i$ and the model estimated value $\hat y_i$).

The residuals, $\hat \epsilon$ are our estimates of the actual unknown true error term $\epsilon$. These assumptions hold both for a regression model with a single predictor and for one with multiple predictors.

Setup

:::

For this guide, we are going to use the following model:

\[ \text{Wellbeing} = b_0 + b_1 \cdot \text{Outdoor Time} + b_2 \cdot \text{Social Interactions} + \epsilon \]

Which we fitted in R as follows:

library(tidyverse)
# Read in data
mwdata = read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")
# fit the model 
wbmodel <- lm(wellbeing ~ outdoor_time + social_int, data = mwdata)

:::

Checking Each Assumption

Linearity

In simple linear regression with only one explanatory variable, we can assess linearity through a simple scatterplot of the outcome variable against the explanatory. In multiple regression, however, it becomes more necessary to rely on diagnostic plots of the model residuals. This is because we need to know whether the relations are linear between the outcome and each predictor after accounting for the other predictors in the model.

In order to assess this, we use partial-residual plots (also known as ‘component-residual plots’). This is a plot with each explanatory variable $x_j$ on the x-axis, and partial residuals on the y-axis.

Partial residuals for a predictor $x_j$ are calculated as: \[ \hat \epsilon + \hat b_j x_j \]

In R we can easily create these plots for all predictors in the model by using the crPlots() function from the car package.

Question 1

Create partial-residual plots for the wbmodel model.
Remember to load the car package first. If it does not load correctly, it might mean that you have need to install it.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.

Solution

Equal variances (Homoscedasticity)

The equal variances assumption is that the error variance $\sigma^2$ is constant across values of the predictors $x_1$, … $x_k$, and across values of the fitted values $\hat y$. This sometimes gets termed “Constant” vs “Non-constant” variance. Figures 1 & 2 shows what these look like visually.

Figure 1: Non-constant variance for numeric and categorical x

Figure 2: Constant variance for numeric and categorical x

In R we can create plots of the Pearson residuals against the predicted values $\hat y$ and against the predictors $x_1$, … $x_k$ by using the residualPlots() function from the car package. This function also provides the results of a lack-of-fit test for each of these relationships (note when it is the fitted values $\hat y$ it gets called “Tukey’s test”).

ncvTest(model) (also from the car package) performs a test against the alternative hypothesis that the error variance changes with the level of the fitted value (also known as the “Breusch-Pagan test”). $p >.05$ indicates that we do not have evidence that the assumption has been violated.

Question 2

Use residualPlots() to plot residuals against each predictor, and use ncvTest() to perform a test against the alternative hypothesis that the error variance changes with the level of the fitted value.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.

Solution

residualPlots(wbmodel)

##              Test stat Pr(>|Test stat|)
## outdoor_time     -0.35             0.73
## social_int       -0.11             0.92
## Tukey test       -0.42             0.68

#test against the alternative hypothesis that error variance changes with level of fitted value
ncvTest(wbmodel)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.00193, Df = 1, p = 1

Partial residual plots show no clear non-linear trends between residuals and predictors. Visual inspection of suggested little sign of non-constant variance, with the Breusch-Pagan test failing to reject the null that error varance does not change across the fitted values ($\chi^2(1)=0.002$, $p = .965$).

Question 3

Create the “residuals vs. fitted plot” - a scatterplot with the residuals $\hat \epsilon$ on the y-axis and the fitted values $\hat y$ on the x-axis.

You can either do this:

manually, using the functions residuals() and fitted(), or
quickly by giving the plot() function your model. This will actually give you lots of plots, so we can specify which plot we want to return - e.g., plot(wbmodel, which = 1)

You can use this plot to visually assess:

Linearity: Does the average of the residuals $\hat \epsilon$ remain close to 0 across the plot?
Equal Variance: does the spread of the residuals $\hat \epsilon$ remain constant across the predicted values $\hat y$?

Solution

Independence

The “independence of errors” assumption is the condition that the errors do not have some underlying relationship which is causing them to influence one another.
There are many sources of possible dependence, and often these are issues of study design. For example, we may have groups of observations in our data which we would expect to be related (e.g., multiple trials from the same participant). Our modelling strategy would need to take this into account.
One form of dependence is autocorrelation - this is when observations influence those adjacent to them. It is common in data for which time is a variable of interest (e.g, the humidity today is dependent upon the rainfall yesterday).

In R we can test against the alternative hypothesis that there is autocorrelation in our errors using the durbinWatsonTest() (an abbreviated function dwt() is also available) in the car package.

Question 4

Perform a test against the alternative hypothesis that there is autocorrelation in the error terms.

Write a sentence summarising whether or not you consider the assumption of independence to have been met (you may have to assume certain aspects of the study design).

Solution

dwt(wbmodel)

##  lag Autocorrelation D-W Statistic p-value
##    1          -0.318           2.6   0.172
##  Alternative hypothesis: rho != 0

A Durbin-Watson test of autocorrelation failed to reject the null hypothesis that there was no serial dependence in the error ($DW = 2.6$, $p = .138$). We will also assume that observations to be randomly sampled during study recruitment.

Normality of errors

The normality assumption is the condition that the errors $\epsilon$ are normally distributed.

We can visually assess this condition through histograms, density plots, and quantile-quantile plots (QQplots) of our residuals $\hat \epsilon$.
We can also perform a Shapiro-Wilk test against the alternative hypothesis that the residuals were not sampled from a normally distributed population.

The shapiro.test() function in R performs a Shapiro-Wilk test.
plot(model_name, which = 2) gives us a QQplot of the residuals (or you can do it manually by extracting the residuals using resid(model_name)).

Question 5

Assess the normality assumption by producing a qqplot of the residuals (either manually or using plot(model, which = ???)), and conducting a Shapiro-Wilk test.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.

Solution

We can get the QQplot from one of the plot(model) plots:

plot(wbmodel, which = 2)

Or we can make our own:

tibble(
  resids = residuals(wbmodel)
) %>% ggplot(aes(sample=resids))+
  geom_qq()+
  geom_qq_line()

shapiro.test(residuals(wbmodel))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(wbmodel)
## W = 0.9, p-value = 0.1

The QQplot indicates that the residuals follow close to a normal distribution, although with evidence of heavier tails. A Shapiro-Wilk test failed to reject the null hypothesis that the residuals were drawn from a normally distributed population ($W = 0.95$, $p = .129$)

Multicollinearity

For the linear model with multiple explanatory variables, we need to also think about multicollinearity - this is when two (or more) of the predictors in our regression model are moderately or highly correlated.
Recall our interpretation of multiple regression coefficients as
“the effect of $x_1$ on $y$ when holding the values of $x_2$, $x_3$, … $x_k$ constant”

This interpretation falls down if predictors are highly correlated because if, e.g., predictors $x_1$ and $x_2$ are highly correlated, then changing the value of $x_1$ necessarily entails a change the value of $x_2$ meaning that it no longer makes sense to talk about holding $x_2$ constant.

We can assess multicollinearity using the variance inflation factor (VIF), which for a given predictor $x_j$ is calculated as:
\[ VIF_j = \frac{1}{1-R_j^2} \\ \] Where $R_j^2$ is the coefficient of determination (the R-squared) resulting from a regression of $x_j$ on to all the other predictors in the model ($x_j = x_1 + ... x_k + \epsilon$).
The more highly correlated $x_j$ is with other predictors, the bigger $R_j^2$ becomes, and thus the bigger $VIF_j$ becomes.

The square root of VIF indicates how much the SE of the coefficient has been inflated due to multicollinearity. For example, if the VIF of a predictor variable were 4.6 ($\sqrt{4.6} = 2.1$), then the standard error of the coefficient of that predictor is 2.1 times larger than if the predictor had zero correlation with the other predictor variables. Suggested cut-offs for VIF are varied. Some suggest 10, others 5. Define what you will consider an acceptable value prior to calculating it.

In R, the vif() function from the car package will provide VIF values for each predictor in your model.

Question 6

Calculate the variance inflation factor (VIF) for the predictors in the model.

Write a sentence summarising whether or not you consider multicollinearity to be a problem here.

Solution

vif(wbmodel)

## outdoor_time   social_int 
##         1.13         1.13

VIF values <5 indicate that multicollinearity is not adversely affecting model estimates.

Individual Case Diagnostics

In linear regression, individual cases in our data can influence our model more than others. There are a variety of measures we can use to evaluate the amount of misfit and influence that single observations have on our model and our model estimates.

THERE ARE NO HARD RULES FOR WHAT COUNTS AS “INFLUENTIAL” AND HOW WE SHOULD DEAL WITH THESE CASES

There are many ways to make a cake. recipes can be useful, but you really need to think about what ingredients you actually have (what data you have).

You don’t have to exclude influential observations. Try to avoid blindly following cut-offs, and try to think carefully about outliers and influential points and whether you want to exclude them, and whether there might be some other model specification that captures this in some estimable way. Do these observations change the conclusions you make (you can try running models with and without certain cases).

There are various measures of outlyngness and influence. Here are a few:

Regression outliers: A large residual $\hat \epsilon_i$ - i.e., a big discrepancy between their predicted y-value and their observed y-value.
- Standardised residuals: For residual $\hat \epsilon_i$, divide by the estimate of the standard deviation of the residuals. In R, the rstandard() function will give you these
- Studentised residuals: For residual $\hat \epsilon_i$, divide by the estimate of the standard deviation of the residuals excluding case $i$. In R, the rstudent() function will give you these. Values $>|2|$ (greater in magnitude than two) are considered potential outliers.
High leverage cases: These are cases which have considerable potential to influence the regression model (e.g., cases with an unusual combination of predictor values).
- Hat values: are used to assess leverage. In R, The hatvalues() function will retrieve these.
  Hat values of more than $2 \bar{h}$ (2 times the average hat value) are considered high leverage. $\bar{h}$ is calculated as $\frac{k + 1}{n}$, where $k$ is the number of predictors, and $n$ is the sample size.
High influence cases: When a case has high leverage and is an outlier, it will have a large influence on the regression model.
- Cook’s Distance: combines leverage (hatvalues) with outlying-ness to capture influence. In R, the cooks.distance() function will provide these.
  There are many suggested Cook’s Distance cut-offs.

Question 7

Create a new tibble which contains:

The original variables from the model (Hint, what does <fitted model>$model give you?)
The fitted values from the model $\hat y$
The residuals $\hat epsilon$
The studentised residuals
The hat values
The Cook’s Distance values.

Solution

Question 8

Looking at the studentised residuals, are there any extreme values?

Solution

Let’s use studentised residuals of $>2$ or $< -2$ indicate potential outlyingness.

We can ask R whether the absolute values are $>2$:

abs(mdl_diagnost$studres) > 2

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    27    28    29    30    31    32 
## FALSE FALSE FALSE FALSE FALSE FALSE

We could filter our newly created tibble to these observations:

mdl_diagnost %>% 
  filter(abs(studres)>2)

## # A tibble: 0 x 8
## # ... with 8 variables: wellbeing <dbl>, outdoor_time <dbl>, social_int <dbl>,
## #   fitted <dbl>, resid <dbl>, studres <dbl>, hats <dbl>, cooksd <dbl>

There are zero rows.

Question 9

Looking at the hat values, are there any observations with high leverage?

Solution

For our model, the average hat value $\bar h$ is:
\[ \bar h = \frac{k+1}{n} = \frac{2+1}{32} = \frac{3}{32} = 0.094 \]

We can ask whether any of observations have hat values which are greater than $2 \times \bar h$:

mdl_diagnost %>%
  filter(hats > (2*0.094))

## # A tibble: 0 x 8
## # ... with 8 variables: wellbeing <dbl>, outdoor_time <dbl>, social_int <dbl>,
## #   fitted <dbl>, resid <dbl>, studres <dbl>, hats <dbl>, cooksd <dbl>

Note that 0 observations have high leverage.

Question 10

Plot the Cook’s Distance values, does it look like there may be any highly influential points?
(You can use plot(model, which = 4) and plot(model, which = 5)).

Solution

Other influence.measures()

Alongside Cook’s Distance, we can examine the extent to which model estimates and predictions are affected when an entire case is dropped from the dataset and the model is refitted.

DFFit: the change in the predicted value at the $i^{th}$ observation with and without the $i^{th}$ observation is included in the regression.
DFbeta: the change in a specific coefficient with and without the $i^{th}$ observation is included in the regression.
DFbetas: the change in a specific coefficient divided by the standard error, with and without the $i^{th}$ observation is included in the regression.
COVRATIO: measures the effect of an observation on the covariance matrix of the parameter estimates. In simpler terms, it captures an observation’s influence on standard errors. Values which are $>1+\frac{3(k+1)}{n}$ or $<1-\frac{3(k+1)}{n}$ are sometimes considered as having strong influence.

Question 11

Use the function influence.measures() to extract these delete-1 measures of influence.

Try plotting the distributions of some of these measures.

Tip: the function influence.measures() returns an infl-type object. To plot this, we need to find a way to extract the actual numbers from it.
What do you think names(influence.measures(<fitted model>)) shows you? How can we use influence.measures(<fitted model>)$ ???? to extract the matrix of numbers?

Solution

influence.measures(wbmodel)

## Influence measures of
##   lm(formula = wellbeing ~ outdoor_time + social_int, data = mwdata) :
## 
##      dfb.1_ dfb.otd_ dfb.scl_   dffit cov.r   cook.d    hat inf
## 1   0.43653 -0.11116 -0.32167  0.4477 1.157 0.066470 0.1489    
## 2  -0.28160  0.03170  0.22954 -0.2917 1.225 0.028838 0.1414    
## 3   0.29581  0.07604 -0.29025  0.3540 1.088 0.041520 0.0967    
## 4  -0.26445 -0.13055  0.29341 -0.3508 1.117 0.040991 0.1071    
## 5  -0.27084  0.22733  0.10472 -0.3290 1.279 0.036700 0.1766    
## 6   0.12462 -0.02693 -0.08288  0.1460 1.141 0.007277 0.0604    
## 7  -0.17361 -0.03392  0.15313 -0.2235 1.087 0.016770 0.0597    
## 8  -0.02879 -0.07259  0.06069 -0.0918 1.309 0.002906 0.1556    
## 9   0.16925  0.08820 -0.17834  0.2477 1.088 0.020553 0.0668    
## 10 -0.24768  0.33112 -0.00551 -0.4267 1.027 0.059217 0.0956    
## 11 -0.02018 -0.02535  0.02534 -0.0492 1.166 0.000834 0.0518    
## 12  0.04644  0.38731 -0.22045  0.4474 1.164 0.066462 0.1517    
## 13 -0.15606  0.21587 -0.02477 -0.3135 1.017 0.032241 0.0626    
## 14  0.13390 -0.31537  0.10705  0.3905 1.039 0.049887 0.0899    
## 15  0.14774 -0.25595  0.08688  0.4273 0.815 0.055918 0.0487    
## 16 -0.11881  0.17855 -0.06061 -0.3502 0.873 0.038519 0.0422    
## 17  0.00121 -0.07684  0.02608 -0.1031 1.177 0.003653 0.0703    
## 18  0.00329  0.04637 -0.01574  0.0739 1.159 0.001877 0.0516    
## 19  0.04301 -0.15310  0.09227  0.2432 1.055 0.019693 0.0546    
## 20  0.05819 -0.10961 -0.03803 -0.2189 1.065 0.016025 0.0508    
## 21 -0.01498  0.00131  0.03428  0.0875 1.131 0.002622 0.0380    
## 22  0.12553 -0.15510 -0.09098 -0.3084 1.020 0.031247 0.0622    
## 23 -0.03324 -0.00580  0.05840  0.1049 1.138 0.003770 0.0466    
## 24  0.24606 -0.23731 -0.18568 -0.4783 0.911 0.071967 0.0774    
## 25  0.06123  0.10878 -0.16017 -0.2209 1.131 0.016498 0.0771    
## 26 -0.02732 -0.29363  0.23790  0.3643 1.243 0.044752 0.1666    
## 27 -0.30514  0.33726  0.20108  0.5966 0.830 0.108234 0.0858    
## 28  0.06176 -0.07036 -0.03455 -0.1080 1.263 0.004013 0.1280    
## 29 -0.15327 -0.04611  0.23001  0.3039 1.067 0.030646 0.0754    
## 30  0.11226  0.25745 -0.30386 -0.3826 1.238 0.049262 0.1686    
## 31 -0.07637 -0.05313  0.12879  0.1542 1.214 0.008153 0.1047    
## 32  0.10249 -0.07000 -0.07662 -0.1399 1.353 0.006736 0.1864   *

Let’s plot the distribution of COVRATIO statistics.
Recall that values which are $>1+\frac{3(k+1)}{n}$ or $<1-\frac{3(k+1)}{n}$ are considered as having strong influence.
For our model: \[ 1 \pm \frac{3(k+1)}{n} \quad = \quad 1 \pm\frac{3(2+1)}{32} \quad = \quad 1\pm \frac{9}{32} \quad = \quad 1\pm0.28 \]

The “infmat” bit of an infl-type object contains the numbers. To use it with ggplot, we will need to turn it into a dataframe (as.data.frame()), or a tibble (as_tibble()):

infdata <- influence.measures(wbmodel)$infmat %>%
  as_tibble()

ggplot(data = infdata, aes(x = cov.r)) + 
  geom_histogram() +
  geom_vline(aes(xintercept = c(1-0.28)))+
  geom_vline(aes(xintercept = c(1+0.28)))

It looks like a few observations may be having quite a high influence here. This is perhaps not that surprising as we only have 32 datapoints.

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

Assumptions & Diagnostics: The Recipe Book

Helpful Mnemonics

Setup

Checking Each Assumption

Linearity

Equal variances (Homoscedasticity)

Independence

Normality of errors

Multicollinearity

Individual Case Diagnostics