In this lab, you will be provided with a comprehensive overview of linear regression assumptions and diagnostics. Therefore, whilst the lab might be appear rather lengthy, it will serve as a handy reference for you to use in the future and refer back to when needed.
LEARNING OBJECTIVES
In the previous labs, we have fitted a number of regression models, including some with multiple predictors. In each case, we first specified the model, then visually explored the marginal distributions and relationships of variables which would be used in the analysis. Finally, we fit the model, and began to examine the fit by studying what the various parameter estimates represented, and the spread of the residuals (the parts of the output inside the red boxes in Figure 1)
But before we draw inferences using our model estimates or use our model to make predictions, we need to be satisfied that our model meets a specific set of assumptions. If these assumptions are not satisfied, the results will not hold.
You can remember the four assumptions by memorising the acronym LINE:
If at least one of these assumptions does not hold, say N - Normality, you might be reporting a LIE. Recall the assumptions of the linear model:
Because we don’t have the data about the entire population, we check the assumptions on the errors by looking at their sample counterpart: the residuals from the fitted model = observed values - fitted values = \(y_i - \hat y_i\).
The residuals \(\hat \epsilon_i\) are the sample realisation of the actual, but unknown, true errors \(\epsilon_i\) for the entire population. Because these same assumptions hold for a regression model with multiple predictors, we can assess them in a similar way. However, there are a number of important considerations.
In this lab, we will check the assumptions of two models - one simple linear model, and one with multiple predictors, and assess whether these models meet the assumptions outlined above. We will be working with two different datasets that you have used in previous labs: riverview
and wellbeing
.
Open a new RMarkdown document. Copy the code below to load in the tidyverse packages, read in the riverview.csv and wellbeing.csv datasets and fit the following two models:
\[ \begin{aligned} M1&: \quad \text{Income} = \beta_0 + \beta_1 \cdot \text{Education} + \epsilon \\ M2&: \quad \text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Outdoor Time} + \beta_2 \cdot \text{Social Interactions} + \epsilon \end{aligned} \]
library(tidyverse)
# read in the riverview data
<- read_csv(file = "https://uoepsy.github.io/data/riverview.csv")
rvdata
# read in the wellbeing data
<- read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")
wbdata
# fit the linear models:
<- lm(income ~ 1 + education, data = rvdata) #riverview model
rv_mdl1
<- lm(wellbeing ~ outdoor_time + social_int, data = wbdata) #wellbeing model wb_mdl1
Note: We have have forgone writing the 1
in lm(y ~ 1 + x...
. The 1 just tells R that we want to estimate the Intercept, and it will do this by default even if we leave it out.
In simple linear regression (SLR) with only one explanatory variable, we could assess linearity through a simple scatterplot of the outcome variable against the explanatory. This would allow us to check if the errors have a mean of zero. If this assumption was met, the residuals would appear to be randomly scattered around zero.
The rationale for this is that, once you remove from the data the linear trend, what’s left over in the residuals should not have any trend, i.e. have a mean of zero.
In multiple regression, however, it becomes more necessary to rely on diagnostic plots of the model residuals. This is because we need to know whether the relations are linear between the outcome and each predictor after accounting for the other predictors in the model.
In order to assess this, we use partial-residual plots (also known as ‘component-residual plots’). This is a plot with each explanatory variable \(x_j\) on the x-axis, and partial residuals on the y-axis.
Partial residuals for a predictor \(x_j\) are calculated as: \[ \hat \epsilon + \hat \beta_j x_j \]
In R we can easily create these plots for all predictors in the model by using the crPlots()
function from the car package.
Check if the fitted model satisfies the linearity assumption for rv_mdl1
. Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.
Create partial-residual plots for the wb_mdl1
model.
Remember to load the car package first. If it does not load correctly, it might mean that you have need to install it.
Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.
The equal variances assumption is that the error variance \(\sigma^2\) is constant across values of the predictor(s) \(x_1, \dots, x_k\), and across values of the fitted values \(\hat y\). This sometimes gets termed “Constant” vs “Non-constant” variance. Figures 3 & 4 shows what these look like visually.
In R we can create plots of the Pearson residuals against the predicted values \(\hat y\) and against the predictors \(x_1\), … \(x_k\) by using the residualPlots()
function from the car package. This function also provides the results of a lack-of-fit test for each of these relationships (note when it is the fitted values \(\hat y\) it gets called “Tukey’s test”).
Check if the fitted models rv_mdl1
and wb_mdl1
satisfy the equal variance assumption. Use residualPlots()
to plot residuals against the predictor.
Write a sentence summarising whether or not you consider the assumption to have been met for each model. Justify your answer with reference to plots.
The ‘independence of errors’ assumption is the condition that the errors do not have some underlying relationship which is causing them to influence one another.
There are many sources of possible dependence, and often these are issues of study design. For example, we may have groups of observations in our data which we would expect to be related (e.g., multiple trials from the same participant). Our modelling strategy would need to take this into account.
One form of dependence is autocorrelation - this is when observations influence those adjacent to them. It is common in data for which time is a variable of interest (e.g, the humidity today is dependent upon the rainfall yesterday).
For both rv_mdl1
and wb_mdl1
, visually assess whether there is autocorrelation in the error terms.
Write a sentence summarising whether or not you consider the assumption of independence to have been met for each (you may have to assume certain aspects of the study design).
The normality assumption is the condition that the errors \(\epsilon\) are normally distributed in the population.
We can visually assess this condition through histograms, density plots, and quantile-quantile plots (QQplots) of our residuals \(\hat \epsilon\).
Assess the normality assumption by producing a qqplot of the residuals (either manually or using plot(model, which = ???)
) for both rv_mdl1
and wb_mdl1
.
Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.
For the linear model with multiple explanatory variables, we need to also think about multicollinearity - this is when two (or more) of the predictors in our regression model are moderately or highly correlated.
We can assess multicollinearity using the variance inflation factor (VIF), which for a given predictor \(x_j\) is calculated as:
\[
VIF_j = \frac{1}{1-R_j^2} \\
\]
Suggested cut-offs for VIF are varied. Some suggest 10, others 5. Define what you will consider an acceptable value prior to calculating it. You could loosely interpret VIF values larger than 5 as moderate multicollinearity and values larger than 10 as severe multicollinearity.
In R, the vif()
function from the car package will provide VIF values for each predictor in your model.
Calculate the variance inflation factor (VIF) for the predictors in the model.
Write a sentence summarising whether or not you consider multicollinearity to be a problem here.
We have seen in the case of the simple linear regression that individual cases in our data can influence our model more than others. We know about:
rstandard()
function will give you theserstudent()
function will give you these.hatvalues()
function will retrieve these.cooks.distance()
function will provide these.
Alongside Cook’s Distance, we can examine the extent to which model estimates and predictions are affected when an entire case is dropped from the dataset and the model is refitted.You can get a whole bucket-load of these measures with the influence.measures()
function:
influence.measures(my_model)
will give you out a dataframe of the various measures.summary(influence.measures(my_model))
will provide a nice summary of what R deems to be the influential points.For questions 8-12, we will be working with our wb_mdl1
only. Feel free to apply the below to your rv_mdl1
too as as extra practice.
Create a new tibble which contains:
wb_mdl1$model
give you?)
Looking at the studentised residuals, are there any outliers?
Looking at the hat values, are there any observations with high leverage?
Looking at the Cook’s Distance values, are there any highly influential points?
You can also display these graphically using plot(model, which = 4)
and plot(model, which = 5)
.
Use the function influence.measures()
to extract these delete-1 measures of influence.
Try plotting the distributions of some of these measures.
Tip: the function influence.measures()
returns an infl
-type object. To plot this, we need to find a way to extract the actual numbers from it.
What do you think names(influence.measures(wb_mdl1))
shows you? How can we use influence.measures(wb_mdl1)$<insert name here>
to extract the matrix of numbers?
Create a new section header in your Rmarkdown document, as we are moving onto a different dataset.
The code below loads the dataset of 656 participants’ scores on Big 5 Personality traits, perceptions of social ranks, and scores on a depression and anxiety scale.
<- read_csv("https://uoepsy.github.io/data/scs_study.csv") scs_study
plot()
function to get a series of plots).Tips:
crPlots()
will not work. To assess, you can create a residuals-vs-fitted plot like we saw in the guided exercises above.lm(y ~ x1 + x2, data = dataset[-c(3,5),])
. Be careful to remember that these values remain in the dataset, they have simply been excluded from the model fit.