Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!
LEARNING OBJECTIVES
In the previous labs, we have fitted a number of multiple regression models. In each case, we first specified the model, then visually explored the marginal distributions and relationships of variables which would be used in the analysis. Finally, we fit the model, and began to examine the fit by studying what the various parameter estimates represented, and the spread of the residuals (the parts of the output inside the red boxes in Figure 1)
You will recall that before we draw inferences using our model estimates or use our model to make predictions, we need to be satisfied that our model meets the set of assumptions.
Recall the assumptions of the linear model:
When we fit a model, we evaluate many of these assumptions by looking at the residuals (the deviations from the observed values \(y_i\) and the model estimated value \(\hat y_i\)). The residuals, \(\hat \epsilon\) are our estimates of the actual unknown true error term \(\epsilon\).
Because these same assumptions hold for a regression model with multiple predictors, we can assess them in a similar way. However, there are a number of important considerations.
Open a new RMarkdown document. Copy the code below to load in the tidyverse packages, read in the wellbeing.csv data and fit the following model:
\[ \text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Outdoor Time} + \beta_2 \cdot \text{Social Interactions} + \epsilon \]
library(tidyverse)
# read in the wellbeing data
mwdata = read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")
# fit the linear model:
wb_mdl1 <- lm(wellbeing ~ outdoor_time + social_int, data = mwdata)
Note: We have have forgone writing the 1
in lm(y ~ 1 + x...
. The 1 just tells R that we want to estimate the Intercept, and it will do this by default even if we leave it out.
In simple linear regression (SLR) with only one explanatory variable, we could assess linearity through a simple scatterplot of the outcome variable against the explanatory. In multiple regression, however, it becomes more necessary to rely on diagnostic plots of the model residuals. This is because we need to know whether the relations are linear between the outcome and each predictor after accounting for the other predictors in the model.
In order to assess this, we use partial-residual plots (also known as ‘component-residual plots’). This is a plot with each explanatory variable \(x_j\) on the x-axis, and partial residuals on the y-axis.
Partial residuals for a predictor \(x_j\) are calculated as: \[ \hat \epsilon + \hat \beta_j x_j \]
In R we can easily create these plots for all predictors in the model by using the crPlots()
function from the car package.
Create partial-residual plots for the wb_mdl1
model.
Remember to load the car package first. If it does not load correctly, it might mean that you have need to install it.
Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.
The equal variances assumption is that the error variance \(\sigma^2\) is constant across values of the predictors \(x_1\), … \(x_k\), and across values of the fitted values \(\hat y\). This sometimes gets termed “Constant” vs “Non-constant” variance. Figures 3 & 4 shows what these look like visually.
In R we can create plots of the Pearson residuals against the predicted values \(\hat y\) and against the predictors \(x_1\), … \(x_k\) by using the residualPlots()
function from the car package. This function also provides the results of a lack-of-fit test for each of these relationships (note when it is the fitted values \(\hat y\) it gets called “Tukey’s test”).
ncvTest(model)
(also from the car package) performs a test against the alternative hypothesis that the error variance changes with the level of the fitted value (also known as the “Breusch-Pagan test”). \(p >.05\) indicates that we do not have evidence that the assumption has been violated.
Use residualPlots()
to plot residuals against each predictor, and use ncvTest()
to perform a test against the alternative hypothesis that the error variance changes with the level of the fitted value.
Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.
Create the “residuals vs. fitted plot” - a scatterplot with the residuals \(\hat \epsilon\) on the y-axis and the fitted values \(\hat y\) on the x-axis.
You can either do this:
residuals()
and fitted()
, orplot()
function your model. Be sure to specify which plot you want to return - e.g., plot(wb_mdl1, which = ???)
You can use this plot to visually assess:
The “independence of errors” assumption is the condition that the errors do not have some underlying relationship which is causing them to influence one another.
There are many sources of possible dependence, and often these are issues of study design. For example, we may have groups of observations in our data which we would expect to be related (e.g., multiple trials from the same participant). Our modelling strategy would need to take this into account.
One form of dependence is autocorrelation - this is when observations influence those adjacent to them. It is common in data for which time is a variable of interest (e.g, the humidity today is dependent upon the rainfall yesterday).
In R we can test against the alternative hypothesis that there is autocorrelation in our errors using the durbinWatsonTest()
(an abbreviated function dwt()
is also available) in the car package.
Perform a test against the alternative hypothesis that there is autocorrelation in the error terms.
Write a sentence summarising whether or not you consider the assumption of independence to have been met (you may have to assume certain aspects of the study design).
The normality assumption is the condition that the errors \(\epsilon\) are normally distributed.
We can visually assess this condition through histograms, density plots, and quantile-quantile plots (QQplots) of our residuals \(\hat \epsilon\).
We can also perform a Shapiro-Wilk test against the alternative hypothesis that the residuals were not sampled from a normally distributed population. The shapiro.test()
function in R.
Assess the normality assumption by producing a qqplot of the residuals (either manually or using plot(model, which = ???)
), and conducting a Shapiro-Wilk test.
Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.
This interpretation falls down if predictors are highly correlated because if, e.g., predictors \(x_1\) and \(x_2\) are highly correlated, then changing the value of \(x_1\) necessarily entails a change the value of \(x_2\) meaning that it no longer makes sense to talk about holding \(x_2\) constant.
We can assess multicollinearity using the variance inflation factor (VIF), which for a given predictor \(x_j\) is calculated as:
\[
VIF_j = \frac{1}{1-R_j^2} \\
\]
Where \(R_j^2\) is the coefficient of determination (the R-squared) resulting from a regression of \(x_j\) on to all the other predictors in the model (\(x_j = x_1 + ... x_k + \epsilon\)).
The more highly correlated \(x_j\) is with other predictors, the bigger \(R_j^2\) becomes, and thus the bigger \(VIF_j\) becomes.
The square root of VIF indicates how much the SE of the coefficient has been inflated due to multicollinearity. For example, if the VIF of a predictor variable were 4.6 (\(\sqrt{4.6} = 2.1\)), then the standard error of the coefficient of that predictor is 2.1 times larger than if the predictor had zero correlation with the other predictor variables. Suggested cut-offs for VIF are varied. Some suggest 10, others 5. Define what you will consider an acceptable value prior to calculating it.
In R, the vif()
function from the car package will provide VIF values for each predictor in your model.
Calculate the variance inflation factor (VIF) for the predictors in the model.
Write a sentence summarising whether or not you consider multicollinearity to be a problem here.
We have seen in the case of the simple linear regression that individual cases in our data can influence our model more than others. We know about:
rstandard()
function will give you theserstudent()
function will give you these.hatvalues()
function will retrieve these.cooks.distance()
function will provide these.Create a new tibble which contains:
wb_mdl1$model
give you?)
Looking at the studentised residuals, are there any outliers?
Looking at the hat values, are there any observations with high leverage?
Looking at the Cook’s Distance values, are there any highly influential points?
You can also display these graphically using plot(model, which = 4)
and plot(model, which = 5)
.
Alongside Cook’s Distance, we can examine the extent to which model estimates and predictions are affected when an entire case is dropped from the dataset and the model is refitted.
Use the function influence.measures()
to extract these delete-1 measures of influence.
Try plotting the distributions of some of these measures.
Tip: the function influence.measures()
returns an infl
-type object. To plot this, we need to find a way to extract the actual numbers from it.
What do you think names(influence.measures(wb_mdl1))
shows you? How can we use influence.measures(wb_mdl1)$<insert name here>
to extract the matrix of numbers?
Create a new section header in your Rmarkdown document, as we are moving onto a different dataset.
The code below loads the dataset of 656 participants’ scores on Big 5 Personality traits, perceptions of social ranks, and scores on a depression and anxiety scale.
scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")
plot()
function to get a series of plots).Tips:
crPlots()
will not work. To assess, you can create a residuals-vs-fitted plot like we saw in the guided exercises above.lm(y ~ x1 + x2, data = dataset[-c(3,5),])
. Be careful to remember that these values remain in the dataset, they have simply been excluded from the model fit.