Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!

LEARNING OBJECTIVES

  1. Specify the assumptions underlying a LM with multiple predictors.
  2. Assess if a fitted model satisfies the assumptions.
  3. Test and assess the effect of influential cases on LM coefficients and overall model evaluations.

Refresher: LM assumptions

In the previous labs, we have fitted a number of multiple regression models. In each case, we first specified the model, then visually explored the marginal distributions and relationships of variables which would be used in the analysis. Finally, we fit the model, and began to examine the fit by studying what the various parameter estimates represented, and the spread of the residuals (the parts of the output inside the red boxes in Figure 1)

Multiple regression output in R, summary.lm(). Residuals and Coefficients highlighted

Figure 1: Multiple regression output in R, summary.lm(). Residuals and Coefficients highlighted

You will recall that before we draw inferences using our model estimates or use our model to make predictions, we need to be satisfied that our model meets the set of assumptions.

All of the estimates, intervals and hypothesis tests (see Figure 2) resulting from a regression analysis assume a certain set of conditions have been met. Meeting these conditions is what allows us to generalise our findings beyond our sample (i.e., to the population).
Multiple regression output in R, summary.lm(). Hypothesis tests highlighted

Figure 2: Multiple regression output in R, summary.lm(). Hypothesis tests highlighted

Recall the assumptions of the linear model:

  • Linearity: The relationship between \(y\) and \(x\) is linear.
  • Independence of errors: The error terms should be independent from one another.
  • Normality: The errors \(\epsilon\) are normally distributed
  • Equal variances (“Homoscedasticity”): The scale of the variability of the errors \(\epsilon\) is constant at all values of \(x\).

When we fit a model, we evaluate many of these assumptions by looking at the residuals (the deviations from the observed values \(y_i\) and the model estimated value \(\hat y_i\)). The residuals, \(\hat \epsilon\) are our estimates of the actual unknown true error term \(\epsilon\).
Because these same assumptions hold for a regression model with multiple predictors, we can assess them in a similar way. However, there are a number of important considerations.

Question 1

Open a new RMarkdown document. Copy the code below to load in the tidyverse packages, read in the wellbeing.csv data and fit the following model:

\[ \text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Outdoor Time} + \beta_2 \cdot \text{Social Interactions} + \epsilon \]

library(tidyverse) 

# read in the wellbeing data
mwdata = read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")
# fit the linear model: 
wb_mdl1 <- lm(wellbeing ~ outdoor_time + social_int, data = mwdata)

Note: We have have forgone writing the 1 in lm(y ~ 1 + x.... The 1 just tells R that we want to estimate the Intercept, and it will do this by default even if we leave it out.

Guided exercises

Linearity

In simple linear regression (SLR) with only one explanatory variable, we could assess linearity through a simple scatterplot of the outcome variable against the explanatory. In multiple regression, however, it becomes more necessary to rely on diagnostic plots of the model residuals. This is because we need to know whether the relations are linear between the outcome and each predictor after accounting for the other predictors in the model.

In order to assess this, we use partial-residual plots (also known as ‘component-residual plots’). This is a plot with each explanatory variable \(x_j\) on the x-axis, and partial residuals on the y-axis.

Partial residuals for a predictor \(x_j\) are calculated as: \[ \hat \epsilon + \hat \beta_j x_j \]

In R we can easily create these plots for all predictors in the model by using the crPlots() function from the car package.

Question 2

Create partial-residual plots for the wb_mdl1 model.
Remember to load the car package first. If it does not load correctly, it might mean that you have need to install it.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.

Solution

Equal variances (Homoscedasticity)

The equal variances assumption is that the error variance \(\sigma^2\) is constant across values of the predictors \(x_1\), … \(x_k\), and across values of the fitted values \(\hat y\). This sometimes gets termed “Constant” vs “Non-constant” variance. Figures 3 & 4 shows what these look like visually.

Non-constant variance for numeric and categorical x

Figure 3: Non-constant variance for numeric and categorical x

Constant variance for numeric and categorical x

Figure 4: Constant variance for numeric and categorical x

In R we can create plots of the Pearson residuals against the predicted values \(\hat y\) and against the predictors \(x_1\), … \(x_k\) by using the residualPlots() function from the car package. This function also provides the results of a lack-of-fit test for each of these relationships (note when it is the fitted values \(\hat y\) it gets called “Tukey’s test”).

ncvTest(model) (also from the car package) performs a test against the alternative hypothesis that the error variance changes with the level of the fitted value (also known as the “Breusch-Pagan test”). \(p >.05\) indicates that we do not have evidence that the assumption has been violated.

Question 3

Use residualPlots() to plot residuals against each predictor, and use ncvTest() to perform a test against the alternative hypothesis that the error variance changes with the level of the fitted value.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.

Solution

Question 4

Create the “residuals vs. fitted plot” - a scatterplot with the residuals \(\hat \epsilon\) on the y-axis and the fitted values \(\hat y\) on the x-axis.

You can either do this:

  1. manually, using the functions residuals() and fitted(), or
  2. quickly by giving the plot() function your model. Be sure to specify which plot you want to return - e.g., plot(wb_mdl1, which = ???)

You can use this plot to visually assess:

  • Linearity: Does the average of the residuals \(\hat \epsilon\) remain close to 0 across the plot?
  • Equal Variance: does the spread of the residuals \(\hat \epsilon\) remain constant across the predicted values \(\hat y\)?

Solution

Independence

The “independence of errors” assumption is the condition that the errors do not have some underlying relationship which is causing them to influence one another.
There are many sources of possible dependence, and often these are issues of study design. For example, we may have groups of observations in our data which we would expect to be related (e.g., multiple trials from the same participant). Our modelling strategy would need to take this into account.
One form of dependence is autocorrelation - this is when observations influence those adjacent to them. It is common in data for which time is a variable of interest (e.g, the humidity today is dependent upon the rainfall yesterday).

In R we can test against the alternative hypothesis that there is autocorrelation in our errors using the durbinWatsonTest() (an abbreviated function dwt() is also available) in the car package.

Question 5

Perform a test against the alternative hypothesis that there is autocorrelation in the error terms.

Write a sentence summarising whether or not you consider the assumption of independence to have been met (you may have to assume certain aspects of the study design).

Solution

Normality of errors

The normality assumption is the condition that the errors \(\epsilon\) are normally distributed.

We can visually assess this condition through histograms, density plots, and quantile-quantile plots (QQplots) of our residuals \(\hat \epsilon\).
We can also perform a Shapiro-Wilk test against the alternative hypothesis that the residuals were not sampled from a normally distributed population. The shapiro.test() function in R.

Question 6

Assess the normality assumption by producing a qqplot of the residuals (either manually or using plot(model, which = ???)), and conducting a Shapiro-Wilk test.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.

Solution

Multicollinearity

For the linear model with multiple explanatory variables, we need to also think about multicollinearity - this is when two (or more) of the predictors in our regression model are moderately or highly correlated.
Recall our interpretation of multiple regression coefficients as
“the effect of \(x_1\) on \(y\) when holding the values of \(x_2\), \(x_3\), … \(x_k\) constant

This interpretation falls down if predictors are highly correlated because if, e.g., predictors \(x_1\) and \(x_2\) are highly correlated, then changing the value of \(x_1\) necessarily entails a change the value of \(x_2\) meaning that it no longer makes sense to talk about holding \(x_2\) constant.

We can assess multicollinearity using the variance inflation factor (VIF), which for a given predictor \(x_j\) is calculated as:
\[ VIF_j = \frac{1}{1-R_j^2} \\ \] Where \(R_j^2\) is the coefficient of determination (the R-squared) resulting from a regression of \(x_j\) on to all the other predictors in the model (\(x_j = x_1 + ... x_k + \epsilon\)).
The more highly correlated \(x_j\) is with other predictors, the bigger \(R_j^2\) becomes, and thus the bigger \(VIF_j\) becomes.

The square root of VIF indicates how much the SE of the coefficient has been inflated due to multicollinearity. For example, if the VIF of a predictor variable were 4.6 (\(\sqrt{4.6} = 2.1\)), then the standard error of the coefficient of that predictor is 2.1 times larger than if the predictor had zero correlation with the other predictor variables. Suggested cut-offs for VIF are varied. Some suggest 10, others 5. Define what you will consider an acceptable value prior to calculating it.

In R, the vif() function from the car package will provide VIF values for each predictor in your model.

Question 7

Calculate the variance inflation factor (VIF) for the predictors in the model.

Write a sentence summarising whether or not you consider multicollinearity to be a problem here.

Solution

Individual cases

We have seen in the case of the simple linear regression that individual cases in our data can influence our model more than others. We know about:

  • Regression outliers: A large residual \(\hat \epsilon_i\) - i.e., a big discrepancy between their predicted y-value and their observed y-value.
    • Standardised residuals: For residual \(\hat \epsilon_i\), divide by the estimate of the standard deviation of the residuals. In R, the rstandard() function will give you these
    • Studentised residuals: For residual \(\hat \epsilon_i\), divide by the estimate of the standard deviation of the residuals excluding case \(i\). In R, the rstudent() function will give you these.
  • High leverage cases: These are cases which have considerable potential to influence the regression model (e.g., cases with an unusual combination of predictor values).
    • Hat values: are used to assess leverage. In R, The hatvalues() function will retrieve these.
  • High influence cases: When a case has high leverage and is an outlier, it will have a large influence on the regression model.
    • Cook’s Distance: combines leverage (hatvalues) with outlying-ness to capture influence. In R, the cooks.distance() function will provide these.
Question 8

Create a new tibble which contains:

  1. The original variables from the model (Hint, what does wb_mdl1$model give you?)
  2. The fitted values from the model \(\hat y\)
  3. The residuals \(\hat epsilon\)
  4. The studentised residuals
  5. The hat values
  6. The Cook’s Distance values.

Solution

Question 9

Looking at the studentised residuals, are there any outliers?

Solution

Question 10

Looking at the hat values, are there any observations with high leverage?

Solution

Question 11

Looking at the Cook’s Distance values, are there any highly influential points?
You can also display these graphically using plot(model, which = 4) and plot(model, which = 5).

Solution

Other influence.measures()

Alongside Cook’s Distance, we can examine the extent to which model estimates and predictions are affected when an entire case is dropped from the dataset and the model is refitted.

  • DFFit: the change in the predicted value at the \(i^{th}\) observation with and without the \(i^{th}\) observation is included in the regression.
  • DFbeta: the change in a specific coefficient with and without the \(i^{th}\) observation is included in the regression.
  • DFbetas: the change in a specific coefficient divided by the standard error, with and without the \(i^{th}\) observation is included in the regression.
  • COVRATIO: measures the effect of an observation on the covariance matrix of the parameter estimates. In simpler terms, it captures an observation’s influence on standard errors.
Question 12

Use the function influence.measures() to extract these delete-1 measures of influence.

Try plotting the distributions of some of these measures.

Tip: the function influence.measures() returns an infl-type object. To plot this, we need to find a way to extract the actual numbers from it.
What do you think names(influence.measures(wb_mdl1)) shows you? How can we use influence.measures(wb_mdl1)$<insert name here> to extract the matrix of numbers?

Solution



Less guided exercises

Question 13

Create a new section header in your Rmarkdown document, as we are moving onto a different dataset.

The code below loads the dataset of 656 participants’ scores on Big 5 Personality traits, perceptions of social ranks, and scores on a depression and anxiety scale.

scs_study <- read_csv("https://uoepsy.github.io/data/scs_study.csv")


  1. Fit the following interaction model:
    • \(\text{DASS-21 Score} = \beta_0 + \beta_1 \cdot \text{SCS Score} + \beta_2 \cdot \text{Neuroticism} + \beta_3 \cdot \text{SCS score} \cdot \text{Neuroticism} + \epsilon\)
  2. Check that the model meets the assumptions of the linear model (Tip: to get a broad overview you can pass your model to the plot() function to get a series of plots).
  3. If you notice any violated assumptions:
    • address the issues by, e.g., excluding observations from the analysis, or replacing outliers with the next most extreme value (Winsorisation).
    • after fitting a new model which you hope addresses violations, you need to check all of your assumptions again. It can be an iterative process, and the most important thing is that your final model (the one you plan to use) meets all the assumptions.

Tips:

  • When there is an interaction in the model, assessing linearity becomes difficult. In fact, crPlots() will not work. To assess, you can create a residuals-vs-fitted plot like we saw in the guided exercises above.
  • Interaction terms often result in multicollinearity, because these terms are made up of the product of some ‘main effects.’ Mean-centering the variables like we have here will reduce this source of structural multicollinearity (“structural” here refers to the fact that multicollinearity is due to our model specification, rather than the data itself)
  • You can fit a model and exclude specific observations. For instance, to remove the 3rd and 5th rows of the dataset: lm(y ~ x1 + x2, data = dataset[-c(3,5),]). Be careful to remember that these values remain in the dataset, they have simply been excluded from the model fit.

Solution