Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Week 4: LM Assumptions

Data Analysis for Psychology in R 2

TOM BOOTH & ALEX DOUMAS

Department of Psychology
The University of Edinburgh

AY 2020-2021

1 / 32

Week's Learning Objectives

  1. Understand the meaning of model coefficients in the case of a binary predictor.

  2. Be able to state the assumptions underlying a linear model.

  3. Understand how to assess if a fitted model satisfies the linear model assumptions.

  4. Understand how to use transformations when the model violates assumptions.

2 / 32

Topics for today

  • What are the assumptions of linear model and how can we test them?
    • Linearity
    • Independence of errors
    • Normality of errors
    • Equal variance (Homoscedasticity)
3 / 32

Linear model assumptions

  • So far, we have discussed evaluating linear models with respect to:

    • Overall model fit ( F -ratio, R2)
    • Individual predictors
  • However, the linear model is also built on a set of assumptions

  • If these assumptions are violated, the model will not be very accurate

  • Thus, we also need to test these assumptions

4 / 32

Today's example

  • Today we will continue using our test score example.

  • As of next week, we will move on to using examples from published papers.

df <- read_csv("./dapr2_lec07.csv")
m1 <- lm(score ~ hours, data = df)
m2 <- lm(score ~ study, data = df)
  • We do all our assumption testing after fitting the lm() model.
    • We will need to use information from the object m1 and m2
5 / 32
  • This is a good time to make sure we are happy with the idea of objects
  • M1 is a lm() model object
  • Contains information about the model we ran, estimates of the residuals, predicted scores etc.

Visualizations vs tests

  • In talking about assumption checks, we will present statistical tests and visualizations

  • In general, graphical methods are often more useful

    • Easier to see the nature and magnitude of the assumption violation
    • There is also a very useful function for producing them all.
  • Statistical tests often suggest assumptions are violated when problem is small

6 / 32

Make the general point about power and tests.

Visualizations made easy

  • For a majority of assumption and diagnostic plots, we will make use of the plot() function.

  • If we give plot() a linear model object (e.g. m1 or m2), we can automatically get 6 useful plots.

    • we will explain these over the the next few weeks.
7 / 32

Linearity

  • Assumption: The relationship between y and x is linear.
    • Assuming a linear relation when the true relation is non-linear can result in under-estimating that relation
  • Investigated with:
    • Scatterplots with loess lines.
8 / 32

Linear vs non-linear

9 / 32

What is a loess line?

  • Method for helping visualize the shape of relationships:

  • Stands for...

    • LOcally
    • Estimated
    • Scatterplot
    • Smoothing
  • Essentially produces a line with follows the data.

10 / 32

Visualization

lin_m1 <- df %>%
ggplot(., aes(x=hours, y=score)) +
geom_point()+
geom_smooth(method = "lm", se=F) + # <<
geom_smooth(method = "loess", se=F,
col = "red") +
labs(x= "Hours Study", y="Test Score",
title = "Scatterplot with linear (blue)
and loess (red) lines")

11 / 32

Normally distributed errors

  • Assumption: The errors ( ϵi ) are normally distributed around each predicted value.

  • Investigated with:

    • QQ-plots
    • Histograms
    • Shapiro-Wilk test
12 / 32

Visualizations

  • Histograms: Plot the frequency distribution of the residuals.
hist(m1$residuals)
13 / 32

Visualizations

  • Histograms: Plot the frequency distribution of the residuals.
hist(m1$residuals)
  • Q-Q Plots: Quantile comparison plots.
    • Plot the standardized residuals from the model against their theoretically expected values.
    • If the residuals are normally distributed, the points should fall neatly on the diagonal of the plot.
    • Non-normally distributed residuals cause deviations of points from the diagonal.
      • The specific shape of these deviations are characteristic of the distribution of the residuals.
plot(m1, which = 2)
13 / 32

Visualizations

14 / 32

shapiro.test()

  • The Shapiro-Wilk test provides a significance test on the departure from normality.

  • A significant p-value ( α=.05 ) suggests that the residuals deviate from normality.

shapiro.test(m1$residuals)
##
## Shapiro-Wilk normality test
##
## data: m1$residuals
## W = 0.99198, p-value = 0.5628
15 / 32

Equal variance (Homoscedasticity)

  • Assumption: The equal variances assumption is constant across values of the predictors x1, ... xk, and across values of the fitted values ˆy

    • Heteroscedasticity refers to when this assumption is violated (non-constant variance)
  • Investigated with:

    • Plot residual values against the predicted values ( ˆy ).
    • Breusch-Pagan test (Non-constant variance test)
16 / 32

Residual-vs-predicted values plot

  • In R, we can plot the residuals vs predicted values using residualPlot() function in the car package.

    • Categorical predictors should show a similar spread of residual values across their levels

    • The plots for continuous predictors should look like a random array of dots

      • The solid line should follow the dashed line closely
residualPlot(m1)
17 / 32

Residual-vs-predicted values plot

18 / 32

Discuss the right hand plot for the binary variable

Breusch-Pagan test

  • Also called the non-constant variance test

  • Tests whether residual variance depends on the predicted values

  • Implemented using the ncvTest() function in R

    • Non-significant p-value suggests homoscedasticity assumption holds
ncvTest(m1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.975437, Df = 1, p = 0.02571
19 / 32

Independence of errors

  • Assumption: The errors are not correlated with one another

  • Difficult to test unless we know the potential source of correlation between cases.

  • We can test a limited form of the assumption by testing for autocorrelation between errors.

    • We can test the correlation between each case an adjacent cases in the dataset
    • Achieved using the Durbin-Watson test
20 / 32

Durbin-Watson test

  • Durbin-Watson test implemented in R using the durbinWatsonTest() function:
durbinWatsonTest(m1)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.07672954 2.148216 0.368
## Alternative hypothesis: rho != 0
  • The D-W statistic can take values between 0 and 4
    • 2= no autocorrelation
  • Therefore, we ideally want D-W values close to 2 and a non-significant p-value
    • Values <1 or >3 may indicate problems
21 / 32

Time for a break

And a quiz...identify the plot and the assumption

22 / 32

Violated Assumptions

What do we do about non-normality of residuals, heteroscedasticity and non-linearity?

23 / 32

Non-linear transformations

  • Often non-normal residuals, heteroscedasticity and non-linearity can be ameliorated by a non-linear transformation of the outcome and/or predictors.

  • This involves applying a function (see first week) to the values of a variable.

    • This changes the values and overall shape of the distribution
  • For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality

  • Non-linearity may be helped by a transformation of both predictors and outcomes

24 / 32

Transforming variables to normality

  • Positively skewed data can be made more normally distributed using a log-transformation.

  • Negatively skewed data can be made more normally distributed using same procedure but first reflecting the variable (make biggest values the smallest and smallest the biggest) and then applying the log-transform

  • What does skew look like?

25 / 32

Visualizing Skew

26 / 32

Log-transformations

  • Log-transformations can be implemented in R using the log() function.

  • If your variable contains zero or negative values, you need to first add a constant to make all your values positive

    • A good strategy is to add a constant so that your minimum value is one
    • E.g., if your minimum value is -1.5, add 2.5 to all your values
27 / 32

Log-transformation in action

df_skew <- df_skew %>%
mutate(
log_pos = log(pos),
neg_ref = ((-1)*neg) + (max(neg)+1),
log_neg = log(neg_ref)
)
28 / 32

Log-transformation in action

29 / 32

Log-transformation in action

30 / 32

Summary of today

  • Looked at the third set of model evaluations, assumptions.

  • Described and considered how to assess:

    • Linearity
    • Independence of errors
    • Normality of errors
    • Equal variance (Homoscedasticity)
  • Key take home point:

    • There are no hard and fast rules for assessing assumptions
    • It takes practice to consider if violations are a problem
31 / 32

Next tasks

  • This week:
    • Complete your lab
    • Come to office hours
    • Weekly quiz: Assessed quiz - Week 3 content.
      • Open Monday 09:00
      • Closes Sunday 17:00
32 / 32

Week's Learning Objectives

  1. Understand the meaning of model coefficients in the case of a binary predictor.

  2. Be able to state the assumptions underlying a linear model.

  3. Understand how to assess if a fitted model satisfies the linear model assumptions.

  4. Understand how to use transformations when the model violates assumptions.

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow