Understand the meaning of model coefficients in the case of a binary predictor.
Be able to state the assumptions underlying a linear model.
Understand how to assess if a fitted model satisfies the linear model assumptions.
Understand how to use transformations when the model violates assumptions.
So far, we have discussed evaluating linear models with respect to:
However, the linear model is also built on a set of assumptions
If these assumptions are violated, the model will not be very accurate
Thus, we also need to test these assumptions
Today we will continue using our test score example.
As of next week, we will move on to using examples from published papers.
df <- read_csv("./dapr2_lec07.csv")m1 <- lm(score ~ hours, data = df)m2 <- lm(score ~ study, data = df)
lm()
model.m1
and m2
M1
is a lm() model objectIn talking about assumption checks, we will present statistical tests and visualizations
In general, graphical methods are often more useful
Statistical tests often suggest assumptions are violated when problem is small
Make the general point about power and tests.
For a majority of assumption and diagnostic plots, we will make use of the plot()
function.
If we give plot()
a linear model object (e.g. m1
or m2
), we can automatically get 6 useful plots.
Method for helping visualize the shape of relationships:
Stands for...
Essentially produces a line with follows the data.
lin_m1 <- df %>% ggplot(., aes(x=hours, y=score)) + geom_point()+ geom_smooth(method = "lm", se=F) + # << geom_smooth(method = "loess", se=F, col = "red") + labs(x= "Hours Study", y="Test Score", title = "Scatterplot with linear (blue) and loess (red) lines")
Assumption: The errors ( ϵi ) are normally distributed around each predicted value.
Investigated with:
hist(m1$residuals)
hist(m1$residuals)
plot(m1, which = 2)
The Shapiro-Wilk test provides a significance test on the departure from normality.
A significant p-value ( α=.05 ) suggests that the residuals deviate from normality.
shapiro.test(m1$residuals)
## ## Shapiro-Wilk normality test## ## data: m1$residuals## W = 0.99198, p-value = 0.5628
Assumption: The equal variances assumption is constant across values of the predictors x1, ... xk, and across values of the fitted values ˆy
Investigated with:
In R, we can plot the residuals vs predicted values using residualPlot()
function in the car
package.
Categorical predictors should show a similar spread of residual values across their levels
The plots for continuous predictors should look like a random array of dots
residualPlot(m1)
Discuss the right hand plot for the binary variable
Also called the non-constant variance test
Tests whether residual variance depends on the predicted values
Implemented using the ncvTest()
function in R
ncvTest(m1)
## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 4.975437, Df = 1, p = 0.02571
Assumption: The errors are not correlated with one another
Difficult to test unless we know the potential source of correlation between cases.
We can test a limited form of the assumption by testing for autocorrelation between errors.
durbinWatsonTest()
function:durbinWatsonTest(m1)
## lag Autocorrelation D-W Statistic p-value## 1 -0.07672954 2.148216 0.368## Alternative hypothesis: rho != 0
And a quiz...identify the plot and the assumption
What do we do about non-normality of residuals, heteroscedasticity and non-linearity?
Often non-normal residuals, heteroscedasticity and non-linearity can be ameliorated by a non-linear transformation of the outcome and/or predictors.
This involves applying a function (see first week) to the values of a variable.
For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality
Non-linearity may be helped by a transformation of both predictors and outcomes
Positively skewed data can be made more normally distributed using a log-transformation.
Negatively skewed data can be made more normally distributed using same procedure but first reflecting the variable (make biggest values the smallest and smallest the biggest) and then applying the log-transform
What does skew look like?
Log-transformations can be implemented in R using the log()
function.
If your variable contains zero or negative values, you need to first add a constant to make all your values positive
df_skew <- df_skew %>% mutate( log_pos = log(pos), neg_ref = ((-1)*neg) + (max(neg)+1), log_neg = log(neg_ref) )
Looked at the third set of model evaluations, assumptions.
Described and considered how to assess:
Key take home point:
Understand the meaning of model coefficients in the case of a binary predictor.
Be able to state the assumptions underlying a linear model.
Understand how to assess if a fitted model satisfies the linear model assumptions.
Understand how to use transformations when the model violates assumptions.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |