class: center, middle, inverse, title-slide #
Assumptions
## Data Analysis for Psychology in R 2
### dapR2 Team ### Department of Psychology
The University of Edinburgh --- # Week's Learning Objectives 1. Be able to state the assumptions underlying a linear model. 2. Understand how to test linear model assumptions. 3. Understand the difference between outliers and influential points. 4. Test and assess the effect of influential cases on LM coefficients and overall model evaluations. 5. Describe and apply some approaches to dealing with violations of model assumptions. --- # Topics for today + What are the assumptions of linear model and how can we assess them? + Linearity + Independence of errors + Normality of errors + Equal variance (Homoscedasticity) --- # Linear model assumptions + So far, we have discussed evaluating linear models with respect to: + Overall model fit ( `\(F\)` -ratio, `\(R^2\)`) + Individual predictors + However, the linear model is also built on a set of assumptions. + If these assumptions are violated, the model will not be very accurate. + Thus, we also need to assess the extent to which these assumptions are met. --- # Some data for today .pull-left[ + Let's look again at our data predicting salary from years or service and performance ratings (no interaction). `$$y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon_i$$` + `\(y\)` = Salary (unit = thousands of pounds ). + `\(x_1\)` = Years of service. + `\(x_2\)` = Average performance ratings. ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:right;"> salary </th> <th style="text-align:right;"> serv </th> <th style="text-align:right;"> perf </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:right;"> 80.18 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:right;"> 123.98 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:right;"> 80.55 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:right;"> 84.35 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:right;"> 83.76 </td> <td style="text-align:right;"> 4.8 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:right;"> 117.61 </td> <td style="text-align:right;"> 4.4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID107 </td> <td style="text-align:right;"> 96.38 </td> <td style="text-align:right;"> 4.3 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID108 </td> <td style="text-align:right;"> 96.49 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID109 </td> <td style="text-align:right;"> 88.23 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID110 </td> <td style="text-align:right;"> 143.69 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> ] --- # Our model ```r m1 <- lm(salary ~ perf + serv, data = salary2) ``` + We will run all our assumptions based on the object `m1` --- # Visualizations vs tests + There exist a variety of ways to assess assumptions, which broadly split into statistical tests and visualizations. + We will focus on visualization: + Easier to see the nature and magnitude of the assumption violation + There is also a very useful function for producing them all. + Statistical tests often suggest assumptions are violated when problem is small. + This is to do with the statistical power of the tests. + Give no information on what the actual problem is. + A summary table of tests will be given at the end of the lecture. --- # Visualizations made easy + For a majority of assumption and diagnostic plots, we will make use of the `plot()` function. + If we give `plot()` a linear model object (e.g. `m1` or `m2`), we can automatically generate assumption plots. + We will also make use of some individual functions for specific visualizations. + Alternatively, we can also use `check_model()` from the `performance` package. + This provides `ggplot` figures as well as some notes to aid interpretation. + Caution that these plots are **not in a format to use directly in reports** --- # Linearity + **Assumption**: The relationship between `\(y\)` and `\(x\)` is linear. + Assuming a linear relation when the true relation is non-linear can result in under-estimating that relation + **Investigated with**: + Scatterplots with loess lines (single variables) + Component-residual plots (when we have multiple predictors) --- # Linear vs non-linear .pull-left[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-4-1.png" width="90%" /> ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-5-1.png" width="90%" /> ] --- # What is a loess line? + Method for helping visualize the shape of relationships: + Stands for... + **LO**cally + **E**stimated + **S**catterplot + **S**moothing + Essentially produces a line with follows the data. + Useful for single predictors. --- # Visualization .pull-left[ ```r lin_m1 <- salary2 %>% ggplot(., aes(x=serv, y=perf)) + geom_point()+ geom_smooth(method = "lm", se=F) + # << * geom_smooth(method = "loess", se=F, col = "red") + labs(x= "Years of Service", y="Performance", title = "Scatterplot with linear (blue) and loess (red) lines") ``` ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-7-1.png" width="90%" /> ] --- # Non-linearity + With multiple predictors, we need to know whether the relations are linear between each predictor and outcome, controlling for the other predictors + This can be done using **component-residual plots** + Also known as partial-residual plots + Component-residual plots have the `\(x\)` values on the X-axis and partial residuals on the Y-axis + *Partial residuals* for each X variable are: `$$\epsilon_i + B_jX_{ij}$$` + Where : + `\(\epsilon_i\)` is the residual from the linear model including all the predictors + `\(B_jX_{ij}\)` is the partial (linear) relation between `\(x_j\)` and `\(y\)` --- # `crPlots()` + Component-residual plots can be obtained using the `crPlots()` function from `car` package ```r m1 <- lm(salary ~ perf + serv, data = salary2) crPlots(m1) ``` + The plots for continuous predictors show a linear (dashed) and loess (solid) line + The loess line should follow the linear line closely, with deviations suggesting non-linearity --- # `crPlots()` <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-9-1.png" width="90%" /> ??? + Here the relations look pretty good. + Deviations of the line are minor --- # Normally distributed errors + **Assumption**: The errors ( `\(\epsilon_i\)` ) are normally distributed around each predicted value. + **Investigated with**: + QQ-plots + Histograms --- # Visualizations + **Histograms**: Plot the frequency distribution of the residuals. ```r hist(m1$residuals) ``` -- + **Q-Q Plots**: Quantile comparison plots. + Plot the standardized residuals from the model against their theoretically expected values. + If the residuals are normally distributed, the points should fall neatly on the diagonal of the plot. + Non-normally distributed residuals cause deviations of points from the diagonal. + The specific shape of these deviations are characteristic of the distribution of the residuals. ```r *plot(m1, which = 2) ``` --- # Visualizations .pull-left[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-12-1.png" width="90%" /> ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-13-1.png" width="90%" /> ] --- # Equal variance (Homoscedasticity) + **Assumption**: The equal variances assumption is constant across values of the predictors `\(x_1\)`, ... `\(x_k\)`, and across values of the fitted values `\(\hat{y}\)` + Heteroscedasticity refers to when this assumption is violated (non-constant variance) + **Investigated with**: + Plot residual values against the predicted values ( `\(\hat{y}\)` ). --- # Residual-vs-predicted values plot + In R, we can plot the residuals vs predicted values using `residualPlot()` function in the `car` package. + Categorical predictors should show a similar spread of residual values across their levels + The plots for continuous predictors should look like a random array of dots + The solid line should follow the dashed line closely ```r residualPlot(m1) ``` + We can also get this plot using: ```r plot(m1, which = 1) ``` --- # Residual-vs-predicted values plot .pull-left[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-16-1.png" width="90%" /> ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-17-1.png" width="90%" /> ] --- # Independence of errors + **Assumption**: The errors are not correlated with one another + Difficult to test unless we know the potential source of correlation between cases. + We will see more of this in year 3. + Essentially, for now, we will evaluate this based on study design. + If a design is between person, we will assume the errors to be independent. --- # Multi-collinearity + This is **not an assumption of linear model**, but it is something we need to consider. + It sits between assumptions and case diagnostics. + Multi-collinearity refers to the correlation between predictors + We saw this in the formula for the standard error of model slopes for an `lm` with multiple predictors. + When there are large correlations between predictors, the standard errors are increased + Therefore, we don't want our predictors to be too correlated --- # Variance Inflation Factor + The **Variance Inflation Factor** or VIF quantifies the extent to which standard errors are increased by predictor inter-correlations + It can be obtained in R using the `vif()` function: ```r vif(m1) ``` ``` ## perf serv ## 1.001337 1.001337 ``` + The function gives a VIF value for each predictor + Ideally, we want values to be close to 1 + VIFs> 10 indicate a problem --- # What to do about multi-collinearity + In practice, multi-collinearity is not often a major problem + When issues arise, consider: + Combining highly correlated predictors into a single composite + E.g. create a sum or average of the two predictors + Dropping an IV that is obviously statistically and conceptually redundant with another from the model --- class: center, middle # Time for a break **And a quiz...identify the plot and the assumption** --- class: center, middle # Violated Assumptions What do we do about non-normality of residuals, heteroscedasticity and non-linearity? --- # Fixing violations 1. Model misspecification (predictors): add predictors 2. If the outcome is not continuous, use generalized linear model (more later in course) 3. Transformations 4. Bootstrapped inference --- # Model misspecification + Sometimes assumptions appear violated because our model is not correct. + Typically we have: + Failed to include an interaction + Failed to include a non-linear (higher order) effect + Usually detected by observing violations of linearity or normality of residuals. + Solved by including the terms in our linear model. --- # Non-linear transformations + Another approach is a non-linear transformation of the outcome and/or predictors. + Often related to non-normal residuals, heteroscedasticity and non-linearity. + This involves applying a function (see first week) to the values of a variable. + This changes the values and overall shape of the distribution + For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality + Non-linearity may be helped by a transformation of both predictors and outcomes --- # Transforming variables to normality + Positively skewed data can be made more normally distributed using a log-transformation. + Negatively skewed data can be made more normally distributed using same procedure but first reflecting the variable (make biggest values the smallest and smallest the biggest) and then applying the log-transform + What does skew look like? --- # Visualizing Skew .pull-left[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-19-1.png" width="90%" /> ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-20-1.png" width="90%" /> ] --- # Log-transformations + Log-transformations can be implemented in R using the `log()` function. + If your variable contains zero or negative values, you need to first add a constant to make all your values positive + A good strategy is to add a constant so that your minimum value is one + E.g., if your minimum value is -1.5, add 2.5 to all your values --- # Log-transformation in action ```r df_skew <- df_skew %>% mutate( * log_pos = log(pos), * neg_ref = ((-1)*neg) + (max(neg)+1), * log_neg = log(neg_ref) ) ``` --- # Log-transformation in action .pull-left[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-22-1.png" width="90%" /> ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-23-1.png" width="90%" /> ] --- # Log-transformation in action .pull-left[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-24-1.png" width="90%" /> ] .pull-right[ <img src="dapr2_12_assumptions_files/figure-html/unnamed-chunk-25-1.png" width="90%" /> ] --- # Generalised linear model + All the models we have been discussing are suitable for continuous outcome variables. + Sometimes our outcomes are not continuous or normally distributed not because of an error in measurement, but because they would not be expected to be. + E.g. Reaction time, counts, binary variables. + For such data, we need a slightly different version of a linear model. + More on this to come later in the course. --- # Bootstrapped inference + One of the concerns when we have violated assumptions is that we make poor inferences. + This is because with violated assumptions, the building blocks of our inferences may be unreliable. + Bootstrapping as a tool can help us here. + We will cover this in detail later in the course. --- # Summary of assumptions + **Linearity**: The relationship between `\(y\)` and `\(x\)` is linear. + Assuming a linear relation when the true relation is non-linear can result in under-estimating that relation + **Normally distributed errors**: The errors ( `\(\epsilon_i\)` ) are normally distributed around each predicted value. + **Homoscedasticity**: The equal variances assumption is constant across values of the predictors `\(x_1\)`, ... `\(x_k\)`, and across values of the fitted values `\(\hat{y}\)` + **Independence of errors**: The errors are not correlated with one another --- # Summary of today + Looked at the third set of model evaluations, assumptions. + Described and considered how to assess: + Linearity + Independence of errors + Normality of errors + Equal variance (Homoscedasticity) + Key take home point: + There are no hard and fast rules for assessing assumptions + It takes practice to consider if violations are a problem --- class: center, middle # Thanks for listening!