class: center, middle, inverse, title-slide #
Revisiting Assumptions
## Data Analysis for Psychology in R 2
### Tom Booth and Alex Doumas ### Department of Psychology
The University of Edinburgh ### AY 2020-2021 --- # Weeks Learning Objectives 1. Understand the problem of multiple comparisons. 2. Run appropriate tests and corrections for multiple comparisons. 3. Apply assumption tests to linear models for experimental designs. --- # Topics for today + A brief word about assumptions in linear models for experimental designs. --- # Assumptions + And a reminder of linear model assumptions: + **L**inearity: The relationship between `\(y\)` and `\(x\)` is linear. + **I**ndependence of errors: The error terms should be independent from one another. + **N**ormality: The errors `\(\epsilon\)` are normally distributed + **E**qual variances ("Homoscedasticity"): The scale of the variability of the errors `\(\epsilon\)` is constant at all values of `\(x\)`. -- **That is, exactly the same assumptions as would be the case for any other linear model** --- # Assumptions - We've gone over these assumptions when discussing regression. - It's worth a bit of redundency, though, to make sure these assumptions are well understood. - We'll review assumptions, and how to check for violations... --- # Visualizations vs tests + In talking about assumption checks, we use both statistical tests and visualizations. + In general, graphical methods are often more useful. + Easier to see the nature and magnitude of the assumption violation. + There is also a very useful function for producing them all. + Statistical tests often suggest assumptions are violated when problem is small. --- # Visualizations made easy + For a many assumption and diagnostic plots, we will make use of the `plot()` function. + If we give `plot()` a linear model object (e.g. `m1` or `m2`), we can automatically get 6 useful plots. + Today we'll use a couple, but we'll get to others in the coming weeks. --- # Example + The data comes from a study into patient care in a paediatric wards. + A researcher was interested in whether the subjective well-being of patients differed dependent on the post-operation treatment schedule they were given, and the hospital in which they were staying. + **Condition 1**: `Treatment` (Levels: TreatA, TreatB, TreatC). + **Condition 2**: `Hosp` (Levels: Hosp1, Hosp2). + Total sample n = 180 (30 patients in each of 6 groups). + Between person design. + **Outcome**: Subjective well-being (SWB). + An average of multiple raters (the patient, a member of their family, and a friend). + SWB score ranged from 0 to 20. --- # The data ```r hosp_tbl <- read_csv("hospital.csv", col_types = "dff") hosp_tbl %>% slice(1:10) ``` ``` ## # A tibble: 10 x 3 ## SWB Treatment Hospital ## <dbl> <fct> <fct> ## 1 6.2 TreatA Hosp1 ## 2 15.9 TreatA Hosp1 ## 3 7.2 TreatA Hosp1 ## 4 11.3 TreatA Hosp1 ## 5 11.2 TreatA Hosp1 ## 6 9 TreatA Hosp1 ## 7 14.5 TreatA Hosp1 ## 8 7.3 TreatA Hosp1 ## 9 13.7 TreatA Hosp1 ## 10 12.6 TreatA Hosp1 ``` --- # Our results ```r m4 <- lm(SWB ~ Treatment + Hospital + Treatment*Hospital, data = hosp_tbl) anova(m4) ``` ``` ## Analysis of Variance Table ## ## Response: SWB ## Df Sum Sq Mean Sq F value Pr(>F) ## Treatment 2 177.02 88.511 21.5597 4.315e-09 *** ## Hospital 1 9.57 9.568 2.3306 0.1287 ## Treatment:Hospital 2 392.18 196.088 47.7635 < 2.2e-16 *** ## Residuals 174 714.34 4.105 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Our results ```r m4sum <- summary(m4) round(m4sum$coefficients,2) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.80 0.37 29.19 0.00 ## TreatmentTreatB -1.37 0.52 -2.62 0.01 ## TreatmentTreatC -0.70 0.52 -1.33 0.18 ## HospitalHosp2 -2.95 0.52 -5.63 0.00 ## TreatmentTreatB:HospitalHosp2 6.63 0.74 8.97 0.00 ## TreatmentTreatC:HospitalHosp2 0.82 0.74 1.11 0.27 ``` --- # Linearity assumption + For regression, **Assumption**: The relationship between `\(y\)` and `\(x\)` is linear. + Assuming a linear relation when the true relation is non-linear can result in under-estimating that relation + **Investigated with**: + Scatterplots with loess lines. + For **ANOVA** there is no formal linearity assumption because of the group structure of all IVs. --- # Normally distributed errors + **Assumption**: The errors ( `\(\epsilon_i\)` ) are normally distributed around each predicted value. + **Investigated with**: + QQ-plots + Histograms + Shapiro-Wilk test --- # Visualizations + **Histograms**: Plot the frequency distribution of the residuals. ```r hist(m4$residuals) ``` -- + **Q-Q Plots**: Quantile comparison plots. + Plot the standardized residuals from the model against their theoretically expected values. + If the residuals are normally distributed, the points should fall neatly on the diagonal of the plot. + Non-normally distributed residuals cause deviations of points from the diagonal. + The specific shape of these deviations are characteristic of the distribution of the residuals. ```r *plot(m4, which = 2) ``` --- # Visualizations .pull-left[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] .pull-right[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- # shapiro.test() + The Shapiro-Wilk test provides a significance test on the departure from normality. + A significant `\(p\)`-value ( `\(\alpha = .05\)` ) suggests that the residuals deviate from normality. ```r shapiro.test(m4$residuals) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: m4$residuals ## W = 0.99392, p-value = 0.6677 ``` --- # Equal variance (Homoscedasticity) + **Assumption**: The equal variances assumption is constant across values of the predictors `\(x_1\)`, ... `\(x_k\)`, and across values of the fitted values `\(\hat{y}\)`. + Heteroscedasticity refers to when this assumption is violated (non-constant variance). + **Investigated with**: + Plot Pearson residual values against the predicted values ( `\(\hat{y}\)` ). + Breusch-Pagan test (Non-constant variance test). --- # Residual-vs-predicted values plot .pull-left[ + In R, we can plot the residuals vs predicted values using `residualPlot()` function in the `car` package. + Categorical predictors should show a similar spread of residual values across their levels. + The plots for continuous predictors should look like a random array of dots. + The solid line should follow the dashed line closely. ```r residualPlot(m4) ``` ] .pull-right[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] --- # Breusch-Pagan test .pull-left[ + Also called the non-constant variance test. + Tests whether residual variance depends on the predicted values. + Implemented using the `ncvTest()` function in R. + Non-significant `\(p\)`-value suggests homoscedasticity assumption holds. ] .pull-right[ ```r ncvTest(m4) ``` ``` ## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 0.6824626, Df = 1, p = 0.40874 ``` ] --- # Independence of errors + **Assumption**: The errors are not correlated with one another. + Difficult to test unless we know the potential source of correlation between cases. + We can test a limited form of the assumption by testing for autocorrelation between errors. + Achieved using the Durbin-Watson test. --- # Durbin-Watson test + Durbin-Watson test implemented in R using the `durbinWatsonTest()` function: ```r durbinWatsonTest(m4) ``` ``` ## lag Autocorrelation D-W Statistic p-value ## 1 -0.1408416 2.251814 0.2 ## Alternative hypothesis: rho != 0 ``` + The D-W statistic can take values between 0 and 4. + 2= no autocorrelation. + Therefore, we ideally want D-W values close to 2 and a non-significant `\(p\)`-value. + Values <1 or >3 may indicate problems. --- class: center, middle # Violated Assumptions What do we do about non-normality of residuals, heteroscedasticity and non-linearity? --- # Non-linear transformations + Often non-normal residuals, heteroscedasticity and non-linearity can be ameliorated by a non-linear transformation of the outcome and/or predictors. + This transformation involves applying a function (see first week) to the values of a variable. + The transformation changes the values and overall shape of the distribution. + For non-normal residuals and heteroscedasticity, skewed outcomes can be transformed to normality. + Non-linearity may be helped by a transformation of both predictors and outcomes. --- # Transforming variables to normality + Positively skewed data can be made more normally distributed using a log-transformation. + Negatively skewed data can be made more normally distributed using same procedure but first reflecting the variable (make biggest values the smallest and smallest the biggest) and then applying the log-transform. --- # Visualizing Skew .pull-left[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] .pull-right[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- # Log-transformations + Log-transformations can be implemented in R using the `log()` function. + If your variable contains zero or negative values, you need to first add a constant to make all your values positive. + A good strategy is to add a constant so that your minimum value is one. + E.g., if your minimum value is -1.5, add 2.5 to all your values. --- # Log-transformation in action ```r df_skew <- df_skew %>% mutate( * log_pos = log(pos), * neg_ref = ((-1)*neg) + (max(neg)+1), * log_neg = log(neg_ref) ) ``` --- # Log-transformation in action .pull-left[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .pull-right[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] --- # Log-transformation in action .pull-left[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .pull-right[ ![](dapR2_lec28_RevisitingAssumptions_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] --- # Summary of today - Reviewed LINE assumptions for linear models. - Discussed these assumptions related to experimental (ANOVA analysed) data and how to check for assumption violations.