class: center, middle, inverse, title-slide #
Week 9: Further diagnostics
## Data Analysis for Psychology in R 2
### TOM BOOTH & ALEX DOUMAS ### Department of Psychology
The University of Edinburgh ### AY 2020-2021 --- # Week's Learning Objectives 1. Specify the assumptions underlying a LM with multiple predictors 2. Assess if a fitted model satisfies the assumptions 3. Test and assess the effect of influential cases on LM coefficients and overall model evaluations --- # Topics for today + Recap diagnostics for LM + Additional checks with multiple predictors + Additional metrics for assessing influential cases. --- # Three important features + Model outliers + Cases for which there is a large discrepancy between their predicted value ( `\(\hat{y_i}\)` ) and their observed value ( `\(y_i\)` ) -- + High leverage cases + Cases with an unusual value of the predictor ( `\(x_i\)` ) -- + High influence cases + Cases who are having a large impact on the estimation of model --- # Some data for today .pull-left[ + Let's look again at our data predicting salary from years or service and performance ratings (no interaction). `$$y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon_i$$` + `\(y\)` = Salary (unit = thousands of pounds ). + `\(x_1\)` = Years of service. + `\(x_2\)` = Average performance ratings. ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:right;"> salary </th> <th style="text-align:right;"> serv </th> <th style="text-align:right;"> perf </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:right;"> 80.18 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:right;"> 123.98 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:right;"> 80.55 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:right;"> 84.35 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:right;"> 83.76 </td> <td style="text-align:right;"> 4.8 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:right;"> 117.61 </td> <td style="text-align:right;"> 4.4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID107 </td> <td style="text-align:right;"> 96.38 </td> <td style="text-align:right;"> 4.3 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID108 </td> <td style="text-align:right;"> 96.49 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID109 </td> <td style="text-align:right;"> 88.23 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID110 </td> <td style="text-align:right;"> 143.69 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> ] --- # Our model ```r m1 <- lm(salary ~ serv + perf, data = salary2) ``` --- # Our model ``` ## ## Call: ## lm(formula = salary ~ serv + perf, data = salary2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -48.459 -9.636 -2.292 10.527 50.487 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 47.7546 7.4010 6.452 4.33e-09 *** ## serv 0.8749 1.3933 0.628 0.532 ## perf 14.2770 1.4429 9.894 2.27e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.13 on 97 degrees of freedom ## Multiple R-squared: 0.5048, Adjusted R-squared: 0.4946 ## F-statistic: 49.44 on 2 and 97 DF, p-value: 1.574e-15 ``` --- # Influence of coefficients + We have already considered Cook's distance as a measure of influence. + Cook's distance is a single value summarizing the total influence of a case + In the context of a lm with 2+ predictors, we may want to look in a little more detail. + **DFFit**: The difference between the predicted outcome value for a case with versus without a case included + **DFbeta**: The difference between the value for a coefficient with and without a case included + **DFbetas**: A standardised version of DFbeta + Obtained by dividing by an estimate of the standard error of the regression coefficient with the case removed --- # In R + We can extract these measures using the `influence.measures()` function: ```r dfs_m1 <- influence.measures(m1) dfs_m1$infmat[1:10,2:7] ``` ``` ## dfb.serv dfb.perf dffit cov.r cook.d hat ## 1 0.066130606 0.041753985 -0.105593829 1.0405921 3.737060e-03 0.02311790 ## 2 0.003851302 0.004708924 0.008013976 1.0577856 2.163037e-05 0.02485396 ## 3 0.054421852 0.041360254 -0.097410289 1.0382531 3.180791e-03 0.02042632 ## 4 -0.119558235 -0.017643599 -0.184243512 0.9907370 1.121319e-02 0.01771114 ## 5 -0.065074949 0.041930739 -0.098328593 1.0456587 3.243668e-03 0.02500085 ## 6 0.035248994 0.006607446 0.061406087 1.0396877 1.266791e-03 0.01529208 ## 7 -0.090987844 -0.140418705 -0.227310927 0.9856771 1.700975e-02 0.02276073 ## 8 -0.177123872 -0.141498344 -0.278126264 0.9919140 2.543642e-02 0.03215527 ## 9 0.019939872 0.015154173 -0.035690602 1.0510796 4.287565e-04 0.02042632 ## 10 0.029007141 0.061933031 0.078179109 1.0784172 2.055900e-03 0.04711972 ``` --- # Influence on standard errors + Influential cases can impact the `\(SE\)` as well as the estimation of coefficients. + This means it impacts our inferences. + Recall, the standard error for a regression slope (single predictor): `$$SE(\beta_1) = \sqrt{\frac{\frac{SSE}{(N-k-1)}}{\sum(x_i -\bar{x})^2}}$$` + Where: + `\(SSE\)` is the sum of squared error (i.e. `\(SS_{residual}\)` ) + `\(N\)` is the sample size + `\(k\)` is the number of predictors --- # Influence on standard errors `$$SE(\beta_1) = \sqrt{\frac{\frac{SSE}{(N-k-1)}}{\sum(x_i -\bar{x})^2}}$$` + `\(\sum(x_i -\bar{x})^2\)` is the sum of squared deviations from each `\(X\)` value from the mean of `\(X\)` + This term implies that increasing the variance in `\(X\)` will decrease the standard error of + High leverage cases (which are far from the mean of X) can increase `\(X\)` variance --- # COVRATIO + Influence on standard errors can be measured using the **COVRATIO** statistic + COVRATIO value <1 show that precision is decreased (SE increased) by a case + COVRATIO value >1 show that precision is increased (SE decreased) by a case + Cases with COVRATIOS `\(> 1+[3(k +1)/n]\)` or `\(< 1-[3( k +1)/ n ]\)` can be considered to have a strong influence on the standard errors --- # COVRATIO in R + COVRATIO values can be extracted using the `covratio()` function: ```r cr <- covratio(m1) cr[1:5] ``` ``` ## 1 2 3 4 5 ## 1.040592 1.057786 1.038253 0.990737 1.045659 ``` --- # COVRATIO in R + And we can check cases using the cuts above: ```r which(cr > (1+(3*2)/100)) ``` ``` ## 10 17 19 20 22 24 29 32 39 41 45 51 55 56 60 63 70 72 82 84 91 92 99 ## 10 17 19 20 22 24 29 32 39 41 45 51 55 56 60 63 70 72 82 84 91 92 99 ``` --- # Summary of today + Today we have consider: + DFFit + DFbeta + DFbetas + COVRATIO + These provide more detailed evaluation of the influence of cases on coefficients, model fit, and standard errors (inference) for linear models --- # Next tasks + This week: + Complete your lab + Come to office hours + Weekly quiz - content from week 8 + Open Monday 09:00 + Closes Sunday 17:00