class: center, middle, inverse, title-slide #
Case Diagnostics 2
## Data Analysis for Psychology in R 2
### dapR2 Team ### Department of Psychology
The University of Edinburgh --- # Week's Learning Objectives 1. Be able to state the assumptions underlying a linear model. 2. Understand how to test linear model assumptions. 3. Understand the difference between outliers and influential points. 4. Test and assess the effect of influential cases on LM coefficients and overall model evaluations. 5. Describe and apply some approaches to dealing with violations of model assumptions. --- # Topics for today + Recap diagnostics for LM + Additional checks with multiple predictors + Additional metrics for assessing influential cases. --- # Three important features + Model outliers + Cases for which there is a large discrepancy between their predicted value ( `\(\hat{y_i}\)` ) and their observed value ( `\(y_i\)` ) -- + High leverage cases + Cases with an unusual value of the predictor ( `\(x_i\)` ) -- + High influence cases + Cases who are having a large impact on the estimation of model --- # Some data for today .pull-left[ + Let's look again at our data predicting salary from years or service and performance ratings (no interaction). `$$y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon_i$$` + `\(y\)` = Salary (unit = thousands of pounds ). + `\(x_1\)` = Years of service. + `\(x_2\)` = Average performance ratings. ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:right;"> salary </th> <th style="text-align:right;"> serv </th> <th style="text-align:right;"> perf </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:right;"> 80.18 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:right;"> 123.98 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:right;"> 80.55 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:right;"> 84.35 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:right;"> 83.76 </td> <td style="text-align:right;"> 4.8 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:right;"> 117.61 </td> <td style="text-align:right;"> 4.4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> ID107 </td> <td style="text-align:right;"> 96.38 </td> <td style="text-align:right;"> 4.3 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID108 </td> <td style="text-align:right;"> 96.49 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ID109 </td> <td style="text-align:right;"> 88.23 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> ID110 </td> <td style="text-align:right;"> 143.69 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> ] --- # Our model ```r m1 <- lm(salary ~ serv + perf, data = salary2) ``` --- # Our model ``` ## ## Call: ## lm(formula = salary ~ serv + perf, data = salary2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -48.459 -9.636 -2.292 10.527 50.487 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 47.7546 7.4010 6.452 4.33e-09 *** ## serv 0.8749 1.3933 0.628 0.532 ## perf 14.2770 1.4429 9.894 2.27e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.13 on 97 degrees of freedom ## Multiple R-squared: 0.5048, Adjusted R-squared: 0.4946 ## F-statistic: 49.44 on 2 and 97 DF, p-value: 1.574e-15 ``` --- # Influence of coefficients + We have already considered Cook's distance as a measure of influence. + Cook's distance is a single value summarizing the total influence of a case + In the context of a lm with 2+ predictors, we may want to look in a little more detail. + **DFFit**: The difference between the predicted outcome value for a case with versus without a case included + **DFbeta**: The difference between the value for a coefficient with and without a case included + **DFbetas**: A standardised version of DFbeta + Obtained by dividing by an estimate of the standard error of the regression coefficient with the case removed --- # In R + We can extract these measures using the `influence.measures()` function: ```r dfs_m1 <- influence.measures(m1) dfs_m1$infmat[1:10,2:7] ``` ``` ## dfb.serv dfb.perf dffit cov.r cook.d hat ## 1 0.066130606 0.041753985 -0.105593829 1.0405921 3.737060e-03 0.02311790 ## 2 0.003851302 0.004708924 0.008013976 1.0577856 2.163037e-05 0.02485396 ## 3 0.054421852 0.041360254 -0.097410289 1.0382531 3.180791e-03 0.02042632 ## 4 -0.119558235 -0.017643599 -0.184243512 0.9907370 1.121319e-02 0.01771114 ## 5 -0.065074949 0.041930739 -0.098328593 1.0456587 3.243668e-03 0.02500085 ## 6 0.035248994 0.006607446 0.061406087 1.0396877 1.266791e-03 0.01529208 ## 7 -0.090987844 -0.140418705 -0.227310927 0.9856771 1.700975e-02 0.02276073 ## 8 -0.177123872 -0.141498344 -0.278126264 0.9919140 2.543642e-02 0.03215527 ## 9 0.019939872 0.015154173 -0.035690602 1.0510796 4.287565e-04 0.02042632 ## 10 0.029007141 0.061933031 0.078179109 1.0784172 2.055900e-03 0.04711972 ``` --- # Influence on standard errors + Influential cases can impact the `\(SE\)` as well as the estimation of coefficients. + This means it impacts our inferences. + Recall, the standard error for a regression slope (single predictor): `$$SE(\beta_1) = \sqrt{\frac{\frac{SSE}{(N-k-1)}}{\sum(x_i -\bar{x})^2}}$$` + Where: + `\(SSE\)` is the sum of squared error (i.e. `\(SS_{residual}\)` ) + `\(N\)` is the sample size + `\(k\)` is the number of predictors --- # Influence on standard errors `$$SE(\beta_1) = \sqrt{\frac{\frac{SSE}{(N-k-1)}}{\sum(x_i -\bar{x})^2}}$$` + `\(\sum(x_i -\bar{x})^2\)` is the sum of squared deviations from each `\(X\)` value from the mean of `\(X\)` + This term implies that increasing the variance in `\(X\)` will decrease the standard error of + High leverage cases (which are far from the mean of X) can increase `\(X\)` variance --- # COVRATIO + Influence on standard errors can be measured using the **COVRATIO** statistic + COVRATIO value <1 show that precision is decreased (SE increased) by a case + COVRATIO value >1 show that precision is increased (SE decreased) by a case + Cases with COVRATIOS `\(> 1+[3(k +1)/n]\)` or `\(< 1-[3( k +1)/ n ]\)` can be considered to have a strong influence on the standard errors --- # COVRATIO in R + COVRATIO values can be extracted using the `covratio()` function: ```r cr <- covratio(m1) cr[1:5] ``` ``` ## 1 2 3 4 5 ## 1.040592 1.057786 1.038253 0.990737 1.045659 ``` --- # COVRATIO in R + And we can check cases using the cuts above: ```r which(cr > (1+(3*2)/100)) ``` ``` ## 10 17 19 20 22 24 29 32 39 41 45 51 55 56 60 63 70 72 82 84 91 92 99 ## 10 17 19 20 22 24 29 32 39 41 45 51 55 56 60 63 70 72 82 84 91 92 99 ``` --- # Summary of today + Today we have consider: + DFFit + DFbeta + DFbetas + COVRATIO + These provide more detailed evaluation of the influence of cases on coefficients, model fit, and standard errors (inference) for linear models --- class: center, middle # Thanks for listening!