class: center, middle, inverse, title-slide .title[ #
Categorical Predictors: Effects (sum to zero) coding
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Weeks Learning Objectives 1. Understand the difference between dummy and sum-to-zero coding. 2. Understand the core principle of different coding schemes. 3. Interpret the output from a model using sum-to-zero coding. --- # Overview of next 3 weeks + The next few weeks will focus on how we can work with linear models with categorical predictors to test different questions. + In particular, we will look at the type of questions typically asked as part of experiments. + We will see that the tools we saw in discussing model comparisons will be important. + As well the principles of categorical data coding discussed when talking about binary and dummy variables. + We are going to generalise some of these concepts, and then show how we can use the general principles to test questions. --- # `Hospital` & `Treatment` data + **Condition 1**: `Treatment` (Levels: TreatA, TreatB, TreatC). + **Condition 2**: `Hospital` (Levels: Hosp1, Hosp2). + Total sample n = 180 (30 patients in each of 6 groups). + Between person design. + **Outcome**: Subjective well-being (`SWB`) + An average of multiple raters (the patient, a member of their family, and a friend). + SWB score ranged from 0 to 20. --- # Recap on dummy coding (reference group) + Create `\(k\)`-1 dummy variables/contrasts + where `\(k\)` is the number of levels of the categorical predictor. + Assign reference group 0 on all dummies. + Assign 1 to the focal group for a particular dummy. + Enter the dummies into the linear model and they code the difference in means between the focal group/level and the reference. --- # Model results ```r summary(lm(SWB ~ Treatment + Hospital + Treatment*Hospital, data = hosp_tbl)) ``` ``` ## ## Call: ## lm(formula = SWB ~ Treatment + Hospital + Treatment * Hospital, ## data = hosp_tbl) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.6000 -1.2533 0.1083 1.2650 5.7000 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.8000 0.3699 29.195 < 2e-16 *** ## TreatmentTreatB -1.3700 0.5232 -2.619 0.0096 ** ## TreatmentTreatC -0.6967 0.5232 -1.332 0.1847 ## HospitalHosp2 -2.9467 0.5232 -5.632 7.02e-08 *** ## TreatmentTreatB:HospitalHosp2 6.6333 0.7399 8.966 4.74e-16 *** ## TreatmentTreatC:HospitalHosp2 0.8233 0.7399 1.113 0.2673 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.026 on 174 degrees of freedom ## Multiple R-squared: 0.4476, Adjusted R-squared: 0.4317 ## F-statistic: 28.2 on 5 and 174 DF, p-value: < 2.2e-16 ``` --- class: center, middle # Principle behind model constraints --- # Why do we need a reference group? + Consider our example and dummy coding. + We have three groups each given a specific Treatment A, B or C + We want a model that represents our data (observations), but all we "know" is what group an observation belongs to. So; `$$y_{ij} = \mu_i + \epsilon_{ij}$$` + Where + `\(y_{ij}\)` are the individual observations + `\(\mu_i\)` is the mean of group `\(i\)` and + `\(\epsilon_{ij}\)` is the individual deviation from that mean. ??? + And this hopefully makes sense. + Given we know someone's group, our best guess is the mean + But people wont all score the mean, so there is some deviation for every person. --- # Why do we need a reference group? + An alternative way to present this idea looks much more like our linear model: `$$y_{ij} = \beta_0 + \underbrace{(\mu_{i} - \beta_0)}_{\beta_i} + \epsilon_{ij}$$` + Where + `\(y_{ij}\)` are the individual observations + `\(\beta_0\)` is an estimate of reference/overall average + `\(\mu_i\)` is the mean of group `\(i\)` + `\(\beta_1\)` is the difference between the reference and the mean of group `\(i\)`, and + `\(\epsilon_{ij}\)` is the individual deviation from that mean. --- # Why do we need a reference group? + We can write this equation more generally as: $$\mu_i = \beta_0 + \beta_i $$ + or for the specific groups (in our case 3): `$$\mu_{treatmentA} = \beta_0 + \beta_{1A}$$` `$$\mu_{treatmentB} = \beta_0 + \beta_{2B}$$` `$$\mu_{treatmentC} = \beta_0 + \beta_{3C}$$` + **The problem**: we have four parameters ( `\(\beta_0\)` , `\(\beta_{1A}\)` , `\(\beta_{2B}\)` , `\(\beta_{3C}\)` ) to model three group means ( `\(\mu_{TreatmentA}\)` , `\(\mu_{TreatmentB}\)` , `\(\mu_{TreatmentC}\)` ) + We are trying to estimate too much with too little. + This is referred to as under-identification. + We need to estimate at least 1 parameter less --- # Constraints fix identification + Consider dummy coding. + Suppose we make Treatment A the reference. Then, `$$\mu_{treatmentA} = \beta_0$$` `$$\mu_{treatmentB} = \beta_0 + \beta_{1B}$$` `$$\mu_{treatmentC} = \beta_0 + \beta_{2C}$$` + **Fixed** ! + We now only have three parameters ( `\(\beta_0\)` , `\(\beta_{1B}\)` , `\(\beta_{2C}\)` ) for the three group means ( `\(\mu_{TreatmentA}\)` , `\(\mu_{TreatmentB}\)` , `\(\mu_{TreatmentC}\)` ). > So when we code categorical variables, we need a constraint so that we can estimate our models. --- class: center, middle # Any questions? --- # Why not always use dummy coding? + We might not always want to compare against a reference group. + We might want to compare to: + The overall or grand mean + Group 1 vs groups 2, 3, 4 combined + and on we go! + Let's consider the example of the grand mean... --- # Effects coding (sum to zero coding) .pull-left[ ![](dapr2_12_effectscoding_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .pull-right[ + The coloured points represent the individual SWB values for the 30 individuals in each group. + The solid coloured lines are the group means. + The dashed grey line is the grand mean (the mean of all the observations) ] --- # Model with the grand mean + If we write our model including the grand mean, we get: `$$y_{ij} = \mu + \beta_j + \epsilon_{ij}$$` + where + `\(y_{ij}\)` is the score for a given individual ( `\(i\)` ) in a given group ( `\(j\)` ) + `\(\mu\)` is the grand mean + `\(\beta_j\)` is a group specific effect + `\(\epsilon_{ij}\)` is the individual deviation from the group mean --- # Model with the grand mean + This means that each group mean is: `$$\mu_{TreatmentA} = \mu + \beta_{TreatmentA}$$` `$$\mu_{TreatmentB} = \mu + \beta_{TreatmentB}$$` `$$\mu_{TreatmentC} = \mu + \beta_{TreatmentC}$$` + And as with dummy coding, this means we have 4 things to estimate, but only 3 group means. --- # Sum to zero constraint + In sum to zero coding, we fix this with the following constraint: `$$\sum_{j=1}^m \beta_j = 0$$` + Or alternatively written for the 3 group case: `$$\beta_1 + \beta_2 + \beta_3 = 0$$` --- # Sum to zero constraint + This constraints leads to the following interpretations: + `\(\beta_0\)` is the grand mean (mean of all observations) or `\(\mu\)` + `\(\beta_j\)` are the differences between the coded group and the grand mean: `$$\beta_j = \mu_j - \mu$$` --- # Why the grand mean? `$$\beta_1 + \beta_2 + \beta_3 = 0$$` + Substitute `\(\beta_0\)` : `$$(\mu_1 - \beta_0) + (\mu_2 - \beta_0) + (\mu_3 - \beta_0) = 0$$` `$$\mu_1 + \mu_2 + \mu_3 = 3\beta_0$$` $$\beta_0 = \frac{\mu_1 + \mu_2 + \mu_3}{3} $$ `$$\beta_0 = \mu$$` --- # Sum to zero constraint + Finally, we can get back to our group means from the coefficients as follows: `$$\mu_1 = \beta_0 + \beta_1$$` `$$\mu_2 = \beta_0 + \beta_2$$` `$$\mu_3 = \beta_0 - (\beta_1 + \beta_2)$$` --- class: center, middle # Pause for breath... --- # OK, but how do we apply the constraint? + Answer, in the same way as we did with dummy coding. + We can create a set of sum to zero (sometimes called effect, or deviation) variables + Or the equivalent contrast matrix. + For effect code variables we: + Create `\(k-1\)` variables + For observations in the focal group, assign 1 + For observations in the last group, assign -1 + For all other groups assign 0 --- # Comparing coding matrices .pull-left[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Level </th> <th style="text-align:right;"> D1 </th> <th style="text-align:right;"> D2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Treatment A </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Treatment B </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Treatment C </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> `$$y_{ij} = \beta_0 + \beta_1D_1 + \beta_2D_2 + \epsilon_{ij}$$` ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Level </th> <th style="text-align:right;"> E1 </th> <th style="text-align:right;"> E2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Treatment A </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Treatment B </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Treatment C </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> </tr> </tbody> </table> `$$y_{ij} = \beta_0 + \beta_1E_1 + \beta_2E_2 + \epsilon_{ij}$$` ] --- # Sum to zero/effects for group means .pull-left[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Level </th> <th style="text-align:right;"> E1 </th> <th style="text-align:right;"> E2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Treatment A </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Treatment B </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Treatment C </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> </tr> </tbody> </table> `$$\mu_1 = \beta_0 + \beta_1$$` `$$\mu_2 = \beta_0 + \beta_2$$` `$$\mu_3 = \beta_0 - (\beta_1 + \beta_2)$$` ] .pull-right[ `$$\mu_1 = \beta_0 + 1*\beta_1 + 0*\beta_2 = \beta_0 + \beta_1$$` `$$\mu_2 = \beta_0 + 0*\beta_1 + 1*\beta_2 = \beta_0 + \beta_2$$` `$$\mu_3 = \beta_0 -1*\beta_1 -1*\beta_2 = \beta_0 - \beta_1 -\beta_2$$` + Now we will look practically at the implementation and differences ] --- # Group Means ```r hosp_tbl %>% select(1:2) %>% group_by(Treatment) %>% summarise( mean = round(mean(SWB),3), sd = round(sd(SWB),1), N = n() ) ``` ``` ## # A tibble: 3 × 4 ## Treatment mean sd N ## <fct> <dbl> <dbl> <int> ## 1 TreatA 9.33 2.9 60 ## 2 TreatB 11.3 2.5 60 ## 3 TreatC 9.04 2 60 ``` --- # Effects (sum to zero) model + We need to change the contrast scheme from default. ```r contrasts(hosp_tbl$Treatment) <- contr.sum contrasts(hosp_tbl$Treatment) ``` ``` ## [,1] [,2] ## TreatA 1 0 ## TreatB 0 1 ## TreatC -1 -1 ``` --- # Effects (sum to zero) model ```r summary(lm(SWB ~ Treatment, data = hosp_tbl)) ``` ``` ## ## Call: ## lm(formula = SWB ~ Treatment, data = hosp_tbl) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.373 -1.987 -0.300 1.838 7.173 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.8806 0.1872 52.791 < 2e-16 *** ## Treatment1 -0.5539 0.2647 -2.093 0.0378 * ## Treatment2 1.3928 0.2647 5.262 4.09e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.511 on 177 degrees of freedom ## Multiple R-squared: 0.1369, Adjusted R-squared: 0.1271 ## F-statistic: 14.04 on 2 and 177 DF, p-value: 2.196e-06 ``` --- # Effects (sum to zero) model .pull-left[ ``` ## (Intercept) Treatment1 Treatment2 ## 9.881 -0.554 1.393 ``` + Coefficients from group means `$$\beta_0 = \frac{\mu_1 + \mu_2 + \mu_3}{3}$$` `$$\beta_1 = \mu_1 - \mu$$` `$$\beta_2 = \mu_2 - \mu$$` ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Treatment </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> Gmean </th> <th style="text-align:right;"> Coefficients </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TreatA </td> <td style="text-align:right;"> 9.327 </td> <td style="text-align:right;"> 9.881 </td> <td style="text-align:right;"> -0.554 </td> </tr> <tr> <td style="text-align:left;"> TreatB </td> <td style="text-align:right;"> 11.273 </td> <td style="text-align:right;"> 9.881 </td> <td style="text-align:right;"> 1.392 </td> </tr> <tr> <td style="text-align:left;"> TreatC </td> <td style="text-align:right;"> 9.042 </td> <td style="text-align:right;"> 9.881 </td> <td style="text-align:right;"> -0.839 </td> </tr> </tbody> </table> ] --- # Effects (sum to zero) model .pull-left[ ``` ## (Intercept) Treatment1 Treatment2 ## 9.881 -0.554 1.393 ``` + Group means from coefficients: `$$\mu_1 = \beta_0 + \beta_1$$` `$$\mu_2 = \beta_0 + \beta_2$$` `$$\mu_3 = \beta_0 - (\beta_1 + \beta_2)$$` ] .pull-right[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Treatment </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> Gmean </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TreatA </td> <td style="text-align:right;"> 9.327 </td> <td style="text-align:right;"> 9.881 </td> </tr> <tr> <td style="text-align:left;"> TreatB </td> <td style="text-align:right;"> 11.273 </td> <td style="text-align:right;"> 9.881 </td> </tr> <tr> <td style="text-align:left;"> TreatC </td> <td style="text-align:right;"> 9.042 </td> <td style="text-align:right;"> 9.881 </td> </tr> </tbody> </table> ```r 9.881 + -.554 ``` ``` ## [1] 9.327 ``` ```r 9.881 + 1.393 ``` ``` ## [1] 11.274 ``` ```r 9.881 - (-.554 + 1.393) ``` ``` ## [1] 9.042 ``` ] --- # The wide world of contrasts + We have now seen two examples of coding schemes (dummy and effect). + There are **lots** of different coding schemes we can use for categorical variables to make different comparisons. + If you are interested, see the excellent resource on [UCLA website](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) + But always remember... **The data is the same, the tested contrast differs** --- class: center, middle # Questions --- # The data is the same, the tested contrasts differ + Run both models: ```r contrasts(hosp_tbl$Treatment) <- contr.treatment m_dummy <- lm(SWB ~ Treatment, data = hosp_tbl) # Change the contrasts and run again contrasts(hosp_tbl$Treatment) <- contr.sum m_zero <- lm(SWB ~ Treatment, data = hosp_tbl) ``` + Create a small data set: ```r treat <- tibble(Treatment = c("TreatA", "TreatB", "TreatC")) ``` --- # The data is the same, the tested contrasts differ + Add the predicted values from our models ```r treat %>% mutate( pred_dummy = predict(m_dummy, newdata = .), pred_zero = predict(m_zero, newdata = .) ) ``` ``` ## # A tibble: 3 × 3 ## Treatment pred_dummy pred_zero ## <chr> <dbl> <dbl> ## 1 TreatA 9.33 9.33 ## 2 TreatB 11.3 11.3 ## 3 TreatC 9.04 9.04 ``` + No matter what coding or contrasts we use, we are still modelling the group means! --- # Summary of today + We have considered different ways in which we can code categorical predictors. + Take home: + Use of coding matrices allows us to compare groups (or levels) in lots of ways. + Our `\(\beta\)`'s will represent differences in group means. + The scheme we use determines which group or combination of groups we are comparing. + **In all cases the underlying data is unchanged.** + This makes coding schemes a very flexible tool for testing hypotheses. --- class: center, middle # Thanks for listening! Any questions?