class: center, middle, inverse, title-slide #
Categorical Predictors
## Data Analysis for Psychology in R 2
### dapR2 Team ### Department of Psychology
The University of Edinburgh --- # Weeks Learning Objectives 1. Understand the meaning of model coefficients in the case of a binary predictor. 2. Understand how to apply dummy coding to include categorical predictors with 2+ levels. 3. Be able to include categorical predictors into an `lm` model in R --- # Recap: Categorical variables + Categorical variables can only take discrete values + E.g., animal type: 1= duck, 2= cow, 3= wasp + They are mutually exclusive + No duck-wasps or cow-ducks! + In R, categorical variables should be of class `factor` + The discrete values are `levels` + Levels can have numeric values (1, 2 3) and labels (e.g. "duck", "cow", "wasp") + All that these numbers represent is category membership. --- # Recap: Binary variable + Binary variable is a categorical variable with two levels. + Traditionally coded with a 0 and 1 + In `lm`, binary variables are often referred to as dummy variables + when we use multiple dummies, we talk about the general procedure of dummy coding + Why 0 and 1? -- + Quick version: It has some nice properties when it comes to interpretation. -- + What is the interpretation of the intercept? -- + What about the slope? --- # Extending our example .pull-left[ + Our in class example so far has used test scores and revision time and motivation. + Let's we also collected data on who they studied with (`study`); + 0 = alone; 1 = with others + And also which of three different revisions methods they used for the test (`study_method`) + 1 = Notes re-reading; 2 = Notes summarising; 3 = Self-testing ([see here](https://www.psychologicalscience.org/publications/journals/pspi/learning-techniques.html)) + We collect a new sample of 200 students. ] .pull-right[ ```r head(test_study3) %>% kable(.) %>% kable_styling(full_width = F) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:right;"> score </th> <th style="text-align:right;"> hours </th> <th style="text-align:right;"> motivation </th> <th style="text-align:left;"> study </th> <th style="text-align:left;"> method </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> -0.42 </td> <td style="text-align:left;"> others </td> <td style="text-align:left;"> self-test </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:left;"> alone </td> <td style="text-align:left;"> summarise </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0.35 </td> <td style="text-align:left;"> alone </td> <td style="text-align:left;"> summarise </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:left;"> others </td> <td style="text-align:left;"> summarise </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> -1.12 </td> <td style="text-align:left;"> alone </td> <td style="text-align:left;"> read </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> -0.64 </td> <td style="text-align:left;"> others </td> <td style="text-align:left;"> read </td> </tr> </tbody> </table> ] --- # LM with binary predictors + Now lets ask the question: + **Do students who study with others score better than students who study alone?** + Our equation does not change: `$$score_i = \beta_0 + \beta_1 study_{i} + \epsilon_i$$` + And in R: ```r performance_study <- lm(score ~ study, data = test_study3) ``` --- # Model results ```r summary(performance_study) ``` ``` ## ## Call: ## lm(formula = score ~ study, data = test_study3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.7902 -4.7902 0.2098 5.4766 18.2098 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 22.7902 0.6603 34.52 < 2e-16 *** ## studyothers 4.7332 1.0093 4.69 4.53e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.896 on 248 degrees of freedom ## Multiple R-squared: 0.08145, Adjusted R-squared: 0.07775 ## F-statistic: 21.99 on 1 and 248 DF, p-value: 4.525e-06 ``` --- # Interpretation .pull-left[ + As before, the intercept `\(\hat \beta_0\)` is the expected value of `\(y\)` when `\(x=0\)` + What is `\(x=0\)` here? + It is the students who study alone. + So what about `\(\hat \beta_1\)`? + `\(\beta_1\)` = + **Look at the output on the right hand side.** + What do you notice about the difference in averages? ] .pull-right[ ```r test_study3 %>% * group_by(., study) %>% summarise( * Average = round(mean(score),4) ) ``` ``` ## # A tibble: 2 x 2 ## study Average ## <fct> <dbl> ## 1 alone 22.8 ## 2 others 27.5 ``` ] --- # Interpretation + `\(\hat \beta_0\)` = predicted expected value of `\(y\)` when `\(x = 0\)` + Or, the mean of group coded 0 (those who study alone) + `\(\hat \beta_1\)` = predicted difference between the means of the two groups. + Group 1 - Group 0 (Mean `score` for those who study with others - mean `score` of those who study alone) + Notice how this maps to our question. + Do students who study with others score better than students who study alone? --- class: center, middle # Quick pause for questions --- # Equations for each group + What would our linear model look like if we added the values for `\(x\)`. `$$\widehat{score} = \hat \beta_0 + \hat \beta_1 study$$` + For those who study alone ( `\(study = 0\)` ): `$$\widehat{score}_{alone} = \hat \beta_0 + \hat \beta_1 \times 0$$` + So; `$$\widehat{score}_{alone} = \hat \beta_0$$` --- # Equations for each group + For those who study with others ( `\(study = 1\)` ): `$$\widehat{score}_{others} = \hat \beta_0 + \hat \beta_1 \times 1$$` + So; `$$\widehat{score}_{others} = \hat \beta_0 + \hat \beta_1$$` + And if we re-arrange; `$$\hat \beta_1 = \widehat{score}_{others} - \hat \beta_0$$` + Remembering that `\(\widehat{score}_{alone} = \hat \beta_0\)`, we finally obtain: `$$\hat \beta_1 = \widehat{score}_{others} - \widehat{score}_{alone}$$` --- # Visualize the model <img src="dapr2_05_LMcategorical1_files/figure-html/unnamed-chunk-7-1.png" width="504" /> --- # Visualize the model <img src="dapr2_05_LMcategorical1_files/figure-html/unnamed-chunk-8-1.png" width="504" /> --- # Visualize the model <img src="dapr2_05_LMcategorical1_files/figure-html/unnamed-chunk-9-1.png" width="504" /> --- # Visualize the model <img src="dapr2_05_LMcategorical1_files/figure-html/unnamed-chunk-10-1.png" width="504" /> --- # Evaluation of model and significance of `\(\beta_1\)` + `\(R^2\)` and `\(F\)`-ratio interpretation are identical to their interpretation in models with only continuous predictors. + And we assess the significance of predictors in the same way + We use the standard error of the coefficient to construct: + We calculate the `\(\hat \beta_1\)` = difference between groups + `\(t\)`-value and associated `\(p\)`-value for the coefficient + Or a confidence interval around the coefficient --- # Hold on... it's a t-test ```r test_study3 %>% t.test(score ~ study, var.equal = T, .) ``` ``` ## ## Two Sample t-test ## ## data: score by study ## t = -4.6895, df = 248, p-value = 4.525e-06 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -6.721058 -2.745251 ## sample estimates: ## mean in group alone mean in group others ## 22.79021 27.52336 ``` --- # Hold on... it's a t-test ```r summary(performance_study) ``` ``` ## ## Call: ## lm(formula = score ~ study, data = test_study3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.7902 -4.7902 0.2098 5.4766 18.2098 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 22.7902 0.6603 34.52 < 2e-16 *** ## studyothers 4.7332 1.0093 4.69 4.53e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.896 on 248 degrees of freedom ## Multiple R-squared: 0.08145, Adjusted R-squared: 0.07775 ## F-statistic: 21.99 on 1 and 248 DF, p-value: 4.525e-06 ``` ??? Yup! --- class: center, middle # Time for a break! --- # Including categorical predictors with >2 levels in a regression + The goal when analyzing categorical data with 2+ levels, we want each of our `\(\beta\)` coefficients to represent a specific difference between means. + e.g. like with the use of a single binary variable, `\(\beta_1\)` is the difference between the two groups. + When we have 2+ levels, to be able to do this, we need to apply a **coding scheme**. + Two common coding schemes are: + Dummy coding (which we will discuss now) + Effects coding (or sum to zero, which we will discuss at the start of next semester) + There are LOTS of ways to do this, if curious, see !(here)[https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/] UCLA --- # Dummy coding + Dummy coding uses 0's and 1's to represent group membership + One level is chosen as a baseline + All other levels are compared against that baseline + Notice, this is identical to binary variables already discussed. + Dummy coding is simply the process of producing a set of binary coded variables + For any categorical variable, we will create `\(k\)`-1 dummy variables + `\(k\)` = number of levels --- # Steps in dummy coding 1. Choose a baseline level 2. Assign everyone in the baseline group `0` for all `\(k\)`-1 dummy variables 3. Assign everyone in the next group a `1` for the first dummy variable and a `0` for all the other dummy variables 4. Assign everyone in the next again group a `1` for the second dummy variable and a `0` for all the other dummy variables 5. Repeat step 5 until all `\(k\)`-1 dummy variables have had 0's and 1's assigned 6. Enter the `\(k\)`-1 dummy variables into your regression --- # Choosing a baseline? + Each level of your categorical predictor will be compared against the baseline. + Good baseline levels could be: + The control group in an experiment + The group expected to have the lowest score on the outcome + The largest group + It is best the baseline is not: + A poorly defined level, e.g. an `Other` group + Much smaller than the other groups --- # Dummy coding .pull-left[ + We start out with a dataset that looks like: ``` ## ID score method ## 1 ID101 18 self-test ## 2 ID102 36 summarise ## 3 ID103 15 summarise ## 4 ID104 29 summarise ## 5 ID105 18 read ## 6 ID106 29 read ## 7 ID107 18 summarise ## 8 ID108 0 read ## 9 ID109 17 read ## 10 ID110 41 read ``` ] .pull-right[ + And end up with one that looks like: ``` ## ID score method method1 method2 ## 1 ID101 18 self-test 1 0 ## 2 ID102 36 summarise 0 1 ## 3 ID103 15 summarise 0 1 ## 4 ID104 29 summarise 0 1 ## 5 ID105 18 read 0 0 ## 6 ID106 29 read 0 0 ## 7 ID107 18 summarise 0 1 ## 8 ID108 0 read 0 0 ## 9 ID109 17 read 0 0 ## 10 ID110 41 read 0 0 ``` ] --- # Dummy coding with `lm` + `lm` automatically applies dummy coding when you include a variable of class `factor` in a model. + It selects the first group as the baseline group + It represents this as a contrast matrix which looks like: ```r contrasts(test_study3$method) ``` ``` ## self-test summarise ## read 0 0 ## self-test 1 0 ## summarise 0 1 ``` --- # Dummy coding with `lm` + So in order to run the `lm` we simply write: ```r mod1 <- lm(score ~ method, data = test_study3) ``` + And `lm` does all the dummy coding work for us --- # Dummy coding with `lm` .pull-left[ + The intercept is the mean of the baseline group (notes re-reading) + The coefficient for `methodself-test` is the mean difference between the self-testing group and the baseline group (re-reading) + The coefficient for `methodsummarise` is the mean difference between the note summarising group and the baseline group (re-reading) ] .pull-right[ ```r mod1 ``` ``` ## ## Call: ## lm(formula = score ~ method, data = test_study3) ## ## Coefficients: ## (Intercept) methodself-test methodsummarise ## 23.4138 4.1620 0.7821 ``` ``` ## # A tibble: 3 x 2 ## method Group_Mean ## <fct> <dbl> ## 1 read 23.4 ## 2 self-test 27.6 ## 3 summarise 24.2 ``` ] --- # Dummy coding with `lm` (full results) ```r summary(mod1) ``` ``` ## ## Call: ## lm(formula = score ~ method, data = test_study3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.4138 -5.3593 -0.1959 5.7496 17.8041 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 23.4138 0.8662 27.031 <2e-16 *** ## methodself-test 4.1620 1.3188 3.156 0.0018 ** ## methodsummarise 0.7821 1.1930 0.656 0.5127 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.079 on 247 degrees of freedom ## Multiple R-squared: 0.04224, Adjusted R-squared: 0.03448 ## F-statistic: 5.447 on 2 and 247 DF, p-value: 0.004845 ``` --- # Changing the baseline group + The level that `lm` chooses as it's baseline may not always be the best choice + You can change it using: ```r contrasts(test_study3$method) <- contr.treatment(3, base = 2) ``` + `contrasts` updates the variable with the new coding scheme + `contr.treatment` Specifies that you want dummy coding + `3` is No. of levels of your predictor + `base=2` is the level number of your new baseline --- # Results using the new baseline .pull-left[ + The intercept is the now the mean of the second group (Self-Testing) + `method1` is now the difference between Notes re-reading and Self-Testing + `method3` is now the difference between Notes summarising and Self-testing ] .pull-right[ ```r contrasts(test_study3$method) <- contr.treatment(3, base = 2) mod2 <- lm(score ~ method, data = test_study3) mod2 ``` ``` ## ## Call: ## lm(formula = score ~ method, data = test_study3) ## ## Coefficients: ## (Intercept) method1 method3 ## 27.576 -4.162 -3.380 ``` ``` ## # A tibble: 3 x 2 ## method Group_Mean ## <fct> <dbl> ## 1 read 23.4 ## 2 self-test 27.6 ## 3 summarise 24.2 ``` ] --- # New baseline (full results) ```r summary(mod2) ``` ``` ## ## Call: ## lm(formula = score ~ method, data = test_study3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.4138 -5.3593 -0.1959 5.7496 17.8041 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 27.5758 0.9945 27.729 < 2e-16 *** ## method1 -4.1620 1.3188 -3.156 0.00180 ** ## method3 -3.3799 1.2892 -2.622 0.00929 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.079 on 247 degrees of freedom ## Multiple R-squared: 0.04224, Adjusted R-squared: 0.03448 ## F-statistic: 5.447 on 2 and 247 DF, p-value: 0.004845 ``` ??? + Note that the choice of baseline does not affect the R^2 or F-ratio --- # Summary of today + We can include categorical predictors in a linear model + When we have categorical predictors, we are modelling the means of groups. + A categorical variable with 2 levels can be represented as a binary (0,1) variable + the associated `\(\beta\)` is the difference between groups + When the categorical predictor has more than one level, we can represent it with dummy coded variables. + We have k-1 dummy variables, where k = levels + Each dummy represents the difference between the mean of the group scored 1 and the reference group + In R, we need to make sure... + R recognises the variable as a factor (`factor()`), and + we have the correct reference level (`contr.treatment()`) --- class: center, middle # Thanks for listening!