Categorical Predictors

class: center, middle, inverse, title-slide

# <b>Categorical Predictors </b>
## Data Analysis for Psychology in R 2<br><br>
### dapR2 Team
### Department of Psychology<br>The University of Edinburgh

---

# Weeks Learning Objectives
1. Understand the meaning of model coefficients in the case of a binary predictor.

2. Understand how to apply dummy coding to include categorical predictors with 2+ levels.

3. Be able to include categorical predictors into an `lm` model in R

---
#  Recap: Categorical variables 
+ Categorical variables can only take discrete values
	+ E.g., animal type: 1= duck, 2= cow, 3= wasp

+ They are mutually exclusive
  + No duck-wasps or cow-ducks!

+ In R, categorical variables should be of class `factor`
  + The discrete values are `levels`
  + Levels can have numeric values (1, 2 3) and labels (e.g. "duck", "cow", "wasp")
  + All that these numbers represent is category membership.

---
# Recap: Binary variable
+ Binary variable is a categorical variable with two levels.

+ Traditionally coded with a 0 and 1

+ In `lm`, binary variables are often referred to as dummy variables
  + when we use multiple dummies, we talk about the general procedure of dummy coding

+ Why 0 and 1?

--
  
+ Quick version: It has some nice properties when it comes to interpretation.

+ What is the interpretation of the intercept?

+ What about the slope?

---
# Extending our example

.pull-left[
+ Our in class example so far has used test scores and revision time and motivation.

+ Let's we also collected data on who they studied with (`study`);
  + 0 = alone; 1 = with others
  
+ And also which of three different revisions methods they used for the test (`study_method`)
	+ 1 = Notes re-reading; 2 = Notes summarising; 3 = Self-testing ([see here](https://www.psychologicalscience.org/publications/journals/pspi/learning-techniques.html))

+ We collect a new sample of 200 students.
]

.pull-right[

```r
head(test_study3) %>%
  kable(.) %>%
  kable_styling(full_width = F)
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> ID </th>
   <th style="text-align:right;"> score </th>
   <th style="text-align:right;"> hours </th>
   <th style="text-align:right;"> motivation </th>
   <th style="text-align:left;"> study </th>
   <th style="text-align:left;"> method </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID101 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -0.42 </td>
   <td style="text-align:left;"> others </td>
   <td style="text-align:left;"> self-test </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID102 </td>
   <td style="text-align:right;"> 36 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 0.15 </td>
   <td style="text-align:left;"> alone </td>
   <td style="text-align:left;"> summarise </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID103 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0.35 </td>
   <td style="text-align:left;"> alone </td>
   <td style="text-align:left;"> summarise </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID104 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 0.16 </td>
   <td style="text-align:left;"> others </td>
   <td style="text-align:left;"> summarise </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID105 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> -1.12 </td>
   <td style="text-align:left;"> alone </td>
   <td style="text-align:left;"> read </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID106 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> -0.64 </td>
   <td style="text-align:left;"> others </td>
   <td style="text-align:left;"> read </td>
  </tr>
</tbody>
</table>

]

---
#  LM with binary predictors 
+ Now lets ask the question:

+ **Do students who study with others score better than students who study alone?**

+ Our equation does not change:

`$$score_i = \beta_0 + \beta_1 study_{i} + \epsilon_i$$`

+ And in R:

```r
performance_study <- lm(score ~ study, data = test_study3)
```

---
# Model results

```r
summary(performance_study)
```

```
## 
## Call:
## lm(formula = score ~ study, data = test_study3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7902  -4.7902   0.2098   5.4766  18.2098 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.7902     0.6603   34.52  < 2e-16 ***
## studyothers   4.7332     1.0093    4.69 4.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.896 on 248 degrees of freedom
## Multiple R-squared:  0.08145,	Adjusted R-squared:  0.07775 
## F-statistic: 21.99 on 1 and 248 DF,  p-value: 4.525e-06
```

---
# Interpretation

.pull-left[
+ As before, the intercept `$\hat \beta_0$` is the expected value of `$y$` when `$x=0$`

+ What is `$x=0$` here?
  + It is the students who study alone.

+ So what about `$\hat \beta_1$`?
  + `$\beta_1$` =

+ **Look at the output on the right hand side.** 
  + What do you notice about the difference in averages?

]

.pull-right[

```r
test_study3 %>%
* group_by(., study) %>%
  summarise(
*   Average = round(mean(score),4)
  )
```

```
## # A tibble: 2 x 2
##   study  Average
##   <fct>    <dbl>
## 1 alone     22.8
## 2 others    27.5
```

]

---
# Interpretation
+ `$\hat \beta_0$` = predicted expected value of `$y$` when `$x = 0$`
  + Or, the mean of group coded 0 (those who study alone)
  
+ `$\hat \beta_1$` = predicted difference between the means of the two groups.
  + Group 1 - Group 0 (Mean `score` for those who study with others - mean `score` of those who study alone)
  
+ Notice how this maps to our question. 
  + Do students who study with others score better than students who study alone?

---
class: center, middle
# Quick pause for questions

---
#  Equations for each group 
+ What would our linear model look like if we added the values for `$x$`.

`$$\widehat{score} = \hat \beta_0 + \hat \beta_1 study$$`

+ For those who study alone ( `$study = 0$` ):

`$$\widehat{score}_{alone} = \hat \beta_0 + \hat \beta_1 \times 0$$`

+ So;

`$$\widehat{score}_{alone} = \hat \beta_0$$`

---
#  Equations for each group 
+ For those who study with others ( `$study = 1$` ):

`$$\widehat{score}_{others} = \hat \beta_0 + \hat \beta_1 \times 1$$`

+ So;

`$$\widehat{score}_{others} = \hat \beta_0 + \hat \beta_1$$`

+ And if we re-arrange;

`$$\hat \beta_1 = \widehat{score}_{others} - \hat \beta_0$$`

+ Remembering that `$\widehat{score}_{alone} = \hat \beta_0$`, we finally obtain:

`$$\hat \beta_1 = \widehat{score}_{others} - \widehat{score}_{alone}$$`

---
#  Visualize the model

---
#  Visualize the model
<img src="dapr2_05_LMcategorical1_files/figure-html/unnamed-chunk-8-1.png" width="504" />

---
#  Visualize the model
<img src="dapr2_05_LMcategorical1_files/figure-html/unnamed-chunk-9-1.png" width="504" />

---
#  Visualize the model

---
#  Evaluation of model and significance of `$\beta_1$`

+ `$R^2$` and `$F$`-ratio interpretation are identical to their interpretation in models with only continuous predictors.

+ And we assess the significance of predictors in the same way

+ We use the standard error of the coefficient to construct:
  + We calculate the `$\hat \beta_1$` = difference between groups
	+ `$t$`-value and associated `$p$`-value for the coefficient
	+ Or a confidence interval around the coefficient

---
# Hold on... it's a t-test

```r
test_study3 %>%
  t.test(score ~ study, var.equal = T, .)
```

```
## 
## 	Two Sample t-test
## 
## data:  score by study
## t = -4.6895, df = 248, p-value = 4.525e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.721058 -2.745251
## sample estimates:
##  mean in group alone mean in group others 
##             22.79021             27.52336
```

---
# Hold on... it's a t-test

```r
summary(performance_study)
```

???
Yup!

---
class: center, middle
# Time for a break!

---
#  Including categorical predictors with >2 levels in a regression 
+ The goal when analyzing categorical data with 2+ levels, we want each of our `$\beta$` coefficients to represent a specific difference between means.
  + e.g. like with the use of a single binary variable, `$\beta_1$` is the difference between the two groups.

+ When we have 2+ levels, to be able to do this, we need to apply a **coding scheme**.

+ Two common coding schemes are:
  + Dummy coding (which we will discuss now)
  + Effects coding (or sum to zero, which we will discuss at the start of next semester)
  + There are LOTS of ways to do this, if curious, see !(here)[https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/] UCLA

---
#  Dummy coding 
+ Dummy coding uses 0's and 1's to represent group membership
	+ One level is chosen as a baseline
	+ All other levels are compared against that baseline
	
+ Notice, this is identical to binary variables already discussed.

+ Dummy coding is simply the process of producing a set of binary coded variables

+ For any categorical variable, we will create `$k$`-1 dummy variables
  + `$k$` = number of levels

---
# Steps in dummy coding
1. Choose a baseline level

2. Assign everyone in the baseline  group `0` for all `$k$`-1 dummy variables

3. Assign everyone in the next group a `1` for the first dummy variable and a `0` for all the other dummy variables

4. Assign everyone in the next again group a `1` for the second dummy variable and a `0` for all the other dummy variables

5. Repeat step 5 until all `$k$`-1 dummy variables have had 0's and 1's assigned

6. Enter the `$k$`-1 dummy variables into your regression

---
#  Choosing a baseline? 
+ Each level of your categorical predictor will be compared against the baseline.

+ Good baseline levels could be:
	+ The control group in an experiment
	+ The group expected to have the lowest score on the outcome
	+ The largest group

+ It is best the baseline is not:
	+ A poorly defined level, e.g. an `Other` group
	+ Much smaller than the other groups

---
#  Dummy coding

.pull-left[
+ We start out with  a dataset that looks like:

```
##       ID score    method
## 1  ID101    18 self-test
## 2  ID102    36 summarise
## 3  ID103    15 summarise
## 4  ID104    29 summarise
## 5  ID105    18      read
## 6  ID106    29      read
## 7  ID107    18 summarise
## 8  ID108     0      read
## 9  ID109    17      read
## 10 ID110    41      read
```

]

.pull-right[
+ And end up with one that looks like:

```
##       ID score    method method1 method2
## 1  ID101    18 self-test       1       0
## 2  ID102    36 summarise       0       1
## 3  ID103    15 summarise       0       1
## 4  ID104    29 summarise       0       1
## 5  ID105    18      read       0       0
## 6  ID106    29      read       0       0
## 7  ID107    18 summarise       0       1
## 8  ID108     0      read       0       0
## 9  ID109    17      read       0       0
## 10 ID110    41      read       0       0
```

]

---
#  Dummy coding with `lm`

+ `lm` automatically applies dummy coding when you include a variable of class `factor` in a model.
  + It selects the first group as the baseline group

+ It represents this as a contrast matrix which looks like:

```r
contrasts(test_study3$method)
```

```
##           self-test summarise
## read              0         0
## self-test         1         0
## summarise         0         1
```

---
#  Dummy coding with `lm`

+ So in order to run the `lm` we simply write:

```r
mod1 <- lm(score ~ method, data = test_study3)
```

+ And `lm` does all the dummy coding work for us

---
#  Dummy coding with `lm`

.pull-left[
+ The intercept is the mean of the baseline group (notes re-reading)

+ The coefficient for `methodself-test` is the mean difference between the self-testing group and the baseline group (re-reading)

+ The coefficient for `methodsummarise` is the mean difference between the note summarising group and the baseline group (re-reading)

]

.pull-right[

```r
mod1
```

```
## 
## Call:
## lm(formula = score ~ method, data = test_study3)
## 
## Coefficients:
##     (Intercept)  methodself-test  methodsummarise  
##         23.4138           4.1620           0.7821
```

```
## # A tibble: 3 x 2
##   method    Group_Mean
##   <fct>          <dbl>
## 1 read            23.4
## 2 self-test       27.6
## 3 summarise       24.2
```

]

---
#  Dummy coding with `lm` (full results)

```r
summary(mod1)
```

```
## 
## Call:
## lm(formula = score ~ method, data = test_study3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4138  -5.3593  -0.1959   5.7496  17.8041 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.4138     0.8662  27.031   <2e-16 ***
## methodself-test   4.1620     1.3188   3.156   0.0018 ** 
## methodsummarise   0.7821     1.1930   0.656   0.5127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.079 on 247 degrees of freedom
## Multiple R-squared:  0.04224,	Adjusted R-squared:  0.03448 
## F-statistic: 5.447 on 2 and 247 DF,  p-value: 0.004845
```

---
#  Changing the baseline group

+ The level that `lm` chooses as it's baseline may not always be the best choice
	+ You can change it using:

```r
contrasts(test_study3$method) <- contr.treatment(3, base = 2)
```

+ `contrasts` updates the variable with the new coding scheme

+ `contr.treatment` Specifies that you want dummy coding

+ `3` is No. of levels of your predictor

+ `base=2` is the level number of your new baseline

---
#  Results using the new baseline

.pull-left[
+ The intercept is the now the mean of the second group (Self-Testing)

+ `method1` is now the difference between Notes re-reading and Self-Testing

+ `method3` is now the difference between Notes summarising and Self-testing

]

.pull-right[

```r
contrasts(test_study3$method) <- contr.treatment(3, base = 2)
mod2 <- lm(score ~ method, data = test_study3)
mod2
```

```
## 
## Call:
## lm(formula = score ~ method, data = test_study3)
## 
## Coefficients:
## (Intercept)      method1      method3  
##      27.576       -4.162       -3.380
```

```
## # A tibble: 3 x 2
##   method    Group_Mean
##   <fct>          <dbl>
## 1 read            23.4
## 2 self-test       27.6
## 3 summarise       24.2
```

]

---
#  New baseline (full results)

```r
summary(mod2)
```

```
## 
## Call:
## lm(formula = score ~ method, data = test_study3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4138  -5.3593  -0.1959   5.7496  17.8041 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.5758     0.9945  27.729  < 2e-16 ***
## method1      -4.1620     1.3188  -3.156  0.00180 ** 
## method3      -3.3799     1.2892  -2.622  0.00929 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.079 on 247 degrees of freedom
## Multiple R-squared:  0.04224,	Adjusted R-squared:  0.03448 
## F-statistic: 5.447 on 2 and 247 DF,  p-value: 0.004845
```

???
+ Note that the choice of baseline does not affect the R^2 or F-ratio

---
# Summary of today
+ We can include categorical predictors in a linear model
  + When we have categorical predictors, we are modelling the means of groups.

+ A categorical variable with 2 levels can be represented as a binary (0,1) variable
  + the associated `$\beta$` is the difference between groups

+ When the categorical predictor has more than one level, we can represent it with dummy coded variables.
  + We have k-1 dummy variables, where k = levels
  + Each dummy represents the difference between the mean of the group scored 1 and the reference group

+ In R, we need to make sure... 
  + R recognises the variable as a factor (`factor()`), and 
  + we have the correct reference level (`contr.treatment()`)

---
class: center, middle
# Thanks for listening!