Why Do We Need a Reference Group?

class: center, middle, inverse, title-slide

.title[
# <b>Why Do We Need a Reference Group?</b>
]
.subtitle[
## Data Analysis for Psychology in R 2<br><br>
]
.author[
### dapR2 Team
]
.institute[
### Department of Psychology<br>The University of Edinburgh
]

---

# Warning!
+ From a practical perspective, what we have covered in the lecture is what you need to know.

+ However, to understand how some of the other coding schemes work, there are a couple of technical bits you may find useful to consider.

---
# Why do we need a reference group?

.pull-left[

+ Consider our example and dummy coding.

+ Our `method` variable has three groups/levels (`read`, `self-test` & `summarise` )

+ We want a model that represents our data (observations), but all we "know" is what group an observation belongs to. So;

`$$y_{ij} = \mu_j + \epsilon_{ij}$$`

+ Where 
  + `\(y_{ij}\)` are the individual observations
  + `\(\mu_j\)` is the mean of group `\(j\)` and
  + `\(\epsilon_{ij}\)` is the individual deviation from that mean.

]

.pull-right[

```
##       ID score    method
## 1  ID101    18 self-test
## 2  ID102    36 summarise
## 3  ID103    15 summarise
## 4  ID104    29 summarise
## 5  ID105    18      read
## 6  ID106    29      read
## 7  ID107    18 summarise
## 8  ID108     0      read
## 9  ID109    17      read
## 10 ID110    41      read
```

]

???
+ And this hopefully makes sense.
  + Given we know someone's group, our best guess is the mean
  + But people wont all score the mean, so there is some deviation for every person.

---
# Why do we need a reference group?
+ An alternative way to present this idea looks much more like our linear model:

`$$y_{ij} = \beta_0 + \underbrace{(\mu_{j} - \beta_0)}_{\beta_j} + \epsilon_{ij}$$`
+ Where 
  + `\(y_{ij}\)` are the individual observations
  + `\(\beta_0\)` is an estimate of reference/overall average
  + `\(\mu_j\)` is the mean of group `\(j\)` 
  + `\(\beta_j\)` is the difference between the reference and the mean of group `\(j\)`, and
  + `\(\epsilon_{ij}\)` is the individual deviation from that mean.

---
# Why do we need a reference group?
+ We can write this equation more generally as:

$$\mu_j = \beta_0 + \beta_j $$

+ or for the specific groups (in our case 3):

`$$\mu_{read} = \beta_0 + \beta_{1read}$$`

`$$\mu_{self-test} = \beta_0 + \beta_{2self-test}$$`

`$$\mu_{summarise} = \beta_0 + \beta_{3summarise}$$`

+ **The problem**: we have four parameters ( `\(\beta_0\)` , `\(\beta_{1read}\)` , `\(\beta_{2self-test}\)` , `\(\beta_{3summarise}\)` ) to model three group means ( `\(\mu_{read}\)` , `\(\mu_{self-test}\)` , `\(\mu_{summarise}\)` )

+ We are trying to estimate too much with too little.
    + We need to estimate at least 1 parameter less

---
# Constraints fix identification
+ Let's think again about dummy coding.

+ Suppose we make `read` the reference. Then,

`$$\mu_{read} = \beta_0$$`

`$$\mu_{self-test} = \beta_0 + \beta_{1self-test}$$`

`$$\mu_{summarise} = \beta_0 + \beta_{summarise}$$`
+ **Fixed** !

+ We now only have three parameters ( `\(\beta_0\)` , `\(\beta_{self-test}\)` , `\(\beta_{summarise}\)` ) for the three group means ( `\(\mu_{read}\)` , `\(\mu_{self-test}\)` , `\(\mu_{summarise}\)` ).

> So when we code categorical variables, we need a constraint so that we can estimate our models.

---
# One last look at our model

```r
summary(mod1)
```

```
## 
## Call:
## lm(formula = score ~ method, data = test_study3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4138  -5.3593  -0.1959   5.7496  17.8041 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.4138     0.8662  27.031   <2e-16 ***
## methodself-test   4.1620     1.3188   3.156   0.0018 ** 
## methodsummarise   0.7821     1.1930   0.656   0.5127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.079 on 247 degrees of freedom
## Multiple R-squared:  0.04224,	Adjusted R-squared:  0.03448 
## F-statistic: 5.447 on 2 and 247 DF,  p-value: 0.004845
```