class: center, middle, inverse, title-slide .title[ #
Why Do We Need a Reference Group?
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Warning! + From a practical perspective, what we have covered in the lecture is what you need to know. + However, to understand how some of the other coding schemes work, there are a couple of technical bits you may find useful to consider. --- # Why do we need a reference group? .pull-left[ + Consider our example and dummy coding. + Our `method` variable has three groups/levels (`read`, `self-test` & `summarise` ) + We want a model that represents our data (observations), but all we "know" is what group an observation belongs to. So; `$$y_{ij} = \mu_j + \epsilon_{ij}$$` + Where + `\(y_{ij}\)` are the individual observations + `\(\mu_j\)` is the mean of group `\(j\)` and + `\(\epsilon_{ij}\)` is the individual deviation from that mean. ] .pull-right[ ``` ## ID score method ## 1 ID101 18 self-test ## 2 ID102 36 summarise ## 3 ID103 15 summarise ## 4 ID104 29 summarise ## 5 ID105 18 read ## 6 ID106 29 read ## 7 ID107 18 summarise ## 8 ID108 0 read ## 9 ID109 17 read ## 10 ID110 41 read ``` ] ??? + And this hopefully makes sense. + Given we know someone's group, our best guess is the mean + But people wont all score the mean, so there is some deviation for every person. --- # Why do we need a reference group? + An alternative way to present this idea looks much more like our linear model: `$$y_{ij} = \beta_0 + \underbrace{(\mu_{j} - \beta_0)}_{\beta_j} + \epsilon_{ij}$$` + Where + `\(y_{ij}\)` are the individual observations + `\(\beta_0\)` is an estimate of reference/overall average + `\(\mu_j\)` is the mean of group `\(j\)` + `\(\beta_j\)` is the difference between the reference and the mean of group `\(j\)`, and + `\(\epsilon_{ij}\)` is the individual deviation from that mean. --- # Why do we need a reference group? + We can write this equation more generally as: $$\mu_j = \beta_0 + \beta_j $$ + or for the specific groups (in our case 3): `$$\mu_{read} = \beta_0 + \beta_{1read}$$` `$$\mu_{self-test} = \beta_0 + \beta_{2self-test}$$` `$$\mu_{summarise} = \beta_0 + \beta_{3summarise}$$` + **The problem**: we have four parameters ( `\(\beta_0\)` , `\(\beta_{1read}\)` , `\(\beta_{2self-test}\)` , `\(\beta_{3summarise}\)` ) to model three group means ( `\(\mu_{read}\)` , `\(\mu_{self-test}\)` , `\(\mu_{summarise}\)` ) + We are trying to estimate too much with too little. + We need to estimate at least 1 parameter less --- # Constraints fix identification + Let's think again about dummy coding. + Suppose we make `read` the reference. Then, `$$\mu_{read} = \beta_0$$` `$$\mu_{self-test} = \beta_0 + \beta_{1self-test}$$` `$$\mu_{summarise} = \beta_0 + \beta_{summarise}$$` + **Fixed** ! + We now only have three parameters ( `\(\beta_0\)` , `\(\beta_{self-test}\)` , `\(\beta_{summarise}\)` ) for the three group means ( `\(\mu_{read}\)` , `\(\mu_{self-test}\)` , `\(\mu_{summarise}\)` ). > So when we code categorical variables, we need a constraint so that we can estimate our models. --- # One last look at our model ```r summary(mod1) ``` ``` ## ## Call: ## lm(formula = score ~ method, data = test_study3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.4138 -5.3593 -0.1959 5.7496 17.8041 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 23.4138 0.8662 27.031 <2e-16 *** ## methodself-test 4.1620 1.3188 3.156 0.0018 ** ## methodsummarise 0.7821 1.1930 0.656 0.5127 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.079 on 247 degrees of freedom ## Multiple R-squared: 0.04224, Adjusted R-squared: 0.03448 ## F-statistic: 5.447 on 2 and 247 DF, p-value: 0.004845 ```