LEARNING OBJECTIVES
Do you think the type of background music playing in a restaurant influences the amount of money that diners spend on their meal?
tidyverse
package.read_csv()
and name the data rest_spend
.
Produce a boxplot displaying the relationship between restaurant spending and music type, and comment on any visually observed differences among the sample means of the three background music types.
For the restaurant spending data, what is the number of groups (\(g\)) and the numbers of observations per group (\(n\))? What was the average spending and standard deviation of each group?
R by default orders the levels of a categorical variable in alphabetical order:
levels(rest_spend$type)
## [1] "Classical Music" "No Music" "Pop Music"
If we follow the same approach and order the groups by alphabetical ordering, we can represent the data as follows:
Population Name | Population ID | Sample observations | Population mean |
---|---|---|---|
Classical Music | 1 | \(y_{1,1}, y_{1,2}, ..., y_{1,n}\) | \(\mu_{1}\) |
No Music | 2 | \(y_{2,1}, y_{2,2}, ..., y_{2,n}\) | \(\mu_{2}\) |
Pop Music | 3 | \(y_{3,1}, y_{3,2}, ..., y_{3,n}\) | \(\mu_{3}\) |
Under the constraint \(\beta_1 + \beta_2 + \beta_3 = 0\), the model coefficients are interpreted as follows:
We can fit the linear regression model:
\[ y = b_0 + b_1 \ \text{EffectLevel1} + b_2 \ \text{EffectLevel2} + \epsilon \]
where
\[ \text{EffectLevel1} = \begin{cases} 1 & \text{if observation is from category 1} \\ 0 & \text{if observation is from category 2} \\ -1 & \text{otherwise} \end{cases} \\ \text{EffectLevel2} = \begin{cases} 0 & \text{if observation is from category 1} \\ 1 & \text{if observation is from category 2} \\ -1 & \text{otherwise} \end{cases} \]
Schematically,
\[ \begin{matrix} \textbf{Level} & \textbf{EffectLevel1} & \textbf{EffectLevel2} \\ \hline \text{Classical Music} & 1 & 0 \\ \text{No Music} & 0 & 1 \\ \text{Pop Music} & -1 & -1 \end{matrix} \]
In R this is simply done by saying:
contrasts(rest_spend$type) <- "contr.sum"
<- lm(amount ~ type, data = rest_spend)
mdl_sum coef(mdl_sum)
## (Intercept) type1 type2
## 22.7381560 1.4359846 -0.5967688
where type
is a factor. In such case R will automatically create the two variables EffectLevel1
and EffectLevel2
for you!
The summary(mdl)
will return 3 estimated coefficients:
summary
as it is found via the side-constraint as \(b_3 = -(b_1 + b_2) = -0.84\)!Under the constraint \(\beta_1 = 0\), meaning that the first factor level is the reference group,
We can fit the linear regression model:
\[ y = b_0 + b_1 \ \text{IsTypeNoMusic} + b_2 \ \text{IsTypePopMusic} +\epsilon \]
where
\[ \text{IsTypeNoMusic} = \begin{cases} 1 & \text{if observation is from the No Music category} \\ 0 & \text{otherwise} \end{cases} \\ \text{IsTypePopMusic} = \begin{cases} 1 & \text{if observation is from the Pop Music category} \\ 0 & \text{otherwise} \end{cases} \]
Schematically,
\[ \begin{matrix} \textbf{Level} & \textbf{IsTypeNoMusic} & \textbf{IsTypePopMusic} \\ \hline \text{Classical Music} & 0 & 0 \\ \text{No Music} & 1 & 0 \\ \text{Pop Music} & 0 & 1 \end{matrix} \]
In R this is simply done by saying:
contrasts(rest_spend$type) <- "contr.treatment"
<- lm(amount ~ type, data = rest_spend)
mdl_trt coef(mdl_trt)
## (Intercept) typeNo Music typePop Music
## 24.174141 -2.032753 -2.275200
where type
is a factor. In such case R will automatically create the two dummy variables IsTypeNoMusic
and IsTypePopMusic
for you!
The summary(mdl)
will return 3 estimated coefficients:
Possible side-constraints on the parameters are:
Name | Constraint | Meaning of \(\beta_0\) | R |
---|---|---|---|
Sum to zero (Effects Coding) | \(\beta_1 + \beta_2 + \beta_3 = 0\) | \(\beta_0 = \mu\) | contr.sum |
Reference group (Dummy Coding) | \(\beta_1 = 0\) | \(\beta_0 = \mu_1\) | contr.treatment |
IMPORTANT
By default R uses the reference group constraint
If your factor has \(g\) levels, your regression model will have \(g-1\) dummy variables (R creates them for you)
Since R uses the reference group constraint, in this section and the next one, we will not specify any side-constraints, meaning that \(\beta_1 = 0\).
Fit a linear model to the data, having amount
as response variable and the factor type
as predictor. Call the fitted model mdl_rg
(for reference group constraint)
Identify the relevant pieces of information from the commands anova(mdl_rg)
and summary(mdl_rg)
that can be used to conduct an ANOVA F-test against the null hypothesis that all population means are equal.
Interpret the F-test results in the context of the ANOVA null hypothesis.
Examine and investigate the meaning of the coefficients in the output of summary(mdl_rg)
. Next, obtain the estimated (or predicted) group means for the “Classical Music,” “No Music,” and “Pop Music” groups by using the predict(mdl_rg, newdata = ...)
function.
It actually makes more sense to have “No Music” as reference group.
Open the help page of the function fct_relevel()
, and look at the examples.
Within the rest_sped
data, reorder the levels of the factor type
so that your reference group is “No Music.”
NOTE
Because you reordered the levels, now
Refit the linear model, and inspect the model summary once more.
Do you notice any change in the estimated coefficients? Do you notice any change in the F-test for model utility?
Let’s now change the side-constraint from the R default (the reference group or contr.treatment
constraint) to the sum-to-zero constraint using contr.sum
.
Recall that under this constraint the interpretation of the coefficients becomes:
Set the sum to zero constraint for the factor type
of background music.
Fit again the linear model using amount
as response variable and type
as predictor. Call the fitted model using the sum to zero constraint mdl_stz
(for sum to zero).
Examine the output of the summary(mdl_stz)
function:
In R, we can switch back to the default reference group constraint by either of these
# Option 1
contrasts(rest_spend$type) <- NULL
# Option 2
contrasts(rest_spend$type) <- "contr.treatment"
We have seen that the F-test (and incremental F-test using anova()
) provide overall tests of whether there are differences in the group means. When we use dummy (reference) coding, this could be a difference between one group and the reference group. When we use effects (sum-to-zero) coding, this could be a difference between one group and the grand mean. In some cases the investigators will have pre-planned comparisons they would like to make, and it is this we will look at next. To do so, we will use a package called emmeans
.
Contrasts represent a wide range of hypotheses about the population means that can be tested using the fitted model.
However, they need to match the following general form. A contrast only allows us to test hypotheses that can be written as a linear combination of the population means with coefficients summing to zero: \[ H_0 : c_1 \mu_1 + c_2 \mu_2 + c_3 \mu_3 = 0 \qquad \text{with} \qquad c_1 + c_2 + c_3 = 0 \]
#install.packages("emmeans") remember that you only install packages once, and you should do so in the console, not your .Rmd script
library(emmeans)
We were interested in the following comparisons:
These are planned comparisons and can be translated to the following research hypotheses: \[ \begin{aligned} 1. \quad H_0 &: \mu_\text{No Music} = \frac{1}{2} (\mu_\text{Classical Music} + \mu_\text{Pop Music}) \\ 2. \quad H_0 &: \mu_\text{No Music} = \mu_\text{Pop Music} \end{aligned} \]
First of all check the levels of the factor type
, as the contrast coefficients need to match the order of the levels:
levels(rest_spend$type)
## [1] "No Music" "Classical Music" "Pop Music"
We specify the hypotheses by first giving each a name, and then specify the coefficients of the comparisons:
<- list("No Music - Some Music" = c(1, -1/2, -1/2),
comp "No Music - Pop Music" = c(1, 0, -1))
Use the emmeans()
function to obtain the estimated treatment means and uncertainties for your factor:
<- emmeans(mdl_stz, ~ type)
emm emm
## type emmean SE df lower.CL upper.CL
## No Music 22.1 0.259 357 21.6 22.7
## Classical Music 24.2 0.259 357 23.7 24.7
## Pop Music 21.9 0.259 357 21.4 22.4
##
## Confidence level used: 0.95
Next, we run the contrast analysis. To do so, we use the contrast()
function:
<- contrast(emm, method = comp)
comp_res comp_res
## contrast estimate SE df t.ratio p.value
## No Music - Some Music -0.895 0.318 357 -2.818 0.0051
## No Music - Pop Music 0.242 0.367 357 0.661 0.5090
Finally, we can ask for 95% confidence intervals:
confint(comp_res)
## contrast estimate SE df lower.CL upper.CL
## No Music - Some Music -0.895 0.318 357 -1.520 -0.270
## No Music - Pop Music 0.242 0.367 357 -0.479 0.964
##
## Confidence level used: 0.95
Interpret the result of the contrast analysis.
A final situation is where an investigator has no predetermined ideas at all where differences may lie, and they may wish to explore the data. We will discuss methods for such analyses in future weeks.