LEARNING OBJECTIVES

  1. Understand how to specify dummy and effects coding
  2. Understand how to specify contracts to test specific effects

Research question

Do you think the type of background music playing in a restaurant influences the amount of money that diners spend on their meal?

Data: RestaurantSpending.csv

Data Exploration: Descriptives

Question 1
  1. Load the tidyverse package.
  2. Read the restaurant spending data into R using the function read_csv() and name the data rest_spend.
  3. Check for the correct coding of all variables (i.e., categorical variables should be factors and numeric variables should be numeric).

Solution

Question 2

Produce a boxplot displaying the relationship between restaurant spending and music type, and comment on any visually observed differences among the sample means of the three background music types.

Solution

Question 3

For the restaurant spending data, what is the number of groups (\(g\)) and the numbers of observations per group (\(n\))? What was the average spending and standard deviation of each group?

Solution

Levels of Variable & Side Constraints

Levels of variable

R by default orders the levels of a categorical variable in alphabetical order:

levels(rest_spend$type)
## [1] "Classical Music" "No Music"        "Pop Music"

If we follow the same approach and order the groups by alphabetical ordering, we can represent the data as follows:

Population Name Population ID Sample observations Population mean
Classical Music 1 \(y_{1,1}, y_{1,2}, ..., y_{1,n}\) \(\mu_{1}\)
No Music 2 \(y_{2,1}, y_{2,2}, ..., y_{2,n}\) \(\mu_{2}\)
Pop Music 3 \(y_{3,1}, y_{3,2}, ..., y_{3,n}\) \(\mu_{3}\)

Sum to zero constraint - effects coding

Under the constraint \(\beta_1 + \beta_2 + \beta_3 = 0\), the model coefficients are interpreted as follows:

  • \(\beta_0 = \mu\) is interpreted as the overall or global mean — that is, the population mean when all the groups are combined;
  • \(\beta_i = \mu_i - \mu\) is interpreted as the effect of group \(i\) — that is, the difference between the population mean for group \(i\), \(\mu_i\), and the global population mean, \(\mu\).

We can fit the linear regression model:

\[ y = b_0 + b_1 \ \text{EffectLevel1} + b_2 \ \text{EffectLevel2} + \epsilon \]

where

\[ \text{EffectLevel1} = \begin{cases} 1 & \text{if observation is from category 1} \\ 0 & \text{if observation is from category 2} \\ -1 & \text{otherwise} \end{cases} \\ \text{EffectLevel2} = \begin{cases} 0 & \text{if observation is from category 1} \\ 1 & \text{if observation is from category 2} \\ -1 & \text{otherwise} \end{cases} \]

Schematically,

\[ \begin{matrix} \textbf{Level} & \textbf{EffectLevel1} & \textbf{EffectLevel2} \\ \hline \text{Classical Music} & 1 & 0 \\ \text{No Music} & 0 & 1 \\ \text{Pop Music} & -1 & -1 \end{matrix} \]

In R this is simply done by saying:

contrasts(rest_spend$type) <- "contr.sum"
mdl_sum <- lm(amount ~ type, data = rest_spend)
coef(mdl_sum)
## (Intercept)       type1       type2 
##  22.7381560   1.4359846  -0.5967688

where type is a factor. In such case R will automatically create the two variables EffectLevel1 and EffectLevel2 for you!

The summary(mdl) will return 3 estimated coefficients:

  • \(b_0 = 22.74\) is estimated overall mean
  • \(b_1 = 1.44\) is the effect of Classical Music
  • \(b_2 = -0.6\) is the effect of No Music
  • NOTE: The effect of Pop Music is not shown by summary as it is found via the side-constraint as \(b_3 = -(b_1 + b_2) = -0.84\)!

Reference group constraint - dummy coding

Under the constraint \(\beta_1 = 0\), meaning that the first factor level is the reference group,

  • \(\beta_0\) is interpreted as \(\mu_1\), the mean response for the reference group (group 1);
  • \(\beta_i\) is interpreted as the difference between the mean response for group \(i\) and the reference group.

We can fit the linear regression model:

\[ y = b_0 + b_1 \ \text{IsTypeNoMusic} + b_2 \ \text{IsTypePopMusic} +\epsilon \]

where

\[ \text{IsTypeNoMusic} = \begin{cases} 1 & \text{if observation is from the No Music category} \\ 0 & \text{otherwise} \end{cases} \\ \text{IsTypePopMusic} = \begin{cases} 1 & \text{if observation is from the Pop Music category} \\ 0 & \text{otherwise} \end{cases} \]

Schematically,

\[ \begin{matrix} \textbf{Level} & \textbf{IsTypeNoMusic} & \textbf{IsTypePopMusic} \\ \hline \text{Classical Music} & 0 & 0 \\ \text{No Music} & 1 & 0 \\ \text{Pop Music} & 0 & 1 \end{matrix} \]

In R this is simply done by saying:

contrasts(rest_spend$type) <- "contr.treatment"
mdl_trt <- lm(amount ~ type, data = rest_spend)
coef(mdl_trt)
##   (Intercept)  typeNo Music typePop Music 
##     24.174141     -2.032753     -2.275200

where type is a factor. In such case R will automatically create the two dummy variables IsTypeNoMusic and IsTypePopMusic for you!

The summary(mdl) will return 3 estimated coefficients:

  • \(b_0 = 24.17\) is the mean of the base level (level 1)
  • \(b_1 = -2.03\) is the effect of No Music
  • \(b_2 = -2.28\) is the effect of Pop Music
  • NOTE: The effect of Classical Music is not shown as it is equal to zero by the constraint!

Side Contraints

Possible side-constraints on the parameters are:

Name Constraint Meaning of \(\beta_0\) R
Sum to zero (Effects Coding) \(\beta_1 + \beta_2 + \beta_3 = 0\) \(\beta_0 = \mu\) contr.sum
Reference group (Dummy Coding) \(\beta_1 = 0\) \(\beta_0 = \mu_1\) contr.treatment

IMPORTANT

  • By default R uses the reference group constraint

  • If your factor has \(g\) levels, your regression model will have \(g-1\) dummy variables (R creates them for you)

ANOVA F-test

Since R uses the reference group constraint, in this section and the next one, we will not specify any side-constraints, meaning that \(\beta_1 = 0\).

Question 4

Fit a linear model to the data, having amount as response variable and the factor type as predictor. Call the fitted model mdl_rg (for reference group constraint)

Identify the relevant pieces of information from the commands anova(mdl_rg) and summary(mdl_rg) that can be used to conduct an ANOVA F-test against the null hypothesis that all population means are equal.

Interpret the F-test results in the context of the ANOVA null hypothesis.

Solution

Reference group (dummy) constraint

Question 5

Examine and investigate the meaning of the coefficients in the output of summary(mdl_rg). Next, obtain the estimated (or predicted) group means for the “Classical Music,” “No Music,” and “Pop Music” groups by using the predict(mdl_rg, newdata = ...) function.

Solution

Question 6

It actually makes more sense to have “No Music” as reference group.

Open the help page of the function fct_relevel(), and look at the examples.

Within the rest_sped data, reorder the levels of the factor type so that your reference group is “No Music.”

NOTE

Because you reordered the levels, now

  • \(\mu_1\) = mean of no music group
  • \(\mu_2\) = mean of classical music group
  • \(\mu_3\) = mean of pop music group

Solution

Question 7

Refit the linear model, and inspect the model summary once more.

Do you notice any change in the estimated coefficients? Do you notice any change in the F-test for model utility?

Solution

Sum-to-zero (effects) constraint

Let’s now change the side-constraint from the R default (the reference group or contr.treatment constraint) to the sum-to-zero constraint using contr.sum.

Recall that under this constraint the interpretation of the coefficients becomes:

  • \(\beta_0\) represents the global mean
  • \(\beta_i\) the effect due to group \(i\) — that is, the mean response in group \(i\) minus the global mean
Question 8

Set the sum to zero constraint for the factor type of background music.

Fit again the linear model using amount as response variable and type as predictor. Call the fitted model using the sum to zero constraint mdl_stz (for sum to zero).

Examine the output of the summary(mdl_stz) function:

  • Do the displayed coefficients change? What do they represent now?
  • Do the predicted group means change?
  • Is the model utility F-test still the same? Why do you think it’s the case?

Solution

In R, we can switch back to the default reference group constraint by either of these

# Option 1
contrasts(rest_spend$type) <- NULL
# Option 2
contrasts(rest_spend$type) <- "contr.treatment"

Contrasts

We have seen that the F-test (and incremental F-test using anova()) provide overall tests of whether there are differences in the group means. When we use dummy (reference) coding, this could be a difference between one group and the reference group. When we use effects (sum-to-zero) coding, this could be a difference between one group and the grand mean. In some cases the investigators will have pre-planned comparisons they would like to make, and it is this we will look at next. To do so, we will use a package called emmeans.

Contrasts represent a wide range of hypotheses about the population means that can be tested using the fitted model.

However, they need to match the following general form. A contrast only allows us to test hypotheses that can be written as a linear combination of the population means with coefficients summing to zero: \[ H_0 : c_1 \mu_1 + c_2 \mu_2 + c_3 \mu_3 = 0 \qquad \text{with} \qquad c_1 + c_2 + c_3 = 0 \]

#install.packages("emmeans") remember that you only install packages once, and you should do so in the console, not your .Rmd script
library(emmeans)
Question 9

We were interested in the following comparisons:

  • Whether having some kind of background music, rather than no music, makes a difference in mean restaurant spending
  • Whether there is any difference in mean restaurant spending between those with no background music and those with pop music

These are planned comparisons and can be translated to the following research hypotheses: \[ \begin{aligned} 1. \quad H_0 &: \mu_\text{No Music} = \frac{1}{2} (\mu_\text{Classical Music} + \mu_\text{Pop Music}) \\ 2. \quad H_0 &: \mu_\text{No Music} = \mu_\text{Pop Music} \end{aligned} \]

First of all check the levels of the factor type, as the contrast coefficients need to match the order of the levels:

levels(rest_spend$type)
## [1] "No Music"        "Classical Music" "Pop Music"

We specify the hypotheses by first giving each a name, and then specify the coefficients of the comparisons:

comp <- list("No Music - Some Music" = c(1, -1/2, -1/2),
             "No Music - Pop Music" = c(1, 0, -1))

Use the emmeans() function to obtain the estimated treatment means and uncertainties for your factor:

emm <- emmeans(mdl_stz, ~ type)
emm
##  type            emmean    SE  df lower.CL upper.CL
##  No Music          22.1 0.259 357     21.6     22.7
##  Classical Music   24.2 0.259 357     23.7     24.7
##  Pop Music         21.9 0.259 357     21.4     22.4
## 
## Confidence level used: 0.95

Next, we run the contrast analysis. To do so, we use the contrast() function:

comp_res <- contrast(emm, method = comp)
comp_res
##  contrast              estimate    SE  df t.ratio p.value
##  No Music - Some Music   -0.895 0.318 357  -2.818  0.0051
##  No Music - Pop Music     0.242 0.367 357   0.661  0.5090

Finally, we can ask for 95% confidence intervals:

confint(comp_res)
##  contrast              estimate    SE  df lower.CL upper.CL
##  No Music - Some Music   -0.895 0.318 357   -1.520   -0.270
##  No Music - Pop Music     0.242 0.367 357   -0.479    0.964
## 
## Confidence level used: 0.95

Interpret the result of the contrast analysis.

  • What do the p-values tell us?
  • Interpret the confidence intervals in the context of the study.

Solution

A final situation is where an investigator has no predetermined ideas at all where differences may lie, and they may wish to explore the data. We will discuss methods for such analyses in future weeks.

References

North, A., A. Shilcock, and D. Hargreaves. 2003. “The Effect of Musical Style on Restaurant Customers’ Spending.” Environment and Behavior 35: 712–18.