Grouping variables

Grouping variables are ordinal or categorical variables whose values appear more than once in a dataset. In other words, when multiple observations in a dataset are related to the same value of an ordinal or categorical variable, then you’re dealing with grouping variables.

For example:

Why are grouping variables important?

In DAPR2, you learned that linear models make a number of assumptions (revisit last year’s DAPR2 flashcards here).

One of those assumptions was the assumption of independence of errors. For this assumption to be met, every observation must be independent of every other observation. But when (for example) our data contains multiple observations from the same child, those data points are not independent, because they all come from the same child.

In general, when our data contains grouping structures, the data points associated with each member of each group are not independent. They are not independent because they all come from the same member of that group. Therefore our analysis will violate the assumption of independence of errors unless we include the grouping variables in our model.

Depending on the data generating process, we model grouping variables either as fixed effects or as random effects.

In particular, if we have a randomly varying grouping variable (see data generating process), then our model needs to contain a random intercept by that variable (and may potentially also contain a random slope over the model’s predictors by that variable).

Linked flash cards