Identify grouping variables

How to identify grouping variables

We can use the tidyverse functions group_by() and count() to look for grouping variables, that is, for categorical or ordinal variables whose values appear more than once in a dataset.

For example, in a dataset on workplace pride in different departments of the UK civil service, let’s look at the variable department_name.

workpride <- read_csv('https://uoepsy.github.io/data/lmm_jsup.csv')

workpride |>
  group_by(department_name) |>
  count() 
# A tibble: 16 x 2
# Groups:   department_name [16]
   department_name                                                       n
   <chr>                                                             <int>
 1 Charity Commission for England and Wales                             17
 2 Competition and Markets Authority                                    21
 3 Crown Prosecution Service                                            13
 4 Food Standards Agency                                                25
 5 Government Legal Department                                          17
 6 HM Revenue & Customs                                                 16
 7 National Crime Agency                                                20
 8 National Savings and Investments                                     20
 9 Office for Standards in Education, Children's Services and Skills    17
10 Office of Gas and Electricity Markets                                15
11 Office of Qualifications and Examinations Regulation                  5
12 Office of Rail and Road                                              17
13 Serious Fraud Office                                                 18
14 Supreme Court of the United Kingdom                                  13
15 UK Statistics Authority                                              45
16 Water Services Regulation Authority                                  16

Each department is associated with multiple observations. The specific number of observations is different between departments (some have 5, some have 45), but that doesn’t matter. What matters is that there are departments that are linked to more than one observation. Because the count values in the n column are greater than 1, department_name is a grouping variable.

Sometimes you might end up with variables where some levels appear multiple times, but other levels appear only once. Even if this is the case, the variable is still considered a grouping variable.

For example, the starwars dataset (which comes with tidyverse) tells us a bit about several Star Wars characters:

starwars |>
  head()
# A tibble: 6 x 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky~    172    77 blond      fair       blue            19   male  mascu~
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu~
3 R2-D2         96    32 <NA>       white, bl~ red             33   none  mascu~
4 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
5 Leia Org~    150    49 brown      light      brown           19   fema~ femin~
6 Owen Lars    178   120 brown, gr~ light      blue            52   male  mascu~
# i 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

In the starwars dataset, is species a grouping variable?

starwars |>
  group_by(species) |>
  count()
# A tibble: 38 x 2
# Groups:   species [38]
   species       n
   <chr>     <int>
 1 Aleena        1
 2 Besalisk      1
 3 Cerean        1
 4 Chagrian      1
 5 Clawdite      1
 6 Droid         6
 7 Dug           1
 8 Ewok          1
 9 Geonosian     1
10 Gungan        3
# i 28 more rows

Yes, species is a grouping variable, because at least some levels of species group together more than one item (e.g., the dataset contains six droids, three Gungans).

Note: Counting how many times each level of the variable appears will only work if data is “tidy”. All the data that we’ll use in DAPR3 is tidy, so for this course, the method will always work.

Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure. Graphic by Allison Horst, CC BY 4.0

According to tidyr’s Tidy data vignette, in tidy data:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Here’s an example of data that’s tidy:

iris |> 
  head() |>
  gt::gt()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

We know this data is tidy because:

  1. Each column is a variable: length and width of sepal and petal, as well as the species of iris.
  2. Each row contains information gathered for one specific iris: one observation.
  3. Each cell contains a single value of the given variable for the given observation.

Here’s an example of data that’s messy:

relig_income |>   # this dataset comes with tidyverse
  head() |>
  gt::gt()
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
Agnostic 27 34 60 81 76 137 122 109 84 96
Atheist 12 27 37 52 35 70 73 59 74 76
Buddhist 27 21 30 34 33 58 62 39 53 54
Catholic 418 617 732 670 638 1116 949 792 633 1489
Don’t know/refused 15 14 15 11 10 35 21 17 18 116
Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529

We know this data is messy because:

  1. Each column is not a distinct variable; each column header actually contains a value of a variable representing how much money the household makes.
  2. Each row is not a distinct observation (which would correspond to one household), but a collection of frequency values for a given religion.
  3. Each cell does contain a single value, a count of how many households of the given religion have the given income.

Here’s a tidy version of that same data:

relig_income |>
  pivot_longer(-religion, names_to = "income", values_to = "frequency") |>
  head() |>
  gt::gt()
religion income frequency
Agnostic <$10k 27
Agnostic $10-20k 34
Agnostic $20-30k 60
Agnostic $30-40k 81
Agnostic $40-50k 76
Agnostic $50-75k 137

It’s very likely that when you gather data yourself, it will be messy in one way or another.

Before you can plot and analyse your data, you’ll need to tidy it.

Check out tidyr’s Tidy data vignette. It walks you through five common ways in which data can be messy and how to fix each one. (The relig_income example from the rabbit hole above is from that vignette!)

Linked flash cards