Identify grouping variables

How to identify grouping variables

We can use the tidyverse functions group_by() and count() to look for grouping variables, that is, for categorical or ordinal variables whose values appear more than once in a dataset.

For example, in a dataset on workplace pride in different departments of the UK civil service, let’s look at the variable department_name.

workpride <- read_csv('https://uoepsy.github.io/data/lmm_jsup.csv')

workpride |>
  group_by(department_name) |>
  count()

# A tibble: 16 × 2
# Groups:   department_name [16]
   department_name                                                       n
   <chr>                                                             <int>
 1 Charity Commission for England and Wales                             17
 2 Competition and Markets Authority                                    21
 3 Crown Prosecution Service                                            13
 4 Food Standards Agency                                                25
 5 Government Legal Department                                          17
 6 HM Revenue & Customs                                                 16
 7 National Crime Agency                                                20
 8 National Savings and Investments                                     20
 9 Office for Standards in Education, Children's Services and Skills    17
10 Office of Gas and Electricity Markets                                15
11 Office of Qualifications and Examinations Regulation                  5
12 Office of Rail and Road                                              17
13 Serious Fraud Office                                                 18
14 Supreme Court of the United Kingdom                                  13
15 UK Statistics Authority                                              45
16 Water Services Regulation Authority                                  16

Each department is associated with multiple observations. The specific number of observations is different between departments (some have 5, some have 45), but that doesn’t matter. What matters is that there are departments that are linked to more than one observation. Because the count values in the n column are greater than 1, department_name is a grouping variable.

What if some levels have counts of 1?

Sometimes you might end up with variables where some levels appear multiple times, but other levels appear only once. Even if this is the case, the variable is still considered a grouping variable.

For example, the starwars dataset (which comes with tidyverse) tells us a bit about several Star Wars characters:

starwars |>
  head()

# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

In the starwars dataset, is species a grouping variable?

starwars |>
  group_by(species) |>
  count()

# A tibble: 38 × 2
# Groups:   species [38]
   species       n
   <chr>     <int>
 1 Aleena        1
 2 Besalisk      1
 3 Cerean        1
 4 Chagrian      1
 5 Clawdite      1
 6 Droid         6
 7 Dug           1
 8 Ewok          1
 9 Geonosian     1
10 Gungan        3
# ℹ 28 more rows

Yes, species is a grouping variable, because at least some levels of species group together more than one item (e.g., the dataset contains six droids, three Gungans).

Note: Counting how many times each level of the variable appears will only work if data is “tidy”. All the data that we’ll use in DAPR3 is tidy, so for this course, the method will always work.

What is tidy data?

Graphic by Allison Horst, CC BY 4.0

According to tidyr’s Tidy data vignette, in tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Here’s an example of data that’s tidy:

iris |> 
  head() |>
  gt::gt()

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

We know this data is tidy because:

Each column is a variable: length and width of sepal and petal, as well as the species of iris.
Each row contains information gathered for one specific iris: one observation.
Each cell contains a single value of the given variable for the given observation.

Here’s an example of data that’s messy:

relig_income |>   # this dataset comes with tidyverse
  head() |>
  gt::gt()

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
Agnostic	27	34	60	81	76	137	122	109	84	96
Atheist	12	27	37	52	35	70	73	59	74	76
Buddhist	27	21	30	34	33	58	62	39	53	54
Catholic	418	617	732	670	638	1116	949	792	633	1489
Don’t know/refused	15	14	15	11	10	35	21	17	18	116
Evangelical Prot	575	869	1064	982	881	1486	949	723	414	1529

We know this data is messy because:

Each column is not a distinct variable; each column header actually contains a value of a variable representing how much money the household makes.
Each row is not a distinct observation (which would correspond to one household), but a collection of frequency values for a given religion.
Each cell does contain a single value, a count of how many households of the given religion have the given income.

Here’s a tidy version of that same data:

relig_income |>
  pivot_longer(-religion, names_to = "income", values_to = "frequency") |>
  head() |>
  gt::gt()

religion	income	frequency
Agnostic	<$10k	27
Agnostic	$10-20k	34
Agnostic	$20-30k	60
Agnostic	$30-40k	81
Agnostic	$40-50k	76
Agnostic	$50-75k	137

What if I gather data that isn’t tidy?

It’s very likely that when you gather data yourself, it will be messy in one way or another.

Before you can plot and analyse your data, you’ll need to tidy it.

Check out tidyr’s Tidy data vignette. It walks you through five common ways in which data can be messy and how to fix each one. (The relig_income example from the rabbit hole above is from that vignette!)

Identify grouping variables

How to identify grouping variables

Linked flash cards

Outgoing links

Backlinks