Describe group-structured data

You learned in DAPR2 how to report descriptive statistics of your data (see the DAPR2 descriptives flash cards).

Now that we’re also modelling grouping structure in our data, we should also describe some properties of those groups.

Why? So that we (and our readers) can tell how well our sample of groups might represent the population we hope to generalise to. For example, if we run a within-participants psychology experiment but all of our participants (i.e., all the levels of our grouping variable) are 19–24-year-old university students, then we probably shouldn’t expect our findings to generalise to the population of all humans in the world (see, e.g., the article Most people are not WEIRD).

Group-level properties to describe

  • What randomly-varying grouping variables (see Data generating process) does the data contain?
  • How many levels does each grouping variable contain?
  • Their relationships: If there are several such grouping variables, how do they relate to each other?
    • For example, if we run a repeated-measures experiment in which every participant sees every experimental stimulus, then our grouping variables are participant and stimulus, and we can say they are “fully crossed”.
    • Or if we collect observational data from many children across different schools, then our grouping variables are child and school. And because each child only attends one school, we can say that child is “nested” within school.
    • Grouping variables can relate to each other in other ways that don’t have fancy labels like “fully crossed” or “nested”. In those cases, you can just describe which levels (if any) occur together and which (if any) do not.
  • Their sample sizes: How many observations do we have for each level of our grouping variables?
  • Summaries of relevant variables by group:
    • For example, how old are the children who took part in the study (mean, SD, range)? How many children did we sample from each school (mean, SD, range)?
    • Or: what are participants’ gender or occupation or first language or other relevant variable (give counts for each category)?

How to summarise data by group

The example data we’ll use are from an experiment designed to investigate how the realism of video games is associated with more/less unnecessarily aggressive gameplay (termed “needless game violence”, abbreviated NGV) and how this behaviour may be associated with people’s dark triad traits.

Data are available at https://uoepsy.github.io/data/NGV.csv.

ngv <- read_csv('https://uoepsy.github.io/data/NGV.csv')
variable description
PID Participant number
age Participant age (years)
level Maze level (1 to 20)
character Whether the objects and characters in the level were presented as 'cartoon' or as 'realistic'
mode Whether the participant played via a screen or with a VR headset
P Psychopathy Trait from SD-3 (score 1-5)
N Narcissism Trait from SD-3 (score 1-5)
M Machiavellianism Trait from SD-3 (score 1-5)
NGV Needless Game Violence metric

The experiment involved playing 10 levels of a game in which the objective was to escape a maze. Various obstacles and other characters were present throughout the maze, and players could interact with these by side-stepping or jumping over them, or by pushing or shooting at them. All of these actions took the same amount of effort to complete (pressing a button), and each one achieved the same end (moving beyond the obstacle and being able to continue through the maze).

Each participant completed all 10 levels twice, once in which all characters were presented as cartoons, and once in which all characters were presented as realistic humans and animals. The layout of the level was identical in both, the only difference being the depiction of objects and characters. For each participant, these 20 levels (\(2 \times 10\) mazes) were presented in a random order. Half of the participants played via a screen, and the other half played via a VR headset. For each level played, we have a record of “needless game violence” (NGV) which was calculated via the number of aggressive (pushing/shooting) actions taken (+0.5 for every action that missed an object, +1 for every action aimed at an inanimate object, and +2 for every action aimed at an animate character).

Prior to the experiment, each participant completed the Short Dark Triad 3 (SD-3), which measures the three traits of machiavellianism, narcissism, and psychopathy.

The randomly-varying grouping variable in this dataset is PID (see Identify grouping variables).

How many levels?

How many participants do we have data from?

n_distinct(ngv$PID)  # count the distinct values in the PID column of ngv
[1] 76

How many observations per level?

How many observations do we have per participant? (You’ll have used this code before to determine whether PID is a grouping variable at all!)

ngv |>
  group_by(PID) |>
  count()
# A tibble: 76 x 2
# Groups:   PID [76]
   PID        n
   <chr>  <int>
 1 ppt_1     20
 2 ppt_10    20
 3 ppt_11    20
 4 ppt_12    20
 5 ppt_13    20
 6 ppt_14    20
 7 ppt_15    20
 8 ppt_16    20
 9 ppt_17    20
10 ppt_18    20
# i 66 more rows

How old is each person?

ngv |>
  select(PID, age) |>  # keep only the columns PID and age
  distinct()           # drop duplicated rows, keeping only one instance of each
# A tibble: 76 x 2
   PID      age
   <chr>  <dbl>
 1 ppt_1     47
 2 ppt_10    46
 3 ppt_11    21
 4 ppt_12    37
 5 ppt_13    41
 6 ppt_14    48
 7 ppt_15    26
 8 ppt_16    47
 9 ppt_17    18
10 ppt_18    36
# i 66 more rows

(See documentation for select() and distinct().)

Descriptives for age

Because age is continuous, we should report measures of central tendency (mean) and dispersion (SD, range).

We should always find each participant’s age first, just in case people provide different numbers of observations—if they did and we just computed mean(ngv$age), then we’d get the wrong number out!

ngv |>
  select(PID, age) |>  # keep only the columns PID and age
  distinct() |>        # drop duplicated rows, keeping only one instance of each
  summarise(
    age_mean = mean(age),
    age_sd   = sd(age),
    age_min  = min(age),
    age_max  = max(age)
  )
# A tibble: 1 x 4
  age_mean age_sd age_min age_max
     <dbl>  <dbl>   <dbl>   <dbl>
1     35.0   8.82      18      48

(See the documentation for summarise().)

Linked flash cards