Identifying grouping structure

In this lab, you will start to build skills that you’ll use again and again for the next five weeks (and really, for the rest of your data analysis career!).

You’ll practice identifying the grouping structure within a given dataset and thinking about what kind of variability each grouping variable contributes.

In the questions below, you’ll apply the tools you saw in the lectures to three different group-structured datasets.

(Next week, you’ll apply new tools to these same three datasets to understand their grouping structures even more deeply!)

Get set up

Create a new .Rmd file for this week’s exercises.
Save it somewhere you can find it again.
Give it a clear name (for example, dapr3_lab01.Rmd).
In the first code chunk, load the packages you’ll need this week:
- tidyverse

Clothing

Read in the dataset located at https://uoepsy.github.io/data/dapr3_mannequin.csv and name it clothing using the following line of code.

clothing <- read_csv("https://uoepsy.github.io/data/dapr3_mannequin.csv")

RQ: Are people more likely to purchase clothing when they see it displayed on a model, and is this association dependent on item price?

variable	description
purch_rating	Purchase rating (sliding scale 0 to 100, with higher ratings indicating greater perceived likelihood of purchase)
price	Price presented for item (range £5 to £100)
ppt	Participant identifier
condition	Whether items are seen on a model or on a white background

More detail about this dataset

Thirty participants were presented with a set of pictures of items of clothing, and rated each item how likely they were to buy it. Each participant saw 20 items, ranging in price from £5 to £100. 15 participants saw these items worn by a model, while the other 15 saw the items hanging against a white background.

Question 1

Based on the RQ:

Which variable is the outcome variable, aka the dependent variable?
Which variable or variables is/are the predictor variable(s), aka the independent variable(s)?

Question 2

What grouping variable(s) does this dataset contain?

🗂️ See Grouping variables and Identify grouping variables flash cards.

Question 3

Which grouping variable(s) contribute reproducible/manipulated/controlled variability?
Which grouping variable(s) contribute random/non-manipulated/non-controlled variability?

🗂️ See Data generating process flash card.

Monkey status

Read in the dataset located at https://uoepsy.github.io/data/msmr_monkeystatus.csv and name it monkey.

RQ: How is the social status of monkeys associated with their ability to solve problems, while controlling for the difficulty of the problem?

variable	description
status	Social status of monkey (adolescent, subordinate adult, or dominant adult)
difficulty	Problem difficulty ('easy' vs 'difficult')
monkeyID	Monkey name
solved	Whether or not the problem was successfully solved by the monkey

More detail about this dataset

Researchers have given a sample of Rhesus Macaques various problems to solve in order to receive treats. Troops of Macaques have a complex social structure, but adult monkeys tend can be loosely categorised as having either a “dominant” or “subordinate” status. The monkeys in our sample are either adolescent monkeys, subordinate adults, or dominant adults. Each monkey attempted various problems before they got bored/distracted/full of treats. Each problems were classed as either “easy” or “difficult”, and the researchers recorded whether or not the monkey solved each problem.

Question 4

Based on the RQ:

Which variable is the outcome variable, aka the dependent variable?
Which variable or variables is/are the predictor variable(s), aka the independent variable(s)?

Question 5

What grouping variable(s) does this dataset contain?

🗂️ See Grouping variables and Identify grouping variables flash cards.

Question 6

Which grouping variable(s) contribute reproducible/manipulated/controlled variability?
Which grouping variable(s) contribute random/non-manipulated/non-controlled variability?

🗂️ See Data generating process flash card.

Laughs

Read in the dataset located at https://uoepsy.github.io/data/lmm_laughs.csv and name it laughs.

RQ: How is the delivery format of jokes (audio-only vs. audio AND video) associated with differences in humour ratings?

variable	description
ppt	Participant identification number
joke_label	Joke presented
joke_id	Joke identification number
delivery	Experimental manipulation: whether joke was presented in audio-only ('audio') or in audiovideo ('video')
rating	Humour rating chosen on a slider from 0 to 100

More detail about this dataset

These data are simulated to imitate an experiment that investigates the effect of visual non-verbal communication (i.e., gestures, facial expressions) on joke appreciation. Ninety participants took part in the experiment, in which they each rated how funny they found a set of 30 jokes. For each participant, the order of these 30 jokes was randomised for each run of the experiment. For each participant, the set of jokes was randomly split into two halves, with the first half being presented in audio-only, and the second half being presented in audio and video. This meant that each participant saw 15 jokes with video and 15 without, and each joke would be presented with video roughly half of the time.

Question 7

Based on the RQ:

Which variable is the outcome variable, aka the dependent variable?
Which variable or variables is/are the predictor variable(s), aka the independent variable(s)?

Question 8

What grouping variable(s) does this dataset contain?

🗂️ See Grouping variables and Identify grouping variables flash cards.

Solution 8.

laughs <- read_csv('https://uoepsy.github.io/data/lmm_laughs.csv')

We’ll go through each variable in laughs one by one.

ppt is a grouping variable because each of its values appears more than once.

laughs |> group_by(ppt) |> count()

# A tibble: 90 × 2
# Groups:   ppt [90]
   ppt         n
   <chr>   <int>
 1 PPTID1     30
 2 PPTID10    30
 3 PPTID11    30
 4 PPTID12    30
 5 PPTID13    30
 6 PPTID14    30
 7 PPTID15    30
 8 PPTID16    30
 9 PPTID17    30
10 PPTID18    30
# ℹ 80 more rows

joke_label is a grouping variable because each of its values appears more than once.

laughs |> group_by(joke_label) |> count()

# A tibble: 30 × 2
# Groups:   joke_label [30]
   joke_label                                                                  n
   <chr>                                                                   <int>
 1 "A couple of New Jersey hunters are out in the woods when one of them …    90
 2 "A doctor says to his patient, 'I have bad news and worse news'. 'Oh d…    90
 3 "A general noticed one of his soldiers behaving oddly. The soldier wou…    90
 4 "A man and a friend are playing golf one day at their local golf cours…    90
 5 "A man goes into the doctor with a penguin on his head. The doctor say…    90
 6 "A patient says: \"Doctor, last night I made a Freudian slip, I was ha…    90
 7 "A termite walks into a cocktail lounge and asks a customer \"is the b…    90
 8 "A turtle was walking down an alley in New York when he was mugged by …    90
 9 "A woman gets on a bus with her baby. The bus driver says: \"That's th…    90
10 "An Alsatian went to a telegram office, took out a blank form and wrot…    90
# ℹ 20 more rows

joke_id is a grouping variable because each of its values appears more than once. (joke_id and joke_label contain the same information, just with different amounts of detail.)

laughs |> group_by(joke_id) |> count()

# A tibble: 30 × 2
# Groups:   joke_id [30]
   joke_id     n
     <dbl> <int>
 1       1    90
 2       2    90
 3       3    90
 4       4    90
 5       5    90
 6       6    90
 7       7    90
 8       8    90
 9       9    90
10      10    90
# ℹ 20 more rows

delivery is a grouping variable because each of its values appears more than once.

laughs |> group_by(delivery) |> count()

# A tibble: 2 × 2
# Groups:   delivery [2]
  delivery     n
  <chr>    <int>
1 audio     1350
2 video     1350

rating is not a grouping variable because it’s the outcome variable (and besides, it’s continuous).

In summary, the grouping variables in laughs are ppt, joke_label/joke_id, and delivery.

Question 9

Which grouping variable(s) contribute reproducible/manipulated/controlled variability?
Which grouping variable(s) contribute random/non-manipulated/non-controlled variability?

🗂️ See Data generating process flash card.

The Piazza forum

Question 10

Finally: we want you to get familiar with the course’s Piazza page. [TODO ADD LINK]

Piazza is a discussion forum where you can anonymously post questions that your coursemates, tutors, and instructors can see and respond to. Throughout this course, please use Piazza to ask us your stats questions. Asking on Piazza is better than asking by email because then everybody else can benefit from your questions.

Your final task this week is to get to know Piazza by anonymously posting about something you like. (A cute photo of your pet, a nice thing someone said to you, the best food you ate recently… whatever makes you happy!)