DAPR3 Lab Exercises
  • Linear mixed models
    • Identifying grouping structure
    • 02.qmd
  • Measurement & Factor Analysis

On this page

  • Clothing
  • Monkey status
  • Laughs
  • The Piazza forum

Identifying grouping structure

In this lab, you will start to build skills that you’ll use again and again for the next five weeks (and really, for the rest of your data analysis career!).

You’ll practice identifying the grouping structure within a given dataset and thinking about what kind of variability each grouping variable contributes.

In the questions below, you’ll apply the tools you saw in the lectures to three different group-structured datasets.

(Next week, you’ll apply new tools to these same three datasets to understand their grouping structures even more deeply!)

NoteGet set up
  1. Create a new .Rmd file for this week’s exercises.
  2. Save it somewhere you can find it again.
  3. Give it a clear name (for example, dapr3_lab01.Rmd).
  4. In the first code chunk, load the packages you’ll need this week:
    • tidyverse

Clothing

Read in the dataset located at https://uoepsy.github.io/data/dapr3_mannequin.csv and name it clothing using the following line of code.

clothing <- read_csv("https://uoepsy.github.io/data/dapr3_mannequin.csv")

RQ: Are people more likely to purchase clothing when they see it displayed on a model, and is this association dependent on item price?

variable description
purch_rating Purchase rating (sliding scale 0 to 100, with higher ratings indicating greater perceived likelihood of purchase)
price Price presented for item (range £5 to £100)
ppt Participant identifier
condition Whether items are seen on a model or on a white background
NoteMore detail about this dataset

Thirty participants were presented with a set of pictures of items of clothing, and rated each item how likely they were to buy it. Each participant saw 20 items, ranging in price from £5 to £100. 15 participants saw these items worn by a model, while the other 15 saw the items hanging against a white background.

Question 1

Based on the RQ:

  1. Which variable is the outcome variable, aka the dependent variable?
  2. Which variable or variables is/are the predictor variable(s), aka the independent variable(s)?

Solution 1.

  1. Outcome: purch_rating.
  2. Predictors: price, condition.

Question 2

What grouping variable(s) does this dataset contain?

Solution 2. We’ll go through each variable in clothing one by one.

  1. purch_rating is not a grouping variable for two reasons: it’s the outcome variable, and it’s continuous.

  2. price is numeric, so it looks continuous. But we can also think of it as ordinal, since it has a fixed set of discrete values that occur in a specific order. And those values each appear more than once, so price is a grouping variable:

clothing |> group_by(price) |> count()
# A tibble: 20 × 2
# Groups:   price [20]
   price     n
   <dbl> <int>
 1     5    30
 2    10    30
 3    15    30
 4    20    30
 5    25    30
 6    30    30
 7    35    30
 8    40    30
 9    45    30
10    50    30
11    55    30
12    60    30
13    65    30
14    70    30
15    75    30
16    80    30
17    85    30
18    90    30
19    95    30
20   100    30
  1. ppt is a grouping variable because each of its values appears more than once.
clothing |> group_by(ppt) |> count()
# A tibble: 30 × 2
# Groups:   ppt [30]
   ppt        n
   <chr>  <int>
 1 ppt_1     20
 2 ppt_10    20
 3 ppt_11    20
 4 ppt_12    20
 5 ppt_13    20
 6 ppt_14    20
 7 ppt_15    20
 8 ppt_16    20
 9 ppt_17    20
10 ppt_18    20
# ℹ 20 more rows
  1. condition is a grouping variable because each of its values appears more than once.
clothing |> group_by(condition) |> count()
# A tibble: 2 × 2
# Groups:   condition [2]
  condition     n
  <chr>     <int>
1 item_only   300
2 model       300

In summary, the grouping variables in clothing are price, ppt, and condition.

Question 3
  1. Which grouping variable(s) contribute reproducible/manipulated/controlled variability?
  2. Which grouping variable(s) contribute random/non-manipulated/non-controlled variability?

Solution 3.

  1. Sources of reproducible/manipulated/controlled variability (appearing in the model as predictors):
  • price
  • condition
  1. Sources of random/non-manipulated/non-controlled variability:
  • ppt

Monkey status

Read in the dataset located at https://uoepsy.github.io/data/msmr_monkeystatus.csv and name it monkey.

RQ: How is the social status of monkeys associated with their ability to solve problems, while controlling for the difficulty of the problem?

variable description
status Social status of monkey (adolescent, subordinate adult, or dominant adult)
difficulty Problem difficulty ('easy' vs 'difficult')
monkeyID Monkey name
solved Whether or not the problem was successfully solved by the monkey
NoteMore detail about this dataset

Researchers have given a sample of Rhesus Macaques various problems to solve in order to receive treats. Troops of Macaques have a complex social structure, but adult monkeys tend can be loosely categorised as having either a “dominant” or “subordinate” status. The monkeys in our sample are either adolescent monkeys, subordinate adults, or dominant adults. Each monkey attempted various problems before they got bored/distracted/full of treats. Each problems were classed as either “easy” or “difficult”, and the researchers recorded whether or not the monkey solved each problem.

Question 4

Based on the RQ:

  1. Which variable is the outcome variable, aka the dependent variable?
  2. Which variable or variables is/are the predictor variable(s), aka the independent variable(s)?

Solution 4.

  1. Outcome: solved.
  2. Predictors: status, difficulty.

Question 5

What grouping variable(s) does this dataset contain?

Solution 5.

monkey <- read_csv("https://uoepsy.github.io/data/msmr_monkeystatus.csv")

We’ll go through each variable in monkey one by one.

  1. status is a grouping variable because each of its values appears more than once.
monkey |> group_by(status) |> count()
# A tibble: 3 × 2
# Groups:   status [3]
  status          n
  <chr>       <int>
1 adolescent    126
2 dominant      181
3 subordinate    90
  1. difficulty is a grouping variable because each of its values appears more than once.
monkey |> group_by(difficulty) |> count()
# A tibble: 2 × 2
# Groups:   difficulty [2]
  difficulty     n
  <chr>      <int>
1 difficult    189
2 easy         208
  1. monkeyID is a grouping variable because each of its values appears more than once.
monkey |> group_by(monkeyID) |> count()
# A tibble: 50 × 2
# Groups:   monkeyID [50]
   monkeyID      n
   <chr>     <int>
 1 Aliyya        7
 2 Ashley        6
 3 Billy         7
 4 Brianna       5
 5 Catherine     9
 6 Celestina     5
 7 Cheyenne      6
 8 Cinoi        10
 9 Courtney      7
10 Daniel        8
# ℹ 40 more rows
  1. solved is not a grouping variable because it’s the outcome variable.

In summary, the grouping variables in monkey are status, difficulty, and monkeyID.

Question 6
  1. Which grouping variable(s) contribute reproducible/manipulated/controlled variability?
  2. Which grouping variable(s) contribute random/non-manipulated/non-controlled variability?

Solution 6.

  1. Sources of reproducible/manipulated/controlled variability (appearing in the model as predictors):
  • status
  • difficulty
  1. Sources of random/non-manipulated/non-controlled variability:
  • monkeyID

Laughs

Read in the dataset located at https://uoepsy.github.io/data/lmm_laughs.csv and name it laughs.

RQ: How is the delivery format of jokes (audio-only vs. audio AND video) associated with differences in humour ratings?

variable description
ppt Participant identification number
joke_label Joke presented
joke_id Joke identification number
delivery Experimental manipulation: whether joke was presented in audio-only ('audio') or in audiovideo ('video')
rating Humour rating chosen on a slider from 0 to 100
NoteMore detail about this dataset

These data are simulated to imitate an experiment that investigates the effect of visual non-verbal communication (i.e., gestures, facial expressions) on joke appreciation. Ninety participants took part in the experiment, in which they each rated how funny they found a set of 30 jokes. For each participant, the order of these 30 jokes was randomised for each run of the experiment. For each participant, the set of jokes was randomly split into two halves, with the first half being presented in audio-only, and the second half being presented in audio and video. This meant that each participant saw 15 jokes with video and 15 without, and each joke would be presented with video roughly half of the time.

Question 7

Based on the RQ:

  1. Which variable is the outcome variable, aka the dependent variable?
  2. Which variable or variables is/are the predictor variable(s), aka the independent variable(s)?

Solution 7.

  1. Outcome: `rating``.
  2. Predictor: delivery.

Question 8

What grouping variable(s) does this dataset contain?

Solution 8.

laughs <- read_csv('https://uoepsy.github.io/data/lmm_laughs.csv')

We’ll go through each variable in laughs one by one.

  1. ppt is a grouping variable because each of its values appears more than once.
laughs |> group_by(ppt) |> count()
# A tibble: 90 × 2
# Groups:   ppt [90]
   ppt         n
   <chr>   <int>
 1 PPTID1     30
 2 PPTID10    30
 3 PPTID11    30
 4 PPTID12    30
 5 PPTID13    30
 6 PPTID14    30
 7 PPTID15    30
 8 PPTID16    30
 9 PPTID17    30
10 PPTID18    30
# ℹ 80 more rows
  1. joke_label is a grouping variable because each of its values appears more than once.
laughs |> group_by(joke_label) |> count()
# A tibble: 30 × 2
# Groups:   joke_label [30]
   joke_label                                                                  n
   <chr>                                                                   <int>
 1 "A couple of New Jersey hunters are out in the woods when one of them …    90
 2 "A doctor says to his patient, 'I have bad news and worse news'. 'Oh d…    90
 3 "A general noticed one of his soldiers behaving oddly. The soldier wou…    90
 4 "A man and a friend are playing golf one day at their local golf cours…    90
 5 "A man goes into the doctor with a penguin on his head. The doctor say…    90
 6 "A patient says: \"Doctor, last night I made a Freudian slip, I was ha…    90
 7 "A termite walks into a cocktail lounge and asks a customer \"is the b…    90
 8 "A turtle was walking down an alley in New York when he was mugged by …    90
 9 "A woman gets on a bus with her baby. The bus driver says: \"That's th…    90
10 "An Alsatian went to a telegram office, took out a blank form and wrot…    90
# ℹ 20 more rows
  1. joke_id is a grouping variable because each of its values appears more than once. (joke_id and joke_label contain the same information, just with different amounts of detail.)
laughs |> group_by(joke_id) |> count()
# A tibble: 30 × 2
# Groups:   joke_id [30]
   joke_id     n
     <dbl> <int>
 1       1    90
 2       2    90
 3       3    90
 4       4    90
 5       5    90
 6       6    90
 7       7    90
 8       8    90
 9       9    90
10      10    90
# ℹ 20 more rows
  1. delivery is a grouping variable because each of its values appears more than once.
laughs |> group_by(delivery) |> count()
# A tibble: 2 × 2
# Groups:   delivery [2]
  delivery     n
  <chr>    <int>
1 audio     1350
2 video     1350
  1. rating is not a grouping variable because it’s the outcome variable (and besides, it’s continuous).

In summary, the grouping variables in laughs are ppt, joke_label/joke_id, and delivery.

Question 9
  1. Which grouping variable(s) contribute reproducible/manipulated/controlled variability?
  2. Which grouping variable(s) contribute random/non-manipulated/non-controlled variability?

Solution 9.

  1. Sources of reproducible/manipulated/controlled variability (appearing in the model as predictors):
  • delivery
  1. Sources of random/non-manipulated/non-controlled variability:
  • ppt
  • joke_label/joke_id

The Piazza forum

Question 10

Finally: we want you to get familiar with the course’s Piazza page. [TODO ADD LINK]

Piazza is a discussion forum where you can anonymously post questions that your coursemates, tutors, and instructors can see and respond to. Throughout this course, please use Piazza to ask us your stats questions. Asking on Piazza is better than asking by email because then everybody else can benefit from your questions.

Your final task this week is to get to know Piazza by anonymously posting about something you like. (A cute photo of your pet, a nice thing someone said to you, the best food you ate recently… whatever makes you happy!)