WalkThrough: Advanced Data Wrangling

In many projects (both in academic research & in other sectors), more time will be spent cleaning and organising data than will actually be spent conducting statistical analyses (a well designed study with a well-thought through data collection process can go a long way to remedy this!).

For this reason, we’re going to take a little detour away from statistics to get some more practice wrangling and cleaning data in R. Don’t worry about the trying to remember all of the new R functions introduced in this topic - there are a lot. Use them as a means of learning about some of the different ways of doing things in R.

Study Background & Data

The data we’re going to look at now is from an experiment on language comprehension, looking at whether people perceive blinking as a sign of lying.

Research Question: Is the rate of blinking during speech interpreted as a sign of dishonesty (in the context of a lie-detection game)?

Participants were informed that they were going to take part in a lie-detection game. They were presented with audiovisual recordings of a speaker stating behind which of two objects (displayed on screen) there was hidden treasure. Utterances took the form of “The treasure is behind the [target name]”.
Over 20 trials, participants were tasked with using the mouse to click on the object they believed the treasure to be behind. They were told that the speaker in the video was attempting to mislead them, meaning that sometimes they told the truth, and sometimes they lied. Crucially, in the videos presented of the speaker producing the utterances, we manipulated the number of times the speaker blinked (from 1 to 10 times). Participants eyes were tracked for the duration of the experiment, with the time spent looking at either object taken as an implicit indication of perceiving a truthful utterance (in which the participant looks at and clicks on the ‘target object’ (the one identified by the speaker as hiding the treasure)) or a dishonest one (in which the participant would look at and click on the alternative ‘distractor’ object).

blink_setup.csv

The data from the experimental design are available at https://uoepsy.github.io/data/blink_setup.csv. In this data, each participant is a row, and the information about what video is presented in each trial are presented in separate columns for each trial. The first bit of the data looks like this:

1 subject_1 /files/vids/blinks_1.mp4 /files/vids/blinsk_8.mp4
2 subject_2 /files/vids/blinks_2.mp4 /files/vids/blinks_4.mp4
3 subject_3 /files/vids/blinks_4.mp4 /files/vids/blinks_5.mp4
4 subject_4 /files/vids/blinks_4.mp4 /files/vids/blinks_7.mp4
5 subject_5 /files/vids/blinks_1.mp4 /files/vids/blinks_4.mp4
6 subject_6 /files/vids/blinks_2.mp4 /files/vids/blinks_3.mp4
blink_eyegaze.xlsx

The data from the eye-tracker, which has been processed to show the proportion of time spent looking at the distractor object in each trial, can be found at https://uoepsy.github.io/data/blink_eyegaze.xlsx. In contrast to the blink_setup.csv data, in this data each trial is a row, so we have 20 rows per participant.

variable_names description
sub Participant number
trial_no Trial number
distractor_fix Time spent looking at distractor object (measured in milliseconds from onset of noun phrase)
rt Time taken to click on an object (measured in milliseconds from the onset of the noun phrase

The top of the data looks like this:

sub trial_no distractor_fix rt
1 1 503.990657311976 2812
1 2 2810.1367654 2974
1 3 706.739099152984 2257
1 4 NA NA
1 5 223.327680772201 4546
1 6 NA NA

Different Data Formats

Data can come in lots of different formats, meaning that we need lots of different ways to read data into R. Below is some information on some of the more common functions for reading and writing different types of data.

Text based files

filetype description reading writing
.csv comma separated values tidyverse - read_csv()
read.csv()
read.table(..., sep = ",")
tidyverse - write_csv()
write.csv()
write.table(..., sep=",")
.tsv tab separated values tidyverse - read_tsv()
read.table(..., sep = "\t")
tidyverse - write_tsv()
write.table(..., sep = "\t")
.txt anything-separated values! read.table(..., sep = ...) write.table(..., sep = ...)

R files

filetype description reading writing
.RDS 1 file = a single R object readRDS() saveRDS()
.RData 1 file = a collection of R objects load() save()
save.image() - to save all objects in the environment)

Excel files
The package readxl provides a variety of functions for reading in different types of Microsoft Excel spreadsheet, such as read_excel(), read_xls(), read_xlsx().

Other software
The package haven provides functions for files which have been saved from other statistical software, for instance with read_spss()/read_sav() and read_sas() for files from SPSS and SAS.

Google sheets
The googlesheets4 package can read in data directly from a spreadsheet stored on google drive. You simply find the id of the sheet (it’s the big long string of numbers & letters in the url of your google sheet), and pass it to read_sheet().
It will prompt you to authenticate your account via your browser, but it’s really easy!

Read in the two data-sets. Take care to look at the file extension (e.g., .csv, .tsv, .xlsx) as indicators of what function to try.
Make sure you assign them identifiable names.
Once you’ve loaded the data-set, take a look at them using functions like summary(), str(), dim()/nrow(), or viewing them by clicking on them in the environment.

Hints:

  • Some functions like read_excel() don’t allow you to download directly from a url, like we have been doing with .csv files.
    • Solution 1:
      • Download the data to your computer
      • upload to the rstudio server if you are using it
      • Direct the function to read it from the place you stored it.
    • Solution 2:
      • Make R download the data directly to somewhere in your working directory (see download.file()).
  • Do both the data-sets have column names? By default R will assume the first row is the name of the column. Look in the help documentation to see how to stop this from happening.

Renaming Columns

You can access the column names from a data-set using names() or colnames().

names(data)
colnames(data)

And we can easily rename these using indexing:

#name the third column "peppapig"
names(data)[3]<-"peppapig"

Or in tidyverse, using rename():

data |>
  rename(newname = currentname)

Problem

The blink_setup.csv file doesn’t have any column names!
We know that there are 20 trials for each participant, and we can see that the 2nd column has information about which subject it is.
Columns 3:22 are trials 1 to 20.

head(setupdata)
   X1        X2                       X3 ...
1   1 subject_1 /files/vids/blinks_1.mp4 ...
2   2 subject_2 /files/vids/blinks_2.mp4 ...
3   3 subject_3 /files/vids/blinks_4.mp4 ...
4   4 subject_4 /files/vids/blinks_4.mp4 ...
5   5 subject_5 /files/vids/blinks_1.mp4 ...
6   6 subject_6 /files/vids/blinks_2.mp4 ...
7 ...       ...                      ... ...
8 ...       ...                      ... ...

Task

  1. Remove the first column
  2. Rename columns 2 to 22 with sensible names.

Hints:

names(setupdata) # what are the names
names(setupdata)[2] # what is the 2nd name
names(setupdata) <- c("...", "...", "...",..) # set the names
c("kermit", paste("peppapig", 1:3, sep="_"))
[1] "kermit"     "peppapig_1" "peppapig_2" "peppapig_3"

Reshaping data

Pivot!

One of the more confusing things to get to grips with is the idea of reshaping a dataframe.
For different reasons, you might sometimes want to have data in wide, or in long format.

Source: https://fromthebottomoftheheap.net/2019/10/25/pivoting-tidily/

When the data is wide, we can make it long using pivot_longer(). When we make data longer, we’re essentially making lots of columns into 2 longer columns. Above, in the animation, the wide variable x, y and z go into a new longer column called name that specifies which (x/y/z) it came from, and the values get put into the val column.

The animation takes a shortcut in the code it displays above, but you could also use pivot_longer(c(x,y,z), names_to = "name", values_to = "val"). To reverse this, and put it back to being wide, we tell R which columns to take the names and values from: pivot_wider(names_from = name, values_from = val).

Problem
The blink_setup.csv file has the data in a different shape to the blink_eyegaze.xlsx file.

  • blink_setup.csv : one row per participant
  • blink_eyegaze.xlsx : one row per trial

Task

Reshape the data to make it so that there is one row per trial.

Hint

  • in the tidyverse functions, you can specify all columns between column x and column z by using the colon, x:z.

Dealing with character strings

There are loads of functions we can use to do various things with character strings in R.
Here are a few examples:

substitute a string of characters for another string:

gsub("don't like","love", "i really really really don't like statistics!")
[1] "i really really really love statistics!"

separate a column into multiple columns by splitting at a set of characters

mupsimp <- read_csv("https://uoepsy.github.io/data/muppet_simp.csv")
mupsimp
# A tibble: 18 × 1
   show_name                     
   <chr>                         
 1 simpsons_Marge Simpson        
 2 muppets_Scooter               
 3 muppets_Rowlf the Dog         
 4 muppets_Fozzie Bear           
 5 simpsons_Abraham Simpson      
 6 muppets_Walter                
 7 muppets_Pepe the King Prawn   
 8 muppets_Gonzo                 
 9 simpsons_Santa's Little Helper
10 simpsons_Snowball II/V        
11 simpsons_Maggie Simpson       
12 simpsons_Lisa Simpson         
13 simpsons_Bart Simpson         
14 muppets_Animal                
15 simpsons_Homer Simpson        
16 muppets_Miss Piggy            
17 muppets_Rizzo the Rat         
18 muppets_Kermit the Frog       
mupsimp |> 
  separate(show_name, into = c("show","name"), sep = "_")
# A tibble: 18 × 2
   show     name                 
   <chr>    <chr>                
 1 simpsons Marge Simpson        
 2 muppets  Scooter              
 3 muppets  Rowlf the Dog        
 4 muppets  Fozzie Bear          
 5 simpsons Abraham Simpson      
 6 muppets  Walter               
 7 muppets  Pepe the King Prawn  
 8 muppets  Gonzo                
 9 simpsons Santa's Little Helper
10 simpsons Snowball II/V        
11 simpsons Maggie Simpson       
12 simpsons Lisa Simpson         
13 simpsons Bart Simpson         
14 muppets  Animal               
15 simpsons Homer Simpson        
16 muppets  Miss Piggy           
17 muppets  Rizzo the Rat        
18 muppets  Kermit the Frog      

Extract or replace substrings in a character vector.

# get the first 3 letters
substr(mupsimp$show_name, 1, 3)
 [1] "sim" "mup" "mup" "mup" "sim" "mup" "mup" "mup" "sim" "sim" "sim" "sim"
[13] "sim" "mup" "sim" "mup" "mup" "mup"

Can be combined with functions like nchar() (to find the number of characters in each string). Additionally, can be used in tidyverse easily:

mupsimp |>
  mutate(
    first3 = substr(show_name, 1, 3),
    last3 = substr(show_name, nchar(show_name)-2, nchar(show_name))
  )
# A tibble: 18 × 3
   show_name                      first3 last3
   <chr>                          <chr>  <chr>
 1 simpsons_Marge Simpson         sim    son  
 2 muppets_Scooter                mup    ter  
 3 muppets_Rowlf the Dog          mup    Dog  
 4 muppets_Fozzie Bear            mup    ear  
 5 simpsons_Abraham Simpson       sim    son  
 6 muppets_Walter                 mup    ter  
 7 muppets_Pepe the King Prawn    mup    awn  
 8 muppets_Gonzo                  mup    nzo  
 9 simpsons_Santa's Little Helper sim    per  
10 simpsons_Snowball II/V         sim    I/V  
11 simpsons_Maggie Simpson        sim    son  
12 simpsons_Lisa Simpson          sim    son  
13 simpsons_Bart Simpson          sim    son  
14 muppets_Animal                 mup    mal  
15 simpsons_Homer Simpson         sim    son  
16 muppets_Miss Piggy             mup    ggy  
17 muppets_Rizzo the Rat          mup    Rat  
18 muppets_Kermit the Frog        mup    rog  

paste() can quickly combine two character vectors:

paste("hello","everyone",sep=" ")
[1] "hello everyone"

You can also use it to collapse a vector into a single string:

paste(mupsimp$show_name, collapse=" ")
[1] "simpsons_Marge Simpson muppets_Scooter muppets_Rowlf the Dog muppets_Fozzie Bear simpsons_Abraham Simpson muppets_Walter muppets_Pepe the King Prawn muppets_Gonzo simpsons_Santa's Little Helper simpsons_Snowball II/V simpsons_Maggie Simpson simpsons_Lisa Simpson simpsons_Bart Simpson muppets_Animal simpsons_Homer Simpson muppets_Miss Piggy muppets_Rizzo the Rat muppets_Kermit the Frog"

and paste0() is a quick shortcut for using sep="":

paste0("hello","everyone")
[1] "helloeveryone"

Problem
If you look at what data was captured by the software to indicate which video was used in each trial, there is a lot of unnecessary data there. The number of the filename indicates how many blinks are in the video. This is the only bit of information we want.

head(setuplong$video)
[1] "/files/vids/blinks_1.mp4"  "/files/vids/blinsk_8.mp4" 
[3] "/files/vids/blinks_1.mp4"  "/files/vids/blinks_5.mp4" 
[5] "/files/vids/blinks_4.mp4"  "/files/vids/blinks_10.mp4"

Task

  • In your (now reshaped to long) blink_setup.csv data, make a new, or edit an existing column, which is a numeric variable containing the number of blinks presented in the video in each trial

Hints:

  • there are lots of different ways you could do this.
  • you can substitute out multiple different strings by separating them with the | symbol:
  gsub("dog|cat", "horse", "I have a dog and a cat and the dogs name is Graham")
[1] "I have a horse and a horse and the horses name is Graham"

Joining/merging

Now comes a fun bit.
Recall that the research question is interested in the relationship between the number of times the speaker was seen to blink, and the time the participants spent looking at the distractor object (indicating perceived dishonesty).
You may have noticed that these variables are currently in different data-sets! The blink_setup.csv contains information about the numbers of blinks in the videos, and the blink_eyegaze.xlsx contains the data on the fixations.

Solution: we need to join them together!

Note that because both data-sets contain information on participant number and trial number, which uniquely identifies each observation, we can join them together matching on these variables!

There are lots of different ways to join data-sets, depending on whether we want to keep rows from one data-set or the other, or keep only those in both data-sets etc.

Check out the help documentation for them all using ?full_join.

Problem
Variables are in different data-sets.

Task

  1. Join the two data-sets (the reshaped-to-long blink_setup.csv data, and the blink_eyegaze.xlsx data) together, and store the joined data in a new object (you can use your own name, but the solutions will use the name blinks_full).

Hints:

We want to match the observations based on two columns which are present in each data-set, indicating which participant, and which trial.

  • Remember that R doesn’t have your intelligence - it doesn’t know that in one data-set the variable is called e.g., trial_no and in the other it is called trial_number.
  • Another thing which R doesn’t know is that “subject_1” in setup data is the same participant as “1” in the eye gaze data. It needs to match the same symbols, and what is more, it needs the variables to be the same type (character, numeric, factor etc).
    • you might want to make use of the skills you learned for manipulating character strings.

Impossible Values

It’s important to check that there are no values in the data which are impossible, given what you know about how the data was measured. This is where exploratory plots and descriptive statistics come in handy.

hist(as.numeric(blinks_full$distractor_fix))

In some trials, participants spent less that 0ms fixating on the distractor object!?!?!?

We have a couple of options as to how to deal with them.

  1. Delete the entire row
  2. Change the specific entry/s in that variable to be NA (Not Applicable) - this has the benefit of keeping the rows should we consider those row to have a valid observation in other variables (for instance the rt - reaction time?)

Some of the tools we learned in the Reading 1B will come in handy here.

Problem
Some impossible values in the distractor_fix variable.

Task
- Assign the entries of the distractor_fix variable which are < 0 to be NA.
- Are there any other impossible values (or combinations of values) in the data?

Hints:

  • While you’re there, why not convert any variables to the right type (numeric, factor, etc).
  • We might not have come across this before, but there is a really useful function called ifelse().
    Play around with the below code to learn:
tibble(x = 1:10) |>
  mutate(
    new_variable = ifelse(x>5,1,0),
    another_new_variable = ifelse(x>5,"peppapig","kermit"),
    morevariables = ifelse(another_new_variable == "kermit","kermit the frog", another_new_variable)
  )

Missing Data in R

Missing data can be a big problem for statistics. For those of you thinking of taking Multivariate Statistics & Methodology in R next semester, you can look forward to discussions around this sort of issue.

Here, however, we are simply going to discuss the practicalities of how to make R code work when some of your values are NAs.

Consider:

vec <- c(1,2,3,4,NA)
mean(vec)
[1] NA

Think about why this is: \[ \text{mean(vec)} = \frac{1+2+3+4+\text{NA}}{5} = \frac{\text{??}}{5} = \text{??} \] There are numerous different ways that functions in R cope with missing values, but if you’re ever in doubt, try na.rm = TRUE. This will basically tell R to “remove the NAs before doing the calculation”.

mean(vec, na.rm=T)
[1] 2.5

Other functions include na.omit(), which remove any row with has an NA anywhere in it:

complete_data <- na.omit(data)

Outliers

Outliers are the extreme - but plausible - values in variables. There is no one way to identify what is extreme enough to consider and outlier, nor is there one way to handle them.
Some outliers could be considered important observations which we would not want to exclude. However, being an outlier can (but not always) result in an observation exerting too great an influence on our analysis.

Some common approaches to identifying outliers:

  • observations which are \(> 3\) (sometimes \(> 2.5\)) standard deviations away from the mean.
  • observations greater than \(1.5 \times IQR\) below the first quartile \(Q_1\) or above the third quartile \(Q_3\).

Some common approaches to handling outliers:

  • Exclude now - for instance, set as NA
  • “Winsorize” - set to a specified percentile. For example, all observations below the 5th percentile set to the 5th percentile, and all observations above the 95th percentile set to the 95th percentile
  • Exclude from analysis later, based on measures of influence (we’ll learn about this in future topics)

Make a bloxplot of the distractor_fix variable. Does it look like there might be any outliers?

Custom Functions

Writing your own function

We already saw some custom functions in the first week, where we made some called dice() and wdice().
Can you write a function which, given a vector, returns TRUE if it is an outlier and FALSE if it is not, based on the criterion of being \(>3\) sd away from the mean.

outliers <- function(obs){
 ...
 ...
 ...
}

Look through the solutions to question B8 above, and make sure that you are comfortable with how writing a function works.

Can you edit the outliers() function you wrote to make it work with vectors which include NAs?

Problem
Possible outliers in the distractor_fix variable.

Task

  1. Replace any values of the distractor_fix variable which are \(>3\) standard deviations from the mean with NA.
  2. Make a new boxplot of the variable

If you skipped questions B8 and B9, then copy and run this code into your document. It will give you a function which takes a vector and returns TRUEs and FALSEs based on whether each entry is greater than 3 standard deviations from the mean.

outliers <- function(obs, x = 3){
  abs(obs - mean(obs, na.rm=TRUE)) > (x * sd(obs, na.rm=TRUE))
}

Summary plots

So where are we now? We have a dataset that has one row per trial per subject, and it contains information on the number of blinks, and on the time spent looking at the distractor:

head(blinks_full)
# A tibble: 6 × 6
    sub trial_no video                     nr_blinks distractor_fix    rt
  <dbl>    <dbl> <chr>                         <dbl>          <dbl> <dbl>
1     1        1 /files/vids/blinks_1.mp4          1           504.  2812
2     1        2 /files/vids/blinsk_8.mp4          8            NA   2974
3     1        3 /files/vids/blinks_1.mp4          1           707.  2257
4     1        4 /files/vids/blinks_5.mp4          5            NA     NA
5     1        5 /files/vids/blinks_4.mp4          4           223.  4546
6     1        6 /files/vids/blinks_10.mp4        10            NA     NA

Thinking way back to the top of this page, we recall that our research question is concerned with whether perceived lying (as implied by more looking at the distractor) increases with the number of blinks seen.

You might notice that we the observations in our dataset aren’t independent from one another - we have 20 datapoints for each subject. And if we split it by “number of blinks”, then we have roughly 45 datapoints for each value. Independence of observations is actually an assumption that is relied upon by all the tests that we cover in this course. So we can’t actually perform an analysis on this data as it is (we’ll learn how to deal with this sort of data structure next semester in the Multivariate Statistics course).

However, it provides a useful example for us to try some plotting:

the stat_summary() function allows us to plot summarised data, rather than the data itself. If we choose to plot it as a “pointrange”, it gives us the mean and standard error for \(y\) across each level of \(x\):

ggplot(blinks_full, aes(x = nr_blinks, y = distractor_fix)) +
  stat_summary(geom = "pointrange")

This is the same as doing some grouping and summarising first, and then giving those summarised values to the plot:

blinks_full |>
  group_by(nr_blinks) |>
  summarise(mean = mean(distractor_fix, na.rm=TRUE),
            se = sd(distractor_fix, na.rm=TRUE)/sqrt(n()),
            lwr = mean-se,
            upr = mean+se
            ) |>
  ggplot(aes(x = nr_blinks, y = mean, ymin = lwr, ymax = upr)) +
  geom_pointrange()

Build a model!

We’re now finally getting to the analysis. As we said earlier, this can sometimes be very straightforward in comparison to the amount of effort involved in cleaning data.

Recall that we’re interested in whether the perception of whether or not a speaker is lying about the location of some hidden treasure (as measured by the pattern of eye fixations towards the object not referred to by the speaker) is influenced by the number of times the speaker is seen to blink while producing the utterance.

Plot the relationship between the two variables you think will be best used to answer this question.

Question B12

Fit the linear model specified below to the data using the lm() function and store the output in the environment as an object named blinks_mdl.

\[ \begin{align} & \text{Fixation time to distractor} = b_0 + b_1 \ \text{Number of blinks} + \epsilon \\ \quad \\ & \text{where} \quad \\ & \epsilon \sim N(0, \sigma) \text{ independently} \end{align} \]

Question B13

The \(\epsilon \sim N(0, \sigma) \text{ independently}\) bit of our model is an assumption we have to make. It concerns the errors (the deviations from observations to our line). Our model assumes that these are normally distributed and centered on 0. We can plot the distribution of residuals to check how well our assumption holds:

hist(residuals(blinks_mdl))

However, we also make the statement that the errors are independent - i.e. they are not related to one another.

For us, this is not the case, and so we should not be using this simple linear regression here.

In what way are we violating the assumption of independence?

Question B14

One option open to us now is to simply “aggregate up”, so that we remove the dependence from our rows in our dataset. However, this means reducing the number of rows.

This is sub-optimal (and we’ll see why next semester!), but we can calculate the average distractor_fix for each level of nr_blinks, and then use those in our model.

We can do this the same way as we saw with our summary plots - with group_by() and summarise()!