Research design & data

Semester 1 - Week 1

1 Getting started

Tip: Lab instructions
  • Please work through the lab exercises in small groups of 3 to 5 students.
  • You will be given some data that you will use throughout the next 5 weeks.
  • As a group, you have to produce a data analysis PDF report on those data.
  • In week 5, you will be asked to submit the PDF report, for which you will receive formative feedback in week 6.
  • One person is the driver, responsible for typing on the PC, and the rest are navigators and cannot type. Navigators are responsible for commenting on the strategy, code, and spotting typos or fixing errors. Each week you will rotate so that everyone experiences being a driver.
  • Driver: open an Rmd file, and start writing your work there.
  • Navigators: be alert and start providing suggestions and comments on the strategy and code.

Format

  • PDF file, max 4 sides of A4 paper, keep the default settings in terms of Rmd knitting font and page margins.
  • Appendix with all the code in a code chunk with the option results='hide'.

The lab is structured to provide various levels of support. When attending the labs, you should directly attempt and work on the tasks. However, if you are unsure or stuck at any point, you can make use of the following help:

  • Simply raise your hand and get help from a tutor
  • Hover your mouse on the superscript number to get a hint. The hints may sometimes show multiple equivalent ways of getting an answer - you just need one way
  • Scroll down to the Worked Example section, where you can read through a worked example.

Try these steps first to register for RStudio server online:

  • Log in to EASE using your university UUN and password.
  • Set your RStudio password here, the username will be the same as your UUN (make sure you type your UUN correctly).
  • Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you set above in (2).
  • Please complete this form and wait for an email. Please note that this can take up to four working days.

  • Once you receive an email from us, please follow the following instructions:

    • Set your here, the username will be the same as your UUN (make sure you type your UUN correctly).
    • Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you just set above.

Before you begin, make sure you have tinytex installed in R so that you can “Knit” your Rmd document to a PDF file:

install.packages("tinytex")
tinytex::install_tinytex()

2 Formative report A

In the first five weeks of the course you should produce a PDF report using Rmarkdown for which you will receive formative feedback in week 6. The report should not include any reference to R code or functions, but be written or a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. You should also avoid any R code output or printout in the PDF file.

To not show the code of an R code chunk, and only show the output, write:

```{r, echo=FALSE}
# code goes here
```

To show the code of an R code chunk, but hide the output, write:

```{r, results='hide'}
# code goes here
```

To hide both code and output of an R code chunk, write:

```{r, include=FALSE}
# code goes here
```

2.1 Data

Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. The following variables were recorded:

  • Movie: Title of the movie
  • LeadStudio: Primary U.S. distributor of the movie
  • RottenTomatoes: Rotten Tomatoes rating (critics)
  • AudienceScore: Audience rating (via Rotten Tomatoes)
  • Genre: One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western
  • TheatersOpenWeek: Number of screens for opening weekend
  • OpeningWeekend: Opening weekend gross (in millions)
  • BOAvgOpenWeekend: Average box office income per theater, opening weekend
  • Budget: Production budget (in millions)
  • DomesticGross: Gross income for domestic (U.S.) viewers (in millions)
  • WorldGross: Gross income for all viewers (in millions)
  • ForeignGross: Gross income for foreign viewers (in millions)
  • Profitability: WorldGross as a percentage of Budget
  • OpenProfit: Percentage of budget recovered on opening weekend
  • Year: Year the movie was released
  • IQ1-IQ50: IQ score of each of 50 audience raters
  • Snacks: How many of the 50 audience raters brought snacks
  • PrivateTransport: How many of the 50 audience raters reached the cinema via private transportation

For formative report A, please only focus on the variables Movie to Year, ignoring anything beyond that. In other words, do not analyse the variables IQ1 to PrivateTransport in the next five weeks of the course. We will use those later in the course.

2.2 Tasks

For formative report A, you will be asked to perform the following tasks, each related to a week of teaching in this course:

Note: This week’s task

A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure

A2) Display and describe the categorical variables
A3) Display and describe six numerical variables of your choice
A4) Display and describe a relationship of interest between two or three variables of your choice
A5) Finish the report write-up, knit to PDF, and submit the PDF for formative feedback

This week you will only focus on task A1. Below there are some guided sub-steps you may want to consider to complete task A1.

2.3 A1 sub-tasks

Tip

To see the hints, hover your cursor on the superscript numbers.

  • Read the movie data into R, and give it a useful name. Inspect the data by looking at the data in RStudio. By viewing, we actually mean looking at the data either on the viewer or the console.1

  • How many observations are there?2

  • How many variables are there?3

  • What does dim(DATA) return?
  • What is the function of appending a [1] or [2]?
  • What is the type of each variable?4

  • What’s the minimum and maximum budget in the sample? What about the average Rotten Tomatoes rating?5

  • Do you notice any issues when computing the minimum and maximum Budget and the average RottenTomatoes rating?6

  • What is the range (i.e. minimum and maximum) of the variables in the data? What about the number of missing values for each variable?7

  • Write-up a description of the dataset for the reader. You don’t need to show the actual data in the report, but a description in words is sufficient for the reader.

3 Worked example

Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:

Variable Name Description
Bill Size of the bill (in dollars)
Tip Size of the tip (in dollars)
Credit Paid with a credit card? n or y
Guests Number of people in the group
Day Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday
Server Code for specific waiter/waitress: A, B, or C
PctTip Tip as a percentage of the bill

These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).

library(tidyverse)  # we use read_csv and glimpse from tidyverse
tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")
head(tips)
# A tibble: 6 × 7
   Bill   Tip Credit Guests Day   Server PctTip
  <dbl> <dbl> <chr>   <dbl> <chr> <chr>   <dbl>
1  23.7 10    n           2 f     A        42.2
2  36.1  7    n           3 f     B        19.4
3  32.0  5.01 y           2 f     A        15.7
4  17.4  3.61 y           2 f     B        20.8
5  15.4  3    n           2 f     B        19.5
6  18.6  2.5  n           2 f     A        13.4

head() shows by default the top 6 rows of the data. Use the n = ... option to change the default behaviour, e.g. head(<data>, n = 10).

dim(tips)
[1] 157   7

This returns the number of rows and columns

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…

glimpse is part of the tidyverse package

Alternatives to glimpse are the data “structure” function:

str(tips)
spc_tbl_ [157 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Bill  : num [1:157] 23.7 36.1 32 17.4 15.4 ...
 $ Tip   : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
 $ Credit: chr [1:157] "n" "n" "y" "y" ...
 $ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
 $ Day   : chr [1:157] "f" "f" "f" "f" ...
 $ Server: chr [1:157] "A" "B" "A" "B" ...
 $ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...
 - attr(*, "spec")=
  .. cols(
  ..   Bill = col_double(),
  ..   Tip = col_double(),
  ..   Credit = col_character(),
  ..   Guests = col_double(),
  ..   Day = col_character(),
  ..   Server = col_character(),
  ..   PctTip = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

or:

sapply(tips, data.class)
       Bill         Tip      Credit      Guests         Day      Server 
  "numeric"   "numeric" "character"   "numeric" "character" "character" 
     PctTip 
  "numeric" 
Example writeup

A dataset containing records on 7 variables related to tipping was obtained from https://uoepsy.github.io/data/RestaurantTips.csv, and was provided by the owner of a bistro in the US interested in studying which factors affected the tipping behaviour of the bistro’s customers. The data contains measurements for a total of 157 parties on four numeric variables: size of the bill (in dollars), size of the tip, number of guests in the group, and tip as a percentage of the bill total. The data also includes three categorical variables indicating whether or not the party paid with a credit card, the day of the week, as well as a server-specific identifier.

summary(tips)
      Bill            Tip            Credit              Guests     
 Min.   : 1.66   Min.   : 0.250   Length:157         Min.   :1.000  
 1st Qu.:15.19   1st Qu.: 2.075   Class :character   1st Qu.:2.000  
 Median :20.22   Median : 3.340   Mode  :character   Median :2.000  
 Mean   :22.73   Mean   : 3.807                      Mean   :2.096  
 3rd Qu.:28.84   3rd Qu.: 5.000                      3rd Qu.:2.000  
 Max.   :70.51   Max.   :15.000                      Max.   :7.000  
                 NA's   :1                                          
     Day               Server              PctTip      
 Length:157         Length:157         Min.   :  6.70  
 Class :character   Class :character   1st Qu.: 14.30  
 Mode  :character   Mode  :character   Median : 16.20  
                                       Mean   : 17.89  
                                       3rd Qu.: 18.20  
                                       Max.   :221.00  
                                                       

summary returns a quick summary of the data.

You probably won’t understand some parts of the output above, but we will learn more in the coming weeks, so don’t worry too much about it. For the moment, you should be able to understand the minimum, maximum, and the mean.
Currently, it is not showing very informative output for the categorical variables.

We can replace each factor level with a clearer label:

tips$Day <- factor(tips$Day, 
                   levels = c("m", "t", "w", "th", "f"),
                   labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))

tips$Credit <- factor(tips$Credit, 
                      levels = c("n", "y"),
                      labels = c("No", "Yes"))

tips$Server <- factor(tips$Server)

Using tidyverse, the function mutate is used to mutate a variable (column) in the data:

tips <- tips %>%
    mutate(
        Day = factor(Day,
                     levels = c("m", "t", "w", "th", "f"),
                     labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")),
        Credit = factor(Credit,
                        levels = c("n", "y"),
                        labels = c("No", "Yes")),
        Server = factor(Server)
    )

The functions %>% and mutate are part of the tidyverse package. The former, %>%, is called pipe.

The pipe works by taking what’s on the left and passing it to the operation on the right. For example, rounding to 2 decimal places the logarithm of the whole numbers from 1 to 10:

round(log(1:10), digits = 2)
 [1] 0.00 0.69 1.10 1.39 1.61 1.79 1.95 2.08 2.20 2.30

is equivalent to:

1:10 %>%
    log() %>%
    round(digits = 2)
 [1] 0.00 0.69 1.10 1.39 1.61 1.79 1.95 2.08 2.20 2.30

Let’s check the result of the changes to the variable types:

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
summary(tips)
      Bill            Tip         Credit        Guests             Day    
 Min.   : 1.66   Min.   : 0.250   No :106   Min.   :1.000   Monday   :20  
 1st Qu.:15.19   1st Qu.: 2.075   Yes: 51   1st Qu.:2.000   Tuesday  :13  
 Median :20.22   Median : 3.340             Median :2.000   Wednesday:62  
 Mean   :22.73   Mean   : 3.807             Mean   :2.096   Thursday :36  
 3rd Qu.:28.84   3rd Qu.: 5.000             3rd Qu.:2.000   Friday   :26  
 Max.   :70.51   Max.   :15.000             Max.   :7.000                 
                 NA's   :1                                                
 Server     PctTip      
 A:60   Min.   :  6.70  
 B:65   1st Qu.: 14.30  
 C:32   Median : 16.20  
        Mean   : 17.89  
        3rd Qu.: 18.20  
        Max.   :221.00  
                        

After making categorical variables factors, summary shows the count of each category for the categorical variables.

The percentage of total bill has a maximum value of 221, which seems very strange. Someone is very unlikely to tip more than their bill total. In this case 221% of their bill value seems unlikely.

Let’s inspect the row where PctTip is greater than 100:

tips[tips$PctTip > 100, ]
# A tibble: 1 × 7
   Bill   Tip Credit Guests Day      Server PctTip
  <dbl> <dbl> <fct>   <dbl> <fct>    <fct>   <dbl>
1  49.6    NA Yes         4 Thursday C         221

Alternatively, using tidyverse, the function filter is used to only filter the rows that satisfy a condition:

tips %>% 
    filter(PctTip > 100)
# A tibble: 1 × 7
   Bill   Tip Credit Guests Day      Server PctTip
  <dbl> <dbl> <fct>   <dbl> <fct>    <fct>   <dbl>
1  49.6    NA Yes         4 Thursday C         221

With a bill of 49.59, the tip would be 109.59 dollars:

49.59 * 221 / 100
[1] 109.5939

Furthermore, we also notice that the tipping amount is not available (NA). The corresponding value in the percentage of total tip seems likely an inputting error, perhaps due to double typing the leading 2 when recording the data. We will set that value to not available (NA) with the following code:

tips$PctTip[tips$PctTip > 100] <- NA

a > b tests whether a is greater than b. a < b tests whether a is smaller than b. a == b tests whether a is equal to b; notice the double equal sign! You can also use >= or <=

Alternatively you can use tidyverse:

tips <- tips %>%
    mutate(
        PctTip = ifelse(PctTip > 100, NA, PctTip)
    )

Where the function ifelse selects a value depending on a condition to test: ifelse(test, value_if_true, value_if_false). In the case above, each value in the column PctTip is replaced by NA if Pct > 100, and it is kept the same otherwise.

summary(tips)
      Bill            Tip         Credit        Guests             Day    
 Min.   : 1.66   Min.   : 0.250   No :106   Min.   :1.000   Monday   :20  
 1st Qu.:15.19   1st Qu.: 2.075   Yes: 51   1st Qu.:2.000   Tuesday  :13  
 Median :20.22   Median : 3.340             Median :2.000   Wednesday:62  
 Mean   :22.73   Mean   : 3.807             Mean   :2.096   Thursday :36  
 3rd Qu.:28.84   3rd Qu.: 5.000             3rd Qu.:2.000   Friday   :26  
 Max.   :70.51   Max.   :15.000             Max.   :7.000                 
                 NA's   :1                                                
 Server     PctTip     
 A:60   Min.   : 6.70  
 B:65   1st Qu.:14.30  
 C:32   Median :16.15  
        Mean   :16.59  
        3rd Qu.:18.05  
        Max.   :42.20  
        NA's   :1      
Example writeup

The average bill size was $22.73, and the average tip was $3.85, corresponding to roughly 17% of the total bill. Out of 157 parties, only 51 paid with a credit card. Most parties tended to be of around 2 people each, and people tended to go to that restaurant more often on Wednesday. Among the three servers, server C was the one that served the least number of parties. The data also included a missing tipping value, corresponding to a bill $49.59, and a data inputting error for the corresponding measure of the tip as a percentage of the total bill.

4 Student Glossary

To conclude the lab, create a glossary of R functions. You can do so by opening Microsoft Word, Excel, or OneNote and creating a table with two columns: one where you should write the name of an R function, and the other column where you should provide a brief description of what the function does.

This “do it yourself” glossary is an opportunity for you to revise what you have learned in today’s lab and write down a few take-home messages. You will find this glossary handy as a reference to keep next to you when you will be doing the assessed weekly quizzes.

Below you can find an example to get you started:

Function Use and package
read_csv For reading comma separated value files. Part of tidyverse package
View ?
head ?
nrow ?
ncol ?
dim ?
glimpse ?
str ?
summary ?
factor ?

References

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F Lock, and Dennis F Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: View(DATA)
    or head(DATA)↩︎

  2. Hint: nrow(DATA)
    or dim(DATA)[1]↩︎

  3. Hint: ncol(DATA)
    or dim(DATA)[2]↩︎

  4. Hint: glimpse(DATA) from tidyverse
    or str(DATA)
    or sapply(DATA, data.class)↩︎

  5. Hint: summary(DATA)
    or min(DATA$VARIABLE) and max(DATA$VARIABLE)
    Hint: mean(DATA$VARIABLE)↩︎

  6. For some movies, data on the budget or rotten tomatoes rating are not available (NA). These are also called missing values.
    If you used the functions min(), max(), mean() you will get NA as a result. This is because if a value is missing, you cannot compute the mean of something you don’t know. For example, what is the mean of 5, 10, and NA? How would I compute (5 + 10 + NA) / 3? I don’t know, so it remains NA.
    You can tell R to ignore the missing values by saying min(DATA$VARIABLE, na.rm = TRUE) and similarly for max and mean.
    Instead, summary() does this for you automatically and immediately tells you if a variable had any NAs and how many.↩︎

  7. Hint: summary(DATA) and nrow(DATA)↩︎