Research design & data

Semester 1 - Week 1

1 Instructions

1.1 Lab format

  • At the start of each teaching block, you will be given a dataset that you will use throughout the labs of that block’s five weeks. By the end of each block your group should have produced a report that analyses the given dataset.
  • The reports are due:
    • Formative Report A (Block 1): 12 noon, Friday 20th October 2023.
    • Formative Report B (Block 2): 12 noon, Friday 1st December 2023.
    • Formative Report C (Block 3): 12 noon, Friday 16th February 2024.
    • Assessed Report (Block 4): 12 noon, Friday 29th March 2024.
  • You will be required to submit a PDF file, not the Rmd file used to create the PDF.
  • No extensions. As these are group-based submissions, no extensions will be given.
  • You will receive formative feedback on each of the formative reports the week after the report due date. This will be signposted via announcements.

1.2 Group setup

  • Work through the lab tasks in groups of up to 5 students.
  • In each group, each week one person is the driver and the rest are the navigators.
    • The driver is responsible for typing on the PC keyboard for that week.
    • The navigators are responsible for commenting on the strategy, code, and spotting typos or fixing errors.
    • Each week the driver will rotate so that everyone experiences being a driver at least once.
  • Driver: download the template Rmd file, upload it to RStudio server online, and start writing your work there. Don’t forget to save your file regularly via File -> Save.
  • Navigators: be alert and start providing suggestions and comments on the strategy and code.

Template Rmd file

Click here to download the template Rmd file

Complete it in the following weeks, and follow instructions in week 5 on how to “knit” and submit the PDF file.

1.3 Report format

  • Each submitted report must be a PDF file of max 6 sides of A4 paper.
  • Keep the default settings in terms of Rmd knitting font and page margins.
  • At the end of the file, you will place the appendices and these will not count towards the page limit.
    • You can include an optional appendix for additional tables and figures which you can’t fit in the main part of the report;
    • You must include a compulsory appendix listing all of the R code used in the report. This is done automatically if you end your file with the following section, which is already included in the template Rmd file:
# Appendix: R code

```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}

```

1.4 Lab help and support

The lab is structured to provide various levels of support. When attending a lab, you should prioritise completing that week’s tasks. However, if you are unsure or stuck at any point, you should make use of all the available help:

  • Raise your hand to get help from a tutor;
  • Hover your mouse on the superscript number to get a quick hint. The hints may sometimes show multiple equivalent ways of getting an answer - you just need one way;
  • Scroll down to the Worked Example section, where you can read through a worked example.
  • Even if you don’t use the Worked Example to complete the tasks, ensure you review and study its content during your independent study time.

1.5 Important steps

1.5.1 Did you register for RStudio Server Online?

Try these steps first to register for RStudio server online:

  • Log in to EASE using your university UUN and password.
  • Set your RStudio password here, the username will be the same as your UUN (make sure you type your UUN correctly).
  • Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you set above in (2).
  • Please complete this form and wait for an email. Please note that this can take up to four working days.

  • Once you receive an email from us, please follow the following instructions:

    • Set your here, the username will be the same as your UUN (make sure you type your UUN correctly).
    • Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you just set above.

1.5.2 Install tinytex

Every single student, when logging into their personal RStudio Server account, must do the following at least once. In other words, everyone in each group has to do it at some point in their own RStudio when they are the driver.

In order to generate a PDF file from RStudio, you must have a package called tinytex installed. This allows you to “Knit” your Rmd document (i.e. combine together text, code, and output) to produce a PDF file. Copy and paste the following code in your console, and press Enter.

install.packages("tinytex")
tinytex::install_tinytex()

2 Formatting resources

These will be useful in week 5 when finalising your report formatting prior to submission.

2.1 APA style

Check the following guide for reporting numbers and statistics in APA style (7th edition).

2.2 Hiding code and/or output

To not show the code of an R code chunk, and only show the output, write:

```{r, echo=FALSE}
# code goes here
```

To show the code of an R code chunk, but hide the output, write:

```{r, results='hide'}
# code goes here
```

To hide both code and output of an R code chunk, write:

```{r, include=FALSE}
# code goes here
```

2.3 Checklist for successful knitting

If you encounter errors when knitting the Rmd file, go through the following checklist to try finding the source of the errors.

3 Formative Report A

In the first five weeks of the course your group should produce a PDF report (using Rmarkdown) for which you will receive formative feedback in week 6.

The report should not include any reference to R code or functions, but be written or a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. You should also avoid any R code output or printout in the PDF file.

3.1 Data

Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. The following variables were recorded:

Variable Description
Movie Title of the movie
LeadStudio Primary U.S. distributor of the movie
RottenTomatoes Rotten Tomatoes rating (critics)
AudienceScore Audience rating (via Rotten Tomatoes)
Genre One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western
TheatersOpenWeek Number of screens for opening weekend
OpeningWeekend Opening weekend gross (in millions)
BOAvgOpenWeekend Average box office income per theater, opening weekend
Budget Production budget (in millions)
DomesticGross Gross income for domestic (U.S.) viewers (in millions)
WorldGross Gross income for all viewers (in millions)
ForeignGross Gross income for foreign viewers (in millions)
Profitability WorldGross as a percentage of Budget
OpenProfit Percentage of budget recovered on opening weekend
Year Year the movie was released
IQ1-IQ50 (ignore for Formative report A) IQ score of each of 50 audience raters
Snacks (ignore for Formative report A) How many of the 50 audience raters bought snacks
PrivateTransport (ignore for Formative report A) How many of the 50 audience raters reached the cinema via private transportation

For formative report A, please only focus on the variables Movie to Year, ignoring anything beyond that. In other words, do not analyse the variables IQ1 to PrivateTransport in the next five weeks of the course. We will use those later in the course.

3.2 Tasks

For formative report A, you will be asked to perform the following tasks, each related to a week of teaching in this course.
This week’s task is highlighted in bold below. Please only focus on completing that task this week. In the next section, you will also find guided sub-steps you may want to consider to complete this week’s task.

A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure
A2) Display and describe the categorical variables.
A3) Display and describe six numerical variables of your choice.
A4) Display and describe a relationship of interest between two or three variables of your choice.
A5) Finish the report write-up, knit to PDF, and submit the PDF for formative feedback.

3.3 A1 sub-tasks

This week you will only focus on task A1. Below there are some guided sub-steps you may want to consider to complete task A1.

Tip

To see the hints, hover your cursor on the superscript numbers.

  • Read the movie data into R, and give it a useful name. Inspect the data by looking at the data in RStudio. By viewing, we actually mean looking at the data either on the viewer or the console.1

  • How many observations are there?2

  • How many variables are there?3

  • What does dim(DATA) return?
  • What is the function of appending a [1] or [2]?
  • What is the type of each variable?4

  • What’s the minimum and maximum budget in the sample? What about the average Rotten Tomatoes rating?5

  • Do you notice any issues when computing the minimum and maximum Budget and the average RottenTomatoes rating?6

  • What is the range (i.e. minimum and maximum) of the variables in the data? What about the number of missing values for each variable?7

  • Write-up a description of the dataset for the reader. You don’t need to show the actual data in the report, but a description in words is sufficient for the reader.

4 Worked example

Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:

Variable Name Description
Bill Size of the bill (in dollars)
Tip Size of the tip (in dollars)
Credit Paid with a credit card? n or y
Guests Number of people in the group
Day Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday
Server Code for specific waiter/waitress: A, B, or C
PctTip Tip as a percentage of the bill

These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).

We load the tidyverse package as we will use the functions read_csv and glimpse from this package.

tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")

read_csv is the function to read CSV (comma separated values) files. Once we have read the file, it is stored into an object called tips using the arrow (<-).

head(tips)
# A tibble: 6 × 7
   Bill   Tip Credit Guests Day   Server PctTip
  <dbl> <dbl> <chr>   <dbl> <chr> <chr>   <dbl>
1  23.7 10    n           2 f     A        42.2
2  36.1  7    n           3 f     B        19.4
3  32.0  5.01 y           2 f     A        15.7
4  17.4  3.61 y           2 f     B        20.8
5  15.4  3    n           2 f     B        19.5
6  18.6  2.5  n           2 f     A        13.4

head() shows by default the top 6 rows of the data. Use the n = ... option to change the default behaviour, e.g. head(<data>, n = 10).

dim(tips)
[1] 157   7

This returns the number of rows and columns in the data

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…

glimpse is part of the tidyverse package

Alternatives to glimpse are the data “structure” function:

str(tips)
spc_tbl_ [157 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Bill  : num [1:157] 23.7 36.1 32 17.4 15.4 ...
 $ Tip   : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
 $ Credit: chr [1:157] "n" "n" "y" "y" ...
 $ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
 $ Day   : chr [1:157] "f" "f" "f" "f" ...
 $ Server: chr [1:157] "A" "B" "A" "B" ...
 $ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...
 - attr(*, "spec")=
  .. cols(
  ..   Bill = col_double(),
  ..   Tip = col_double(),
  ..   Credit = col_character(),
  ..   Guests = col_double(),
  ..   Day = col_character(),
  ..   Server = col_character(),
  ..   PctTip = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

or:

sapply(tips, data.class)
       Bill         Tip      Credit      Guests         Day      Server 
  "numeric"   "numeric" "character"   "numeric" "character" "character" 
     PctTip 
  "numeric" 
Example writeup

A dataset containing records on 7 variables related to tipping was obtained from https://uoepsy.github.io/data/RestaurantTips.csv, and was provided by the owner of a bistro in the US interested in studying which factors affected the tipping behaviour of the bistro’s customers. The data contains measurements for a total of 157 parties on four numeric variables: size of the bill (in dollars), size of the tip, number of guests in the group, and tip as a percentage of the bill total. The data also includes three categorical variables indicating whether or not the party paid with a credit card, the day of the week, as well as a server-specific identifier.

summary(tips)
      Bill            Tip            Credit              Guests     
 Min.   : 1.66   Min.   : 0.250   Length:157         Min.   :1.000  
 1st Qu.:15.19   1st Qu.: 2.075   Class :character   1st Qu.:2.000  
 Median :20.22   Median : 3.340   Mode  :character   Median :2.000  
 Mean   :22.73   Mean   : 3.807                      Mean   :2.096  
 3rd Qu.:28.84   3rd Qu.: 5.000                      3rd Qu.:2.000  
 Max.   :70.51   Max.   :15.000                      Max.   :7.000  
                 NA's   :1                                          
     Day               Server              PctTip      
 Length:157         Length:157         Min.   :  6.70  
 Class :character   Class :character   1st Qu.: 14.30  
 Mode  :character   Mode  :character   Median : 16.20  
                                       Mean   : 17.89  
                                       3rd Qu.: 18.20  
                                       Max.   :221.00  
                                                       

summary returns a quick summary of the data, i.e. a list of numerical summaries.

You probably won’t understand some parts of the output above, but we will learn more in the coming weeks, so don’t worry too much about it. For the moment, you should be able to understand the minimum, maximum, and the mean.
Currently, it is not showing very informative output for the categorical variables, also known as factors.

We can replace each factor level with a clearer label. The following code takes the column Day from the tips data and assigns a new label “Monday” to the level “m”, etc.

tips$Day <- factor(tips$Day, 
                   levels = c("m", "t", "w", "th", "f"),
                   labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))

tips$Credit <- factor(tips$Credit, 
                      levels = c("n", "y"),
                      labels = c("No", "Yes"))

tips$Server <- factor(tips$Server)

Using tidyverse, the function mutate is used to mutate a variable (column) in the data:

tips <- tips %>%
    mutate(
        Day = factor(Day,
                     levels = c("m", "t", "w", "th", "f"),
                     labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")),
        Credit = factor(Credit,
                        levels = c("n", "y"),
                        labels = c("No", "Yes")),
        Server = factor(Server)
    )

The functions %>% and mutate are part of the tidyverse package. The former, %>%, is called pipe.

The pipe works by taking what’s on the left and passing it to the operation on the right. For example, rounding to 2 decimal places the logarithm of the whole numbers from 1 to 10:

round(log(1:10), digits = 2)
 [1] 0.00 0.69 1.10 1.39 1.61 1.79 1.95 2.08 2.20 2.30

is equivalent to:

1:10 %>%
    log() %>%
    round(digits = 2)
 [1] 0.00 0.69 1.10 1.39 1.61 1.79 1.95 2.08 2.20 2.30

Let’s check the result of the changes to the variable types:

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
summary(tips)
      Bill            Tip         Credit        Guests             Day    
 Min.   : 1.66   Min.   : 0.250   No :106   Min.   :1.000   Monday   :20  
 1st Qu.:15.19   1st Qu.: 2.075   Yes: 51   1st Qu.:2.000   Tuesday  :13  
 Median :20.22   Median : 3.340             Median :2.000   Wednesday:62  
 Mean   :22.73   Mean   : 3.807             Mean   :2.096   Thursday :36  
 3rd Qu.:28.84   3rd Qu.: 5.000             3rd Qu.:2.000   Friday   :26  
 Max.   :70.51   Max.   :15.000             Max.   :7.000                 
                 NA's   :1                                                
 Server     PctTip      
 A:60   Min.   :  6.70  
 B:65   1st Qu.: 14.30  
 C:32   Median : 16.20  
        Mean   : 17.89  
        3rd Qu.: 18.20  
        Max.   :221.00  
                        

After making categorical variables factors, summary shows the count of each category for the categorical variables.

The percentage of total bill has a maximum value of 221, which seems very strange. Someone is very unlikely to tip more than their bill total. In this case 221% of their bill value seems unlikely.

Let’s inspect the row where PctTip is greater than 100:

tips[tips$PctTip > 100, ]
# A tibble: 1 × 7
   Bill   Tip Credit Guests Day      Server PctTip
  <dbl> <dbl> <fct>   <dbl> <fct>    <fct>   <dbl>
1  49.6    NA Yes         4 Thursday C         221

Alternatively, using tidyverse, the function filter is used to only filter the rows that satisfy a condition:

tips %>% 
    filter(PctTip > 100)
# A tibble: 1 × 7
   Bill   Tip Credit Guests Day      Server PctTip
  <dbl> <dbl> <fct>   <dbl> <fct>    <fct>   <dbl>
1  49.6    NA Yes         4 Thursday C         221

With a bill of 49.6, the tip would be 109.62 dollars:

49.6 * 221 / 100
[1] 109.616

Furthermore, we also notice that the tipping amount is not available (NA). The corresponding value in the percentage of total tip seems likely an inputting error, perhaps due to double typing the leading 2 when recording the data. We will set that value to not available (NA) with the following code:

tips$PctTip[tips$PctTip > 100] <- NA

a > b tests whether a is greater than b. a < b tests whether a is smaller than b. a == b tests whether a is equal to b; notice the double equal sign! You can also use >= or <=

Alternatively you can use tidyverse:

tips <- tips %>%
    mutate(
        PctTip = ifelse(PctTip > 100, NA, PctTip)
    )

Where the function ifelse selects a value depending on a condition to test: ifelse(test, value_if_true, value_if_false). In the case above, each value in the column PctTip is replaced by NA if Pct > 100, and it is kept the same otherwise.

summary(tips)
      Bill            Tip         Credit        Guests             Day    
 Min.   : 1.66   Min.   : 0.250   No :106   Min.   :1.000   Monday   :20  
 1st Qu.:15.19   1st Qu.: 2.075   Yes: 51   1st Qu.:2.000   Tuesday  :13  
 Median :20.22   Median : 3.340             Median :2.000   Wednesday:62  
 Mean   :22.73   Mean   : 3.807             Mean   :2.096   Thursday :36  
 3rd Qu.:28.84   3rd Qu.: 5.000             3rd Qu.:2.000   Friday   :26  
 Max.   :70.51   Max.   :15.000             Max.   :7.000                 
                 NA's   :1                                                
 Server     PctTip     
 A:60   Min.   : 6.70  
 B:65   1st Qu.:14.30  
 C:32   Median :16.15  
        Mean   :16.59  
        3rd Qu.:18.05  
        Max.   :42.20  
        NA's   :1      
Example writeup

The average bill size was $22.73, and the average tip was $3.85, corresponding to roughly 17% of the total bill. Out of 157 parties, only 51 paid with a credit card. Most parties tended to be of around 2 people each, and people tended to go to that restaurant more often on Wednesday. Among the three servers, server C was the one that served the least number of parties. The data also included a missing tipping value, corresponding to a bill $49.59, and a data inputting error for the corresponding measure of the tip as a percentage of the total bill.

5 Student Glossary

To conclude the lab, create a glossary of R functions. You can do so by opening Microsoft Word, Excel, or OneNote and creating a table with two columns: one where you should write the name of an R function, and the other column where you should provide a brief description of what the function does.

This “do it yourself” glossary is an opportunity for you to revise what you have learned in today’s lab and write down a few take-home messages. You will find this glossary handy as a reference to keep next to you when you will be doing the assessed weekly quizzes.

Below you can find an example to get you started:

Function Use and package
read_csv For reading comma separated value files. Part of tidyverse package
View ?
head ?
nrow ?
ncol ?
dim ?
glimpse ?
str ?
summary ?
factor ?
Back to top

References

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F Lock, and Dennis F Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: To read the data use read_csv() from the tidyverse package.
    To preview the data, use View(DATA) or head(DATA)↩︎

  2. Hint: nrow(DATA)
    or dim(DATA)[1]↩︎

  3. Hint: ncol(DATA)
    or dim(DATA)[2]↩︎

  4. Hint: glimpse(DATA) from tidyverse
    or str(DATA)
    or sapply(DATA, data.class)↩︎

  5. Hint: summary(DATA)
    or min(DATA$VARIABLE) and max(DATA$VARIABLE)
    Hint: mean(DATA$VARIABLE)↩︎

  6. For some movies, data on the budget or rotten tomatoes rating are not available (NA). These are also called missing values.
    If you used the functions min(), max(), mean() you will get NA as a result. This is because if a value is missing, you cannot compute the mean of something you don’t know. For example, what is the mean of 5, 10, and NA? How would I compute (5 + 10 + NA) / 3? I don’t know, so it remains NA.
    You can tell R to ignore the missing values by saying min(DATA$VARIABLE, na.rm = TRUE) and similarly for max and mean.
    Instead, summary() does this for you automatically and immediately tells you if a variable had any NAs and how many.↩︎

  7. Hint: summary(DATA) and nrow(DATA)↩︎