Research design & data

Semester 1 - Week 1

0 Setup

Practicalities:

  • Sit in tables of at most 5 students per table.
  • Check your table name and register for that group on LEARN. To do so, navigate to the course LEARN page > click Groups > select Labs_1_2_3_4 > find your group > click Join.
  • Each week, one person within a table will be the “driver”, responsible for typing on the PC, while the others are the “navigators”, responsible for spotting typos and suggesting the strategy/code the driver should use to answer tasks. This is based on novel pedagogical practice called pair-programming.
  • The person who is the “driver” will rotate every week, so that everyone experiences being a driver at least once.
  • Driver:
    • Log into the RStudio Server Online: go to the course LEARN page > click Quick Links > click RStudio Server Online (by Noteable) > select RStudio from the dropdown > click Start.
    • Click here to download the template Rmd file
    • Upload the template Rmd file to the RStudio server and add your work to that file.
    • Save your work regularly by clicking File > Save.
    • At the end of the lab, download the Rmd file and then share it on your Group Discussion Space.
  • Navigators:
    • Have the lab tasks open and be ready to tell the driver what to do.
  • If you have questions, raise your hand and a tutor will come over to help!

Overview:

  • At the start of each teaching block (recall that each block is five weeks long), you will be given a dataset that you will use throughout the labs of that block. By the end of each block your group should have produced a report that analyses the given dataset.
  • The reports are due by:
    • Formative Report A (Block 1): 12 noon, Friday the 17th of October 2025.
    • Formative Report B (Block 2): 12 noon, Friday the 28th of November 2025.
    • Formative Report C (Block 3): 12 noon, Friday the 13th of February 2026.
    • Assessed Report (Block 4): 12 noon, Friday the 27th of March 2026.
  • You will be required to submit a PDF file, not the Rmd file used to create the PDF.
  • No extensions will be available for the formative reports.
  • You will receive written formative feedback on each of the formative reports the week after the report due date. This will be signposted via announcements.

Lab help and support:

  • The lab is structured to provide various levels of support.
  • When attending a lab, you should put away distractions and prioritise completing that week’s tasks to make the most of the help available.
  • If you are unsure or stuck at any point, you should make use of all the available help:
    • Raise your hand to get help from a tutor;
    • Hover your mouse on the superscript number to get a quick hint. The hints may sometimes show multiple equivalent ways of getting an answer - you just need one way;
    • Scroll down to the Worked Example section, where you can read through a worked example.
    • Even if you don’t use the Worked Example to complete the tasks, ensure you review and study its content during your independent study time.

1 Formative Report A

Formative Report A covers the labs from weeks 1-5 of the DAPR1 course in semester 1. You’ll need to create a PDF report using RMarkdown, which will be submitted by 12 noon on Friday, 17th October 2025. No extensions are available as this is a formative report. Expect written formative feedback in week 6 of semester 1.

Your report should be tailored for a reader with basic statistical knowledge and should not include any references to R code or functions in the main report write-up. Instead, keep the main report focused on text, figures, and tables. All R code should be included in the compulsory Appendix B for reproducibility, which is automatically created for you in the template Rmd file (do not edit that section). If you need to add extra tables or figures that don’t fit in the main part of the report, you can use an optional Appendix A. Remember, the main report should be a PDF file and should not exceed six sides of A4 paper, though appendices at the end don’t count towards this limit.

Ensure to use the default settings for font and page margins in your RMarkdown file. Also, make sure your report title includes your group name: Group NAME.LETTER, and list the exam numbers of all group members in the author section.

At this page you can find resources to help you with your report formatting.

1.1 Data

Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres.

The following variables were recorded. For formative report A, please only focus on the variables Movie to Year, ignoring anything beyond that. In other words, do not analyse the variables IQ1 to PrivateTransport in the next five weeks of the course. We will use those later in the course.

Variable Description
Movie Title of the movie
LeadStudio Primary U.S. distributor of the movie
RottenTomatoes Rotten Tomatoes rating (critics)
AudienceScore Audience rating (via Rotten Tomatoes)
Genre One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western
TheatersOpenWeek Number of screens for opening weekend
OpeningWeekend Opening weekend gross (in millions)
BOAvgOpenWeekend Average box office income per theater, opening weekend
Budget Production budget (in millions)
DomesticGross Gross income for domestic (U.S.) viewers (in millions)
WorldGross Gross income for all viewers (in millions)
ForeignGross Gross income for foreign viewers (in millions)
Profitability WorldGross as a percentage of Budget
OpenProfit Percentage of budget recovered on opening weekend
Year Year the movie was released
IQ1-IQ50 (ignore for Formative report A) IQ score of each of 50 audience raters
Snacks (ignore for Formative report A) How many of the 50 audience raters bought snacks
PrivateTransport (ignore for Formative report A) How many of the 50 audience raters reached the cinema via private transportation

1.2 This week’s task

Task A1

A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure

Sub-steps

Below there are sub-steps you need to consider to complete this week’s task.

Tip

To see the hints, hover your cursor on the superscript numbers.

  • Read the movie data into R, and give it a useful name. Inspect the data by looking at the data in RStudio. By viewing, we actually mean looking at the data either on the viewer or the console.1

  • How many observations are there?2

  • How many variables are there?3

  • What does dim(DATA) return?
  • What is the function of appending a [1] or [2]?
  • What is the type of each variable?4

  • What’s the minimum and maximum budget in the sample? What about the minimum and maximum Rotten Tomatoes rating?5

  • Do you notice any issues when computing the minimum/maximum Budget and the minimum/maximum RottenTomatoes rating?6

  • Which variables have missing values in the dataset, and how many missing values does each have?7

  • Write-up a description of the dataset for the reader. You don’t need to show the actual data in the report, but a description in words is sufficient for the reader.

2 Worked example

Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:

Variable Name Description
Bill Size of the bill (in dollars)
Tip Size of the tip (in dollars)
Credit Paid with a credit card? n or y
Guests Number of people in the group
Day Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday
Server Code for specific waiter/waitress: A, B, or C
PctTip Tip as a percentage of the bill

These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).

We load the tidyverse package as we will use the functions read_csv and glimpse from this package.

tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")

read_csv is the function to read CSV (comma separated values) files. Once we have read the file, it is stored into an object called tips using the arrow (<-).

head(tips)
# A tibble: 6 × 7
   Bill   Tip Credit Guests Day   Server PctTip
  <dbl> <dbl> <chr>   <dbl> <chr> <chr>   <dbl>
1  23.7 10    n           2 f     A        42.2
2  36.1  7    n           3 f     B        19.4
3  32.0  5.01 y           2 f     A        15.7
4  17.4  3.61 y           2 f     B        20.8
5  15.4  3    n           2 f     B        19.5
6  18.6  2.5  n           2 f     A        13.4

head() shows by default the top 6 rows of the data. Use the n = ... option to change the default behaviour, e.g. head(<data>, n = 10).

dim(tips)
[1] 157   7

This returns the number of rows and columns in the data

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…

glimpse is part of the tidyverse package

Alternatives to glimpse are the data “structure” function:

str(tips)
spc_tbl_ [157 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Bill  : num [1:157] 23.7 36.1 32 17.4 15.4 ...
 $ Tip   : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
 $ Credit: chr [1:157] "n" "n" "y" "y" ...
 $ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
 $ Day   : chr [1:157] "f" "f" "f" "f" ...
 $ Server: chr [1:157] "A" "B" "A" "B" ...
 $ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...
 - attr(*, "spec")=
  .. cols(
  ..   Bill = col_double(),
  ..   Tip = col_double(),
  ..   Credit = col_character(),
  ..   Guests = col_double(),
  ..   Day = col_character(),
  ..   Server = col_character(),
  ..   PctTip = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

or:

sapply(tips, data.class)
       Bill         Tip      Credit      Guests         Day      Server 
  "numeric"   "numeric" "character"   "numeric" "character" "character" 
     PctTip 
  "numeric" 
Example writeup

A dataset containing records on 7 variables related to tipping was obtained from https://uoepsy.github.io/data/RestaurantTips.csv, and was provided by the owner of a bistro in the US interested in studying which factors affected the tipping behaviour of the bistro’s customers. The data contains measurements for a total of 157 parties on four numeric variables: size of the bill (in dollars), size of the tip, number of guests in the group, and tip as a percentage of the bill total. The data also includes three categorical variables indicating whether or not the party paid with a credit card, the day of the week, as well as a server-specific identifier.

summary(tips)
      Bill            Tip            Credit              Guests     
 Min.   : 1.66   Min.   : 0.250   Length:157         Min.   :1.000  
 1st Qu.:15.19   1st Qu.: 2.075   Class :character   1st Qu.:2.000  
 Median :20.22   Median : 3.340   Mode  :character   Median :2.000  
 Mean   :22.73   Mean   : 3.807                      Mean   :2.096  
 3rd Qu.:28.84   3rd Qu.: 5.000                      3rd Qu.:2.000  
 Max.   :70.51   Max.   :15.000                      Max.   :7.000  
                 NA's   :1                                          
     Day               Server              PctTip      
 Length:157         Length:157         Min.   :  6.70  
 Class :character   Class :character   1st Qu.: 14.30  
 Mode  :character   Mode  :character   Median : 16.20  
                                       Mean   : 17.89  
                                       3rd Qu.: 18.20  
                                       Max.   :221.00  
                                                       

summary returns a quick summary of the data, i.e. a list of numerical summaries.

You probably won’t understand some parts of the output above, but we will learn more in the coming weeks, so don’t worry too much about it. For the moment, you should be able to understand the minimum, maximum, and the mean.
Currently, it is not showing very informative output for the categorical variables, also known as factors.

We can replace each factor level with a clearer label. The following code takes the column Day from the tips data and assigns a new label “Monday” to the level “m”, etc.

tips$Day <- factor(tips$Day, 
                   levels = c("m", "t", "w", "th", "f"),
                   labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))

tips$Credit <- factor(tips$Credit, 
                      levels = c("n", "y"),
                      labels = c("No", "Yes"))

tips$Server <- factor(tips$Server)

You can change/update a variable (column) in the data using the function mutate from the tidyverse package. It works as follows:

tips <- tips |>
    mutate(
        Day = factor(Day,
                     levels = c("m", "t", "w", "th", "f"),
                     labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")),
        Credit = factor(Credit,
                        levels = c("n", "y"),
                        labels = c("No", "Yes")),
        Server = factor(Server)
    )

The function |> is called pipe and works by taking what’s on the left and passing it to the operation on the right. For example, you can take the logarithm of the whole numbers from 1 to 10 and round them to 2 digits using this code:

round(log(1:10), digits = 2)
 [1] 0.00 0.69 1.10 1.39 1.61 1.79 1.95 2.08 2.20 2.30

or this equivalent code that uses the pipe |>:

1:10 |>
    log() |>
    round(digits = 2)
 [1] 0.00 0.69 1.10 1.39 1.61 1.79 1.95 2.08 2.20 2.30

You can loosely think of the pipe as “then”. In fact, the pipe takes what’s to its left, and then passes it on to what’s on its right.

Curiosity: Sometimes you may also find an older version of the pipe, which is %>% and works in the same style. However, it requires the package tidyverse to be loaded before you can use the older pipe %>%.

Let’s check the result of the changes to the variable types:

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
summary(tips)
      Bill            Tip         Credit        Guests             Day    
 Min.   : 1.66   Min.   : 0.250   No :106   Min.   :1.000   Monday   :20  
 1st Qu.:15.19   1st Qu.: 2.075   Yes: 51   1st Qu.:2.000   Tuesday  :13  
 Median :20.22   Median : 3.340             Median :2.000   Wednesday:62  
 Mean   :22.73   Mean   : 3.807             Mean   :2.096   Thursday :36  
 3rd Qu.:28.84   3rd Qu.: 5.000             3rd Qu.:2.000   Friday   :26  
 Max.   :70.51   Max.   :15.000             Max.   :7.000                 
                 NA's   :1                                                
 Server     PctTip      
 A:60   Min.   :  6.70  
 B:65   1st Qu.: 14.30  
 C:32   Median : 16.20  
        Mean   : 17.89  
        3rd Qu.: 18.20  
        Max.   :221.00  
                        

After making categorical variables factors, summary shows the count of each category for the categorical variables.

The percentage of total bill has a maximum value of 221, which seems very strange. Someone is very unlikely to tip more than their bill total. In this case 221% of their bill value seems unlikely.

Let’s inspect the row where PctTip is greater than 100:

tips[tips$PctTip > 100, ]
# A tibble: 1 × 7
   Bill   Tip Credit Guests Day      Server PctTip
  <dbl> <dbl> <fct>   <dbl> <fct>    <fct>   <dbl>
1  49.6    NA Yes         4 Thursday C         221

Alternatively, using tidyverse, the function filter is used to only filter the rows that satisfy a condition:

tips |> 
    filter(PctTip > 100)
# A tibble: 1 × 7
   Bill   Tip Credit Guests Day      Server PctTip
  <dbl> <dbl> <fct>   <dbl> <fct>    <fct>   <dbl>
1  49.6    NA Yes         4 Thursday C         221

With a bill of 49.6, the tip would be 109.62 dollars:

49.6 * 221 / 100
[1] 109.616

Furthermore, we also notice that the tipping amount is not available (NA). The corresponding value in the percentage of total tip seems likely an inputting error, perhaps due to double typing the leading 2 when recording the data. We will set that value to not available (NA) with the following code:

tips$PctTip[tips$PctTip > 100] <- NA

a > b tests whether a is greater than b. a < b tests whether a is smaller than b. a == b tests whether a is equal to b; notice the double equal sign! You can also use >= or <=

Alternatively you can use tidyverse:

tips <- tips |>
    mutate(
        PctTip = ifelse(PctTip > 100, NA, PctTip)
    )

Where the function ifelse selects a value depending on a condition to test: ifelse(test, value_if_true, value_if_false). In the case above, each value in the column PctTip is replaced by NA if Pct > 100, and it is kept the same otherwise.

summary(tips)
      Bill            Tip         Credit        Guests             Day    
 Min.   : 1.66   Min.   : 0.250   No :106   Min.   :1.000   Monday   :20  
 1st Qu.:15.19   1st Qu.: 2.075   Yes: 51   1st Qu.:2.000   Tuesday  :13  
 Median :20.22   Median : 3.340             Median :2.000   Wednesday:62  
 Mean   :22.73   Mean   : 3.807             Mean   :2.096   Thursday :36  
 3rd Qu.:28.84   3rd Qu.: 5.000             3rd Qu.:2.000   Friday   :26  
 Max.   :70.51   Max.   :15.000             Max.   :7.000                 
                 NA's   :1                                                
 Server     PctTip     
 A:60   Min.   : 6.70  
 B:65   1st Qu.:14.30  
 C:32   Median :16.15  
        Mean   :16.59  
        3rd Qu.:18.05  
        Max.   :42.20  
        NA's   :1      

The function summary() return a numeric answer for the min/max/mean, see above, even in the presence of missing values (NAs).

However, if you use functions such as min(), max(), mean(), which compute the minimum, maximum, and mean (i.e., average) of a variable respectively, they will return NA when applied to a variable that contains a missing value:

min(tips$Tip)
[1] NA

To get a numeric result, you need to include the argument, i.e. the option, na.rm = TRUE:

min(tips$Tip, na.rm = TRUE)
[1] 0.25
Example writeup

The average bill size was $22.73, and the average tip was $3.81, corresponding to roughly 17% of the total bill. Out of 157 parties, only 51 paid with a credit card. Most parties tended to be of around 2 people each, and people tended to go to that restaurant more often on Wednesday. Among the three servers, server C was the one that served the least number of parties. The data also included a missing tipping value, corresponding to a bill $49.59, and a data inputting error for the corresponding measure of the tip as a percentage of the total bill.

3 Student Glossary

A good understanding of R functions is essential for your success in the DAPR curriculum and the degree more generally. We strongly recommend you get into the habit of keeping track of any new R function that you encounter and write a short description of what the function does and which package it comes from. You could save this into a Word, Excel, or Notebook file. You should get into the habit of updating that document as you encounter new R functions. For this week, we have provided below a completed table to help you get started!

Function Use and package
<- Assignment operator. Stores the value on the right into the named object on the left
read_csv For reading comma separated value files. Part of tidyverse package
View Opens a spreadsheet-like data viewer in RStudio. Base R function
head Shows the first 6 rows of a dataset (by default). Base R function
nrow Returns the number of rows. Base R function
ncol Returns the number of columns. Base R function
dim Returns the dimensions (rows and columns). Base R function
glimpse Similar to str. Displays an overview of the data variables with their types. Part of tidyverse package
str Similar to glimpse. Displays an overview of the data variables with their types. Base R function
summary Produces summary statistics. Base R function
factor Creates or modifies a variable into a categorical one (factor) with levels. Base R function
Back to top

References

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F Lock, and Dennis F Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: To read the data use read_csv() from the tidyverse package.
    To preview the data, use View(DATA) or head(DATA)↩︎

  2. Hint: nrow(DATA)
    or dim(DATA)[1]↩︎

  3. Hint: ncol(DATA)
    or dim(DATA)[2]↩︎

  4. Hint: glimpse(DATA) from tidyverse
    or str(DATA)
    or sapply(DATA, data.class)↩︎

  5. Hint: summary(DATA$VARIABLE)
    or min(DATA$VARIABLE) and max(DATA$VARIABLE)↩︎

  6. For some movies, data on the budget or rotten tomatoes rating are not available (NA). These are also called missing values.
    If you used the functions min(), max(), you will get NA as a result. This is because if a value is missing, you cannot compute the mean of something you don’t know. For example, what is the mean of 5, 10, and NA? How would I compute (5 + 10 + NA) / 3? I don’t know, so it remains NA.
    You can tell R to ignore the missing values by saying min(DATA$VARIABLE, na.rm = TRUE) and similarly for max and other functions like mean, which computes the average instead.
    The summary() function, instead, does this for you automatically and immediately tells you if a variable had any NAs and how many.↩︎

  7. Hint: summary(DATA) shows the variables and the number of missing values in each variable.↩︎