install.packages("tinytex")
tinytex::install_tinytex()
Research design & data
Semester 1 - Week 1
Lab format
- At the start of each teaching block, you will be given a dataset that you will use throughout the labs of that block. By the end of each block your group should have produced a report that analyses the given dataset.
- The reports are due by:
- Formative Report A (Block 1): 12 noon, Friday the 18th of October 2024.
- Formative Report B (Block 2): 12 noon, Friday the 29th of November 2024.
- Formative Report C (Block 3): 12 noon, Friday the 14th of February 2025.
- Assessed Report (Block 4): 12 noon, Friday the 28th of March 2025.
- You will be required to submit a PDF file, not the Rmd file used to create the PDF.
- As these are group-based submissions, no extensions will be given.
- You will receive written formative feedback on each of the formative reports the week after the report due date. This will be signposted via announcements.
Group setup
- Work through the lab tasks in groups of up to 5 students.
- In each group, each week one person is the driver and the rest are the navigators.
- The driver is responsible for typing on the PC keyboard for that week.
- The navigators are responsible for commenting on the strategy, code, and spotting typos or fixing errors.
- Each week the driver will rotate so that everyone experiences being a driver at least once.
- Driver: download the template Rmd file below, upload it to the RStudio server, and start writing your work there. Don’t forget to save your file regularly via File -> Save.
- Navigators: be alert and start providing suggestions and comments on the strategy and code.
Report format
- Each submitted report must be a PDF file of max 6 sides of A4 paper.
- Keep the default settings in terms of Rmd knitting font and page margins.
- At the end of the file, you will place the appendices and these will not count towards the page limit.
- You can include an optional appendix for additional tables and figures which you can’t fit in the main part of the report;
- You must include a compulsory appendix listing all of the R code used in the report. This is done automatically if you end your file with the following section, which is already included in the template Rmd file:
# Appendix: R code
```{r ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE}
```
Lab help and support
The lab is structured to provide various levels of support. When attending a lab, you should prioritise completing that week’s tasks. However, if you are unsure or stuck at any point, you should make use of all the available help:
- Raise your hand to get help from a tutor;
- Hover your mouse on the superscript number to get a quick hint. The hints may sometimes show multiple equivalent ways of getting an answer - you just need one way;
- Scroll down to the Worked Example section, where you can read through a worked example.
- Even if you don’t use the Worked Example to complete the tasks, ensure you review and study its content during your independent study time.
Important steps
Did you register for RStudio Server Online?
- Login to EASE using your university UUN and password.
- Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and RStudio password.
Try these steps first to register for RStudio server online:
- Log in to EASE using your university UUN and password.
- Set your RStudio password here, the username will be the same as your UUN (make sure you type your UUN correctly).
- Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you set above in (2).
- Please complete this form and wait for an email. Please note that this can take up to four working days.
Install tinytex
Every single student, when logging into their personal RStudio Server account, must do the following at least once. In other words, everyone in each group has to do it at some point in their own RStudio when they are the driver.
In order to generate a PDF file from RStudio, you must have a package called tinytex
installed. This allows you to “Knit” your Rmd document (i.e. combine together text, code, and output) to produce a PDF file. Copy and paste the following code in your console, and press Enter.
Useful resources when finalising your report formatting prior to submission in week 5.
Checklist for successful knitting
If you encounter errors when knitting the Rmd file, go through the following checklist to try finding the source of the errors.
APA style
Check the following guide for reporting numbers and statistics in APA style (7th edition).
Hiding code and/or output
To not show the code of an R code chunk, and only show the output, write:
```{r, echo=FALSE}
# code goes here
```
To show the code of an R code chunk, but hide the output, write:
```{r, results='hide'}
# code goes here
```
To hide both code and output of an R code chunk, write:
```{r, include=FALSE}
# code goes here
```
1 Formative Report A
In the first five weeks of the course your group should produce a PDF report (using Rmarkdown) for which you will receive written formative feedback in week 6.
The report should not include any reference to R code or functions, but be written or a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. You should also avoid any R code output or printout in the PDF file.
1.1 Data
Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. The following variables were recorded:
Variable | Description |
---|---|
Movie | Title of the movie |
LeadStudio | Primary U.S. distributor of the movie |
RottenTomatoes | Rotten Tomatoes rating (critics) |
AudienceScore | Audience rating (via Rotten Tomatoes) |
Genre | One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western |
TheatersOpenWeek | Number of screens for opening weekend |
OpeningWeekend | Opening weekend gross (in millions) |
BOAvgOpenWeekend | Average box office income per theater, opening weekend |
Budget | Production budget (in millions) |
DomesticGross | Gross income for domestic (U.S.) viewers (in millions) |
WorldGross | Gross income for all viewers (in millions) |
ForeignGross | Gross income for foreign viewers (in millions) |
Profitability | WorldGross as a percentage of Budget |
OpenProfit | Percentage of budget recovered on opening weekend |
Year | Year the movie was released |
IQ1-IQ50 (ignore for Formative report A) | IQ score of each of 50 audience raters |
Snacks (ignore for Formative report A) | How many of the 50 audience raters bought snacks |
PrivateTransport (ignore for Formative report A) | How many of the 50 audience raters reached the cinema via private transportation |
For formative report A, please only focus on the variables Movie
to Year
, ignoring anything beyond that. In other words, do not analyse the variables IQ1
to PrivateTransport
in the next five weeks of the course. We will use those later in the course.
1.2 Tasks
For formative report A, you will be asked to perform the following tasks, each related to a week of teaching in this course.
This week’s task is highlighted in bold below. Please only focus on completing that task this week. In the next section, you will also find guided sub-steps you may want to consider to complete this week’s task.
A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure
A2) Display and describe the categorical variables.
A3) Display and describe six numerical variables of your choice.
A4) Display and describe a relationship of interest between two or three variables of your choice.
A5) Finish the report write-up, knit to PDF, and submit the PDF for formative feedback.
1.3 A1 sub-tasks
This week you will only focus on task A1. Below there are some guided sub-steps you may want to consider to complete task A1.
To see the hints, hover your cursor on the superscript numbers.
Read the movie data into R, and give it a useful name. Inspect the data by looking at the data in RStudio. By viewing, we actually mean looking at the data either on the viewer or the console.1
How many observations are there?2
How many variables are there?3
- What does
dim(DATA)
return? - What is the function of appending a
[1]
or[2]
?
What is the type of each variable?4
What’s the minimum and maximum budget in the sample? What about the minimum and maximum Rotten Tomatoes rating?5
Do you notice any issues when computing the minimum/maximum Budget and the minimum/maximum RottenTomatoes rating?6
Which variables have missing values in the dataset, and how many missing values does each have?7
Write-up a description of the dataset for the reader. You don’t need to show the actual data in the report, but a description in words is sufficient for the reader.
2 Worked example
Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:
Variable Name | Description |
---|---|
Bill | Size of the bill (in dollars) |
Tip | Size of the tip (in dollars) |
Credit | Paid with a credit card? n or y |
Guests | Number of people in the group |
Day | Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday |
Server | Code for specific waiter/waitress: A, B, or C |
PctTip | Tip as a percentage of the bill |
These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).
We load the tidyverse
package as we will use the functions read_csv
and glimpse
from this package.
tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")
read_csv
is the function to read CSV (comma separated values) files. Once we have read the file, it is stored into an object called tips using the arrow (<-
).
head(tips)
# A tibble: 6 × 7
Bill Tip Credit Guests Day Server PctTip
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 23.7 10 n 2 f A 42.2
2 36.1 7 n 3 f B 19.4
3 32.0 5.01 y 2 f A 15.7
4 17.4 3.61 y 2 f B 20.8
5 15.4 3 n 2 f B 19.5
6 18.6 2.5 n 2 f A 13.4
head()
shows by default the top 6 rows of the data. Use the n = ...
option to change the default behaviour, e.g. head(<data>, n = 10)
.
dim(tips)
[1] 157 7
This returns the number of rows and columns in the data
glimpse(tips)
Rows: 157
Columns: 7
$ Bill <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
glimpse
is part of the tidyverse package
Alternatives to glimpse are the data “structure” function:
str(tips)
spc_tbl_ [157 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Bill : num [1:157] 23.7 36.1 32 17.4 15.4 ...
$ Tip : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
$ Credit: chr [1:157] "n" "n" "y" "y" ...
$ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
$ Day : chr [1:157] "f" "f" "f" "f" ...
$ Server: chr [1:157] "A" "B" "A" "B" ...
$ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...
- attr(*, "spec")=
.. cols(
.. Bill = col_double(),
.. Tip = col_double(),
.. Credit = col_character(),
.. Guests = col_double(),
.. Day = col_character(),
.. Server = col_character(),
.. PctTip = col_double()
.. )
- attr(*, "problems")=<externalptr>
or:
sapply(tips, data.class)
Bill Tip Credit Guests Day Server
"numeric" "numeric" "character" "numeric" "character" "character"
PctTip
"numeric"
A dataset containing records on 7 variables related to tipping was obtained from https://uoepsy.github.io/data/RestaurantTips.csv, and was provided by the owner of a bistro in the US interested in studying which factors affected the tipping behaviour of the bistro’s customers. The data contains measurements for a total of 157 parties on four numeric variables: size of the bill (in dollars), size of the tip, number of guests in the group, and tip as a percentage of the bill total. The data also includes three categorical variables indicating whether or not the party paid with a credit card, the day of the week, as well as a server-specific identifier.
summary(tips)
Bill Tip Credit Guests
Min. : 1.66 Min. : 0.250 Length:157 Min. :1.000
1st Qu.:15.19 1st Qu.: 2.075 Class :character 1st Qu.:2.000
Median :20.22 Median : 3.340 Mode :character Median :2.000
Mean :22.73 Mean : 3.807 Mean :2.096
3rd Qu.:28.84 3rd Qu.: 5.000 3rd Qu.:2.000
Max. :70.51 Max. :15.000 Max. :7.000
NA's :1
Day Server PctTip
Length:157 Length:157 Min. : 6.70
Class :character Class :character 1st Qu.: 14.30
Mode :character Mode :character Median : 16.20
Mean : 17.89
3rd Qu.: 18.20
Max. :221.00
summary
returns a quick summary of the data, i.e. a list of numerical summaries.
You probably won’t understand some parts of the output above, but we will learn more in the coming weeks, so don’t worry too much about it. For the moment, you should be able to understand the minimum, maximum, and the mean.
Currently, it is not showing very informative output for the categorical variables, also known as factors.
We can replace each factor level with a clearer label. The following code takes the column Day
from the tips
data and assigns a new label “Monday” to the level “m”, etc.
You can change/update a variable (column) in the data using the function mutate
from the tidyverse package. It works as follows:
The function |>
is called pipe and works by taking what’s on the left and passing it to the operation on the right. For example, you can take the logarithm of the whole numbers from 1 to 10 and round them to 2 digits using this code:
or this equivalent code that uses the pipe |>
:
You can loosely think of the pipe as “then”. In fact, the pipe takes what’s to its left, and then passes it on to what’s on its right.
Curiosity: Sometimes you may also find an older version of the pipe, which is %>%
and works in the same style. However, it requires the package tidyverse to be loaded before you can use the older pipe %>%
.
Let’s check the result of the changes to the variable types:
glimpse(tips)
Rows: 157
Columns: 7
$ Bill <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
summary(tips)
Bill Tip Credit Guests Day
Min. : 1.66 Min. : 0.250 No :106 Min. :1.000 Monday :20
1st Qu.:15.19 1st Qu.: 2.075 Yes: 51 1st Qu.:2.000 Tuesday :13
Median :20.22 Median : 3.340 Median :2.000 Wednesday:62
Mean :22.73 Mean : 3.807 Mean :2.096 Thursday :36
3rd Qu.:28.84 3rd Qu.: 5.000 3rd Qu.:2.000 Friday :26
Max. :70.51 Max. :15.000 Max. :7.000
NA's :1
Server PctTip
A:60 Min. : 6.70
B:65 1st Qu.: 14.30
C:32 Median : 16.20
Mean : 17.89
3rd Qu.: 18.20
Max. :221.00
After making categorical variables factors, summary
shows the count of each category for the categorical variables.
The percentage of total bill has a maximum value of 221, which seems very strange. Someone is very unlikely to tip more than their bill total. In this case 221% of their bill value seems unlikely.
Let’s inspect the row where PctTip
is greater than 100:
tips[tips$PctTip > 100, ]
# A tibble: 1 × 7
Bill Tip Credit Guests Day Server PctTip
<dbl> <dbl> <fct> <dbl> <fct> <fct> <dbl>
1 49.6 NA Yes 4 Thursday C 221
Alternatively, using tidyverse, the function filter
is used to only filter the rows that satisfy a condition:
tips |>
filter(PctTip > 100)
# A tibble: 1 × 7
Bill Tip Credit Guests Day Server PctTip
<dbl> <dbl> <fct> <dbl> <fct> <fct> <dbl>
1 49.6 NA Yes 4 Thursday C 221
With a bill of 49.6, the tip would be 109.62 dollars:
49.6 * 221 / 100
[1] 109.616
Furthermore, we also notice that the tipping amount is not available (NA). The corresponding value in the percentage of total tip seems likely an inputting error, perhaps due to double typing the leading 2 when recording the data. We will set that value to not available (NA) with the following code:
tips$PctTip[tips$PctTip > 100] <- NA
a > b
tests whether a is greater than b. a < b
tests whether a is smaller than b. a == b
tests whether a is equal to b; notice the double equal sign! You can also use >=
or <=
Alternatively you can use tidyverse:
Where the function ifelse
selects a value depending on a condition to test: ifelse(test, value_if_true, value_if_false)
. In the case above, each value in the column PctTip is replaced by NA if Pct > 100, and it is kept the same otherwise.
summary(tips)
Bill Tip Credit Guests Day
Min. : 1.66 Min. : 0.250 No :106 Min. :1.000 Monday :20
1st Qu.:15.19 1st Qu.: 2.075 Yes: 51 1st Qu.:2.000 Tuesday :13
Median :20.22 Median : 3.340 Median :2.000 Wednesday:62
Mean :22.73 Mean : 3.807 Mean :2.096 Thursday :36
3rd Qu.:28.84 3rd Qu.: 5.000 3rd Qu.:2.000 Friday :26
Max. :70.51 Max. :15.000 Max. :7.000
NA's :1
Server PctTip
A:60 Min. : 6.70
B:65 1st Qu.:14.30
C:32 Median :16.15
Mean :16.59
3rd Qu.:18.05
Max. :42.20
NA's :1
The function summary()
return a numeric answer for the min/max/mean, see above, even in the presence of missing values (NAs).
However, if you use functions such as min()
, max()
, mean()
, which compute the minimum, maximum, and mean (i.e., average) of a variable respectively, they will return NA when applied to a variable that contains a missing value:
min(tips$Tip)
[1] NA
To get a numeric result, you need to include the argument, i.e. the option, na.rm = TRUE
:
min(tips$Tip, na.rm = TRUE)
[1] 0.25
The average bill size was $22.73, and the average tip was $3.85, corresponding to roughly 17% of the total bill. Out of 157 parties, only 51 paid with a credit card. Most parties tended to be of around 2 people each, and people tended to go to that restaurant more often on Wednesday. Among the three servers, server C was the one that served the least number of parties. The data also included a missing tipping value, corresponding to a bill $49.59, and a data inputting error for the corresponding measure of the tip as a percentage of the total bill.
3 Student Glossary
To conclude the lab, create a glossary of R functions. You can do so by opening Microsoft Word, Excel, or OneNote and creating a table with two columns: one where you should write the name of an R function, and the other column where you should provide a brief description of what the function does.
This “do it yourself” glossary is an opportunity for you to revise what you have learned in today’s lab and write down a few take-home messages. You will find this glossary handy as a reference to keep next to you when you will be doing the assessed weekly quizzes.
Below you can find an example to get you started:
Function | Use and package |
---|---|
read_csv |
For reading comma separated value files. Part of tidyverse package |
View |
? |
head |
? |
nrow |
? |
ncol |
? |
dim |
? |
glimpse |
? |
str |
? |
summary |
? |
factor |
? |
References
Footnotes
Hint: To read the data use
read_csv()
from thetidyverse
package.
To preview the data, useView(DATA)
orhead(DATA)
↩︎Hint:
nrow(DATA)
ordim(DATA)[1]
↩︎Hint:
ncol(DATA)
ordim(DATA)[2]
↩︎Hint:
glimpse(DATA)
fromtidyverse
orstr(DATA)
orsapply(DATA, data.class)
↩︎Hint:
summary(DATA$VARIABLE)
ormin(DATA$VARIABLE)
andmax(DATA$VARIABLE)
↩︎For some movies, data on the budget or rotten tomatoes rating are not available (NA). These are also called missing values.
If you used the functionsmin()
,max()
, you will get NA as a result. This is because if a value is missing, you cannot compute the mean of something you don’t know. For example, what is the mean of 5, 10, and NA? How would I compute (5 + 10 + NA) / 3? I don’t know, so it remains NA.
You can tell R to ignore the missing values by sayingmin(DATA$VARIABLE, na.rm = TRUE)
and similarly formax
and other functions likemean
, which computes the average instead.
Thesummary()
function, instead, does this for you automatically and immediately tells you if a variable had any NAs and how many.↩︎Hint:
summary(DATA)
shows the variables and the number of missing values in each variable.↩︎