install.packages("tinytex")
tinytex::install_tinytex()
Research design & data
Semester 1 - Week 1
1 Getting started
- Please work through the lab exercises in small groups of 3 to 5 students.
- You will be given some data that you will use throughout the next 5 weeks.
- As a group, you have to produce a data analysis PDF report on those data.
- In week 5, you will be asked to submit the PDF report, for which you will receive formative feedback in week 6.
- One person is the driver, responsible for typing on the PC, and the rest are navigators and cannot type. Navigators are responsible for commenting on the strategy, code, and spotting typos or fixing errors. Each week you will rotate so that everyone experiences being a driver.
- Driver: open an Rmd file, and start writing your work there.
- Navigators: be alert and start providing suggestions and comments on the strategy and code.
Format
- PDF file, max 4 sides of A4 paper, keep the default settings in terms of Rmd knitting font and page margins.
- Appendix with all the code in a code chunk with the option
results='hide'
.
The lab is structured to provide various levels of support. When attending the labs, you should directly attempt and work on the tasks. However, if you are unsure or stuck at any point, you can make use of the following help:
- Simply raise your hand and get help from a tutor
- Hover your mouse on the superscript number to get a hint. The hints may sometimes show multiple equivalent ways of getting an answer - you just need one way
- Scroll down to the Worked Example section, where you can read through a worked example.
- Login to EASE using your university UUN and password.
- Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and RStudio password.
Try these steps first to register for RStudio server online:
- Log in to EASE using your university UUN and password.
- Set your RStudio password here, the username will be the same as your UUN (make sure you type your UUN correctly).
- Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you set above in (2).
Please complete this form and wait for an email. Please note that this can take up to four working days.
-
Once you receive an email from us, please follow the following instructions:
- Set your here, the username will be the same as your UUN (make sure you type your UUN correctly).
- Access the server from https://rstudio.ppls.ed.ac.uk using your university UUN and the password you just set above.
Before you begin, make sure you have tinytex
installed in R so that you can “Knit” your Rmd document to a PDF file:
2 Formative report A
In the first five weeks of the course you should produce a PDF report using Rmarkdown for which you will receive formative feedback in week 6. The report should not include any reference to R code or functions, but be written or a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. You should also avoid any R code output or printout in the PDF file.
To not show the code of an R code chunk, and only show the output, write:
```{r, echo=FALSE}
# code goes here
```
To show the code of an R code chunk, but hide the output, write:
```{r, results='hide'}
# code goes here
```
To hide both code and output of an R code chunk, write:
```{r, include=FALSE}
# code goes here
```
2.1 Data
Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. The following variables were recorded:
-
Movie
: Title of the movie -
LeadStudio
: Primary U.S. distributor of the movie -
RottenTomatoes
: Rotten Tomatoes rating (critics) -
AudienceScore
: Audience rating (via Rotten Tomatoes) -
Genre
: One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western -
TheatersOpenWeek
: Number of screens for opening weekend -
OpeningWeekend
: Opening weekend gross (in millions) -
BOAvgOpenWeekend
: Average box office income per theater, opening weekend -
Budget
: Production budget (in millions) -
DomesticGross
: Gross income for domestic (U.S.) viewers (in millions) -
WorldGross
: Gross income for all viewers (in millions) -
ForeignGross
: Gross income for foreign viewers (in millions) -
Profitability
: WorldGross as a percentage of Budget -
OpenProfit
: Percentage of budget recovered on opening weekend -
Year
: Year the movie was released -
IQ1
-IQ50
: IQ score of each of 50 audience raters -
Snacks
: How many of the 50 audience raters brought snacks -
PrivateTransport
: How many of the 50 audience raters reached the cinema via private transportation
For formative report A, please only focus on the variables Movie
to Year
, ignoring anything beyond that. In other words, do not analyse the variables IQ1
to PrivateTransport
in the next five weeks of the course. We will use those later in the course.
2.2 Tasks
For formative report A, you will be asked to perform the following tasks, each related to a week of teaching in this course:
A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure
A2) Display and describe the categorical variables
A3) Display and describe six numerical variables of your choice
A4) Display and describe a relationship of interest between two or three variables of your choice
A5) Finish the report write-up, knit to PDF, and submit the PDF for formative feedback
This week you will only focus on task A1. Below there are some guided sub-steps you may want to consider to complete task A1.
2.3 A1 sub-tasks
To see the hints, hover your cursor on the superscript numbers.
Read the movie data into R, and give it a useful name. Inspect the data by looking at the data in RStudio. By viewing, we actually mean looking at the data either on the viewer or the console.1
How many observations are there?2
How many variables are there?3
- What does
dim(DATA)
return? - What is the function of appending a
[1]
or[2]
?
What is the type of each variable?4
What’s the minimum and maximum budget in the sample? What about the average Rotten Tomatoes rating?5
Do you notice any issues when computing the minimum and maximum Budget and the average RottenTomatoes rating?6
What is the range (i.e. minimum and maximum) of the variables in the data? What about the number of missing values for each variable?7
Write-up a description of the dataset for the reader. You don’t need to show the actual data in the report, but a description in words is sufficient for the reader.
3 Worked example
Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:
Variable Name | Description |
---|---|
Bill | Size of the bill (in dollars) |
Tip | Size of the tip (in dollars) |
Credit | Paid with a credit card? n or y |
Guests | Number of people in the group |
Day | Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday |
Server | Code for specific waiter/waitress: A, B, or C |
PctTip | Tip as a percentage of the bill |
These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).
# A tibble: 6 × 7
Bill Tip Credit Guests Day Server PctTip
<dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 23.7 10 n 2 f A 42.2
2 36.1 7 n 3 f B 19.4
3 32.0 5.01 y 2 f A 15.7
4 17.4 3.61 y 2 f B 20.8
5 15.4 3 n 2 f B 19.5
6 18.6 2.5 n 2 f A 13.4
head()
shows by default the top 6 rows of the data. Use the n = ...
option to change the default behaviour, e.g. head(<data>, n = 10)
.
dim(tips)
[1] 157 7
This returns the number of rows and columns
glimpse(tips)
Rows: 157
Columns: 7
$ Bill <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
glimpse
is part of the tidyverse package
Alternatives to glimpse are the data “structure” function:
str(tips)
spc_tbl_ [157 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Bill : num [1:157] 23.7 36.1 32 17.4 15.4 ...
$ Tip : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
$ Credit: chr [1:157] "n" "n" "y" "y" ...
$ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
$ Day : chr [1:157] "f" "f" "f" "f" ...
$ Server: chr [1:157] "A" "B" "A" "B" ...
$ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...
- attr(*, "spec")=
.. cols(
.. Bill = col_double(),
.. Tip = col_double(),
.. Credit = col_character(),
.. Guests = col_double(),
.. Day = col_character(),
.. Server = col_character(),
.. PctTip = col_double()
.. )
- attr(*, "problems")=<externalptr>
or:
sapply(tips, data.class)
Bill Tip Credit Guests Day Server
"numeric" "numeric" "character" "numeric" "character" "character"
PctTip
"numeric"
A dataset containing records on 7 variables related to tipping was obtained from https://uoepsy.github.io/data/RestaurantTips.csv, and was provided by the owner of a bistro in the US interested in studying which factors affected the tipping behaviour of the bistro’s customers. The data contains measurements for a total of 157 parties on four numeric variables: size of the bill (in dollars), size of the tip, number of guests in the group, and tip as a percentage of the bill total. The data also includes three categorical variables indicating whether or not the party paid with a credit card, the day of the week, as well as a server-specific identifier.
summary(tips)
Bill Tip Credit Guests
Min. : 1.66 Min. : 0.250 Length:157 Min. :1.000
1st Qu.:15.19 1st Qu.: 2.075 Class :character 1st Qu.:2.000
Median :20.22 Median : 3.340 Mode :character Median :2.000
Mean :22.73 Mean : 3.807 Mean :2.096
3rd Qu.:28.84 3rd Qu.: 5.000 3rd Qu.:2.000
Max. :70.51 Max. :15.000 Max. :7.000
NA's :1
Day Server PctTip
Length:157 Length:157 Min. : 6.70
Class :character Class :character 1st Qu.: 14.30
Mode :character Mode :character Median : 16.20
Mean : 17.89
3rd Qu.: 18.20
Max. :221.00
summary
returns a quick summary of the data.
You probably won’t understand some parts of the output above, but we will learn more in the coming weeks, so don’t worry too much about it. For the moment, you should be able to understand the minimum, maximum, and the mean.
Currently, it is not showing very informative output for the categorical variables.
We can replace each factor level with a clearer label:
Using tidyverse, the function mutate
is used to mutate a variable (column) in the data:
The functions %>%
and mutate
are part of the tidyverse
package. The former, %>%
, is called pipe.
The pipe works by taking what’s on the left and passing it to the operation on the right. For example, rounding to 2 decimal places the logarithm of the whole numbers from 1 to 10:
is equivalent to:
Let’s check the result of the changes to the variable types:
glimpse(tips)
Rows: 157
Columns: 7
$ Bill <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
summary(tips)
Bill Tip Credit Guests Day
Min. : 1.66 Min. : 0.250 No :106 Min. :1.000 Monday :20
1st Qu.:15.19 1st Qu.: 2.075 Yes: 51 1st Qu.:2.000 Tuesday :13
Median :20.22 Median : 3.340 Median :2.000 Wednesday:62
Mean :22.73 Mean : 3.807 Mean :2.096 Thursday :36
3rd Qu.:28.84 3rd Qu.: 5.000 3rd Qu.:2.000 Friday :26
Max. :70.51 Max. :15.000 Max. :7.000
NA's :1
Server PctTip
A:60 Min. : 6.70
B:65 1st Qu.: 14.30
C:32 Median : 16.20
Mean : 17.89
3rd Qu.: 18.20
Max. :221.00
After making categorical variables factors, summary
shows the count of each category for the categorical variables.
The percentage of total bill has a maximum value of 221, which seems very strange. Someone is very unlikely to tip more than their bill total. In this case 221% of their bill value seems unlikely.
Let’s inspect the row where PctTip
is greater than 100:
tips[tips$PctTip > 100, ]
# A tibble: 1 × 7
Bill Tip Credit Guests Day Server PctTip
<dbl> <dbl> <fct> <dbl> <fct> <fct> <dbl>
1 49.6 NA Yes 4 Thursday C 221
With a bill of 49.59, the tip would be 109.59 dollars:
49.59 * 221 / 100
[1] 109.5939
Furthermore, we also notice that the tipping amount is not available (NA). The corresponding value in the percentage of total tip seems likely an inputting error, perhaps due to double typing the leading 2 when recording the data. We will set that value to not available (NA) with the following code:
tips$PctTip[tips$PctTip > 100] <- NA
a > b
tests whether a is greater than b. a < b
tests whether a is smaller than b. a == b
tests whether a is equal to b; notice the double equal sign! You can also use >=
or <=
Alternatively you can use tidyverse:
Where the function ifelse
selects a value depending on a condition to test: ifelse(test, value_if_true, value_if_false)
. In the case above, each value in the column PctTip is replaced by NA if Pct > 100, and it is kept the same otherwise.
summary(tips)
Bill Tip Credit Guests Day
Min. : 1.66 Min. : 0.250 No :106 Min. :1.000 Monday :20
1st Qu.:15.19 1st Qu.: 2.075 Yes: 51 1st Qu.:2.000 Tuesday :13
Median :20.22 Median : 3.340 Median :2.000 Wednesday:62
Mean :22.73 Mean : 3.807 Mean :2.096 Thursday :36
3rd Qu.:28.84 3rd Qu.: 5.000 3rd Qu.:2.000 Friday :26
Max. :70.51 Max. :15.000 Max. :7.000
NA's :1
Server PctTip
A:60 Min. : 6.70
B:65 1st Qu.:14.30
C:32 Median :16.15
Mean :16.59
3rd Qu.:18.05
Max. :42.20
NA's :1
The average bill size was $22.73, and the average tip was $3.85, corresponding to roughly 17% of the total bill. Out of 157 parties, only 51 paid with a credit card. Most parties tended to be of around 2 people each, and people tended to go to that restaurant more often on Wednesday. Among the three servers, server C was the one that served the least number of parties. The data also included a missing tipping value, corresponding to a bill $49.59, and a data inputting error for the corresponding measure of the tip as a percentage of the total bill.
4 Student Glossary
To conclude the lab, create a glossary of R functions. You can do so by opening Microsoft Word, Excel, or OneNote and creating a table with two columns: one where you should write the name of an R function, and the other column where you should provide a brief description of what the function does.
This “do it yourself” glossary is an opportunity for you to revise what you have learned in today’s lab and write down a few take-home messages. You will find this glossary handy as a reference to keep next to you when you will be doing the assessed weekly quizzes.
Below you can find an example to get you started:
Function | Use and package |
---|---|
read_csv |
For reading comma separated value files. Part of tidyverse package |
View |
? |
head |
? |
nrow |
? |
ncol |
? |
dim |
? |
glimpse |
? |
str |
? |
summary |
? |
factor |
? |
References
Footnotes
Hint:
View(DATA)
orhead(DATA)
↩︎Hint:
nrow(DATA)
ordim(DATA)[1]
↩︎Hint:
ncol(DATA)
ordim(DATA)[2]
↩︎Hint:
glimpse(DATA)
fromtidyverse
orstr(DATA)
orsapply(DATA, data.class)
↩︎Hint:
summary(DATA)
ormin(DATA$VARIABLE)
andmax(DATA$VARIABLE)
Hint:mean(DATA$VARIABLE)
↩︎For some movies, data on the budget or rotten tomatoes rating are not available (NA). These are also called missing values.
If you used the functionsmin()
,max()
,mean()
you will get NA as a result. This is because if a value is missing, you cannot compute the mean of something you don’t know. For example, what is the mean of 5, 10, and NA? How would I compute (5 + 10 + NA) / 3? I don’t know, so it remains NA.
You can tell R to ignore the missing values by sayingmin(DATA$VARIABLE, na.rm = TRUE)
and similarly formax
andmean
.
Instead,summary()
does this for you automatically and immediately tells you if a variable had any NAs and how many.↩︎Hint:
summary(DATA)
andnrow(DATA)
↩︎