Numeric data

Semester 1 - Week 3

Formative report A

In the first five weeks of the course your group should produce a PDF report using Rmarkdown, to be submitted at the end of week 5. You will receive formative feedback on your submission in week 6.
The submitted report should be a PDF file of 4 pages at most. In week 5, you can add an Appendix in which you will collate all the R code in a chunk with the setting results = 'hide', which does not count towards the page limit.
The report should not include any reference to R code or functions, but be written for a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. You should also avoid any R code output or printout in the PDF file.

Data

For formative report A, please only focus on the variables Movie to Year, ignoring anything beyond that. In other words, do not analyse the variables IQ1 to PrivateTransport in the next five weeks of the course, we will use those later in the course.

Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. The following variables were recorded:

  • Movie: Title of the movie
  • LeadStudio: Primary U.S. distributor of the movie
  • RottenTomatoes: Rotten Tomatoes rating (critics)
  • AudienceScore: Audience rating (via Rotten Tomatoes)
  • Genre: One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western
  • TheatersOpenWeek: Number of screens for opening weekend
  • OpeningWeekend: Opening weekend gross (in millions)
  • BOAvgOpenWeekend: Average box office income per theater, opening weekend
  • Budget: Production budget (in millions)
  • DomesticGross: Gross income for domestic (U.S.) viewers (in millions)
  • WorldGross: Gross income for all viewers (in millions)
  • ForeignGross: Gross income for foreign viewers (in millions)
  • Profitability: WorldGross as a percentage of Budget
  • OpenProfit: Percentage of budget recovered on opening weekend
  • Year: Year the movie was released
  • (Ignore for now) IQ1-IQ50: IQ score of each of 50 audience raters
  • (Ignore for now) Snacks: How many of the 50 audience raters brought snacks
  • (Ignore for now) PrivateTransport: How many of the 50 audience raters reached the cinema via private transportation

1 Tasks

For formative report A, you will be asked to perform the following tasks, each related to a week of teaching in this course.
This week you will only focus on task A3. In the next section you will find some guided sub-steps you may want to consider to complete task A3.

A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure
A2) Display and describe the categorical variables

This week’s task

A3) Display and describe six numerical variables of your choice

A4) Display and describe a relationship of interest between two or three variables of your choice
A5) Finish the report write-up, knit to PDF, and submit the PDF for formative feedback

2 A3 sub-tasks

Tip

To see the hints, hover your cursor on the superscript numbers.

In this section you will find some guided sub-steps you may want to consider to complete task A3.

  • Reopen last week’s Rmd file, as you will continue last week’s work and build on it.1

We display numeric variables with histograms, density plots, or boxplots. Respectively, these use the function geom_histogram(), geom_density(), or geom_boxplot() from ggplot2, which is a package automatically loaded when you load tidyverse via library(tidverse). For illustration purposes, we will use the starwars dataset from tidyverse, containing information on Starwars characters.

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

The distribution of the character heights (cm) can be displayed with a histogram:

ggplot(starwars, aes(x = height)) +
    geom_histogram(color = 'gray', fill = 'lightblue') +
    labs(x = "Character height (cm)", y = "Frequency")

The distribution of the character heights (cm) can be displayed with a density plot:

ggplot(starwars, aes(x = height)) +
    geom_density(color = 'dodgerblue') +
    labs(x = "Character height (cm)")

The distribution of the character heights (cm) can be displayed with a box plot:

ggplot(starwars, aes(x = height)) +
    geom_boxplot() +
    labs(x = "Character height (cm)")

  • Create six plots, each displaying the distribution of:2

    • Production budgets
    • Audience scores
    • Rotten Tomatoes ratings
    • World gross income
    • Foreign gross income
    • Year of movie release
  • Arrange the above plots as a single figure comprising 2 by 3 panels3

Consider again the starwars dataset. The mean and SD of the height variable can be computed as:

starwars %>%
    summarise(M = mean(height, na.rm = TRUE),
              SD = sd(height, na.rm = TRUE)) %>%
    round(digits = 2)
# A tibble: 1 × 2
      M    SD
  <dbl> <dbl>
1  174.  34.8

To make a nice table for the PDF document, you can use the kbl() function from the kableExtra package:

library(kableExtra)

starwars %>%
    summarise(M = mean(height, na.rm = TRUE),
              SD = sd(height, na.rm = TRUE)) %>%
    kbl(booktabs = TRUE, digits = 2)
M SD
174.36 34.77
  • Create a table of descriptive statistics that summarises the production budgets using the mean and standard deviation.4

Using summarise for more than a couple of variables would make the job very tedious and long. There is a shortcut, which uses the describe function from the psych package.
The following code creates a table of descriptive statistics (via the describe function from the psych package) and ensures the table is in proper format by using the kbl function from the kableExtra package.

library(kableExtra) # for the kbl function
library(psych)      # for the describe function

starwars %>%
    select(height, mass) %>%
    describe() %>%
    kbl(digits = 2, booktabs = TRUE)   # kbl converts to a nice PDF table
vars n mean sd median trimmed mad min max range skew kurtosis se
height 1 81 174.36 34.77 180 178.17 19.27 66 264 198 -1.03 1.78 3.86
mass 2 59 97.31 169.46 79 75.44 16.31 15 1358 1343 6.97 48.93 22.06

To only show the columns n, mean, sd, median you can use:

library(kableExtra) # for the kbl function
library(psych)      # for the describe function

starwars %>%
    select(height, mass) %>%
    describe() %>%
    select(n, mean, sd, median) %>%
    kbl(digits = 2, booktabs = TRUE)   # kbl converts to a nice PDF table
n mean sd median
height 81 174.36 34.77 180
mass 59 97.31 169.46 79
  • Create a table of descriptive statistics that summarises (using the mean and standard deviation) the six numeric variables which you plotted above. 5
  • Write up a summary of what you have reported in the table, using proper rounding to 2 decimal places and avoiding any reference to R code or functions.

To not show the code of an R code chunk, and only show the output, write:

```{r, echo=FALSE}
# code goes here
```

To show the code of an R code chunk, but hide the output, write:

```{r, results='hide'}
# code goes here
```

To hide both code and output of an R code chunk, write:

```{r, include=FALSE}
# code goes here
```
  • Make sure that all R code and output is not visible in the PDF report. The PDF report should only include text, tables and plots.

3 Worked example

Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:

Variable Name Description
Bill Size of the bill (in dollars)
Tip Size of the tip (in dollars)
Credit Paid with a credit card? n or y
Guests Number of people in the group
Day Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday
Server Code for specific waiter/waitress: A, B, or C
PctTip Tip as a percentage of the bill

These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).

library(tidyverse)  # we use read_csv and glimpse from tidyverse
tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")
head(tips)
# A tibble: 6 × 7
   Bill   Tip Credit Guests Day   Server PctTip
  <dbl> <dbl> <chr>   <dbl> <chr> <chr>   <dbl>
1  23.7 10    n           2 f     A        42.2
2  36.1  7    n           3 f     B        19.4
3  32.0  5.01 y           2 f     A        15.7
4  17.4  3.61 y           2 f     B        20.8
5  15.4  3    n           2 f     B        19.5
6  18.6  2.5  n           2 f     A        13.4
glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
tips$Day <- factor(tips$Day, 
                   levels = c("m", "t", "w", "th", "f"),
                   labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))

tips$Credit <- factor(tips$Credit, 
                      levels = c("n", "y"),
                      labels = c("No", "Yes"))

tips$Server <- factor(tips$Server)

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…
tips <- tips %>%
    mutate(
        PctTip = ifelse(PctTip > 100, NA, PctTip)
    )

We can create a histogram of tips via:

ggplot(tips, aes(x = Bill)) + 
    geom_histogram(color = 'white') + 
    labs(x = "Size of the tip (US dollars)")

We can create a single figure with the distribution of all numeric variables by using the patchwork package:

library(patchwork)

pltBill <- ggplot(tips, aes(x = Bill)) +
    geom_histogram(color = 'white', fill = 'lightblue') +
    labs(x = "Size of the bill (US dollars)")

pltTip <- ggplot(tips, aes(x = Tip)) +
    geom_histogram(color = 'white', fill = 'lightblue') +
    labs(x = "Size of the tip (US dollars)")

pltGuests <- ggplot(tips, aes(x = Guests)) +
    geom_bar(fill = 'lightblue') +
    labs(x = "Number of people in the group")

pltPctTip <- ggplot(tips, aes(x = PctTip)) +
    geom_histogram(color = 'white', fill = 'lightblue') +
    labs(x = "Tip as a percentage of the bill")

(pltBill | pltTip) / (pltGuests | pltPctTip)

To summarise one numeric variable, you can use the summarise function from tidyverse, which takes the data and computes a numeric summary. The syntax is:

data %>%
    summarise(
        column_name = computation
    )

This computes the mean and SD of tip size (in US dollars):

tips %>%
    summarise(
        M = mean(Tip, na.rm = TRUE),
        SD = sd(Tip, na.rm = TRUE)
    ) %>%
    kbl(digits = 2, booktabs = TRUE)
M SD
3.81 2.37

To summarise all of the numeric variables into a single table of descriptive statistics you can use the describe function from the psych package:

library(kableExtra)
library(psych)

tips %>%
    select(Bill, Tip, Guests, PctTip) %>%
    describe() %>%
    kbl(booktabs = TRUE, digits = 2)
vars n mean sd median trimmed mad min max range skew kurtosis se
Bill 1 157 22.73 12.16 20.22 21.37 11.03 1.66 70.51 68.85 1.22 1.89 0.97
Tip 2 156 3.81 2.37 3.34 3.50 1.99 0.25 15.00 14.75 1.50 3.13 0.19
Guests 3 157 2.10 0.93 2.00 1.98 0.00 1.00 7.00 6.00 2.22 7.81 0.07
PctTip 4 156 16.59 4.39 16.15 16.25 2.74 6.70 42.20 35.50 2.50 12.39 0.35

To only keep the sample size, mean, SD, median, min, max, we use the select function from tidyverse:

library(kableExtra)
library(psych)

tips %>%
    select(Bill, Tip, Guests, PctTip) %>%
    describe() %>%
    select(n, mean, sd, median, min, max) %>%
    kbl(booktabs = TRUE, digits = 2)
n mean sd median min max
Bill 157 22.73 12.16 20.22 1.66 70.51
Tip 156 3.81 2.37 3.34 0.25 15.00
Guests 157 2.10 0.93 2.00 1.00 7.00
PctTip 156 16.59 4.39 16.15 6.70 42.20
Use the appropriate summary for each variable type

Ensure that you summarise variables correctly:

  • For categorical variables use frequency tables
  • For continuous variables use a table of descriptive statistics (mean, SD, Median, etc.)

You should not summarise categorical variables with the mean, SD, and this is why it’s important to use select() before describe() to only keep the variables that are continuous.

Example writeup

The distributions of bill size, tip size, and group size are skewed to the right. The distribution of tips, as a percentage of the total bill, appears to be approximately bell shaped, with three outliers in the right tail of the distribution.
The average bill was about $22.73, with a SD of $12.16. The average tip was $3.81, with a SD of $2.37, corresponding to an average tip as a percentage of the total bill of $16.59, with a SD of $4.39. The average party size comprised 2 guests, with a SD of roughly 1 person.

4 Student Glossary

To conclude the lab, add the new functions to the glossary of R functions that you started last week.

Function Use and package
geom_histogram ?
geom_density ?
geom_boxplot ?
patchwork: | and / ?
summarise ?
select ?
kbl ?
describe ?

References

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F Lock, and Dennis F Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: ask last week’s driver for the Rmd file, they should share it with the group via email or Teams.
    To download the file from the server, go to the RStudio Files pane, tick the box next to the Rmd file, and select More > Export.↩︎

  2. Hint: geom_histogram(), geom_density(), or geom_boxplot()↩︎

  3. Hint: the | and / functions from the patchwork package.

    For example, if plt1, plt2, plt3, and plt4 are plots, this creates a single figure:

    (plt1 | plt2) / (plt3 | plt4)↩︎

  4. Hint: the summarise() function from tidyverse, mean() and sd().

    Hint: if you remember, there are NAs in the variable. What do you need to add to the mean() and sd() function to ensure they do not affect the computation?↩︎

  5. Hint: the describe function from the psych package.↩︎