Categorical data

Semester 1 - Week 2

1 Formative report A

In the first five weeks of the course your group should produce a PDF report using Rmarkdown, to be submitted at the end of week 5. You will receive formative feedback on your submission in week 6.
The submitted report should be a PDF file of 4 pages at most. In week 5, you can add an Appendix in which you will collate all the R code in a chunk with the setting results = 'hide', which does not count towards the page limit.
The report should not include any reference to R code or functions, but be written for a generic reader who is only assumed to have a basic statistical understanding without any R knowledge. You should also avoid any R code output or printout in the PDF file.

To not show the code of an R code chunk, and only show the output, write:

```{r, echo=FALSE}
# code goes here
```

To show the code of an R code chunk, but hide the output, write:

```{r, results='hide'}
# code goes here
```

To hide both code and output of an R code chunk, write:

```{r, include=FALSE}
# code goes here
```

1.1 Data

For formative report A, please only focus on the variables Movie to Year, ignoring anything beyond that. In other words, do not analyse the variables IQ1 to PrivateTransport in the next five weeks of the course, we will use those later in the course.

Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. The following variables were recorded:

  • Movie: Title of the movie
  • LeadStudio: Primary U.S. distributor of the movie
  • RottenTomatoes: Rotten Tomatoes rating (critics)
  • AudienceScore: Audience rating (via Rotten Tomatoes)
  • Genre: One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western
  • TheatersOpenWeek: Number of screens for opening weekend
  • OpeningWeekend: Opening weekend gross (in millions)
  • BOAvgOpenWeekend: Average box office income per theater, opening weekend
  • Budget: Production budget (in millions)
  • DomesticGross: Gross income for domestic (U.S.) viewers (in millions)
  • WorldGross: Gross income for all viewers (in millions)
  • ForeignGross: Gross income for foreign viewers (in millions)
  • Profitability: WorldGross as a percentage of Budget
  • OpenProfit: Percentage of budget recovered on opening weekend
  • Year: Year the movie was released
  • (Ignore for now) IQ1-IQ50: IQ score of each of 50 audience raters
  • (Ignore for now) Snacks: How many of the 50 audience raters brought snacks
  • (Ignore for now) PrivateTransport: How many of the 50 audience raters reached the cinema via private transportation

1.2 Tasks

For formative report A, you will be asked to perform the following tasks, each related to a week of teaching in this course.
This week you will only focus on task A2. In the next section you will find some guided sub-steps you may want to consider to complete task A2.

A1) Read the data into R, inspect it, and write a concise introduction to the data and its structure

This week’s task

A2) Display and describe the categorical variables

A3) Display and describe six numerical variables of your choice
A4) Display and describe a relationship of interest between two or three variables of your choice
A5) Finish the report write-up, knit to PDF, and submit the PDF for formative feedback

1.3 A2 sub-tasks

Tip

To see the hints, hover your cursor on the superscript numbers.

In this section you will find some guided sub-steps you may want to consider to complete task A2.

  • Reopen last week’s Rmd file, as you will continue last week’s work and build on it.1

Consider a table of toy data comprising a participant identifier (id: 1 to 5), the participant age, the course (A or B) they are enrolled into, and their height:

toy_data <- tibble(
    id = 1:5,
    age = c(18, 20, 25, 22, 19),
    course = c("A", "B", "A", "B", "A"),
    height = c(171, 180, 168, 193, 174)
)
toy_data
# A tibble: 5 × 4
     id   age course height
  <int> <dbl> <chr>   <dbl>
1     1    18 A         171
2     2    20 B         180
3     3    25 A         168
4     4    22 B         193
5     5    19 A         174

To select the first two columns, you can either say the range from:to using numbers or the names of the columns:

toy_data %>%
    select(1:3)
# A tibble: 5 × 3
     id   age course
  <int> <dbl> <chr> 
1     1    18 A     
2     2    20 B     
3     3    25 A     
4     4    22 B     
5     5    19 A     
toy_data %>%
    select(id:course)
# A tibble: 5 × 3
     id   age course
  <int> <dbl> <chr> 
1     1    18 A     
2     2    20 B     
3     3    25 A     
4     4    22 B     
5     5    19 A     

However, if you check the data in toy_data, those didn’t change. The result of the above computation was only printed to the screen but not stored.

toy_data
# A tibble: 5 × 4
     id   age course height
  <int> <dbl> <chr>   <dbl>
1     1    18 A         171
2     2    20 B         180
3     3    25 A         168
4     4    22 B         193
5     5    19 A         174

To store it, we need to assign the result:

toy_data <- toy_data %>%
    select(id:course)
toy_data
# A tibble: 5 × 3
     id   age course
  <int> <dbl> <chr> 
1     1    18 A     
2     2    20 B     
3     3    25 A     
4     4    22 B     
5     5    19 A     

By doing the above, we have overwritten the data stored in toy_data with the selected columns.

  • Overwrite the data to only include the first 15 variables (i.e. columns).2

  • Create a plot displaying the frequency distribution of movie genres.3

  • Create a plot displaying the frequency distribution of the lead studios.4

  • Would it make sense to create a plot of the frequency distribution of movie names?5

Tip

Before applying a function to your data, you should always ask yourself if what you are about to do is going to convey any insight about the data, compared to just looking at the data itself.
The goal of data analysis is to to go from a multitude of values to insights that provide actionable information from a quick glance.

  • Describe the distribution of movie genres. You may want to include both the frequency and the percentage frequency.6

Consider this code:

toy_data %>%
    count(course) %>%
    mutate(
        perc = round(n / sum(n) * 100, 2)
    )
# A tibble: 2 × 3
  course     n  perc
  <chr>  <int> <dbl>
1 A          3    60
2 B          2    40

You can change the names of the frequency from n to Freq and perc to Perc using:

toy_data %>%
    count(course, name = "Freq") %>%
    mutate(
        Perc = round(Freq / sum(Freq) * 100, 2)
    )
# A tibble: 2 × 3
  course  Freq  Perc
  <chr>  <int> <dbl>
1 A          3    60
2 B          2    40

An alternative to the above involves using the group_by(), summarise(), and n() functions from tidyverse. The n() function counts the number of values, and we use it inside summarise() because we are summarising the data with a number. Before summarise, we use group_by(course) to tell R to do the computation for each unique course entry, i.e. for each group of rows defined by course.

toy_data %>%
    group_by(course) %>%
    summarise(
        Freq = n()
    ) %>%
    mutate(
        Perc = round(Freq / sum(Freq) * 100, 2)
    )
# A tibble: 2 × 3
  course  Freq  Perc
  <chr>  <int> <dbl>
1 A          3    60
2 B          2    40
  • Describe the distribution of lead studios. You may want to include both the frequency and the percentage frequency.7

  • What is the most common genre and the most common lead studio?8

2 Worked example

Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:

Variable Name Description
Bill Size of the bill (in dollars)
Tip Size of the tip (in dollars)
Credit Paid with a credit card? n or y
Guests Number of people in the group
Day Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday
Server Code for specific waiter/waitress: A, B, or C
PctTip Tip as a percentage of the bill

These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).

library(tidyverse)  # we use read_csv and glimpse from tidyverse
tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")
head(tips)
# A tibble: 6 × 7
   Bill   Tip Credit Guests Day   Server PctTip
  <dbl> <dbl> <chr>   <dbl> <chr> <chr>   <dbl>
1  23.7 10    n           2 f     A        42.2
2  36.1  7    n           3 f     B        19.4
3  32.0  5.01 y           2 f     A        15.7
4  17.4  3.61 y           2 f     B        20.8
5  15.4  3    n           2 f     B        19.5
6  18.6  2.5  n           2 f     A        13.4

head() shows the top 6 rows of data. Use the n = ... option to change the default behaviour, e.g. head(<data>, n = 10).

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…

glimpse is part of the tidyverse package and is used to check the type of each variable.

We can use better labels for the categorical variables:

tips$Day <- factor(tips$Day, 
                   levels = c("m", "t", "w", "th", "f"),
                   labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))

tips$Credit <- factor(tips$Credit, 
                      levels = c("n", "y"),
                      labels = c("No", "Yes"))

tips$Server <- factor(tips$Server)

glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <fct> No, No, Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, N…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <fct> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Friday,…
$ Server <fct> A, B, A, B, B, A, B, A, A, B, B, A, B, B, B, B, C, C, C, C, C, …
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…

The categorical variable course in toy_data with levels “A” and “B” can have the values relabelled to “Year1” and “Year2” via:

toy_data$course <- factor(
  toy_data$course, 
  levels = c("A", "B"), 
  labels = c("Year1", "Year2")
)

Last week, we also saw that if someone tipped more than 100% of the bill size, it was likely a data input error and we decided to replace that value with NA (not available):

The mutate function takes as arguments:

  • column name
  • =
  • how to compute that column

The syntax for ifelse is:

ifelse(test_condition, 
       value_if_true, 
       avalue_if_false)
tips <- tips %>%
    mutate(
        PctTip = ifelse(PctTip > 100, NA, PctTip)
    )

This displays the frequency distribution of credit card payers:

plt_credit <- ggplot(tips, aes(x = Credit)) +
    geom_bar() +
    labs(x = "Paid by credit card?", y = "Count")
plt_credit

You can even flip the coordinates, if you wish to, using the coord_flip() function:

ggplot(tips, aes(x = Credit)) +
    geom_bar() +
    labs(x = "Paid by credit card?") +
    coord_flip()

You can use the patchwork package to place graphs side by side. Simply create an object for each graph, and concatenate the objects with | for horizontal concatenation and / for vertical concatenation of graphs. You can even combine this by using parentheses, e.g. (plot1 | plot2) / (plot3 | plot4) for 2 rows and 2 columns.

Run install.packages("patchwork") first in your R console

We can display the frequency distribution of all the categorical variables: Credit, Day, and Server:

To rotate x-axis labels by 90 degrees, you can use this code:
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
To rotate the labels by 45 degrees, you can use: theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Don’t worry, no one remembers it. People always google “rotate x-axis labels ggplot” to find it.

library(patchwork)

plt1 <- ggplot(tips, aes(x = Credit)) +
    geom_bar() +
    labs(x = "Paird by credit card?", y = "Count") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

plt2 <- ggplot(tips, aes(x = Day)) +
    geom_bar() +
    labs(x = "Day of week", y = "Count") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

plt3 <- ggplot(tips, aes(x = Server)) +
    geom_bar() +
    labs(y = "Count") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

plt1 | plt2 | plt3

A frequency table can be obtained using:

tbl_credit <- tips %>%
    count(Credit) %>%
    mutate(
        perc = round((n / sum(n)) * 100, 2)
    )
tbl_credit
# A tibble: 2 × 3
  Credit     n  perc
  <fct>  <int> <dbl>
1 No       106  67.5
2 Yes       51  32.5
tbl_day <- tips %>%
    count(Day) %>%
    mutate(
        perc = round((n / sum(n)) * 100, 2)
    )
tbl_day
# A tibble: 5 × 3
  Day           n  perc
  <fct>     <int> <dbl>
1 Monday       20 12.7 
2 Tuesday      13  8.28
3 Wednesday    62 39.5 
4 Thursday     36 22.9 
5 Friday       26 16.6 
tbl_server <- tips %>%
    count(Server) %>%
    mutate(
        perc = round((n / sum(n)) * 100, 2)
    )
tbl_server
# A tibble: 3 × 3
  Server     n  perc
  <fct>  <int> <dbl>
1 A         60  38.2
2 B         65  41.4
3 C         32  20.4

You can create nice tables with the kbl command from the kableExtra package.

Run install.packages("kableExtra") first in your R console

library(kableExtra)

kbl(list(tbl_credit, tbl_day, tbl_server), booktabs = TRUE)

Frequency tables of categorical variables

Paid with a credit card
Credit n perc
No 106 67.52
Yes 51 32.48
Day of the week
Day n perc
Monday 20 12.74
Tuesday 13 8.28
Wednesday 62 39.49
Thursday 36 22.93
Friday 26 16.56
Server
Server n perc
A 60 38.22
B 65 41.40
C 32 20.38

Add arrange(desc(column_of_freq)). For example:

tbl_day <- tips %>%
    count(Day) %>%
    mutate(
        perc = round((n / sum(n)) * 100, 2)
    ) %>%
    arrange(desc(n))
tbl_day
# A tibble: 5 × 3
  Day           n  perc
  <fct>     <int> <dbl>
1 Wednesday    62 39.5 
2 Thursday     36 22.9 
3 Friday       26 16.6 
4 Monday       20 12.7 
5 Tuesday      13  8.28

Add arrange(desc(column_of_freq)). For example:

tbl_day <- tips %>%
    count(Day, name = "Freq") %>%
    mutate(
        Perc = round((Freq / sum(Freq)) * 100, 2)
    ) %>%
    arrange(desc(Freq))
tbl_day
# A tibble: 5 × 3
  Day        Freq  Perc
  <fct>     <int> <dbl>
1 Wednesday    62 39.5 
2 Thursday     36 22.9 
3 Friday       26 16.6 
4 Monday       20 12.7 
5 Tuesday      13  8.28

From the univariate distribution (or marginal distribution) of each categorical variable we see that the most common payment method was not a credit card, and the most common day of the week to dine at that restaurant was Wednesday. Finally, most parties were waited on by server B.

The most common value is the mode.

3 Student Glossary

To conclude the lab, add the new functions to the glossary of R functions that you started last week.

Function Use and package
factor ?
%>% ?
geom_bar ?
labs ?
count ?
mutate ?
sum ?
round ?
coord_flip ?
kbl ?
arrange ?
desc ?

References

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F Lock, and Dennis F Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: ask last week’s driver for the Rmd file, they should share it with the group via email or Teams.↩︎

  2. Hint: the select() function from tidyverse↩︎

  3. Hint: we display categorical variables with barplots. Consider the geom_bar() function.

    Example: For the toy_data from above, the frequency distribution of course enrollment is:

    ggplot(toy_data, aes(x = course)) +
        geom_bar() +
        labs(x = "Enrollment per course", y = "Frequency")

    ↩︎
  4. Hint: similar to above, change the column to LeadStudio.↩︎

  5. Hint: what would be the height of each bar? Would adding such a plot to a report bring any insights and be useful to a decision maker?↩︎

  6. Hint: We describe categorical variables with frequency distributions.

    Consider using the count function from tidyverse and mutate for adding percentages.

    Example:

    toy_data %>%
        count(course) %>%
        mutate(
            perc = round(n / sum(n) * 100, 2)
        )
    # A tibble: 2 × 3
      course     n  perc
      <chr>  <int> <dbl>
    1 A          3    60
    2 B          2    40
    ↩︎
  7. Hint: similar to above, but replacing Genre with LeadStudio↩︎

  8. Hint: What is the mode of Genre and LeadStudio? In other words, which category in each of those frequency distributions has the highest frequency?
    Tip: You may want to order the frequency tables in descending order. The function arrange(desc(<column_of_freq>)) may help.↩︎