Probability Theory

Semester 1 - Week 7

1 Formative report B

Download the template Rmd file and start your work there. Save your work regularly by clicking File > Save.

Formative Report B covers the labs from weeks 7-11 of the DAPR1 course. You’ll need to create a PDF report using RMarkdown, which will be submitted by 12 noon on Friday, 29th November 2024. Remember, these submissions are group-based, so there are no extensions. Expect written formative feedback in week 12.

Your report should be tailored for a reader with basic statistical knowledge and should not include any references to R code or functions in the main report write-up. Instead, keep the main report focused on text, figures, and tables. All R code should be included in the compulsory Appendix B for reproducibility, which is automatically created for you in the template Rmd file. If you need to add extra tables or figures that don’t fit in the main part of the report, you can use an optional Appendix A. Remember, the main report should be a PDF file and should not exceed six sides of A4 paper, though appendices at the end don’t count towards this limit.

Ensure to use the default settings for font and page margins in your RMarkdown file. Also, make sure your report title includes your group name: Group NAME.LETTER, and list the exam numbers of all group members in the author section.

Don’t forget to register for your lab table group on LEARN by navigating to the course LEARN page, clicking on Groups, selecting Labs_1_2_3, finding your group, and clicking Join.

At this page you can find resources to help you with your report formatting.

1.1 Data

Hollywood Movies. At the link https://uoepsy.github.io/data/hollywood_movies_subset.csv you will find data on Hollywood movies released between 2012 and 2018 from the top 5 lead studios and top 10 genres. Among all variables, the following will be used for this report:

  • LeadStudio: Primary U.S. distributor of the movie
  • AudienceScore: Audience rating (via Rotten Tomatoes)
  • Genre: One of Action Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Musical, Romantic Comedy, Thriller, or Western
  • IQ1-IQ50: IQ score of each of 50 audience raters (every movie had different raters)
  • Snacks: How many of the 50 audience raters bought snacks
  • PrivateTransport: How many of the 50 audience raters reached the cinema via private transportation

1.2 Tasks

For formative report B, you will be asked to perform the following tasks, each related to a week of teaching in this course.
This week’s task is highlighted in bold below. Please only focus on completing that task this week. In the next section, you will also find the guided sub-steps that you need to consider to complete this week’s task.

B1) Create a new categorical variable, Rating, taking the value ‘Good’ if the audience score is > 50, and ‘Bad’ otherwise. Inspect and describe the joint probability distribution of movie genre and rating.
B2) Investigate if a movie receiving a good rating is independent of the genre.
B3) Computing and plotting probabilities with a binomial distribution.
B4) Computing and plotting probabilities with a normal distribution.
B5) Finish the report write-up and formatting, knit to PDF, and submit the PDF for formative feedback.

1.3 B1 sub-tasks

This week you will only focus on task B1. Below there are sub-steps you need to consider to complete task B1.

Tip

To see the hints, hover your cursor on the superscript numbers.

  • Read the Hollywood movies data into R and give it a helpful name. Data link: https://uoepsy.github.io/data/hollywood_movies_subset.csv

  • Find the 3 most frequent movie genres. Next, filter the data to only keep movies from the 3 most frequent genres. This is the dataset you will be working with for the rest of the report.1

  • Create a variable called Rating where the AudienceScore variable is recoded so that the movies scoring less than or equal to 50 are coded as “Bad” and those scoring over 50 (i.e., >50) are “Good”.2
  • Ensure Rating and Genre are coded as factors.3
  • Create a contingency table of Rating by Genre.4
  • Transform the table of counts to a relative frequency table.5
  • Do the numbers in the table satisfy the requirements of probabilities?6
  • Instead of checking the probability requirements manually, add row and column totals to the relative frequency table.7
  • Visualise the relative frequency table as a mosaic plot, making sure to add a main title and clear axis titles.8
  • Does your plot have any NAs? If yes, you can drop the NAs before re-plotting, see hint. If there are no NAs, no action is needed but read over the hint anyway.9
  • In the introduction section of your report, write up a small introduction to the data and variables.

    • What are the data that you are analysing and where can these be found?
    • Which variables are you using and what do they represent?
    • What is the type of these variables?
    • Are there any missing values in these variables?
    • In the next weeks, update the introduction based on the new variables that you will consider.
  • In the analysis section of your report, present and write up your results from above, using proper rounding to 2 decimal places and avoiding any reference to R code or functions.

2 Worked Example

Consider the dataset available at https://uoepsy.github.io/data/RestaurantTips.csv, containing 157 observations on the following 7 variables:

Variable Name Description
Bill Size of the bill (in dollars)
Tip Size of the tip (in dollars)
Credit Paid with a credit card? n or y
Guests Number of people in the group
Day Day of the week: m=Monday, t=Tuesday, w=Wednesday, th=Thursday, or f=Friday
Server Code for specific waiter/waitress: A, B, or C
PctTip Tip as a percentage of the bill

These data were collected by the owner of a bistro in the US, who was interested in understanding the tipping patterns of their customers. The data are adapted from Lock et al. (2020).

library(tidyverse)  # we use read_csv and glimpse from tidyverse
tips <- read_csv("https://uoepsy.github.io/data/RestaurantTips.csv")
head(tips)
# A tibble: 6 × 7
   Bill   Tip Credit Guests Day   Server PctTip
  <dbl> <dbl> <chr>   <dbl> <chr> <chr>   <dbl>
1  23.7 10    n           2 f     A        42.2
2  36.1  7    n           3 f     B        19.4
3  32.0  5.01 y           2 f     A        15.7
4  17.4  3.61 y           2 f     B        20.8
5  15.4  3    n           2 f     B        19.5
6  18.6  2.5  n           2 f     A        13.4
glimpse(tips)
Rows: 157
Columns: 7
$ Bill   <dbl> 23.70, 36.11, 31.99, 17.39, 15.41, 18.62, 21.56, 19.58, 23.59, …
$ Tip    <dbl> 10.00, 7.00, 5.01, 3.61, 3.00, 2.50, 3.44, 2.42, 3.00, 2.00, 1.…
$ Credit <chr> "n", "n", "y", "y", "n", "n", "n", "n", "n", "n", "n", "n", "n"…
$ Guests <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 1, 5, 5, …
$ Day    <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f", "f"…
$ Server <chr> "A", "B", "A", "B", "B", "A", "B", "A", "A", "B", "B", "A", "B"…
$ PctTip <dbl> 42.2, 19.4, 15.7, 20.8, 19.5, 13.4, 16.0, 12.4, 12.7, 10.7, 11.…

We can filter our data to only include rows of data from Servers A and B, and save this filtered data to a new dataset called “tips2”.

Because we want to include servers A and B, but not C, we can use the != (or does not equal) operator.

Some common operators include:

Operator Description
< less than
> greater than
<= less than or equal to
>= greater than or equal to
== equal to
!= not equal to
LEFT %in% RIGHT is LEFT a member of RIGHT?
tips2 <- tips |> 
    filter(Server != "C")

Consider if, for example, we wanted to recode the Tip variable so that those tipping less than or equal to 15% were coded as “Below” average, and those tipping over 15% were coded as “Above” average (15% being used as below/above average tips cut-off in relation to usual US tipping rates). To do so, we could create a new variable called “Tip_Avg”.

tips2 <- tips2 |>
    mutate(Tip_Avg = ifelse(PctTip <= 15, 'Below', 'Above'))

table(tips2$Tip_Avg)

Above Below 
   80    45 

Now that we have the variables we want, it would be a good point to make these both factors:

tips2$Server <- factor(tips2$Server)
tips2$Tip_Avg <- factor(tips2$Tip_Avg)

To visually represent the distribution of how many customers were served by each server and if they left a below or above average tip, we could Create a contingency table:

freq_tbl <- table(tips2$Server, tips2$Tip_Avg)
freq_tbl
   
    Above Below
  A    40    20
  B    40    25

We could then transform the table of counts above to instead represent a relative frequency table:

rel_freq_tbl <- freq_tbl |>
    prop.table()
rel_freq_tbl
   
    Above Below
  A  0.32  0.16
  B  0.32  0.20

Before we interpret our results, we must ensure that the numbers above satisfy the requirements of probabilities. We can do this two ways:

Check that all values in the proportions table are greater than or equal 0, all values are less than or equal to 1, and they all sum to 1:

all(rel_freq_tbl >= 0)
[1] TRUE
all(rel_freq_tbl <= 1)
[1] TRUE
sum(rel_freq_tbl)
[1] 1

Instead of checking manually, we can use the function addmargins() to check that the probabilities sum to 1:

rel_freq_tbl <- freq_tbl |>
    prop.table() |>
    addmargins()
rel_freq_tbl
     
      Above Below  Sum
  A    0.32  0.16 0.48
  B    0.32  0.20 0.52
  Sum  0.64  0.36 1.00

In order to visualise our results in a figure, we could use a mosaic plot:

library(ggmosaic)
mos_plot <- ggplot(tips2) +
    geom_mosaic(aes(x = product(Server, Tip_Avg), fill = Server)) +  
    labs(title = "Association between Servers and Tips", x = "Tip", y = "Server")
mos_plot
Figure 3: Association between Servers and Tips
Example writeup

More customers tipped above (64%) than below (36%) average. Both Server A and Server B received an equal distribution of tips above average (32%), but server B had a higher proportion of tips below average (20%) in comparison to server A (16%). These associations are visually represented in Figure 3.

3 Student Glossary

To conclude the lab, add the new functions to the glossary of R functions.

Function Use and package
filter ?
mutate ?
drop_na ?
factor ?
table ?
ifelse ?
prop.table ?
addmargins ?
all ?
| ?
geom_mosaic ?
Back to top

References

Lock, Robin H, Patti Frazer Lock, Kari Lock Morgan, Eric F Lock, and Dennis F Lock. 2020. Statistics: Unlocking the Power of Data. John Wiley & Sons.

Footnotes

  1. Hint: use sort(table(DATA$COLUMN)) to find top 3 movie genres, and filter() to select the specific 3 genres.

    Example: For the starwars dataset, we can filter to include only the two most frequent species, Humans and Droids, via the following code:

    # sort the frequency table
    sort(table(starwars$species))
    
    # or, for decreasing order, you can use:
    sort(table(starwars$species), decreasing = TRUE)
    
    # keep only the rows where species is Human or Droid
    # in R, or is the vertical bar |
    starwars2 <- starwars |> 
        filter(species == "Human" | species ==  "Droid")
    ↩︎
  2. Hint: use ifelse() and mutate() functions to create a new ‘Rating’ variable capturing whether movies are good or bad.

    Example: For the starwars dataset, we can create a variable to capture whether characters were short (<180cm) or tall (>180cm) based on their recorded height using the following code:

    starwars2 <- starwars2 |>
        mutate(size = ifelse(height < 180, 's', 't'))

    In the code above, mutate() changes to data to include a column named size, computed with the computation that follows the equal sign.
    The ifelse function checks a condition (is height < 180?) and uses the value 's' if the condition is TRUE, and 't' if the condition is FALSE.↩︎

  3. Hint: use the factor() function.

    Example: For the starwars data, if I wanted to give better labels for my size variable, I could use the following code. Since the names are fine as is for species, I will just specify factor().

    starwars2$size <- factor(starwars2$size,
                             levels = c("s", "t"),
                             labels = c("short", "tall"))
    
    starwars2$species <- factor(starwars2$species)
    ↩︎
  4. Hint: use the table() function, passing to it the two columns that you want to be included in your contingency table.

    Example: For the starwars dataset, if I wanted a contingency table displaying how many humans and droids were short and tall, I could use the following code:

    swars_freq_tbl <- table(starwars2$species, starwars2$size)
    swars_freq_tbl
    
            short tall
      Droid     4    1
      Human    13   17
    ↩︎
  5. Hint: use the prop.table() function. You will want to pass your contingency table created above to this.

    Example: For the starwars dataset, if I wanted a relative frequency table, I could use the following code:

    swars_rel_freq_tbl <- swars_freq_tbl |>
        prop.table()
    swars_rel_freq_tbl
    
                 short       tall
      Droid 0.11428571 0.02857143
      Human 0.37142857 0.48571429
    ↩︎
  6. Hint: Recall the requirements of probabilities - are the proportions >= 0 or <= 1?; and that the values in the relative frequency table should sum to 1. The all() and sum() functions could be useful here.

    Example: For the starwars dataset, I could check these requirements using the following code:

    all(swars_rel_freq_tbl >= 0)
    [1] TRUE
    all(swars_rel_freq_tbl <= 1)
    [1] TRUE
    sum(swars_rel_freq_tbl)
    [1] 1
    ↩︎
  7. Hint: Use the addmargins() function.

    Example: For the starwars dataset, I could add this using the following code:

    swars_rel_freq_sum <- swars_freq_tbl |>
        prop.table() |>
        addmargins()
    swars_rel_freq_sum
    
                 short       tall        Sum
      Droid 0.11428571 0.02857143 0.14285714
      Human 0.37142857 0.48571429 0.85714286
      Sum   0.48571429 0.51428571 1.00000000
    ↩︎
  8. Hint: Make sure to load the ggmosaic() package so that you can specify geom_mosaic() when building your plot. To add a title, as well as x- and y-axis titles, specify labs(title = , x = , y = ).

    Example: For the starwars dataset, I create a mosaic plot using the following code:

    library(ggmosaic)
    m_plot <- ggplot(starwars2) +
        geom_mosaic(aes(x = product(species, size), fill = species)) +
        labs(title = "Starwars Mosaic Plot Example Title", 
             x = "Size", 
             y = "Species",
             fill = "Species")
    m_plot
    Figure 1: Starwars Mosaic Plot Example Title
    ↩︎
  9. Hint: Here you could make sense of the drop_na() function.

    Example: For the starwars dataset, I do have NAs in my ‘size’ variable, so I need to remove these before re-plotting:

    starwars2 <- starwars2 |>
        drop_na(size) |>
        mutate(size = factor(size))
    library(ggmosaic)
    m_plot <- ggplot(starwars2) +
        geom_mosaic(aes(x = product(species, size), fill = species)) +
        labs(title = "Starwars Mosaic Plot Example Title", 
             x = "Size", 
             y = "Species",
             fill = "Species")
    m_plot
    Figure 2: Starwars Mosaic Plot Example Title - No NAs
    ↩︎