Exercises: More R; Estimates & Intervals

Question 0

Now that you’ve had a little bit of playing around in R, we’re going to ask you to change some of RStudio’s global settings.

There was a section in reading 01A #useful-settings that showed a couple of settings that are very useful. Please change these now.

While you’re at it - pick your favourite colour scheme for RStudio and change that too!

Data manipulation & visualisation

Data: Past Surveys
In the last few years, we have asked students of the statistics courses in the Psychology department to fill out a little survey.
Anonymised data are available at https://uoepsy.github.io/data/surveydata_historical.csv.

Note: this does not contain the responses from this year.

Question 1

Read in the data, giving it a name to store it in your environment.

library(tidyverse)

surveydata <- read_csv("https://uoepsy.github.io/data/surveydata_historical.csv")

Question 2

How many previous USMR students are born in the same month as you?

  • The data contains students from some of the other statistics courses we teach, so this will involve filtering your data to USMR students first.
  • In tidyverse you can make a table using ... |> select(variable) |> table()
  • You can also try ... |> count(variable) to get the same information.

surveydata |> 
  filter(course == "usmr") |>
  count(birthmonth)
# A tibble: 13 × 2
   birthmonth     n
   <chr>      <int>
 1 apr           26
 2 aug           20
 3 dec           21
 4 feb           18
 5 jan           21
 6 jul           28
 7 jun           32
 8 mar           28
 9 may           29
10 nov           29
11 oct           29
12 sep           26
13 <NA>           1

Question 3

Create a new variable in the dataset which indicates whether people were taller than 6 foot (182cm).

You might want to use mutate(). Remember to make the changes apply to the objects in your environment, rather than just printing it out.
data <- data |> mutate(...)
(see 2A #advances-in-r)

surveydata <- surveydata |>
  mutate(
    over6ft = height > 182
  )

Question 4

What percentage of respondents to the survey (for whom we have data on their height) are greater than 6 foot tall?

  • Try table(), and then think about how we can convert the counts to percentages (what does sum() of the table give you?).
    • table() will actually by default count only those values which aren’t missing, so this means you don’t have to do anything extra here (if you wanted it to also count the missing values, we can use table(data$variable, useNA = "ifany"))
  • See also 2A #categorical.

We can divide the table by the sum of the table

table(surveydata$over6ft) / sum(table(surveydata$over6ft))

     FALSE       TRUE 
0.95903166 0.04096834 

We can also use prop.table()

prop.table(table(surveydata$over6ft))

     FALSE       TRUE 
0.95903166 0.04096834 

Question 5

Calculate the mean and standard deviation of heights of all respondents to the survey.

Can you also do this using the tidyverse syntax?

  • We can do it with mean(data$variable), but it will be useful to practice tidyverse style. You’ll probably want to summarise() the data.
  • We’re likely to have missing data in here, so na.rm=TRUE will be handy (see 2A #numeric)

This returns us NA:

mean(surveydata$height)
[1] NA

So we need to make sure we use na.rm = TRUE (we can use T as a shorthand for TRUE).

mean(surveydata$height, na.rm = T)
[1] 167.7519
sd(surveydata$height, na.rm = T)
[1] 8.270978

or in tidyverse:

surveydata |> 
  summarise(
    meanheight = mean(height, na.rm = T),
    sdheight = sd(height, na.rm = T)
  )
# A tibble: 1 × 2
  meanheight sdheight
       <dbl>    <dbl>
1       168.     8.27

Question 6

Plot the distribution of heights of all respondents. Try to make it ‘publication ready’.

if we want a histogram, then hist() won’t cut it here, we’re going to want to use ggplot to make a lovely pretty histogram, (ggplot was introduced in 2A #ggplot).

ggplot(surveydata, aes(x=height)) + 
  geom_histogram(binwidth = 5) + 
  labs(x = "Height (cm)", 
       y = "Frequency", 
       title = "Heights of respondents to the survey") + 
  theme_minimal()

Question 7

For respondents from each of the different courses, calculate the mean and standard deviation of heights.

This is just like when we did it for all the respondents - we want to summarise our data. Only this time we need to group_by something else first.

We’re using the same code as we did before, but we’ve added in one new line using group_by().

surveydata |> 
  group_by(course) |>
  summarise(
    meanheight = mean(height, na.rm = T),
    sdheight = sd(height, na.rm = T)
  )
# A tibble: 4 × 3
  course meanheight sdheight
  <chr>       <dbl>    <dbl>
1 dapr1        168.     7.36
2 dapr2        167.     7.79
3 rms2         167.     7.98
4 usmr         168.     8.75

Question 8

Based on your answer to the previous question, can you picture what the distributions are going to look like?

Plot the distributions of heights for each course to see if you’re correct.

Try looking up the documentation for ?facet_wrap. It is an incredibly useful extension of ggplot which allows you to create the same plot for different groups.

You might also want to add an extra aesthetic mapping from the course variable to some feature of your plot (e.g. ‘colour’ or ‘fill’).

They’re all going to look pretty similar - the center will be at around 167 and the majority of the distribution will lie within about 15cm either side (2 times the standard deviation of about 7.5).
Note that the USMR students have a slightly higher mean as well as a slightly bigger standard deviation. So this plot will be slightly shifted to the right, and will be slightly more spread out. It might be hard to see these differences just by looking at the plots though.

ggplot(surveydata, aes(x=height, fill = course)) + 
  geom_histogram(binwidth = 5) + 
  facet_wrap(~course) + 
  labs(x = "Height (cm)", 
       y = "Frequency", 
       title = "Heights of respondents to the survey") + 
  guides(fill = "none") + 
  theme_minimal()

Question 9

How many people in the survey do we have height data for? (i.e. how many are not missing?)
To find this out we might need to use is.na() (see below for a little example for you to play with).

The is.na(x) function is a bit like asking x == NA. It is necessary because NA is a special thing in R, which means we can’t ask questions like 3 == NA (because we don’t know what that NA is - it could be 3 for all we know!).

mynumbers <- c(1,5,NA,3,6,NA)
mynumbers == 5
[1] FALSE  TRUE    NA FALSE FALSE    NA
mynumbers == NA
[1] NA NA NA NA NA NA

Instead, we can use is.na() to ask “is this thing an NA?”

# for each number, TRUE if it's an NA, otherwise FALSE
is.na(mynumbers)
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE
# ! means "not", so this is asking if each number is "not" an NA
!is.na(mynumbers)
[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
# how many non-NAs are there? 
sum(!is.na(mynumbers))
[1] 4

To get the number of respondents with height data, we can sum the non-NA values:

sum(!is.na(surveydata$height))
[1] 537

and we have this many missing:

sum(is.na(surveydata$height))
[1] 11

which makes sense because we have this many entries in the data:

nrow(surveydata)
[1] 548

Question 10

Just like finding out how many people were over 6 foot, we can quickly work out how many respondents are taller than <insert your height here>, by using table().
For instance, here is a table of how many people are taller than me (Josiah):

table(surveydata$height > 177)

FALSE  TRUE 
  469    68 

However, as a learning exercise, let’s make sure we can recreate these numbers by doing the calculations manually.

Without using table() or prop.table(), find out:

  1. How many respondents are taller than you?
  2. What proportion of respondents (with valid height data) are taller than you?
  • Remember that we can sum() a condition as a quick way of counting: sum(data$variable == "thing") adds up all the TRUE responses.
    • We’re going to need to make sure we tell sum() to ignore the missing values (na.rm=TRUE will come in handy again).
  • Our denominator (bit on the bottom) for calculating the proportion \(\frac{\text{nr people taller than me}}{\text{total nr of people}}\), is the total number of people for whom we have height data. We just calculated that in the previous question!
  • Can you also do this in tidyverse syntax? (most of it can be done inside summarise()).

There are this many people who are taller than me. We need to ignore the NAs again:

sum(surveydata$height > 177, na.rm = T)
[1] 68

To get the total number of respondents with height data, we can sum the non-NA values:

sum(!is.na(surveydata$height))
[1] 537

And if we divide the number who are taller than me by the total number of respondents, we get the proportion:

sum(surveydata$height > 177, na.rm = T) / sum(!is.na(surveydata$height))
[1] 0.1266294

We can do all this inside tidyverse too!

surveydata |> 
  summarise(
    n_taller = sum(height > 177, na.rm = T),
    n_heights = sum(!is.na(height)),
    prop_taller = n_taller / n_heights
  )
# A tibble: 1 × 3
  n_taller n_heights prop_taller
     <int>     <int>       <dbl>
1       68       537       0.127

Estimates & Intervals

For these next exercises we are going to be focusing on self-perceived sleep quality ratings. Our survey contains a set of respondents who completed the question below. We’re going to use this sample to get an estimate of the sleep quality rating in the wider population.

Question 11

We only asked the sleep quality rating question to students in USMR in the 2022 cohort, so to make things easier, let’s create a subset of the dataset which includes only those students.

This will need some filtering, and assigning (e.g. usmr2022 <-) to a new name.

usmr2022 <- surveydata |> 
  filter(course == "usmr", year == 2022)
dim(usmr2022)
[1] 80 25

Question 12

For the USMR students in the 2022 academic cohort, calculate the following:

  • mean Sleep-Quality rating
  • standard deviation of Sleep-Quality ratings
  • number of respondents who completed Sleep-Quality rating

You can do this with things like mean(data$variable), or you can do it all in tidyverse (see the example of summarise in the intro to tidyverse: 2A #advances-in-r).

usmr2022 |> 
  summarise(
    m_sleep = mean(sleeprating, na.rm = TRUE),
    sd_sleep = sd(sleeprating, na.rm = TRUE),
    n_sleep = sum(!is.na(sleeprating)), 
    n_total = n() # n() will give the total count
  )
# A tibble: 1 × 4
  m_sleep sd_sleep n_sleep n_total
    <dbl>    <dbl>   <int>   <int>
1    66.0     22.6      79      80

Question 13

Using your answers to the previous question, construct a 95% confidence interval for the average Sleep-Quality rating.
Why might it be a bad idea to use this as an estimate of the average Sleep-Quality rating of the global population?

  • The previous question gives you all the pieces that you need. You’ll just need to put them together in the way seen in 2B #confidence-intervals.
  • Think about who makes up the sample (e.g. USMR students). Are they representative of the population we are trying to generalise to?

So we have our estimate, which is the mean sleep quality rating in our sample: \(\bar x = 65.99\). And we have a standard deviation of our sleep quality ratings: \(s = 22.63\). While our sample has 80 students in it, only 79 responded to the sleep question, so really our \(n\) for this calculation is 79.

We can calculate our standard error as \(SE = \frac{s}{\sqrt{n}}\), which is \(\frac{22.63}{\sqrt{79}} = 2.55\).

This tells us that if we were to collect lots of samples of size 79, we would expect their mean sleep quality ratings to be normally distributed around 65.99 with a standard deviation of 2.55. So we can expect that 95% of the samples we could take would have means between \(1.96 \times 2.55\) above and below 65.99. Those are our confidence intervals!

\[ \begin{align} \text{Mean = }& 65.99 \\ \text{95\% CI = }& 65.99 \pm 1.96 \times 2.55 \\ &[60.99, 70.99] \end{align} \]

How well does our sample represent the global population? Well, for one thing, they’re all university students, and while there will be many students from across the globe, it’s highly likely that our sample is biased towards certain countries or continents. Additionally, we’re going to have a lot of people in our sample in a fairly narrow age range. It might be that sleep quality changes a lot as people age, in which case our results are probably not going to generalise very well to, for instance, people who are 70 years old. It’s also important to remember that this is a snapshot taken at a certain point in time, and might not generalise to other times. There are lots more ways in which our sample might be biased - you can find a nice little 1 page article at https://www.nature.com/articles/466029a to get you thinking.

Optional Extras

Optional Extra

Note that the confidence interval from the previous question is concerned with describing the abstract and theoretical distribution of “what the mean sleep quality rating would look like from all possible samples of this size that I could take”. In order to do this we used a formula to describe the spread of this distribution, and in doing so had to assume that the standard deviation of our sample is a good approximation of the standard deviation of the population, and that the population is normally distributed.

We can also avoid ever using the standard deviation of our sample (sd(usmr2022$sleeprating)), and instead approximate the sampling distribution of the mean by “bootstrapping” - taking repeated resamples with replacement from the original sample (see 2B#standard-error.

bootstrap_means <- replicate(1000, mean(sample(observed_sample, replace = TRUE)))
  1. Create an object that contains the 10,000 means from 10,000 resamples of our sleep ratings.
  2. The distribution of resample means is the ‘bootstrap distribution’. Plot a histogram of it. What is the standard deviation? How does it compare to the standard error you calculated in the previous question with the formula?
  3. At what values does the middle 95% of the bootstrap distribution fall?

For 3, look up quantile(). We saw this in 2B #confidence-intervals.

Resample means

Here is our sample of sleep ratings:

sleeprates <- usmr2022$sleeprating

And we can get rid of the NA’s:

sleeprates <- sleeprates[!is.na(sleeprates)]

We can resample with replacement from this set of numbers by using the replace = TRUE argument in the sample() function.
Note, we’re leaving size = blank, which means it will stop at the same length as the original vector we give it.

sample(sleeprates, replace = TRUE)

and the mean of a given resample is calculated by wrapping mean() around the above code:

mean(sample(sleeprates, replace = TRUE))
[1] 63.73418

finally, we’ll do it lots and lots of times, using replicate():

BSmeans <- replicate(10000, mean(sample(sleeprates, replace = TRUE)))

Bootstrap Distribution

Here’s the histogram of the bootstrap distribution:

hist(BSmeans)

And here’s the standard deviation of that distribution. This is a bootstrapped estimate of the standard error.

sd(BSmeans)
[1] 2.535323

Recall our standard error calculated using \(\frac{s}{\sqrt{n}}\) from the previous question was 2.55

Percentiles

We can get the 2.5% and 97.5% percentiles (i.e. getting the middle 95%), using the code below. Recall our confidence intervals that we computed analytically were 60.99 and 70.99.

quantile(BSmeans, c(.025,.975))
    2.5%    97.5% 
60.84778 70.74684 



Bootstrapping is a great way to learn about sampling variability because it allows us to actually plot, summarise and describe what would otherwise be an abstract conceptual distribution.

It can also be a useful tool in practice, but it doesn’t come without its own problems/complexities. One important thing to note is that it often works worse than traditional methods for small samples, especially skewed samples (i.e. bootstrapping a “95% CI” for a small sample will often be too narrow and <95%).