Exercises: More R; Estimates & Intervals

Question 0

Now that you’ve had a little bit of playing around in R, we’re going to ask you to change some of RStudio’s global settings.

There was a section in reading 01A #useful-settings that showed a couple of settings that are very useful. Please change these now.

While you’re at it - pick your favourite colour scheme for RStudio and change that too!

Data manipulation & visualisation

Data: Past Surveys
In the last few years, we have asked students of the statistics courses in the Psychology department to fill out a little survey.
Anonymised data are available at https://uoepsy.github.io/data/surveydata_historical.csv.

Note: this does not contain the responses from this year.

Question 1

Read in the data, giving it a name to store it in your environment.

Question 2

How many previous USMR students are born in the same month as you?

  • The data contains students from some of the other statistics courses we teach, so this will involve filtering your data to USMR students first.
  • In tidyverse you can make a table using ... |> select(variable) |> table()
  • You can also try ... |> count(variable) to get the same information.

Question 3

Create a new variable in the dataset which indicates whether people were taller than 6 foot (182cm).

You might want to use mutate(). Remember to make the changes apply to the objects in your environment, rather than just printing it out.
data <- data |> mutate(...)
(see 2A #advances-in-r)

Question 4

What percentage of respondents to the survey (for whom we have data on their height) are greater than 6 foot tall?

  • Try table(), and then think about how we can convert the counts to percentages (what does sum() of the table give you?).
    • table() will actually by default count only those values which aren’t missing, so this means you don’t have to do anything extra here (if you wanted it to also count the missing values, we can use table(data$variable, useNA = "ifany"))
  • See also 2A #categorical.

Question 5

Calculate the mean and standard deviation of heights of all respondents to the survey.

Can you also do this using the tidyverse syntax?

  • We can do it with mean(data$variable), but it will be useful to practice tidyverse style. You’ll probably want to summarise() the data.
  • We’re likely to have missing data in here, so na.rm=TRUE will be handy (see 2A #numeric)

Question 6

Plot the distribution of heights of all respondents. Try to make it ‘publication ready’.

if we want a histogram, then hist() won’t cut it here, we’re going to want to use ggplot to make a lovely pretty histogram, (ggplot was introduced in 2A #ggplot).

Question 7

For respondents from each of the different courses, calculate the mean and standard deviation of heights.

This is just like when we did it for all the respondents - we want to summarise our data. Only this time we need to group_by something else first.

Question 8

Based on your answer to the previous question, can you picture what the distributions are going to look like?

Plot the distributions of heights for each course to see if you’re correct.

Try looking up the documentation for ?facet_wrap. It is an incredibly useful extension of ggplot which allows you to create the same plot for different groups.

You might also want to add an extra aesthetic mapping from the course variable to some feature of your plot (e.g. ‘colour’ or ‘fill’).

Question 9

How many people in the survey do we have height data for? (i.e. how many are not missing?)
To find this out we might need to use is.na() (see below for a little example for you to play with).

The is.na(x) function is a bit like asking x == NA. It is necessary because NA is a special thing in R, which means we can’t ask questions like 3 == NA (because we don’t know what that NA is - it could be 3 for all we know!).

mynumbers <- c(1,5,NA,3,6,NA)
mynumbers == 5
[1] FALSE  TRUE    NA FALSE FALSE    NA
mynumbers == NA
[1] NA NA NA NA NA NA

Instead, we can use is.na() to ask “is this thing an NA?”

# for each number, TRUE if it's an NA, otherwise FALSE
is.na(mynumbers)
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE
# ! means "not", so this is asking if each number is "not" an NA
!is.na(mynumbers)
[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
# how many non-NAs are there? 
sum(!is.na(mynumbers))
[1] 4

Question 10

Just like finding out how many people were over 6 foot, we can quickly work out how many respondents are taller than <insert your height here>, by using table().
For instance, here is a table of how many people are taller than me (Josiah):

table(surveydata$height > 177)

FALSE  TRUE 
  403    61 

However, as a learning exercise, let’s make sure we can recreate these numbers by doing the calculations manually.

Without using table() or prop.table(), find out:

  1. How many respondents are taller than you?
  2. What proportion of respondents (with valid height data) are taller than you?
  • Remember that we can sum() a condition as a quick way of counting: sum(data$variable == "thing") adds up all the TRUE responses.
    • We’re going to need to make sure we tell sum() to ignore the missing values (na.rm=TRUE will come in handy again).
  • Our denominator (bit on the bottom) for calculating the proportion \(\frac{\text{nr people taller than me}}{\text{total nr of people}}\), is the total number of people for whom we have height data. We just calculated that in the previous question!
  • Can you also do this in tidyverse syntax? (most of it can be done inside summarise()).

Estimates & Intervals

For these next exercises we are going to be focusing on self-perceived sleep quality ratings. Our survey contains a set of respondents who completed the question below. We’re going to use this sample to get an estimate of the sleep quality rating in the wider population.

Question 11

We only asked the sleep quality rating question to students in USMR in the 2022 cohort, so to make things easier, let’s create a subset of the dataset which includes only those students.

This will need some filtering, and assigning (e.g. usmr2022 <-) to a new name.

Question 12

For the USMR students in the 2022 academic cohort, calculate the following:

  • mean Sleep-Quality rating
  • standard deviation of Sleep-Quality ratings
  • number of respondents who completed Sleep-Quality rating

You can do this with things like mean(data$variable), or you can do it all in tidyverse (see the example of summarise in the intro to tidyverse: 2A #advances-in-r).

Question 13

Using your answers to the previous question, construct a 95% confidence interval for the average Sleep-Quality rating.
Why might it be a bad idea to use this as an estimate of the average Sleep-Quality rating of the global population?

  • The previous question gives you all the pieces that you need. You’ll just need to put them together in the way seen in 2B #confidence-intervals.
  • Think about who makes up the sample (e.g. USMR students). Are they representative of the population we are trying to generalise to?

Optional Extras

Optional Extra

Note that the confidence interval from the previous question is concerned with describing the abstract and theoretical distribution of “what the mean sleep quality rating would look like from all possible samples of this size that I could take”. In order to do this we used a formula to describe the spread of this distribution, and in doing so had to assume that the standard deviation of our sample is a good approximation of the standard deviation of the population, and that the population is normally distributed.

We can also avoid ever using the standard deviation of our sample (sd(usmr2022$sleeprating)), and instead approximate the sampling distribution of the mean by “bootstrapping” - taking repeated resamples with replacement from the original sample (see 2B#standard-error.

bootstrap_means <- replicate(1000, mean(sample(observed_sample, replace = TRUE)))
  1. Create an object that contains the 10,000 means from 10,000 resamples of our sleep ratings.
  2. The distribution of resample means is the ‘bootstrap distribution’. Plot a histogram of it. What is the standard deviation? How does it compare to the standard error you calculated in the previous question with the formula?
  3. At what values does the middle 95% of the bootstrap distribution fall?

For 3, look up quantile(). We saw this in 2B #confidence-intervals.