Week 2 Exercises: More R; Estimates & Intervals

In the last couple of years during welcome week, we have asked students of the statistics courses in the Psychology department to fill out a little survey.
Anonymised data are available at https://uoepsy.github.io/data/surveydata_allcourse22.csv.

Data manipulation & visualisation

Question 1

Create a new variable in the dataset which indicates whether people are taller than 6 foot (182cm).

Hints:
You might want to use mutate(). Remember to make the changes apply to the objects in your environment, rather than just printing it out.
data <- data %>% mutate(...)
(see 2A #tidyverse-and-pipes)

Question 2

What percentage of respondents to the survey (for whom we have data on their height) are greater than 6 foot tall?

Hints:

Try table(), and then think about how we can convert the counts to percentages (what does sum() of the table give you?).
- table() will actually by default count only those values which aren’t missing, so this means you don’t have to do anything extra here (if you wanted it to also count the missing values, we can use table(data$variable, useNA = "ifany"))
See also 2A #categorical.

Question 3

how many of USMR students in 2024/25 are born in the same month as you?

Hints:
This will involve filtering your data to current USMR students first.
In tidyverse you can make a table using ... %>% select(variable) %>% table()
You can also try ... %>% count(variable) to get the same information.

Question 4

Calculate the mean and standard deviation of heights of all respondents to the survey.

Can you (also) do this using the tidyverse syntax?

Hints:
We can do it with mean(data$variable), but it will be useful to practice tidyverse style. You’ll want to summarise() the data.
We’re likely to have missing data in here, so na.rm=TRUE will be handy (see 2A #numeric)

Question 5

Plot the distribution of heights of all respondents. Try to make it ‘publication ready’.

Hints:
hist() won’t cut it here, we’re going to want ggplot, which was introduced in 2A #ggplot.

Question 6

For respondents from each of the different courses, calculate the mean and standard deviation of heights.

Hints:
This is just like question 4 - we want to summarise our data. Only this time we need to group_by something else first.

Question 7

Plot the distributions of heights for each course.
Based on your answer to the previous question, can you picture what the distributions are going to look like before you plot them?

Hints:
Try looking up the documentation for ?facet_wrap. It is an incredibly useful extension of ggplot which allows you to create the same plot for different groups.

You might also want to add an extra aesthetic mapping from the course variable to some feature of your plot (e.g. ‘colour’ or ‘fill’).

Question 8

What proportion of respondents to the survey are taller than you?

Hints:

Remember that we can sum() a condition as a quick way of counting: sum(data$variable == "thing") adds up all the TRUE responses.
- We’re going to need to make sure we tell sum() to ignore the missing values (na.rm=TRUE will come in handy again).
We’re also going to need to figure out what our denominator is when calculating the proportion $\frac{\text{nr people taller than me}}{\text{total nr of people}}$.
The denominator (the bit on the bottom) should really be the total number of people for whom we have height data. We should exclude the missing data because otherwise we will be counting them as if they are all shorter than you (but they might not be - we don’t know!). To find out how many in a variable are not NAs, we might need to use is.na() (see below for a little example for you to play with)
Can you also do this in tidyverse syntax? (most of it can be done inside summarise()).

is.na()

The is.na(x) function is a bit like asking x == NA. It is necessary because NA is a special thing in R, which means we can’t ask questions like 3 == NA (because we don’t know what that NA is - it could be 3 for all we know!).

mynumbers <- c(1,5,NA,3,6,NA)
# for each number, TRUE if it's an NA, otherwise FALSE
is.na(mynumbers)

[1] FALSE FALSE  TRUE FALSE FALSE  TRUE

# ! means "not", so this is asking if each number is "not" an NA
!is.na(mynumbers)

[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE

# how many non-NAs are there? 
sum(!is.na(mynumbers))

[1] 4

Estimates & Intervals

For these next exercises we are going to be focusing on self-perceived sleep quality ratings. Our survey contains a set of respondents who completed the question below. We’re going to use this sample to get an estimate of the sleep quality rating in the wider population.

Question 9

We only asked the sleep quality rating question to students in USMR in the 2022/23 academic year, so to make things easier, let’s create a subset of the dataset which includes only those students.

Hints:
This will need some filtering, and assigning (usmrdata <-) to a new name.

Question 10

For the USMR students in the 2022/23 academic cohort, calculate the following:

mean Sleep-Quality rating
standard deviation of Sleep-Quality ratings
number of respondents who completed Sleep-Quality rating

Hints:
You can do this with things like mean(data$variable), or you can do it all in tidyverse (see the example of summarise in the intro to tidyverse: 2A #tidyverse-and-pipes).

Question 11

Using your answers to the previous question, construct a 95% confidence interval for the average Sleep-Quality rating.

Why might it be a bad idea to use this as an estimate of the average Sleep-Quality rating of the global population?

Hints:

The previous question gives you all the pieces that you need. You’ll just need to put them together in the way seen in 2B #confidence-intervals.
Think about who makes up the sample (e.g. USMR students). Are they representative of the population we are trying to generalise to?