Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!

LEARNING OBJECTIVES

LO1: Understand the relation between X-Y (explanatory/outcome) specification and practical research questions.
L02: Understand how to summarise and visualize numeric-categorical relationships.
LO3: Understand how to summarise and visualize numeric-numeric relationships.

Outcome vs Explanatory

In the previous couple of labs, we looked at how to handle different types of data, and how to describe and visualise categorical and numeric distributions. More often than not, research involves investigating relationships between variables, rather than studying variables in isolation.

If we are using one variable to help us understand or predict values of another variable, we call the former the explanatory variable and the latter the outcome variable.

Other names

outcome variable = dependent variable = response variable = Y
explanatory variable = independent variable = predictor variable = X

(referring to outcome/explanatory variables as Y and X respectively matches up with how we often want to plot them - the outcome variable on the y-axis, and the explanatory variable on the x-axis)

The distinction between explanatory and outcome variables is borne out in how we design experimental studies - the researcher manipulates the explanatory variable for each unit before the response variable is measured (for instance, we might randomly allocate participants to one of two conditions). This contrasts with observational studies in which the researcher does not control the value of any variable, but simply observes the values as they naturally exist.

We’re going to pick up again with the data collected on the Stroop task from the previous lab.

library(tidyverse)
stroopdata <- read_csv("https://uoepsy.github.io/data/strooptask.csv")

# calculate the "stroop effect" - the difference in time taken to complete
# the matching vs mismatching sets 
stroopdata <- 
    stroopdata %>% 
        mutate(
            stroop_effect = mismatching - matching
        )

The data is experimental - researchers controlled the presentation of the stimuli (coloured words) and the assignment of whether or not participants received practice.

The researchers are interested in two relationships:

the relationship between receiving practice (categorical) and the stroop-effect (numeric)
the relationship between age (numeric) and the stroop-effect (numeric)

Numeric and Categorical

Recall that the “stroop-effect” is the difference (in seconds) between participants’ times on the mismatching set of words vs the matching set. We know how to describe a numeric variable such as the stroop-effect, for instance by calculating the mean and standard deviation, or median and IQR. We saw how to produce visualisations of numeric variables in the form of density curves, histogram, and boxplots.

# take the "stroopdata" dataframe %>%
# summarise() it, such that there is a value called "mean_stroop", which
# is the mean() of the "stroop_effect" variable, and a value called "sd_stroop", which
# is the standard deviation of the "stroop_effect" variable.
stroopdata %>% 
  summarise(
    mean_stroop = mean(stroop_effect),
    sd_stroop = sd(stroop_effect)
  )

## # A tibble: 1 x 2
##   mean_stroop sd_stroop
##         <dbl>     <dbl>
## 1        2.40      5.02

ggplot(data = stroopdata, aes(x = stroop_effect)) + 
  geom_histogram()

To understand the relationship between categorical (practice) and the numeric (stroop effect), for now we will simply calculate these summary statistics for the numeric variable when it is split by the different levels in the categorical variable.

In other words, we want to calculate the mean and standard deviation of the stroop_effect variable separately for those observations where practice is “no”, and for those where practice is “yes”:

practice	stroop_effect
no	9.69
no	10.07
yes	-2.97
yes	-0.23
yes	-5.59
no	3.67
yes	1.41
yes	2.1
yes	-0.33
…	…

We can do this using the group_by() function.

group_by()

The group_by() function creates a grouping in the dataframe, so that subsequent functions will be computed on each group.

It is most useful in combination with summarise(), to reduce a variable into a summary value for each group in a grouping variable:

# take the data %>%
# make it grouped by each unique value in the "grouping_variable" %>%
# summarise() it FOR EACH GROUP, creating a value called "summary_value" ()
data %>% 
  group_by(grouping_variable) %>%
  summarise(
    summary_value = ...
  )

Let’s do this for the Stroop Task data - we will summarise() the stroop_effect variable, after grouping the data by the practice variable:

# take the "stroopdata" %>%
# and group it by each unique value in the "practice" variable (yes/no) %>%
# then summarise() it FOR EACH GROUP, creating summary values called 
# "mean_stroop" and "sd_stroop" which are the means and standard deviations of 
# the "stroop_effect" variable entries for each group of "practice".
stroopdata %>%
  group_by(practice) %>%
  summarise(
    mean_stroop = mean(stroop_effect),
    sd_stroop = sd(stroop_effect)
  )

## # A tibble: 2 x 3
##   practice mean_stroop sd_stroop
##   <chr>          <dbl>     <dbl>
## 1 no            4.54        4.25
## 2 yes           0.0229      4.75

Visualising - Colours

Question

Given the output above, which of the following visualisations is most representative of these statistics?

Answer

We can visualise the data using the same code we had before, but with one small addition - we tell ggplot to colour the data according to the different values in the practice variable.
Note we add this inside the aes() mappings, because we are mapping something on the plot (the colour) to something in the data (the practice variable). If we just wanted to make the line blue, we could put col = "blue" outside the aes().

ggplot(data = stroopdata, aes(x = stroop_effect, col = practice)) +
  geom_density()

Visualising - Facets

Interpreting two density curves on top of one another works well, but overlaying two histograms on top of one another doesn’t. Instead, we might want to create separate histograms for each set of values (the stroop_effect variable values for each of practice/no practice groups).

facet_wrap() is a handy part of ggplot which allows us to easily split one plot into many:

ggplot(data = stroopdata, aes(x = stroop_effect)) +
  geom_histogram() +
  facet_wrap(~practice)

Numeric and Numeric

When we are interested in the relationship between two numeric variables, such as the one we have between age and the stroop-effect, the most easily interpreted visualisation of this relationship is in the form of a scatterplot:

# make a ggplot with the stroopdata
# put the possible values of the "age" variable on the x axis,
# and put the possible values of the "stroop_effect" variable on the y axis.
# for each entry in the data, add a "tomato1" coloured geom_point() to the plot, 
ggplot(data = stroopdata, aes(x = age, y = stroop_effect)) +
  geom_point(col="tomato1")

The visual pattern that these points make on the plot tells us something about the data - it looks like the older participants tended to have a greater stroop-effect.
But we can also have relationships between two numeric variables that look the opposite, or have no obvious pattern, or have a more consistent patterning (see Figure 1)

Figure 1: Relationships between two numeric variables can look very different

As a means of summarising these different types of relationships, we can calculate the covariance to describe in what direction, and how strong (i.e., how clear and consistent) the pattern is.

Covariance

We know that variance is the measure of how much a single numeric variable varies around its mean.
Covariance is a measure of how two numeric variables vary together, and can express the directional relationship between them.

Covariance is the measure of how two variables vary together.
For samples, covariance is calculated using the following formula:

\[\mathrm{cov}(x,y)=\frac{1}{n-1}\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})\]

where:

$x$ and $y$ are two variables;
$i$ denotes the observational unit, such that $x_i$ is value that the $x$ variable takes on the $i$th observational unit, and similarly for $y_i$;
$n$ is the sample size.

It often helps to understand covariance by working through a visual explanation.

Consider the following scatterplot:

Now let’s superimpose a vertical dashed line at the mean of $x$ ($\bar{x}$) and a horizontal dashed line at the mean of $y$ ($\bar{y}$):
Now let’s pick one of the points, call it $x_i$, and show $(x_{i}-\bar{x})$ and $(y_{i}-\bar{y})$.
Notice that this makes a rectangle.
As $(x_{i}-\bar{x})$ and $(y_{i}-\bar{y})$ are both positive values, their product - $(x_{i}-\bar{x})(y_{i}-\bar{y})$ - is positive.

In fact, for all these points in red, the product $(x_{i}-\bar{x})(y_{i}-\bar{y})$ is positive (remember that a negative multiplied by a negative gives a positive):
And for these points in blue, the product $(x_{i}-\bar{x})(y_{i}-\bar{y})$ is negative:

Now take another look at the formula for covariance:

\[\mathrm{cov}(x,y)=\frac{\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}\]

It is the sum of all these products divided by $n-1$. It is the average of the products!

We can easily calculate the covariance between variables in R using the cov() function. cov() takes two variables cov(x = , y = ).
We can either use the $ to pull out the variables from the datset:

cov(stroopdata$age, stroopdata$stroop_effect)

## [1] 23.9597

Or we can specify the dataframe, use the %>% symbol, and call cov() inside summarise():

stroopdata %>%
  summarise(
    mean_age = mean(age),
    mean_stroop = mean(stroop_effect),
    cov_agestroop = cov(age, stroop_effect)
  )

## # A tibble: 1 x 3
##   mean_age mean_stroop cov_agestroop
##      <dbl>       <dbl>         <dbl>
## 1     42.8        2.40          24.0

Categorical and Categorical

What if we are interested in the relationship between two variables that are both categorical?

As a quick example, let’s read in a dataset containing information on passengers from the Titanic. We can see from the first few rows of the dataset that there are quite a few categorical variables here:

titanic <- read_csv("https://uoepsy.github.io/data/titanic.csv")
head(titanic)

## # A tibble: 6 x 5
##      X1 class     age    sex   survived
##   <dbl> <chr>     <chr>  <chr> <chr>   
## 1     1 1st class adults man   yes     
## 2     2 1st class adults man   yes     
## 3     3 1st class adults man   yes     
## 4     4 1st class adults man   yes     
## 5     5 1st class adults man   yes     
## 6     6 1st class adults man   yes

Recall that we summarised one categorical variable using a frequency table:

titanic %>%
  count(survived)

## # A tibble: 2 x 2
##   survived     n
##   <chr>    <int>
## 1 no         817
## 2 yes        499

We can also achieve this using the table() function:

#thes two lines of code do exactly the same thing!
table(titanic$survived)
titanic %>% select(survived) %>% table()

## 
##  no yes 
## 817 499

Contingency Tables

Let’s suppose we are interested in how the Class of passengers’ tickets (1st Class, 2nd Class, 3rd Class) can be used to understand their survival.
We can create two-way table, where we have each variable on either dimension of the table:

#Either this:
table(titanic$class, titanic$survived)
# or this:
titanic %>% select(class, survived) %>% table()

##            
##              no yes
##   1st class 122 203
##   2nd class 167 118
##   3rd class 528 178

And we can pass this to prop.table() to turn these into proportions.
We can turn them into:

proportions of the total:

titanic %>% 
  select(class, survived) %>% 
  table() %>%
  prop.table()

##            survived
## class               no        yes
##   1st class 0.09270517 0.15425532
##   2nd class 0.12689970 0.08966565
##   3rd class 0.40121581 0.13525836

proportions of each row:

titanic %>% 
  select(class, survived) %>% 
  table() %>%
  prop.table(margin = 1)

##            survived
## class              no       yes
##   1st class 0.3753846 0.6246154
##   2nd class 0.5859649 0.4140351
##   3rd class 0.7478754 0.2521246

proportions of each column:

titanic %>% 
  select(class, survived) %>% 
  table() %>%
  prop.table(margin = 2)

##            survived
## class              no       yes
##   1st class 0.1493268 0.4068136
##   2nd class 0.2044064 0.2364729
##   3rd class 0.6462668 0.3567134

Mosaic Plots

The equivalent way to visualise a contingency table is in the form of a mosaic plot.

titanic %>% 
  select(class, survived) %>% 
  table() %>%
  plot()

You can think of the prop.table(margin = ) as scaling the areas of one of the variables to be equal:

titanic %>% 
  select(class, survived) %>% 
  table() %>%
  prop.table(margin = 1)

##            survived
## class              no       yes
##   1st class 0.3753846 0.6246154
##   2nd class 0.5859649 0.4140351
##   3rd class 0.7478754 0.2521246

In the table above, each row (representing each level of the “Class” variable) sums to 1. The equivalent plot would make each of level of the “Class” variable as the same area:

titanic %>% 
  select(class, survived) %>% 
  table() %>%
  prop.table(margin = 1) %>%
  plot()

Glossary

Explanatory variable: A variable used to understand or predict values of an outcome variable.
Outcome variable: A variable which we are aiming to understand or predict via some explanatory variable(s).
Scatterplot: A plot in which the values of two variables are plotted along the two axes, the pattern of the resulting points revealing any relationship which is present.
Covariance: A measure of the extent to which two variables vary together.
cov() To calculate the covariance between two variables.
group_by() To apply a grouping in a dataframe for each level of a given variable. Grouped dataframes will retain their grouping, so that if we use summarise() it will provide a summary calculation for each group.
geom_point() To add points/dots to a ggplot.
facet_wrap() To split a ggplot into multiple plots (facets) for each level of a given variable.
Inline R code (introduced in the exercises below): R code which is written in-text, rather than inside a code-chunk. When a document with inline R code is rendered, the calculations are performed and resulting output is placed within the written text.

Exercises

Question 1

Open a new Rmarkdown file, and give it an appropriate title for this set of exercises.
In a new code-chunk, load the tidyverse package.
Now create a new heading of “Stroop task data”.
- Hint: remember that we specify headings and subheadings using #, ##, ### etc.
In a new code-chunk, read in the Stroop task data from https://uoepsy.github.io/data/strooptask.csv.
Add a new variable to the data which is the “stroop-effect” for each observation (the difference in times taken to complete the matching and mismatchings sets of coloured words)

Solution

Question 2

Recall that above we visualised the relationship between practice (categorical) and Stroop-effect (numeric) by plotting two density curves overlayed on one another, or two histograms side-by-side.

What other way might we visualise this relationship?
Hint:

ggplot(data = stroopdata, aes(x = practice, y = stroop_effect)) + 
geom_?????

Solution

Take a moment to think - we have visualised numeric variables in a variety of ways: histograms, density curves, and boxplots.
Both histograms and density curves already have values on the y-axis (histograms have “count” on the y, and density curves have “density”).

ggplot(data = stroopdata, aes(x = practice, y = stroop_effect)) +
  geom_boxplot()

Question 3

Check what other variables there are in the data - use the function names(), and give it your dataframe:

names(stroopdata)

## [1] "id"            "age"           "practice"      "matching"     
## [5] "mismatching"   "height"        "stroop_effect"

We also have information recorded on the participants’ heights (measured in cm).
Produce a visualisation of the relationship between height and the Stroop-effect. Before you do so, think about what you expect to see?
Calculate the covariance.

Solution

We wouldn’t really expect height to influence the Stroop-effect. We can look at the relationship in a scatterplot:

ggplot(data = stroopdata, aes(x = height, y = stroop_effect)) +
  geom_point()

And the covariance is small (in comparison to the relationship we saw previous between the stroop effect and age):

cov(stroopdata$height, stroopdata$stroop_effect)

## [1] -3.024087

Question 4

We’re going to move to a different dataset now, using data from Edinburgh City Council, and from the Scottish Environment Protection Agency.

Create a new heading in your Rmardkown for “Edinburgh Cyclists”.

The first dataset we are going to use is the from the Bicycle Counters on Middle Meadow Walk (see Figure 2).

$Photo from \@SustransScot on Twitter, 2017$

Figure 2: Photo from @SustransScot on Twitter, 2017

Read in the data https://uoepsy.github.io/data/cycling_mmwalk.csv and assign it a name of your choice.

Variable Name	Description
month	Month (first three letters)
day	Day of the month (1 - 31)
hour	Hour of the dat (0 - 24)
cyclists	Number of cyclists recorded by the bike counters (in either direction) on Middle Meadow Walk, Edinburgh

Solution

We’re going to assign it the name “mm_walk”.

mm_walk <- read_csv("https://uoepsy.github.io/data/cycling_mmwalk.csv")

Question 5

The data contains information on the total number of cyclists travelling in either direction on Middle Meadow Walk for each hour of each day in 2012.

Plot the relationship between hour of day and the number of cyclists.
Tip: You will want to make the hour of day a factor first, so that R treats it as different levels.

Solution

Currently, the hour variable is numeric, and R won’t know that we want a separate boxplot for each hour:

ggplot(data = mm_walk, aes(x = hour, y = cyclists)) + 
  geom_boxplot()

What we need to do is make the hour variable a factor:

mm_walk$hour <- factor(mm_walk$hour)

mm_walk <- 
  mm_walk %>% 
  mutate(
    hour = factor(hour)
  )

and then plot it. Note that ggplot now knows to create a separate boxplot for each hour:

ggplot(data = mm_walk, aes(x = hour, y = cyclists)) + 
  geom_boxplot()

Question 6

Using group_by() and summarise(), you can aggregate the data in to grouped averages (using mean()), or grouped totals (using sum()).

Which month had the highest total number of cyclists?
Which hour of the day had the highest average number of cyclists?

Solution

To calculate the total number of cyclists in each month, we want to: 1. group by the month
2. add together (sum) all the numbers of cyclists in that month

mm_walk %>%
  group_by(month) %>%
  summarise(
    monthly_cyclists = sum(cyclists)
  )

## # A tibble: 12 x 2
##    month monthly_cyclists
##    <chr>            <dbl>
##  1 Apr              37707
##  2 Aug              55165
##  3 Dec              23507
##  4 Feb              38781
##  5 Jan              33517
##  6 Jul              36707
##  7 Jun              39379
##  8 Mar              50997
##  9 May              46364
## 10 Nov              43962
## 11 Oct              49160
## 12 Sep              44351

August had the most cyclists, with 55,165. August tends to be the busiest month in Edinburgh because of the festival.

To calculate the average number of cyclists in each hour of the day, we want to: 1. group by the hour of the day 2. calculate the average of the numbers of cyclists for each hour

Tip: we can use arrange(somevariable) order our data from smallest to largest according to “somevariable”, and by adding in desc() too, we can go from largest to smallest!

mm_walk %>%
  group_by(hour) %>%
  summarise(
    avg_cyclists = mean(cyclists)
  ) %>%
  arrange(desc(avg_cyclists))

## # A tibble: 24 x 2
##    hour  avg_cyclists
##    <fct>        <dbl>
##  1 17           144. 
##  2 8            139. 
##  3 18           113. 
##  4 16           108. 
##  5 9            104. 
##  6 15            84.2
##  7 13            76.6
##  8 14            76.0
##  9 19            71.0
## 10 12            69.2
## # … with 14 more rows

Matching the plot we made in the previous question, there are two peak hours - 8am and 5pm. 5pm has a slightly higher mean number of cyclists, with 143.

Question 7

Another dataset, available at https://uoepsy.github.io/data/cycling_invrow.csv, contains information on the total number of cyclists each month from 2014 to 2016 on Inverleith Row, Edinburgh. It also contains data on the monthly rainfall (in millimetres) measured at the nearby Royal Botanic Gardens.

Read the data into R, and produce a visualisation of the relationship between monthly rainfall and monthly cyclists.

Variable Name	Description
year	Year (2014 - 2016)
month	Month (first three letters)
cyclists	Number of cyclists recorded by the bike counters (in either direction) on Inverleith Row, Edinburgh
rainfall_mm	Rainfall (in millimetres) recorded at the Royal Botanic Gardens, Edinburgh

Solution

inv_row <- read_csv("https://uoepsy.github.io/data/cycling_invrow.csv")

ggplot(data = inv_row, aes(x = rainfall_mm, y = cyclists))+
  geom_point()

Question 8 - Harder

Make a new heading of “Pulse rate and exercise”.
Read in a new dataset from https://uoepsy.github.io/data/pulse.csv.

Variable	Description
active	Pulse rate (beats per minute) after exercise
rest	Resting pulse rate (beats per minute)
smoke	“Y” = Smoker or “N” = Nonsmoker
sex	“F” = Female or “M” = Male
exercise	Typical hours of exercise (per week)
hgt	Height (in inches)
wgt	Weight (in pounds)

Explore visually the relationship between active and resting pulse rate and the variables which influence resting pulse rate.
For each plot you make:

Think about what variable we are assuming to influence another. Which should be on the x-axis, and which on the y?
Think about what type of measurement each variable is - e.g., numeric, categorical. How is it best to visualise each relationship?
Try to make your plots easily interpretable (for instance, add useful axes and titles using labs(title = "---", x = "---", y = "---")), and aesthetically pleasing (try out some of the different themes).

Solution

Read in the data:

pulse <- read_csv("https://uoepsy.github.io/data/pulse.csv")

Visualise the relationship between resting and active pulse rate. This should just be a scatterplot:

ggplot(data = pulse, aes(x = rest, y = active)) + 
  geom_point(col="red")+
  labs(title = "Relationship between resting and active pulse rate",x = "Resting pulse", y = "Pulse after exercise")+
  theme_minimal()

For a visual exploration, we now want to create similar plots between each variable and the rest variable, to see which ones we see patterns in.

ggplot(data = pulse, aes(x = sex, y = rest)) + 
  geom_boxplot()

ggplot(data = pulse, aes(x = smoke, y = rest)) + 
  geom_boxplot()

When you look at the data, you’ll notice that because the exercise per week is rounded to the nearest hour, there are actually only three different distinct values (1, 2, or 3). A scatterplot might have initially seemed the best choice (because we think of hours per week as numeric). However, a boxplot might be more convenient.
Note that we have x = factor(exercise) in the code below. This temporarily makes the exercise variable a factor (for the plot only, and making no permanent changes to the dataframe itself).

ggplot(data = pulse, aes(x = factor(exercise), y = rest)) + 
  geom_boxplot()

ggplot(data = pulse, aes(x = hgt, y = rest)) + 
  geom_point()

ggplot(data = pulse, aes(x = wgt, y = rest)) + 
  geom_point()

Question 9

Create a two-way contingency table to assess the relationship between Smoker/Non-Smoker and Male/Female.
What percentage of Males are Smokers?

Solution

pulse %>%
  select(sex, smoke) %>%
  table()

##    smoke
## sex   N   Y
##   F 101   9
##   M 105  17

pulse %>%
  select(sex, smoke) %>%
  table() %>% 
  prop.table(margin=1)

##    smoke
## sex          N          Y
##   F 0.91818182 0.08181818
##   M 0.86065574 0.13934426

14% of Males are smokers.

Question 10

We’re going to now learn about how to write a paragraph of text with embedded calculations. Each time we render the document, the calculations get performed and the ouput gets created. This is really useful as it means we can write a report which we can really easily update if we change our analysis, or include more data.

Instead of writing in code-chunks, as we have been, we can also write inline R code, which we specify as shown in Figure 3.

Writing Inline R code. The code on the left gets compiled to look like what is shown on the right (source: https://rmarkdown.rstudio.com/authoring_quick_tour.html)

Figure 3: Writing Inline R code. The code on the left gets compiled to look like what is shown on the right (source: https://rmarkdown.rstudio.com/authoring_quick_tour.html)

Write the paragraph below, and fill in the blanks with inline R code.

The average resting pulse was ????, with a standard deviation of ????. Heights of participants in the study ranged from ???? to ????, with a median of ???? and an IQR of ????.

Tip: for writing inline code, often using code like mean(data$variable) is easier, because it is more compact than data %>% summarise(mean = mean(variable)) %>% pull(mean).

Solution

Question 11

Final task for this set of exercises!

Compile your Rmarkdown document into an html. Remember that you can do this using the “Knit” button at the top.

You’ll notice that as the .html file shows up in the files tab in the bottom-right of Rstudio.
Make sure that your inline R code is displaying correctly.

Upload your .html to the discussion board, so we can all admire each other’s lovely ggplots!

Visualising and describing relationships

Outcome vs Explanatory

Numeric and Categorical

Visualising - Colours

Visualising - Facets

Numeric and Numeric

Covariance

Categorical and Categorical

Contingency Tables

Mosaic Plots

Glossary

Exercises