Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!
LEARNING OBJECTIVES
In the previous couple of labs, we looked at how to handle different types of data, and how to describe and visualise categorical and numeric distributions. More often than not, research involves investigating relationships between variables, rather than studying variables in isolation.
If we are using one variable to help us understand or predict values of another variable, we call the former the explanatory variable and the latter the outcome variable.
Other names
(referring to outcome/explanatory variables as Y and X respectively matches up with how we often want to plot them - the outcome variable on the y-axis, and the explanatory variable on the x-axis)
The distinction between explanatory and outcome variables is borne out in how we design experimental studies - the researcher manipulates the explanatory variable for each unit before the response variable is measured (for instance, we might randomly allocate participants to one of two conditions). This contrasts with observational studies in which the researcher does not control the value of any variable, but simply observes the values as they naturally exist.
We’re going to pick up again with the data collected on the Stroop task from the previous lab.
library(tidyverse)
stroopdata <- read_csv("https://uoepsy.github.io/data/strooptask.csv")
# calculate the "stroop effect" - the difference in time taken to complete
# the matching vs mismatching sets
stroopdata <-
stroopdata %>%
mutate(
stroop_effect = mismatching - matching
)
The data is experimental - researchers controlled the presentation of the stimuli (coloured words) and the assignment of whether or not participants received practice.
The researchers are interested in two relationships:
Recall that the “stroop-effect” is the difference (in seconds) between participants’ times on the mismatching set of words vs the matching set. We know how to describe a numeric variable such as the stroop-effect, for instance by calculating the mean and standard deviation, or median and IQR. We saw how to produce visualisations of numeric variables in the form of density curves, histogram, and boxplots.
# take the "stroopdata" dataframe %>%
# summarise() it, such that there is a value called "mean_stroop", which
# is the mean() of the "stroop_effect" variable, and a value called "sd_stroop", which
# is the standard deviation of the "stroop_effect" variable.
stroopdata %>%
summarise(
mean_stroop = mean(stroop_effect),
sd_stroop = sd(stroop_effect)
)
## # A tibble: 1 x 2
## mean_stroop sd_stroop
## <dbl> <dbl>
## 1 2.40 5.02
ggplot(data = stroopdata, aes(x = stroop_effect)) +
geom_histogram()
To understand the relationship between categorical (practice) and the numeric (stroop effect), for now we will simply calculate these summary statistics for the numeric variable when it is split by the different levels in the categorical variable.
In other words, we want to calculate the mean and standard deviation of the stroop_effect variable separately for those observations where practice is “no”, and for those where practice is “yes”:
practice | stroop_effect |
---|---|
no | 9.69 |
no | 10.07 |
yes | -2.97 |
yes | -0.23 |
yes | -5.59 |
no | 3.67 |
yes | 1.41 |
yes | 2.1 |
yes | -0.33 |
… | … |
We can do this using the group_by()
function.
group_by()
The group_by()
function creates a grouping in the dataframe, so that subsequent functions will be computed on each group.
It is most useful in combination with summarise()
, to reduce a variable into a summary value for each group in a grouping variable:
# take the data %>%
# make it grouped by each unique value in the "grouping_variable" %>%
# summarise() it FOR EACH GROUP, creating a value called "summary_value" ()
data %>%
group_by(grouping_variable) %>%
summarise(
summary_value = ...
)
Let’s do this for the Stroop Task data - we will summarise()
the stroop_effect variable, after grouping the data by the practice variable:
# take the "stroopdata" %>%
# and group it by each unique value in the "practice" variable (yes/no) %>%
# then summarise() it FOR EACH GROUP, creating summary values called
# "mean_stroop" and "sd_stroop" which are the means and standard deviations of
# the "stroop_effect" variable entries for each group of "practice".
stroopdata %>%
group_by(practice) %>%
summarise(
mean_stroop = mean(stroop_effect),
sd_stroop = sd(stroop_effect)
)
## # A tibble: 2 x 3
## practice mean_stroop sd_stroop
## <chr> <dbl> <dbl>
## 1 no 4.54 4.25
## 2 yes 0.0229 4.75
Given the output above, which of the following visualisations is most representative of these statistics?
We can visualise the data using the same code we had before, but with one small addition - we tell ggplot to colour the data according to the different values in the practice variable.
Note we add this inside the aes()
mappings, because we are mapping something on the plot (the colour) to something in the data (the practice variable). If we just wanted to make the line blue, we could put col = "blue"
outside the aes()
.
ggplot(data = stroopdata, aes(x = stroop_effect, col = practice)) +
geom_density()
Interpreting two density curves on top of one another works well, but overlaying two histograms on top of one another doesn’t. Instead, we might want to create separate histograms for each set of values (the stroop_effect variable values for each of practice/no practice groups).
facet_wrap()
is a handy part of ggplot
which allows us to easily split one plot into many:
ggplot(data = stroopdata, aes(x = stroop_effect)) +
geom_histogram() +
facet_wrap(~practice)
When we are interested in the relationship between two numeric variables, such as the one we have between age and the stroop-effect, the most easily interpreted visualisation of this relationship is in the form of a scatterplot:
# make a ggplot with the stroopdata
# put the possible values of the "age" variable on the x axis,
# and put the possible values of the "stroop_effect" variable on the y axis.
# for each entry in the data, add a "tomato1" coloured geom_point() to the plot,
ggplot(data = stroopdata, aes(x = age, y = stroop_effect)) +
geom_point(col="tomato1")
The visual pattern that these points make on the plot tells us something about the data - it looks like the older participants tended to have a greater stroop-effect.As a means of summarising these different types of relationships, we can calculate the covariance to describe in what direction, and how strong (i.e., how clear and consistent) the pattern is.
We know that variance is the measure of how much a single numeric variable varies around its mean.
Covariance is a measure of how two numeric variables vary together, and can express the directional relationship between them.
Covariance is the measure of how two variables vary together.
For samples, covariance is calculated using the following formula:
\[\mathrm{cov}(x,y)=\frac{1}{n-1}\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})\]
where:
It often helps to understand covariance by working through a visual explanation.
Consider the following scatterplot:
Now let’s superimpose a vertical dashed line at the mean of \(x\) (\(\bar{x}\)) and a horizontal dashed line at the mean of \(y\) (\(\bar{y}\)):
Now let’s pick one of the points, call it \(x_i\), and show \((x_{i}-\bar{x})\) and \((y_{i}-\bar{y})\).
Notice that this makes a rectangle.
As \((x_{i}-\bar{x})\) and \((y_{i}-\bar{y})\) are both positive values, their product - \((x_{i}-\bar{x})(y_{i}-\bar{y})\) - is positive.
In fact, for all these points in red, the product \((x_{i}-\bar{x})(y_{i}-\bar{y})\) is positive (remember that a negative multiplied by a negative gives a positive):
And for these points in blue, the product \((x_{i}-\bar{x})(y_{i}-\bar{y})\) is negative:
Now take another look at the formula for covariance:
\[\mathrm{cov}(x,y)=\frac{\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}\]
It is the sum of all these products divided by \(n-1\). It is the average of the products!
We can easily calculate the covariance between variables in R using the cov()
function.
cov()
takes two variables cov(x = , y = )
.
We can either use the $
to pull out the variables from the datset:
cov(stroopdata$age, stroopdata$stroop_effect)
## [1] 23.9597
Or we can specify the dataframe, use the %>%
symbol, and call cov()
inside summarise()
:
stroopdata %>%
summarise(
mean_age = mean(age),
mean_stroop = mean(stroop_effect),
cov_agestroop = cov(age, stroop_effect)
)
## # A tibble: 1 x 3
## mean_age mean_stroop cov_agestroop
## <dbl> <dbl> <dbl>
## 1 42.8 2.40 24.0
What if we are interested in the relationship between two variables that are both categorical?
As a quick example, let’s read in a dataset containing information on passengers from the Titanic. We can see from the first few rows of the dataset that there are quite a few categorical variables here:
titanic <- read_csv("https://uoepsy.github.io/data/titanic.csv")
head(titanic)
## # A tibble: 6 x 5
## X1 class age sex survived
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 1st class adults man yes
## 2 2 1st class adults man yes
## 3 3 1st class adults man yes
## 4 4 1st class adults man yes
## 5 5 1st class adults man yes
## 6 6 1st class adults man yes
Recall that we summarised one categorical variable using a frequency table:
titanic %>%
count(survived)
## # A tibble: 2 x 2
## survived n
## <chr> <int>
## 1 no 817
## 2 yes 499
We can also achieve this using the table()
function:
#thes two lines of code do exactly the same thing!
table(titanic$survived)
titanic %>% select(survived) %>% table()
##
## no yes
## 817 499
Let’s suppose we are interested in how the Class of passengers’ tickets (1st Class, 2nd Class, 3rd Class) can be used to understand their survival.
We can create two-way table, where we have each variable on either dimension of the table:
#Either this:
table(titanic$class, titanic$survived)
# or this:
titanic %>% select(class, survived) %>% table()
##
## no yes
## 1st class 122 203
## 2nd class 167 118
## 3rd class 528 178
And we can pass this to prop.table()
to turn these into proportions.
We can turn them into:
proportions of the total:
titanic %>%
select(class, survived) %>%
table() %>%
prop.table()
## survived
## class no yes
## 1st class 0.09270517 0.15425532
## 2nd class 0.12689970 0.08966565
## 3rd class 0.40121581 0.13525836
proportions of each row:
titanic %>%
select(class, survived) %>%
table() %>%
prop.table(margin = 1)
## survived
## class no yes
## 1st class 0.3753846 0.6246154
## 2nd class 0.5859649 0.4140351
## 3rd class 0.7478754 0.2521246
proportions of each column:
titanic %>%
select(class, survived) %>%
table() %>%
prop.table(margin = 2)
## survived
## class no yes
## 1st class 0.1493268 0.4068136
## 2nd class 0.2044064 0.2364729
## 3rd class 0.6462668 0.3567134
The equivalent way to visualise a contingency table is in the form of a mosaic plot.
titanic %>%
select(class, survived) %>%
table() %>%
plot()
You can think of the prop.table(margin = )
as scaling the areas of one of the variables to be equal:
titanic %>%
select(class, survived) %>%
table() %>%
prop.table(margin = 1)
## survived
## class no yes
## 1st class 0.3753846 0.6246154
## 2nd class 0.5859649 0.4140351
## 3rd class 0.7478754 0.2521246
In the table above, each row (representing each level of the “Class” variable) sums to 1. The equivalent plot would make each of level of the “Class” variable as the same area:
titanic %>%
select(class, survived) %>%
table() %>%
prop.table(margin = 1) %>%
plot()
cov()
To calculate the covariance between two variables.group_by()
To apply a grouping in a dataframe for each level of a given variable. Grouped dataframes will retain their grouping, so that if we use summarise()
it will provide a summary calculation for each group.geom_point()
To add points/dots to a ggplot.facet_wrap()
To split a ggplot into multiple plots (facets) for each level of a given variable.
Recall that above we visualised the relationship between practice (categorical) and Stroop-effect (numeric) by plotting two density curves overlayed on one another, or two histograms side-by-side.
What other way might we visualise this relationship?
Hint:
ggplot(data = stroopdata, aes(x = practice, y = stroop_effect)) +
geom_?????
Check what other variables there are in the data - use the function names()
, and give it your dataframe:
names(stroopdata)
## [1] "id" "age" "practice" "matching"
## [5] "mismatching" "height" "stroop_effect"
We also have information recorded on the participants’ heights (measured in cm).
Produce a visualisation of the relationship between height and the Stroop-effect. Before you do so, think about what you expect to see?
Calculate the covariance.
We’re going to move to a different dataset now, using data from Edinburgh City Council, and from the Scottish Environment Protection Agency.
Create a new heading in your Rmardkown for “Edinburgh Cyclists”.
The first dataset we are going to use is the from the Bicycle Counters on Middle Meadow Walk (see Figure 2).
Read in the data https://uoepsy.github.io/data/cycling_mmwalk.csv and assign it a name of your choice.
Variable Name | Description |
---|---|
month | Month (first three letters) |
day | Day of the month (1 - 31) |
hour | Hour of the dat (0 - 24) |
cyclists | Number of cyclists recorded by the bike counters (in either direction) on Middle Meadow Walk, Edinburgh |
The data contains information on the total number of cyclists travelling in either direction on Middle Meadow Walk for each hour of each day in 2012.
Plot the relationship between hour of day and the number of cyclists.
Tip: You will want to make the hour of day a factor first, so that R treats it as different levels.
Using group_by()
and summarise()
, you can aggregate the data in to grouped averages (using mean()
), or grouped totals (using sum()
).
Another dataset, available at https://uoepsy.github.io/data/cycling_invrow.csv, contains information on the total number of cyclists each month from 2014 to 2016 on Inverleith Row, Edinburgh. It also contains data on the monthly rainfall (in millimetres) measured at the nearby Royal Botanic Gardens.
Read the data into R, and produce a visualisation of the relationship between monthly rainfall and monthly cyclists.
Variable Name | Description |
---|---|
year | Year (2014 - 2016) |
month | Month (first three letters) |
cyclists | Number of cyclists recorded by the bike counters (in either direction) on Inverleith Row, Edinburgh |
rainfall_mm | Rainfall (in millimetres) recorded at the Royal Botanic Gardens, Edinburgh |
Make a new heading of “Pulse rate and exercise”.
Read in a new dataset from https://uoepsy.github.io/data/pulse.csv.
Variable | Description |
---|---|
active | Pulse rate (beats per minute) after exercise |
rest | Resting pulse rate (beats per minute) |
smoke | “Y” = Smoker or “N” = Nonsmoker |
sex | “F” = Female or “M” = Male |
exercise | Typical hours of exercise (per week) |
hgt | Height (in inches) |
wgt | Weight (in pounds) |
Explore visually the relationship between active and resting pulse rate and the variables which influence resting pulse rate.
For each plot you make:
labs(title = "---", x = "---", y = "---")
), and aesthetically pleasing (try out some of the different themes).
Create a two-way contingency table to assess the relationship between Smoker/Non-Smoker and Male/Female.
What percentage of Males are Smokers?
We’re going to now learn about how to write a paragraph of text with embedded calculations. Each time we render the document, the calculations get performed and the ouput gets created. This is really useful as it means we can write a report which we can really easily update if we change our analysis, or include more data.
Instead of writing in code-chunks, as we have been, we can also write inline R code, which we specify as shown in Figure 3.
Tip: for writing inline code, often using code like mean(data$variable)
is easier, because it is more compact than data %>% summarise(mean = mean(variable)) %>% pull(mean)
.
Final task for this set of exercises!
Compile your Rmarkdown document into an html. Remember that you can do this using the “Knit” button at the top.
You’ll notice that as the .html file shows up in the files tab in the bottom-right of Rstudio.
Make sure that your inline R code is displaying correctly.
Upload your .html to the discussion board, so we can all admire each other’s lovely ggplots!