Covariance & Correlation

Our data for this walkthrough is from a (hypothetical) study on memory. Twenty participants studied passages of text (c500 words long), and were tested a week later. The testing phase presented participants with 100 statements about the text. They had to answer whether each statement was true or false, as well as rate their confidence in each answer (on a sliding scale from 0 to 100). The dataset contains, for each participant, the percentage of items correctly answered, and the average confidence rating. Participants’ ages were also recorded.

Let’s take a look at the relationships between the percentage of items answered correctly (recall_accuracy) and participants’ average self-rating of confidence in their answers (recall_confidence):

library(tidyverse)
library(patchwork)

recalldata <- read_csv("https://uoepsy.github.io/data/recalldata.csv")

ggplot(recalldata, aes(x=recall_confidence, recall_accuracy))+
  geom_point() + 
ggplot(recalldata, aes(x=age, recall_accuracy))+
  geom_point()

These two relationships look quite different.

For participants who tended to be more confident in their answers, the percentage of items they correctly answered tends to be higher.
The older participants were, the lower the percentage of items they correctly answered tended to be.

Which relationship should we be more confident in and why?

Ideally, we would have some means of quantifying the strength and direction of these sorts of relationship. This is where we come to the two summary statistics which we can use to talk about the association between two numeric variables: Covariance and Correlation.

Covariance

Covariance is the measure of how two variables vary together. It is the change in one variable associated with the change in another variable.

For samples, covariance is calculated using the following formula:

\[\mathrm{cov}(x,y)=\frac{1}{n-1}\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})\]

where:

\(x\) and \(y\) are two variables; e.g., age and recall_accuracy;
\(i\) denotes the observational unit, such that \(x_i\) is value that the \(x\) variable takes on the \(i\)th observational unit, and similarly for \(y_i\);
\(n\) is the sample size.

In R
We can calculate covariance in R using the cov() function.
cov() can take two variables cov(x = , y = ).

cov(x = recalldata$recall_accuracy, y = recalldata$recall_confidence)

## [1] 118.0768

Optional: Manually calculating covariance

Create 2 new columns in the memory recall data, one of which is the mean recall accuracy, and one which is the mean recall confidence.

recalldata <-
  recalldata %>% mutate(
    maccuracy = mean(recall_accuracy),
    mconfidence = mean(recall_confidence)
  )

Now create three new columns which are:
1. recall accuracy minus the mean recall accuracy - this is the \((x_i - \bar{x})\) part.
2. confidence minus the mean confidence - and this is the \((y_i - \bar{y})\) part.
3. the product of i. and ii. - this is calculating \((x_i - \bar{x})\)\((y_i - \bar{y})\).

recalldata <- 
  recalldata %>% 
    mutate(
      acc_minus_mean_acc = recall_accuracy - maccuracy,
      conf_minus_mean_conf = recall_confidence - mconfidence,
      prod_acc_conf = acc_minus_mean_acc * conf_minus_mean_conf
    )

recalldata

## # A tibble: 20 x 9
##    ppt    recall_accuracy recall_confidence   age maccuracy mconfidence
##    <chr>            <dbl>             <dbl> <dbl>     <dbl>       <dbl>
##  1 ppt_1               72              66.6    72      69.2        55.4
##  2 ppt_2               66              47.1    35      69.2        55.4
##  3 ppt_3               47              43.8    48      69.2        55.4
##  4 ppt_4               84              58.9    52      69.2        55.4
##  5 ppt_5               84              75.1    46      69.2        55.4
##  6 ppt_6               58              53.5    41      69.2        55.4
##  7 ppt_7               52              48.5    86      69.2        55.4
##  8 ppt_8               76              67.1    58      69.2        55.4
##  9 ppt_9               41              40.4    59      69.2        55.4
## 10 ppt_10              67              46.8    22      69.2        55.4
## 11 ppt_11              60              50.6    62      69.2        55.4
## 12 ppt_12              67              28.7    40      69.2        55.4
## 13 ppt_13              76              69.0    47      69.2        55.4
## 14 ppt_14              93              67.9    51      69.2        55.4
## 15 ppt_15              71              54.5    34      69.2        55.4
## 16 ppt_16              71              64.6    37      69.2        55.4
## 17 ppt_17              99              66.3    37      69.2        55.4
## 18 ppt_18              66              49.0    51      69.2        55.4
## 19 ppt_19              77              58.5    41      69.2        55.4
## 20 ppt_20              58              51.4    57      69.2        55.4
## # ... with 3 more variables: acc_minus_mean_acc <dbl>,
## #   conf_minus_mean_conf <dbl>, prod_acc_conf <dbl>

Finally, sum the products, and divide by \(n-1\)

recalldata %>%
  summarise(
    prod_sum = sum(prod_acc_conf),
    n = n()
  )

## # A tibble: 1 x 2
##   prod_sum     n
##      <dbl> <int>
## 1    2243.    20

2243.46 / (20-1)

## [1] 118.0768

Which is the same result as using cov():

cov(x = recalldata$recall_accuracy, y = recalldata$recall_confidence)

## [1] 118.0768

Optional: Covariance explained visually

Correlation - \(r\)

You can think of correlation as a standardized covariance. It has a scale from negative one to one, on which the distance from zero indicates the strength of the relationship.
Just like covariance, positive/negative values reflect the nature of the relationship.

The correlation coefficient is a standardised number which quantifies the strength and direction of the linear relationship between two variables. In a population it is denoted by \(\rho\), and in a sample it is denoted by \(r\).

We can calculate \(r\) using the following formula:
\[ r_{(x,y)}=\frac{\mathrm{cov}(x,y)}{s_xs_y} \]

We can actually rearrange this formula to show that the correlation is simply the covariance, but with the values \((x_i - \bar{x})\) divided by the standard deviation (\(s_x\)), and the values \((y_i - \bar{y})\) divided by \(s_y\): \[ r_{(x,y)}=\frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_{i}-\bar{x}}{s_x} \right) \left( \frac{y_{i}-\bar{y}}{s_y} \right) \]
The correlation is the simply the covariance of standardised variables (variables expressed as the distance in standard deviations from the mean).

Properties of correlation coefficients

\(-1 \leq r \leq 1\)
The sign indicates the direction of association
- positive association (\(r > 0\)) means that values of one variable tend to be higher when values of the other variable are higher
- negative association (\(r < 0\)) means that values of one variable tend to be lower when values of the other variable are higher
- no linear association (\(r \approx 0\)) means that higher/lower values of one variable do not tend to occur with higher/lower values of the other variable
The closer \(r\) is to \(\pm 1\), the stronger the linear association
\(r\) has no units and does not depend on the units of measurement
The correlation between \(x\) and \(y\) is the same as the correlation between \(y\) and \(x\)

In R
Just like R has a cov() function for calculating covariance, there is a cor() function for calculating correlation:

cor(x = recalldata$recall_accuracy, y = recalldata$recall_confidence)

## [1] 0.6993654

Optional: Manually calculating correlation

We calculated above that \(\mathrm{cov}(\texttt{recall_accuracy}, \texttt{recall_confidence})\) = 118.077.

To calculate the correlation, we can simply divide this by the standard deviations of the two variables \(s_{\texttt{recall_accuracy}} \times s_{\texttt{recall_confidence}}\)

recalldata %>% summarise(
  s_ra = sd(recall_accuracy),
  s_rc = sd(recall_confidence)
)

## # A tibble: 1 x 2
##    s_ra  s_rc
##   <dbl> <dbl>
## 1  14.5  11.6

118.08 / (14.527 * 11.622)

## [1] 0.6993902

Which is the same result as using cor():

cor(x = recalldata$recall_accuracy, y = recalldata$recall_confidence)

## [1] 0.6993654

Correlation Test

Now that we’ve seen the formulae for covariance and correlation, as well as how to quickly calculate them in R using cov() and cor(), we can use a statistical test to establish the probability of finding an association this strong by chance alone.

Hypotheses

Remember, hypotheses are about the population parameter (in this case the correlation between the two variables in the population - i.e., \(\rho\)).

Null Hypothesis

There is not a linear relationship between \(x\) and \(y\) in the population.
\(H_0: \rho = 0\)

Alternative Hypothesis

There is a positive linear relationship between \(x\) and \(y\) in the population.
\(H_1: \rho > 0\)
There is a negative linear relationship between \(x\) and \(y\) in the population.
\(H_1: \rho < 0\)
There is a linear relationship between \(x\) and \(y\) in the population.
\(H_1: \rho \neq 0\)

Test statistic

Our test statistic here is another \(t\) statistic, the formula for which depends on both the observed correlation (\(r\)) and the sample size (\(n\)):

\[t = r \sqrt{\frac{n-2}{1-r^2}}\]

\(p\)-value

We calculate the p-value for our \(t\)-statistic as the long-run probability of a \(t\)-statistic with \(n-2\) degrees of freedom being less than, greater than, or more extreme in either direction (depending on the direction of our alternative hypothesis) than our observed \(t\)-statistic.

Assumptions

Both variables are quantitative
Both variables should be drawn from normally distributed populations.
The relationship between the two variables should be linear.

In R
We can test the significance of the correlation coefficient really easily with the function cor.test():

cor.test(recalldata$recall_accuracy, recalldata$recall_confidence)

## 
##  Pearson's product-moment correlation
## 
## data:  recalldata$recall_accuracy and recalldata$recall_confidence
## t = 4.1512, df = 18, p-value = 0.0005998
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3719603 0.8720125
## sample estimates:
##       cor 
## 0.6993654

Optional: Manually conducting the correlation test

Or, if we want to calculate our test statistic manually:

#calculate r
r = cor(recalldata$recall_accuracy, recalldata$recall_confidence)

#get n
n = nrow(recalldata)

#calculate t    
tstat = r * sqrt((n - 2) / (1 - r^2))

#calculate p-value for t, with df = n-2 
2*(1-pt(tstat, df=n-2))

## [1] 0.0005998222

Cautions!

Correlation is an invaluable tool for quantifying relationships between variables, but must be used with care.

Below are a few things to be aware of when we talk about correlation.

Correlation can be heavily affected by outliers. Always plot your data!

r = 0 means no linear association. The variables could still be otherwise associated. Always plot your data!

Correlation does not imply causation!

Game: Guess the \(r\)

Take a break and play this “guess the correlation” game to get an idea of what different strengths and directions of \(r\) can look like.
(if the game is not showing, try http://guessthecorrelation.com/).

source: http://guessthecorrelation.com/

Correlation Exercises

Data: Sleep levels and daytime functioning

A researcher is interested in the relationship between hours slept per night and self-rated effects of sleep on daytime functioning. She recruited 50 healthy adults, and collected data on the Total Sleep Time (TST) over the course of a seven day period via sleep-tracking devices.
At the end of the seven day period, participants completed a Daytime Functioning (DTF) questionnaire. This involved participants rating their agreement with ten statements (see Table 1). Agreement was measured on a scale from 1-5. An overall score of daytime functioning can be calculated by:

reversing the scores for items 4,5 and 6 (because those items reflect agreement with positive statements, whereas the other ones are agreement with negative statement);

summing the scores on each item; and

subtracting the sum score from 50 (the max possible score). This will make higher scores reflect better perceived daytime functioning.

The data is available at https://uoepsy.github.io/data/sleepdtf.csv.

Table 1: Daytime Functioning Questionnaire
Item	Statement
Item_1	I often felt an inability to concentrate
Item_2	I frequently forgot things
Item_3	I found thinking clearly required a lot of effort
Item_4	I often felt happy
Item_5	I had lots of energy
Item_6	I worked efficiently
Item_7	I often felt irritable
Item_8	I often felt stressed
Item_9	I often felt sleepy
Item_10	I often felt fatigued

Question A1

Read in the data, and calculate the overall daytime functioning score, following the criteria outlined above. Make this a new column in your dataset.

Hints:

To reverse items 4, 5 and 6, we we need to make all the scores of 1 become 5, scores of 2 become 4, and so on… What number satisfies all of these equations: ? - 5 = 1, ? - 4 = 2, ? - 3 = 3?
To quickly sum accross rows, you can use the rowSums() function.

Solution

sleepdtf <- read_csv("https://uoepsy.github.io/data/sleepdtf.csv")
summary(sleepdtf)

##       TST             item_1         item_2         item_3         item_4    
##  Min.   : 4.900   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.: 7.225   1st Qu.:1.00   1st Qu.:2.00   1st Qu.:1.25   1st Qu.:1.00  
##  Median : 7.900   Median :1.00   Median :2.00   Median :2.00   Median :1.00  
##  Mean   : 8.004   Mean   :1.58   Mean   :2.46   Mean   :2.38   Mean   :1.26  
##  3rd Qu.: 9.025   3rd Qu.:2.00   3rd Qu.:3.00   3rd Qu.:3.00   3rd Qu.:1.00  
##  Max.   :11.200   Max.   :3.00   Max.   :5.00   Max.   :5.00   Max.   :3.00  
##      item_5         item_6         item_7         item_8        item_9    
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.0   Min.   :1.00  
##  1st Qu.:2.00   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:2.0   1st Qu.:2.00  
##  Median :2.00   Median :3.00   Median :2.00   Median :2.5   Median :3.00  
##  Mean   :2.36   Mean   :2.78   Mean   :2.04   Mean   :2.5   Mean   :2.96  
##  3rd Qu.:3.00   3rd Qu.:4.00   3rd Qu.:3.00   3rd Qu.:3.0   3rd Qu.:4.00  
##  Max.   :4.00   Max.   :5.00   Max.   :4.00   Max.   :4.0   Max.   :5.00  
##     item_10    
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :3.00  
##  Mean   :2.54  
##  3rd Qu.:3.00  
##  Max.   :5.00

# To reverse the items, we can simply do 6 minus the score.   
sleepdtf <- 
  sleepdtf %>% mutate(
    item_4=6-item_4,
    item_5=6-item_5,
    item_6=6-item_6
  ) 

# Now using rowSums(), and subtracting it from 50 (the max score)
sleepdtf$dtf = 50-rowSums(sleepdtf[, 2:11])

Question A2

Calculate the correlation between the total sleep time (TST) and the overall daytime functioning score.
Conduct a test to establish the probability of observing a correlation this strong in a sample of this size assuming the true correlation to be 0.

Write a sentence or two summarising the results.

Solution

cor.test(sleepdtf$TST, sleepdtf$dtf)

## 
##  Pearson's product-moment correlation
## 
## data:  sleepdtf$TST and sleepdtf$dtf
## t = 6.244, df = 48, p-value = 1.062e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4807039 0.7989417
## sample estimates:
##       cor 
## 0.6694741

There was a strong positive correlation between total sleep time and self-reported daytime functioning score (\(r\) = 0.67, \(t(48)\) = 6.24, \(p < .001\)) in the current sample. As total sleep time increased, levels of self-reported daytime functioning increased.

Question A3

Open-ended: Think about this relationship in terms of causation.

Claim: Less sleep causes poorer daytime functioning.

Why might it be inappropriate to make the claim above based on these data alone? Think about what sort of study could provide stronger evidence for such a claim.

Things to think about:

comparison groups.
random allocation.
measures of daytime functioning.
measures of sleep time.
other (unmeasured) explanatory variables.

Functions and Models Exercises

Question B1

The Scottish National Gallery kindly provided us with measurements of side and perimeter (in metres) for a sample of 10 square paintings.

The data are provided below:

sng <- tibble(
  side = c(1.3, 0.75, 2, 0.5, 0.3, 1.1, 2.3, 0.85, 1.1, 0.2),
  perimeter = c(5.2, 3.0, 8.0, 2.0, 1.2, 4.4, 9.2, 3.4, 4.4, 0.8)
)

Plot the data from the Scottish National Gallery using ggplot().

We know that there is a mathematical model for the relationship between the side-length and perimeter of squares: \(perimeter = 4 \times \ side\).
Try adding the following line to your plot:

  stat_function(fun = ~.x * 4)

Solution

Question B2

Use our mathematical model to predict the perimeter of a painting with a side of 1.5 metres.
We do not have a painting with a side of 1.5 metres within the random sample of paintings from the Scottish National Gallery. We predict the perimeter of an unobserved squared painting having a 1.5 metre side using the mathematical model.

You can obtain this prediction either using a visual approach or an algebraic one.

Solution

Question B3

Consider now the relationship between height (in inches) and handspan (in cm). Utts and Heckard (2015) provides data for a sample of 167 students who reported their height and handspan as part of a class survey.

Data: handheight.csv

height	handspan
68	21.5
71	23.5
73	22.5
64	18.0
68	23.5
59	20.0

Read the handheight data into R, and investigate how handspan varies as a function of height for the students in the sample.

Do you notice any outliers or points that do not fit with the pattern in the rest of the data?

Comment on any main differences you notice between this relationship and the relationship between sides and perimeter of squares.

Solution

The handheight data set contains two variables, height and handspan, which are both numeric and continuous. We display the relationship between two numeric variables with a scatterplot.

We can also add marginal boxplots for each variable using the package ggExtra. Before using the package, make sure you have it installed via install.packages('ggExtra').

handheight <- read_csv(file = 'https://uoepsy.github.io/data/handheight.csv')

library(ggExtra)

plt <- ggplot(handheight, aes(x = height, y = handspan)) +
  geom_point(size = 3, alpha = 0.5) +
  labs(x = 'Height (in.)', y = 'Handspan (cm)')

ggMarginal(plt, type = 'boxplot')

Figure 6: The statistical relationship between height and handspan.

Outliers are extreme observations that do not seem to fit with the rest of the data. This could either be:

marginally along one axis: points that have an unusual (too high or too low) x-coordinate or y-coordinate;
jointly: observations that do not fit with the rest of the point cloud.

The boxplots in Figure 6 do not highlight any outliers in the marginal distributions of height and handspan. Furthermore, from the scatterplot we do not notice any extreme observations or points that do not fit with the rest of the point cloud.

We notice a moderate, positive (that is, increasing) linear relationship between height and handspan.

Recall Figure 5, displaying the relationship between side and perimeters of squares. In the plot we notice two points on top of each other, reflecting the fact that two squares having the same side will always have the same perimeter. In fact, the data from the Scottish National Gallery include two squared paintings with a side of 1.1m, both having a measured perimeter of 4.4m.

Figure 6, instead, displays the relationship between height and handspan of a sample of students. The first thing that grabs our attention is the fact that students having the same height do not necessarily have the same handspan. Rather, we clearly see a variety of handspan values for students all having a height of, for example, 70in. To be more precise, the seven students who are 70 in. tall all have differing handspans.

Question B4

Using the following command, superimpose on top of your scatterplot a best-fit line describing how handspan varies as a function of height. For the moment, the argument se = FALSE tells R to not display uncertainty bands.

geom_smooth(method = lm, se = FALSE)

Comment on any differences you notice with the line summarising the linear relationship between side and perimeter.

Solution

The mathematical model \(Perimeter = 4 \times \ Side\) represents the exact relationship between side-length and perimeter of squares.

In contrast, the relationship between height and handspan shows deviations from an “average pattern.” Hence, we need to create a model that allows for deviations from the linear relationship. This is called a statistical model.

A statistical model includes both a deterministic function and a random error term: \[ Handspan = \beta_0 + \beta_1 \ Height + \epsilon \] or, in short, \[ y = \underbrace{\beta_0 + \beta_1 \ x}_{f(x)} + \underbrace{\epsilon}_{\text{random error}} \]

The deterministic function need not be linear if the scatterplot displays signs of nonlinearity. In the equation above, the terms \(\beta_0\) and \(\beta_1\) are numbers specifying where the line going through the data meets the y-axis and its slope (rate of increase/decrease).

Question B5

The line of best-fit is given by:¹ \[ \widehat{Handspan} = -3 + 0.35 \ Height \]

What is your best guess for the handspan of a student who is 73in tall?

And for students who are 5in?

Solution

References

Utts, Jessica M, and Robert F Heckard. 2015. Mind on Statistics. Cengage Learning.

Covariance, Correlation, and Modelling

Covariance & Correlation

Covariance

Correlation - \(r\)

Correlation Test

Cautions!

Game: Guess the \(r\)

Correlation Exercises

Functions and Models Exercises

References