S2W3 - Hypothesis Testing: critical values

class: center, middle, inverse, title-slide

# <b>S2W3 - Hypothesis Testing: critical values</b>
## Data Analysis for Psychology in R 1
### Umberto Noè
### Department of Psychology<br/>The University of Edinburgh
### AY 2020-2021

---

# Learning objectives

1. Recognise the difference between a bootstrap and null distribution.

1. Understand the parallel between p-values and critical values.

1. Be able to perform a one-sided or two-sided hypothesis test using the critical value method.

1. Understand the link between z-scores and critical values.

---
class: inverse, center, middle

# Part 0
## What you need to know

---

# Parameters, statistics, best estimates

- _Parameter_: a number that summarises some aspect of the population.

- _Statistic_: a numerical summary of the sample data.

- The statistic calculated from the sample is our _best estimate_ of the true but unknown value of the population parameter

.footnote[
To revise [click here](https://uoepsy.github.io/dapr1/labs/11_sampling_distributions.html)
]

---

# Normal distributions

.pull-left[
In a bell-shaped distribution, approximately:

- 68% of the values are within 1 SD of the mean

- 95.4% of the values are within 2 SD of the mean

- 99.7% of the values are within 3 SD of the mean
]

.pull-right[
<img src="https://uoepsy.github.io/dapr1/labs/images/prob/normal_rule.png" width="100%" style="display: block; margin: auto;" />
]

To be precise, 95% of the values are within 1.96 SD of the mean:

```r
qnorm(p = c(0.025, 0.975), mean = 0, sd = 1)
```

```
## [1] -1.96  1.96
```

---

# Quantiles (a.k.a. percentiles)

.pull-left[
- The `$p$`-quantile is the value that cuts an area equal to `$p$` to its left.

- It is the value `$x_p$` such that `$P(X \leq x_p) = p$`

- We use the term percentile when using _percentages_. The 0.5-quantile is the 50th percentile.
]

.pull-right[
<img src="https://uoepsy.github.io/dapr1/labs/images/prob/normal_quantile.png" width="90%" style="display: block; margin: auto;" />
]

.footnote[
To revise [click here](https://uoepsy.github.io/dapr1/labs/10_continuous_distributions.html)
]

---

# Quantiles (a.k.a. percentiles)

.pull-left[
**Resampling approach**

With data `df$y`
  
```
quantile(<data>, probs = <probs_to_the_left>)

quantile(df$y, probs = c(0.25, 0.5, 0.75))
```

With bootstrap means `bootstrap$means`:

```r
quantile(bootstrap$means, probs = 0.5)
```

```
##   50% 
## 1.265
```
]

.pull-right[
**Theoretical approach**

Find the quantiles of a probability distribution with the function `q` followed by the distribution name.
  
_Example._ Quantiles of a normal distribution:
```
qnorm(p = <prob_to_left>, mean, sd)
```

The 0.025 and 0.975-quantiles of a standard normal distribution are:

```r
qnorm(p = c(0.025, 0.975), mean = 0, sd = 1)
```

```
## [1] -1.96  1.96
```
]

---

# Hypothesis testing 101

- Pose a question that you would like to investigate or an hypothesis you'd like to empirically check.

- Identify the relevant population _parameters_.

- Translate that question or hypothesis into null `$(H_0)$` and alternative `$(H_1)$` hypotheses.

For example:
  
  `$$H_0 : \mu = 0$$`
  `$$H_1 : \mu > 0$$`
  
--

- Find or collect data that will help you answer
this question.

- Compute the statistic that estimates the parameter of interest. For example, the sample mean `$\bar x$`

.footnote[
To revise [click here](https://uoepsy.github.io/dapr1/labs/13_hypothesis_testing.html)
]

---

# Measuring evidence against `$H_0$`

- Light blue: _null distribution_ of sample means from samples of size `$n = 50$`.

- Red line: _observed statistic_.

.pull-left[

]

.pull-right[

]

Which scenario do you think provides more evidence that the population mean is greater than 0?

---

# P-values and statistical significance

.pull-left[

$$
p = 0.14
$$

]

.pull-right[

$$
p < .001
$$

]

If results as extreme or more extreme than the observed statistic are unlikely to occur by sampling variation alone when the null hypothesis is true, we say the sample results are statistically significant.

Statistical significance means that we have convincing evidence against `$H_0$` and in favour of `$H_1$`.

---
class: inverse, center, middle

# Part A
## Research question and data

---

# Research question

> Is reaction time in identifying ink colours increased when the ink is used to spell a different colour?

Researchers recruited 131 participants for a study. Each participant was asked to complete two tasks, both requiring them to pronounce the _colour_ of words shown on a screen.

In task (a) the colour and words matched, while in task (b) the colour and words did not match.

---

# Research question

To evaluate whether mismatching words and colours increased participants reaction times, we can compute for each participant the _difference_ between the time to complete the mismatching colour-word task and the matching colour-word task.

If the _average difference_ is larger than 0, then the mismatching colour-word task took _on average_ longer to complete.

We are not interested in whether the mismatching colour-word task took longer to complete than the other task for one particular individual. This might happen by change. What we really want to do is assess if there is a pattern, hence why the mean!

---

# Data

The data can be found at: https://uoepsy.github.io/data/stroopexpt2.csv

.pull-left[

```r
library(tidyverse)
library(kableExtra)

data <- read_csv("https://uoepsy.github.io/data/stroopexpt2.csv")

dim(data)
```

```
## [1] 131   5
```

```r
head(data) %>% 
  kable(digits = 2)
```
]

.pull-right[
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> id </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> matching </th>
   <th style="text-align:right;"> mismatching </th>
   <th style="text-align:right;"> stroop_effect </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 40 </td>
   <td style="text-align:right;"> 12.61 </td>
   <td style="text-align:right;"> 14.00 </td>
   <td style="text-align:right;"> 1.39 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 48 </td>
   <td style="text-align:right;"> 14.84 </td>
   <td style="text-align:right;"> 14.87 </td>
   <td style="text-align:right;"> 0.03 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 35 </td>
   <td style="text-align:right;"> 15.94 </td>
   <td style="text-align:right;"> 19.60 </td>
   <td style="text-align:right;"> 3.66 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 47 </td>
   <td style="text-align:right;"> 9.73 </td>
   <td style="text-align:right;"> 4.64 </td>
   <td style="text-align:right;"> -5.09 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 27 </td>
   <td style="text-align:right;"> 14.71 </td>
   <td style="text-align:right;"> 14.65 </td>
   <td style="text-align:right;"> -0.06 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 55 </td>
   <td style="text-align:right;"> 20.20 </td>
   <td style="text-align:right;"> 17.10 </td>
   <td style="text-align:right;"> -3.10 </td>
  </tr>
</tbody>
</table>

]

<br>

where: `stroop_effect = mismatching - matching`

---

# Data

Visualise the distribution of the `stroop_effect` variable in the sample data:

.pull-left[

```r
ggplot(data) +
  geom_histogram(aes(x = stroop_effect), 
                 color = 'white') +
  labs(x = 'Stroop effect')
```

It is not very symmetric and bell-shaped...

```r
data %>%
  summarise(Mean = mean(stroop_effect),
            SD = sd(stroop_effect)) %>%
  kable(digits = 3)
```

]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
]

---

# Parameters and hypotheses

- `$\texttt{mismatching}_i$` = time participant `$i$` took to complete the _mismatching_ colour-word task

- `$\texttt{matching}_i$` = time participant `$i$` took to complete the _matching_ colour-word task

- `$D_i = \texttt{mismatching}_i - \texttt{matching}_i$` = difference in completion times ("_Stroop effect_")

- `$\mu$` = population mean difference in completion times

We wish to test whether the population mean difference in completion times is larger than 0.

That is, if the mean completion time of the mismatching colour-word task is higher than the matching colour-word task.

`$$H_0 : \mu = 0$$`
`$$H_1 : \mu > 0$$`

---

# Sample statistic

The observed sample mean difference in completion times is:

```r
xbar_obs <- mean(data$stroop_effect)
xbar_obs
```

```
## [1] 0.8843
```

<br>

This is just the mean of the differences (`stroop_effect`) in the sample data:

`$$\bar{x}_{obs} = \frac{ \sum_{i=1}^n D_i }{ n }$$`

---
class: inverse, center, middle

# Part B
## Bootstrap distribution vs Null distribution

---

# Bootstrap distribution

```r
source('https://uoepsy.github.io/files/rep_sample_n.R')
```

Set the random seed:

```r
set.seed(1)
```

Generate the bootstrap distribution:

```r
n <- nrow(data)
n
```

```
## [1] 131
```

```r
boot_dist <- data %>%
  rep_sample_n(n = n, samples = 5000, replace = TRUE) %>%
  group_by(sample) %>%
  summarise(xbar = mean(stroop_effect))
```

---

# Bootstrap distribution

.pull-left[

```r
head(boot_dist)
```

```
## # A tibble: 6 x 2
##   sample  xbar
##    <dbl> <dbl>
## 1      1 1.24 
## 2      2 0.797
## 3      3 1.10 
## 4      4 0.267
## 5      5 0.901
## 6      6 1.07
```

]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-24-1.png" width="80%" style="display: block; margin: auto;" />
]

- Centre = mean of bootstrap distribution = 0.884 = observed sample mean

- Spread = standard deviation of bootstrap distribution = 0.411 = bootstrap standard error

---

# Null distribution: Resampling approach

Centred at the value specified in the null hypothesis!

```r
data <- data %>%
  mutate(
    stroop_effect_shifted = stroop_effect - xbar_obs
  )

mean(data$stroop_effect_shifted) %>% 
  round(digits = 3)
```

```
## [1] 0
```

```r
null_dist <- data %>%
  rep_sample_n(n = n, samples = 5000, replace = TRUE) %>%
  group_by(sample) %>%
  summarise(xbar = mean(stroop_effect_shifted))
```

---

# Null distribution: Resampling approach

.pull-left[
Centre and spread

```r
# mean
mu <- mean(null_dist$xbar)
# standard error
se <- sd(null_dist$xbar)

tibble(mu, se) %>%
    kable(digits = 3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> mu </th>
   <th style="text-align:right;"> se </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.006 </td>
   <td style="text-align:right;"> 0.411 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
]

---

# Null distribution: Theoretical approach

Recall:
$$
\bar X \sim N(\mu, SE), \qquad \qquad SE = \frac{s}{\sqrt{n}}
$$

<br>

But under the null hypothesis we assume that `$H_0: \mu = 0$`, so
$$
\bar X \sim N(0, SE), \qquad \qquad SE = \frac{s}{\sqrt{n}}
$$

---

# Null distribution: Theoretical approach

.pull-left[
In R:

```r
mu_theory <- 0
se_theory <- 
  sd(data$stroop_effect) / sqrt(n)

tibble(mu_theory, se_theory) %>%
    kable(digits = 3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> mu_theory </th>
   <th style="text-align:right;"> se_theory </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.414 </td>
  </tr>
</tbody>
</table>

Then use `dnorm` for the density, `qnorm` for the quantiles, and `pnorm` for the probabilities of a normal distribution having mean and SE computed above.
]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Part C
## Critical values (one-sided `$H_1$`)

`$$H_0: \mu = 0$$`
`$$H_1: \mu > 0$$`

---

# Recap of p-values!

.pull-left[

```r
pvalue <- 
  sum(null_dist$xbar >= xbar_obs) / 
  nrow(null_dist)

pvalue
```

```
## [1] 0.0152
```

The probability of observing a sample mean as large as 0.884 or larger, when the null hypothesis is true, is 0.0152.

At the 5% significance level, the sample results provide strong evidence that the population mean difference in completion times is larger than 0.

]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" />
]

---

# Recap of p-values!

---

# Critical values: Resampling approach

.pull-left[

```r
q0.95 <- quantile(null_dist$xbar, 
                  probs = 0.95)
q0.95
```

```
##    95% 
## 0.6755
```

```r
tibble(q0.95, 
       xbar_obs, 
       xbar_obs >= q0.95) %>%
  kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> q0.95 </th>
   <th style="text-align:right;"> xbar_obs </th>
   <th style="text-align:left;"> xbar_obs &gt;= q0.95 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.6755 </td>
   <td style="text-align:right;"> 0.8843 </td>
   <td style="text-align:left;"> TRUE </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />
]

---

# Critical values: Theoretical approach

.pull-left[

```r
# Resampling
quantile(null_dist$xbar, probs = 0.95)
```

```
##    95% 
## 0.6755
```

```r
# Theoretical
q0.95_theory <- qnorm(p = 0.95, mu_theory, se_theory)
q0.95_theory
```

```
## [1] 0.6808
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> q0.95_theory </th>
   <th style="text-align:right;"> xbar_obs </th>
   <th style="text-align:left;"> xbar_obs &gt;= q0.95_theory </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.6808 </td>
   <td style="text-align:right;"> 0.8843 </td>
   <td style="text-align:left;"> TRUE </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Part D
## Critical values (two-sided `$H_1$`)

`$$H_0: \mu = 0$$`
`$$H_1: \mu \neq 0$$`

---

# Critical values: Resampling approach

.pull-left[

```r
quantile(null_dist$xbar, 
         probs = c(0.025, 0.975))
```

```
##    2.5%   97.5% 
## -0.8074  0.7998
```

```r
# Observed statistic
xbar_obs
```

```
## [1] 0.8843
```

The observed statistic `$\bar x_{obs}$` is larger than the 97.5th percentile, so we reject the null hypothesis.

People often say that the observed statistic falls in the rejection region (the red intervals).

]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" />
]

---

# Critical values: Theoretical approach

.pull-left[

```r
# Resampling
quantile(null_dist$xbar, 
         probs = c(0.025, 0.975))
```

```
##    2.5%   97.5% 
## -0.8074  0.7998
```

```r
# Theoretical
qnorm(p = c(0.025, 0.975), 
      mean = mu_theory, sd = se_theory)
```

```
## [1] -0.8112  0.8112
```

```r
# Observed statistic
xbar_obs
```

```
## [1] 0.8843
```
]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Part E
## Standardized statistics (a.k.a. z-scores)

---

# Z-scores

- For data `$x_1, ..., x_n$`, the z-score is
$$
z_i = \frac{x_i - \mu}{\sigma}
$$

- For our 5,000 means `$\bar x_1, ..., \bar x_{5000}$` from the null distribution, we compute the z-score as:
$$
z_i = \frac{\bar x_i - 0}{SE} = \frac{\bar x_i - \text{hypothesised value}}{ SE }
$$

where `$SE$` = standard error of the mean.

- Don't forget to also transform the observed statistic to standard units! We need to z-score the observed statistic to bring it to the same scale:

`$$z_{obs} = \frac{\bar x_{obs} - \text{hypothesised value}}{ SE }$$`

---

# Z-scores

How is the standard error computed?

- **Resampling approach**: `$SE$` = standard deviation of the null distribution

- **Theoretical approach**: `$SE = \frac{s}{\sqrt{n}}$`

---

# Resampling approach

.pull-left[

```r
hypothesised_value <- 0

null_dist$z <- 
  (null_dist$xbar - hypothesised_value) /
  sd(null_dist$xbar)

z_obs <- (xbar_obs - hypothesised_value) / 
  sd(null_dist$xbar)
z_obs
```

```
## [1] 2.152
```

```r
quantile(null_dist$z, probs = c(0.025, 0.975))
```

```
##   2.5%  97.5% 
## -1.965  1.947
```

]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" />
]

---

# Theoretical approach

.pull-left[

```r
# Resampling
quantile(null_dist$z, 
         probs = c(0.025, 0.975))
```

```
##   2.5%  97.5% 
## -1.965  1.947
```

```r
# Theoretical
q <- qnorm(c(0.025, 0.975), mean=0, sd=1)
q
```

```
## [1] -1.96  1.96
```

```r
# Observed z-score
z_obs_theory <- (xbar_obs - hypothesised_value) / 
  se_theory
z_obs_theory
```

```
## [1] 2.136
```

]

.pull-right[
<img src="dapR1_lec13_nhstcrit_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" />
]

---

# What if you don't want to z-score `$\bar x_{obs}$`?

- Recall that
$$
\text{When } H_0 : \mu = 0 \text{ is true:} \qquad \bar X \sim N(0, SE)
$$

- We also know that for a normal distribution, 95% of the values lie within 1.96 SE of the mean (= 0 in this case).

- So, we would reject the null hypothesis if the observed sample mean is smaller than `$-1.96 \cdot SE$` or larger than `$1.96 \cdot SE$`.

- If we use the absolute value to ignore the sign, we can simply refer to the upper tail and

`$$\text{Reject } H_0 \text{ if:} \qquad |\bar x_{obs}| \geq 1.96 \cdot SE$$`

---

class: inverse, center, middle, animated, rotateInDownLeft

# End