Exercises: t-test, effect size, assumptions
You will be using the following dataset: https://uoepsy.github.io/data/students_reading_scores.csv
These contain measurements of age (in years) and reading scores (0 - 100) for 35 university students randomly sampled from a hypothetical university.
Question 1
Solution
library(tidyverse)
students <- read_csv("https://uoepsy.github.io/data/students_reading_scores.csv")
head(students)
## # A tibble: 6 × 2
## age read_score
## <dbl> <dbl>
## 1 28 84
## 2 -99 75
## 3 27 NA
## 4 29 NA
## 5 24 52
## 6 25 23
Question 2
Perform preliminary checks on the data by:
- plotting the variables in the dataset;
- computing a table of descriptive statistics.
Solution
Plot the variables in the dataset:
library(patchwork)
plt_age <- ggplot(students, aes(x = age)) +
geom_histogram()
plt_read <- ggplot(students, aes(x = read_score)) +
geom_histogram()
plt_age | plt_read
Table of descriptive statistics:
students %>%
pivot_longer(cols = c(age, read_score), names_to = "Variable", values_to = "Values") %>%
group_by(Variable) %>%
summarise(M = mean(Values),
SD = sd(Values),
Min = min(Values),
Med = median(Values),
Max = max(Values),
NMiss = sum(is.na(Values)))
## # A tibble: 2 × 7
## Variable M SD Min Med Max NMiss
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 age -9.49 57.5 -99 24 32 0
## 2 read_score NA NA NA NA NA 10
Question 3
What do you notice? Are there any issues with the data that need fixing?
If yes, make appropriate changes to the data to fix those issues.
Solution
It looks like some students had their age stored as “-99”. This is an impossible value for age, and it was used by those entering the data as a missing value; i.e. when a student preferred to not disclose that information. The values “-99” should be replaced to be NAs.
The reading scores for some students were also missing but, if you look at the CSV file itself you will see that those were stored as spaces rather than -99s.
This is in the CSV file, and not the object created into R when you read the CSV file data into R.
However, R is smart enough that when you read CSV data into R and there are spaces, those are interpreted as missing values and converted to NAs automatically.
Alternative 1:
students$age[students$age == -99] <- NA
Alternative 2:
students <- students %>%
mutate(
age = ifelse(age == -99, NA, age)
)
head(students)
## # A tibble: 6 × 2
## age read_score
## <dbl> <dbl>
## 1 28 84
## 2 NA 75
## 3 27 NA
## 4 29 NA
## 5 24 52
## 6 25 23
Plot the variables in the dataset:
library(patchwork)
plt_age <- ggplot(students, aes(x = age)) +
geom_histogram()
plt_read <- ggplot(students, aes(x = read_score)) +
geom_histogram()
plt_age | plt_read
Table of descriptive statistics:
students %>%
pivot_longer(cols = c(age, read_score), names_to = "Variable", values_to = "Values") %>%
group_by(Variable) %>%
summarise(M = mean(Values),
SD = sd(Values),
Min = min(Values),
Med = median(Values),
Max = max(Values),
NMiss = sum(is.na(Values)))
## # A tibble: 2 × 7
## Variable M SD Min Med Max NMiss
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 age NA NA NA NA NA 10
## 2 read_score NA NA NA NA NA 10
The table still has NAs, because when you compute the mean for example, if you add a number to NA, you get NA. For example, 2 + NA = NA.
To tell R to ignore NAs when computing the mean, you say na.rm = TRUE
.
students %>%
pivot_longer(cols = c(age, read_score), names_to = "Variable", values_to = "Values") %>%
group_by(Variable) %>%
summarise(M = mean(Values, na.rm = TRUE),
SD = sd(Values, na.rm = TRUE),
Min = min(Values, na.rm = TRUE),
Med = median(Values, na.rm = TRUE),
Max = max(Values, na.rm = TRUE),
NMiss = sum(is.na(Values)))
## # A tibble: 2 × 7
## Variable M SD Min Med Max NMiss
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 age 26.3 3.82 19 26 32 10
## 2 read_score 59 17.6 23 59 92 10
Question 4
We will only be using the reading scores.
Remove from the data any rows for which the reading score is missing.
Solution
Alternative 1:
students_tidy <- students %>%
drop_na(read_score)
head(students_tidy)
## # A tibble: 6 × 2
## age read_score
## <dbl> <dbl>
## 1 28 84
## 2 NA 75
## 3 24 52
## 4 25 23
## 5 23 40
## 6 24 58
Alternative 2:
students_tidy <- students %>%
filter(complete.cases(read_score))
If you used na.omit(students)
or drop_na()
, you would remove students having NAs also for age. However, students may not have provided their age, but we can have their reading score, so if we were to do this, we would throw away available data!
Question 5
At the 5% significance level, test whether the sample data provide significance evidence that the mean reading score for all students in that university is not equal to 50.
Solution
xbar <- mean(students_tidy$read_score)
xbar
## [1] 59
n <- nrow(students_tidy)
n
## [1] 25
s <- sd(students_tidy$read_score)
s
## [1] 17.57365
se <- s / sqrt(n)
se
## [1] 3.514731
Observed t-statistic
mu0 <- 50
tobs <- (xbar - mu0) / se
tobs
## [1] 2.560651
P-value = 2 * Probability to the right of tobs:
2 * pt(abs(tobs), df = n - 1, lower.tail = FALSE)
## [1] 0.01716085
The sample results are significant, so it’s good practice to follow up with a confidence interval
tstar <- qt(c(0.025, 0.975), df = n - 1)
tstar
## [1] -2.063899 2.063899
ci <- xbar + tstar * se
ci
## [1] 51.74595 66.25405
Question 6
The t-test for one population mean relies on some assumptions for the results to be valid.
Check whether these are satisfied or not in this sample.
Solution
The data come from a random sample of students from that university, hence the independence assumption is satisfied.
The sample size - that is, the number of data points that are not missing - is \(n = 25\). This is not a large enough sample size for the sample mean to be normally distributed, so we will check whether the sample came from a population that follows a normal distribution. If this is the case, then the sample mean would be normally distributed irrespectively of the sample size.
plt1 <- ggplot(students_tidy, aes(x = read_score)) +
geom_density() +
labs(x = "Reading scores (0-100)", title = "(a) Density plot")
plt2 <- ggplot(students_tidy, aes(sample = read_score)) +
geom_qq() +
geom_qq_line() +
labs(x = "Theoretical quantiles", y = "Sample quantiles", title = "(b) qqplot")
plt1 | plt2
The density plot highlights a roughly bell-shaped distribution with a single peak near 60. The qqplot doesn’t highlight any severe departures from normality, in fact the sample quantiles follow the theoretical quantiles from a normal distribution pretty closely.
Alternatively, you can also obtain the qqplot as follows:
qqnorm(students_tidy$read_score)
qqline(students_tidy$read_score)
We can also perform a Shapiro-Wilk test:
shapiro.test(students_tidy$read_score)
##
## Shapiro-Wilk normality test
##
## data: students_tidy$read_score
## W = 0.97792, p-value = 0.8411
Question 7
Provide a write up pf your results in the context of the study.
Solution
Data were obtained from https://uoepsy.github.io/data/students_reading_scores.csv. The dataset contains measurements on age and reading scores for 35 randomly sampled students from a hypothetical university. Due to random sampling of the students, the independence assumption is satisfied. Some students had missing values for their reading scores, so those students were dropped from the analysis, leaving a sample of 25 students for the analysis. A Shapiro-Wilk normality test was performed at the 5% significance level (\(W = 0.98, p = 0.84\)). The sample data do not provide sufficient evidence to reject the null hypothesis of normality of the population data. Furthermore, the density plot shown in Figure 1 (a) highlights a roughly bell-shaped distribution, and the qqplot in Figure 1 (b) doesn’t show violations from normality.
At the 5% significance level, we performed a hypothesis test against the null hypothesis that the mean reading score of all students in that university was equal to 50. The sample data provide very strong evidence to reject the null hypothesis in favour of the alternative one that the population mean reading score is different from 50: \(t(24) = 2.56, p = 0.02\), two-sided.
We are 95% confident that the mean reading score for all students in that university is between 52 and 66.
Question 8
Is there a significant effect? Is the effect important?
Solution
Yes, there is a significant effect, as we found a p-value smaller than the significance level of 5%. However, this is no guarantee that the effect is of practical importance.
To judge whether the effect is important, we need to compute the effect size, i.e. Cohen’s D:
D <- (xbar - mu0) / s
D
## [1] 0.5121302
Cohen’D highlights a medium effect size, highlighting that the sample result is not only significant, but perhaps also of moderate importance and worth noting.
Exercises: power
To compute power, you need to know the distribution of the sample mean
- when \(H_0\) is true
- when \(H_1\) is true
This seldom happens in practice, so these exercises will be more conceptual.
Before going ahead, though, remember that if the population has mean \(\mu\) the sampling distribution of the mean is \(N(\mu, \frac{\sigma}{\sqrt n})\), where \(\sigma\) is the population SD and \(n\) the size of a sample.
Suppose you are testing
\[
H_0 : \mu = 0 \\
H_1 : \mu > 0
\]
What is the power of the test if:
- the population has a mean \(\mu = 5\)
- the population has a standard deviation \(\sigma = 3\)
- you will take a sample of size 30
The distribution of the sample mean in the population is
\[\bar X \sim N(5, \frac{3}{\sqrt{30}})\]
If \(H_0\) is true, however, the distribution would be:
\[\bar X \sim N(0, \frac{3}{\sqrt{30}})\]
You would reject the null hypothesis, at the 5% significance level, for values larger the 0.95 quantile of a \(N(0, \frac{3}{\sqrt{30}})\) distribution:
sigma <- 3
n <- 30
SE <- 3 / sqrt(30)
tstar <- qnorm(0.95, mean = 0, sd = SE)
tstar
## [1] 0.9009234
The power is the probability of rejecting the null hypothesis when the null is false.
So, if the null is false, the mean follows the distribution \(N(5, \frac{3}{\sqrt{30}})\). The probability of correctly rejecting the null would be the probability under that distribution beyond -1.07
Pow <- 1 - pnorm(0.90, mean = 5, sd = SE)
Pow
## [1] 1
The power is 1.
Question 9
Assume now that the true population mean is \(\mu = 2\). What is the power of the test?
Solution
The distribution of the mean when \(H_0\) is false is then
\[\bar X \sim N(2, \frac{3}{\sqrt{30}})\]
We need to compute the probability in a \(N(2, \frac{3}{\sqrt{30}})\) distribution to the right of 0.90.
Pow <- pnorm(0.90, mean = 2, sd = SE, lower.tail = FALSE)
Pow
## [1] 0.9776951
or
Pow <- 1 - pnorm(0.90, mean = 2, sd = SE)
Pow
## [1] 0.9776951
The power, i.e. the probability that the test rejects the null hypothesis when it is false, is 0.98.
Question 10
Assume now that the true population mean is \(\mu = 2\). What is the probability of a Type II error?
Solution
The easiest way to compute it is as \(\beta = 1 - \text{Power}\) because \(\text{Power} = 1 - \beta\).
beta <- 1 - Pow
beta
## [1] 0.02230486
The probability of not rejecting the null when it is false is 0.02.
If you want to compute it from scratch you can compute it as the probability to the left of 0.90. This is because values below 0.90 lead to not rejecting the null hypothesis.
beta <- pnorm(0.90, mean = 2, sd = SE)
beta
## [1] 0.02230486