3B: Practical Inference

This reading:

How does hypothesis testing work in practice?
How do we do all this in R?
- spoiler: it’s easier than you think
What are some basic hypothesis tests that we can conduct?

In the previous section we saw how we can apply the logic of Null Hypothesis Significance Testing (NHST), allowing us to draw inferences about parameters in the population, based on statistics computed on the sample we have collected.

NHST

We have a sample (\(n=10\)):

mysample <- c(1, -4, 6, 4, -2, 3, 2, -5, 6, 8)

And a sample mean:

mean(mysample)

[1] 1.9

We want to test this against the null hypothesis that the mean in the population is actually 0.

Remember, there are lots of samples of size \(n=10\) that we could take, and they all have different means. To quantify the spread of these different means we can use the standard error, calculated using \(SE = \frac{\sigma}{\sqrt{n}}\):

sd(mysample) / sqrt(length(mysample))

[1] 1.394035

We can use this information to express how far away from the null hypothesis (mean = 0) our observed sample is, in terms of standard errors:

\[ Z \ = \ \frac{\text{estimate}-\text{null}}{SE} \ = \ \frac{1.9 - 0}{1.39} \ = \ 1.36 \] We then ask, if the mean in the population is 0, what is the probability of obtaining a \(Z\)-statistic from a sample of this size at least as extreme as the one we have observed?

The resulting probability is our p-value:

2*pnorm(1.36, mean = 0, sd = 1, lower.tail = FALSE)

[1] 0.1738299

As our \(p\)-value is above our threshold of \(\alpha=.05\), we fail to reject the null hypothesis that the mean in the population is zero.

We can get to the same conclusion by constructing a 95% confidence interval:

xbar = mean(mysample)
se = sd(mysample) / sqrt(length(mysample))
c(xbar - (1.96 * se), xbar + (1.96 * se))

[1] -0.8323084  4.6323084

As this interval includes zero, then at the 5% level we fail to reject the null hypothesis that the population mean is zero.¹

While in practice NHST follows the logic described above, there is something important that we have been sweeping under the carpet.

In our estimation of the standard error we have used the formula that includes \(\sigma\), which refers to the population standard deviation. However, we never know this value (because we don’t have data for the population), so we have been using the sample standard deviation \(s\) instead. This is an approximation, and might be okay when we have a very large \(n\) (meaning \(s\) provides accurate estimate of \(\sigma\)), but in practice is not always feasible. \[ SE = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]

\(t\)-distributions

To resolve the issues with this approximation (using \(s\) in place of \(\sigma\)), instead of using the normal distribution, we tend to instead use the \(t\)-distribution.
The \(t\) distribution is very similar to the normal distribution, but it has slightly heavier tails:

Figure 2: Normal distribution (black) vs t-distribution with 3 degrees of freedom (red)

\(t\)-distributions are always centered on zero, and the precise shape (how heavy the tails are) depends upon a parameter known as the degrees of freedom.

Degrees of Freedom - \(df\)

‘Degrees of freedom’ is a tricky concept. One of the most intuitive ways to understand it is to think of it as the number of independent bits of information that go into calculating an estimate. Put another way, it is the number of datapoints that are free to vary.

Degrees of freedom (df)

When we estimate the mean from a sample, we use up one of our degrees of freedom, and so our test of a single mean will require us to use a \(t\)-distribution with \(n-1\) degrees of freedom. For \(t\)-distributions, as the \(df\) increases the distribution becomes closer and closer to a normal distribution (see Figure 3) - the use of these \(t\)-distributions is exactly what we need to account for using \(s\) in our calculation of the standard error.

Figure 3: t distributions with various degrees of freedom.

In order to utilise the \(t\)-distribution in hypothesis testing, we need to move to performing \(t\)-tests!

The logic remains the same as before, but where we previously were relying on the normal distribution:

pnorm() for our \(p\)-values
qnorm() in order to calculate our confidence intervals (qnorm(0.975) gives the 1.96 we have been using)

We can use pt() and qt() to conduct the same process but in reference to the appropriate \(t\)-distribution.

Demonstration: t-test for a single mean

Recalling our sample, we’ll now perform the appropriate test against the null hypothesis that the mean in the population is zero.

mysample <- c(1, -4, 6, 4, -2, 3, 2, -5, 6, 8)

Now that we move to using \(t\)-distributions, our test-statistic is going to be a \(t\)-statistic:

\[ \begin{align} & t = \frac{\bar x - \mu_0}{\frac{s}{\sqrt{n}}}\\ \ \\ & \text{where:} \\ & \bar x : \text{sample mean} \\ & \mu_0 : \text{hypothesised population mean} \\ & s : \text{sample standard deviation} \\ & n : \text{sample size} \\ \end{align} \] Which can be calculated as:

xbar = mean(mysample)
se = sd(mysample) / sqrt(length(mysample))
tstat = (xbar - 0) / se
tstat

[1] 1.36295

Because we have \(n=10\), and we are estimating a sample mean, we are going to be referring to a \(t\)-distribution with 9 (\(10-1\)) degrees of freedom (we lose one by calculating the mean).

Our p-value can be found with the pt() function:

2*pt(tstat, df = 9, lower.tail = FALSE)

[1] 0.2060213

And our confidence interval can be constructed using: \[ \text{95% CI} = \bar{x} \pm t^* \times SE \\ \] Note that \(t^*\) has replaced the 1.96 we saw in previous chapters, because we obtained that using the normal distribution. The code qnorm(c(0.025, 0.975)) shows us that 95% of normal distribution is beyond 1.96 from the mean. But what we actually want to know is where 95% of the \(t\)-distribution with \(df=9\) lies:
So instead we can use:

qt(c(0.025, 0.975), df = 9)

[1] -2.262157  2.262157

And our confidence interval is:

xbar = mean(mysample)
se = sd(mysample) / sqrt(length(mysample))
c(xbar - (2.262 * se), xbar + (2.262 * se))

[1] -1.253307  5.053307

R: Making things easier

All of the above is crucial for understanding how this all works, but in practice we can avoid all of the rigmarole of ever calculating the standard error or using functions like pt(), qt(). This is where R starts to become far more powerful - there are functions that do all this sort of stuff for us - in just one single line of code!

Take a look at the output of the function below. I have given it the sample, and specified that we want it to test against the null hypothesis that \(\mu=0\).

t.test(mysample, mu = 0)


    One Sample t-test

data:  mysample
t = 1.363, df = 9, p-value = 0.206
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -1.253526  5.053526
sample estimates:
mean of x 
      1.9

The t.test() function here gives us the \(t\)-statistic, the \(df\), the \(p\)-value, the 95% CI, the mean \(\bar x\), and it even tells us the alternative hypothesis (that the true mean is \(\neq 0\)).

All of these numbers will match those that we calculated above (there may be a small bit of rounding error).

It’s that easy!

Checking/Testing Assumptions

These sort of tests often require us to meet a set of conditions in order for our inferences to be valid. When we perform tests like these that involve estimating a mean, a common requirement is that the deviations from that mean are close to normally distributed.

Assumption Plots

We can evaluate how close to normal a distribution is by visualising it via histograms and density plots and making a judgment call, but this can sometimes be hard:

Code

data <- tibble(mysample = mysample)
ggplot(data,aes(x=mysample))+geom_histogram(bins=14) +
ggplot(data,aes(x=mysample))+geom_density()

Another useful visualisation tool is the QQplot. The closer to the diagonal line, the closer our data is to being normally distributed:

qqnorm(data$mysample) # make the plot
qqline(data$mysample) # add the line

Optional: What are the ‘theoretical quantiles’?

Assumption Tests

There are also, if we wish to make use of them, specific hypothesis tests that assess normality, such as the ‘Shapiro-Wilks’ Test. The null hypothesis for this test is that the data we give it are drawn from a normal distribution. This means that we want a p-value greater than .05. So in the example below, we have no reason to reject the hypothesis that our data are drawn from a normal distribution. This means we can continue to conduct a t-test.

shapiro.test(mysample)


    Shapiro-Wilk normality test

data:  mysample
W = 0.94722, p-value = 0.6358

Basic Tests

Now that we’ve gone through all the nitty-gritty bits of how hypothesis testing works, the heavy lifting is done.
we’re going to start to look at some of the different basic hypothesis tests that we can perform.

For each test below we show an example conducted the quick way (e.g. t.test() function), and also the manually computations (for those of you who are interested!). We’ve already seen the one sample \(t\)-test in the example above, so you might want to skim over that section.

something to bear in mind

These tests are the simple hypothesis tests that were developed in the 19th and 20th centuries, and provide a good basis of understanding the null hypothesis significance testing framework.

In the latter half of this course, we move to focus on a modelling based approach for analysing data. We will start to see how many of these simple tests that we are learning now are actually special cases of a more general statistical model.

One sample t-test

Purpose

The one sample t-test is what we have already seen above. We use it to test whether the mean is different from/greater than/less than some hypothesised value.

Examples:
- Is the mean age of USMR students different from 20?
- Is the mean IQ different from 100?
- Do people read more than 250 words per minute?

Assumptions:

The data are continuous (not discrete)
The data are independent (i.e. the value of a datapoint does not depend on the value of another datapoint in any way)
The data are normally distributed OR the sample size is large enough (rule-of-thumb n = 30) and the data are not strongly skewed

Research Question: Do people read more than 250 words per minute?

Fifty participants were recruited and tasked with reading a passage of text that was 2000 words long. Their reading times (in words per minute) was recorded, and these are accessible at https://uoepsy.github.io/data/usmr_tread.csv.

wpmtime <- read_csv("https://uoepsy.github.io/data/usmr_tread.csv")
head(wpmtime)

# A tibble: 6 × 2
  id      wpm
  <chr> <dbl>
1 ppt_1   307
2 ppt_2   265
3 ppt_3   205
4 ppt_4   300
5 ppt_5   207
6 ppt_6   300

The quick and easy way

Manually

Two sample t-test

Purpose
The two sample t-test is used to test whether the mean of one group is different from/greater than/less than the mean of another.

Examples:
- Is the mean age of cat people different from the mean age of dog people?
- Do people who identify as “morning people” have a higher average rating of sleep quality than those who identify as “evening people”?
- Is the average reaction time different between people who do and don’t drink caffeinated drinks?

Assumptions:

The data are continuous (not discrete)
The data are independent (i.e. the value of a datapoint does not depend on the value of another datapoint in any way)
The data are normally distributed for each group, OR the sample size is large enough (rule-of-thumb n = 30) and the data are not strongly skewed
The variance is equal across groups*.

*We can relax this assumption by using an adjusted test called the “Welch \(t\)-test”, which calculates the standard error slightly differently, and estimates the degrees of freedom differently too. This is actually the default in R, and we change this easily in R using t.test(...., var.equal = FALSE/TRUE)

Research Question: Is the average reaction time different between people who do and don’t drink caffeinated drinks?

One hundred participants were recruited and completed a simple reaction time task. They were also surveyed on whether they regularly drank caffeine in any form. The data are accessible at https://uoepsy.github.io/data/usmr_tcaff.csv.

tcaff <- read_csv("https://uoepsy.github.io/data/usmr_tcaff.csv")
head(tcaff)

# A tibble: 6 × 2
     rt caff 
  <dbl> <chr>
1  482. yes  
2  389. yes  
3  484. no   
4  601. no   
5  409. yes  
6  368. no

The quick and easy way

First some quick descriptives, and a plot:

tcaff %>% 
  group_by(caff) %>%
  summarise(
    m = mean(rt),
    s = sd(rt)
  )

# A tibble: 2 × 3
  caff      m     s
  <chr> <dbl> <dbl>
1 no     408.  88.9
2 yes    465. 109.

ggplot(tcaff, aes(x = rt)) +
  geom_histogram() + 
  facet_wrap(~caff)

The data look fairly close to normally distributed for each group here. One thing to note is that the variances look like they may be different between the two groups. The caffeine drinkers’ reaction time’s have a standard deviation of 109ms, and the non-caffeine drinkers have an sd of only 89ms.

However, we’ll perform the Welch t-test here, which doesn’t require us to assume equal variances.

We can give R the two sets of data in two ways. Either by extracting the relevant entries:

t.test(x = tcaff$rt[tcaff$caff=="no"], 
       y = tcaff$rt[tcaff$caff=="yes"])

Or using the formula notation, with the ~ (“tilde”) symbol. In R, you can interpret y ~ x as “y is modeled as a function of x”. By splitting the numeric values (rt variable) by the categories of the caff variable, we can conduct a \(t\)-test using:

t.test(rt ~ caff, data = tcaff)


    Welch Two Sample t-test

data:  rt by caff
t = -2.8497, df = 93.971, p-value = 0.005377
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -96.20205 -17.19423
sample estimates:
 mean in group no mean in group yes 
         408.0505          464.7486

An Welch two sample t-test was used to assess whether the mean reaction time of people who regularly drink caffeine (\(n = 60\)) was different to that of people who do not (\(n=40\)). There was a significant difference in average reaction time between the caffeine (Mean=465; SD=109) and non-caffeine (Mean=408; SD=89) groups (\(t(94)=-2.85, p = .005\), two-tailed). Therefore, we reject the null hypothesis that there is no difference in reaction times between caffeine drinkers and non-caffeine drinkers.

Code

ggplot(tcaff, aes(x = caff, y = rt)) +
  geom_boxplot()+
  labs(x="drinks caffeine",y="reaction time (ms)")

Manually

Paired sample t-test

Purpose

The paired sample t-test is used to test whether the mean difference between two sets of paired observations is different from 0.

Examples:
- Is the mean cognitive score of participants at age 60 different from when they are re-tested at age 70?
- Are scores on test 1 different on average from scores on test 2 (with participants completing both tests).

Assumptions:

The data are continuous (not discrete)
The differences are independent (i.e. the value of a the difference for one pair does not depend on the values of another pair in any way)
The differences are normally distributed OR the sample size is large enough (rule-of-thumb n = 30) and the data are not strongly skewed

Research Question: Is the mean cognitive score of participants at age 60 different from when they are re-tested at age 70?

Addenbrooke’s Cognitive Examination-III (ACE-III) is a brief cognitive test that assesses five cognitive domains: attention, memory, verbal fluency, language and visuospatial abilities. The total score is 100 with higher scores indicating better cognitive functioning. A research project is examining changes in cognitive functioning with age, and administers the ACE-III to a set of participants at age 60, then again at age 70. The data is accessible at https://uoepsy.github.io/data/usmr_tcaff.csv.

acedata <- read_csv("https://uoepsy.github.io/data/acedata.csv")
head(acedata)

# A tibble: 6 × 3
  participant ace_60 ace_70
  <chr>        <dbl>  <dbl>
1 sub1            93     85
2 sub2            95     92
3 sub3            93     90
4 sub4            93     95
5 sub5            96     88
6 sub6            91     85

The paired t test is the one sample t test in disguise

Footnotes

Remember that confidence intervals provide a range of plausible values for the population mean. In this case, zero is a plausible value.↩︎