4A: Chi-Square Tests

This reading:

  • What are the basic hypothesis tests that we can conduct when we are interested in variables that have categories instead of numbers?

Here we continue with our brief explainers of different basic statistical tests. The past few weeks have focused on tests for numeric outcome variables, where we have been concerned with the mean of that variable (e.g. whether that mean is different from some specific value, or whether it is different between two groups). We now turn to investigate tests for categorical outcome variables.

The test-statistics for these tests (denoted \(\chi^2\), spelled chi-square, pronounced “kai-square”) are obtained by adding up the standardized squared deviations in each cell of a table of frequencies:

\[ \chi^2 = \sum_{all\ cells} \frac{(\text{Observed} - \text{Expected})^2}{\text{Expected}} \] where:

\(\chi^2\) Goodness of Fit Test

Purpose

The \(\chi^2\) Goodness of Fit Test is typically used to investigate whether observed sample proportions are consistent with an hypothesis about the proportional breakdown of the various categories in the population.

  • Examples:
    • Do 20% of the adult population suffer from some form of depression?
    • Are people equally likely to be born on any of the seven days of the week?
    • Are 25% of Smarties brown?
    • Are 2/3 of people ‘dog people’ and 1/3 of people ‘cat people’?

Assumptions

  1. Data should be randomly sampled from the population.
  2. Data should be at the categorical or nominal level - goodness-of-fit test is not appropriate for continuous level data.
  3. Expected counts should be at least 5.

Example

Research Question: Have proportions of adults suffering no/mild/moderate/severe depression changed from 2019?

In 2019, it was reported that 80% of adults (18+) experienced no symptoms of depression, 12% experienced mild symptoms, 4% experienced moderate symptoms, and 4% experienced severe symptoms.
The dataset is accessible at https://uoepsy.github.io/data/usmr_chisqdep.csv contains data from 1000 people to whom the PHQ-9 depression scale was administered in 2022.

depdata <- read_csv("https://uoepsy.github.io/data/usmr_chisqdep.csv") 
head(depdata)
# A tibble: 6 × 3
  id    dep    fam_hist
  <chr> <chr>  <chr>   
1 ID1   severe n       
2 ID2   mild   n       
3 ID3   no     n       
4 ID4   no     n       
5 ID5   no     n       
6 ID6   no     n       

We can see our table of observed counts with the table() function:

table(depdata$dep)

    mild moderate       no   severe 
     143       34      771       52 
The quick and easy way

Manually

\(\chi^2\) Test of Independence

Purpose

The \(\chi^2\) Test of Independence is used to determine whether or not there is a significant association between two categorical variables. To examine the independence of two categorical variables, we have a contingency table:

                   Family History of Depression
Depression Severity   n   y
           mild      93  50
           moderate  23  11
           no       532 239
           severe    37  15
  • Examples:
    • Is depression severity associated with having a family history of depression?
    • Are people with blue eyes more likely to be over 6 foot tall?
    • Are people who carry the APOE-4 gene more likely to have mild cognitive impairment?

Assumptions

  1. Two or more categories (groups) for each variable.
  2. Independence of observations
    • there is no relationship between the subjects in each group
  3. Large enough sample size, such that:
    • expected frequencies for each cell are at least 1
    • expected frequencies should be at least 5 for the majority (80%) of cells

Example

Research Question: Is severity of depression associated with having a family history of depression?

The dataset accessible at https://uoepsy.github.io/data/usmr_chisqdep.csv contains data from 1000 people to whom the PHQ-9 depression scale was administered in 2022, and for which respondents were asked a brief family history questionnaire to establish whether they had a family history of depression.

depdata <- read_csv("https://uoepsy.github.io/data/usmr_chisqdep.csv")
head(depdata)
# A tibble: 6 × 3
  id    dep    fam_hist
  <chr> <chr>  <chr>   
1 ID1   severe n       
2 ID2   mild   n       
3 ID3   no     n       
4 ID4   no     n       
5 ID5   no     n       
6 ID6   no     n       

We can create our contingency table:

table(depdata$dep, depdata$fam_hist)
          
             n   y
  mild      93  50
  moderate  23  11
  no       532 239
  severe    37  15

And even create a quick and dirty visualisation of this too:

plot(table(depdata$dep, depdata$fam_hist))

The quick and easy way

Manually