Nominal Data

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Binomial Test

Last Week

the \(t\)-test and \(z\)-test deal with interval (at least) data
they can compare continuous data to a distribution with
- known \(\mu\) and \(\sigma\) (\(z\)-test)
- known \(\mu\) and unknown \(\sigma\) (one-sample \(t\)-test)

they can compare continuous data from two samples
- independent groups (independent-samples \(t\)-test)
- related groups (paired-samples \(t\)-test)

Not Everything is Numbers

sometimes the things we are interested in aren’t numeric
fashionable or not?
heads or tails?
USMR student or not?

Binomial Distribution

two possible outcomes with fixed probabilities \(p\) and \(1-p\)
with enough trials, proportion of outcomes will be \(p\) and \(1-p\)

the number of trials (\(n\)) is fixed
each observation is independent
each observation has one of two outcomes
the probability of each outcome is consistent

Binomial?

a fair coin is tossed 20 times
- \(x\) = # of heads
draw 3 cards at random without replacement
- \(x\) = # of diamonds
draw 3 cards at random with replacement
- \(x\) = # of diamonds
the probability of having blood type B is .1; choose 4 people at random.
- \(x\) = # with type B

Binomial Distribution

toss a coin 100 times
how many “successes” (=heads, say) are you likely to get?

Binomial Test

if you toss a coin 4 times, what’s the probability of it landing on heads at least 3 times?
\(2^4 = 16\) possible sequences of outcomes
Of those 16, 5 outcomes include \(\ge\) 3 heads
\(p = 5/16 = .3125\)

Toss1	Toss2	Toss3	Toss4	HEADS
H	H	H	H	4
T	H	H	H	3
H	T	H	H	3
T	T	H	H	2
H	H	T	H	3
T	H	T	H	2
H	T	T	H	2
T	T	T	H	1
H	H	H	T	3
T	H	H	T	2
H	T	H	T	2
T	T	H	T	1
H	H	T	T	2
T	H	T	T	1
H	T	T	T	1
T	T	T	T	0

Binomial Test

print(
  binom.test(3, 4, p = 0.5, alternative = "greater"),
  digits = 7
)


    Exact binomial test

data:  3 and 4
number of successes = 3, number of trials = 4, p-value = 0.3125
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.2486046 1.0000000
sample estimates:
probability of success 
                  0.75

don’t be fooled by the probability of success which is just 3/4

Binomial Test (2)

approximately 9% of the world’s population have blue eyes; is the USMR class of 2022–23 a representative sample?

Eye Colours for USMR

(eyes <- table(usmrEyes))

usmrEyes
 blue brown green  grey hazel other 
   18    43     7     1     8     3

Binomial Test

approximately 9% of the world’s population have blue eyes

binom.test(
  eyes["blue"], sum(eyes),
  0.09, alternative = "two.sided"
)


    Exact binomial test

data:  eyes["blue"] and sum(eyes)
number of successes = 18, number of trials = 80, p-value = 0.0002
alternative hypothesis: true probability of success is not equal to 0.09
95 percent confidence interval:
 0.1391 0.3321
sample estimates:
probability of success 
                 0.225

THe \(\chi^2\) Distribution

Goodness-of-Fit Test

so what happens when we are interested in more than two outcomes?
we have already talked about dice “numbers” being categories
we know that, in a fair die, the probability of getting each number is \(\frac{1}{6}\) (H₀)
can we assess the probability of getting a known set of throws if H₀ is true?
if the probability is low enough (\(p<.05\)) we can assert that the die is biased

Calculating \(\chi^2\)

let’s assume we throw the die 600 times
if everything worked out perfectly for an unbiased die, our expected values would be:

(expected <- 600 * c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6))

[1] 100 100 100 100 100 100

and we can calculate a \(\chi^2\) statistic using the formula

\[\chi^2 = \sum{\frac{(O_i-E_i)^2}{E_i}}\] where \(O_i\) is the \(i\textrm{th}\) observed and \(E_i\) is the \(i\textrm{th}\) expected value

Calculating \(\chi^2\)

we don’t have to do this calculation by hand
we can do it piece-by-piece, starting with 600 throws I’ve ‘recorded’

throws

6 3 3 3 1 5 4 6 4 6 2 4 6 6 2 6 1 5 6 1 4 6 4 2 1 1 1 4 1 3 2 3 3 6 1 1 6 6 6 2 3 1 2 6 1 5 3 6 3 5 1 5 1 4 4 3 3 1 3 6 2 4 5 3 4 4 2 5 5 6 3 2 2 6 6 3 1 2 1 6 ...

table(throws)

throws
  1   2   3   4   5   6 
 92 124 102  96  77 109

Calculating \(\chi^2\)

chiTab <- data.frame(
  expected=expected, # from earlier calculation
  observed=table(throws) |> as.integer()
)
chiTab

  expected observed
1      100       92
2      100      124
3      100      102
4      100       96
5      100       77
6      100      109

\[\chi^2 = \sum{\frac{(O_i-E_i)^2}{E_i}}\]

Calculating \(\chi^2\)

chiTab <- chiTab |>
    mutate(sq_diff = (observed - expected)^2)

\[\chi^2 = \sum{\frac{\color{red}{(O_i-E_i)^2}}{E_i}}\]

Calculating \(\chi^2\)

chiTab <- chiTab |>
    mutate(sq_diff = (observed - expected)^2,
        std_sq_diff = (sq_diff/expected))

\[\chi^2 = \sum{\color{red}{\frac{(O_i-E_i)^2}{E_i}}}\]

Calculating \(\chi^2\)

chiTab <- chiTab |>
    mutate(sq_diff = (observed - expected)^2,
        std_sq_diff = (sq_diff/expected))
chiTab

\[\chi^2 = \sum{\color{red}{\frac{(O_i-E_i)^2}{E_i}}}\]

Calculating \(\chi^2\)

chiTab <- chiTab |>
    mutate(sq_diff = (observed - expected)^2,
        std_sq_diff = (sq_diff/expected))
chiTab

  expected observed sq_diff std_sq_diff
1      100       92      64        0.64
2      100      124     576        5.76
3      100      102       4        0.04
4      100       96      16        0.16
5      100       77     529        5.29
6      100      109      81        0.81

sum(chiTab$std_sq_diff)

[1] 12.7

df = \(\textrm{number of groups}-1 = 5\)

\[\chi^2 = \sum{\color{red}{\frac{(O_i-E_i)^2}{E_i}}}\]

\[\chi^2 = \color{red}{\sum{\frac{(O_i-E_i)^2}{E_i}}}\]

Evaluating \(\chi^2\)

so for the particular random throws we did, \(\chi^2=12.7\)
what we want to know is how probable that value is in a world where chance governs dice throws

we already know two important things
1. we’re going to have to work out the distribution of \(\chi^2\) and work out the probability of getting that value or more
2. the reason we’re calling the value we’ve calculated \(\chi^2\) is because we’re going to compare it to the \(\chi^2\) distribution

Why do Things the Easy Way?

calculate and plot 10,000 random 600-fair-dice-throw \(\chi^2\)s

diceChi <- function(n) {
    dice <- sample(1:6, n,
        replace = TRUE)
    chisq.test(table(dice))$statistic
}

chiDist <- replicate(10000,
    diceChi(600))

plot(density(chiDist), main = "chisq(5)",
    lwd = 2)

The \(\chi^2\) Distribution

\(\chi^2\) Probability

for our random 600 dice throws a couple of slides back
- \(\chi^2=12.7\), \(\textrm{df} = 5\)
we can use pchisq()

pchisq(12.7, 5, lower.tail=FALSE)

[1] 0.02636

looks like we can conclude that our die is unlikely to be fair

Two-Dimensional \(\chi^2\)

Types of \(\chi^2\) Test

what we’ve just seen is a goodness of fit calculation
- do the data come from a specific distribution?

test of homogeneity
- do groups have same distribution of a variable of interest?
test of independence
- are categorical variables associated with each other?

course	Elsewhere	UK
dapr1	9	49
dapr2	5	21
rms2	9	29
usmr	17	27

Where are You From?

data from all of the stats modules in Psychology

statsClasses |>
    select(course, in_uk) |>
    table()

       in_uk
course  Elsewhere UK
  dapr1         9 49
  dapr2         5 21
  rms2          9 29
  usmr         17 27

statsClasses |>
    select(course, in_uk) |>
    table() |>
    addmargins()

       in_uk
course  Elsewhere  UK Sum
  dapr1         9  49  58
  dapr2         5  21  26
  rms2          9  29  38
  usmr         17  27  44
  Sum          40 126 166

contingency tables

Under the Null Hypothesis

students on each module would be equally likely to come from the UK
in other words, of the 58 students on dapr1, \(\frac{126}{166}\times{}58\), or approx 44.02 students, should come from the UK under H₀

       in_uk
course  Elsewhere  UK Sum
  dapr1         9  49  58
  dapr2         5  21  26
  rms2          9  29  38
  usmr         17  27  44
  Sum          40 126 166

we can repeat this calculation for each cell of the table, to give “expected values”
- like the probability-based values for dice

Expected Values

repeating the calculation is easy using the “outer product (\(\otimes\))” operator %o% in R (this takes two vectors and multiplies them out into a matrix)

\[ (a,b) \otimes (y, z) = \begin{bmatrix} a \times y & b \times y \\ a \times z & b \times z \\ \end{bmatrix} \]

Expected Values

repeating the calculation is easy using the “outer product (\(\otimes\))” operator %o% in R (this takes two vectors and multiplies them out into a matrix)

t <- statsClasses |>
  select(course, in_uk) |>
  table()
t

       in_uk
course  Elsewhere UK
  dapr1         9 49
  dapr2         5 21
  rms2          9 29
  usmr         17 27

# expected values
e <- rowSums(t) %o%
  colSums(t)/sum(t)
e

      Elsewhere    UK
dapr1    13.976 44.02
dapr2     6.265 19.73
rms2      9.157 28.84
usmr     10.602 33.40

Restating the Null Hypothesis

under H₀:
- knowing which class people are in gives no additional information about where they come from
- knowing where they’re from gives no additional information about which class they’re in

observed

       in_uk
course  Elsewhere UK
  dapr1         9 49
  dapr2         5 21
  rms2          9 29
  usmr         17 27

expected

      Elsewhere    UK
dapr1    13.976 44.02
dapr2     6.265 19.73
rms2      9.157 28.84
usmr     10.602 33.40

Using `plot()`

plot(t)

plot(as.table(e))

Are The Courses Different?

chisq.test(t)


    Pearson's Chi-squared test

data:  t
X-squared = 7.8, df = 3, p-value = 0.05

might need to look more closely

chisq.test(t)$p.value

[1] 0.05124

Bonus: Eye Colours

are USMR eyes different from estimated world proportions?

chisq.test(
  table(uc),
  p = c(0.09, 0.75, 0.02, 0.03, 0.05, 0.06)
)


    Chi-squared test for given probabilities

data:  table(uc)
X-squared = 160, df = 5, p-value <2e-16

Bonus: Eye Colours

are eye colours different between stats classes?

chisq.test(table(ec))


    Pearson's Chi-squared test

data:  table(ec)
X-squared = 34, df = 18, p-value = 0.01

Toss1	Toss2	Toss3	Toss4	HEADS
H	H	H	H	4
T	H	H	H	3
H	T	H	H	3
T	T	H	H	2
H	H	T	H	3
T	H	T	H	2
H	T	T	H	2
T	T	T	H	1
H	H	H	T	3
T	H	H	T	2
H	T	H	T	2
T	T	H	T	1
H	H	T	T	2
T	H	T	T	1
H	T	T	T	1
T	T	T	T	0

Toss1	Toss2	Toss3	Toss4	HEADS
H	H	H	H	4
T	H	H	H	3
H	T	H	H	3
T	T	H	H	2
H	H	T	H	3
T	H	T	H	2
H	T	T	H	2
T	T	T	H	1
H	H	H	T	3
T	H	H	T	2
H	T	H	T	2
T	T	H	T	1
H	H	T	T	2
T	H	T	T	1
H	T	T	T	1
T	T	T	T	0

Nominal Data

Binomial Test

Last Week

Not Everything is Numbers

Binomial Distribution

Binomial?

Binomial Distribution

Binomial Test

Binomial Test

Binomial Test (2)

Eye Colours for USMR

Binomial Test

THe \(\chi^2\) Distribution

Goodness-of-Fit Test

Calculating \(\chi^2\)

Calculating \(\chi^2\)

Calculating \(\chi^2\)

Calculating \(\chi^2\)

Calculating \(\chi^2\)

Calculating \(\chi^2\)

Calculating \(\chi^2\)

Evaluating \(\chi^2\)

Why do Things the Easy Way?

The \(\chi^2\) Distribution

\(\chi^2\) Probability

Two-Dimensional \(\chi^2\)

Types of \(\chi^2\) Test

Where are You From?

Under the Null Hypothesis

Expected Values

Expected Values

Restating the Null Hypothesis

Using plot()

Are The Courses Different?

Bonus: Eye Colours

Bonus: Eye Colours

Using `plot()`

Toss1	Toss2	Toss3	Toss4	HEADS
H	H	H	H	4
T	H	H	H	3
H	T	H	H	3
T	T	H	H	2
H	H	T	H	3
T	H	T	H	2
H	T	T	H	2
T	T	T	H	1
H	H	H	T	3
T	H	H	T	2
H	T	H	T	2
T	T	H	T	1
H	H	T	T	2
T	H	T	T	1
H	T	T	T	1
T	T	T	T	0