Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
the \(t\)-test and \(z\)-test deal with interval (at least) data
they can compare continuous data to a distribution with
sometimes the things we are interested in aren’t numeric
fashionable or not?
heads or tails?
USMR student or not?
two possible outcomes with fixed probabilities \(p\) and \(1-p\)
with enough trials, proportion of outcomes will be \(p\) and \(1-p\)
if you toss a coin 4 times, what’s the probability of it landing on heads at least 3 times?
\(2^4 = 16\) possible sequences of outcomes
Of those 16, 5 outcomes include \(\ge\) 3 heads
\(p = 5/16 = .3125\)
Toss1 | Toss2 | Toss3 | Toss4 | HEADS |
---|---|---|---|---|
H | H | H | H | 4 |
T | H | H | H | 3 |
H | T | H | H | 3 |
T | T | H | H | 2 |
H | H | T | H | 3 |
T | H | T | H | 2 |
H | T | T | H | 2 |
T | T | T | H | 1 |
H | H | H | T | 3 |
T | H | H | T | 2 |
H | T | H | T | 2 |
T | T | H | T | 1 |
H | H | T | T | 2 |
T | H | T | T | 1 |
H | T | T | T | 1 |
T | T | T | T | 0 |
Exact binomial test
data: 3 and 4
number of successes = 3, number of trials = 4, p-value = 0.3125
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.2486046 1.0000000
sample estimates:
probability of success
0.75
probability of success
which is just 3/4approximately 9% of the world’s population have blue eyes; is the USMR class of 2022–23 a representative sample?
Exact binomial test
data: eyes["blue"] and sum(eyes)
number of successes = 18, number of trials = 80, p-value = 0.0002
alternative hypothesis: true probability of success is not equal to 0.09
95 percent confidence interval:
0.1391 0.3321
sample estimates:
probability of success
0.225
so what happens when we are interested in more than two outcomes?
we have already talked about dice “numbers” being categories
we know that, in a fair die, the probability of getting each number is \(\frac{1}{6}\) (H0)
can we assess the probability of getting a known set of throws if H0 is true?
if the probability is low enough (\(p<.05\)) we can assert that the die is biased
let’s assume we throw the die 600 times
if everything worked out perfectly for an unbiased die, our expected values would be:
\[\chi^2 = \sum{\frac{(O_i-E_i)^2}{E_i}}\] where \(O_i\) is the \(i\textrm{th}\) observed and \(E_i\) is the \(i\textrm{th}\) expected value
we don’t have to do this calculation by hand
we can do it piece-by-piece, starting with 600 throws I’ve ‘recorded’
6 3 3 3 1 5 4 6 4 6 2 4 6 6 2 6 1 5 6 1 4 6 4 2 1 1 1 4 1 3 2 3 3 6 1 1 6 6 6 2 3 1 2 6 1 5 3 6 3 5 1 5 1 4 4 3 3 1 3 6 2 4 5 3 4 4 2 5 5 6 3 2 2 6 6 3 1 2 1 6 ...
chiTab <- data.frame(
expected=expected, # from earlier calculation
observed=table(throws) |> as.integer()
)
chiTab
expected observed
1 100 92
2 100 124
3 100 102
4 100 96
5 100 77
6 100 109
\[\chi^2 = \sum{\frac{(O_i-E_i)^2}{E_i}}\]
chiTab <- chiTab |>
mutate(sq_diff = (observed - expected)^2,
std_sq_diff = (sq_diff/expected))
chiTab
expected observed sq_diff std_sq_diff
1 100 92 64 0.64
2 100 124 576 5.76
3 100 102 4 0.04
4 100 96 16 0.16
5 100 77 529 5.29
6 100 109 81 0.81
\[\chi^2 = \sum{\color{red}{\frac{(O_i-E_i)^2}{E_i}}}\]
\[\chi^2 = \color{red}{\sum{\frac{(O_i-E_i)^2}{E_i}}}\]
so for the particular random throws we did, \(\chi^2=12.7\)
what we want to know is how probable that value is in a world where chance governs dice throws
we already know two important things
we’re going to have to work out the distribution of \(\chi^2\) and work out the probability of getting that value or more
the reason we’re calling the value we’ve calculated \(\chi^2\) is because we’re going to compare it to the \(\chi^2\) distribution
for our random 600 dice throws a couple of slides back
we can use pchisq()
[1] 0.02636
what we’ve just seen is a goodness of fit calculation
test of homogeneity
test of independence
course | Elsewhere | UK |
---|---|---|
dapr1 | 9 | 49 |
dapr2 | 5 | 21 |
rms2 | 9 | 29 |
usmr | 17 | 27 |
students on each module would be equally likely to come from the UK
in other words, of the 58 students on dapr1, \(\frac{126}{166}\times{}58\), or approx 44.02 students, should come from the UK under H0
in_uk
course Elsewhere UK Sum
dapr1 9 49 58
dapr2 5 21 26
rms2 9 29 38
usmr 17 27 44
Sum 40 126 166
we can repeat this calculation for each cell of the table, to give “expected values”
%o%
in R (this takes two vectors and multiplies them out into a matrix)\[ (a,b) \otimes (y, z) = \begin{bmatrix} a \times y & b \times y \\ a \times z & b \times z \\ \end{bmatrix} \]
%o%
in R (this takes two vectors and multiplies them out into a matrix)under H0:
knowing which class people are in gives no additional information about where they come from
knowing where they’re from gives no additional information about which class they’re in
observed
in_uk
course Elsewhere UK
dapr1 9 49
dapr2 5 21
rms2 9 29
usmr 17 27
expected
Elsewhere UK
dapr1 13.976 44.02
dapr2 6.265 19.73
rms2 9.157 28.84
usmr 10.602 33.40
plot()
Pearson's Chi-squared test
data: table(ec)
X-squared = 34, df = 18, p-value = 0.01