| id | quality | SPLATTED |
|---|---|---|
| The Great Odorjan of Erpod | 84 | 0 |
| Hapetox Bron | 34 | 1 |
| Loorn Molzeks | 92 | 0 |
| Ba'lite Adrflen | 49 | 1 |
| Tedlambo Garilltet | 93 | 0 |
| Goraveola Grellorm | 5 | 1 |
| Colonel Garqun | 55 | 1 |
| Bosgogo Lurcat | 64 | 1 |
| Osajed Voplily | 45 | 0 |
| Subcommander Edorop | 90 | 0 |
Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
| id | quality | SPLATTED |
|---|---|---|
| The Great Odorjan of Erpod | 84 | 0 |
| Hapetox Bron | 34 | 1 |
| Loorn Molzeks | 92 | 0 |
| Ba'lite Adrflen | 49 | 1 |
| Tedlambo Garilltet | 93 | 0 |
| Goraveola Grellorm | 5 | 1 |
| Colonel Garqun | 55 | 1 |
| Bosgogo Lurcat | 64 | 1 |
| Osajed Voplily | 45 | 0 |
| Subcommander Edorop | 90 | 0 |
quality = quality of singing
SPLATTED = whether splatted (1 or 0)
geom_jitter() with alpha=.5each alien either gets splatted or doesn’t
underlyingly, there’s a binomial distribution
for each value of “quality of singing” there’s a probability of getting splatted
for each alien, the outcome is deterministic
but it’s the probability we are ultimately interested in
we can approximate it by binning our data…
# A tibble: 10 × 2
bin prop
<fct> <dbl>
1 [1,10.9] 0.982
2 (10.9,20.8] 0.959
3 (20.8,30.7] 0.935
4 (30.7,40.6] 0.803
5 (40.6,50.5] 0.573
6 (50.5,60.4] 0.311
7 (60.4,70.3] 0.115
8 (70.3,80.2] 0.0388
9 (80.2,90.1] 0.00893
10 (90.1,100] 0.0351
we can fit our data using a standard linear model
but there’s something very wrong…
\[\textrm{odds}(y)=\frac{p(y)}{1-p(y)}\]
\[0<p<1\] \[0<\textrm{odds}<\infty\]
\(p(y)\)
\(\textrm{odds}(y)\)
throw heads
\(\frac{1}{2}\)
\(\frac{1}{1}\)
throw 8 with two dice
\(\frac{5}{36}\)
\(\frac{5}{31}\)
get splatted
\(\frac{99}{100}\)
\(\frac{99}{1}\)
odds never goes below zero
odds rises to \(\infty\)
\[10^3=1000; \log_{10}(1000)=3\]
\[10^3=1000; \log_{10}(1000)=3\]
\[e^{6.908}=1000; \log(1000) = 6.908\]
[1] "2.71828182845904509080"
[1] "2.71828182845904509080"
if log-odds are less than zero, the odds go down (multiplied by <1)
if log-odds are greater than zero, the odds go up (multiplied by >1)
high odds = high probability
generalises the linear model using mapping functions
coefficients are in logit (log-odds) units
coefficients use Wald’s \(z\) instead of \(t\)
fit using maximum likelihood

generalises the linear model using mapping functions
coefficients are in logit (log-odds) units
coefficients use Wald’s \(z\) instead of \(t\)
fit using maximum likelihood
| id | quality | SPLATTED |
|---|---|---|
| The Great Odorjan of Erpod | 84 | 0 |
| Hapetox Bron | 34 | 1 |
| Loorn Molzeks | 92 | 0 |
| Ba'lite Adrflen | 49 | 1 |
| Tedlambo Garilltet | 93 | 0 |
| Goraveola Grellorm | 5 | 1 |
| Colonel Garqun | 55 | 1 |
| Bosgogo Lurcat | 64 | 1 |
NB., no statistical test done by default
deviance compares the likelihood of the new model to that of the previous model
a generalisation of sums of squares
lower “residual deviance” is good (a bit like Residual Sums of Squares)
Df Deviance Resid. Df Resid. Dev
NULL 999 1377
quality 1 800 998 577
'log Lik.' -688.5 (df=1)
'log Lik.' -288.6 (df=2)
model deviance maps to the \(\chi^2\) distribution
can specify a \(\chi^2\) test to statistically evaluate model in a similar way to \(F\) ratio
Call:
glm(formula = SPLATTED ~ quality, family = binomial, data = singers)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.08191 0.33410 15.2 <2e-16 ***
quality -0.10557 0.00642 -16.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1377.06 on 999 degrees of freedom
Residual deviance: 577.29 on 998 degrees of freedom
AIC: 581.3
Number of Fisher Scoring iterations: 6
...
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.08191 0.33410 15.2 <2e-16 ***
quality -0.10557 0.00642 -16.5 <2e-16 ***
...
zero = “50/50” (odds of 1)
value below zero: probability of being splatted decreases as quality increases
...
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.08191 0.33410 15.2 <2e-16 ***
quality -0.10557 0.00642 -16.5 <2e-16 ***
...
a calculation for quality = 50
log-odds: \(5.08+-0.11\cdot50=\color{red}{-0.42}\)
odds: \(e^{-0.42}=\color{red}{0.657}\)
probability: \(\frac{0.657}{1+0.657}=\color{red}{0.3965}\)
\[\hat{y}_i=b_0+b_1x_i\] \[\textrm{odds}=e^{\hat{y}_i}\] \[p=\frac{\textrm{odds}}{1+\textrm{odds}}\]
intuitive to think in probability
useful to write a function which takes a value in logits l and converts it to a probability p
so far we’ve looked at
model deviance and \(\chi^2\) (similar to sums of squares and \(F\))
model coefficients and how to map them to probability
what about “explained variance” (similar to \(R^2\))?
no really good way of doing this, many proposals
SPSS uses something called “accuracy” (how well does the model predict actual data?)
not very informative, but good for learning R
logit regression is one type of GLM
others make use of different link functions (through family=...)
poisson: number of events in a time period
inverse gaussian: time to reach some criterion
…
linear
convertible to linear (use log() etc.)
non-convertible (use contrasts() etc. to map)
don’t affect the choice of model
linear
convertible to linear (use log() etc.)
non-convertible (use glm() with family=...)
directly affect the choice of model