The Generalized Linear Model

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Aliens

A Binary World

1,000 Aliens

id	quality	SPLATTED
The Great Odorjan of Erpod	84	0
Hapetox Bron	34	1
Loorn Molzeks	92	0
Ba'lite Adrflen	49	1
Tedlambo Garilltet	93	0
Goraveola Grellorm	5	1
Colonel Garqun	55	1
Bosgogo Lurcat	64	1
Osajed Voplily	45	0
Subcommander Edorop	90	0

quality = quality of singing
SPLATTED = whether splatted (1 or 0)

1,000 Aliens

using geom_jitter() with alpha=.5

Binomial Regression, Conceptually

each alien either gets splatted or doesn’t
- each observation is either a 1 or a 0
underlyingly, there’s a binomial distribution
for each value of “quality of singing” there’s a probability of getting splatted

for each alien, the outcome is deterministic
but it’s the probability we are ultimately interested in
we can approximate it by binning our data…

Binned Data

singers <- singers |>
    mutate(bin = cut_interval(quality,
        10))

dat <- singers |>
    group_by(bin) |>
    summarise(prop = mean(SPLATTED))

dat

# A tibble: 10 × 2
   bin            prop
   <fct>         <dbl>
 1 [1,10.9]    0.982  
 2 (10.9,20.8] 0.959  
 3 (20.8,30.7] 0.935  
 4 (30.7,40.6] 0.803  
 5 (40.6,50.5] 0.573  
 6 (50.5,60.4] 0.311  
 7 (60.4,70.3] 0.115  
 8 (70.3,80.2] 0.0388 
 9 (80.2,90.1] 0.00893
10 (90.1,100]  0.0351

Binned Data

singers <- singers |>
    mutate(bin = cut_interval(quality,
        10))

dat <- singers |>
    group_by(bin) |>
    summarise(prop = mean(SPLATTED))

dat |>
    ggplot(aes(x = bin, y = prop)) +
    xlab("quality bin") +
    ylab("prop splatted") +
    geom_point(size = 3) +
    scale_x_discrete(label = 1:10)

Best Fit Lines

we can fit our data using a standard linear model
but there’s something very wrong…

The Problem with Probability

a linear model predicts impossible values because probability isn’t linear; it’s asymptotic

The Problem with Probability

variance necessarily covaries with probability

Assumptions

Log Odds

Probability and Odds

\[\textrm{odds}(y)=\frac{p(y)}{1-p(y)}\]

\[0<p<1\] \[0<\textrm{odds}<\infty\]

\(p(y)\)

\(\textrm{odds}(y)\)

throw heads

\(\frac{1}{2}\)

\(\frac{1}{1}\)

throw 8 with two dice

\(\frac{5}{36}\)

\(\frac{5}{31}\)

get splatted

\(\frac{99}{100}\)

\(\frac{99}{1}\)

Probability and Odds

odds never goes below zero
odds rises to \(\infty\)

Refresher: (Natural) Logarithms

a logarithm is the power you would raise a number to to obtain another number

\[10^3=1000; \log_{10}(1000)=3\]

log(1000, base = 10)

[1] 3

Refresher: (Natural) Logarithms

a logarithm is the power you would raise a number to to obtain another number

\[10^3=1000; \log_{10}(1000)=3\]

a natural logarithm is the power you would raise the number \(e\) to to obtain another number

\[e^{6.908}=1000; \log(1000) = 6.908\]

exp(6.908)

[1] 1000

log(1000)

[1] 6.908

Refresher 2

what value does \(e\) have? (\(e^1=e\))

exp(1)

[1] "2.71828182845904509080"

Why \(e\)?

\(\log(l)\) = area under curve between \(1\) and \(l\) (negative if \(l<1\))

log(3)

[1] 1.099

log(0.3)

[1] -1.204

Refresher 2

what value does \(e\) have? (\(e^1=e\))

exp(1)

[1] "2.71828182845904509080"

\((\textrm{any number})^0=1\)

log(1)

[1] 0

\(\log(0)=-\infty\)

log(0)

[1] -Inf

\(\log(\infty)=\infty\)

log(Inf)

[1] Inf

Probability and Log-Odds

\(\log(0)=-\infty\); \(\log(\infty)=+\infty\); \(\log(1)=0\)
log-odds of \(0\) (odds of \(1\)) are exactly 50:50 (\(p=0.5\))

Probability and Log-Odds

if log-odds are less than zero, the odds go down (multiplied by <1)
if log-odds are greater than zero, the odds go up (multiplied by >1)
high odds = high probability

The Generalized Linear Model

generalises the linear model using mapping functions
coefficients are in logit (log-odds) units

coefficients use Wald’s \(z\) instead of \(t\)
fit using maximum likelihood

Likelihood

The Generalized Linear Model

generalises the linear model using mapping functions
coefficients are in logit (log-odds) units

coefficients use Wald’s \(z\) instead of \(t\)
fit using maximum likelihood

but actually it’s all quite straightforward…

Alien Singer Splat Probability

id	quality	SPLATTED
The Great Odorjan of Erpod	84	0
Hapetox Bron	34	1
Loorn Molzeks	92	0
Ba'lite Adrflen	49	1
Tedlambo Garilltet	93	0
Goraveola Grellorm	5	1
Colonel Garqun	55	1
Bosgogo Lurcat	64	1

use glm() instead of lm()
specify link function with family = binomial

mod.b <- glm(SPLATTED ~ quality,
    family = binomial, data = singers)

Evaluating the Model

anova(mod.b)

Analysis of Deviance Table

Model: binomial, link: logit

Response: SPLATTED

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev
NULL                      999       1377
quality  1      800       998        577

NB., no statistical test done by default
deviance compares the likelihood of the new model to that of the previous model
- a generalisation of sums of squares
- lower “residual deviance” is good (a bit like Residual Sums of Squares)

Evaluating the Model

        Df Deviance Resid. Df Resid. Dev
NULL                      999       1377
quality  1      800       998        577

deviance is \(-2\times\) the log-likelihood ratio of the reduced compared to the full model
higher deviance is good (a bit like \(F\))

mod.n <- glm(SPLATTED ~ 1, family = binomial, data = singers)
logLik(mod.n)

'log Lik.' -688.5 (df=1)

logLik(mod.b)

'log Lik.' -288.6 (df=2)

-2 * (logLik(mod.n) -
  logLik(mod.b))

'log Lik.' 799.8 (df=1)

Evaluating the Model

model deviance maps to the \(\chi^2\) distribution
can specify a \(\chi^2\) test to statistically evaluate model in a similar way to \(F\) ratio

anova(mod.b, test = "Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: SPLATTED

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
NULL                      999       1377             
quality  1      800       998        577   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Coefficients


Call:
glm(formula = SPLATTED ~ quality, family = binomial, data = singers)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.08191    0.33410    15.2   <2e-16 ***
quality     -0.10557    0.00642   -16.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1377.06  on 999  degrees of freedom
Residual deviance:  577.29  on 998  degrees of freedom
AIC: 581.3

Number of Fisher Scoring iterations: 6

Model Coefficients

coefficients are in logits (= log-odds)

...
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.08191    0.33410    15.2   <2e-16 ***
quality     -0.10557    0.00642   -16.5   <2e-16 ***
...

zero = “50/50” (odds of 1)
value below zero: probability of being splatted decreases as quality increases

Log-Odds, Odds, and Probability

...
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.08191    0.33410    15.2   <2e-16 ***
quality     -0.10557    0.00642   -16.5   <2e-16 ***
...

a calculation for quality = 50

log-odds: \(5.08+-0.11\cdot50=\color{red}{-0.42}\)
odds: \(e^{-0.42}=\color{red}{0.657}\)
probability: \(\frac{0.657}{1+0.657}=\color{red}{0.3965}\)

\[\hat{y}_i=b_0+b_1x_i\] \[\textrm{odds}=e^{\hat{y}_i}\] \[p=\frac{\textrm{odds}}{1+\textrm{odds}}\]

A Useful Function

intuitive to think in probability
useful to write a function which takes a value in logits l and converts it to a probability p

l2p <- function(logits) {
  odds <- exp(logits)
  prob <- odds/(1 + odds)
  return(prob)
}

qualities 50 and 51

l2p(5.08+-0.11*50)

[1] 0.3965

l2p(5.08+-0.11*51)

[1] 0.3705

qualities 70 and 71

l2p(5.08+-0.11*70)

[1] 0.06786

l2p(5.08+-0.11*71)

[1] 0.06123

Representing the Model Graphically

singers |>
  ggplot(aes(x = quality, y = SPLATTED)) +
  ylab("p(SPLATTED)") +
  geom_jitter(size = 3, width = 0, height = 0.2, alpha = 0.1) +
  geom_smooth(method = "glm", method.args = list(family = binomial)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2))

One Last Trick

so far we’ve looked at
- model deviance and \(\chi^2\) (similar to sums of squares and \(F\))
- model coefficients and how to map them to probability
what about “explained variance” (similar to \(R^2\))?
no really good way of doing this, many proposals
SPSS uses something called “accuracy” (how well does the model predict actual data?)
not very informative, but good for learning R

Accuracy

first, what does the model predict (in logit units)?

guess <- predict(mod.b)  # in logit units

if the chance of being splatted is more than .5 (logit > 0) call it a “splat”

guess <- ifelse(guess > 0, 1, 0)

how well do predicted splats match actual splats?

hits <- sum(guess == singers$SPLATTED)
hits/length(singers$SPLATTED)

[1] 0.879

present model “correctly predicts” 87.9% of the observations

Other Types of Data

logit regression is one type of GLM
others make use of different link functions (through family=...)

poisson: number of events in a time period
inverse gaussian: time to reach some criterion
…

GLMs

Predictor Variables

linear
convertible to linear (use log() etc.)
non-convertible (use contrasts() etc. to map)
don’t affect the choice of model

Dependent Variables

linear
convertible to linear (use log() etc.)
non-convertible (use glm() with family=...)
directly affect the choice of model

The Generalized Linear Model

Aliens

A Binary World

A Binary World

A Binary World

1,000 Aliens

1,000 Aliens

1,000 Aliens

Binomial Regression, Conceptually

Binned Data

Binned Data

Best Fit Lines

The Problem with Probability

The Problem with Probability

The Problem with Probability

Assumptions

Log Odds

Probability and Odds

Probability and Odds

Refresher: (Natural) Logarithms

Refresher: (Natural) Logarithms

Refresher 2

Why \(e\)?

Refresher 2

Probability and Log-Odds

Probability and Log-Odds

The Generalized Linear Model

The Generalized Linear Model

Likelihood

The Generalized Linear Model

Alien Singer Splat Probability

Evaluating the Model

Evaluating the Model

Evaluating the Model

Model Coefficients

Model Coefficients

coefficients are in logits (= log-odds)

Log-Odds, Odds, and Probability

A Useful Function

Representing the Model Graphically

One Last Trick

Accuracy

Other Types of Data

GLMs

Predictor Variables

Dependent Variables

End