The Generalized Linear Model

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Aliens

A Binary World

A Binary World

A Binary World

1,000 Aliens

id quality SPLATTED
The Great Odorjan of Erpod 84 0
Hapetox Bron 34 1
Loorn Molzeks 92 0
Ba'lite Adrflen 49 1
Tedlambo Garilltet 93 0
Goraveola Grellorm 5 1
Colonel Garqun 55 1
Bosgogo Lurcat 64 1
Osajed Voplily 45 0
Subcommander Edorop 90 0
  • quality = quality of singing

  • SPLATTED = whether splatted (1 or 0)

1,000 Aliens

1,000 Aliens

  • using geom_jitter() with alpha=.5

Binomial Regression, Conceptually

  • each alien either gets splatted or doesn’t

    • each observation is either a 1 or a 0
  • underlyingly, there’s a binomial distribution

  • for each value of “quality of singing” there’s a probability of getting splatted

  • for each alien, the outcome is deterministic

  • but it’s the probability we are ultimately interested in

  • we can approximate it by binning our data…

Binned Data

singers <- singers |>
    mutate(bin = cut_interval(quality,
        10))

dat <- singers |>
    group_by(bin) |>
    summarise(prop = mean(SPLATTED))

dat
# A tibble: 10 × 2
   bin            prop
   <fct>         <dbl>
 1 [1,10.9]    0.982  
 2 (10.9,20.8] 0.959  
 3 (20.8,30.7] 0.935  
 4 (30.7,40.6] 0.803  
 5 (40.6,50.5] 0.573  
 6 (50.5,60.4] 0.311  
 7 (60.4,70.3] 0.115  
 8 (70.3,80.2] 0.0388 
 9 (80.2,90.1] 0.00893
10 (90.1,100]  0.0351 

Binned Data

singers <- singers |>
    mutate(bin = cut_interval(quality,
        10))

dat <- singers |>
    group_by(bin) |>
    summarise(prop = mean(SPLATTED))

dat |>
    ggplot(aes(x = bin, y = prop)) +
    xlab("quality bin") +
    ylab("prop splatted") +
    geom_point(size = 3) +
    scale_x_discrete(label = 1:10)

Best Fit Lines

  • we can fit our data using a standard linear model

  • but there’s something very wrong…

The Problem with Probability

The Problem with Probability

  • a linear model predicts impossible values because probability isn’t linear; it’s asymptotic

The Problem with Probability

  • variance necessarily covaries with probability

Assumptions

Log Odds

Probability and Odds

\[\textrm{odds}(y)=\frac{p(y)}{1-p(y)}\]

\[0<p<1\] \[0<\textrm{odds}<\infty\]

 

 

\(p(y)\)

\(\textrm{odds}(y)\)

 

 

throw heads

\(\frac{1}{2}\)

\(\frac{1}{1}\)

 

 

throw 8 with two dice

\(\frac{5}{36}\)

\(\frac{5}{31}\)

 

 

get splatted

\(\frac{99}{100}\)

\(\frac{99}{1}\)

 

Probability and Odds

 

  • odds never goes below zero

  • odds rises to \(\infty\)

Refresher: (Natural) Logarithms

  • a logarithm is the power you would raise a number to to obtain another number

\[10^3=1000; \log_{10}(1000)=3\]

log(1000, base = 10)
[1] 3

Refresher: (Natural) Logarithms

  • a logarithm is the power you would raise a number to to obtain another number

\[10^3=1000; \log_{10}(1000)=3\]

  • a natural logarithm is the power you would raise the number \(e\) to to obtain another number

\[e^{6.908}=1000; \log(1000) = 6.908\]

exp(6.908)
[1] 1000
log(1000)
[1] 6.908

Refresher 2

  • what value does \(e\) have? (\(e^1=e\))
exp(1)
[1] "2.71828182845904509080"

Why \(e\)?

  • \(\log(l)\) = area under curve between \(1\) and \(l\) (negative if \(l<1\))
log(3)
[1] 1.099
log(0.3)
[1] -1.204

Refresher 2

  • what value does \(e\) have? (\(e^1=e\))
exp(1)
[1] "2.71828182845904509080"
  • \((\textrm{any number})^0=1\)
log(1)
[1] 0
  • \(\log(0)=-\infty\)
log(0)
[1] -Inf
  • \(\log(\infty)=\infty\)
log(Inf)
[1] Inf

Probability and Log-Odds

  • \(\log(0)=-\infty\); \(\log(\infty)=+\infty\); \(\log(1)=0\)
  • log-odds of \(0\) (odds of \(1\)) are exactly 50:50 (\(p=0.5\))

Probability and Log-Odds

  • if log-odds are less than zero, the odds go down (multiplied by <1)

  • if log-odds are greater than zero, the odds go up (multiplied by >1)

  • high odds = high probability

The Generalized Linear Model

The Generalized Linear Model

  • generalises the linear model using mapping functions

  • coefficients are in logit (log-odds) units

 

  • coefficients use Wald’s \(z\) instead of \(t\)

  • fit using maximum likelihood

Likelihood

The Generalized Linear Model

  • generalises the linear model using mapping functions

  • coefficients are in logit (log-odds) units

 

  • coefficients use Wald’s \(z\) instead of \(t\)

  • fit using maximum likelihood

 

  • but actually it’s all quite straightforward…

Alien Singer Splat Probability

id quality SPLATTED
The Great Odorjan of Erpod 84 0
Hapetox Bron 34 1
Loorn Molzeks 92 0
Ba'lite Adrflen 49 1
Tedlambo Garilltet 93 0
Goraveola Grellorm 5 1
Colonel Garqun 55 1
Bosgogo Lurcat 64 1

 

  • use glm() instead of lm()

  • specify link function with family = binomial

mod.b <- glm(SPLATTED ~ quality,
    family = binomial, data = singers)

Evaluating the Model

anova(mod.b)
Analysis of Deviance Table

Model: binomial, link: logit

Response: SPLATTED

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev
NULL                      999       1377
quality  1      800       998        577
  • NB., no statistical test done by default

  • deviance compares the likelihood of the new model to that of the previous model

    • a generalisation of sums of squares

    • lower “residual deviance” is good (a bit like Residual Sums of Squares)

Evaluating the Model

        Df Deviance Resid. Df Resid. Dev
NULL                      999       1377
quality  1      800       998        577
  • deviance is \(-2\times\) the log-likelihood ratio of the reduced compared to the full model
  • higher deviance is good (a bit like \(F\))
mod.n <- glm(SPLATTED ~ 1, family = binomial, data = singers)
logLik(mod.n)
'log Lik.' -688.5 (df=1)
logLik(mod.b)
'log Lik.' -288.6 (df=2)
-2 * (logLik(mod.n) -
  logLik(mod.b))
'log Lik.' 799.8 (df=1)

Evaluating the Model

  • model deviance maps to the \(\chi^2\) distribution

  • can specify a \(\chi^2\) test to statistically evaluate model in a similar way to \(F\) ratio

anova(mod.b, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: SPLATTED

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
NULL                      999       1377             
quality  1      800       998        577   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Coefficients


Call:
glm(formula = SPLATTED ~ quality, family = binomial, data = singers)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.08191    0.33410    15.2   <2e-16 ***
quality     -0.10557    0.00642   -16.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1377.06  on 999  degrees of freedom
Residual deviance:  577.29  on 998  degrees of freedom
AIC: 581.3

Number of Fisher Scoring iterations: 6

Model Coefficients

coefficients are in logits (= log-odds)

...
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.08191    0.33410    15.2   <2e-16 ***
quality     -0.10557    0.00642   -16.5   <2e-16 ***
...
  • zero = “50/50” (odds of 1)

  • value below zero: probability of being splatted decreases as quality increases

Log-Odds, Odds, and Probability

...
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.08191    0.33410    15.2   <2e-16 ***
quality     -0.10557    0.00642   -16.5   <2e-16 ***
...

a calculation for quality = 50

  • log-odds: \(5.08+-0.11\cdot50=\color{red}{-0.42}\)

  • odds: \(e^{-0.42}=\color{red}{0.657}\)

  • probability: \(\frac{0.657}{1+0.657}=\color{red}{0.397}\)


\[\hat{y}_i=b_0+b_1x_i\] \[\textrm{odds}=e^{\hat{y}_i}\] \[p=\frac{\textrm{odds}}{1+\textrm{odds}}\]

A Useful Function

  • intuitive to think in probability

  • useful to write a function which takes a value in logits l and converts it to a probability p

l2p <- function(logits) {
  odds <- exp(logits)
  prob <- odds/(1 + odds)
  return(prob)
}
  • qualities 50 and 51
l2p(5.08+-0.11*50)
[1] 0.3965
l2p(5.08+-0.11*51)
[1] 0.3705
  • qualities 70 and 71
l2p(5.08+-0.11*70)
[1] 0.06786
l2p(5.08+-0.11*71)
[1] 0.06123

Representing the Model Graphically

singers |>
  ggplot(aes(x = quality, y = SPLATTED)) +
  ylab("p(SPLATTED)") +
  geom_jitter(size = 3, width = 0, height = 0.2, alpha = 0.1) +
  geom_smooth(method = "glm", method.args = list(family = binomial)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.2))

One Last Trick

  • so far we’ve looked at

    • model deviance and \(\chi^2\) (similar to sums of squares and \(F\))

    • model coefficients and how to map them to probability

  • what about “explained variance” (similar to \(R^2\))?

  • no really good way of doing this, many proposals

  • SPSS uses something called “accuracy” (how well does the model predict actual data?)

  • not very informative, but good for learning R

Accuracy

  • first, what does the model predict (in logit units)?
guess <- predict(mod.b)  # in logit units
  • if the chance of being splatted is more than .5 (logit > 0) call it a “splat”
guess <- ifelse(guess > 0, 1, 0)
  • how well do predicted splats match actual splats?
hits <- sum(guess == singers$SPLATTED)
hits/length(singers$SPLATTED)
[1] 0.879
  • present model “correctly predicts” 87.9% of the observations

Other Types of Data

  • logit regression is one type of GLM

  • others make use of different link functions (through family=...)

 

  • poisson: number of events in a time period

  • inverse gaussian: time to reach some criterion

GLMs

Predictor Variables

  • linear

  • convertible to linear (use log() etc.)

  • non-convertible (use contrasts() etc. to map)

  • don’t affect the choice of model

 

Dependent Variables

  • linear

  • convertible to linear (use log() etc.)

  • non-convertible (use glm() with family=...)

  • affect the choice of model

End