Some very quick recaps

Simple linear regression

Multiple linear regression

Interactions

Categorical explanatory variables

Binary Outcomes!

So what do we do if we want to do a similar analysis but for an outcome variable that is not numeric?
We measure a lot of things as categories, and sometimes what our research is interested in is what variables influence the likelihood that an observation will fall into a given category.

In the lecture, we saw an example in which our participants (aliens), were either splatted or survived.

Consider the following questions:

What features of speech influence the likelihood of an utterance being perceived as true/false?
What lifestyle and demographic variables influence the chances that a person does/does not smoke?
Does preference for milk vs dark chocolate depend on personality traits?

The common thread in the questions above is that they all contain a dichotomy: “splatted/not splatted,” “true/false,” “does/does not,” “milk vs dark.”

A note: This sort of binary thinking can be useful, but it is important to remember that it can in part be simply a result of measurement. For instance, we might initially consider “has blond hair” to be binary “Yes/No,” but that doesn’t mean that in the real world people either have blond hair or they don’t. We could instead choose to measure hair colour via colorimetric measures of the energy at each spectral wavelength, meaning “blondness” could become a continuum. While binary thinking can be useful, it is not difficult to think of ways in which it can be harmful.

Introducing GLM

Research Questions Is susceptibility to change blindness influenced by level of alcohol intoxication and perceptual load?

Watch the following video

Simons, D. J., & Levin, D. T. (1997). Change blindness. Trends in cognitive sciences, 1(7), 261-267.

You may well have already heard of these series of experiments, or have seen similar things on TV.

Question A1

For a given participant in the ‘Door Study,’ give a description of the outcome variable of interest, and ask yourself whether it is binary.

Solution

drunkdoor.csv Dataset

variable	description
id	Unique ID number
bac	Blood Alcohol Content (BAC), A BAC of 0.0 is sober, while in the United States 0.08 is legally intoxicated, and above that is very impaired. BAC levels above 0.40 are potentially fatal.
age	Age (in years)
condition	Condition - Perceptual load created by distracting oject (door) and details and amount of papers handled in front of participant (Low vs High)
notice	Whether or not the participant noticed the swap (Yes = 1 vs No = 0)

Question A2

Read in the data, and plot the relationship between the age variable and the notice variable.
Use geom_point(), and add to the plot geom_smooth(method="lm"). This will plot the regression line for a simple model of lm(notice ~ age).

Solution

Question A3

Just visually following the line from the plot produced in the previous question, what do you think the predicted model value would be for someone who is aged 30?
What does this value mean?

Solution

What we are interested in evaluating here is really the probability of noticing the experimenter-swap.

This is where the Generalised Linear Model (GLM) comes in. With a little bit of trickery, we can translate probability (restricted to between 0 and 1, and not represented by a straight line), into “log-odds,” which is both linear and unbounded (see Figure 2).

Probability, odds, log-odds

If we let $p$ denote the probability of a given event, then:

$\frac{p}{(1-p)}$ are the odds of the event happening. For example, the odds of rolling a 6 with a normal die is 1/5 (sometimes this is expressed ‘1:5,’ and in gambling the order is sometimes flipped and you’ll see ‘5/1’ or ‘odds of five to one’).
$ln(\frac{p}{(1-p)})$ are the log-odds of the event.

Figure 2: https://uoepsy.github.io/usmr/lectures/lecture_9.html#32

Question A4

The probability of a coin landing on heads if 0.5, or 50%. What are the odds, and what are the log-odds?
This year’s Tour de France winner, Tadej Pogacar, was given odds of 11 to 4 by a popular gambling site (i.e., if we could run the race over and over again, for every 4 he won he would lose 11). Translate this into the implied probability of him winning.

Solution

How is this useful? Recall the linear model formula we have seen a lot of over the last couple of weeks: \[ \color{red}{Y} = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} X_1 + ... + \beta_k \cdot{} X_k} \] Because we are defining a linear relationship here, we can’t directly model the probabilities (because they are bounded by 0 and 1, and they are not linear). But we can model the log-odds of the event happening.
Our model formula thus becomes: \[ \color{red}{ln\left(\frac{p}{1-p} \right)} = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} X_1 + ... + \beta_k \cdot{} X_k} \\ \quad \\ \text{Where} Y_i \sim Binomial(n, p_i) \text{ for a given }x_i \\ \text{and } n = 1 \text{for binary responses} \]

Optional A bit of a tangent - How does the model get estimated?

Maximum Likelihood Estimation

For a linear regression, we heard about how the regression line can be found by “minimising the residual sums of squares” (i.e., we rotate the line to find the point at which $\sum{(y - \hat{y})^2}$ is smallest. This gave us the “best fitting line” (Figure 3).

Figure 3: https://uoepsy.github.io/usmr/lectures/lecture_6.html#36

For the logistic regression model, what we’re really wanting is the “best fitting squiggle” (Figure ??), and to get to this we must do something else. The reason we have to do something different is because for our actual observations, the event has either happened or it hasn’t. So we can’t take the raw data as “probabilities” which we can translate into log-odds. We can try, but it doesn’t make sense, and we would just be trying to fit some line between infinity and negative infinity. This would mean the residuals would also all be infinity, and so it becomes impossible to work out anything!

drunkdoor %>% 
  mutate(
    notice_odds = notice/(1-notice),
    notice_logodds = log(notice_odds)
  )

## # A tibble: 120 x 7
##    id        bac   age condition notice notice_odds notice_logodds
##    <chr>   <dbl> <dbl> <chr>      <dbl>       <dbl>          <dbl>
##  1 ID1   0.0674     43 Low            1         Inf            Inf
##  2 ID2   0.00331    64 Low            0           0           -Inf
##  3 ID3   0.00323    44 Low            1         Inf            Inf
##  4 ID4   0.0798     67 High           0           0           -Inf
##  5 ID5   0.0668     62 High           0           0           -Inf
##  6 ID6   0.0155     45 Low            1         Inf            Inf
##  7 ID7   0.00113    50 High           0           0           -Inf
##  8 ID8   0.0511     45 High           0           0           -Inf
##  9 ID9   0.00314    48 High           0           0           -Inf
## 10 ID10  0.0416     61 Low            0           0           -Inf
## # ... with 110 more rows

Instead, maximum likelihood estimation is used to find the set of coefficients which best reproduce the observed data.

For a given line on the plot where y = log-odds, and x = predictor variable, we can project our points on to it to find some candidate log-odd values for each observation, which we can then compare to the observed data (the 0s and 1s, or -Inf and Inf on the log-odds scale).

Consider three possible lines we might fit below. You can see the observations at the Inf and -Inf points of the y-axis, and they are coloured by whether the event was observed or not. These are then projected down on to the possible lines. Recall that a log-odds of 0 is 50/50, or an odds of 1.
In the left hand plot, all of the observations where the event was observed to happen (red dots at log-odds of +Inf), are modelled as having a log-odds of > 0 (red dots on the line). However, so are a lot of the observations where the event was not observed (blue dots at log-odds of -Inf).
In the middle plot, some red dots get incorrectly given log-odds < 0, and a lot of blue dots get incorrectly given log-odds > 0. The right hand plot seems to fit a little better - there are only a few points which get given log-odds the wrong way from what we would expect.

Figure 4: Note: the dotted lines are not residuals, but just to show the projections of observations down to the line

If we take that right-hand plot above, and translate the log-odds back into probabilities, we see the squiggle!

What we can then ask for each of our candidate lines “what is the likelihood of the data, given this line?” To get this, we sum up the probabilities that observed 1s are predicted as 1s and that observed 0s predicted as 0s, at some given threshold which defines what we count as “predicted as 1” (e.g., is the predicted prob < or > 0.5?).
For instance, the furthest right point is an observed 0, and so the likelihood of this data given the predicted probability is 1-0.04, or 0.96.

The cool thing is that computers do all this for us. Maximum likelihood estimation is just a method to find out which parameters maximise the likelihood of the data. In our context, it tells us what values of coefficients in $\color{red}{ln\left(\frac{p}{1-p} \right)} = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} X_1 }$ maximise the likelihood of reproducing the data we observed.

Question A5

To fit a logistic regression model, we need to useglm(), a slightly more general form of lm().
The syntax is pretty much the same, but we need to add in a family at the end, to tell it that we are doing a binomial¹ logistic regression.

Using the drunkdoor.csv data, fit a model investigating whether participants’ age predicts the log-odds of noticing the person they are talking to be switched out mid-conversation. Look at the summary() output of your model.

lm(y ~ x1 + x2, data = data)
is the same as:
glm(y ~ x1 + x2, data = data, family = "gaussian")

(Gaussian is another name for the normal distribution)

Solution

model1 <- glm(notice ~ age, data = drunkdoor, family="binomial")
summary(model1)

## 
## Call:
## glm(formula = notice ~ age, family = "binomial", data = drunkdoor)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3142  -0.9686  -0.3496   0.9635   1.7526  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.15625    1.57163   4.553 5.28e-06 ***
## age         -0.12999    0.02819  -4.612 4.00e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 166.32  on 119  degrees of freedom
## Residual deviance: 136.84  on 118  degrees of freedom
## AIC: 140.84
## 
## Number of Fisher Scoring iterations: 4

Question A6

Interpreting coefficients from a logistic regression can be difficult at first.
Based on your model output, complete the following sentence:

“Being 1 year older decreases _________ by 0.13.”

Solution

Question A7

Unfortunately, if we talk about increases/decreases in log-odds, it’s not that intuitive.
What we often do is translate this back into odds.
The opposite of the natural logarithm is the exponential (see here for more details if you are interested), and in R these functions are log() and exp():

log(2)

## [1] 0.6931472

exp(log(2))

## [1] 2

log(exp(0.6931472))

## [1] 0.6931472

Exponentiate the coefficients from your model in order to translate them back from log-odds, and provide an interpretation of what the resulting numbers mean.

Solution

exp(coef(model1))

##  (Intercept)          age 
## 1282.0913563    0.8781015

The odds of noticing a mid-conversation person-switch for someone age 0 is 1282:1.
For every year older someone is, the odds of noticing decrease by 0.88.

Question A8

Based on your answer to the previous question, calculate the odds of noticing the swap for a one year-old (for now, forget about the fact that this experiment wouldn’t work on a 1 year old!)
And what about for a 40 year old?

Can you translate the odds back to probabilities?

Solution

exp(coef(model1))

##  (Intercept)          age 
## 1282.0913563    0.8781015

The odds of noticing a mid-conversation person-switch for someone age 0 is 1282:1.
For every year older someone is, the odds of noticing decrease by 0.88.

This means that for a one year old, the odds of noticing are $1282*0.88$, or 1129:1.
The odds for a 40 year old are $1282*0.88^{40}$, or 7.71:1

And we can always then turn these back into probabilities.

From probabilities to log-odds (or logit):
\[ logit_i=log(\frac{p_i}{1-p_i})) \] From log-odds to probability:
\[ p_i=\frac{e^{logit_i}}{(1+e^{logit_i})} \]

Predicted probability of noticing for a one year old = $\frac{1129}{1 + 1129} = 0.99$ Predicted probability of noticing for a 40 year old = $\frac{7.71}{1 + 7.71} = 0.89$

Question A9

We can easily get R to extract these predicted probabilities for us.

Calculate the predicted log-odds (probabilities on the logit scale): predict(model, type="link")
Calculate the predicted probabilities: predict(model, type="response")

The code below creates a dataframe with the variable age in it, which has the values 1 to 100. Can you use this object in the predict() function, along with your model, to calculate the predicted probabilities of the outcome (noticing the swap) for each value of age? How about then plotting them?

ages100 <- tibble(age = 1:100)

Solution

Question A9

We have the following model coefficients, in terms of log-odds:

summary(model1)$coefficients

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  7.1562479 1.57163317  4.553383 5.279002e-06
## age         -0.1299931 0.02818848 -4.611569 3.996402e-06

We can convert these to odds by using exp():

exp(coef(model1))

##  (Intercept)          age 
## 1282.0913563    0.8781015

In order to say something along the lines of “For every year older someone is, the odds of noticing the mid-conversation swap (the outcome event happening) decreases by 0.88.”

Why can we not translate this into a straightforward statement about the change in probability of the outcome for every year older someone is?

Solution

We have to be careful - we are talking about an odds ratio.

Our model holds that a 21 year old has 0.88 the odds that a 20 year old has, and that a 51 year old has 0.88 the odds of a 50 year old. But these are not the same in terms of probability. The probability between age and the probability isn’t linear, it is sigmoidal.

example <- tibble(
  age = c(20,21,50,51)
)

example <- 
  example %>% mutate(
    logodds = predict(model1, newdata = example, type = "link"),
    odds = exp(logodds),
    probs = predict(model1, newdata = example, type = "response"),
  )

example

## # A tibble: 4 x 4
##     age logodds  odds probs
##   <dbl>   <dbl> <dbl> <dbl>
## 1    20   4.56  95.2  0.990
## 2    21   4.43  83.6  0.988
## 3    50   0.657  1.93 0.658
## 4    51   0.527  1.69 0.629

In log odds the difference is constant (it is a linear relationship):

example$logodds[example$age == 21] - example$logodds[example$age == 20]

##          2 
## -0.1299931

example$logodds[example$age == 51] - example$logodds[example$age == 50]

##          4 
## -0.1299931

Translated back into odds, the difference is not constant:

example$odds[example$age == 21] - example$odds[example$age == 20]

##         2 
## -11.60945

example$odds[example$age == 51] - example$odds[example$age == 50]

##         4 
## -0.235046

the multiplication, however, IS constant (this is the odds ratio):

example$odds[example$age == 21]/example$odds[example$age == 20]

##         2 
## 0.8781015

example$odds[example$age == 51]/example$odds[example$age == 50]

##         4 
## 0.8781015

This is the same thing we saw visually in Figure 2!

GLM as a classifier

Question B1

From the model we created in the earlier exercises:

drunkdoor <- read_csv("")
model1 <- glm(notice ~ age, data = drunkdoor, family="binomial")

Add new column to the drunkdoor dataset which contains the predicted probability of the outcome for each observation.
Then, using ifelse(), add another column which is these predicted probabilities translated into the predicted binary outcome (0 or 1) based on whether the probability is greater than >.5.
Create a two-way contingency table of the predicted outcome and the observed outcome.

Hint: you don’t need the newdata argument for predict() if you want to use the original data the model was fitted on.

Solution

drunkdoor <-
  drunkdoor %>%
  mutate(
    predprobs = predict(model1, type="response"),
    predclass = ifelse(predprobs > 0.5, 1, 0)
  )


drunkdoor %>%
  select(notice, predclass) %>%
  table()

##       predclass
## notice  0  1
##      0 42 19
##      1 20 39

A table of predicted outcome vs observed outcome sometimes gets referred to as a confusion matrix, and we can think of the different cells in general terms (Figure 5).
Another way to think about how our model is fitted is that it aims to maximise (TP + TN)/n, or, put another way, to minimise (FP+FN)/n.
Which is equivalent to the good old idea of minimising sums of squares (where we minimise the extend to which the predicted values differ from the observed values).

Figure 5: Confusion Matrix

Question B2

What percentage of the n = 120 observations are correctly classified by our model, when the threshold is set at 0.5?

Solution

sum(drunkdoor$predclass == drunkdoor$notice) / nrow(drunkdoor)

## [1] 0.675

The model correctly classifies 67.5% of the observations.

Exercises

Approaching a research question

Question C1

Recall our research question, which we will now turn to:

Research Questions Is susceptibility to change blindness influenced by level of alcohol intoxication and perceptual load?

Try and make a mental list of the different relationships between variables that this question invokes, can you identify one variable as the ‘outcome’ or ‘response’ variable? (it often helps to think about the implicit direction of the relationship in the question)

Solution

Question C2

Think about our outcome variable and how it is measured. What type of data is it? Numeric? Categorical?
What type of distribution does it follow? For instance, do values vary around a central point, or fall into one of various categories, or follow the count of successes in a number of trials?

Solution

Question C3

Think about our explanatory variable(s). Is there more than one? What type of variables are they? Do we want to model these together? Might they be correlated?
Are there any other variables (measured or unmeasured) which our prior knowledge or theory suggests might be relevant?

Solution

We know that drunkdoor$bac is simply the observed blood alcohol level (BAC). This is technically measured as a proportion, but for the current purposes we can just treat it as any other numeric scale. However, we might consider scaling the variable so that instead of the coefficient representing the change when moving from 0% to 1% BAC (1% blood alcohol is fatal!), we might want to have the change associated with 0% to 0.01% BAC (i.e, a we want to talk about effects in terms of changing 1/100th of a percentage of BAC). The drunkdoor$condition variable is an experimental manipulation. That is, the researchers had control over what observations fell into which group. We wouldn’t therefore expect any correlation between this and the bac variable (unless researchers did not allocate participants to conditions randomly).

If we were actually doing research in this area, we might already have the idea based on previous research that change-blindness appears to vary depending upon age. From the earlier exercises here we have some evidence to corroborate this. We may therefore want to include this in our model. If we don’t, any results might simply be due to, e.g. older people tending to be more highly intoxicated (and so we see an effects of alcohol that might actually be simply the effect of age).

We might also think about any other possible variables which might influence our results, even if we didn’t measure them. This sort of thinking becomes important in the discussion section.

John Stuart Mill - Three Criteria for Causality
To varying extents, all of these criteria can be incredibly difficult to satisfy. Criteria 3 especially is one of the things that makes scientific investigation so interesting.

The cause precedes the effect
The cause is demonstrably related to the effect
There are no plausible alternative explanations

Mill, J. S. (1869). A System of Logic, Ratiocinative and Inductive: Being a Connected View of the Principles of Evidence and the Methods of Scientific Investigation. Harper and brothers.

Fitting the model

Question C4

Write a sentence describing the model you will fit. It might help to also describe each variable as you introduce it.
Fit the model.

Things to think about:

Do you want BAC on the current scale, or could you transform it somehow?
Is condition a factor? What is your reference level? Have you checked contrasts(drunkdoor$condition)? (This will only work if you make it a factor first)

Solution

Whether or not participants noticed the swap mid-conversation (binary 0 vs 1) is modelled using logistic regression, with blood alcohol content (measured in 100th of percentages blood content) and perceptual load condition (low load vs high load, with low as the reference level) and age (years).

In the sentence above, I stated that I want blood alcohol in terms of 100ths of percentages, rather than percentages.

drunkdoor <- drunkdoor %>% 
  mutate(
    bac100 = bac*100
  )

I also stated that the low-load will be the reference level.
Currently it is the other way around:

# make it a factor
drunkdoor$condition<-factor(drunkdoor$condition)
contrasts(drunkdoor$condition)

##      Low
## High   0
## Low    1

So let’s change it:

# change 0,1 column to 1,0, and rename it so it 
# compares high against low (not low against high)
contrasts(drunkdoor$condition) <- cbind(High=c(1,0))
contrasts(drunkdoor$condition)

##      High
## High    1
## Low     0

changeblind_model <- glm(notice ~ bac100 + condition + age, data = drunkdoor, family = "binomial")
summary(changeblind_model)

## 
## Call:
## glm(formula = notice ~ bac100 + condition + age, family = "binomial", 
##     data = drunkdoor)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.01149  -0.31516  -0.00735   0.26706   2.23101  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   17.01764    3.55483   4.787 1.69e-06 ***
## bac100         0.60763    0.17431   3.486 0.000491 ***
## conditionHigh -5.60871    1.11634  -5.024 5.06e-07 ***
## age           -0.30482    0.06232  -4.891 1.00e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 166.322  on 119  degrees of freedom
## Residual deviance:  66.763  on 116  degrees of freedom
## AIC: 74.763
## 
## Number of Fisher Scoring iterations: 6

Question C5

Compute 95% confidence intervals for the log-odds coefficients using confint(). Wrap the whole thing in exp() in order to convert all these back into odds and odds-ratios.

Try the sjPlot package and using plot_model() on your model. What do you get? Tip:, for some people this plots seems to miss out plotting the BAC effect, so you might need to add:

plot_model(model) +
  scale_y_log10(limits = c(1e-05,10))

Solution

confint(changeblind_model)

##                    2.5 %     97.5 %
## (Intercept)   11.0145707 25.2161032
## bac100         0.3009620  0.9932602
## conditionHigh -8.1588018 -3.7128209
## age           -0.4487299 -0.1994882

exp(confint(changeblind_model))

##                      2.5 %       97.5 %
## (Intercept)   6.075294e+04 8.937467e+10
## bac100        1.351158e+00 2.700023e+00
## conditionHigh 2.862051e-04 2.440857e-02
## age           6.384385e-01 8.191499e-01

sjPlot::plot_model(changeblind_model)+
  geom_hline(yintercept=1)+
  scale_y_log10(limits = c(1e-05,10))

Looking beyond

This week we have looked at one specific type of Generalised Linear Model, in order to fit a binary logistic regression. We can use GLM to fit all sorts of models, depending on what type of data our outcome variable is, and this is all through the family = part of the model syntax.

For instance, if we had data which was binomial but with an $n > 1$, for instance the number of correct answers in 10 trials:

participant	trials_correct	trials_incorrect	x1
id1	1	9	114
id2	4	6	113
id3	7	3	83
id4	8	2	141
id5	3	7	115
...	...	...	...

We could model this with family = "binomial" using the two columns as the outcome:
glm(cbind(trials_correct, trials_incorrect) ~ x1, data = data, family = "binomial")

Or if we had count data, which can range from 0 to Infinity (theoretically), e.g.:

person	n_fish_caught	age
id1	8	44
id2	7	46
id3	13	29
id4	6	56
id5	14	37
...	...	...

We could model this using family = "poisson":
glm(n_fish_caught ~ age, data = data, family = "poisson")

If you want a really nice resource to help you in your future studies, then https://bookdown.org/roback/bookdown-BeyondMLR/ is an excellent read.

For real studies demonstrating the effects used in the example here, see:

In this case it happens to be the special case of a binomial where $n=1$, which sometimes gets referred to as ‘binary logistic regression’↩︎

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.