Binary Logistic Model

class: center, middle, inverse, title-slide

# Binary Logistic Model 
## Data Analysis for Psychology in R 2 
### dapR2 Team
### Department of Psychology The University of Edinburgh
### AY 2020-2021

---

# Weeks Learning Objectives
1. Identify and provide examples of binary psychological outcomes.
2. Understand why a standard LM is not appropriate for binary data.
3. Fit and interpret a logistic model

---
# Topics for today

+ Logistic regression
	+ Why do we need logistic regression?
	+ The logistic regression model
	+ Overall model evaluation
	+ Logistic regression in R

---
#  Binary outcomes

+ Thus far we have discussed:
	+ linear regression with a continuous DV
	+ linear regression with categorical (including binary) predictors

+ What if we have a binary outcome variable?
	+ E.g.,:
		+ Healthy vs diseased
		+ Died vs survived
		+ Hired vs not hired
		+ Correct vs incorrect

+ When we have binary outcome variable, linear regression is no longer appropriate
	+ Let's see what happens when we fit a linear regression model with a binary outcome variable?

---
#  Applying linear regression to binary outcomes

+ We can code our outcome in terms of whether or not an event happened
	+ `$Y=1$` for a job offer (event occurred)
	+ `$Y=0$` for no job offer (event did not occur)
		
+ If we then fit a linear regression model, our model predicts the probability of the event occurring `$P(Y=1)$` or `$P(y_i)$`

---
#  Some example data 
+ Imagine we're interested in predicting hiring decisions.

+ We collect data on n=242 job-seekers
	+ Age
	+ Effort put into job application

+ Our variables:
	+ DV: `work` (0 = did not get job; 1 = did get job)
	+ IV1: `age` (in years)
	+ IV2: `msrch` (effort into job application, 0=low effort, 1 = high effort)

---
#  Visualize Our Data

.pull-left[

+ Here we can see:

1. We have slightly more people being hired than not (light blue bar in left-hand plot)

2. Those who were hired (blue) are younger than those who were not (red) (density plot, top right)

3. There is a greater number of people who work hard (proportion of blue in bottom right plot) among those who were hired (right hand bar of this plot)

]

.pull-right[
![](dapR2_lec33_BinaryLogistic1_files/figure-html/unnamed-chunk-2-1.png)
]
---
# Proportion hired

```r
hire %>%
  group_by(work) %>%
  summarise(
    n = n()) %>%
  mutate(
    Prop = round(n/sum(n),2)
  ) 
```

```
## # A tibble: 2 x 3
## work n Prop
## * <fct> <int> <dbl>
## 1 0 117 0.48
## 2 1 125 0.52
```

---
#  Linear probability model 
+ So let's just use `lm()` to predict our outcome (Hire vs no hire)
  + **Note**: R will actually try and stop us doing this if `work` is a factor
  + So we have to pretend to R that it is numeric.

```r
hire <- hire %>%
 mutate(
* work = as.numeric(work)
 )
m1 <- lm(work ~ age, data = hire)
```

---
#  Linear probability model 
+ Nothing  looks too amiss here

```
## 
## Call:
## lm(formula = work ~ age, data = hire)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -0.7971 -0.4497 0.2345 0.4240 0.7082 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 3.028641 0.320801 9.441 < 2e-16 ***
## age -0.031578 0.006668 -4.736 3.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4799 on 240 degrees of freedom
## Multiple R-squared: 0.08545,	Adjusted R-squared: 0.08164 
## F-statistic: 22.42 on 1 and 240 DF, p-value: 3.739e-06
```

+ But let's look at our assumptions checks?

---
#  Violated assumptions

.pull-left[

+ The plots here were produced using the `check_model()` function in the `performance()` package.

+ These are not ideally formatted plots for reports, but it is a nice quick way to look at all assumption plots.

+ Can also be used for different diagnostics like `vif` etc.

]

.pull-right[
![](dapR2_lec33_BinaryLogistic1_files/figure-html/unnamed-chunk-6-1.png)
]

---
#  Impossible probabilities

.pull-left[
+ And there is a further issue?

+ By definition, probabilities must be between 0 and 1.

+ But our model predicts some `$P(Y=1)$` values >1 within the plausible range of age values
]

.pull-right[

![](dapR2_lec33_BinaryLogistic1_files/figure-html/unnamed-chunk-7-1.png)

]

---
#  The problem with linear regression

+ In general, when we apply linear regression to binary outcomes:
	+ The distribution of the residuals is bimodal (not normal)
	+ The variance of the residuals is not constant
	+ The relation between `$x$` and `$y$` is not linear
	+ Probabilities are not constrained to be between 0 and 1
	
+ The logistic regression model solves these issues with linear regression

---
class: center, middle
# Time for a break

---
class: center, middle
# Welcome Back!

**Let's get some details on the logistic model**

---
#  The logistic regression model 
+ In logistic regression, we predict the probability that `$Y=1$`, `$P(y_i)$`, from our X's, using:

`$$P(y_i) = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1)}}$$`

+ `$e$` = exponential
+ `$\beta_0 + \beta_1x_1$` is a linear combination with:
	+ a constant `$\beta_0$` (intercept), and
	+ `$\beta_1$` capturing the effect of `$x_1$` on the outcome `$y$`

---
#  The logistic regression model

---
#  Probability, odds and log-odds
+ An alternative way to think about this is in terms of probability, odds and log-odds.

+ Our presentation above includes the exponential of the coefficients so that (a) predictions are bounded within 0 and 1, and (b) we can have a linear representation.

+ The alternative way to view it is to change what we are predicting...

---
#  Probability, odds and log-odds

.pull-left[

+ `$Probability = P(Y_i)$`

+ `$odds = \frac{P(Y=1)}{1-P(Y=1)}$`

+ `$logodds = ln \left (\frac{P(Y=1)}{1-P(Y=1)} \right)$`

]

.pull-right[
<img src="MC_logistic.png" width="80%" style="display: block; margin: auto;" />
]

---
#  Logistic models with different slopes 
+ Given that we can see our log-odds provide a linear relation, we can think about predicting these:

`$$ln \left (\frac{P(Y=1)}{1-P(Y=1)} \right) = \beta_0 + \beta_1x_1$$`
+ and we can easily extend to models with multiple predictors:

`$$P(y_i) = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 ... + \beta_kx_k)}}$$`

+ Or

`$$ln \left (\frac{P(Y=1)}{1-P(Y=1)} \right) = \beta_0 + \beta_1x_1 + \beta_2x_2 ... + \beta_kx_k$$`

---
#  Estimating logistic regression coefficients 
+ Linear regression models can be estimated using least squares estimation

+ Logistic regression models are estimated using maximum likelihood estimation (MLE)

+ MLE  finds the logistic regression coefficients that *maximise the likelihood of the observed data having occurred*
	+ While least squares estimation minimises the SSE to find the coefficients for the line of best, MLE minimises the log-likelihood
	+ Larger log-likelihood values indicate poorer fitting models

---
#  Overall model evaluation 
+ To evaluate overall model fit in logistic regression we:
	+ Compare our model to a baseline model with no predictors (null model)
	+ Assess the improvement in fit
	+ *Note we have already seen this idea when discussing the `$F$`-test and incremental `$F$`-test*
	
+ The baseline model is where the predicted values for the DV are based on the most frequent value of the DV (0 or 1)
	+ Our best guess of the DV value in the absence of informative predictors.
	+ Analogous to using the mean DV value as the baseline in linear regression.

---
#  Overall model evaluation 
+ We compare our model with the baseline model using **deviance**:

`$$deviance = -2*loglikelihood$$`

+ Deviance often denoted **-2LL**

---
#  Overall model evaluation 
+ We calculate the -2LL differences between our model and the baseline model.

+ We assess the statistical significance of the -2LL difference to see if our model significantly improves on the baseline model
	+ We compare our -2LL difference to a `$\chi^2$` distribution with df = k
		+ k= number of predictors in model
		
+ Significant `$p$`-value indicates that our model improves on the baseline model.

+ Assessing the -2LL differences in this way is called a likelihood ratio test or chi-square difference test

---
class: center, middle
# Time for a break

---
class: center, middle
# Welcome Back!

**Logistic regression in r**

---
#  The `glm()` function 
+ In R, we conduct logistic regression using the `glm()` function

+ `glm` stands for **g**eneralised **l**inear **m**odel
  + More on this next week
  
+ Very similar in structure to the `lm()` function

---
#  Run our model: a little more code

```r
*m2 <- glm(work ~ age + msrch, data = hire,
 family = "binomial")
```

+ `glm()`; R function for running generalised linear models.
	+ Provide a formula in the same style as lm()
	+ Provide the name of the dataset

---
#  Run our model: a little more code

```r
m2 <- glm(work ~ age + msrch, data = hire, 
* family = "binomial")
```

+ The new bit.
	+ We need to start what family of probability distributions we want for our DV.
	+ This relates to the type of variable our DV is.
	+ We will take a little more about this next lecture.
	+ For now, for a binary variable, we want `family = binomial`

---
# Quick aside: Why binomial?
+ Binomial distribution is a discrete probability distribution. (you do not need to remember the probability mass function below - don't worry.)

`$$f(k,n,p) = Pr(X = k) = \binom{n}{k}p^{k}q^{n-k}$$`

+ `$k$` = number of success
+ `$n$` = total trials, 
+ `$p$` = probability success
+ `$q$` = `$1-p$` or probability of failure
+ `$\binom{n}{k}$` = `$n$` choose `$k$`, or the number of ways to select `$k$` unordered items from a set of `$n$` items.

+ Practically, we can think of every instance of a `Hire` in out example as a success, and every `no hire` as a failure.

---
#  `glm()` output

```r
m2 <- glm(work ~ age + msrch, data = hire, family = "binomial")
summary(m2)
```

---
#  `glm()` output

```
## 
## Call:
## glm(formula = work ~ age + msrch, family = "binomial", data = hire)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8834  -1.0496   0.6436   0.9204   2.0589  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.52505    1.56972   2.883 0.003943 ** 
## age         -0.11848    0.03214  -3.687 0.000227 ***
## msrch1       1.68335    0.33446   5.033 4.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.22  on 241  degrees of freedom
## Residual deviance: 285.36  on 239  degrees of freedom
## AIC: 291.36
## 
## Number of Fisher Scoring iterations: 4
```

---
#  Overall Model Test

```r
m2_null <- glm(work ~ 1, family = "binomial", data = hire)
anova(m2_null, m2, test="Chisq")
```

```
## Analysis of Deviance Table
## 
## Model 1: work ~ 1
## Model 2: work ~ age + msrch
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       241     335.22                          
## 2       239     285.36  2   49.858 1.491e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Summary of today
+ For binary outcomes we use logistic rather than linear regression.

+ Logistic regression predicts the probability of `$Y=1$` from our IVs

+ Logistic regression is estimated using maximum likelihood estimation

+ Implemented using `glm()` function in R

+ The model can be evaluated by comparison with a baseline model using a likelihood ratio  test