Week 7: The Linear Model

class: center, middle, inverse, title-slide

# Week 7: The Linear Model
## Univariate Statistics and Methodology using R
### Martin Corley
### Department of Psychology The University of Edinburgh

---

class: inverse, center, middle

# Part 1:  Correlation++

---
# Back on the Road

.pull-left[
![](lecture_6_files/figure-html/frecap-1.svg)

.center[
`$r= 0.4648$`, `$p = 0.0007$`
]
]
.pull-right[
![](lecture_6_files/img/playmo_police.jpg)
]
???
- is there anything more the police' data about blood alcohol and reaction time can tell us?

- so far we know they're correlated, but it would be good if we could something about _how bad_ blood alcohol is

+ how much worse does drinking more make things?
---
# Back on the Road (2)

.pull-left[
![](lecture_6_files/figure-html/frecap2-1.svg)

.tc[
"for every extra 0.01% blood alcohol, reaction time slows down by around 32 ms"
]
]
.pull-right[
![](lecture_6_files/img/playmo_police.jpg)

]
???
- this kind of information is based on the assumption that the relationship between blood alcohol and RT is _linear_

- that is, that each additional unit of blood alcohol affects reaction time by the same amount

---
# The Only Equation You Will Ever Need

.br3.pa2.f2.white.bg-gray[
$$ \textrm{outcome}_i = (\textrm{model})_i + \textrm{error}_i $$
]

.center.pt2[
![:scale 25%](lecture_6_files/img/model_error.svg)
]

---
count: false
# The Only Equation You Will Ever Need

.br3.pa2.f2.white.bg-gray[
$$ \textrm{outcome}_i = (\textrm{model})_i + \textrm{error}_i $$
]

- to get any further, we need to make _assumptions_

- nature of the **model** .tr[
(linear)
]

- nature of the **errors** .tr[
(normal)
]

---
# A Linear Model

.flex.items-top[
.w-40.pa2[
$$ \color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i $$
$$ \color{red}{y_i} = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} + \epsilon_i $$
.center[
so the linear model itself is...
]

$$ \hat{y}_i = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} $$
.center[
![:scale 30%](lecture_6_files/img/formula.svg)
]
]
.w-60[
&nbsp;
]]
---
count: false
# A Linear Model

$$ \hat{y}_i = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} $$
.center[
![:scale 30%](lecture_6_files/img/formula.svg)
]]
.w-60.pa2[
![](lecture_6_files/figure-html/bb-1.svg)

]]
---
count: false
# A Linear Model

$$ \hat{y}_i = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} $$
.center[
![:scale 30%](lecture_6_files/img/formula.svg)
]
.br3.pa3.bg-light-yellow[
$$ \hat{y} = \color{blue}{b_0 + b_1 \cdot{} x_i} $$
.center[
![:scale 30%](lecture_6_files/img/formula2.svg)
]]
]
.w-60.pa2[
![](lecture_6_files/figure-html/bb2-1.svg)

]]
---
# Take An Observation

.flex.items-top[
.w-40.pa2[

.br3.bg-gray.white.f2.pa2.tc[
xi = 1.2, yi = 9.9
]

$$ \hat{y}_i = b_0 + b_1\cdot{}x_i = 7.4 $$ 
$$ y_i = \hat{y}_i + \epsilon_i = 7.4 + \color{red}{2.5} $$

]
.w-60.pa2[
![](lecture_6_files/figure-html/errplot-1.svg)
]
]
???
- `$\hat{y}_i$` is what the model _predicts_ for `$x_i$`

- `$y_i$` is the actual value that was observed for `$x_i$`

- why would we care?

+ for one thing, the model can predict `$\hat{y}$` for values of `$x$` that we have never observed

---
# Simplifying the Data

.pull-left[
- **NB** we're doing this for illustrative purposes only

```r
ourDat <- dat %>% sample_n(20)
 # take 20 data points
```

]
.pull-right[
![](lecture_6_files/figure-html/plot1-1.svg)
]

---
count: false
# Simplifying the Data
.pull-left[
- **NB** we're doing this for illustrative purposes only

```r
ourDat <- dat %>% sample_n(20)
 # take 20 data points
```

```r
ourDat <- ourDat %>% 
 mutate(BloodAlc = BloodAlc*100)
```

]
.pull-right[
![](lecture_6_files/figure-html/plot3-1.svg)
]
---
count: false
# Simplifying the Data
.pull-left[
- **NB** we're doing this for illustrative purposes only

```r
ourDat <- dat %>% sample_n(20)
 # take 20 data points
```

```r
ourDat <- ourDat %>% 
 mutate(BloodAlc = BloodAlc*100)
```
.center[
"for every extra unit blood alcohol, reaction time slows down by around 33 ms"
]

]
.pull-right[
![](lecture_6_files/figure-html/plot2-1.svg)
]

.center.f3.pt3[
but how can we evaluate our model?
]
---
# Linear Models in R

```r
mod <- lm(RT ~ BloodAlc, data=ourDat)
```

---
count: false
# Linear Models in R

```r
mod <- lm(RT ~ BloodAlc, data=ourDat)
summary(mod)
```

```
## 
## Call:
## lm(formula = RT ~ BloodAlc, data = ourDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.18  -39.10   -4.71   44.55  121.14 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
*## (Intercept)    322.6      158.0    2.04    0.056 .
*## BloodAlc        32.7       15.4    2.12    0.048 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.7 on 18 degrees of freedom
## Multiple R-squared:   0.2,	Adjusted R-squared:  0.156 
## F-statistic: 4.51 on 1 and 18 DF,  p-value: 0.0479
```

---
# Intercept and Slope Again

$$ b_0= 322.6; \quad b_1= 32.7 $$
.center[
![](lecture_6_files/figure-html/git-1.svg)
]
???
- `$b_0$` is the the predicted value, `$\bar{y}$`, when `$x=0$`. Here I've represented it as a line, so where the top is the intercept value of 322.629
---
count: false
# Intercept and Slope Again

$$ b_0= 322.6; \quad b_1= 32.7 $$
.center[
![](lecture_6_files/figure-html/git2-1.svg)
]
---
count: false
# Intercept and Slope Again

$$ b_0= 322.6; \quad b_1= 32.7 $$
.center[
![](lecture_6_files/figure-html/git3-1.svg)
]
---
count: false
# Intercept and Slope Again

$$ b_0= 322.6; \quad b_1= 32.7 $$
.center[
![](lecture_6_files/figure-html/git4-1.svg)
]
???
- note that the intercept is really very far from the data we're interested in.

- it may be pretty meaningless to talk about the reaction time of someone with zero blood alcohol
---
class: inverse, center, middle, animated, fadeInDownBig
# End of Part 1

---
class: inverse, center, middle

# Part 2
## Significance
---
# Intercept and Slope

```r
mod <- lm(RT ~ BloodAlc, data=ourDat)
summary(mod)
```

---
# Are We Impressed?

- we have an intercept of 322.6 and a slope of 32.7

- in NHST world, our pressing question is

.pt4[
&nbsp;
]

.br3.bg-gray.pa2.white.tc[
how likely would we have been to find these parameters under the null hypothesis?
]

---
# Testing Chance

.center[
![](lecture_6_files/figure-html/chanceg-1.svg)
]

- repeatedly sampling 20~datapoints from the population

+ variability in _height_ of line = variability in intercept ( `$b_0$` )
  + variability in _angle_ of line = variability in slope ( `$b_1$` )
---
# We've Seen This Before

.center[
![](lecture_6_files/figure-html/figfig-1.svg)
]
- shaded area represents "95% confidence interval"

+ if we repeatedly sampled 20 items from the population...
  + assumes that the 20 we have are the _best estimate_ of the population

???
- if you were able to lay these graphs over each other, they wouldn't be the same
- in the simulation graph, we _know_ the entire population (of 50 drink-drivers) we're repeatedly sampling 20 from
- in this graph, the only information we have is that our 20 drink-drivers represent "the population" but we have no other information about that population
  + so the 20 are our "best estimate" of the population
  + **just like means and standard errors** (here the edges of the grey regions are 2.1009 standard errors from the line)

---

# The Good Old _t_-Test

```
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    322.6      158.0    2.04    0.056 .
## BloodAlc        32.7       15.4    2.12    0.048 *
```

- for each model parameter we are interested in whether it is _different from zero_

- **intercept**: just like a mean

- **slope**: does the best-fit line differ from horizontal?

- these are just (two-tailed) one-sample `$t$`-tests

+ **standard error** is the standard deviation of doing these lots of times
  + **t value** is `$\frac{\textrm{Estimate}}{\textrm{Std. Error}}$`
  + to calculate `$p$`, we need to know the _degrees of freedom_
  
---
# Degrees of Freedom

.center[
![](lecture_6_files/figure-html/plotdat-1.svg)
]
---
count: false
# Degrees of Freedom

.center[
![](lecture_6_files/figure-html/dfplots2-1.svg)
]

???
- we can always add in two points to make the straight line that we want to see
- this is one way of showing that for `$n$` data points, there are `$n-2$` degrees of freedom
---
# Degrees of Freedom

- in fact we subtract 2 degrees of freedom because we "know" two things

+ intercept ( `$b_0$` )
  
  + slope ( `$b_1$` )

- the remaining degrees of freedom are the _residual_ degrees of freedom

- the _model_ also has associated degrees of freedom

+ 2 (intercept, slope) - 1 (knowing one affects the other)
  
.br3.bg-gray.pa2.white.tc[
the models we have been looking at have 20 observations and 1 predictor

**(1, 18)** degrees of freedom
]  
---
# Linear Models in R

```r
mod <- lm(RT ~ BloodAlc, data=ourDat)
summary(mod)
```

```
## 
## Call:
## lm(formula = RT ~ BloodAlc, data = ourDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.18  -39.10   -4.71   44.55  121.14 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    322.6      158.0    2.04    0.056 .
## BloodAlc        32.7       15.4    2.12    0.048 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.7 on 18 degrees of freedom
*## Multiple R-squared:   0.2,	Adjusted R-squared:  0.156 
*## F-statistic: 4.51 on 1 and 18 DF,  p-value: 0.0479
```

---
# Total Sum of Squares

$$ \textrm{total SS}=\sum{(y - \bar{y})^2} $$

.pull-left[
- sum of squares between observed `$y$` and mean `$\bar{y}$`

- represents the total amount of variance in the model

- how much does the observed data vary from a model which says "there is no effect of `$x$`" (**null model**)?
]
.pull-right[
![](lecture_6_files/figure-html/totss-1.svg)

]
---
# Residual Sum of Squares

$$ \textrm{residual SS} = \sum{(y - \hat{y})^2} $$
.pull-left[
- sum of squared differences between observed `$y$` and predicted `$\hat{y}$`

- represents the unexplained variance in the model

- how much does the observed data vary from the existing model?
]
.pull-right[
![](lecture_6_files/figure-html/residp-1.svg)
]
---
# Model Sum of Squares

$$ \textrm{model SS} = \sum{(\hat{y} - \bar{y})^2} $$

.pull-left[
- sum of squared differences between predicted `$\hat{y}$` and mean `$\bar{y}$`

- represents the additional variance explained by the current model over the null model
]
.pull-right[
![](lecture_6_files/figure-html/modp-1.svg)

]

---
# Testing the Model: _R_2

.pull-left[
$$ R^2 = \frac{\textrm{model SS}}{\textrm{total SS}} = \frac{\sum{(\hat{y}-\bar{y})^2}}{\sum{(y-\bar{y})^2}} $$

"how much the model improves over the null"

.pt3[
- `$0 \le R^2 \le 1$`

- we want `$R^2$` to be large

- for a single predictor, `$\sqrt{R^2} = |r|$`

]]

.pull-right[
![](lecture_6_files/figure-html/unnamed-chunk-1-1.svg)

]
---
# Testing the Model: _F_

.pull-left[
`$F$` ratio depends on **mean squares** 
.tc[
( `$\textrm{MS}_x = \textrm{SS}_x/\textrm{df}_x$` )
]

`$$F=\frac{\textrm{model MS}}{\textrm{residual MS}}=\frac{\sum{(\hat{y}-\bar{y})^2}/\textrm{df}_m}{\sum{(y-\hat{y})^2}/\textrm{df}_r}$$`

"how much the model improves over chance"

.pt3[
- `$0 < F$`

- we want `$F$` to be large

- significance of `$F$` does not always equate to a large (or theoretically sensible) effect

]]

.pull-right[
![](lecture_6_files/figure-html/unnamed-chunk-2-1.svg)

]

---
# A Linear Model for 20 Drinkers

.center[
![](lecture_6_files/figure-html/mod-1.svg)
]

- a linear model describes the **best-fit line** through the data

- minimises the error terms `$\epsilon$` or **residuals**
---
# Two Types of Significance

```r
mod <- lm(RT ~ BloodAlc, data=ourDat)
summary(mod)
```

```
## 
## Call:
## lm(formula = RT ~ BloodAlc, data = ourDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.18  -39.10   -4.71   44.55  121.14 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
*## (Intercept)    322.6      158.0    2.04    0.056 .
*## BloodAlc        32.7       15.4    2.12    0.048 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.7 on 18 degrees of freedom
*## Multiple R-squared:   0.2,	Adjusted R-squared:  0.156 
*## F-statistic: 4.51 on 1 and 18 DF,  p-value: 0.0479
```
---
# The Good, the Bad, and the Ugly

.flex.items-center[
.w-50.br3.bg-washed-green.pa2[
- we can easily extend this approach

+ use more than one predictor
  
  + generalised linear model
]
.w-50.br3.bg-washed-red.pa2[
- not a panacea

+ depends on _assumptions_ about the data
  
  + depends on _decisions_ about analysis
]]

.flex.items-center[
.w-70.br3.pa2.bg-gray.white[
- like other statistics, linear models don't tell you "about" your data

- they simply assess what is (un)likely to be due to chance

- the key to good statistics is _common sense and good interpretation_
]
.w-30.center[
![:scale 70%](lecture_6_files/img/playmo_clown.jpg)
]
]

---
class: inverse, center, middle, animated, fadeInDownBig

# End