The Linear Model

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Model Comparison

Where we Were

...
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
...

Intercept and Slope

...
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
...

What If We Know Nothing

there is no relationship between our independent and dependent variables

equivalent to “not knowing” the independent variable

if we’ve measured a bunch of RTs but nothing else, our best estimate of a new RT is the mean RT

y ~ 1 or RT ~ 1

The Null Model

how much better is RT ~ BloodAlc than RT ~ 1?

Simplify the Data

to make the next few graphs less busy

ourDat <- dat |>
    slice_sample(n = 20)

Total Sum of Squares

\[ \textrm{total SS}=\sum{(y - \bar{y})^2} \]

sum of squared differences between observed \(y\) and mean \(\bar{y}\)
= total amount of variance in the data
= total unexplained variance from the null model

Residual Sum of Squares

\[\textrm{residual SS} = \sum{(y - \hat{y})^2}\]

sum of squared differences between observed \(y\) and predicted \(\hat{y}\)
= total unexplained variance from a given model in the data

Model Sum of Squares

\[ \textrm{model SS} = \sum{(\hat{y} - \bar{y})^2} \]

sum of squared differences between predicted \(\hat{y}\) and observed \(\bar{y}\)
total variance explained by a given model

Testing the Model: R²

\[ R^2 = \frac{\textrm{model SS}}{\textrm{total SS}} = \frac{\sum{(\hat{y}-\bar{y})^2}}{\sum{(y-\bar{y})^2}} \]

how much the model improves over the null

\(0 \le R^2 \le 1\)
we want \(R^2\) to be large

Testing the Model: F

\(F\) ratio depends on mean squares

\[(\textrm{MS}_x=\textrm{SS}_x/\textrm{df}_x)\]

\[F=\frac{\textrm{model MS}}{\textrm{residual MS}}=\frac{\sum{(\hat{y}-\bar{y})^2}/\textrm{df}_m}{\sum{(y-\hat{y})^2}/\textrm{df}_r}\]

how much the model ‘explains’

\(0<F\)
we want \(F\) to be large

Two Types of Significance

mod <- lm(RT ~ BloodAlc, data = dat)
summary(mod)


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

Two Types of Significance

mod <- lm(RT ~ BloodAlc, data = dat)
summary(mod)


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

Two Types of Significance

summary(mod)

...
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

\(F\)-statistic comes from comparison of model with null model

mod0 <- lm(RT ~ 1, data = dat)
anova(mod0, mod)

Analysis of Variance Table

Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)    
1     49 190340                              
2     48 149223  1     41117 13.2 0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two Types of Significance

in fact, because comparison is implicit:

anova(mod0,mod)

Analysis of Variance Table

Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)    
1     49 190340                              
2     48 149223  1     41117 13.2 0.00067 ***

anova(mod)

Analysis of Variance Table

Response: RT
          Df Sum Sq Mean Sq F value  Pr(>F)    
BloodAlc   1  41117   41117    13.2 0.00067 ***
Residuals 48 149223    3109

Summarising…

Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R²=0.22, F(1,48)=13.2, p=.0007).

we also have the \(t\)-statistic describing the effect…

Hold On

we’ve made a lot of assumptions

Assumptions

The Most Important Rule

look at your data, not just the statistics

Anscombe’s Quartet

Anscombe (1973)

Assumptions of Linear Models

required

linearity of relationship(!)
for the residuals:
- normality
- homogeneity of variance
- independence

desirable

no ‘bad’ (overly influential) observations

Residuals

\[y_i=b_0+b_1\cdot{}x_i+\epsilon_i\]

\[\color{red}{\epsilon\sim{}N(0,\sigma)~\text{independently}}\]

normally distributed (mean should be \(\simeq{}\) zero)

homogeneous (differences from \(\hat{y}\) shouldn’t be systematically smaller or larger for different \(x\))
independent (residuals shouldn’t influence other residuals)

At A Glance

summary(mod)


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

In More Detail

linearity

plot(mod, which = 1)

plots residuals \(\epsilon_i\) against fitted values \(\hat{y}_i\)
the ‘average residual’ is roughly zero across \(\hat{y}\), so relationship is likely to be linear

In More Detail

normality

hist(resid(mod))

In More Detail

normality

plot(density(resid(mod)))

check that residuals \(\epsilon\) are approximately normally distributed
in fact there’s a better way of doing this…

In More Detail

normality

plot(mod, which = 2)

Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)
observations close to the straight line mean residuals are approximately normal

Q-Q Plots

y axis

for a normal distribution, what values should (say) 5%, or 10% of the observations lie below?
expressed in “standard deviations from the mean”

qnorm(c(0.05, 0.1))

[1] -1.645 -1.282

Q-Q Plots

x axis

from our residuals, what values are (say) 5%, or 10%, of observations found to be less than?
convert to “standard deviations from the mean”

quantile(
  scale(resid(mod)),
  c(0.05, 0.1)
)

    5%    10% 
-1.405 -1.280

Q-Q Plot shows these values plotted against each other

In More Detail

normality

plot(mod,which=2)

Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)
observations close to the straight line mean residuals are approximately normal

numbered points refer to row numbers in the original data

In More Detail

homogeneity of variance

plot(mod, which = 3)

graph shows \(\sqrt{|\textrm{standardized residuals|}}\)
the size of the residuals is approximately the same across values of \(\hat{y}\)

What about Independence?

no easy way to check independence of residuals
in part, because it depends on the source of the observations
one determinant might be a single person being observed multiple times
e.g., my reaction times might tend to be slower than yours
\(\rightarrow\) multivariate statistics

Independence

another determinant might be time
observations in a sequence might be autocorrelated
can be checked using the Durbin-Watson Test from the car package

library(car)
dwt(mod)

 lag Autocorrelation D-W Statistic p-value
   1         -0.1377          2.22   0.422
 Alternative hypothesis: rho != 0

Visual vs Other Methods

more statistical ways of checking assumptions will be introduced in labs
they tend to have limitations (for example, they’re susceptible to sample size)
nothing beats looking at plots like these (and plot(<model>) makes it easy)

Assumption Checking

there are no criteria for deciding exactly when assumptions are sufficiently met
- it’s a matter of experience and judgement
we need to talk about influence

Influence

even substantial outliers may only have small effects on the model
here, only the intercept is affected

Influence

observations with high leverage are inconsistent with other data, but may not be distorting the model

Influence

we care about observations with high influence (outliers with high leverage)

Cook’s Distance

a standardised measure of “how much the model differs without observation \(i\)”

Cook’s Distance

\[D_i=\frac{\sum_{j=1}^{n}{(\hat{y}_j-\hat{y}_{j(i)})^2}}{(p+1)\hat{\sigma}^2}\]

\(\hat{y}_j\) is the \(j\)th fitted value
\(\hat{y}_{j(i)}\) is the \(j\)th value from a fit which doesn’t include observation \(i\)
\(p\) is the number of regression coefficients
\(\hat{\sigma}^2\) is the estimated variance from the fit, i.e., mean squared error

Cook’s Distance

plot(bad, which = 4)

observations labelled by row
various rules of thumb, but start looking when Cook’s Distance > 0.5

Cook’s Distance

plot(mod, which = 4)

observations labelled by row
various rules of thumb, but start looking when Cook’s Distance > 0.5

So Now We Know…

the relationship is approximately linear

the residuals are approximately normal

the variance is approximately homogeneous

we believe the observations are independent

there are no overly-influential observations


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

we are ready to report our observations

What We Know

Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R²=0.22, F(1,48)=13.2, p=.0007). Reaction time slowed by 32.3 ms for every additional 0.01% blood alcohol by volume (t(48)=3.64, p=.0007).

Learning to Read

the Playmo School has been evaluating its reading programmes, using 50 students
ages of students
hours per week students spend reading of their own volition
whether they are taught using phonics or whole-word methods
outcome: “reading age”

Learning to Read

age	hrs_wk	method	R_AGE
10.0	5.1	phonics	14.1
8.0	4.4	phonics	11.8
9.3	5.8	phonics	13.8
8.7	5.4	phonics	13.3
10.2	4.6	phonics	14.3
10.5	4.6	word	9.6
9.4	4.6	word	8.0
8.6	3.6	word	6.6
8.1	4.4	word	7.5
6.1	5.1	word	5.5

Learning to Read

age	hrs_wk	method	R_AGE
10.0	5.1	phonics	14.1
8.0	4.4	phonics	11.8
9.3	5.8	phonics	13.8
8.7	5.4	phonics	13.3
10.2	4.6	phonics	14.3
10.5	4.6	word	9.6
9.4	4.6	word	8.0
8.6	3.6	word	6.6
8.1	4.4	word	7.5
6.1	5.1	word	5.5

Does Practice Affect Reading Age?

p <- reading |>
    ggplot(aes(x = hrs_wk,
        y = R_AGE)) +
    geom_point(size = 3) +
    ylab("reading age") +
    xlab("hours reading/week")
p

hours per week is correlated with reading age: r=0.3812, p=0.0063
we can use a linear model to say something about the effect size

Does Practice Affect Reading Age?

p + geom_smooth(method = "lm")

each extra hour spent reading a week adds 1.193 years to reading age

A Linear Model

summary(mod)


Call:
lm(formula = R_AGE ~ hrs_wk, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-5.813 -1.826 -0.017  2.426  4.729 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.069      2.119    1.92   0.0608 . 
hrs_wk         1.193      0.417    2.86   0.0063 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.7 on 48 degrees of freedom
Multiple R-squared:  0.145, Adjusted R-squared:  0.127 
F-statistic: 8.16 on 1 and 48 DF,  p-value: 0.00631

but…

Assumptions Not Met!

it seems that the assumptions aren’t met for this model
(another demonstration on the right)
one reason for this can be because there’s still systematic structure in the residuals
i.e., more than one thing which can explain the variance

Adding Age into the Equation

so far, have focused on effects of practice
but presumably older children read better?

age	hrs_wk	method	R_AGE
9.982	5.137	phonics	14.090
8.006	4.353	phonics	11.762
9.349	5.808	phonics	13.838
8.620	3.612	word	6.620
8.060	4.447	word	7.524
6.117	5.085	word	5.502

Another Model

mod2 <- lm(R_AGE ~ age, data = reading)
summary(mod2)


Call:
lm(formula = R_AGE ~ age, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.642 -2.601 -0.454  2.280  4.543 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.238      1.904    2.23   0.0307 * 
age            0.699      0.225    3.10   0.0032 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.66 on 48 degrees of freedom
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

R²=0.17

Two Models, No Answers

we now have two models that don’t map well to assumptions
each suggests an effect
- one of age
- one of hrs_wk

if we run them independently, the chances of a type 1 error are
- \(\frac{1}{20}\) (mod, including hrs_wk)
- \(\frac{1}{20}\) (mod2, including age)
or \(\frac{1}{10}\) overall

we need to test multiple predictors in one linear mod

Model Equations Again

\[\color{red}{\textrm{outcome}_i}=\color{blue}{(\textrm{model})_i}+\textrm{error}_i\]

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_i}+\epsilon_i\]

linear model with two predictors

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]

\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]

Two Predictors

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]

\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]

mod.m <- lm(R_AGE ~ 1 + hrs_wk + age, data = reading)

mod.m <- lm(R_AGE ~ 1 + age + hrs_wk, data = reading)

Running A Multiple Regression

mod.m <- lm(R_AGE ~ age + hrs_wk, data = reading)
summary(mod.m)


Call:
lm(formula = R_AGE ~ age + hrs_wk, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.820 -2.382  0.074  2.404  3.549 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.55 on 47 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105

Running a Multiple Regression

...
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
...

there are independent effects of age and practice
- reading age improves by 0.58 years for each year of age
- reading age improves by 0.95 years for each hour of practice
the intercept (0 years old, 0 hours/week) is meaningless!

important question: is this model better than a model based just on age?

Model Fit

...
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
...

in multiple regression, R² measures the fit of the entire model
- sum of individual R²s if predictors not correlated
R²=0.25 looks better than the R² for mod2 (age as a predictor) of 0.17
but any predictor will improve R² (chance associations guarantee this)

mod.r <- update(mod2, ~. + runif(50))
summary(mod.r)

...
Multiple R-squared:  0.196, Adjusted R-squared:  0.161 
...

Comparing Models

...
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105

F ratio compares model to the null model
but we can also compare models in succession

mod.n <- lm(R_AGE ~ 1, data = reading)
anova(mod.n, mod2, mod.m)

Analysis of Variance Table

Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
  Res.Df RSS Df Sum of Sq     F Pr(>F)   
1     49 409                             
2     48 340  1      68.2 10.50 0.0022 **
3     47 305  1      35.3  5.43 0.0241 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing Models

explicit way

anova(mod.n, mod2, mod.m)

Analysis of Variance Table

Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
  Res.Df RSS Df Sum of Sq     F Pr(>F)
1     49 409                          
2     48 340  1      68.2 10.50 0.0022
3     47 305  1      35.3  5.43 0.0241

lazy way

anova(mod.m)

Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
age        1   68.2    68.2   10.50 0.0022
hrs_wk     1   35.3    35.3    5.43 0.0241
Residuals 47  305.2     6.5

Order Matters!

`age` then `hrs_wk`

mod <- lm(R_AGE ~ age + hrs_wk, data=reading)
anova(mod)

Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
age        1   68.2    68.2   10.50 0.0022
hrs_wk     1   35.3    35.3    5.43 0.0241
Residuals 47  305.2     6.5

`hrs_wk` then `age`

mod <- lm(R_AGE ~ hrs_wk + age, data=reading)
anova(mod)

Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
hrs_wk     1   59.4    59.4    9.14  0.004
age        1   44.1    44.1    6.79  0.012
Residuals 47  305.2     6.5

Order Matters!

order affects F ratios because R, by default, uses Type 1 sums of squares
- calculate each predictor’s improvement to the model in turn
compare to Type 3 sums of squares
- calculate each predictor’s improvement to the model taking all other predictors into account
in R, predictors should be entered into the model in a theoretically-motivated order

The Linear Model

Model Comparison

Where we Were

Intercept and Slope

What If We Know Nothing

The Null Model

Simplify the Data

Total Sum of Squares

Residual Sum of Squares

Model Sum of Squares

Testing the Model: R2

Testing the Model: F

Two Types of Significance

Two Types of Significance

Two Types of Significance

Two Types of Significance

Summarising…

Hold On

Assumptions

The Most Important Rule

The Most Important Rule

Anscombe’s Quartet

Assumptions of Linear Models

required

desirable

Residuals

At A Glance

In More Detail

linearity

In More Detail

normality

In More Detail

normality

In More Detail

normality

Q-Q Plots

y axis

Q-Q Plots

x axis

In More Detail

normality

In More Detail

homogeneity of variance

What about Independence?

Independence

Visual vs Other Methods

Assumption Checking

Influence

Influence

Influence

Influence

Cook’s Distance

Cook’s Distance

Cook’s Distance

Cook’s Distance

So Now We Know…

What We Know

Learning to Read

Learning to Read

Learning to Read

Learning to Read

Does Practice Affect Reading Age?

Does Practice Affect Reading Age?

A Linear Model

Assumptions Not Met!

Adding Age into the Equation

Another Model

Another Model

Two Models, No Answers

Model Equations Again

linear model with two predictors

Two Predictors

Running A Multiple Regression

Running a Multiple Regression

Model Fit

Comparing Models

Comparing Models

explicit way

lazy way

Order Matters!

Testing the Model: R²

`age` then `hrs_wk`

`hrs_wk` then `age`