The Linear Model

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Model Comparison

Where we Were

...
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
...

Intercept and Slope

...
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
...

What If We Know Nothing

there is no relationship between our independent and dependent variables

equivalent to “not knowing” the independent variable

if we’ve measured a bunch of RTs but nothing else, our best estimate of a new RT is the mean RT

y ~ 1 or RT ~ 1

The Null Model

how much better is RT ~ BloodAlc than RT ~ 1?

Simplify the Data

to make the next few graphs less busy

ourDat <- dat |>
    slice_sample(n = 20)

Total Sum of Squares

\[ \textrm{total SS}=\sum{(y - \bar{y})^2} \]

sum of squared differences between observed \(y\) and mean \(\bar{y}\)
= total amount of variance in the data
= total unexplained variance from the null model

Residual Sum of Squares

\[\textrm{residual SS} = \sum{(y - \hat{y})^2}\]

sum of squared differences between observed \(y\) and predicted \(\hat{y}\)
= total unexplained variance from a given model in the data

Model Sum of Squares

\[ \textrm{model SS} = \sum{(\hat{y} - \bar{y})^2} \]

sum of squared differences between predicted \(\hat{y}\) and observed \(\bar{y}\)
total variance explained by a given model

Testing the Model: R²

\[ R^2 = \frac{\textrm{model SS}}{\textrm{total SS}} = \frac{\sum{(\hat{y}-\bar{y})^2}}{\sum{(y-\bar{y})^2}} \]

how much the model improves over the null

\(0 \le R^2 \le 1\)
we want \(R^2\) to be large

Testing the Model: F

\(F\) ratio depends on mean squares

\[(\textrm{MS}_x=\textrm{SS}_x/\textrm{df}_x)\]

\[F=\frac{\textrm{model MS}}{\textrm{residual MS}}=\frac{\sum{(\hat{y}-\bar{y})^2}/\textrm{df}_m}{\sum{(y-\hat{y})^2}/\textrm{df}_r}\]

how much the model ‘explains’

\(0<F\)
we want \(F\) to be large

Two Types of Significance

mod <- lm(RT ~ BloodAlc, data = dat)
summary(mod)


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

Two Types of Significance

mod <- lm(RT ~ BloodAlc, data = dat)
summary(mod)


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

Two Types of Significance

summary(mod)

...
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

\(F\)-statistic comes from comparison of model with null model

mod0 <- lm(RT ~ 1, data = dat)
anova(mod0, mod)

Analysis of Variance Table

Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)    
1     49 190340                              
2     48 149223  1     41117 13.2 0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two Types of Significance

in fact, because comparison is implicit:

anova(mod0,mod)

Analysis of Variance Table

Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)    
1     49 190340                              
2     48 149223  1     41117 13.2 0.00067 ***

anova(mod)

Analysis of Variance Table

Response: RT
          Df Sum Sq Mean Sq F value  Pr(>F)    
BloodAlc   1  41117   41117    13.2 0.00067 ***
Residuals 48 149223    3109

Summarising…

Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R²=0.22, F(1,48)=13.2, p=.0007).

we also have the \(t\)-statistic describing the effect…

Hold On

we’ve made a lot of assumptions

Assumptions

The Most Important Rule

look at your data, not just the statistics

Anscombe’s Quartet

Anscombe (1973)

Assumptions of Linear Models

required

linearity of relationship(!)
for the residuals:
- normality
- homogeneity of variance
- independence

desirable

no ‘bad’ (overly influential) observations

Residuals

\[y_i=b_0+b_1\cdot{}x_i+\epsilon_i\]

\[\color{red}{\epsilon\sim{}N(0,\sigma)~\text{independently}}\]

normally distributed (mean should be \(\simeq{}\) zero)

homogeneous (differences from \(\hat{y}\) shouldn’t be systematically smaller or larger for different \(x\))
independent (residuals shouldn’t influence other residuals)

At A Glance

summary(mod)


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

In More Detail

linearity

plot(mod, which = 1)

plots residuals \(\epsilon_i\) against fitted values \(\hat{y}_i\)
the ‘average residual’ is roughly zero across \(\hat{y}\), so relationship is likely to be linear

In More Detail

normality

hist(resid(mod))

In More Detail

normality

plot(density(resid(mod)))

check that residuals \(\epsilon\) are approximately normally distributed
in fact there’s a better way of doing this…

In More Detail

normality

plot(mod, which = 2)

Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)
observations close to the straight line mean residuals are approximately normal

Q-Q Plots

y axis

for a normal distribution, what values should (say) 5%, or 10% of the observations lie below?
expressed in “standard deviations from the mean”

qnorm(c(0.05, 0.1))

[1] -1.645 -1.282

Q-Q Plots

x axis

from our residuals, what values are (say) 5%, or 10%, of observations found to be less than?
convert to “standard deviations from the mean”

quantile(
  scale(resid(mod)),
  c(0.05, 0.1)
)

    5%    10% 
-1.405 -1.280

Q-Q Plot shows these values plotted against each other

In More Detail

normality

plot(mod,which=2)

Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)
observations close to the straight line mean residuals are approximately normal

numbered points refer to row numbers in the original data

In More Detail

homogeneity of variance

plot(mod, which = 3)

graph shows \(\sqrt{|\textrm{standardized residuals|}}\)
the size of the residuals is approximately the same across values of \(\hat{y}\)

What about Independence?

no easy way to check independence of residuals
in part, because it depends on the source of the observations
one determinant might be a single person being observed multiple times
e.g., my reaction times might tend to be slower than yours
\(\rightarrow\) multivariate statistics

Independence

another determinant might be time
observations in a sequence might be autocorrelated
can be checked using the Durbin-Watson Test from the car package

library(car)
dwt(mod)

 lag Autocorrelation D-W Statistic p-value
   1         -0.1377          2.22   0.422
 Alternative hypothesis: rho != 0

Visual vs Other Methods

more statistical ways of checking assumptions will be introduced in labs
they tend to have limitations (for example, they’re susceptible to sample size)
nothing beats looking at plots like these (and plot(<model>) makes it easy)

Assumption Checking

there are no criteria for deciding exactly when assumptions are sufficiently met
- it’s a matter of experience and judgement
we need to talk about influence

Influence

even substantial outliers may only have small effects on the model
here, only the intercept is affected

Influence

observations with high leverage are inconsistent with other data, but may not be distorting the model

Influence

we care about observations with high influence (outliers with high leverage)

Cook’s Distance

a standardised measure of “how much the model differs without observation \(i\)”

Cook’s Distance

\[D_i=\frac{\sum_{j=1}^{n}{(\hat{y}_j-\hat{y}_{j(i)})^2}}{(p+1)\hat{\sigma}^2}\]

\(\hat{y}_j\) is the \(j\)th fitted value
\(\hat{y}_{j(i)}\) is the \(j\)th value from a fit which doesn’t include observation \(i\)
\(p\) is the number of regression coefficients
\(\hat{\sigma}^2\) is the estimated variance from the fit, i.e., mean squared error

Cook’s Distance

plot(bad, which = 4)

observations labelled by row
various rules of thumb, but start looking when Cook’s Distance > 0.5

Cook’s Distance

plot(mod, which = 4)

observations labelled by row
various rules of thumb, but start looking when Cook’s Distance > 0.5

So Now We Know…

the relationship is approximately linear

the residuals are approximately normal

the variance is approximately homogeneous

we believe the observations are independent

there are no overly-influential observations


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

we are ready to report our observations

What We Know

Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R²=0.22, F(1,48)=13.2, p=.0007). Reaction time slowed by 32.3 ms for every additional 0.01% blood alcohol by volume (t(48)=3.64, p=.0007).

Learning to Read

the Playmo School has been evaluating its reading programmes, using 50 students
ages of students
hours per week students spend reading of their own volition
whether they are taught using phonics or whole-word methods
outcome: “reading age”

Learning to Read

age	hrs_wk	method	R_AGE
10.0	5.1	phonics	14.1
8.0	4.4	phonics	11.8
9.3	5.8	phonics	13.8
8.7	5.4	phonics	13.3
10.2	4.6	phonics	14.3
10.5	4.6	word	9.6
9.4	4.6	word	8.0
8.6	3.6	word	6.6
8.1	4.4	word	7.5
6.1	5.1	word	5.5

Learning to Read

age	hrs_wk	method	R_AGE
10.0	5.1	phonics	14.1
8.0	4.4	phonics	11.8
9.3	5.8	phonics	13.8
8.7	5.4	phonics	13.3
10.2	4.6	phonics	14.3
10.5	4.6	word	9.6
9.4	4.6	word	8.0
8.6	3.6	word	6.6
8.1	4.4	word	7.5
6.1	5.1	word	5.5

Does Practice Affect Reading Age?

p <- reading |>
    ggplot(aes(x = hrs_wk,
        y = R_AGE)) +
    geom_point(size = 3) +
    ylab("reading age") +
    xlab("hours reading/week")
p

hours per week is correlated with reading age: r=0.3812, p=0.0063
we can use a linear model to say something about the effect size

Does Practice Affect Reading Age?

p + geom_smooth(method = "lm")

each extra hour spent reading a week adds 1.193 years to reading age

A Linear Model

summary(mod)


Call:
lm(formula = R_AGE ~ hrs_wk, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-5.813 -1.826 -0.017  2.426  4.729 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.069      2.119    1.92   0.0608 . 
hrs_wk         1.193      0.417    2.86   0.0063 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.7 on 48 degrees of freedom
Multiple R-squared:  0.145, Adjusted R-squared:  0.127 
F-statistic: 8.16 on 1 and 48 DF,  p-value: 0.00631

but…

Assumptions Not Met!

it seems that the assumptions aren’t met for this model
(another demonstration on the right)
one reason for this can be because there’s still systematic structure in the residuals
i.e., more than one thing which can explain the variance

The Linear Model

Model Comparison

Where we Were

Intercept and Slope

What If We Know Nothing

The Null Model

Simplify the Data

Total Sum of Squares

Residual Sum of Squares

Model Sum of Squares

Testing the Model: R2

Testing the Model: F

Two Types of Significance

Two Types of Significance

Two Types of Significance

Two Types of Significance

Summarising…

Hold On

Assumptions

The Most Important Rule

Anscombe’s Quartet

Assumptions of Linear Models

required

desirable

Residuals

At A Glance

In More Detail

linearity

In More Detail

normality

In More Detail

normality

In More Detail

normality

Q-Q Plots

y axis

Q-Q Plots

x axis

In More Detail

normality

In More Detail

homogeneity of variance

What about Independence?

Independence

Visual vs Other Methods

Assumption Checking

Influence

Influence

Influence

Influence

Cook’s Distance

Cook’s Distance

Cook’s Distance

Cook’s Distance

So Now We Know…

What We Know

Learning to Read

Learning to Read

Learning to Read

Learning to Read

Does Practice Affect Reading Age?

Does Practice Affect Reading Age?

A Linear Model

Assumptions Not Met!

Testing the Model: R²