The Linear Model

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Model Comparison

Where we Were

...
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
...

Intercept and Slope

...
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
...

What If We Know Nothing

there is no relationship between our independent and dependent variables

  • equivalent to “not knowing” the independent variable
  • if we’ve measured a bunch of RTs but nothing else, our best estimate of a new RT is the mean RT

y ~ 1 or RT ~ 1

The Null Model

  • how much better is RT ~ BloodAlc than RT ~ 1?

Simplify the Data

  • to make the next few graphs less busy
ourDat <- dat |>
    slice_sample(n = 20)

Total Sum of Squares

\[ \textrm{total SS}=\sum{(y - \bar{y})^2} \]

  • sum of squared differences between observed \(y\) and mean \(\bar{y}\)

  • = total amount of variance in the data

  • = total unexplained variance from the null model

Residual Sum of Squares

\[\textrm{residual SS} = \sum{(y - \hat{y})^2}\]

  • sum of squared differences between observed \(y\) and predicted \(\hat{y}\)

  • = total unexplained variance from a given model in the data

Model Sum of Squares

\[ \textrm{model SS} = \sum{(\hat{y} - \bar{y})^2} \]

  • sum of squared differences between predicted \(\hat{y}\) and observed \(\bar{y}\)

  • total variance explained by a given model

Testing the Model: R2

\[ R^2 = \frac{\textrm{model SS}}{\textrm{total SS}} = \frac{\sum{(\hat{y}-\bar{y})^2}}{\sum{(y-\bar{y})^2}} \]

how much the model improves over the null

  • \(0 \le R^2 \le 1\)

  • we want \(R^2\) to be large

Testing the Model: F

\(F\) ratio depends on mean squares

\[(\textrm{MS}_x=\textrm{SS}_x/\textrm{df}_x)\]

\[F=\frac{\textrm{model MS}}{\textrm{residual MS}}=\frac{\sum{(\hat{y}-\bar{y})^2}/\textrm{df}_m}{\sum{(y-\hat{y})^2}/\textrm{df}_r}\]

how much the model ‘explains’

  • \(0<F\)

  • we want \(F\) to be large

Two Types of Significance

mod <- lm(RT ~ BloodAlc, data = dat)
summary(mod)

Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

Two Types of Significance

mod <- lm(RT ~ BloodAlc, data = dat)
summary(mod)

Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

Two Types of Significance

summary(mod)
...
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673
  • \(F\)-statistic comes from comparison of model with null model
mod0 <- lm(RT ~ 1, data = dat)
anova(mod0, mod)
Analysis of Variance Table

Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)    
1     49 190340                              
2     48 149223  1     41117 13.2 0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two Types of Significance

  • in fact, because comparison is implicit:
anova(mod0,mod)
Analysis of Variance Table

Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
  Res.Df    RSS Df Sum of Sq    F  Pr(>F)    
1     49 190340                              
2     48 149223  1     41117 13.2 0.00067 ***

 

anova(mod)
Analysis of Variance Table

Response: RT
          Df Sum Sq Mean Sq F value  Pr(>F)    
BloodAlc   1  41117   41117    13.2 0.00067 ***
Residuals 48 149223    3109                    

Summarising…

Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R2=0.22, F(1,48)=13.2, p=.0007).

  • we also have the \(t\)-statistic describing the effect…

Hold On

  • we’ve made a lot of assumptions

Assumptions

The Most Important Rule

The Most Important Rule

  • look at your data, not just the statistics

Anscombe’s Quartet

Anscombe (1973)

Assumptions of Linear Models

required

  • linearity of relationship(!)

  • for the residuals:

    • normality
    • homogeneity of variance
    • independence

desirable

  • no ‘bad’ (overly influential) observations

Residuals

\[y_i=b_0+b_1\cdot{}x_i+\epsilon_i\]

\[\color{red}{\epsilon\sim{}N(0,\sigma)~\text{independently}}\]

  • normally distributed (mean should be \(\simeq{}\) zero)

  • homogeneous (differences from \(\hat{y}\) shouldn’t be systematically smaller or larger for different \(x\))

  • independent (residuals shouldn’t influence other residuals)

At A Glance

summary(mod)

Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673

In More Detail

linearity

plot(mod, which = 1)
  • plots residuals \(\epsilon_i\) against fitted values \(\hat{y}_i\)

  • the ‘average residual’ is roughly zero across \(\hat{y}\), so relationship is likely to be linear

In More Detail

normality

hist(resid(mod))

In More Detail

normality

plot(density(resid(mod)))
  • check that residuals \(\epsilon\) are approximately normally distributed

  • in fact there’s a better way of doing this…

In More Detail

normality

plot(mod, which = 2)
  • Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)

  • observations close to the straight line mean residuals are approximately normal

Q-Q Plots

y axis

  • for a normal distribution, what values should (say) 5%, or 10% of the observations lie below?

  • expressed in “standard deviations from the mean”

qnorm(c(0.05, 0.1))
[1] -1.645 -1.282

Q-Q Plots

x axis

  • from our residuals, what values are (say) 5%, or 10%, of observations found to be less than?

  • convert to “standard deviations from the mean”

quantile(
  scale(resid(mod)),
  c(0.05, 0.1)
)
    5%    10% 
-1.405 -1.280 

  • Q-Q Plot shows these values plotted against each other

In More Detail

normality

plot(mod,which=2)
  • Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)

  • observations close to the straight line mean residuals are approximately normal

  • numbered points refer to row numbers in the original data

In More Detail

homogeneity of variance

plot(mod, which = 3)
  • graph shows \(\sqrt{|\textrm{standardized residuals|}}\)

  • the size of the residuals is approximately the same across values of \(\hat{y}\)

What about Independence?

  • no easy way to check independence of residuals

  • in part, because it depends on the source of the observations

  • one determinant might be a single person being observed multiple times

  • e.g., my reaction times might tend to be slower than yours
    \(\rightarrow\) multivariate statistics

Independence

  • another determinant might be time

  • observations in a sequence might be autocorrelated

  • can be checked using the Durbin-Watson Test from the car package

library(car)
dwt(mod)
 lag Autocorrelation D-W Statistic p-value
   1         -0.1377          2.22   0.422
 Alternative hypothesis: rho != 0

Visual vs Other Methods

  • more statistical ways of checking assumptions will be introduced in labs

  • they tend to have limitations (for example, they’re susceptible to sample size)

  • nothing beats looking at plots like these (and plot(<model>) makes it easy)

Assumption Checking

  • there are no criteria for deciding exactly when assumptions are sufficiently met

    • it’s a matter of experience and judgement
  • we need to talk about influence

Influence

Influence

  • even substantial outliers may only have small effects on the model

  • here, only the intercept is affected

Influence

  • observations with high leverage are inconsistent with other data, but may not be distorting the model

Influence

  • we care about observations with high influence (outliers with high leverage)

Cook’s Distance

  • a standardised measure of “how much the model differs without observation \(i\)

Cook’s Distance

\[D_i=\frac{\sum_{j=1}^{n}{(\hat{y}_j-\hat{y}_{j(i)})^2}}{(p+1)\hat{\sigma}^2}\]

  • \(\hat{y}_j\) is the \(j\)th fitted value
  • \(\hat{y}_{j(i)}\) is the \(j\)th value from a fit which doesn’t include observation \(i\)
  • \(p\) is the number of regression coefficients
  • \(\hat{\sigma}^2\) is the estimated variance from the fit, i.e., mean squared error

Cook’s Distance

plot(bad, which = 4)
  • observations labelled by row

  • various rules of thumb, but start looking when Cook’s Distance > 0.5

Cook’s Distance

plot(mod, which = 4)
  • observations labelled by row

  • various rules of thumb, but start looking when Cook’s Distance > 0.5

So Now We Know…

  • the relationship is approximately linear
  • the residuals are approximately normal
  • the variance is approximately homogeneous
  • we believe the observations are independent
  • there are no overly-influential observations


Call:
lm(formula = RT ~ BloodAlc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-115.92  -40.42    1.05   42.93  126.64 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   321.24      91.05    3.53  0.00093 ***
BloodAlc       32.28       8.88    3.64  0.00067 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared:  0.216, Adjusted R-squared:   0.2 
F-statistic: 13.2 on 1 and 48 DF,  p-value: 0.000673
  • we are ready to report our observations

What We Know

Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R2=0.22, F(1,48)=13.2, p=.0007). Reaction time slowed by 32.3 ms for every additional 0.01% blood alcohol by volume (t(48)=3.64, p=.0007).

Learning to Read

Learning to Read

  • the Playmo School has been evaluating its reading programmes, using 50 students

  • ages of students

  • hours per week students spend reading of their own volition

  • whether they are taught using phonics or whole-word methods

  • outcome: “reading age”

Learning to Read

age hrs_wk method R_AGE
10.0 5.1 phonics 14.1
8.0 4.4 phonics 11.8
9.3 5.8 phonics 13.8
8.7 5.4 phonics 13.3
10.2 4.6 phonics 14.3
10.5 4.6 word 9.6
9.4 4.6 word 8.0
8.6 3.6 word 6.6
8.1 4.4 word 7.5
6.1 5.1 word 5.5

Learning to Read

age hrs_wk method R_AGE
10.0 5.1 phonics 14.1
8.0 4.4 phonics 11.8
9.3 5.8 phonics 13.8
8.7 5.4 phonics 13.3
10.2 4.6 phonics 14.3
10.5 4.6 word 9.6
9.4 4.6 word 8.0
8.6 3.6 word 6.6
8.1 4.4 word 7.5
6.1 5.1 word 5.5

Does Practice Affect Reading Age?

p <- reading |>
    ggplot(aes(x = hrs_wk,
        y = R_AGE)) +
    geom_point(size = 3) +
    ylab("reading age") +
    xlab("hours reading/week")
p

  • hours per week is correlated with reading age: r=0.3812, p=0.0063

  • we can use a linear model to say something about the effect size

Does Practice Affect Reading Age?

p + geom_smooth(method = "lm")

  • each extra hour spent reading a week adds 1.193 years to reading age

A Linear Model

summary(mod)

Call:
lm(formula = R_AGE ~ hrs_wk, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-5.813 -1.826 -0.017  2.426  4.729 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.069      2.119    1.92   0.0608 . 
hrs_wk         1.193      0.417    2.86   0.0063 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.7 on 48 degrees of freedom
Multiple R-squared:  0.145, Adjusted R-squared:  0.127 
F-statistic: 8.16 on 1 and 48 DF,  p-value: 0.00631

but…

Assumptions Not Met!

  • it seems that the assumptions aren’t met for this model

  • (another demonstration on the right)

  • one reason for this can be because there’s still systematic structure in the residuals

  • i.e., more than one thing which can explain the variance

Adding Age into the Equation

  • so far, have focused on effects of practice

  • but presumably older children read better?

age hrs_wk method R_AGE
9.982 5.137 phonics 14.090
8.006 4.353 phonics 11.762
9.349 5.808 phonics 13.838
8.620 3.612 word 6.620
8.060 4.447 word 7.524
6.117 5.085 word 5.502

Another Model

Another Model

mod2 <- lm(R_AGE ~ age, data = reading)
summary(mod2)

Call:
lm(formula = R_AGE ~ age, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.642 -2.601 -0.454  2.280  4.543 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.238      1.904    2.23   0.0307 * 
age            0.699      0.225    3.10   0.0032 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.66 on 48 degrees of freedom
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

  • R2=0.17

Two Models, No Answers

  • we now have two models that don’t map well to assumptions

  • each suggests an effect

    • one of age

    • one of hrs_wk

  • if we run them independently, the chances of a type 1 error are

    • \(\frac{1}{20}\) (mod, including hrs_wk)

    • \(\frac{1}{20}\) (mod2, including age)

  • or \(\frac{1}{10}\) overall

  • we need to test multiple predictors in one linear mod

Model Equations Again

\[\color{red}{\textrm{outcome}_i}=\color{blue}{(\textrm{model})_i}+\textrm{error}_i\]

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_i}+\epsilon_i\]

linear model with two predictors

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]

\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]

Two Predictors

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]

\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]

 

 

mod.m <- lm(R_AGE ~ 1 + hrs_wk + age, data = reading)

or

mod.m <- lm(R_AGE ~ 1 + age + hrs_wk, data = reading)

Running A Multiple Regression

mod.m <- lm(R_AGE ~ age + hrs_wk, data = reading)
summary(mod.m)

Call:
lm(formula = R_AGE ~ age + hrs_wk, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.820 -2.382  0.074  2.404  3.549 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.55 on 47 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105

Running a Multiple Regression

...
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
...
  • there are independent effects of age and practice

    • reading age improves by 0.58 years for each year of age

    • reading age improves by 0.95 years for each hour of practice

  • the intercept (0 years old, 0 hours/week) is meaningless!

  • important question: is this model better than a model based just on age?

Model Fit

...
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
...
  • in multiple regression, R2 measures the fit of the entire model

    • sum of individual R2s if predictors not correlated
  • R2=0.25 looks better than the R2 for mod2 (age as a predictor) of 0.17

  • but any predictor will improve R2 (chance associations guarantee this)

mod.r <- update(mod2, ~. + runif(50))
summary(mod.r)
...
Multiple R-squared:  0.196, Adjusted R-squared:  0.161 
...

Comparing Models

...
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105
  • F ratio compares model to the null model

  • but we can also compare models in succession

mod.n <- lm(R_AGE ~ 1, data = reading)
anova(mod.n, mod2, mod.m)
Analysis of Variance Table

Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
  Res.Df RSS Df Sum of Sq     F Pr(>F)   
1     49 409                             
2     48 340  1      68.2 10.50 0.0022 **
3     47 305  1      35.3  5.43 0.0241 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing Models

explicit way

anova(mod.n, mod2, mod.m)
Analysis of Variance Table

Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
  Res.Df RSS Df Sum of Sq     F Pr(>F)
1     49 409                          
2     48 340  1      68.2 10.50 0.0022
3     47 305  1      35.3  5.43 0.0241

lazy way

anova(mod.m)
Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
age        1   68.2    68.2   10.50 0.0022
hrs_wk     1   35.3    35.3    5.43 0.0241
Residuals 47  305.2     6.5               

Order Matters!

age then hrs_wk

mod <- lm(R_AGE ~ age + hrs_wk, data=reading)
anova(mod)
Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
age        1   68.2    68.2   10.50 0.0022
hrs_wk     1   35.3    35.3    5.43 0.0241
Residuals 47  305.2     6.5               

hrs_wk then age

mod <- lm(R_AGE ~ hrs_wk + age, data=reading)
anova(mod)
Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
hrs_wk     1   59.4    59.4    9.14  0.004
age        1   44.1    44.1    6.79  0.012
Residuals 47  305.2     6.5               

Order Matters!

  • order affects F ratios because R, by default, uses Type 1 sums of squares
    • calculate each predictor’s improvement to the model in turn
  • compare to Type 3 sums of squares
    • calculate each predictor’s improvement to the model taking all other predictors into account
  • in R, predictors should be entered into the model in a theoretically-motivated order

The Two-Predictor Model

End