Multiple Regression

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Adding Age into the Equation

  • so far, have focused on effects of practice

  • but presumably older children read better?

age hrs_wk method R_AGE
9.982 5.137 phonics 14.090
8.006 4.353 phonics 11.762
9.349 5.808 phonics 13.838
8.620 3.612 word 6.620
8.060 4.447 word 7.524
6.117 5.085 word 5.502

Another Model

Another Model

mod2 <- lm(R_AGE ~ age, data = reading)
summary(mod2)

Call:
lm(formula = R_AGE ~ age, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.642 -2.601 -0.454  2.280  4.543 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.238      1.904    2.23   0.0307 * 
age            0.699      0.225    3.10   0.0032 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.66 on 48 degrees of freedom
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

  • R2=0.17

Two Models, No Answers

  • we now have two models that don’t map well to assumptions

  • each suggests an effect

    • one of age

    • one of hrs_wk

  • if we run them independently, the chances of a type 1 error are

    • \(\frac{1}{20}\) (mod, including hrs_wk)

    • \(\frac{1}{20}\) (mod2, including age)

  • or \(\frac{1}{10}\) overall

  • we need to test multiple predictors in one linear mod

Model Equations Again

\[\color{red}{\textrm{outcome}_i}=\color{blue}{(\textrm{model})_i}+\textrm{error}_i\]

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_i}+\epsilon_i\]

linear model with two predictors

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]

\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]

Two Predictors

\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]

\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]

 

 

mod.m <- lm(R_AGE ~ 1 + hrs_wk + age, data = reading)

or

mod.m <- lm(R_AGE ~ 1 + age + hrs_wk, data = reading)

Running A Multiple Regression

mod.m <- lm(R_AGE ~ age + hrs_wk, data = reading)
summary(mod.m)

Call:
lm(formula = R_AGE ~ age + hrs_wk, data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.820 -2.382  0.074  2.404  3.549 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.55 on 47 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105

Running a Multiple Regression

...
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
...
  • there are independent effects of age and practice

    • reading age improves by 0.58 years for each year of age

    • reading age improves by 0.95 years for each hour of practice

  • the intercept (0 years old, 0 hours/week) is meaningless!

  • important question: is this model better than a model based just on age?

Model Fit

...
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
...
  • in multiple regression, R2 measures the fit of the entire model

    • sum of individual R2s if predictors not correlated
  • R2=0.25 looks better than the R2 for mod2 (age as a predictor) of 0.17

  • but any predictor will improve R2 (chance associations guarantee this)

mod.r <- update(mod2, ~. + runif(50))
summary(mod.r)
...
Multiple R-squared:  0.196, Adjusted R-squared:  0.161 
...

Comparing Models

...
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105
  • F ratio compares model to the null model

  • but we can also compare models in succession

mod.n <- lm(R_AGE ~ 1, data = reading)
anova(mod.n, mod2, mod.m)
Analysis of Variance Table

Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
  Res.Df RSS Df Sum of Sq     F Pr(>F)   
1     49 409                             
2     48 340  1      68.2 10.50 0.0022 **
3     47 305  1      35.3  5.43 0.0241 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing Models

explicit way

anova(mod.n, mod2, mod.m)
Analysis of Variance Table

Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
  Res.Df RSS Df Sum of Sq     F Pr(>F)
1     49 409                          
2     48 340  1      68.2 10.50 0.0022
3     47 305  1      35.3  5.43 0.0241

lazy way

anova(mod.m)
Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
age        1   68.2    68.2   10.50 0.0022
hrs_wk     1   35.3    35.3    5.43 0.0241
Residuals 47  305.2     6.5               

Order Matters!

age then hrs_wk

Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
age        1   68.2    68.2   10.50 0.0022
hrs_wk     1   35.3    35.3    5.43 0.0241
Residuals 47  305.2     6.5               

hrs_wk then age

Analysis of Variance Table

Response: R_AGE
          Df Sum Sq Mean Sq F value Pr(>F)
hrs_wk     1   59.4    59.4    9.14  0.004
age        1   44.1    44.1    6.79  0.012
Residuals 47  305.2     6.5               

Order Matters!

  • order affects F ratios because R, by default, uses Type 1 sums of squares
    • calculate each predictor’s improvement to the model in turn
  • compare to Type 3 sums of squares
    • calculate each predictor’s improvement to the model taking all other predictors into account
  • in R, predictors should be entered into the model in a theoretically-motivated order

The Two-Predictor Model

End