Scaling and Contrasts

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Scaling

Learning to Read

age hrs_wk method R_AGE
9.982 5.137 phonics 14.090
8.006 4.353 phonics 11.762
9.349 5.808 phonics 13.838
8.678 5.441 phonics 13.315
10.150 4.589 phonics 14.271
10.454 4.605 word 9.634
9.366 4.612 word 7.981
8.620 3.612 word 6.620
8.060 4.447 word 7.524
6.117 5.085 word 5.502

Learning to Read

...
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
...

Learning to Read

...
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
...
  • as we noted last week, the intercept for this model is nonsensical

    • “children aged zero who read for zero hours a week have a predicted reading age of 0.52”
  • perhaps there’s something we can do about this?

One-Predictor Model

  • let’s start with a model with a single predictor of age
# model
mod2 <- lm(R_AGE ~ age,
    data = reading)

# figure
reading |>
    ggplot(aes(x = age,
        y = R_AGE)) + xlab("age (yrs)") +
    ylab("reading age (yrs)") +
    geom_point(size = 3) +
    geom_smooth(method = "lm")

Changing the Intercept

  • actually it’s fairly easy to move the intercept

  • we can just pick a “useful-looking” value

  • for example, we might want the intercept to tell us about students at age 8

    • this is a decision; no magic about it

Changing the Intercept

  • actually it’s fairly easy to move the intercept

  • we can just pick a “useful-looking” value

  • for example, we might want the intercept to tell us about students at age 8

    • this is a decision; no magic about it

A Model With a New Intercept

original model

...
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.238      1.904    2.23   0.0307 * 
age            0.699      0.225    3.10   0.0032 **
...

new model

mod2b <- lm(R_AGE ~ I(age-8), data=reading)
summary(mod2b)
...
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.829      0.382    25.7   <2e-16 ***
I(age - 8)     0.699      0.225     3.1   0.0032 ** 
...
  • I() isolates +, -, *, etc. from the formula context

Fit Remains Unchanged

original model

...
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

new model

...
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

A Model with a New Slope

  • it’s also easy to linearly scale the slope

  • we can just pick a “useful” scale

  • for example, we might want to examine the effect per month of age

    • this is a decision; no magic about it

A Model with a New Slope

  • it’s also easy to linearly scale the slope

  • we can just pick a “useful” scale

  • for example, we might want to examine the effect per month of age

    • this is a decision; no magic about it

A Model With a New Slope

original model

...
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4.238      1.904    2.23   0.0307 * 
age            0.699      0.225    3.10   0.0032 **
...

new model

mod2c <- lm(R_AGE ~ I(age*12), data=reading)
summary(mod2c)
...
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   4.2385     1.9037    2.23   0.0307 * 
I(age * 12)   0.0582     0.0188    3.10   0.0032 **
...

Fit Remains Unchanged

original model

...
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

new model

...
Multiple R-squared:  0.167, Adjusted R-squared:  0.15 
F-statistic: 9.62 on 1 and 48 DF,  p-value: 0.00323

We Can Get Fancy About This

mod.mb <- lm(R_AGE ~ I((age-8)*12) + I(hrs_wk-mean(hrs_wk)), data=reading)
summary(mod.mb)

Call:
lm(formula = R_AGE ~ I((age - 8) * 12) + I(hrs_wk - mean(hrs_wk)), 
    data = reading)

Residuals:
   Min     1Q Median     3Q    Max 
-3.820 -2.382  0.074  2.404  3.549 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                9.8627     0.3657   26.97   <2e-16 ***
I((age - 8) * 12)          0.0482     0.0185    2.61    0.012 *  
I(hrs_wk - mean(hrs_wk))   0.9455     0.4056    2.33    0.024 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.55 on 47 degrees of freedom
Multiple R-squared:  0.253, Adjusted R-squared:  0.221 
F-statistic: 7.97 on 2 and 47 DF,  p-value: 0.00105

But We Need to Know What We’re Doing

(Intercept)                9.8627     0.3657   26.97   <2e-16 ***
  • model reading age of someone

    • aged 8
    • who reads for an average amount of hours/week
I((age - 8) * 12)          0.0482     0.0185    2.61    0.012 *  
  • model increase in reading age per month of age
I(hrs_wk - mean(hrs_wk))   0.9455     0.4056    2.33    0.024 *  
  • model increase in reading age per hour/wk of reading
  • note that we need to understand all other coefficients in order to interpret intercept

Which Has a Bigger Effect?

  • in our two-predictor model, is age more important than practise? Or vice-versa?

  • hard to tell because the predictors are in different units

...
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.518      2.422    0.21    0.832  
age            0.578      0.222    2.61    0.012 *
hrs_wk         0.945      0.406    2.33    0.024 *
...

Standardisation

  • if the predictors and outcome are very roughly normally distributed…

  • we can calculate z-scores by subtracting the mean and dividing by the standard deviation

\[z_i=\frac{x_i-\bar{x}}{\sigma_x}\]

Standardisation

  • in R, the scale() function calculates z-scores

  • in R, you don’t need to create new columns

mod.ms <- lm(scale(R_AGE) ~ scale(age) + scale(hrs_wk), data=reading)

Standardisation

  • in R, the scale() function calculates z-scores

  • in R, you don’t need to shouldn’t often create new columns

mod.ms <- lm(scale(R_AGE) ~ scale(age) + scale(hrs_wk), data=reading)
  • variables all in terms of standard deviations from the mean

  • at the intercept, age and hrs_wk are at their means

  • slopes: “how many standard deviations does R_AGE change for a one standard deviation change in the predictor?”

  • model fit won’t change

Standardisation

summary(mod.ms)
...
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.39e-16   1.25e-01    0.00    1.000  
scale(age)    3.38e-01   1.30e-01    2.61    0.012 *
scale(hrs_wk) 3.02e-01   1.30e-01    2.33    0.024 *
...
  • R_AGE changes 0.34 sds for a 1-sd change in age, and 0.30 sds for a 1-sd change in hrs_wk

  • reasonable conclusion might be that age has a greater effect on reading age than does practice

Categorical Predictors

Playmobil vs. SuperZings

  • some important pretesting went into these lectures

  • every individual figure rated for “usefulness” in explaining stats

  • how do we decide which to use?

Playmobil vs. SuperZings

type UTILITY
playmo 9.7
zing 3.8
playmo 4.7
zing 0.1
playmo 5.1
zing 1.2
playmo 5.1
zing 2.1
playmo 2.5
zing 2.9
  • we know one way to answer this
t.test(UTILITY ~ type, data = toys,
    var.equal = TRUE)

    Two Sample t-test

data:  UTILITY by type
t = 2.5, df = 8, p-value = 0.03
alternative hypothesis: true difference in means between group playmo and group zing is not equal to 0
95 percent confidence interval:
 0.3115 6.4885
sample estimates:
mean in group playmo   mean in group zing 
                5.42                 2.02 

The Only Equation You’ll Ever Need

  • which toys are the most useful?

\[\color{red}{\textrm{outcome}_i}\color{white}=\color{blue}{(\textrm{model})_i}\color{white}{+\textrm{error}_i}\]

\[\color{red}{\textrm{utility}_i}\color{white}{=}\color{blue}{(\textrm{some function of type})_i}\color{white}{+\epsilon_i}\]

  • we need to represent type as a number

  • the simplest way of doing this is to use 0 or 1

Quantifying a Nominal Predictor

toys <- toys |>
    mutate(is_zing = ifelse(type ==
        "zing", 1, 0))
toys
# A tibble: 10 × 3
   type   UTILITY is_zing
   <fct>    <dbl>   <dbl>
 1 playmo     9.7       0
 2 zing       3.8       1
 3 playmo     4.7       0
 4 zing       0.1       1
 5 playmo     5.1       0
 6 zing       1.2       1
 7 playmo     5.1       0
 8 zing       2.1       1
 9 playmo     2.5       0
10 zing       2.9       1
  • this maps to a linear model

\[\color{red}{\textrm{utility}_i}=\color{blue}{b_0+b_1\cdot{}\textrm{is_zing}}+\epsilon_i\]

  • \(\overline{\textrm{utility}}\) for playmobil is intercept

  • “change due to zing-ness” is slope

Linear Model Using is_zing

mod1 <- lm(UTILITY ~ is_zing, data = toys)
summary(mod1)

Call:
lm(formula = UTILITY ~ is_zing, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
-2.920 -0.795 -0.320  0.680  4.280 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.420      0.947    5.72  0.00044 ***
is_zing       -3.400      1.339   -2.54  0.03479 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.12 on 8 degrees of freedom
Multiple R-squared:  0.446, Adjusted R-squared:  0.377 
F-statistic: 6.44 on 1 and 8 DF,  p-value: 0.0348

Let R Do the Work

contrasts(toys$type)
       zing
playmo    0
zing      1
  • already built-in to factors

  • NB the first value will be the default intercept (because \(b_n=0\) for that value)

Let R Do the Work

contrasts(toys$type)
       zing
playmo    0
zing      1
  • already built-in to factors

  • NB the first value will be the default intercept (because \(b_n=0\) for that value)

  • as long as we have a factor, can just use lm() with that column

Linear Model Using type

original

summary(mod1)
...
(Intercept)    5.420      0.947    5.72  0.00044 ***
is_zing       -3.400      1.339   -2.54  0.03479 *  
...

with the factor type

mod2 <- lm(UTILITY~type, data=toys)
summary(mod2)
...
(Intercept)    5.420      0.947    5.72  0.00044 ***
typezing      -3.400      1.339   -2.54  0.03479 *  
...

Linear Model Using type

original

summary(mod1)
...
(Intercept)    5.420      0.947    5.72  0.00044 ***
is_zing       -3.400      1.339   -2.54  0.03479 *  
...

with the factor type

mod2 <- lm(UTILITY~type, data=toys)
summary(mod2)
...
(Intercept)    5.420      0.947    5.72  0.00044 ***
typezing      -3.400      1.339   -2.54  0.03479 *  
...
  • the intercept is playmo by inference from the coefficient

Graphically

  • shows “what the model is doing”, but isn’t a very good presentation

  • the line suggests you can make predictions for types between playmo and zing

Graphically

  • error bars represent one standard error of the mean

What About Lego Figures?

  • we now have three groups

  • can’t label them c(0, 1, 2) because that would express a linear relationship

Independent Effects

  • “change due to lego-ness” is independent of other changes

  • solution: add another predictor

toys <- toys |>
  mutate(
    is_playmo = ifelse(type == "playmo", 1, 0),
    is_zing = ifelse(type == "zing", 1, 0)
  )
head(toys)
# A tibble: 6 × 4
  type   UTILITY is_playmo is_zing
  <fct>    <dbl>     <dbl>   <dbl>
1 lego       0.1         0       0
2 playmo     9.7         1       0
3 zing       3.8         0       1
4 lego       1.2         0       0
5 playmo     4.7         1       0
6 zing       0.1         0       1

Three-Level Predictor: Two Coefficients

type UTILITY is_playmo is_zing
lego 0.1 0 0
playmo 9.7 1 0
zing 3.8 0 1
lego 1.2 0 0
playmo 4.7 1 0
zing 0.1 0 1

\[\textrm{utility}_i=\color{blue}{b_0}\color{gray}{+b_1\cdot\textrm{is_playmo}_i+b_2\cdot\textrm{is_zing}_i}+\epsilon_i\]

\[\textrm{utility}_i=\color{blue}{b_0}\color{gray}{+b_1\cdot0+b_2\cdot0}+\epsilon_i\]

“utility of a lego minfig”

Three-Level Predictor: Two Coefficients

type UTILITY is_playmo is_zing
lego 0.1 0 0
playmo 9.7 1 0
zing 3.8 0 1
lego 1.2 0 0
playmo 4.7 1 0
zing 0.1 0 1

\[\textrm{utility}_i=\color{blue}{b_0}\color{red}{+b_1\cdot\textrm{is_playmo}_i}+\color{gray}{b_2\cdot\textrm{is_zing}_i}+\epsilon_i\] \[\textrm{utility}_i=\color{blue}{b_0}\color{red}{+b_1\cdot1}+\color{gray}{b_2\cdot0}+\epsilon_i\]

“change in utility from lego due to being a playmo”

Three-Level Predictor: Two Coefficients

type UTILITY is_playmo is_zing
lego 0.1 0 0
playmo 9.7 1 0
zing 3.8 0 1
lego 1.2 0 0
playmo 4.7 1 0
zing 0.1 0 1

\[\textrm{utility}_i=\color{blue}{b_0}\color{gray}{+b_1\cdot\textrm{is_playmo}_i}+\color{red}{b_2\cdot\textrm{is_zing}_i}+\epsilon_i\] \[\textrm{utility}_i=\color{blue}{b_0}\color{gray}{+b_1\cdot0}+\color{red}{b_2\cdot1}+\epsilon_i\]

“change in utility from lego due to being a zing”

Three-level Predictor Linear Model

mod <- lm(UTILITY ~ is_playmo + is_zing, data = toys)
summary(mod)

Call:
lm(formula = UTILITY ~ is_playmo + is_zing, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.92  -0.77   0.04   0.49   4.28 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1.160      0.792    1.47   0.1686   
is_playmo      4.260      1.120    3.80   0.0025 **
is_zing        0.860      1.120    0.77   0.4573   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

Let R Do the Work

  • what we saw was, conceptually, a multiple regression (two predictors)
contrasts(toys$type)
       playmo zing
lego        0    0
playmo      1    0
zing        0    1
  • already built-in to factors, as before

  • NB this time lego is the intercept

Let R Do the Work

mod <- lm(UTILITY ~ type, data = toys)
summary(mod)

Call:
lm(formula = UTILITY ~ type, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.92  -0.77   0.04   0.49   4.28 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1.160      0.792    1.47   0.1686   
typeplaymo     4.260      1.120    3.80   0.0025 **
typezing       0.860      1.120    0.77   0.4573   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

Let R Do the Work

mod <- lm(UTILITY ~ type, data = toys)
summary(mod)

Call:
lm(formula = UTILITY ~ type, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.92  -0.77   0.04   0.49   4.28 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1.160      0.792    1.47   0.1686   
typeplaymo     4.260      1.120    3.80   0.0025 **
typezing       0.860      1.120    0.77   0.4573   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

\[\hat{\textrm{utility}}= b_0\]

Let R Do the Work

mod <- lm(UTILITY ~ type, data = toys)
summary(mod)

Call:
lm(formula = UTILITY ~ type, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.92  -0.77   0.04   0.49   4.28 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1.160      0.792    1.47   0.1686   
typeplaymo     4.260      1.120    3.80   0.0025 **
typezing       0.860      1.120    0.77   0.4573   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

\[\hat{\textrm{utility}}= b_0\]

\[+ \color{blue}{b_1}\cdot\textrm{is_playmo}\]

Let R Do the Work

mod <- lm(UTILITY ~ type, data = toys)
summary(mod)

Call:
lm(formula = UTILITY ~ type, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.92  -0.77   0.04   0.49   4.28 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1.160      0.792    1.47   0.1686   
typeplaymo     4.260      1.120    3.80   0.0025 **
typezing       0.860      1.120    0.77   0.4573   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

\[\hat{\textrm{utility}} = b_0\]

\[+ \color{blue}{b_1}\cdot\textrm{is_playmo}\]

\[+ \color{red}{b_2}\cdot\textrm{is_zing}\]

Let R Do the Work

mod <- lm(UTILITY ~ type, data = toys)
summary(mod)

Call:
lm(formula = UTILITY ~ type, data = toys)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.92  -0.77   0.04   0.49   4.28 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1.160      0.792    1.47   0.1686   
typeplaymo     4.260      1.120    3.80   0.0025 **
typezing       0.860      1.120    0.77   0.4573   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

\[\hat{\textrm{utility}} = b_0\]

\[+ \color{blue}{b_1}\cdot\textrm{is_playmo}\]

\[+ \color{red}{b_2}\cdot\textrm{is_zing}\]

Graphically

\[\textrm{utility}_i=\color{grey}{b_0}\color{blue}{+b_1}\cdot\textrm{is_playmo}_i+\color{red}{b_2}\cdot\textrm{is_zing}_i+\epsilon_i\]

Not Quite a Multiple Regression

anova(mod)
Analysis of Variance Table

Response: UTILITY
          Df Sum Sq Mean Sq F value Pr(>F)   
type       2   50.7   25.37     8.1 0.0059 **
Residuals 12   37.6    3.13                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • type is still one variable

  • in some sense it can’t be decomposed

  • but it has two degrees of freedom because \(b_1\) and \(b_2\) are independent of each other

But You Never Compared Playmo to Zing

  • change the order of the factor
#| output-line-numbers: "11"
toys <-
  toys |> mutate(type=fct_relevel(type,c('playmo','lego','zing')))
mod.r <- lm(UTILITY~type, data=toys)
summary(mod.r)
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.420      0.792    6.85 0.000018 ***
typelego      -4.260      1.120   -3.80   0.0025 ** 
typezing      -3.400      1.120   -3.04   0.0103 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.77 on 12 degrees of freedom
Multiple R-squared:  0.574, Adjusted R-squared:  0.503 
F-statistic:  8.1 on 2 and 12 DF,  p-value: 0.00595

Different Contrast Codings

  • multiple models increase the chance of a type 1 error
  • different contrast codings may suit our research question
  • a predictor with \(g\) levels (or ‘groups’) has \(g-1\) possible contrasts
  • contrasts can be anything you like (a few are built in to R)

 

  • contrasts act like ‘tests of differences of interest’

Sum Contrasts

  • how much better are playmos than the average toy?

  • let’s just look at zings

type UTILITY is_playmo is_zing
playmo 9.7 1 0
zing 3.8 0 1
playmo 4.7 1 0
zing 0.1 0 1
playmo 5.1 1 0
zing 1.2 0 1
playmo 5.1 1 0
zing 2.1 0 1
playmo 2.5 1 0
zing 2.9 0 1
  • ‘average’ (grand mean) utility = mean(toys$UTILITY) = 3.72

Sum Contrasts

  • we can change the contrast type of toys$type
contrasts(toys$type) <- "contr.sum"
contrasts(toys$type)
       [,1]
playmo    1
zing     -1
mod.s <- lm(UTILITY ~ type, data=toys)
summary(mod.s)
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.72       0.67    5.56  0.00054 ***
type1           1.70       0.67    2.54  0.03479 *  
...

Sum Contrasts

  • we can change the contrast type of toys$type
contrasts(toys$type) <- "contr.sum"
contrasts(toys$type)
       [,1]
playmo    1
zing     -1
mod.s <- lm(UTILITY ~ type, data=toys)
summary(mod.s)
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.72       0.67    5.56  0.00054 ***
type1           1.70       0.67    2.54  0.03479 *  
...
  • mean = 3.72

Sum Contrasts

  • we can change the contrast type of toys$type
contrasts(toys$type) <- "contr.sum"
contrasts(toys$type)
       [,1]
playmo    1
zing     -1
mod.s <- lm(UTILITY ~ type, data=toys)
summary(mod.s)
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.72       0.67    5.56  0.00054 ***
type1           1.70       0.67    2.54  0.03479 *  
...
  • mean = 3.72
  • difference = 1.7

End