age | hrs_wk | method | R_AGE |
---|---|---|---|
9.982 | 5.137 | phonics | 14.090 |
8.006 | 4.353 | phonics | 11.762 |
9.349 | 5.808 | phonics | 13.838 |
8.620 | 3.612 | word | 6.620 |
8.060 | 4.447 | word | 7.524 |
6.117 | 5.085 | word | 5.502 |
Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
age | hrs_wk | method | R_AGE |
---|---|---|---|
9.982 | 5.137 | phonics | 14.090 |
8.006 | 4.353 | phonics | 11.762 |
9.349 | 5.808 | phonics | 13.838 |
8.620 | 3.612 | word | 6.620 |
8.060 | 4.447 | word | 7.524 |
6.117 | 5.085 | word | 5.502 |
Call:
lm(formula = R_AGE ~ age + hrs_wk, data = reading)
Residuals:
Min 1Q Median 3Q Max
-3.820 -2.382 0.074 2.404 3.549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.518 2.422 0.21 0.832
age 0.578 0.222 2.61 0.012 *
hrs_wk 0.945 0.406 2.33 0.024 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.55 on 47 degrees of freedom
Multiple R-squared: 0.253, Adjusted R-squared: 0.221
F-statistic: 7.97 on 2 and 47 DF, p-value: 0.00105
our model says that age
and hrs_wk
have orthogonal effects
what if practice affects people of different ages differently?
\[\color{red}{\textrm{outcome}_i}=\color{blue}{(\textrm{model})_i}+\textrm{error}_i\] \[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]
\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}+\epsilon_i\] \[\color{red}{\hat{y_i}}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}\]
\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}+\epsilon_i\]
\[\color{red}{\hat{y_i}}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}\]
\[\color{red}{\hat{y_i}}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}\]
\(\color{blue}{b_0=1}\)
\(\color{blue}{b_1=1.5}\)
\(\color{blue}{b_2=2}\)
\(\color{blue}{b_3=0.5}\)
\(\color{red}{x_{1i}=2}\)
\[\color{red}{\hat{y_i}}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}\]
\(\color{blue}{b_0=1}\)
\(\color{blue}{b_1=1.5}\)
\(\color{blue}{b_2=2}\)
\(\color{blue}{b_3=0.5}\)
\(\color{red}{x_{1i}=2}\)
\(\color{green}{x_{1i}=4}\)
Call:
lm(formula = R_AGE ~ 1 + hrs_wk + age + hrs_wk:age, data = reading)
Residuals:
Min 1Q Median 3Q Max
-4.217 -2.172 -0.049 2.144 4.318
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.792 11.904 2.08 0.043 *
hrs_wk -3.818 2.324 -1.64 0.107
age -2.348 1.423 -1.65 0.106
hrs_wk:age 0.569 0.274 2.08 0.043 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.46 on 46 degrees of freedom
Multiple R-squared: 0.317, Adjusted R-squared: 0.273
F-statistic: 7.13 on 3 and 46 DF, p-value: 0.000496
how does interaction work with categorical predictors?
(as you’ll see) it’s all just multiplication
age | hrs_wk | method | R_AGE |
---|---|---|---|
9.982 | 5.137 | phonics | 14.090 |
8.006 | 4.353 | phonics | 11.762 |
9.349 | 5.808 | phonics | 13.838 |
8.620 | 3.612 | word | 6.620 |
8.060 | 4.447 | word | 7.524 |
6.117 | 5.085 | word | 5.502 |
method
coded by R?we know that hrs_wk
affects reading age
perhaps method
affects reading age too?
this is a question of model improvement
Analysis of Variance Table
Response: R_AGE
Df Sum Sq Mean Sq F value Pr(>F)
hrs_wk 1 59.4 59.4 26.9 4.4e-06 ***
method 1 245.6 245.6 111.3 5.5e-14 ***
Residuals 47 103.7 2.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = R_AGE ~ hrs_wk + method, data = reading)
Residuals:
Min 1Q Median 3Q Max
-3.389 -1.008 0.209 1.055 2.430
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.237 1.205 6.00 2.6e-07 ***
hrs_wk 1.003 0.231 4.35 7.2e-05 ***
methodword -4.446 0.421 -10.55 5.5e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.49 on 47 degrees of freedom
Multiple R-squared: 0.746, Adjusted R-squared: 0.735
F-statistic: 69.1 on 2 and 47 DF, p-value: 1.01e-14
note that the lines are parallel
an hour of practice has the same effect, however you’re taught
Analysis of Variance Table
Response: R_AGE
Df Sum Sq Mean Sq F value Pr(>F)
hrs_wk 1 59.4 59.4 29.5 2.0e-06 ***
method 1 245.6 245.6 122.2 1.5e-14 ***
hrs_wk:method 1 11.3 11.3 5.6 0.022 *
Residuals 46 92.5 2.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = R_AGE ~ hrs_wk + method + hrs_wk:method, data = reading)
Residuals:
Min 1Q Median 3Q Max
-2.995 -0.674 0.242 1.017 2.593
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.499 1.365 4.03 0.00021 ***
hrs_wk 1.347 0.264 5.11 0.0000061 ***
methodword 1.183 2.413 0.49 0.62629
hrs_wk:methodword -1.134 0.479 -2.37 0.02224 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.42 on 46 degrees of freedom
Multiple R-squared: 0.774, Adjusted R-squared: 0.759
F-statistic: 52.4 on 3 and 46 DF, p-value: 6.95e-15
\[\color{red}{\hat{y_i}}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}\]
(Intercept) hrs_wk methodword hrs_wk:methodword
5.499 1.347 1.183 -1.134
\[\color{red}{\hat{\textrm{R_AGE}}}=\color{blue}{5.499}\cdot{}\color{orange}{1}+\color{blue}{1.347}\cdot{}\color{orange}{\textrm{hrs_wk}}+\color{blue}{1.183}\cdot{}\color{orange}{\textrm{method}}+\color{blue}{-1.134}\cdot{}\color{orange}{\textrm{hrs_wk}\cdot{}\textrm{method}}\]
the coefficients show you which way things are coded
methodword
can be read as “when method
is word
”
method
is (coded as) zero (for phonics):\[\color{red}{\hat{\textrm{R_AGE}}}=\color{blue}{5.499}\cdot{}\color{orange}{1}+\color{blue}{1.347}\cdot{}\color{orange}{\textrm{hrs_wk}}+\color{blue}{1.183}\cdot{}\color{orange}{0}+\color{blue}{-1.134}\cdot{}\color{orange}{\textrm{hrs_wk}\cdot{}0}\]
\[\color{red}{\hat{\textrm{R_AGE}}}=\color{blue}{5.499}\cdot{}\color{orange}{1}+\color{blue}{1.347}\cdot{}\color{orange}{\textrm{hrs_wk}}\]
method
is (coded as) one (for word):\[\color{red}{\hat{\textrm{R_AGE}}}=\color{blue}{5.499}\cdot{}\color{orange}{1}+\color{blue}{1.347}\cdot{}\color{orange}{\textrm{hrs_wk}}+\color{blue}{1.183}\cdot{}\color{orange}{1}+\color{blue}{-1.134}\cdot{}\color{orange}{\textrm{hrs_wk}\cdot{}1}\]
\[\color{red}{\hat{\textrm{R_AGE}}}=\color{blue}{6.682}\cdot{}\color{orange}{1}+\color{blue}{0.213}\cdot{}\color{orange}{\textrm{hrs_wk}}\]
geom_smooth(method="lm")
effectively runs a linear model, to make a graph
it’s not an analysis
age | hrs_wk | method | school | R_AGE |
---|---|---|---|---|
9.982 | 5.137 | phonics | private | 14.090 |
8.006 | 4.353 | phonics | state | 11.762 |
9.349 | 5.808 | phonics | state | 13.838 |
8.620 | 3.612 | word | private | 6.620 |
8.060 | 4.447 | word | private | 7.524 |
6.117 | 5.085 | word | state | 5.502 |
shoehorning in one more reading example
does where it’s taught affect the efficacy of a method?
Analysis of Variance Table
Response: R_AGE
Df Sum Sq Mean Sq F value Pr(>F)
hrs_wk 1 59.4 59.4 29.5 2.0e-06
method 1 245.6 245.6 122.2 1.5e-14
hrs_wk:method 1 11.3 11.3 5.6 0.022
Residuals 46 92.5 2.0
where you start depends on what you’re doing
have I got a good model? (no: one of many possible issues: missing predictor)
can I improve a model? (for example, I am exploring which predictors are relevant)
in the present case, we already have a theory (that different schools are using the teaching methods differently)
this time we’ll start by looking at the data
looking at the graph, it seems as if
the state school is doing a bit better
but it depends which method we look at
we wouldn’t be able to predict how well a child would do without knowing which kind of school and which method
it seems (graphically) as if our theory/hunch was right
what about statistically?
Analysis of Variance Table
Response: R_AGE
Df Sum Sq Mean Sq F value Pr(>F)
method 1 263 263.2 110.56 8.1e-14 ***
school 1 5 5.0 2.11 0.15357
method:school 1 31 31.0 13.02 0.00076 ***
Residuals 46 110 2.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = R_AGE ~ method + school + method:school, data = reading)
Residuals:
Min 1Q Median 3Q Max
-2.620 -1.132 -0.030 0.926 3.877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.258 0.428 26.31 < 2e-16 ***
methodword -3.038 0.618 -4.92 0.000012 ***
schoolstate 2.209 0.618 3.58 0.00083 ***
methodword:schoolstate -3.152 0.873 -3.61 0.00076 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.54 on 46 degrees of freedom
Multiple R-squared: 0.732, Adjusted R-squared: 0.715
F-statistic: 41.9 on 3 and 46 DF, p-value: 3.32e-13
F(3,46)=41.9
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.258 0.428 26.31 < 2e-16 ***
methodword -3.038 0.618 -4.92 0.000012 ***
schoolstate 2.209 0.618 3.58 0.00083 ***
methodword:schoolstate -3.152 0.873 -3.61 0.00076 ***
the predicted reading age of someone who has been to private school and taught using phonics
methodword
mean?being taught using the word method reduces your reading age by 3 years if you went to private school
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.258 0.428 26.31 < 2e-16 ***
methodword -3.038 0.618 -4.92 0.000012 ***
schoolstate 2.209 0.618 3.58 0.00083 ***
methodword:schoolstate -3.152 0.873 -3.61 0.00076 ***
schoolstate
mean?being taught in a state school increases your predicted reading age by 2.2 years if you are taught using the phonics method
compared to being taught using the phonics method in a private school, your predicted reading age is -3+2.2+-3.2 years different (i.e., it is 11.3-4 or 7.3 years)
we can’t say anything general about the value of each method
we can’t say anything general about the effectiveness of schools
no main effects
possibly the only useful information we have is the interaction
we can fix this…
for example the interaction term can only be added when \(\color{orange}{x_{1i}}\) and \(\color{orange}{x_{2i}}\) (here, method
and school
) are equal to 1
school
, word method
\[\color{red}{\hat{y_i}}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\color{blue}{b_3}\cdot{}\color{orange}{x_{1i}x_{2i}}\]
what we’ve done is changed the values of \(\color{orange}{x_{1}}\) and \(\color{orange}{x_{2}}\)
where method
i is phonics, \(\color{orange}{x_{1i}}=-0.5\)
where method
i is word, \(\color{orange}{x_{1i}}=+0.5\)
similarly for school
and \(\color{orange}{x_{2}}\)
Call:
lm(formula = R_AGE ~ method * school, data = reading)
Residuals:
Min 1Q Median 3Q Max
-2.620 -1.132 -0.030 0.926 3.877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.056 0.218 46.05 < 2e-16 ***
method1 -4.614 0.437 -10.56 6.9e-14 ***
school1 0.634 0.437 1.45 0.15357
method1:school1 -3.152 0.873 -3.61 0.00076 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.54 on 46 degrees of freedom
Multiple R-squared: 0.732, Adjusted R-squared: 0.715
F-statistic: 41.9 on 3 and 46 DF, p-value: 3.32e-13
F(3,46)=41.9
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.056 0.218 46.05 < 2e-16 ***
method1 -4.614 0.437 -10.56 6.9e-14 ***
school1 0.634 0.437 1.45 0.15357
method1:school1 -3.152 0.873 -3.61 0.00076 ***
note the output isn’t as helpful here, you have to remember which values are 1
depends which you assigned positive values
here, word method
, state school
school
using phonics method
reading age = 10.1 - \(\frac{1}{2}\)·-4.6 - \(\frac{1}{2}\)·0.6 + \(\frac{1}{4}\)·-3.2
reading age = 10.1 + 2.3 - 0.3 - 0.8 = 11.3 years
…etc.
school
with (-.5, .5) contrast codingmethod
with (-.5, .5) contrast coding Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.056 0.218 46.05 < 2e-16 ***
method1 -4.614 0.437 -10.56 6.9e-14 ***
school1 0.634 0.437 1.45 0.15357
method1:school1 -3.152 0.873 -3.61 0.00076 ***
on average, phonics is a better method for teaching
on average, there is no (statistical) difference between types of school
there’s also an interaction between these two predictors (but we should really calculate it out to ensure that we know what it “means”)
a quick read: when school
is state and method
is word, reading age is reduced
but this obscures the fact that state schools are particularly good on phonics—perhaps we should recode?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.025 0.246 40.71 < 2e-16 ***
method1 -4.589 0.492 -9.32 0.0000000000024 ***