Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
...
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
...
...
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
...
there is no relationship between our independent and dependent variables
y ~ 1
or RT ~ 1
RT ~ BloodAlc
than RT ~ 1
?\[ \textrm{total SS}=\sum{(y - \bar{y})^2} \]
sum of squared differences between observed \(y\) and mean \(\bar{y}\)
= total amount of variance in the data
= total unexplained variance from the null model
\[\textrm{residual SS} = \sum{(y - \hat{y})^2}\]
sum of squared differences between observed \(y\) and predicted \(\hat{y}\)
= total unexplained variance from a given model in the data
\[ \textrm{model SS} = \sum{(\hat{y} - \bar{y})^2} \]
sum of squared differences between predicted \(\hat{y}\) and observed \(\bar{y}\)
total variance explained by a given model
\[ R^2 = \frac{\textrm{model SS}}{\textrm{total SS}} = \frac{\sum{(\hat{y}-\bar{y})^2}}{\sum{(y-\bar{y})^2}} \]
how much the model improves over the null
\(0 \le R^2 \le 1\)
we want \(R^2\) to be large
\(F\) ratio depends on mean squares
\[(\textrm{MS}_x=\textrm{SS}_x/\textrm{df}_x)\]
\[F=\frac{\textrm{model MS}}{\textrm{residual MS}}=\frac{\sum{(\hat{y}-\bar{y})^2}/\textrm{df}_m}{\sum{(y-\hat{y})^2}/\textrm{df}_r}\]
how much the model ‘explains’
\(0<F\)
we want \(F\) to be large
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
...
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
Analysis of Variance Table
Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
Res.Df RSS Df Sum of Sq F Pr(>F)
1 49 190340
2 48 149223 1 41117 13.2 0.00067 ***
Analysis of Variance Table
Response: RT
Df Sum Sq Mean Sq F value Pr(>F)
BloodAlc 1 41117 41117 13.2 0.00067 ***
Residuals 48 149223 3109
Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R2=0.22, F(1,48)=13.2, p=.0007).
Anscombe (1973)
linearity of relationship(!)
for the residuals:
\[y_i=b_0+b_1\cdot{}x_i+\epsilon_i\]
\[\color{red}{\epsilon\sim{}N(0,\sigma)~\text{independently}}\]
homogeneous (differences from \(\hat{y}\) shouldn’t be systematically smaller or larger for different \(x\))
independent (residuals shouldn’t influence other residuals)
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
from our residuals, what values are (say) 5%, or 10%, of observations found to be less than?
convert to “standard deviations from the mean”
Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)
observations close to the straight line mean residuals are approximately normal
no easy way to check independence of residuals
in part, because it depends on the source of the observations
one determinant might be a single person being observed multiple times
e.g., my reaction times might tend to be slower than yours
\(\rightarrow\) multivariate statistics
another determinant might be time
observations in a sequence might be autocorrelated
can be checked using the Durbin-Watson Test from the car
package
more statistical ways of checking assumptions will be introduced in labs
they tend to have limitations (for example, they’re susceptible to sample size)
nothing beats looking at plots like these (and plot(<model>)
makes it easy)
there are no criteria for deciding exactly when assumptions are sufficiently met
we need to talk about influence
even substantial outliers may only have small effects on the model
here, only the intercept is affected
\[D_i=\frac{\sum_{j=1}^{n}{(\hat{y}_j-\hat{y}_{j(i)})^2}}{(p+1)\hat{\sigma}^2}\]
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R2=0.22, F(1,48)=13.2, p=.0007). Reaction time slowed by 32.3 ms for every additional 0.01% blood alcohol by volume (t(48)=3.64, p=.0007).
the Playmo School has been evaluating its reading programmes, using 50 students
ages of students
hours per week students spend reading of their own volition
whether they are taught using phonics or whole-word methods
outcome: “reading age”
age | hrs_wk | method | R_AGE |
---|---|---|---|
10.0 | 5.1 | phonics | 14.1 |
8.0 | 4.4 | phonics | 11.8 |
9.3 | 5.8 | phonics | 13.8 |
8.7 | 5.4 | phonics | 13.3 |
10.2 | 4.6 | phonics | 14.3 |
10.5 | 4.6 | word | 9.6 |
9.4 | 4.6 | word | 8.0 |
8.6 | 3.6 | word | 6.6 |
8.1 | 4.4 | word | 7.5 |
6.1 | 5.1 | word | 5.5 |
age | hrs_wk | method | R_AGE |
---|---|---|---|
10.0 | 5.1 | phonics | 14.1 |
8.0 | 4.4 | phonics | 11.8 |
9.3 | 5.8 | phonics | 13.8 |
8.7 | 5.4 | phonics | 13.3 |
10.2 | 4.6 | phonics | 14.3 |
10.5 | 4.6 | word | 9.6 |
9.4 | 4.6 | word | 8.0 |
8.6 | 3.6 | word | 6.6 |
8.1 | 4.4 | word | 7.5 |
6.1 | 5.1 | word | 5.5 |
hours per week is correlated with reading age: r=0.3812, p=0.0063
we can use a linear model to say something about the effect size
Call:
lm(formula = R_AGE ~ hrs_wk, data = reading)
Residuals:
Min 1Q Median 3Q Max
-5.813 -1.826 -0.017 2.426 4.729
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.069 2.119 1.92 0.0608 .
hrs_wk 1.193 0.417 2.86 0.0063 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.7 on 48 degrees of freedom
Multiple R-squared: 0.145, Adjusted R-squared: 0.127
F-statistic: 8.16 on 1 and 48 DF, p-value: 0.00631
but…
it seems that the assumptions aren’t met for this model
(another demonstration on the right)
one reason for this can be because there’s still systematic structure in the residuals
i.e., more than one thing which can explain the variance
so far, have focused on effects of practice
but presumably older children read better?
age | hrs_wk | method | R_AGE |
---|---|---|---|
9.982 | 5.137 | phonics | 14.090 |
8.006 | 4.353 | phonics | 11.762 |
9.349 | 5.808 | phonics | 13.838 |
8.620 | 3.612 | word | 6.620 |
8.060 | 4.447 | word | 7.524 |
6.117 | 5.085 | word | 5.502 |
Call:
lm(formula = R_AGE ~ age, data = reading)
Residuals:
Min 1Q Median 3Q Max
-3.642 -2.601 -0.454 2.280 4.543
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.238 1.904 2.23 0.0307 *
age 0.699 0.225 3.10 0.0032 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.66 on 48 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.15
F-statistic: 9.62 on 1 and 48 DF, p-value: 0.00323
we now have two models that don’t map well to assumptions
each suggests an effect
one of age
one of hrs_wk
if we run them independently, the chances of a type 1 error are
\(\frac{1}{20}\) (mod
, including hrs_wk
)
\(\frac{1}{20}\) (mod2
, including age
)
or \(\frac{1}{10}\) overall
\[\color{red}{\textrm{outcome}_i}=\color{blue}{(\textrm{model})_i}+\textrm{error}_i\]
\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_i}+\epsilon_i\]
\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]
\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]
\[\color{red}{y_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}+\epsilon_i\]
\[\color{red}{\hat{y}_i}=\color{blue}{b_0}\cdot{}\color{orange}{1}+\color{blue}{b_1}\cdot{}\color{orange}{x_{1i}}+\color{blue}{b_2}\cdot{}\color{orange}{x_{2i}}\]
Call:
lm(formula = R_AGE ~ age + hrs_wk, data = reading)
Residuals:
Min 1Q Median 3Q Max
-3.820 -2.382 0.074 2.404 3.549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.518 2.422 0.21 0.832
age 0.578 0.222 2.61 0.012 *
hrs_wk 0.945 0.406 2.33 0.024 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.55 on 47 degrees of freedom
Multiple R-squared: 0.253, Adjusted R-squared: 0.221
F-statistic: 7.97 on 2 and 47 DF, p-value: 0.00105
...
(Intercept) 0.518 2.422 0.21 0.832
age 0.578 0.222 2.61 0.012 *
hrs_wk 0.945 0.406 2.33 0.024 *
...
there are independent effects of age and practice
reading age improves by 0.58 years for each year of age
reading age improves by 0.95 years for each hour of practice
the intercept (0 years old, 0 hours/week) is meaningless!
...
Multiple R-squared: 0.253, Adjusted R-squared: 0.221
...
in multiple regression, R2 measures the fit of the entire model
R2=0.25 looks better than the R2 for mod2
(age
as a predictor) of 0.17
but any predictor will improve R2 (chance associations guarantee this)
...
Multiple R-squared: 0.196, Adjusted R-squared: 0.161
...
...
Multiple R-squared: 0.253, Adjusted R-squared: 0.221
F-statistic: 7.97 on 2 and 47 DF, p-value: 0.00105
F ratio compares model to the null model
but we can also compare models in succession
Analysis of Variance Table
Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
Res.Df RSS Df Sum of Sq F Pr(>F)
1 49 409
2 48 340 1 68.2 10.50 0.0022 **
3 47 305 1 35.3 5.43 0.0241 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: R_AGE ~ 1
Model 2: R_AGE ~ age
Model 3: R_AGE ~ age + hrs_wk
Res.Df RSS Df Sum of Sq F Pr(>F)
1 49 409
2 48 340 1 68.2 10.50 0.0022
3 47 305 1 35.3 5.43 0.0241
age
then hrs_wk
Analysis of Variance Table
Response: R_AGE
Df Sum Sq Mean Sq F value Pr(>F)
age 1 68.2 68.2 10.50 0.0022
hrs_wk 1 35.3 35.3 5.43 0.0241
Residuals 47 305.2 6.5