Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
...
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
...
...
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
...
there is no relationship between our independent and dependent variables
y ~ 1
or RT ~ 1
RT ~ BloodAlc
than RT ~ 1
?\[ \textrm{total SS}=\sum{(y - \bar{y})^2} \]
sum of squared differences between observed \(y\) and mean \(\bar{y}\)
= total amount of variance in the data
= total unexplained variance from the null model
\[\textrm{residual SS} = \sum{(y - \hat{y})^2}\]
sum of squared differences between observed \(y\) and predicted \(\hat{y}\)
= total unexplained variance from a given model in the data
\[ \textrm{model SS} = \sum{(\hat{y} - \bar{y})^2} \]
sum of squared differences between predicted \(\hat{y}\) and observed \(\bar{y}\)
total variance explained by a given model
\[ R^2 = \frac{\textrm{model SS}}{\textrm{total SS}} = \frac{\sum{(\hat{y}-\bar{y})^2}}{\sum{(y-\bar{y})^2}} \]
how much the model improves over the null
\(0 \le R^2 \le 1\)
we want \(R^2\) to be large
\(F\) ratio depends on mean squares
\[(\textrm{MS}_x=\textrm{SS}_x/\textrm{df}_x)\]
\[F=\frac{\textrm{model MS}}{\textrm{residual MS}}=\frac{\sum{(\hat{y}-\bar{y})^2}/\textrm{df}_m}{\sum{(y-\hat{y})^2}/\textrm{df}_r}\]
how much the model ‘explains’
\(0<F\)
we want \(F\) to be large
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
...
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
Analysis of Variance Table
Model 1: RT ~ 1
Model 2: RT ~ BloodAlc
Res.Df RSS Df Sum of Sq F Pr(>F)
1 49 190340
2 48 149223 1 41117 13.2 0.00067 ***
Analysis of Variance Table
Response: RT
Df Sum Sq Mean Sq F value Pr(>F)
BloodAlc 1 41117 41117 13.2 0.00067 ***
Residuals 48 149223 3109
Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R2=0.22, F(1,48)=13.2, p=.0007).
Anscombe (1973)
linearity of relationship(!)
for the residuals:
\[y_i=b_0+b_1\cdot{}x_i+\epsilon_i\]
\[\color{red}{\epsilon\sim{}N(0,\sigma)~\text{independently}}\]
homogeneous (differences from \(\hat{y}\) shouldn’t be systematically smaller or larger for different \(x\))
independent (residuals shouldn’t influence other residuals)
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
from our residuals, what values are (say) 5%, or 10%, of observations found to be less than?
convert to “standard deviations from the mean”
Q-Q plot compares the residuals \(\epsilon\) against a known distribution (here, normal)
observations close to the straight line mean residuals are approximately normal
no easy way to check independence of residuals
in part, because it depends on the source of the observations
one determinant might be a single person being observed multiple times
e.g., my reaction times might tend to be slower than yours
\(\rightarrow\) multivariate statistics
another determinant might be time
observations in a sequence might be autocorrelated
can be checked using the Durbin-Watson Test from the car
package
more statistical ways of checking assumptions will be introduced in labs
they tend to have limitations (for example, they’re susceptible to sample size)
nothing beats looking at plots like these (and plot(<model>)
makes it easy)
there are no criteria for deciding exactly when assumptions are sufficiently met
we need to talk about influence
even substantial outliers may only have small effects on the model
here, only the intercept is affected
\[D_i=\frac{\sum_{j=1}^{n}{(\hat{y}_j-\hat{y}_{j(i)})^2}}{(p+1)\hat{\sigma}^2}\]
Call:
lm(formula = RT ~ BloodAlc, data = dat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
Adding a predictor of blood alcohol improved the amount of variance explained over the null model (R2=0.22, F(1,48)=13.2, p=.0007). Reaction time slowed by 32.3 ms for every additional 0.01% blood alcohol by volume (t(48)=3.64, p=.0007).
the Playmo School has been evaluating its reading programmes, using 50 students
ages of students
hours per week students spend reading of their own volition
whether they are taught using phonics or whole-word methods
outcome: “reading age”
age | hrs_wk | method | R_AGE |
---|---|---|---|
10.0 | 5.1 | phonics | 14.1 |
8.0 | 4.4 | phonics | 11.8 |
9.3 | 5.8 | phonics | 13.8 |
8.7 | 5.4 | phonics | 13.3 |
10.2 | 4.6 | phonics | 14.3 |
10.5 | 4.6 | word | 9.6 |
9.4 | 4.6 | word | 8.0 |
8.6 | 3.6 | word | 6.6 |
8.1 | 4.4 | word | 7.5 |
6.1 | 5.1 | word | 5.5 |
age | hrs_wk | method | R_AGE |
---|---|---|---|
10.0 | 5.1 | phonics | 14.1 |
8.0 | 4.4 | phonics | 11.8 |
9.3 | 5.8 | phonics | 13.8 |
8.7 | 5.4 | phonics | 13.3 |
10.2 | 4.6 | phonics | 14.3 |
10.5 | 4.6 | word | 9.6 |
9.4 | 4.6 | word | 8.0 |
8.6 | 3.6 | word | 6.6 |
8.1 | 4.4 | word | 7.5 |
6.1 | 5.1 | word | 5.5 |
hours per week is correlated with reading age: r=0.3812, p=0.0063
we can use a linear model to say something about the effect size
Call:
lm(formula = R_AGE ~ hrs_wk, data = reading)
Residuals:
Min 1Q Median 3Q Max
-5.813 -1.826 -0.017 2.426 4.729
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.069 2.119 1.92 0.0608 .
hrs_wk 1.193 0.417 2.86 0.0063 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.7 on 48 degrees of freedom
Multiple R-squared: 0.145, Adjusted R-squared: 0.127
F-statistic: 8.16 on 1 and 48 DF, p-value: 0.00631
but…
it seems that the assumptions aren’t met for this model
(another demonstration on the right)
one reason for this can be because there’s still systematic structure in the residuals
i.e., more than one thing which can explain the variance