Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
variance
\[ s^2 = \frac{\sum{(x-\bar{x})^2}}{n} = \frac{\sum{(x-\bar{x})(x-\bar{x})}}{n} \]
covariance
\[ \textrm{cov}(x,y) = \frac{\sum{\color{blue}{(x-\bar{x})}\color{red}{(y-\bar{y})}}}{n} \]
\(\color{blue}{x-\bar{x}}\) | \(\color{red}{y-\bar{y}}\) | \(\color{blue}{(x-\bar{x})}\color{red}{(y-\bar{y})}\) |
---|---|---|
-0.3 | 1.66 | -0.5 |
0.81 | 2.21 | 1.79 |
-1.75 | -2.85 | 4.99 |
-0.14 | -3.58 | 0.49 |
1.37 | 2.56 | 3.52 |
10.29 |
\[\textrm{cov}(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n} = \frac{10.29}{5} \simeq \color{red}{2.06}\]
miles
\(x-\bar{x}\) | \(y-\bar{y}\) | \((x-\bar{x})(y-\bar{y})\) |
---|---|---|
-0.3 | 1.66 | -0.5 |
0.81 | 2.21 | 1.79 |
-1.75 | -2.85 | 4.99 |
-0.14 | -3.58 | 0.49 |
1.37 | 2.56 | 3.52 |
10.29 |
\[\textrm{cov}(x,y)=\frac{10.29}{5}\simeq 2.06\]
kilometres
\(x-\bar{x}\) | \(y-\bar{y}\) | \((x-\bar{x})(y-\bar{y})\) |
---|---|---|
-0.48 | 2.68 | -1.29 |
1.3 | 3.56 | 4.64 |
-2.81 | -4.59 | 12.91 |
-0.22 | -5.77 | 1.27 |
2.21 | 4.12 | 9.12 |
26.65 |
\[ \textrm{cov}(x,y)=\frac{26.65}{5}\simeq 5.33 \]
\[r = \frac{\textrm{covariance}(x,y)}{\textrm{standard deviation}(x)\cdot\textrm{standard deviation}(y)}\]
\[r=\frac{\frac{\sum{(x-\bar{x})(y-\bar{y})}}{n}}{\sqrt{\frac{\sum{(x-\bar{x})^2}}{n}}\sqrt{\frac{\sum{(y-\bar{y})^2}}{n}}}\]
\[r=\frac{\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\color{red}{n}}}{\sqrt{\frac{\sum{(x-\bar{x})^2}}{\color{red}{n}}}\sqrt{\frac{\sum{(y-\bar{y})^2}}{\color{red}{n}}}}\]
\[r=\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2}}\sqrt{\sum{(y-\bar{y})^2}}}\]
measure of how related two variables are
\(-1 \le r \le 1\) (\(\pm 1\) = perfect fit; \(0\) = no fit)
\[ r=0.4648 \]
\[ r=-0.4648 \]
\[ r = 0.4648 \]
we can measure a correlation using \(r\)
we want to know whether that correlation is significant
cardinal rule in NHST: compare everything to chance
let’s investigate…
pick some pairs of numbers at random, return correlation
arbitrarily, I’ve picked numbers uniformly distributed between 0 and 100
pick some pairs of numbers at random, return correlation
\[t= r\sqrt{\frac{n-2}{1-r^2}}\]
calculate \(t\)
[1] 3.637
\[r= 0.4648, p = 0.0007\]
reaction time is positively associated with blood alcohol
not a very complete picture
how much does alcohol affect RT?
\[\color{red}{\textrm{outcome}_i} = (\textrm{model})_i + \textrm{error}_i\]
\[\color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i\]
\[\color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i\]
\[\color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i\]
\[\color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i\]
maximise the explanatory worth of the model
minimise the amount of unexplained error
explain more than one outcome!
so far, have been talking about the ith observation
we want to generalise (“for any i”)
\[\color{red}{\textrm{outcomes}}=\color{blue}{\textrm{(model)}}+\textrm{errors}\]
\[\color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i\] \[\color{red}{y_i} = \color{blue}{\textrm{intercept}\cdot{}1+\textrm{slope}\cdot{}x_i}+\epsilon_i\]
\[\color{red}{y_i} = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} + \epsilon_i\] so the linear model itself is…
\[\hat{y}_i = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i}\]
\[\color{blue}{b_0=5}, \color{blue}{b_1=2}\] \[\color{orange}{x_i=1.2},\color{red}{y_i=9.9}\] \[\hat{y}_i=7.4\]
\[\hat{y}_i = \color{blue}{b_0 \cdot{}}\color{orange}{1} \color{blue}{+b_1 \cdot{}} \color{orange}{x_i}\]
values of the linear model (coefficients)
values we provide (inputs)
\[\color{red}{\textrm{outcome}_i} = \color{blue}{(\textrm{model})_i} + \textrm{error}_i\] \[\color{red}{y_i} = \color{blue}{\textrm{intercept}}\cdot{}\color{orange}{1}+\color{blue}{\textrm{slope}}\cdot{}\color{orange}{x_i}+\epsilon_i\] \[\color{red}{y_i} = \color{blue}{b_0} \cdot{} \color{orange}{1} + \color{blue}{b_1} \cdot{} \color{orange}{x_i} + \epsilon_i\] so the linear model itself is…
\[\hat{y}_i = \color{blue}{b_0} \cdot{} \color{orange}{1} + \color{blue}{b_1} \cdot{} \color{orange}{x_i}\]
\[\hat{y}_i = \color{blue}{b_0} + \color{blue}{b_1} \cdot{} \color{orange}{x_i}\]
Call:
lm(formula = RT ~ BloodAlc, data = ourDat)
Residuals:
Min 1Q Median 3Q Max
-115.92 -40.42 1.05 42.93 126.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
for every extra 0.01% blood alcohol, reaction time slows down by around 32 ms
b0 won’t be significantly different from zero
b1 won’t be significantly different from zero
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 321.24 91.05 3.53 0.00093 ***
BloodAlc 32.28 8.88 3.64 0.00067 ***
...
the logic is the same as for \(t\) tests
\(t\) value is \(\frac{\textrm{estimate}}{\textrm{standard error}}\)
standard errors are calculated from model like for \(t\)-tests
we subtract 2 df because we “know” two things
intercept (b0)
slope (b1)
the remaining df are the residual degrees of freedom
...
BloodAlc 32.28 8.88 3.64 0.00067 ***
...
F-statistic: 13.2 on 1 and 48 DF, p-value: 0.000673
reaction time slowed by 32.3 ms for every additional 0.01% blood alcohol by volume (t(48)=3.64, p=.0007)