Data Analysis for Psychology in R 3
Psychology, PPLS
University of Edinburgh
Please check the Learn page for the Course Introduction video!
Please make sure you install (or update) R and Rstudio (see instructions on Learn)
multilevel modelling working with group structured data |
regression refresher |
introducing multilevel models | |
more complex groupings | |
centering, assumptions, and diagnostics | |
recap | |
factor analysis working with multi-item measures |
what is a psychometric test? |
using composite scores to simplify data (PCA) | |
uncovering underlying constructs (EFA) | |
more EFA | |
recap |
deterministic
given the same input, deterministic functions return exactly the same output
area of sphere = \(4\pi r^2\)
height of fall = \(1/2 g t^2\)
statistical
\[ \color{red}{\textrm{outcome}} \color{black}{=} \color{blue}{(\textrm{model})} + \color{black}{\textrm{error}} \]
handspan = height + randomness
cognitive test score = age + premorbid IQ + … + randomness
\[ \begin{align} \color{red}{\textrm{outcome}} & = \color{blue}{(\textrm{model})} + \textrm{error} \\ \qquad \\ \qquad \\ \color{red}{y_i} & = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} + \varepsilon_i \\ \qquad \\ & \text{where } \varepsilon_i \sim N(0, \sigma) \text{ independently} \\ \end{align} \]
Our proposed model of the world:
\(\color{red}{y_i} = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_i} + \varepsilon_i\)
Our model \(\hat{\textrm{f}}\textrm{itted}\) to some data:
\(\hat{y}_i = \color{blue}{\hat b_0 \cdot{} 1 + \hat b_1 \cdot{} x_i}\)
For the \(i^{th}\) observation:
Our model \(\hat{\textrm{f}}\textrm{itted}\) to some data:
\(\color{red}{y_i} = \color{blue}{5 \cdot{} 1 + 2 \cdot{} x_i} + \hat\varepsilon_i\)
For the observation \(x_i = 1.2, \; y_i = 9.9\):
\[ \begin{align} \color{red}{9.9} & = \color{blue}{5 \cdot{}} 1 + \color{blue}{2 \cdot{}} 1.2 + \hat\varepsilon_i \\ & = 7.4 + \hat\varepsilon_i \\ & = 7.4 + 2.5 \\ \end{align} \]
y | x |
---|---|
7.99 | Category1 |
4.73 | Category0 |
3.66 | Category0 |
3.41 | Category0 |
5.75 | Category1 |
5.66 | Category0 |
... | ... |
More than one predictor?
\(\color{red}{y} = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_1 + \, ... \, + b_k \cdot x_k} + \varepsilon\)
Call:
lm(formula = y ~ x1 + x2, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-28.85 -6.27 -0.19 6.97 26.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.372 1.124 18.12 <2e-16 ***
x1 1.884 1.295 1.46 0.1488
x2 2.042 0.624 3.28 0.0015 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.2 on 97 degrees of freedom
Multiple R-squared: 0.135, Adjusted R-squared: 0.118
F-statistic: 7.6 on 2 and 97 DF, p-value: 0.00086
term | est | interpretation |
---|---|---|
(Intercept) | 20.37 | estimated y when all predictors are zero/reference level |
x1 | 1.88 | estimated change in y when x1 increases by 1, and all other predictors are held constant |
x2 | 2.04 | estimated change in y when x2 increases by 1, and all other predictors are held constant |
Design is so important! If possible, we could design it so that X and Z are orthogonal (in the long run) by e.g., randomisation.
Association between X and Y is changed if we adjust for Z (a is smaller than previous slide), because there is a bit (b) that could be attributed to Z instead.
Call:
lm(formula = y ~ x1 + x2, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-28.85 -6.27 -0.19 6.97 26.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.372 1.124 18.12 <2e-16 ***
x1 1.884 1.295 1.46 0.1488
x2 2.042 0.624 3.28 0.0015 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.2 on 97 degrees of freedom
Multiple R-squared: 0.135, Adjusted R-squared: 0.118
F-statistic: 7.6 on 2 and 97 DF, p-value: 0.00086
term | est | interpretation |
---|---|---|
(Intercept) | 20.37 | estimated y when all predictors are zero/reference level |
x1 | 1.88 | estimated change in y when x1 increases by 1, and all other predictors are held constant |
x2 | 2.04 | estimated change in y when x2 increases by 1, and all other predictors are held constant |
Call:
lm(formula = knowledge ~ age + education, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-28.85 -6.27 -0.19 6.97 26.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.372 1.124 18.12 <2e-16 ***
age 1.884 1.295 1.46 0.1488
education 2.042 0.624 3.28 0.0015 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.2 on 97 degrees of freedom
Multiple R-squared: 0.135, Adjusted R-squared: 0.118
F-statistic: 7.6 on 2 and 97 DF, p-value: 0.00086
term | est | interpretation |
---|---|---|
(Intercept) | 20.37 | estimated knowledge for age 0, education 0 |
age | 1.88 | estimated change in knowledge for every 1 year older, holding education constant |
education | 2.04 | estimated change in knowledge for every 1 year of education, holding age constant |
what if the relationship of interest depends on the level of some other variable?
adding in a product term (x1 \(\times\) x2) to our model, we can model this..
\(\color{red}{y} = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_1 + b_2 \cdot x_2 + b_3 \cdot x_1 \cdot x_2} + \varepsilon\)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.3521 1.0097 35.01 < 2e-16 ***
x1 0.1719 0.0476 3.61 0.00038 ***
x2Level2 -4.3901 0.6740 -6.51 6e-10 ***
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.009 2.008 -2.00 0.052 .
x1 4.345 0.329 13.22 < 2e-16 ***
x2 3.189 0.402 7.93 3.2e-10 ***
\(\begin{align} \color{red}{y} \;\;\;\; & = \;\;\;\;\; \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x_1 + ... + b_k \cdot x_k} & + & \;\;\;\varepsilon \\ \qquad \\ \color{red}{\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ \vdots \\ y_n \end{bmatrix}} & = \color{blue}{\begin{bmatrix} 1 & x_{11} & x_{21} & \dots & x_{k1} \\ 1 & x_{12} & x_{22} & & x_{k2} \\ 1 & x_{13} & x_{23} & & x_{k3} \\ 1 & x_{14} & x_{24} & & x_{k4} \\ 1 & x_{15} & x_{25} & & x_{k5} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & x_{2n} & \dots & x_{kn} \end{bmatrix} \begin{bmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_k \end{bmatrix}} & + & \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5 \\ \vdots \\ \varepsilon_n \end{bmatrix} \\ \\\color{red}{\boldsymbol y} \;\;\;\;\; & = \qquad \qquad \;\;\; \mathbf{\color{blue}{X \qquad \qquad \qquad \;\;\;\: \boldsymbol \beta}} & + & \;\;\; \boldsymbol \varepsilon \\ \end{align}\)
\(\begin{align} \color{red}{y} = \mathbf{\color{blue}{X \boldsymbol{\beta}} + \boldsymbol{\varepsilon}} & \qquad & (-\infty, \infty) \\ \qquad \\ \qquad \\ \color{red}{ln \left( \frac{p}{1-p} \right) } = \mathbf{\color{blue}{X \boldsymbol{\beta}} + \boldsymbol{\varepsilon}} & \qquad & [0,1] \\ \qquad \\ \qquad \\ \color{red}{ln (y) } = \mathbf{\color{blue}{X \boldsymbol{\beta}} + \boldsymbol{\varepsilon}} & \qquad & (0, \infty) \\ \end{align}\)
Rather than focussing on slope coefficients, we can also think of our model in terms of sums of squares (SS).
\(SS_{total} = \sum^{n}_{i=1}(y_i - \bar y)^2\)
\(SS_{model} = \sum^{n}_{i=1}(\hat y_i - \bar y)^2\)
\(SS_{residual} = \sum^{n}_{i=1}(y_i - \hat y_i)^2\)
Rather than focussing on slope coefficients, we can also think of our model in terms of sums of squares (SS).
\(R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}\)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) ... ... ... ...
z ... ... ... ...
x ... ... ... ...
...
...
Multiple R-squared: 0.134, Adjusted R-squared: 0.116
...
Model comparisons:
isolate the improvement in model fit due to inclusion of additional parameters
Test everything in the model all at once by comparing it to a ‘null model’ with no predictors:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.851 0.660 21.00 < 2e-16 ***
speciesdog -3.210 0.933 -3.44 0.00086 ***
specieshorse -3.316 0.933 -3.55 0.00059 ***
speciesparrot -2.903 0.933 -3.11 0.00245 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.3 on 96 degrees of freedom
Multiple R-squared: 0.152, Adjusted R-squared: 0.126
F-statistic: 5.75 on 3 and 96 DF, p-value: 0.00117
This is kind of where traditional “analysis of (co)variance” sits.
There are different ‘types’ of ANOVA..
This is kind of where traditional “analysis of (co)variance” sits.
There are different ‘types’ of ANOVA..
Anova Table (Type III tests)
Response: y
Sum Sq Df F value Pr(>F)
(Intercept) 12041 1 3242.2 < 2e-16 ***
z2 301 1 80.9 2.1e-14 ***
z 352 1 94.8 5.4e-16 ***
x 105 1 28.2 7.0e-07 ***
Residuals 357 96
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our model:
\(\color{red}{y} = \color{blue}{\mathbf{X \boldsymbol \beta}} + \varepsilon \qquad \text{where } \boldsymbol \varepsilon \sim N(0, \sigma) \text{ independently}\)
Our ability to generalise from the model we fit on sample data to the wider population requires making some assumptions.
All our work here is in aim of making models of the world.
Models are models. They are simplifications. They are therefore wrong.
Our residuals reflect everything that we don’t account for in our model. \(y - \hat{y}\)
In an ideal world, our model accounts for all the systematic relationships. The leftovers (our residuals) are just random noise.
We check by examining how much “like randomness” the residuals appear to be (zero mean, normally distributed, constant variance, i.i.d (“independent and identically distributed”)
We will never know whether our residuals contain only randomness - we can never observe everything!
What does randomness look like?
“zero mean and constant variance”
mean of the residuals = zero across the predicted values of the model.
spread of residuals is normally distributed and constant across the predicted values of the model.
What does randomness look like?
“zero mean and constant variance”
mean of the residuals = zero across the predicted values of the model.
spread of residuals is normally distributed and constant across the predicted values of the model.
What does randomness look like?
“zero mean and constant variance”
mean of the residuals = zero across the predicted values of the model.
spread of residuals is normally distributed and constant across the predicted values of the model.
Linearity
Independence
Normality
Equal variances
first thing to do: think!
…
highly skewed?
lm(y ~ x)
and lm(log(y) ~ x)
are quite different models…
non-constant variance?
…
you can’t necessarily see violations of independence in diagnostic plots - we need to think about how the data were generated.
transformations/bootstraps/ etc don’t help us if we have violated our assumption of independence…
children within schools
patients within clinics
observations within individuals
children within classrooms within schools within districts etc…
patients within doctors within hospitals…
time-periods within trials within individuals
Measurements on observational units within a given cluster are often more similar to each other than to those in other clusters.
Clustering is something systematic that our model should (arguably) take into account.
Standard errors will often be smaller than they should be, meaning that:
Clustering can be expressed in terms of the expected correlation among the measurements within the same cluster - known as the intra-class correlation coefficient (ICC).
There are various formulations of ICC, but the basic principle = ratio of variance between groups to total variance.
\(\rho = \frac{\sigma^2_{b}}{\sigma^2_{b} + \sigma^2_e} \\ \qquad \\\textrm{Where:} \\ \sigma^2_{b} = \textrm{variance between clusters} \\ \sigma^2_e = \textrm{variance within clusters (residual variance)} \\\)
Wide Data
observations are spread across columns
# A tibble: 5 × 5
ID age trial_1 trial_2 trial_3
<chr> <chr> <chr> <chr> <chr>
1 001 62 10 12.5 18
2 002 66 7.5 7 5
3 003 22 12 14.5 11
4 004 34 10.5 17 14
5 ... ... ... ... ...
Long Data
each observation of the outcome is a separate row
# A tibble: 13 × 4
ID age trial score
<chr> <chr> <chr> <chr>
1 001 44 trial_1 10
2 001 44 trial_2 12.5
3 001 44 trial_3 18
4 002 38 trial_1 7.5
5 002 38 trial_2 7
6 002 38 trial_3 5
7 003 74 trial_1 12
8 003 74 trial_2 14.5
9 003 74 trial_3 11
10 004 58 trial_1 10.5
11 004 58 trial_2 17
12 004 58 trial_3 14
13 ... ... ... ...
lm()
Are older people more satisfied with life? 112 people from 12 different dwellings (cities/towns) in Scotland. Information on their ages and some measure of life satisfaction.
# A tibble: 6 × 4
age lifesat dwelling size
<dbl> <dbl> <chr> <chr>
1 40 31 Aberdeen >100k
2 45 56 Glasgow >100k
3 40 51 Glasgow >100k
4 40 55 Dundee >100k
5 40 41 Dundee >100k
6 55 69 Perth <100k
(Complete pooling)
lm(y ~ 1 + x, data = df)
Information from all clusters is pooled together to estimate over x
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.018 4.889 6.14 1.3e-08 ***
age 0.499 0.118 4.22 5.1e-05 ***
But residuals are not independent.
(No pooling)
lm(y ~ cluster + x, data = df)
Completely partition out cluster differences in average \(y\).
Treat every cluster as an independent entity.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.3330 4.1493 3.70 0.00036 ***
dwellingDumfries 8.0741 3.8483 2.10 0.03844 *
dwellingDundee 15.9882 3.8236 4.18 6.3e-05 ***
dwellingDunfermline 26.2600 3.8910 6.75 1.0e-09 ***
dwellingEdinburgh 12.6929 3.8195 3.32 0.00125 **
dwellingFort William 1.3577 3.8999 0.35 0.72849
dwellingGlasgow 18.2906 3.8213 4.79 5.9e-06 ***
dwellingInverness 5.0718 3.8542 1.32 0.19124
... ...
... ...
age 0.6047 0.0888 6.81 7.6e-10 ***
---
(No pooling)
lm(y ~ cluster * x, data = df)
Completely partition out cluster differences in \(y \sim x\).
Treat every cluster as an independent entity.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.345 8.867 -0.49 0.62532
dwellingDumfries 13.480 14.370 0.94 0.35078
dwellingDundee 59.479 17.650 3.37 0.00112 **
dwellingDunfermline 42.938 17.206 2.50 0.01444 *
... ...
age 1.159 0.239 4.84 0.0000054 ***
dwellingDumfries:age -0.206 0.360 -0.57 0.56813
dwellingDundee:age -1.181 0.463 -2.55 0.01243 *
dwellingDunfermline:age -0.486 0.408 -1.19 0.23638
... ...
(No pooling)
lm(y ~ cluster * x, data = df)
Completely partition out cluster differences in \(y \sim x\).
Treat every cluster as an independent entity.
Prevents us from studying cluster level effects.
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.345 8.867 -0.49 0.62532
dwellingDumfries 13.480 14.370 0.94 0.35078
dwellingDundee 59.479 17.650 3.37 0.00112 **
dwellingDunfermline 42.938 17.206 2.50 0.01444 *
... ...
... ...
dwellingPaisley 28.665 16.732 1.71 0.09020 .
dwellingPerth 27.473 12.901 2.13 0.03600 *
dwellingStirling 42.272 13.379 3.16 0.00217 **
age 1.159 0.239 4.84 0.0000054 ***
size>100k NA NA NA NA
dwellingDumfries:age -0.206 0.360 -0.57 0.56813
dwellingDundee:age -1.181 0.463 -2.55 0.01243 *
... ...
size
dwelling <100k >100k
Aberdeen 0 10
Dumfries 10 0
Dundee 0 10
Dunfermline 10 0
Edinburgh 0 10
Fort William 10 0
Glasgow 0 10
Inverness 10 0
Kirkcaldy 2 0
Paisley 10 0
Perth 10 0
Stirling 10 0
we can fit a linear regression model which takes the form \(\color{red}{y} = \color{blue}{\mathbf{X} \boldsymbol{\beta}} + \boldsymbol{\varepsilon}\)
in R, we fit this with lm(y ~ x1 + .... xk, data = mydata)
.
we can extend this to different link functions to model outcome variables which follow different distributions.
when drawing inferences from a fitted model to the broader population, we rely on certain assumptions.
Complete readings
Attend your lab and work together on the exercises
Complete the weekly quiz
Piazza forum!
Office hours (see Learn page for details)