Data Analysis for Psychology in R 2
Department of Psychology
University of Edinburgh
2025–2026
Introduction to Linear Models | Intro to Linear Regression |
Interpreting Linear Models | |
Testing Individual Predictors | |
Model Testing & Comparison | |
Linear Model Analysis | |
Analysing Experimental Studies | Categorical Predictors & Dummy Coding |
Effects Coding & Coding Specific Contrasts | |
Assumptions & Diagnostics | |
Bootstrapping | |
Categorical Predictor Analysis |
Interactions | Interactions I |
Interactions II | |
Interactions III | |
Analysing Experiments | |
Interaction Analysis | |
Advanced Topics | Power Analysis |
Binary Logistic Regression I | |
Binary Logistic Regression II | |
Logistic Regression Analysis | |
Exam Prep and Course Q&A |
Understand the link between models and functions
Understand the key concepts (intercept and slope) of the linear model
Understand what residuals represent
Understand the key principles of least squares
Be able to specify a simple linear model (labs)
Pretty much all statistics is about models
A model is a formal representation of a system
Put another way, a model is an idea about the way the world is
We tend to represent mathematical models as functions
A function is an expression that defines the relationship between one variable (or set of variables) and another variable (or set of variables)
It allows us to specify what is important (arguments) and how these things interact with each other (operations)
This allows us to make and test predictions
To think through these relations, we can use a simpler example
Suppose I have a model for growth of babies 1
\[ \text{Length} = 55 + 4 * \text{Age} \]
The x-axis shows Age
The y-axis shows Length
The black line represents our model: \(y = 55+4x\)
Age | PredictedLength |
---|---|
10.0 | 95 |
10.2 | 96 |
10.5 | 97 |
10.8 | 98 |
11.0 | 99 |
11.2 | 100 |
11.5 | 101 |
11.8 | 102 |
12.0 | 103 |
What does this say about our model?
If we were to collect actual data on height and age, will our observations fall on the line?
Age | Year | Prediction | Prediction_M |
---|---|---|---|
216 | 18 | 919 | 9.19 |
228 | 19 | 967 | 9.67 |
240 | 20 | 1015 | 10.15 |
252 | 21 | 1063 | 10.63 |
264 | 22 | 1111 | 11.11 |
276 | 23 | 1159 | 11.59 |
288 | 24 | 1207 | 12.07 |
300 | 25 | 1255 | 12.55 |
How might we judge how good our model is?
Model is represented as a function
We see that as a line (or surface if we have more things to consider)
That yields predictions (or values we expect if our model is true)
We can collect data
If the predictions do not match the observed data (observations deviate from our line), that says something about our model
In statistics we (roughly) follow this process:
We define a model that represents one state of the world (probabilistically)
We collect data to compare to it
These comparisons lead us to make inferences about how the world actually is, by comparison to a world that we specify by our model
A deterministic model is a model for an exact relationship:
\[ y = \underbrace{3 + 2 x}_{f(x)} \]
A statistical model allows for case-by-case variability:
\[ y = \underbrace{3 + 2 x}_{f(x)} + \epsilon \]
For the majority of the course, we will focus on how we move from the idea of an association to estimating a model for the relationship
We’ll mostly look at the linear model
Assumes the relationship between the outcome variable and the predictor(s) is linear
Describes a continuous outcome variable as a function of one or more predictor variables
Question: Do students who study more get higher scores on the test?
student | hours | score |
---|---|---|
ID1 | 0.5 | 1 |
ID2 | 1.0 | 3 |
ID3 | 1.5 | 1 |
ID4 | 2.0 | 2 |
ID5 | 2.5 | 2 |
ID6 | 3.0 | 6 |
ID7 | 3.5 | 3 |
ID8 | 4.0 | 3 |
ID9 | 4.5 | 4 |
ID10 | 5.0 | 8 |
Codebook
student
= ID variable unique to each respondent
hours
= the number of hours spent studying. This will be our predictor ( \(x\) )
score
= test score. This will be our outcome ( \(y\) )
The line can be described by two values:
Intercept: the point where the line crosses the \(y\) -axis and \(x = 0\)
Slope: the gradient of the line, or rate of change
\[y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i\]
\(y_i\) = the outcome variable (e.g. score
)
\(x_i\) = the predictor variable, (e.g. hours
)
\(\beta_0\) = intercept
\(\beta_1\) = slope
\(\epsilon_i\) = residual (we will come to this shortly)
\[y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i\]
\(i\) is a subscript to indicate that each participant has their own value.
So each participant has their own:
\(\epsilon_i\), or the residual, is a measure of how well the model fits each data point.
It is the distance between the model line (on \(y\)-axis) and a data point.
\(\epsilon_i\) is positive if the point is above the line (red in plot)
\(\epsilon_i\) is negative if the point is below the line (blue in plot)
The line represents a model of our data.
In the scatterplot, the data are represented by points
So a good line is a line that is “close” to all points
The method that we use to identify the best-fitting line is the Principle of Least Squares
\[y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i\]
Where
How do we calculate \(\beta_0\) and \(\beta_1\)?
The values \(\beta_0\) and \(\beta_1\) are typically unknown and need to be estimated from our data.
We find the values of \(\hat \beta_0\) and \(\hat \beta_1\) (and thus our best line) using least squares
Least squares:
minimises the distances between the actual values of \(y\) and the model-predicted values of \(\hat y\)
that is, it minimises the residuals for each data point (the line is “close”)
Essentially:
Essentially:
Essentially:
Essentially:
Essentially:
Why do you think we square the deviations?
\[SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
\[SS_{Residual} = \sum_{i=1}^{n}(\color{#BF1932}{y_i} - \hat{y}_i)^2\]
\[SS_{Residual} = \sum_{i=1}^{n}(y_i - \color{#BF1932}{\hat{y}_i})^2\]
\[SS_{Residual} = \sum_{i=1}^{n}(\color{#BF1932}{y_i - \hat{y}_i})^2\]
The values of the intercept and slope that minimise the sum of square residual are our estimated coefficients from our data
Minimising the \(SS_{residual}\) means that across all our data, the predicted values from our model are as close as they can be to the actual measured values of the outcome
\[\hat \beta_1 = \frac{SP_{xy}}{SS_x}\]
\[SP_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]
\[SS_x = \sum_{i=1}^{n}(x_i - \bar{x})^2\]
hours
)scores
)\[\hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x}\]
\[\hat{y}_i = \color{blue}{b_0 \cdot{}}\color{orange}{1} \color{blue}{+b_1 \cdot{}} \color{orange}{x_i}\]
values of the linear model (coefficients)
values we provide (inputs)
y ~ 1 + x
lm
in Rlm()
functionR
the name of the data set
lm
in Rlm
in RR
, store in object named mod1:lm
in Rsummary()
of output in R
:
Call:
lm(formula = score ~ hours, data = test)
Residuals:
Min 1Q Median 3Q Max
-1.618 -1.077 -0.746 1.177 2.436
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.400 1.111 0.36 0.728
hours 1.055 0.358 2.94 0.019 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.63 on 8 degrees of freedom
Multiple R-squared: 0.52, Adjusted R-squared: 0.46
F-statistic: 8.67 on 1 and 8 DF, p-value: 0.0186
Attend your lab and work together on the exercises
Complete the weekly quiz
Help each other on the Piazza forum
Attend office hours (see Learn page for details)