df_nonlin |>
ggplot(aes(x=x1, y=y)) +
geom_point(size = 5) +
geom_smooth(method = 'lm',
colour = 'blue', se = FALSE, linewidth = 3) +
geom_smooth(method = 'loess',
colour = 'red', se = FALSE, linewidth = 3)
Data Analysis for Psychology in R 2
Department of Psychology
University of Edinburgh
2025–2026
| Introduction to Linear Models | Intro to Linear Regression |
| Interpreting Linear Models | |
| Testing Individual Predictors | |
| Model Testing & Comparison | |
| Linear Model Analysis | |
| Analysing Experimental Studies | Categorical Predictors & Dummy Coding |
| Effects Coding & Coding Specific Contrasts | |
| Assumptions & Diagnostics | |
| Bootstrapping | |
| Categorical Predictor Analysis |
| Interactions | Interactions I |
| Interactions II | |
| Interactions III | |
| Analysing Experiments | |
| Interaction Analysis | |
| Advanced Topics | Power Analysis |
| Binary Logistic Regression I | |
| Binary Logistic Regression II | |
| Logistic Regression Analysis | |
| Exam Prep and Course Q&A |
What does a linear model assume is true about the data that it models? (Four assumptions)
What three properties of a single data point might affect a linear model’s estimates? How can we diagnose each property?
What relationship between predictors do we want to avoid? How can we diagnose it?
When we use any statistical model, we are saying “This model is the process that I think the world used to generate my data.”
Linear models are constrained in certain ways: for example, associations between predictors and outcomes can only be linear.
These constraints are the assumptions that the linear model makes about how the data was generated.
| Assumption | Looks fine | Suspicious | |
| L | Linearity: The association between predictor and outcome is a straight line. | ||
| I | Independence: Every data point's error is independent of every other data point's error. (Until DAPR3, we'll assume this is true as long as we have between-participant data.) | ||
| N | Normally-distributed errors: The differences between fitted line and each data point (i.e., the residuals) follow a normal distribution. | ||
| E | Equal variance of errors: The differences between fitted line and each data point (i.e., the residuals) are dispersed by a similar amount across the whole range of the predictor. |
Make a scatterplot with a straight line and a “LOESS” line (LOcally Estimated Scatterplot Smoothing).
The linear variable:
You want the LOESS line (method = 'loess', red) to stick close to the straight line (method = 'lm', blue). Deviations suggest non-linearity.
With multiple predictors, we need to get slightly more complex.
Now we to hold the other predictors constant while we look at each one in turn.
The solution: Component+residual plots (aka “partial-residual plots”).
Basically, CR plots can look at the linearity of each predictor without any of the other predictors getting in the way.
A component-residual plot shows:
Again, we want the LOESS line to match the straight line as closely as possible. Deviations suggest non-linearity.
How much of a deviation is a problem? This is kind of a judgement call. Some deviation is normal.
In order of increasing spiciness:
Keep the variable as-is and report the non-linearity in your write-up. A good solution if the deviation isn’t huge.
Transform the variable until it looks more linear (e.g., what if you take the exponential with exp()? the logarithm with log()?)
Beyond DAPR3: You can use so-called “higher-order” regression terms, which let you model particular kinds of curves (quadratic functions, cubic functions, etc.).
Beyond DAPR3: You can capture basically any non-linear relationship using “generalised additive models” (GAMs).
| Assumption | Looks fine | Suspicious | |
| L | Linearity: The association between predictor and outcome is a straight line. |
||
| I | Independence: Every data point's error is independent of every other data point's error. (Until DAPR3, we'll assume this is true as long as we have between-participant data.) | ||
| N | Normally-distributed errors: The differences between fitted line and each data point (i.e., the residuals) follow a normal distribution. | ||
| E | Equal variance of errors: The differences between fitted line and each data point (i.e., the residuals) are dispersed by a similar amount across the whole range of the predictor. |
The most common source of non-independence is when multiple observations are gathered from the same source.
For example:
Until DAPR3, you can assume that errors are independent as long as the experimental design is between-subjects (i.e., as long as each person only contributes data to one experimental condition).
In order of increasing spiciness:
Keep the variable as-is and report the non-independence in your write-up.
In DAPR3, you’ll learn how to tell a model that some data points probably behave more like one another than they behave like others by including so-called “random effects”.
| Assumption | Looks fine | Suspicious | |
| L | Linearity: The association between predictor and outcome is a straight line. | ||
| I | Independence: Every data point's error is independent of every other data point's error. (Until DAPR3, assume this is true as long as you have between-participant data.) | ||
| N | Normally-distributed errors: The differences between fitted line and each data point (i.e., the residuals) follow a normal distribution. | ||
| E | Equal variance of errors: The differences between fitted line and each data point (i.e., the residuals) are dispersed by a similar amount across the whole range of the predictor. |
The differences between the fitted line and each data point (aka the residuals) should be normally-distributed.
Before there can be any residuals, there must be a fitted line.
So first, fit a model.
Now we have a couple of options:
With a histogram of the residuals, we can eyeball how normally-distributed they appear.
Normally-distributed errors:
Matches the bell curve shape pretty well.
We can compare the model’s residuals to what the residuals WOULD look like in a world where they were perfectly normally-distributed.
This is what a “quantile-quantile plot”, a Q-Q plot, does. The dots should follow the diagonal line.
A perfect match to the diagonal is rare. Ask: How big is the mismatch?

If errors aren’t normally distributed, then a model which assumes they are will produce weird (aka biased) estimates:
Next week: we’ll learn how to get around this problem using a method called bootstrapping.
| Assumption | Looks fine | Suspicious | |
| L | Linearity: The association between predictor and outcome is a straight line. | ||
| I | Independence: Every data point's error is independent of every other data point's error. (Until DAPR3, assume this is true as long as you have between-participant data.) | ||
| N | Normally-distributed errors: The differences between fitted line and each data point (i.e., the residuals) follow a normal distribution. | ||
| E | Equal variance of errors: The differences between fitted line and each data point (i.e., the residuals) are dispersed by a similar amount across the whole range of the predictor. |
When variance if errors is not equal, the model is not equally good at estimating the outcome for all values of the predictor.
Again, to find the residuals—the differences between the fitted line and each data point—we need a fitted line.
So first, fit a model.
Investigate by plotting the residuals against the predicted outcome values (also called the “fitted” values).
A plot of residuals vs. predicted values shows:
What we want to see:
In order of increasing spiciness:
Keep the variable as-is and report the non-equal variance in your write-up.
Include additional predictors or interaction terms (more on interactions next semester). These may help account for some of that extra variance.
Use weighted least squares regression (WLS) instead of ordinary least squares (OLS). More details in this week’s flash cards.
| Assumption | Looks fine | Suspicious | |
| L | Linearity: The association between predictor and outcome is a straight line. | ||
| I | Independence: Every data point's error is independent of every other data point's error. (Until DAPR3, assume this is true as long as you have between-participant data.) | ||
| N | Normally-distributed errors: The differences between fitted line and each data point (i.e., the residuals) follow a normal distribution. | ||
| E | Equal variance of errors: The differences between fitted line and each data point (i.e., the residuals) are dispersed by a similar amount across the whole range of the predictor. |
Checking assumptions is not an absolute science. It relies on intuitions and vibes (sorry!!)
\(\rightarrow\) Look at the plots, motivate your reasoning, and you’ll be fine.
Diagnosing unusual properties of
individual data points
(aka “case diagnostics”):
Diagnosing undesirable relationships
between predictors:
An outlier is a data point whose value for the outcome variable is unusually extreme.
The outcome is usually plotted on the y axis, so outliers are usually weird in the vertical direction (↕).
“Unusually extreme” for the model, not necessarily for the overall distribution of data!
Residuals extracted from the linear model are on the scale of the outcome. For example:
The scales of these residuals will be totally different. No single threshold for outliers will work for cm AND log units AND yards AND millions of pounds AND …
\(\downarrow\)
Standardised residuals convert residuals from their original scale into z-scores.
Now the residuals of Models A and B are on the same scale.
But to get z-scores, we compare each data point to the mean and SD of all data points.
The mean and SD will be affected by our potential outliers, so we’re slightly comparing a data point to itself.
\(\downarrow\)
Studentised residuals are a version of standardised residuals that exclude the specific data point we’re looking at.
You might recognise “Student” from “Student’s t-test”.
Studentised residuals are a kind of residual that follows a t-distribution.
If you see data points with studentised residuals less than –2 or more than 2:
A value less than –2 or more than 2 is a necessary but not sufficient condition.
\(\rightarrow\) The more extreme the studentised residual, the more likely it is that the data point is an outlier.
Once we’ve fit a model, we can use the rstudent() function to get all the studentised residuals.
1 2 3 4 5 6
-0.0255 0.0261 0.2353 -0.3360 0.8212 -0.6834
To filter these residuals for the ones more extreme than \(\pm\) 2:
# A tibble: 4 x 3
x y_outl stud_resid
<dbl> <dbl> <dbl>
1 -1.28 3.95 2.15
2 0.44 2.53 -2.24
3 1.76 1.24 -7.02
4 1.8 9.56 2.07
We expected one outlier.
But it looks like we have four…?
Studentised residuals for the data without outliers:
Studentised residuals for the data with one outlier:
A value more extreme than \(\pm\) 2 does not necessarily mean the data point is an outlier.
We would expect extreme values 5% of the time.
\(\rightarrow\) The more extreme the studentised residual, the more likely it is that the data point is an outlier.
We’ll talk in detail about how to deal with unusual data points after we’ve looked at all three kinds.
Preview:
More on that in a bit!
| Unusual property of a data point | Looks fine | Suspicious |
| 1. Outlyingness: Unusual value of the outcome (↕), when compared to the model. | ||
| 2. High leverage: Unusual value of the predictor (↔︎), when compared to other predictor values. | ||
| 3. High influence: High outlyingness and/or high leverage. |
A data point with high leverage has an unusually extreme value for a predictor variable.
Predictors are usually plotted on the x axis, so high-leverage cases are usually weird in the horizontal direction (↔︎).
Can’t use residuals for high-leverage values, because residuals represent vertical (↕) distance.
Instead, for measuring horizontal (↔︎) distance: hat values.
Hat values \(h\) are a standardised way of measuring how different a data point’s value is from other data points. (More mathy details are in the appendix of the slides.)
Step 1: Compute the mean hat value \(\bar{h}\) for a model with \(k\) predictors and \(n\) data points.
\[\bar{h} = \frac{k+1}{n}\]
Heuristic (= rule of thumb) for high leverage: data points with hat values larger than \(2 \times \bar{h}\).
For example, a model with one predictor (\(k=1\)) and 104 observations (\(n=104\)) has a mean hat value \(\bar{h}\) of:
\[ \begin{align} \bar{h} &= \frac{k+1}{n} \\ \bar{h} &= \frac{1+1}{104} \\ \bar{h} &= \frac{2}{104} \\ \bar{h} &\approx 0.0192 \end{align} \]
So our heuristic value for comparison is \(2 ~ \times\) this number, so 0.0384.
Step 2: Let R compute the hat value \(h_i\) for each individual data point \(i\) using hatvalues().
1 2 3 4 5 6
0.0369 0.0359 0.0348 0.0338 0.0328 0.0319
Step 3: Compare each data point’s hat values to the heuristic comparison value from Step 1.
But not this time! :)
Hat values for the data without high-leverage cases:
Hat values for the data with one high-leverage case:
Why are the dots curved?
The curved shape happens because hat values measure distance from the mean of x.
The mean of x is 0, so all the data points on either side of 0 get bigger as they get farther away.
| Unusual property of a data point | Looks fine | Suspicious |
| 1. Outlyingness: Unusual value of the outcome (↕), when compared to the model. | ||
| 2. High leverage: Unusual value of the predictor (↔︎), when compared to other predictor values. | ||
| 3. High influence: High outlyingness and/or high leverage. |
A data point with high influence has an unusually extreme value for the outcome variable and/or for a predictor variable. High influence points have the most potential to influence a linear model’s estimates.
High-influence cases can be weird (relative to the model) in both the vertical direction (↕) and/or the horizontal direction (↔︎).
Two measures to diagnose high influence:
Interpretation: the average distance that the predicted outcome values will move, if a given data point is removed.
Cook’s distance is essentially outlyingness \(\times\) leverage (mathy details in appendix).
| small outlyingness | x | small leverage | = | small influence |
| small outlyingness | x | BIG leverage | = | BIG influence |
| BIG outlyingness | x | small leverage | = | BIG influence |
| BIG outlyingness | x | BIG leverage | = | VERY BIG influence |
A few possible threshold values for comparison:
Step 1: Compute the threshold value of Cook’s distance for our model and data:
\(n = 102\) data points, \(k = 1\) predictor.
In maths:
\[ \begin{align} D &= \frac{4}{n-k-1}\\ D &= \frac{4}{102-1-1}\\ D &= \frac{4}{100}\\ D &= 0.04 \end{align} \]
Heuristic for high influence for THIS model: data points with Cook’s distance \(D\) larger than 0.04.
Step 3: Compare each data point’s \(D_i\) to the heuristic comparison value from Step 1.
# A tibble: 5 x 3
x y_infl D
<dbl> <dbl> <dbl>
1 -1.76 2.50 0.0499
2 -1.28 3.95 0.0578
3 -1.08 0.109 0.0404
4 1.8 9.56 0.0726
5 -1 6.5 0.165
We expected one high-influence value, but now we have five…?
\(D\) for the data with no extreme high-influence values:
\(D\) for the data with one extreme high-influence value:
\(\rightarrow\) The more extreme the Cook’s distance, the higher the influence of that data point.
Cook’s distance looks at how each data point influences the model predictions overall.
The next measure, COVRATIO, looks at how each data point influences the regression coefficients (the slopes and intercepts).
COVRATIO stands for “covariance ratio”.
Interpretation: how much a given data point affects the standard error (i.e., the variability) of the regression parameters.
A ratio is a fraction that compares two values.
\[ \text{COVRATIO}_i = \frac{\text{a parameter's standard error, including data point}\ i}{\text{a parameter's standard error, NOT including data point}\ i} \]
If the given data point does not affect the standard error, then the ratio = 1.
If the SE gets bigger without a data point:
\[ \frac {\text{small SE with}\ i} {\text{big SE without}\ i} = \frac{\text{small}}{\text{big}} < 1 \]
If the SE gets smaller without a data point:
\[ \frac {\text{big SE with}\ i} {\text{small SE without}\ i} = \frac{\text{big}}{\text{small}} > 1 \]
Threshold values for a model with \(k\) parameters and \(n\) data points:
Step 1: Compute the COVRATIO threshold value for our model and data: \(n=102\) data points, \(k=1\) predictor.
In maths (just the bit we add/subtract from 1):
\[ \begin{align} & \frac{3(k + 1)}{n}\\ = & \frac{3(1 + 1)}{102}\\ = & \frac{3 \times 2}{102}\\ = & \frac{6}{102}\\ = & 0.059\\ \end{align} \]
Heuristic for high influence for THIS model:
Step 2: Let R compute the COVRATIO for each data point using covratio().
Step 3: Compare each data point’s COVRATIO to the heuristic comparison values from Step 1.
# A tibble: 6 x 3
x y_infl covr
<dbl> <dbl> <dbl>
1 -2 0.347 1.06
2 -1.96 0.476 1.06
3 -1.28 3.95 0.936
4 0.44 2.53 0.900
5 1.92 7.79 1.06
6 -1 6.5 0.678
Six values that affect the standard error of the regression parameters a lot!
For the data with no extreme high-influence values:
For the data with one extreme high-influence value:
\(\rightarrow\) The more extreme the COVRATIO, the higher the influence of that data point.
| Unusual property of a data point | Looks fine | Suspicious |
| 1. Outlyingness: Unusual value of the outcome (↕), when compared to the model. | ||
| 2. High leverage: Unusual value of the predictor (↔︎), when compared to other predictor values. | ||
| 3. High influence: High outlyingness and/or high leverage. |
influence.measures() computes all the data point diagnostics we’ve talked about
(except studentised residuals for outlyingness).
dfb.1_ dfb.x dffit cov.r cook.d hat
1 -0.003 0.005 -0.006 1.06 0.000 0.038
2 0.002 -0.004 0.005 1.06 0.000 0.037
3 0.025 -0.042 0.049 1.06 0.001 0.036
4 -0.038 0.062 -0.073 1.05 0.003 0.035
5 0.089 -0.142 0.169 1.04 0.014 0.034
6 -0.077 0.119 -0.142 1.04 0.010 0.033
dfb.1_: difference between the predicted values for the intercept with and without this data pointdfb.x aka “dfbeta”: difference between the predicted values for a predictor’s slope with and without this data point (there will be one of these measures per predictor)dffit: difference between predicted outcome values with and without this data pointcov.r: covariance ratio of regression parameters with and without this data pointcook.d: Cook’s Distance of this data pointhat: hat value of this data point
Ignore them and pretend they don’t exist?
Check if they could be a mistake?
Delete them?
Mention them in your write-up?
Replace them with less extreme values?
Check how much they influence your conclusions?
A sensitivity analysis asks: Do our conclusions change if we leave out the unusual data point(s)?
Example 1:
Imagine we want to know whether there’s a significant positive association between x and y, but there’s a data point that we suspect might have high influence on our model.
One high-influence data point?
We try removing that data point to see what happens.
A model fit to the data that contains
the high-influence value:
Call:
lm(formula = y_infl ~ x, data = df_infl)
Residuals:
Min 1Q Median 3Q Max
-2.481 -0.626 -0.035 0.488 4.222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.1787 0.0962 43.4 <2e-16 ***
x 1.9007 0.0826 23.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.972 on 100 degrees of freedom
Multiple R-squared: 0.841, Adjusted R-squared: 0.839
F-statistic: 529 on 1 and 100 DF, p-value: <2e-16
A model fit to the data with the
high-influence value removed:
Call:
lm(formula = y_good ~ x, data = df_outl)
Residuals:
Min 1Q Median 3Q Max
-2.4526 -0.5939 0.0037 0.5157 2.2888
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.1369 0.0874 47.3 <2e-16 ***
x 1.9315 0.0749 25.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.878 on 99 degrees of freedom
Multiple R-squared: 0.87, Adjusted R-squared: 0.869
F-statistic: 664 on 1 and 99 DF, p-value: <2e-16
Even if we remove the data point, we still see a significant positive association between x and y.
\(\rightarrow\) The high-influence point doesn’t affect our conclusions, so it’s not a major cause for concern.
Again, imagine we want to know whether there’s a significant positive association between x and y.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.395 0.259 1.53 0.130
x 0.448 0.211 2.12 0.036 *
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2189 0.1804 1.21 0.23
x -0.0703 0.1547 -0.45 0.65
\(\rightarrow\) The high-influence point DOES affect our conclusions, so it IS a major problem for our analysis.
Diagnosing unusual properties of
individual data points
(aka “case diagnostics”):
Diagnosing undesirable relationships
between predictors:
When two predictors are correlated, they contain similar information.
If you know one, you can guess the other.
The model cannot tell which predictor is contributing what information, so its estimates are less precise. In other words, the variance of its estimates increases.
We have a data set called corr_df with an outcome variable y and two predictors x1 and x2.
In corr_df, x1 and x2 are highly correlated.
The correlation appears as a strong diagonal line.
We have another data set called uncorr_df:
In uncorr_df, x1 and x2 are not very correlated.
The lack of correlation appears as a cloud of data points.
When predictors are correlated, the variance of the model’s estimates increases.
We can detect this using the Variance Inflation Factor or VIF.
Intepreting VIF:
We calculate the Variance Inflation Factor in R using vif() from the package car.
Correlated predictors:
The SE of each predictor is \(\sqrt{7.4} = 2.72\) times bigger than it would be without the other predictors.
Slightly worrisome!
In order of increasing spiciness:
If the correlation isn’t too worrying, then leave the model as-is and report the VIFs.
If the correlations are large, remove one of the correlated predictors from the model—it’s not adding any new information anyway.
In DAPR3: Make a composite predictor that combines the correlated predictors. For example: a sum, an average, or a cleverer technique like Principal Component Analysis.
check_model()What does a linear model assume is true about the data that it models? (Four assumptions)
What three properties of a single data point might affect a linear model’s estimates? How can we diagnose each property?
What relationship between predictors do we want to avoid? How can we diagnose it?
Attend your lab and work together on the exercises
Help each other on the Piazza forum
Complete the weekly quiz

Attend office hours (see Learn page for details)
Hat values ( \(h_i\) ) are used to assess data points’ leverage in a linear model.
In essence: we find the difference between each data point and the mean. We standardise that difference with respect to how big all the differences are and with respect to how many data points we have overall.
For a simple linear model, the hat value for case \(i\) would be
\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\]
where
The mean of all hat values ( \(\bar{h}\) ) is:
\[\bar{h} = (k+1)/n\]
In a simple linear regression with one predictor, \(k=1\).
So \(\bar h = (1 + 1) / n = 2 /n\).
Cook’s Distance of a data point \(i\):
\[D_i = \frac{(\text{StandardizedResidual}_i)^2}{k+1} \times \frac{h_i}{1-h_i}\]
Where
\[\frac{(\text{StandardizedResidual}_i)^2}{k+1} = \text{Outlyingness}\]
and
\[\frac{h_i}{1-h_i} = \text{Leverage},\]
So
\[D_i = \text{Outlyingness} \times \text{Leverage}.\]