Calculating Sums of Squares & R-Squared for the Linear Model

Data Analysis for Psychology in R 2

Author

Affiliation

Emma Waterston

Department of Psychology
University of Edinburgh
2025–2026

In this document we are going to work through the hand calculations for:

\(SS_\text{total}\)
\(SS_\text{residual}\)
\(SS_\text{model}\)
\(R^2\)
\(\hat R^2\)

Packages

For this example we will only need the tidyverse and kableExtra packages.

library(tidyverse)
library(kableExtra)

Data

test_study2 <- read_csv("https://uoepsy.github.io/data/test_study2.csv")

We will make our calculations for the model we have been using in class this week:

performance <- lm(score ~ hours + motivation, data = test_study2)
summary(performance)


Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9548  -2.8042  -0.2847   2.9344  13.8240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.86679    0.65473  10.488   <2e-16 ***
hours        1.37570    0.07989  17.220   <2e-16 ***
motivation   0.91634    0.38376   2.388   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared:  0.6696,    Adjusted R-squared:  0.6651 
F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16

Sums of Squares Total

\[ SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2 \]

Applying the formula involves:

taking the observed value of score for each individual (\(y_i\))
taking the mean of score (\(\bar{y}\)), which is the same for everyone, and subtracting it from each individual value of score(\(y_i\)) (i.e., \(y_i\) - \(\bar{y}\))
squaring each of the obtained \(y_i - \bar{y}\) values
and summing them together

We will build an object called ss_tab (to stand for “sums of squares table”). Here we will add our calculations to the original data.

ss_tab <- test_study2 |>
    mutate(
        y_dev = score - mean(score), #subtract the mean from each observed value of score
        y_dev2 = round(y_dev^2, 2) # square these values, and round to 2dp
    )

In the code above, y_dev = \((y_i - \bar{y})\) , and y_dev2 squares these values. Each individual has these values calculated. This we can see below where we print the top 10 rows of ss_tab.

ID	score	hours	motivation	y_dev	y_dev2
ID101	7	2	-1.42	-9.14	83.54
ID102	23	12	-0.41	6.86	47.06
ID103	17	4	0.49	0.86	0.74
ID104	6	2	0.24	-10.14	102.82
ID105	12	2	0.09	-4.14	17.14
ID106	24	12	1.05	7.86	61.78
ID107	11	3	-0.09	-5.14	26.42
ID108	23	8	0.39	6.86	47.06
ID109	25	10	0.34	8.86	78.50
ID110	15	9	-0.08	-1.14	1.30

We can then calculate the sum of y_dev2 to give us our sums of squares total:

ss_tot <- sum(ss_tab$y_dev2)
ss_tot

[1] 8556.12

Sums of Squares Residual

\[ SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Applying the formula involves:

taking the observed values of score for each individual ( \(y_i\) )
subtracting from each \(y_i\) the model-predicted value of score for that individual ( \(\hat{y}_i\) )
squaring each of the obtained \(y_i - \hat{y}_i\) values
and summing them

ss_tab <- ss_tab |>
  mutate(
    y_pred = round(performance$fitted.values,2), # extract the predicted values from the model
    pred_dev = round((score - y_pred),2), # subtract these from the observed values of score
    pred_dev2 = round(pred_dev^2,2) # square these values, and round to 2dp
  )

ID	score	hours	y_dev2	y_pred	pred_dev	pred_dev2
ID101	7	2	83.54	8.32	-1.32	1.74
ID102	23	12	47.06	23.00	0.00	0.00
ID103	17	4	0.74	12.82	4.18	17.47
ID104	6	2	102.82	9.84	-3.84	14.75
ID105	12	2	17.14	9.70	2.30	5.29
ID106	24	12	61.78	24.34	-0.34	0.12
ID107	11	3	26.42	10.91	0.09	0.01
ID108	23	8	47.06	18.23	4.77	22.75
ID109	25	10	78.50	20.94	4.06	16.48
ID110	15	9	1.30	19.17	-4.17	17.39

As we did previously, we can then sum this final column to give the sums of squares residual:

ss_resid <- sum(ss_tab$pred_dev2)
ss_resid

[1] 2826.83

Sums of Squares Model

\[ SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]

Applying the formula involves:

taking the predicted value of score for each individual from our model ( \(\hat{y}_i\) )
subtracting the mean of score ( \(\bar{y}\) ) from each of the \(\hat{y}_i\) values
squaring each of the obtained \(\hat{y}_i - \bar{y}\) values
and summing them

However, there is a much more efficient way to do this which is simply:

\[ SS_{Model} = SS_{Total} - SS_{Residual} \]

ss_mod = ss_tot - ss_resid
ss_mod

[1] 5729.29

Calulate \(R^2\)

\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]

ss_mod/ss_tot

[1] 0.6696131

1 - (ss_resid/ss_tot)

[1] 0.6696131

We can compare this to the value at the bottom of our model summary:

summary(performance)


Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9548  -2.8042  -0.2847   2.9344  13.8240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.86679    0.65473  10.488   <2e-16 ***
hours        1.37570    0.07989  17.220   <2e-16 ***
motivation   0.91634    0.38376   2.388   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared:  0.6696,    Adjusted R-squared:  0.6651 
F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16

If we wanted to save this specific value from the summary output, we can do the following:

sum_performance <- summary(performance)
sum_performance$r.squared

[1] 0.6695598

Calulate \(\hat R^2\)

\[ \hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1} \]

To calculate adjusted \(R^2\) (denoted as \(\hat R^2\)), we need to know:

\(R^2\)
\(N\) (sample size)
\(k\) (number of predictors)

Once we have obtained these values, we can substitute into the formula:

\[ \begin{align} \hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1} \\ \\ \hat R^2 = 1 - (1 - 0.6696)\frac{150-1}{150-2-1} \\ \\ \hat R^2 = 1 - (0.3304)\frac{149}{147} \\ \\ \hat R^2 = 1 - (0.3304 \cdot 1.013605) \\ \\ \hat R^2 = 0.6651049 \end{align} \]

We can compare this to the value at the bottom of our model summary:

summary(performance)


Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9548  -2.8042  -0.2847   2.9344  13.8240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.86679    0.65473  10.488   <2e-16 ***
hours        1.37570    0.07989  17.220   <2e-16 ***
motivation   0.91634    0.38376   2.388   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared:  0.6696,    Adjusted R-squared:  0.6651 
F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16

If we wanted to save this specific value from the summary output, we can do the following:

sum_performance <- summary(performance)
sum_performance$adj.r.squared

[1] 0.665064