Calculating Sums of Squares & R-Squared for the Linear Model


Data Analysis for Psychology in R 2

Author
Affiliation

Emma Waterston


Department of Psychology
University of Edinburgh
2025–2026

In this document we are going to work through the hand calculations for:

Packages

For this example we will only need the tidyverse and kableExtra packages.

library(tidyverse)
library(kableExtra)

Data

test_study2 <- read_csv("https://uoepsy.github.io/data/test_study2.csv")

We will make our calculations for the model we have been using in class this week:

performance <- lm(score ~ hours + motivation, data = test_study2)
summary(performance)

Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9548  -2.8042  -0.2847   2.9344  13.8240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.86679    0.65473  10.488   <2e-16 ***
hours        1.37570    0.07989  17.220   <2e-16 ***
motivation   0.91634    0.38376   2.388   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared:  0.6696,    Adjusted R-squared:  0.6651 
F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16

Sums of Squares Total

\[ SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2 \]

Applying the formula involves:

  • taking the observed value of score for each individual (\(y_i\))
  • taking the mean of score (\(\bar{y}\)), which is the same for everyone, and subtracting it from each individual value of score(\(y_i\)) (i.e., \(y_i\) - \(\bar{y}\))
  • squaring each of the obtained \(y_i - \bar{y}\) values
  • and summing them together

We will build an object called ss_tab (to stand for “sums of squares table”). Here we will add our calculations to the original data.

ss_tab <- test_study2 |>
    mutate(
        y_dev = score - mean(score), #subtract the mean from each observed value of score
        y_dev2 = round(y_dev^2, 2) # square these values, and round to 2dp
    )

In the code above, y_dev = \((y_i - \bar{y})\) , and y_dev2 squares these values. Each individual has these values calculated. This we can see below where we print the top 10 rows of ss_tab.

ID score hours motivation y_dev y_dev2
ID101 7 2 -1.42 -9.14 83.54
ID102 23 12 -0.41 6.86 47.06
ID103 17 4 0.49 0.86 0.74
ID104 6 2 0.24 -10.14 102.82
ID105 12 2 0.09 -4.14 17.14
ID106 24 12 1.05 7.86 61.78
ID107 11 3 -0.09 -5.14 26.42
ID108 23 8 0.39 6.86 47.06
ID109 25 10 0.34 8.86 78.50
ID110 15 9 -0.08 -1.14 1.30

We can then calculate the sum of y_dev2 to give us our sums of squares total:

ss_tot <- sum(ss_tab$y_dev2)
ss_tot
[1] 8556.12

Sums of Squares Residual

\[ SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Applying the formula involves:

  • taking the observed values of score for each individual ( \(y_i\) )
  • subtracting from each \(y_i\) the model-predicted value of score for that individual ( \(\hat{y}_i\) )
  • squaring each of the obtained \(y_i - \hat{y}_i\) values
  • and summing them
ss_tab <- ss_tab |>
  mutate(
    y_pred = round(performance$fitted.values,2), # extract the predicted values from the model
    pred_dev = round((score - y_pred),2), # subtract these from the observed values of score
    pred_dev2 = round(pred_dev^2,2) # square these values, and round to 2dp
  )
ID score hours y_dev2 y_pred pred_dev pred_dev2
ID101 7 2 83.54 8.32 -1.32 1.74
ID102 23 12 47.06 23.00 0.00 0.00
ID103 17 4 0.74 12.82 4.18 17.47
ID104 6 2 102.82 9.84 -3.84 14.75
ID105 12 2 17.14 9.70 2.30 5.29
ID106 24 12 61.78 24.34 -0.34 0.12
ID107 11 3 26.42 10.91 0.09 0.01
ID108 23 8 47.06 18.23 4.77 22.75
ID109 25 10 78.50 20.94 4.06 16.48
ID110 15 9 1.30 19.17 -4.17 17.39

As we did previously, we can then sum this final column to give the sums of squares residual:

ss_resid <- sum(ss_tab$pred_dev2)
ss_resid
[1] 2826.83

Sums of Squares Model

\[ SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]

Applying the formula involves:

  • taking the predicted value of score for each individual from our model ( \(\hat{y}_i\) )
  • subtracting the mean of score ( \(\bar{y}\) ) from each of the \(\hat{y}_i\) values
  • squaring each of the obtained \(\hat{y}_i - \bar{y}\) values
  • and summing them

However, there is a much more efficient way to do this which is simply:

\[ SS_{Model} = SS_{Total} - SS_{Residual} \]

ss_mod = ss_tot - ss_resid
ss_mod
[1] 5729.29

Calulate \(R^2\)

\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]

ss_mod/ss_tot
[1] 0.6696131

or

1 - (ss_resid/ss_tot)
[1] 0.6696131

We can compare this to the value at the bottom of our model summary:

summary(performance)

Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9548  -2.8042  -0.2847   2.9344  13.8240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.86679    0.65473  10.488   <2e-16 ***
hours        1.37570    0.07989  17.220   <2e-16 ***
motivation   0.91634    0.38376   2.388   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared:  0.6696,    Adjusted R-squared:  0.6651 
F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16

If we wanted to save this specific value from the summary output, we can do the following:

sum_performance <- summary(performance)
sum_performance$r.squared
[1] 0.6695598

Calulate \(\hat R^2\)

\[ \hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1} \]

To calculate adjusted \(R^2\) (denoted as \(\hat R^2\)), we need to know:

  • \(R^2\)
  • \(N\) (sample size)
  • \(k\) (number of predictors)

Once we have obtained these values, we can substitute into the formula:

\[ \begin{align} \hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1} \\ \\ \hat R^2 = 1 - (1 - 0.6696)\frac{150-1}{150-2-1} \\ \\ \hat R^2 = 1 - (0.3304)\frac{149}{147} \\ \\ \hat R^2 = 1 - (0.3304 \cdot 1.013605) \\ \\ \hat R^2 = 0.6651049 \end{align} \]

We can compare this to the value at the bottom of our model summary:

summary(performance)

Call:
lm(formula = score ~ hours + motivation, data = test_study2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9548  -2.8042  -0.2847   2.9344  13.8240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.86679    0.65473  10.488   <2e-16 ***
hours        1.37570    0.07989  17.220   <2e-16 ***
motivation   0.91634    0.38376   2.388   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared:  0.6696,    Adjusted R-squared:  0.6651 
F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16

If we wanted to save this specific value from the summary output, we can do the following:

sum_performance <- summary(performance)
sum_performance$adj.r.squared
[1] 0.665064