library(tidyverse)
library(kableExtra)Calculating Sums of Squares & R-Squared for the Linear Model
Data Analysis for Psychology in R 2
In this document we are going to work through the hand calculations for:
- \(SS_\text{total}\)
- \(SS_\text{residual}\)
- \(SS_\text{model}\)
- \(R^2\)
- \(\hat R^2\)
Packages
For this example we will only need the tidyverse and kableExtra packages.
Data
test_study2 <- read_csv("https://uoepsy.github.io/data/test_study2.csv")We will make our calculations for the model we have been using in class this week:
performance <- lm(score ~ hours + motivation, data = test_study2)
summary(performance)
Call:
lm(formula = score ~ hours + motivation, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-12.9548 -2.8042 -0.2847 2.9344 13.8240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.86679 0.65473 10.488 <2e-16 ***
hours 1.37570 0.07989 17.220 <2e-16 ***
motivation 0.91634 0.38376 2.388 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651
F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16
Sums of Squares Total
\[ SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2 \]
Applying the formula involves:
- taking the observed value of
scorefor each individual (\(y_i\)) - taking the mean of
score(\(\bar{y}\)), which is the same for everyone, and subtracting it from each individual value ofscore(\(y_i\)) (i.e., \(y_i\) - \(\bar{y}\)) - squaring each of the obtained \(y_i - \bar{y}\) values
- and summing them together
We will build an object called ss_tab (to stand for “sums of squares table”). Here we will add our calculations to the original data.
ss_tab <- test_study2 |>
mutate(
y_dev = score - mean(score), #subtract the mean from each observed value of score
y_dev2 = round(y_dev^2, 2) # square these values, and round to 2dp
)In the code above, y_dev = \((y_i - \bar{y})\) , and y_dev2 squares these values. Each individual has these values calculated. This we can see below where we print the top 10 rows of ss_tab.
| ID | score | hours | motivation | y_dev | y_dev2 |
|---|---|---|---|---|---|
| ID101 | 7 | 2 | -1.42 | -9.14 | 83.54 |
| ID102 | 23 | 12 | -0.41 | 6.86 | 47.06 |
| ID103 | 17 | 4 | 0.49 | 0.86 | 0.74 |
| ID104 | 6 | 2 | 0.24 | -10.14 | 102.82 |
| ID105 | 12 | 2 | 0.09 | -4.14 | 17.14 |
| ID106 | 24 | 12 | 1.05 | 7.86 | 61.78 |
| ID107 | 11 | 3 | -0.09 | -5.14 | 26.42 |
| ID108 | 23 | 8 | 0.39 | 6.86 | 47.06 |
| ID109 | 25 | 10 | 0.34 | 8.86 | 78.50 |
| ID110 | 15 | 9 | -0.08 | -1.14 | 1.30 |
We can then calculate the sum of y_dev2 to give us our sums of squares total:
ss_tot <- sum(ss_tab$y_dev2)
ss_tot[1] 8556.12
Sums of Squares Residual
\[ SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]
Applying the formula involves:
- taking the observed values of
scorefor each individual ( \(y_i\) ) - subtracting from each \(y_i\) the model-predicted value of
scorefor that individual ( \(\hat{y}_i\) ) - squaring each of the obtained \(y_i - \hat{y}_i\) values
- and summing them
ss_tab <- ss_tab |>
mutate(
y_pred = round(performance$fitted.values,2), # extract the predicted values from the model
pred_dev = round((score - y_pred),2), # subtract these from the observed values of score
pred_dev2 = round(pred_dev^2,2) # square these values, and round to 2dp
)| ID | score | hours | y_dev2 | y_pred | pred_dev | pred_dev2 |
|---|---|---|---|---|---|---|
| ID101 | 7 | 2 | 83.54 | 8.32 | -1.32 | 1.74 |
| ID102 | 23 | 12 | 47.06 | 23.00 | 0.00 | 0.00 |
| ID103 | 17 | 4 | 0.74 | 12.82 | 4.18 | 17.47 |
| ID104 | 6 | 2 | 102.82 | 9.84 | -3.84 | 14.75 |
| ID105 | 12 | 2 | 17.14 | 9.70 | 2.30 | 5.29 |
| ID106 | 24 | 12 | 61.78 | 24.34 | -0.34 | 0.12 |
| ID107 | 11 | 3 | 26.42 | 10.91 | 0.09 | 0.01 |
| ID108 | 23 | 8 | 47.06 | 18.23 | 4.77 | 22.75 |
| ID109 | 25 | 10 | 78.50 | 20.94 | 4.06 | 16.48 |
| ID110 | 15 | 9 | 1.30 | 19.17 | -4.17 | 17.39 |
As we did previously, we can then sum this final column to give the sums of squares residual:
ss_resid <- sum(ss_tab$pred_dev2)
ss_resid[1] 2826.83
Sums of Squares Model
\[ SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]
Applying the formula involves:
- taking the predicted value of
scorefor each individual from our model ( \(\hat{y}_i\) ) - subtracting the mean of
score( \(\bar{y}\) ) from each of the \(\hat{y}_i\) values - squaring each of the obtained \(\hat{y}_i - \bar{y}\) values
- and summing them
However, there is a much more efficient way to do this which is simply:
\[ SS_{Model} = SS_{Total} - SS_{Residual} \]
ss_mod = ss_tot - ss_resid
ss_mod[1] 5729.29
Calulate \(R^2\)
\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]
ss_mod/ss_tot[1] 0.6696131
or
1 - (ss_resid/ss_tot)[1] 0.6696131
We can compare this to the value at the bottom of our model summary:
summary(performance)
Call:
lm(formula = score ~ hours + motivation, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-12.9548 -2.8042 -0.2847 2.9344 13.8240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.86679 0.65473 10.488 <2e-16 ***
hours 1.37570 0.07989 17.220 <2e-16 ***
motivation 0.91634 0.38376 2.388 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651
F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16
If we wanted to save this specific value from the summary output, we can do the following:
sum_performance <- summary(performance)
sum_performance$r.squared[1] 0.6695598
Calulate \(\hat R^2\)
\[ \hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1} \]
To calculate adjusted \(R^2\) (denoted as \(\hat R^2\)), we need to know:
- \(R^2\)
- \(N\) (sample size)
- \(k\) (number of predictors)
Once we have obtained these values, we can substitute into the formula:
\[ \begin{align} \hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1} \\ \\ \hat R^2 = 1 - (1 - 0.6696)\frac{150-1}{150-2-1} \\ \\ \hat R^2 = 1 - (0.3304)\frac{149}{147} \\ \\ \hat R^2 = 1 - (0.3304 \cdot 1.013605) \\ \\ \hat R^2 = 0.6651049 \end{align} \]
We can compare this to the value at the bottom of our model summary:
summary(performance)
Call:
lm(formula = score ~ hours + motivation, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-12.9548 -2.8042 -0.2847 2.9344 13.8240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.86679 0.65473 10.488 <2e-16 ***
hours 1.37570 0.07989 17.220 <2e-16 ***
motivation 0.91634 0.38376 2.388 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651
F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16
If we wanted to save this specific value from the summary output, we can do the following:
sum_performance <- summary(performance)
sum_performance$adj.r.squared[1] 0.665064