library(tidyverse)F-Ratio & Model Comparisons
Data Analysis for Psychology in R 2
In this document we are going to work through the hand calculations for:
- \(F\)-statistic
- Incremental \(F\)-test
Packages
For this example we will only need the tidyversepackage.
Data
test_study2 <- read_csv("https://uoepsy.github.io/data/test_study2.csv")We will make our calculations for the model we have been using in class this week:
performance <- lm(score ~ hours + motivation, data = test_study2)
summary(performance)
Call:
lm(formula = score ~ hours + motivation, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-12.9548 -2.8042 -0.2847 2.9344 13.8240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.86679 0.65473 10.488 <2e-16 ***
hours 1.37570 0.07989 17.220 <2e-16 ***
motivation 0.91634 0.38376 2.388 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651
F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16
\(F\)-Statistic
\[ F = \frac{\frac{SS_{model}}{df_{model}}}{\frac{SS_{residual}}{df_{residual}}} = \frac{MS_{Model}}{MS_{Residual}} \]
Sums of Squares Residual
\[ SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]
ss_tab <- test_study2 |>
mutate(
y_pred = performance$fitted.values, # extract the predicted values from the model
pred_dev = round((score - y_pred),2), # subtract these from the observed values of score
pred_dev2 = round(pred_dev^2,2) # square these values, and round to 2dp
)As we did previously, we can then sum this final column to give the sums of squares residual:
ss_resid <- sum(ss_tab$pred_dev2)
ss_resid[1] 2826.83
Sums of Squares Model
\[ SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]
ss_tab <- ss_tab |>
mutate(
mean_pred_dev = performance$fitted.values - mean(score), # extract the predicted values from the model and subtract mean
mean_pred_dev2 = round(mean_pred_dev^2,2) # square these values, and round to 2dp
)ss_mod = sum(ss_tab$mean_pred_dev2)
ss_mod[1] 5728.81
DF Model
\[ df_\text{model} = k = 2 \]
DF Residual
\[ df_\text{residual} = n - k - 1= 150 - 2 - 1 = 147 \]
Back to \(F\)-Statistic
\[ \begin{align} F = \frac{\frac{SS_{model}}{df_{model}}}{\frac{SS_{residual}}{df_{residual}}} = \frac{MS_{Model}}{MS_{Residual}} \\ \\ F = \frac{\frac{5728.79}{2}}{\frac{2826.83}{147}} = \frac{2864.40}{19.23} \\ \\ F = 148.95 = 148.95 \\ \\ F = 148.95 \end{align} \]
We can compare this to the value at the bottom of our model summary():
summary(performance)
Call:
lm(formula = score ~ hours + motivation, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-12.9548 -2.8042 -0.2847 2.9344 13.8240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.86679 0.65473 10.488 <2e-16 ***
hours 1.37570 0.07989 17.220 <2e-16 ***
motivation 0.91634 0.38376 2.388 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651
F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16
If we wanted to save this specific value from the summary output, we can do the following:
sum_performance <- summary(performance)
sum_performance$fstatistic value numdf dendf
148.9305 2.0000 147.0000
Model Comparisons
Research Question: Does motivation significantly predict score over and above the effects of study hours?
Step 1: Fit & Run Models
#model with just study hours
m1 <- lm(score ~ hours, data = test_study2)
summary(m1)
Call:
lm(formula = score ~ hours, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-13.4074 -2.8477 -0.4074 3.2950 13.3165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.78794 0.66420 10.22 <2e-16 ***
hours 1.36195 0.08094 16.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.455 on 148 degrees of freedom
Multiple R-squared: 0.6567, Adjusted R-squared: 0.6544
F-statistic: 283.2 on 1 and 148 DF, p-value: < 2.2e-16
#model with study hours and motivation
m2 <- lm(score ~ hours + motivation, data = test_study2)
summary(m2)
Call:
lm(formula = score ~ hours + motivation, data = test_study2)
Residuals:
Min 1Q Median 3Q Max
-12.9548 -2.8042 -0.2847 2.9344 13.8240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.86679 0.65473 10.488 <2e-16 ***
hours 1.37570 0.07989 17.220 <2e-16 ***
motivation 0.91634 0.38376 2.388 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.386 on 147 degrees of freedom
Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651
F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16
Step 2: Statistically Compare Models
Using anova()
anova(m1, m2)Analysis of Variance Table
Model 1: score ~ hours
Model 2: score ~ hours + motivation
Res.Df RSS Df Sum of Sq F Pr(>F)
1 148 2936.9
2 147 2827.3 1 109.66 5.7017 0.01822 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By Hand (ADVANCED/OPTIONAL)
We can extract the information we need from the summary() output given the relationship between the residual sum of squares and residual standard error:
\[ \text{SS}_\text{Residual} = \text{Residual Standard Error}^2 \cdot df_\text{Residual} \]
Therefore we can calculate our \(SS_\text{Residual}\) as follows:
Restricted Model
sum_m1 <- summary(m1)
sum_m1$sigma #4.4547[1] 4.454672
sum_m1$df #148[1] 2 148 2
\[ \begin{align} SSR_R = (4.4547^2) \cdot 148 \\ \\ SSR_R = 2936.964 \\ \end{align} \]
Full Model
sum_m2 <- summary(m2)
sum_m2$sigma #4.3856[1] 4.385557
sum_m2$df #147[1] 3 147 3
\[ \begin{align} SSR_F = (4.3856^2) \cdot 147 \\ \\ SSR_F = 2827.323 \\ \end{align} \]
Once we have obtained these values, we can substitute into the formula:
\[ \begin{align} F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F} \\ \\ F_{(148-147),147} = \frac{(2936.964-2827.323)/(148-147)}{2827.323 / 147} \\ \\ F_{1,147} = \frac{109.641/1}{ 19.2335} \\ \\ F_{1,147} = \frac{109.641}{19.2335} \\ \\ F_{1,147} = 5.70 \end{align} \]
Step 3: Write Up
Motivation was found to explain a significant amount of variance in scores over and above the effects of study hours alone \(F(1, 147) . = 5.70, p < .05)\).