  1. Understand the calculation and interpretation of the coefficient of determination.
  2. Understand the calculation and interpretation of the F-test of model utility.
  3. Understand how to standardize model coefficients and when this is appropriate to do.
  4. Understand the relationship between the correlation coefficient and the regression slope.

Data recap

Question 1

Read the riverview data from the previous lab into R, and fit a linear model to investigate how income varies with years of formal education.


Partitioning variation

We might ask ourselves if the model is useful. To quantify and assess model utility, we split the total variability of the response into two terms: the variability explained by the model plus the variability left unexplained in the residuals.

\[ \text{total variability in response = variability explained by model + unexplained variability in residuals} \]

Each term is quantified by a sum of squares:

\[ \begin{aligned} SS_{Total} &= SS_{Model} + SS_{Residual} \\ \sum_{i=1}^n (y_i - \bar y)^2 &= \sum_{i=1}^n (\hat y_i - \bar y)^2 + \sum_{i=1}^n (y_i - \hat y_i)^2 \end{aligned} \]

Question 2

What is the proportion of the total variability in incomes explained by the linear relationship with education level?

Hint: The question asks to compute the value of \(R^2\).


Model utility test

To test if the model is useful — that is, if the explanatory variable is a useful predictor of the response — we test the following hypotheses:

\[ \begin{aligned} H_0 &: \text{the model is ineffective, } \beta_1 = 0 \\ H_1 &: \text{the model is effective, } \beta_1 \neq 0 \end{aligned} \]

The relevant test-statistic is the F-statistic:

\[ \begin{split} F = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model} / 1}{SS_{Residual} / (n-2)} \end{split} \]

which compares the amount of variation in the response explained by the model to the amount of variation left unexplained in the residuals.

The sample F-statistic is compared to an F-distribution with \(df_{1} = 1\) and \(df_{2} = n - 2\) degrees of freedom.1

Question 3

Perform a model utility test at the 5% significance level, by computing the F-statistic using its definition.


Question 4

Look at the output of summary(mdl) and anova(mdl).

For each output, identify the relevant information to conduct an F-test against the null hypothesis that the model is ineffective at predicting income using education level.


Question 5

Consider the F value output of anova(mdl) and the t value for education returned by summary(mdl)

F value = 51.452
t value = 7.173

Do you notice any relationship between the F-statistic for overall model utility and the t-statistic for \(H_0: \beta_1 = 0\)?


Back to regression coefficients

Question 6

Compute the average education level and the average income in the sample.

Use the predict() function to compute the predicted income for those with average education level.

What do you notice?


Question 7

Let’s formalise the previous question using symbols. Consider the fitted model \(\hat{y} = \hat \beta_0 + \hat \beta_1 x\).

What is the predicted response for an individual having an explanatory variable at the average level \(\bar{x}\)?

Hint: Substitute the formula of \(\hat \beta_0\) into the equation of the fitted model.



Question 8

Add to the riverview dataset two variables called z_education and z_income representing the standardized education and income variables, respectively.

Without using R, if you were to fit a linear regression model using the standardized response and standardized predictor, what would the intercept be?

Hint: Recall the formula for the \(z\)-score: \[ z_x = \frac{x - \bar{x}}{s_x}, \qquad z_y = \frac{y - \bar{y}}{s_y} \]


Question 9

Using R, fit the regression model using the standardized response and explanatory variables.

What is the slope equal to?


Question 10

Interpret the slope of the standardized variables.



  1. \(SS_{Total}\) has \(n - 1\) degrees of freedom as one degree of freedom is lost in estimating the population mean with the sample mean \(\bar{y}\). \(SS_{Residual}\) has \(n - 2\) degrees of freedom. There are \(n\) residuals, but two degrees of freedom are lost in estimating the intercept and slope of the line used to obtain the \(\hat y_i\)s. Hence, by difference, \(SS_{Model}\) has \(n - 1 - (n - 2) = 1\) degree of freedom.↩︎