Be sure to check the solutions to last week’s exercises.
You can still ask any questions about previous weeks’ materials if things aren’t clear!
LEARNING OBJECTIVES
Read the riverview data from the previous lab into R, and fit a linear model to investigate how income varies with years of formal education.
We might ask ourselves if the model is useful. To quantify and assess model utility, we split the total variability of the response into two terms: the variability explained by the model plus the variability left unexplained in the residuals.
\[ \text{total variability in response = variability explained by model + unexplained variability in residuals} \]
Each term is quantified by a sum of squares:
\[ \begin{aligned} SS_{Total} &= SS_{Model} + SS_{Residual} \\ \sum_{i=1}^n (y_i - \bar y)^2 &= \sum_{i=1}^n (\hat y_i - \bar y)^2 + \sum_{i=1}^n (y_i - \hat y_i)^2 \end{aligned} \]
What is the proportion of the total variability in incomes explained by the linear relationship with education level?
Hint: The question asks to compute the value of \(R^2\).
To test if the model is useful — that is, if the explanatory variable is a useful predictor of the response — we test the following hypotheses:
\[ \begin{aligned} H_0 &: \text{the model is ineffective, } \beta_1 = 0 \\ H_1 &: \text{the model is effective, } \beta_1 \neq 0 \end{aligned} \]
The relevant test-statistic is the F-statistic:
\[ \begin{split} F = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model} / 1}{SS_{Residual} / (n-2)} \end{split} \]
which compares the amount of variation in the response explained by the model to the amount of variation left unexplained in the residuals.
The sample F-statistic is compared to an F-distribution with \(df_{1} = 1\) and \(df_{2} = n - 2\) degrees of freedom.1
Perform a model utility test at the 5% significance level, by computing the F-statistic using its definition.
Look at the output of summary(mdl)
and anova(mdl)
.
For each output, identify the relevant information to conduct an F-test against the null hypothesis that the model is ineffective at predicting income using education level.
Consider the F value
output of anova(mdl)
and the t value
for education returned by summary(mdl)
F value = 51.452
t value = 7.173
Do you notice any relationship between the F-statistic for overall model utility and the t-statistic for \(H_0: \beta_1 = 0\)?
Compute the average education level and the average income in the sample.
Use the predict()
function to compute the predicted income for those with average education level.
What do you notice?
Let’s formalise the previous question using symbols. Consider the fitted model \(\hat{y} = \hat \beta_0 + \hat \beta_1 x\).
What is the predicted response for an individual having an explanatory variable at the average level \(\bar{x}\)?
Hint: Substitute the formula of \(\hat \beta_0\) into the equation of the fitted model.
Add to the riverview dataset two variables called z_education
and z_income
representing the standardized education and income variables, respectively.
Without using R, if you were to fit a linear regression model using the standardized response and standardized predictor, what would the intercept be?
Hint: Recall the formula for the \(z\)-score:
\[
z_x = \frac{x - \bar{x}}{s_x}, \qquad z_y = \frac{y - \bar{y}}{s_y}
\]
Using R, fit the regression model using the standardized response and explanatory variables.
What is the slope equal to?
Interpret the slope of the standardized variables.
\(SS_{Total}\) has \(n - 1\) degrees of freedom as one degree of freedom is lost in estimating the population mean with the sample mean \(\bar{y}\). \(SS_{Residual}\) has \(n - 2\) degrees of freedom. There are \(n\) residuals, but two degrees of freedom are lost in estimating the intercept and slope of the line used to obtain the \(\hat y_i\)s. Hence, by difference, \(SS_{Model}\) has \(n - 1 - (n - 2) = 1\) degree of freedom.↩︎