This lab is a long one! We did cover most of this content to varying degrees in the lectures over the past two weeks.

A note on notation

You will see a variety of different ways of specifying the linear model form in different resources. Some use $\beta$, some use $b$.
In the lectures, you have seen the form $\color{red}{y} = \color{blue}{b_0 \cdot{} 1 + b_1 \cdot{} x} + \epsilon$.

In the exercises below, we will tend to use $\color{red}{Y} = \color{blue}{\beta_0 \cdot{} 1 + \beta_1 \cdot{} X} + \epsilon$, to denote our population model, which when fitted on some sample data becomes $\color{red}{\hat{Y}} = \color{blue}{\hat{\beta}_0 \cdot{} 1 + \hat{\beta_1} \cdot{} X} + \hat{\epsilon}$ (the little hats indicate that they are fitted estimates).

It’s important to take regular breaks. This will help (a bit) with not getting overwhelmed. If you were to sit down and read through this all in one sitting, it would be completely understandable that you might end up shouting at your computer.

These things (both the statistics side and the programming side) take time - and repeated practice - to sink in.

Exercises: Simple regression

Question A1

Let’s imagine a study into income disparity for workers in a local authority. We might carry out interviews and find that there is a link between the level of education and an employee’s income. Those with more formal education seem to be better paid. Now we wouldn’t have time to interview everyone who works for the local authority so we would have to interview a sample, say 10%.

In this lab we will use the riverview data (available at https://uoepsy.github.io/data/riverview.csv) to examine whether education level is related to income among the employees working for the city of Riverview, a hypothetical midwestern city in the US.

Data: riverview.csv

education	income	seniority	gender	male	party
8	37.449	7	male	1	Democrat
8	26.430	9	female	0	Independent
10	47.034	14	male	1	Democrat
10	34.182	16	female	0	Independent
10	25.479	1	female	0	Republican
12	46.488	11	female	0	Democrat

Load the required libraries and import the riverview data into a variable named riverview.

Solution

library(tidyverse)

riverview <- read_csv(file = "https://uoepsy.github.io/data/riverview.csv")
head(riverview)

## # A tibble: 6 x 6
##   education income seniority gender  male party      
##       <dbl>  <dbl>     <dbl> <chr>  <dbl> <chr>      
## 1         8   37.4         7 male       1 Democrat   
## 2         8   26.4         9 female     0 Independent
## 3        10   47.0        14 male       1 Democrat   
## 4        10   34.2        16 female     0 Independent
## 5        10   25.5         1 female     0 Republican 
## 6        12   46.5        11 female     0 Democrat

Exploring the data

Question A2

We first want to visualise and describe the marginal distributions (the distribution of each variable without reference to the values of the other variables) of employee incomes and education levels.

You could use, for example, geom_density() for a density plot or geom_histogram() for a histogram.
Look at the shape, centre and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal?
Do you notice any extreme observations?

Solution

We can plot the marginal distribution of employee incomes as a density curve, and add a boxplot underneath to check for the presence of outliers.

Note: The function ggMarginal() from the ggExtra library only works with scatterplots.

ggplot(data = riverview, aes(x = income)) +
  geom_density() +
  geom_boxplot(width = 1/300) +
  labs(x = "Income (in thousands of U.S. dollars)", 
       y = "Probability density")

Figure 1: Density plot and boxplot of employee incomes.

ggplot(data = riverview, aes(x = education)) +
  geom_density() +
  geom_boxplot(width = 1/100) +
  labs(x = "Education (in years)", 
       y = "Probability density")

Figure 2: Density plot and boxplot of employee education levels.

The plot suggests that the distribution of employee incomes is unimodal and most of the incomes are between roughly $45,000 and $70,000. The smallest income in the sample is about $25,000 and the largest income is over $80,000. (We could find the exact values using the summary() function). This suggests there is a fair amount of variation in the data. Furthermore, the boxplot does not highlight any outliers in the data.

To further summarize the distribution, it is typical to compute and report numerical summary statistics such as the mean and standard deviation. One way to compute these values is to use the summary() function from the tidyverse library:

riverview %>% 
  summarize(
    mean_incom = mean(income), 
    sd_income = sd(income),
    mean_edu = mean(education),
    sd_edu = sd(education)
    )

## # A tibble: 1 x 4
##   mean_incom sd_income mean_edu sd_edu
##        <dbl>     <dbl>    <dbl>  <dbl>
## 1       53.7      14.6       16   4.36

The marginal distribution of income is unimodal with a mean of approximately $53,700. There is variation in employees’ salaries (SD = $14,553).
The marginal distribution of education is unimodal with a mean of 16 years. There is variation in employees’ level of education (SD = 4.4 years).

Question A3

After examining the marginal distributions of the variables of interest in the analysis, we typically move on to examining relationships between the variables.

Visualise and describe the relationship between income and level of education among the employees in the sample.

Think about:

Direction of association
Form of association (can it be summarised well with a straight line?)
Strength of association (how closely do points fall to a recognizable pattern such as a line?)
Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail.

Solution

We are trying to investigate how income varies when varying years of formal education. Hence income is the dependent variable (on the y-axis), and education is the independent variable (on the x-axis).

ggplot(data = riverview, aes(x = education, y = income)) +
  geom_point(alpha = 0.5) +
  labs(x = "Education (in years)", 
       y = "Income (in thousands of U.S. dollars)")

Figure 3: The relationship between employees’ education level and income.

To comment on the strength of the linear association we compute the correlation coefficient:

riverview %>%
  select(education, income) %>%
  cor()

##           education    income
## education 1.0000000 0.7947847
## income    0.7947847 1.0000000

that is, \[ r_{\text{education, income}} = 0.79 \]

We might write:

There is a strong positive linear relationship between education level and income for the employees in the sample. High incomes tend to be observed, on average, with more years of formal education. The scatterplot does not highlight any outliers.

Fitting a model

Question A4

The scatterplot highlights a linear relationship, where the data points are scattered around an underlying linear pattern with a roughly-constant spread as x varies.

Hence, we will try to fit a simple (one $x$ variable only) linear regression model:

\[ Income = \beta_0 + \beta_1 \ Education + \epsilon \quad \\ \text{where} \quad \epsilon \sim N(0, \sigma) \text{ independently} \]

where “$\epsilon \sim N(0, \sigma) \text{ independently}$” means that the errors around the line have mean zero and constant spread as x varies.

Fit the linear model to the sample data using the lm() function and name the output mdl.

Hint: The syntax of the lm() function is:

lm(<response variable> ~ 1 + <explanatory variable>, data = <dataframe>)

Why ~ 1?

Solution

As the variables are in the riverview dataframe, we would write:

mdl <- lm(income ~ 1 + education, data = riverview)

Interpreting coefficients

Question A5

Interpret the estimated intercept and slope in the context of the question of interest.

To obtain the estimated regression coefficients you can either:

type mdl, i.e. simply invoke the name of the fitted model;
type mdl$coefficients;
use the coef(mdl) function;
use the coefficients(mdl) function;
use the summary(mdl) function and look under the “Estimate” column.

The estimated parameters returned by the above methods are all equivalent. However, summary() returns more information and you need to look under the column “Estimate”.

Solution

coef(mdl)

## (Intercept)   education 
##   11.321379    2.651297

The fitted line is: \[ \widehat{Income} = 11.32 + 2.65 \ Education \\ \]

We can interpret the estimated intercept as follows,

The estimated average income associated to zero years of formal education is $11,321.

For the estimated slope we might write,

The estimated increase in average income associated to a one year increase in education is $2,651.

The parameter estimates from our simple linear regression model take the form of a line, representing the systemtic part of our model $\beta_0 + \beta_1 x$, which in our case is $11.32 + 2.65 \ Education$. Deviations from the line are determined by the random error component $\hat \epsilon$ (the red lines in Figure 4 below).

Figure 4: Simple linear regression model, with systematic part of the model in blue and residuals in red

Interpreting $\sigma$

Question A6

Consider the following:

In fitting a linear regression model, we make the assumption that the errors around the line are normally distributed around zero (this is the $\epsilon \sim N(0, \sigma)$ bit.)
About 95% of values from a normal distribution fall within two standard deviations of the centre.

We can obtain the estimated standard deviation of the errors ($\hat \sigma$) from the fitted model using sigma(mdl).
What does this tell us?

Huh? What is $\sigma$?

Solution

The estimated standard deviation of the errors can be equivalently obtained by:

typing sigma(mdl);
looking at the “Residual standard error” entry of the summary(mdl) output.

Note: The term “Residual standard error” is a misnomer, as the help page for sigma says (check ?sigma). However, it’s hard to get rid of this bad name as it has been used in too many books showing R output.

sigma(mdl)

## [1] 8.978116

For any particular level of education, employee incomes should be distributed above and below the regression line with standard deviation estimated to be $\hat \sigma = 8.98$. Since $2 \hat \sigma = 2 (8.98) = 17.96$, we expect most (about 95%) of the employee incomes to be within about $18,000 from the regression line.

Inference for regression coefficients

To quantify the amount of uncertainty in each estimated coefficient that is due to sampling variability, we use the standard error (SE) of the coefficient. Recall that a standard error gives a numerical answer to the question of how variable a statistic will be because of random sampling.

The standard errors are found in the column “Std. Error” of the summary() of a model:

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 11.321379  6.1232350 1.848921 7.434602e-02
## education    2.651297  0.3696232 7.172972 5.562116e-08

In this example the slope, 2.651, has a standard error of 0.37. One way to envision this is as a distribution. Our best guess (mean) for the slope parameter is 2.651. The standard deviation of this distribution is 0.37, which indicates the precision (uncertainty) of our estimate.

Sampling distribution of the slope coefficient. The distribution is approximately bell-shaped with a mean of 2.651 and a standard error of 0.37.

Figure 5: Sampling distribution of the slope coefficient. The distribution is approximately bell-shaped with a mean of 2.651 and a standard error of 0.37.

We can perform a test against the null hypothesis that the estimate is zero. Our test statistic: The reference distribution in this case is a t-distribution with $n-2$ degrees of freedom, where $n$ is the sample size, and our test statistic is:
\[ t = \frac{\hat \beta_1 - 0}{SE(\hat \beta_1)} \]

Question A7

Test the hypothesis that the population slope is zero — that is, that there is no linear association between income and education level in the population.
(Hint: you can find all the necessary information in summary(mdl))

Solution

The information is already contained in the row corresponding to the variable “education” in the output of summary(mdl), which reports the t-statistic under t value and the p-value under Pr(>|t|):

summary(mdl)

## 
## Call:
## lm(formula = income ~ 1 + education, data = riverview)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.809  -5.783   2.088   5.127  18.379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3214     6.1232   1.849   0.0743 .  
## education     2.6513     0.3696   7.173 5.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.978 on 30 degrees of freedom
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6194 
## F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

Before we interpret the results, recall that the p-value 5.56e-08 in the Pr(>|t|) column simply means $5.56 \times 10^{-8}$. This is a very small value, hence we will report it as <.001 following the APA guidelines.

We performed a t-test against the null hypothesis that education is not a significant predictor of income: $t(30) = 7.173,\ p < .001$, two-sided. The large t-statistic leads to a very small p-value, meaning that we have strong evidence against the null hypothesis.

Fitted and predicted values

To compute the model-predicted values for the data in the sample:

predict(<fitted model>)
fitted(<fitted model>)
fitted.values(<fitted model>)
mdl$fitted.values

predict(mdl)

##        1        2        3        4        5        6        7        8 
## 32.53175 32.53175 37.83435 37.83435 37.83435 43.13694 43.13694 43.13694 
##        9       10       11       12       13       14       15       16 
## 43.13694 48.43953 48.43953 48.43953 51.09083 53.74212 53.74212 53.74212 
##       17       18       19       20       21       22       23       24 
## 53.74212 53.74212 56.39342 59.04472 59.04472 61.69601 61.69601 64.34731 
##       25       26       27       28       29       30       31       32 
## 64.34731 64.34731 64.34731 66.99861 66.99861 69.64990 69.64990 74.95250

To compute model-predicted values for other data:

predict(<fitted model>, newdata = <dataframe>)

# make a tibble/dataframe with values for the predictor:
education_query <- tibble(education = c(11, 18))
# model predicted values of income, for the values of education
# inside the "education_query" data
predict(mdl, newdata = education_query)

##        1        2 
## 40.48564 59.04472

Question A8

Compute the model-predicted income for someone with 1 year of education.

Solution

education_query <- tibble(education = c(1))
predict(mdl, newdata = education_query)

##        1 
## 13.97268

Partitioning variation

We might ask ourselves if the model is useful. To quantify and assess model utility, we split the total variability of the response into two terms: the variability explained by the model plus the variability left unexplained in the residuals.

\[ \text{total variability in response = variability explained by model + unexplained variability in residuals} \]

Each term is quantified by a sum of squares:

\[ \begin{aligned} SS_{Total} &= SS_{Model} + SS_{Residual} \\ \sum_{i=1}^n (y_i - \bar y)^2 &= \sum_{i=1}^n (\hat y_i - \bar y)^2 + \sum_{i=1}^n (y_i - \hat y_i)^2 \\ \quad \\ \text{Where:} \\ y_i = \text{observed value} \\ \bar{y} = \text{mean} \\ \hat{y}_i = \text{model predicted value} \\ \end{aligned} \]

The R-squared coefficient is defined as the proportion of the total variability in the outcome variable which is explained by our model:
\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]

Question A9

What is the proportion of the total variability in incomes explained by the linear relationship with education level?

Hint: The question asks to compute the value of $R^2$, but you might be able to find it already computed somewhere.

Solution

summary(mdl)

## 
## Call:
## lm(formula = income ~ 1 + education, data = riverview)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.809  -5.783   2.088   5.127  18.379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3214     6.1232   1.849   0.0743 .  
## education     2.6513     0.3696   7.173 5.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.978 on 30 degrees of freedom
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6194 
## F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

The output of summary() displays the R-squared value in the following line:

Multiple R-squared:  0.6317

For the moment, ignore “Adjusted R-squared”. We will come back to this later on.

Optional - Manual calculation of R-Squared

riverview_fitted <- riverview %>%
  mutate(
    income_hat = predict(mdl),
    resid = income - income_hat
  )
head(riverview_fitted)

## # A tibble: 6 x 8
##   education income seniority gender  male party       income_hat  resid
##       <dbl>  <dbl>     <dbl> <chr>  <dbl> <chr>            <dbl>  <dbl>
## 1         8   37.4         7 male       1 Democrat          32.5   4.92
## 2         8   26.4         9 female     0 Independent       32.5  -6.10
## 3        10   47.0        14 male       1 Democrat          37.8   9.20
## 4        10   34.2        16 female     0 Independent       37.8  -3.65
## 5        10   25.5         1 female     0 Republican        37.8 -12.4 
## 6        12   46.5        11 female     0 Democrat          43.1   3.35

riverview_fitted %>%
  summarise(
    SSModel = sum( (income_hat - mean(income))^2 ),
    SSTotal = sum( (income - mean(income))^2 )
  ) %>%
  summarise(
    RSquared = SSModel / SSTotal
  )

## # A tibble: 1 x 1
##   RSquared
##      <dbl>
## 1    0.632

Approximately 63% of the total variability in employee incomes is explained by the linear association with education level.

Testing Model Utility

To test if the model is useful — that is, if the explanatory variable is a useful predictor of the response — we test the following hypotheses:

\[ \begin{aligned} H_0 &: \text{the model is ineffective, } \beta_1 = 0 \\ H_1 &: \text{the model is effective, } \beta_1 \neq 0 \end{aligned} \] The relevant test-statistic is the F-statistic:

\[ \begin{split} F = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model} / 1}{SS_{Residual} / (n-2)} \end{split} \]

which compares the amount of variation in the response explained by the model to the amount of variation left unexplained in the residuals.

The sample F-statistic is compared to an F-distribution with $df_{1} = 1$ and $df_{2} = n - 2$ degrees of freedom.¹

Optional: Another formula for the F-test. Click the plus to expand →

Question A10

Look at the output of summary(mdl). Identify the relevant information to conduct an F-test against the null hypothesis that the model is ineffective at predicting income using education level.

Solution

summary(mdl)

## 
## Call:
## lm(formula = income ~ 1 + education, data = riverview)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.809  -5.783   2.088   5.127  18.379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3214     6.1232   1.849   0.0743 .  
## education     2.6513     0.3696   7.173 5.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.978 on 30 degrees of freedom
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6194 
## F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

The relevant row is the following:

F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

We might write up the test results as,

We performed an F-test for the overall significance of the regression, $F(1, 30) = 51.45, p < .001$. The large F-statistic leads to a very small p-value ($<.001$), meaning that we have very strong evidence against the null hypothesis that the model is ineffective.

In other words, the data provide strong evidence that education is an effective predictor of income.

Optional: Equivalence of t-test for the slope and model utility F-test in SLR. Click the plus to expand →

In simple linear regression only, the F-statistic for overall model significance is equal to the square of the t-statistic for $H_0: \beta_1 = 0$.

You can check that the squared t-statistic is equal, up to rounding error, to the F-statistic:

summary(mdl)$fstatistic['value']

##    value 
## 51.45152

summary(mdl)$coefficients['education','t value']

## [1] 7.172972

\[ t^2 = F \\ 7.173^2 = 51.452 \]

Here we will show the equivalence of the F-test for model effectiveness and t-test for the slope.

Recall the formula of the sum of squares due to the model. We will rewrite it in an equivalent form below: \[ \begin{aligned} SS_{Model} &= \sum_i (\hat y_i - \bar y)^2 \\ &= \sum_i (\hat \beta_0 + \hat \beta_1 x_i - \bar y)^2 \\ &= \sum_i (\bar y - \hat \beta_1 \bar x + \hat \beta_1 x_i - \bar y)^2 \\ &= \sum_i (\hat \beta_1 (x_i - \bar x))^2 \\ &= \hat \beta_1^2 \sum_i (x_i - \bar x)^2 \end{aligned} \]

The F-statistic is given by: \[ \begin{aligned} F = \frac{SS_{Model} / 1}{SS_{Residual} / (n - 2)} = \frac{\hat \beta_1^2 \sum_i (x_i - \bar x)^2}{\hat \sigma^2} = \frac{\hat \beta_1^2 }{\hat \sigma^2 / \sum_i (x_i - \bar x)^2} \end{aligned} \]

Now recall the formula of the t-statistic, \[ t = \frac{\hat \beta_1}{SE(\hat \beta_1)} = \frac{\hat \beta_1}{\hat \sigma / \sqrt{\sum_i (x_i - \bar x)^2}} \]

It is evident that the latter is obtained as the square root of the former.

Take a breather

As we are about to move on to multiple regression, why not go and make a cup of tea/coffee and go for a walk, listen to some music. Anything but thinking about statistics for at least 20 minutes!

Exercises: Multiple Regression

In this next block of exercises, we move from the simple linear regression model (one outcome variable, one explanatory variable) to the multiple regression model (one outcome variable, multiple explanatory variables).
Everything we just learned about simple linear regression can be extended (with minor modification) to the multiple regression model. The key conceptual difference is that for simple linear regression we think of the distribution of errors at some fixed value of the explanatory variable, and for multiple linear regression, we think about the distribution of errors at fixed set of values for all our explanatory variables.

~ Numeric + Numeric

Research question
Reseachers are interested in the relationship between psychological wellbeing and time spent outdoors.
The researchers know that other aspects of peoples’ lifestyles such as how much social interaction they have can influence their mental well-being. They would like to study whether there is a relationship between well-being and time spent outdoors after taking into account the relationship between well-being and social interactions.

Wellbeing data codebook

Download link

The data is available at https://uoepsy.github.io/data/wellbeing.csv.

Description

Researchers interviewed 32 participants, selected at random from the population of residents of Edinburgh & Lothians. They used the Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), a self-report measure of mental health and well-being. The scale is scored by summing responses to each item, with items answered on a 1 to 5 Likert scale. The minimum scale score is 14 and the maximum is 70.
The researchers also asked participants to estimate the average number of hours they spend outdoors each week, the average number of social interactions they have each week (whether on-line or in-person), and whether they believe that they stick to a routine throughout the week (Yes/No).

The data in wellbeing.csv contain five attributes collected from a random sample of $n=32$ hypothetical residents over Edinburgh & Lothians, and include:

wellbeing: Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS), a self-report measure of mental health and well-being. The scale is scored by summing responses to each item, with items answered on a 1 to 5 Likert scale. The minimum scale score is 14 and the maximum is 70.
outdoor_time: Self report estimated number of hours per week spent outdoors
social_int: Self report estimated number of social interactions per week (both online and in-person)
routine: Binary Yes/No response to the question “Do you follow a daily routine throughout the week?”
location: Location of primary residence (City, Suburb, Rural)

Preview

The first six rows of the data are:

wellbeing	outdoor_time	social_int	location	routine
30	7	8	Suburb	Routine
21	9	8	City	No Routine
38	14	10	Suburb	Routine
27	16	10	City	No Routine
20	1	10	Rural	No Routine
37	11	12	Suburb	No Routine

Question B1

Create a new section heading in your RMarkdown document for the multiple regression exercises.
Import the wellbeing data into R. Assign them to a object called mwdata.

Solution

library(tidyverse)
# Read in data
mwdata = read_csv(file = "https://uoepsy.github.io/data/wellbeing.csv")
head(mwdata)

## # A tibble: 6 x 5
##   wellbeing outdoor_time social_int location routine   
##       <dbl>        <dbl>      <dbl> <chr>    <chr>     
## 1        30            7          8 Suburb   Routine   
## 2        21            9          8 City     No Routine
## 3        38           14         10 Suburb   Routine   
## 4        27           16         10 City     No Routine
## 5        20            1         10 Rural    No Routine
## 6        37           11         12 Suburb   No Routine

Question B2

Produce plots of the marginal distributions (the distributions of each variable in the analysis without reference to the other variables) of the wellbeing, outdoor_time, and social_int variables.

Solution

We should be familiar now with how to visualise a marginal distribution. You might choose histograms, density curves, or boxplots, or a combination:

library(patchwork) #used to arrange plots

wellbeing_plot <- 
  ggplot(data = mwdata, aes(x = wellbeing)) +
  geom_density() +
  geom_boxplot(width = 1/250) +
  labs(x = "Score on WEMWBS (range 14-70)", y = "Probability\ndensity")

outdoortime_plot <- 
  ggplot(data = mwdata, aes(x = outdoor_time)) +
  geom_density() +
  geom_boxplot(width = 1/200) +
  labs(x = "Time spent outdoors per week (hours)", y = "Probability\ndensity")

social_plot <- 
  ggplot(data = mwdata, aes(x = social_int)) +
  geom_density() +
  geom_boxplot(width = 1/150) +
  labs(x = "Number of social interactions per week", y = "Probability\ndensity")

# the "patchwork" library allows us to arrange multiple plots
wellbeing_plot / outdoortime_plot / social_plot

Figure 6: Marginal distribution plots of wellbeing sores, weekly hours spent outdoors, and social interactions

The marginal distribution of scores on the WEMWBS is unimodal with a mean of approximately 43. There is variation in scores (SD = 11.7).
The marginal distribution of weekly hours spend outdoors is unimodal with a mean of approximately 14.8 hours. There is variation in outdoor time (SD = 6.9 hours).
The marginal distribution of numbers of social interactions per week is unimodal with a mean of approximately 16. There is variation in in numbers of social interactions per week (SD = 4.4).

Question B3

Produce plots of the marginal relationships between the outcome variable (wellbeing) and each of the explanatory variables.

Solution

wellbeing_outdoor <- 
  ggplot(data = mwdata, aes(x = outdoor_time, y = wellbeing)) +
  geom_point(alpha = 0.5) +
  labs(x = "Time spent outdoors per week (hours)", y = "Wellbeing score (WEMWBS)")

wellbeing_social <- 
  ggplot(data = mwdata, aes(x = social_int, y = wellbeing)) +
  geom_point(alpha = 0.5) +
  labs(x = "Number of social interactions per week", y = "Wellbeing score (WEMWBS)")

wellbeing_outdoor | wellbeing_social

Scatterplots displaying the relationships between scores on the WEMWBS and a) weekly outdoor time (hours), and b) weekly number of social interactions

Figure 7: Scatterplots displaying the relationships between scores on the WEMWBS and a) weekly outdoor time (hours), and b) weekly number of social interactions

Question B4

Produce a correlation matrix of the variables which are to be used in the analysis, and write a short paragraph describing the relationships.

Correlation matrix

A table showing the correlation coefficients - $r_{(x,y)}=\frac{\mathrm{cov}(x,y)}{s_xs_y}$ - between variables. Each cell in the table shows the relationship between two variables. The diagonals show the correlation of a variable with itself (and are therefore always equal to 1).

Hint: We can create a correlation matrix easily by giving the cor() function a dataframe. However, we only want to give it 3 columns here. Think about how we select specific columns, either using select(), or giving the column numbers inside [].

Solution

We can either use:

# correlation matrix of the first 3 columns
cor(mwdata[,1:3])

or:

# select only the columns we want by name, and pass this to cor()
mwdata %>% 
  select(wellbeing, outdoor_time, social_int) %>%
  cor()

##              wellbeing outdoor_time social_int
## wellbeing    1.0000000    0.5815613  0.7939003
## outdoor_time 0.5815613    1.0000000  0.3394469
## social_int   0.7939003    0.3394469  1.0000000

There is a moderate, positive, linear relationship between weekly outdoor time and WEMWBS scores for the participants in the sample. Participants’ wellbeing scores tend to increase, on average, with the number of hours spent outdoors each week.
There is a moderate, positive, linear relationship between the weekly number of social interactions and WEMWBS scores for the participants in the sample. Participants’ wellbeing scores tend to increase, on average, with the weekly number of social interactions. There is also a weak positive correlation between weekly outdoor time and the weekly number of social interactions.

Note that there is a weak correlation between our two explanatory variables (outdoor_time and social_int). We will return to how this might affect our model when later on we look at the assumptions of multiple regression.

Model formula

For multiple linear regression, the model formula is an extension of the one predictor (“simple”) regression model, to include any number of predictors:
\[ y = \beta_0 \ + \ \beta_1 x_1 \ + \ \beta_2 x_2 \ + \ ... \ + \beta_k x_k \ + \ \epsilon \\ \quad \\ \text{where} \quad \epsilon \sim N(0, \sigma) \text{ independently} \]

In the model specified above,

$\mu_{y|x_1, x_2, ..., x_k} = \beta_0 + \beta_1 x + \beta_2 x_2 + ... \beta_k x_k$ represents the systematic part of the model giving the mean of $y$ at each combination of values of variables $x_1$-$x_k$;
$\epsilon$ represents the error (deviation) from that mean, and the errors are independent from one another.

Visual

Note that for simple linear regression we talked about our model as a line in 2 dimensions: the systematic part $\beta_0 + \beta_1 x$ defined a line for $\mu_y$ across the possible values of $x$, with $\epsilon$ as the random deviations from that line. But in multiple regression we have more than two variables making up our model.

In this particular case of three variables (one outcome + two explanatory), we can think of our model as a regression surface (See Figure 8). The systematic part of our model defines the surface across a range of possible values of both $x_1$ and $x_2$. Deviations from the surface are determined by the random error component, $\hat \epsilon$.

Figure 8: Regression surface for wellbeing ~ outdoor_time + social_int, from two different angles

Don’t worry about trying to figure out how to visualise it if we had any more explanatory variables! We can only concieve of 3 spatial dimensions. One could imagine this surface changing over time, which would bring in a 4th dimension, but beyond that, it’s not worth trying!.

Question B5

The scatterplots we created in an earlier exercise show moderate, positive, and linear relationships both between outdoor time and wellbeing, and between numbers of social interactions and wellbeing.

In R, using lm(), fit the linear model specified by the formula below, assigning the output to an object called wb_mdl1.

\[ Wellbeing = \beta_0 \ + \ \beta_1 \cdot Outdoor Time \ + \ \beta_2 \cdot Social Interactions \ + \ \epsilon \]

Tip: As we did for simple linear regression, we can fit our multiple regression model using the lm() function. We can add as many explanatory variables as we like, separating them with a +.

lm( <response variable> ~ 1 + <explanatory variable 1> + <explanatory variable 2> + ... , data = <dataframe>)

Solution

wb_mdl1 <- lm(wellbeing ~ 1 + outdoor_time + social_int, data = mwdata)

Interpretation of Muliple Regression Coefficients

The parameters of a multiple regression model are:

$\beta_0$ (The intercept);
$\beta_1$ (The slope across values of $x_1$);
…
…
$\beta_k$ (The slope across values of $x_k$);
$\sigma$ (The standard deviation of the errors).

You’ll hear a lot of different ways that people explain multiple regression coefficients.
For the model $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$, the estimate $\hat \beta_1$ will often be reported as:

the increase in $y$ for a one unit increase in $x_1$ when…

holding the effect of $x_2$ constant.
controlling for differences in $x_2$.
partialling out the effects of $x_2$.
holding $x_2$ equal.
accounting for effects of $x_2$.

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)  5.3703775  4.3205141 1.242995 2.238259e-01
## outdoor_time 0.5923673  0.1689445 3.506284 1.499467e-03
## social_int   1.8034489  0.2690982 6.701825 2.369845e-07

The coefficient 0.59 of weekly outdoor time for predicting wellbeing score says that among those with the same number of social interactions per week, those who have one additional hour of outdoor time tend to, on average, score 0.59 higher on the WEMWBS wellbeing scale. The multiple regression coefficient measures that average conditional relationship.

Question B6

Just like the simple linear regression, when we estimate parameters from the available data, we have:

A fitted model (recall that the h$\hat{\textrm{a}}$ts are used to distinguish our estimates from the true unknown parameters): \[ \widehat{Wellbeing} = \hat \beta_0 + \hat \beta_1 \cdot Outdoor Time + \hat \beta_2 \cdot Social Interactions \]
And we have the residuals $\hat \epsilon = y - \hat y$ which are the deviations from the observed values and our model-predicted responses.

Extract and interpret the parameter estimates (the “coefficients”) from your model.
(summary(), coef(), $coefficients etc will be useful here)
Within what distance from the model predicted values (the regression surface) would we expect 95% of WEMWBS wellbeing scores to be? (Either sigma() or part of the output from summary() will help you for this)

Solution

coef(wb_mdl1)

##  (Intercept) outdoor_time   social_int 
##    5.3703775    0.5923673    1.8034489

$\hat \beta_0$ = 5.37, the estimated average wellbeing score associated with zero hours of outdoor time and zero social interactions per week.
$\hat \beta_1$ = 0.59, the estimated increase in average wellbeing score associated with one hour increase in weekly outdoor time, holding the number of social interactions constant (i.e., when the remaining explanatory variables are held at the same value or are fixed).
$\hat \beta_2$ = 1.8, the estimated increase in average wellbeing score associated with an additional social interaction per week (an increase of one), holding weekly outdoor time constant.

sigma(wb_mdl1)

## [1] 6.148276

The estimated standard deviation of the errors is $\hat \sigma$ = 6.15. We would expect 95% of wellbeing scores to be within about 12.3 ($2 \hat \sigma$) from the model fit.

Question B7

Obtain 95% confidence intervals for the regression coefficients, and write a sentence describing each.

Solution

confint(wb_mdl1, level = 0.95)

##                   2.5 %     97.5 %
## (Intercept)  -3.4660660 14.2068209
## outdoor_time  0.2468371  0.9378975
## social_int    1.2530813  2.3538164

The average wellbeing score for all those with zero hours of outdoor time and zero social interactions per week is between -3.47 and 14.21.
When holding the number of social interactions per week constant, each one hour increase in weekly outdoor time is associated with a difference in wellbeing scores between 0.25 and 0.94, on average.
When holding weekly outdoor time constant, each increase of one social interaction per week is associated with a difference in wellbeing scores between 1.25 and 2.35, on average.

Pause

Before we go too far

So far, we have been fitting and interpreting our regression models. In each case, we first specified the model, then visually explored the marginal distributions and relationships of variables which would be used in the analysis. Then, once we fitted the model, we began to examine the fit by studying what the various parameter estimates represented, and the spread of the residuals. We saw these in the output of summary() of a model - they were shown in the parts of the output inside the red boxes in Figure 9).

Figure 9: Multiple regression output in R, summary.lm(). Residuals and Coefficients highlighted

We also discussed drawing inferences using our model estimates, as well as using a model to make predictions. However, we should really not have done this prior to being satisfied that our model meets a certain set of assumptions. All of the estimates, intervals and hypothesis tests (see Figure 10) resulting from a regression analysis assume a certain set of conditions have been met. Meeting these conditions is what allows us to generalise our findings beyond our sample (i.e., to the population).

Figure 10: Multiple regression output in R, summary.lm(). Hypothesis tests highlighted

Choose > Fit > Assess > Use

IMPORTANT!

It may help to think of the sequence of steps involved in statistical modeling as:
\[ \text{Choose} \rightarrow \text{Fit} \rightarrow \text{Assess} \rightarrow \text{Use} \]

We explore/visualise our data and Choose our model specification.
Then we Fit the model in R.
Next, we Assess the fit, to ensure that it meets all the underlying assumptions?
Finally, we Use our model to draw statistical inferences about the world, or to make predictions.

A general rule
Do not use (draw inferences or predictions from) a model before you have assessed that the model satisfies the underlying assumptions

The LINE Mnemonic

The assumptions of the linear model can be committed to memory using the LINE mnemonic:

Linearity: The relationship between $y$ and $x$ is linear.
Independence of errors: The error terms should be independent from one another.
Normality: The errors $\epsilon$ are normally distributed
Equal variances (“Homoscedasticity”): The scale of the variability of the errors $\epsilon$ is constant at all values of $x$.

When we fit a model, we evaluate many of these assumptions by looking at the residuals
(the deviations from the observed values $y_i$ and the model estimated value $\hat y_i$).

The residuals, $\hat \epsilon$ are our estimates of the actual unknown true error term $\epsilon$. These assumptions hold both for a regression model with a single predictor and for one with multiple predictors.

Exercises: Assumptions & Diagnostics

Question C1

Create a new section heading for “Assumptions”.
Recall our the form of our model which we fitted and stored as wb_mdl1:

\[ \text{Wellbeing} = \beta_0 + \beta_1 \cdot \text{Outdoor Time} + \beta_2 \cdot \text{Social Interactions} + \epsilon \]

Wich we fitted in R using:

wb_mdl1 <- lm(wellbeing ~ outdoor_time + social_int, data = mwdata)

Note: We have have forgone writing the 1 in lm(y ~ 1 + x.... The 1 just tells R that we want to estimate the Intercept, and it will do this by default even if we leave it out.

Linearity

In simple linear regression with only one explanatory variable, we can assess linearity through a simple scatterplot of the outcome variable against the explanatory. In multiple regression, however, it becomes more necessary to rely on diagnostic plots of the model residuals. This is because we need to know whether the relations are linear between the outcome and each predictor after accounting for the other predictors in the model.

In order to assess this, we use partial-residual plots (also known as ‘component-residual plots’). This is a plot with each explanatory variable $x_j$ on the x-axis, and partial residuals on the y-axis.

Partial residuals for a predictor $x_j$ are calculated as: \[ \hat \epsilon + \hat \beta_j x_j \]

In R we can easily create these plots for all predictors in the model by using the crPlots() function from the car package.

Question C2

Create partial-residual plots for the wb_mdl1 model.
Remember to load the car package first. If it does not load correctly, it might mean that you have need to install it.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to the plots.

Solution

library(car)
crPlots(wb_mdl1)

The smoother (the pink line) follows quite closely to a linear relationship (the dashed blue line), suggesting that the linearity assumption is met.

Equal variances (Homoscedasticity)

The equal variances assumption is that the error variance $\sigma^2$ is constant across values of the predictors $x_1$, … $x_k$, and across values of the fitted values $\hat y$. This sometimes gets termed “Constant” vs “Non-constant” variance. Figures 11 & 12 shows what these look like visually.

Figure 11: Non-constant variance for numeric and categorical x

Figure 12: Constant variance for numeric and categorical x

In R we can create plots of the Pearson residuals against the predicted values $\hat y$ and against the predictors $x_1$, … $x_k$ by using the residualPlots() function from the car package. This function also provides the results of a lack-of-fit test for each of these relationships (note when it is the fitted values $\hat y$ it gets called “Tukey’s test”).

ncvTest(model) (also from the car package) performs a test against the alternative hypothesis that the error variance changes with the level of the fitted value (also known as the “Breusch-Pagan test”). $p >.05$ indicates that we do not have evidence that the assumption has been violated.

Question C3

Use residualPlots() to plot residuals against each predictor, and use ncvTest() to perform a test against the alternative hypothesis that the error variance changes with the level of the fitted value.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.

Solution

residualPlots(wb_mdl1)

##              Test stat Pr(>|Test stat|)
## outdoor_time   -0.3478           0.7306
## social_int     -0.1068           0.9157
## Tukey test     -0.4189           0.6753

#test against the alternative hypothesis that error variance changes with level of fitted value
ncvTest(wb_mdl1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.001925809, Df = 1, p = 0.965

Partial residual plots show no clear non-linear trends between residuals and predictors. Visual inspection of suggested little sign of non-constant variance, with the Breusch-Pagan test failing to reject the null that error varance does not change across the fitted values ($\chi^2(1)=0.002$, $p = .965$).

Question C4

Create the “residuals vs. fitted plot” - a scatterplot with the residuals $\hat \epsilon$ on the y-axis and the fitted values $\hat y$ on the x-axis.

You can either do this:

manually, using the functions residuals() and fitted(), or
quickly by giving the plot() function your model. This will actually give you lots of plots, so we can specify which plot we want to return - e.g., plot(wb_mdl1, which = 1)

You can use this plot to visually assess:

Linearity: Does the average of the residuals $\hat \epsilon$ remain close to 0 across the plot?
Equal Variance: does the spread of the residuals $\hat \epsilon$ remain constant across the predicted values $\hat y$?

Solution

The long way:

# Notice that we create a tibble and pass it directly to ggplot()
# using the %>%.
# This means we don't have to store it as an object in the environment,
# it is just being used to create the plot
tibble(
  residuals = residuals(wb_mdl1),
  fitted = fitted(wb_mdl1)
) %>% 
  ggplot(aes(x = fitted, y = residuals)) + 
  geom_point() + 
  geom_smooth(color="red",se=FALSE)

The quick way:

plot(wb_mdl1, which=1)

The horizontal red line shows that the average of the residual remains close to zero across the fitted values.
The spread of the residuals remains reasonably constant across the fitted values.

Independence

The “independence of errors” assumption is the condition that the errors do not have some underlying relationship which is causing them to influence one another.
There are many sources of possible dependence, and often these are issues of study design. For example, we may have groups of observations in our data which we would expect to be related (e.g., multiple trials from the same participant). Our modelling strategy would need to take this into account.
One form of dependence is autocorrelation - this is when observations influence those adjacent to them. It is common in data for which time is a variable of interest (e.g, the humidity today is dependent upon the rainfall yesterday).

In R we can test against the alternative hypothesis that there is autocorrelation in our errors using the durbinWatsonTest() (an abbreviated function dwt() is also available) in the car package.

Question C5

Perform a test against the alternative hypothesis that there is autocorrelation in the error terms.

Write a sentence summarising whether or not you consider the assumption of independence to have been met (you may have to assume certain aspects of the study design).

Solution

dwt(wb_mdl1)

##  lag Autocorrelation D-W Statistic p-value
##    1       -0.318249      2.600574   0.108
##  Alternative hypothesis: rho != 0

A Durbin-Watson test of autocorrelation failed to reject the null hypothesis that there was no serial dependence in the error ($DW = 2.6$, $p = .138$). We will also assume that observations to be randomly sampled during study recruitment.

Normality of errors

The normality assumption is the condition that the errors $\epsilon$ are normally distributed.

We can visually assess this condition through histograms, density plots, and quantile-quantile plots (QQplots) of our residuals $\hat \epsilon$.
We can also perform a Shapiro-Wilk test against the alternative hypothesis that the residuals were not sampled from a normally distributed population. The shapiro.test() function in R.

Question C6

Assess the normality assumption by producing a qqplot of the residuals (either manually or using plot(model, which = ???)), and conducting a Shapiro-Wilk test.

Write a sentence summarising whether or not you consider the assumption to have been met. Justify your answer with reference to plots and/or formal tests where available.

Solution

We can get the QQplot from one of the plot(model) plots:

plot(wb_mdl1, which = 2)

Or we can make our own:

tibble(
  resids = residuals(wb_mdl1)
) %>% ggplot(aes(sample=resids))+
  geom_qq()+
  geom_qq_line()

shapiro.test(residuals(wb_mdl1))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(wb_mdl1)
## W = 0.94831, p-value = 0.129

The QQplot indicates that the residuals follow close to a normal distribution, although with evidence of heavier tails. A Shapiro-Wilk test failed to reject the null hypothesis that the residuals were drawn from a normally distributed population ($W = 0.95$, $p = .129$)

Multicollinearity

For the linear model with multiple explanatory variables, we need to also think about multicollinearity - this is when two (or more) of the predictors in our regression model are moderately or highly correlated.
Recall our interpretation of multiple regression coefficients as
“the effect of $x_1$ on $y$ when holding the values of $x_2$, $x_3$, … $x_k$ constant”

This interpretation falls down if predictors are highly correlated because if, e.g., predictors $x_1$ and $x_2$ are highly correlated, then changing the value of $x_1$ necessarily entails a change the value of $x_2$ meaning that it no longer makes sense to talk about holding $x_2$ constant.

We can assess multicollinearity using the variance inflation factor (VIF), which for a given predictor $x_j$ is calculated as:
\[ VIF_j = \frac{1}{1-R_j^2} \\ \] Where $R_j^2$ is the coefficient of determination (the R-squared) resulting from a regression of $x_j$ on to all the other predictors in the model ($x_j = x_1 + ... x_k + \epsilon$).
The more highly correlated $x_j$ is with other predictors, the bigger $R_j^2$ becomes, and thus the bigger $VIF_j$ becomes.

The square root of VIF indicates how much the SE of the coefficient has been inflated due to multicollinearity. For example, if the VIF of a predictor variable were 4.6 ($\sqrt{4.6} = 2.1$), then the standard error of the coefficient of that predictor is 2.1 times larger than if the predictor had zero correlation with the other predictor variables. Suggested cut-offs for VIF are varied. Some suggest 10, others 5. Define what you will consider an acceptable value prior to calculating it.

In R, the vif() function from the car package will provide VIF values for each predictor in your model.

Question C7

Calculate the variance inflation factor (VIF) for the predictors in the model.

Write a sentence summarising whether or not you consider multicollinearity to be a problem here.

Solution

vif(wb_mdl1)

## outdoor_time   social_int 
##      1.13023      1.13023

VIF values <5 indicate that multicollinearity is not adversely affecting model estimates.

Individual cases

We have seen in the case of the simple linear regression that individual cases in our data can influence our model more than others. We know about:

Regression outliers: A large residual $\hat \epsilon_i$ - i.e., a big discrepancy between their predicted y-value and their observed y-value.
- Standardised residuals: For residual $\hat \epsilon_i$, divide by the estimate of the standard deviation of the residuals. In R, the rstandard() function will give you these
- Studentised residuals: For residual $\hat \epsilon_i$, divide by the estimate of the standard deviation of the residuals excluding case $i$. In R, the rstudent() function will give you these. Values $>|2|$ (greater in magnitude than two) are considered potential outliers.
High leverage cases: These are cases which have considerable potential to influence the regression model (e.g., cases with an unusual combination of predictor values).
- Hat values: are used to assess leverage. In R, The hatvalues() function will retrieve these.
  Hat values of more than $2 \bar{h}$ (2 times the average hat value) are considered high leverage. $\bar{h}$ is calculated as $\frac{k + 1}{n}$, where $k$ is the number of predictors, and $n$ is the sample size.
High influence cases: When a case has high leverage and is an outlier, it will have a large influence on the regression model.
- Cook’s Distance: combines leverage (hatvalues) with outlying-ness to capture influence. In R, the cooks.distance() function will provide these.
  A suggested Cook’s Distance cut-off is $\frac{4}{n-k-1}$, where $k$ is the number of predictors, and $n$ is the sample size.

Question C8

Create a new tibble which contains:

The original variables from the model (Hint, what does wb_mdl1$model give you?)
The fitted values from the model $\hat y$
The residuals $\hat epsilon$
The studentised residuals
The hat values
The Cook’s Distance values.

Solution

mdl_diagnost <- 
  tibble(
  wb_mdl1$model,
  fitted = fitted(wb_mdl1),
  resid = residuals(wb_mdl1),
  studres = rstudent(wb_mdl1),
  hats = hatvalues(wb_mdl1),
  cooksd = cooks.distance(wb_mdl1)
)

Question C9

Looking at the studentised residuals, are there any outliers?

Solution

Recall from the lectures, studentised residuals of $>2$ or $< -2$ indicate potential outlyingness.

We can ask R whether the absolute values are $>2$:

abs(mdl_diagnost$studres) > 2

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    27    28    29    30    31    32 
## FALSE FALSE FALSE FALSE FALSE FALSE

We could filter our newly created tibble to these observations:

mdl_diagnost %>% 
  filter(abs(studres)>2)

## # A tibble: 0 x 8
## # … with 8 variables: wellbeing <dbl>, outdoor_time <dbl>, social_int <dbl>,
## #   fitted <dbl>, resid <dbl>, studres <dbl>, hats <dbl>, cooksd <dbl>

There are zero rows.

Question C10

Looking at the hat values, are there any observations with high leverage?

Solution

For our model, the average hat value $\bar h$ is:
\[ \bar h = \frac{k+1}{n} = \frac{2+1}{32} = \frac{3}{32} = 0.094 \]

We can ask whether any of observations have hat values which are greater than $2 \times \bar h$:

mdl_diagnost %>%
  filter(hats > (2*0.094))

## # A tibble: 0 x 8
## # … with 8 variables: wellbeing <dbl>, outdoor_time <dbl>, social_int <dbl>,
## #   fitted <dbl>, resid <dbl>, studres <dbl>, hats <dbl>, cooksd <dbl>

Note that quite a few observations (0) have high leverage.

Question C11

Looking at the Cook’s Distance values, are there any highly influential points?
(You can also display these graphically using plot(model, which = 4) and plot(model, which = 5)).

Solution

For our model, a proposed cut-off for Cook’s Distance is: \[ D_{cutoff} = \frac{4}{n-k-1} = \frac{4}{32 - 2 - 1} = \frac{4}{29} = 0.138 \]

There are no observations which have a high influence on our model estimates:

mdl_diagnost %>%
  filter(cooksd > 0.138)

## # A tibble: 0 x 8
## # … with 8 variables: wellbeing <dbl>, outdoor_time <dbl>, social_int <dbl>,
## #   fitted <dbl>, resid <dbl>, studres <dbl>, hats <dbl>, cooksd <dbl>

Other influence.measures()

Alongside Cook’s Distance, we can examine the extent to which model estimates and predictions are affected when an entire case is dropped from the dataset and the model is refitted.

DFFit: the change in the predicted value at the $i^{th}$ observation with and without the $i^{th}$ observation is included in the regression.
DFbeta: the change in a specific coefficient with and without the $i^{th}$ observation is included in the regression.
DFbetas: the change in a specific coefficient divided by the standard error, with and without the $i^{th}$ observation is included in the regression.
COVRATIO: measures the effect of an observation on the covariance matrix of the parameter estimates. In simpler terms, it captures an observation’s influence on standard errors. Values which are $>1+\frac{3(k+1)}{n}$ or $<1-\frac{3(k+1)}{n}$ are considered as having strong influence.

Question C12

Use the function influence.measures() to extract these delete-1 measures of influence.

Try plotting the distributions of some of these measures.

Tip: the function influence.measures() returns an infl-type object. To plot this, we need to find a way to extract the actual numbers from it.
What do you think names(influence.measures(wb_mdl1)) shows you? How can we use influence.measures(wb_mdl1)$<insert name here> to extract the matrix of numbers?

Solution

influence.measures(wb_mdl1)

## Influence measures of
##   lm(formula = wellbeing ~ outdoor_time + social_int, data = mwdata) :
## 
##      dfb.1_ dfb.otd_ dfb.scl_   dffit cov.r   cook.d    hat inf
## 1   0.43653 -0.11116 -0.32167  0.4477 1.157 0.066470 0.1489    
## 2  -0.28160  0.03170  0.22954 -0.2917 1.225 0.028838 0.1414    
## 3   0.29581  0.07604 -0.29025  0.3540 1.088 0.041520 0.0967    
## 4  -0.26445 -0.13055  0.29341 -0.3508 1.117 0.040991 0.1071    
## 5  -0.27084  0.22733  0.10472 -0.3290 1.279 0.036700 0.1766    
## 6   0.12462 -0.02693 -0.08288  0.1460 1.141 0.007277 0.0604    
## 7  -0.17361 -0.03392  0.15313 -0.2235 1.087 0.016770 0.0597    
## 8  -0.02879 -0.07259  0.06069 -0.0918 1.309 0.002906 0.1556    
## 9   0.16925  0.08820 -0.17834  0.2477 1.088 0.020553 0.0668    
## 10 -0.24768  0.33112 -0.00551 -0.4267 1.027 0.059217 0.0956    
## 11 -0.02018 -0.02535  0.02534 -0.0492 1.166 0.000834 0.0518    
## 12  0.04644  0.38731 -0.22045  0.4474 1.164 0.066462 0.1517    
## 13 -0.15606  0.21587 -0.02477 -0.3135 1.017 0.032241 0.0626    
## 14  0.13390 -0.31537  0.10705  0.3905 1.039 0.049887 0.0899    
## 15  0.14774 -0.25595  0.08688  0.4273 0.815 0.055918 0.0487    
## 16 -0.11881  0.17855 -0.06061 -0.3502 0.873 0.038519 0.0422    
## 17  0.00121 -0.07684  0.02608 -0.1031 1.177 0.003653 0.0703    
## 18  0.00329  0.04637 -0.01574  0.0739 1.159 0.001877 0.0516    
## 19  0.04301 -0.15310  0.09227  0.2432 1.055 0.019693 0.0546    
## 20  0.05819 -0.10961 -0.03803 -0.2189 1.065 0.016025 0.0508    
## 21 -0.01498  0.00131  0.03428  0.0875 1.131 0.002622 0.0380    
## 22  0.12553 -0.15510 -0.09098 -0.3084 1.020 0.031247 0.0622    
## 23 -0.03324 -0.00580  0.05840  0.1049 1.138 0.003770 0.0466    
## 24  0.24606 -0.23731 -0.18568 -0.4783 0.911 0.071967 0.0774    
## 25  0.06123  0.10878 -0.16017 -0.2209 1.131 0.016498 0.0771    
## 26 -0.02732 -0.29363  0.23790  0.3643 1.243 0.044752 0.1666    
## 27 -0.30514  0.33726  0.20108  0.5966 0.830 0.108234 0.0858    
## 28  0.06176 -0.07036 -0.03455 -0.1080 1.263 0.004013 0.1280    
## 29 -0.15327 -0.04611  0.23001  0.3039 1.067 0.030646 0.0754    
## 30  0.11226  0.25745 -0.30386 -0.3826 1.238 0.049262 0.1686    
## 31 -0.07637 -0.05313  0.12879  0.1542 1.214 0.008153 0.1047    
## 32  0.10249 -0.07000 -0.07662 -0.1399 1.353 0.006736 0.1864   *

Let’s plot the distribution of COVRATIO statistics.
Recall that values which are $>1+\frac{3(k+1)}{n}$ or $<1-\frac{3(k+1)}{n}$ are considered as having strong influence.
For our model: \[ 1 \pm \frac{3(k+1)}{n} \quad = \quad 1 \pm\frac{3(2+1)}{32} \quad = \quad 1\pm \frac{9}{32} \quad = \quad 1\pm0.28 \]

The “infmat” bit of an infl-type object contains the numbers. To use it with ggplot, we will need to turn it into a dataframe (as.data.frame()), or a tibble (as_tibble()):

infdata <- influence.measures(wb_mdl1)$infmat %>%
  as_tibble()

ggplot(data = infdata, aes(x = cov.r)) + 
  geom_histogram() +
  geom_vline(aes(xintercept = c(1-0.28)))+
  geom_vline(aes(xintercept = c(1+0.28)))

It looks like a few observations may be having quite a high influence here. This is perhaps not that surprising as we only have 32 datapoints.

Lewis-Beck, Colin, and Michael Lewis-Beck. 2015. Applied Regression: An Introduction. Vol. 22. Sage publications.

$SS_{Total}$ has $n - 1$ degrees of freedom as one degree of freedom is lost in estimating the population mean with the sample mean $\bar{y}$. $SS_{Residual}$ has $n - 2$ degrees of freedom. There are $n$ residuals, but two degrees of freedom are lost in estimating the intercept and slope of the line used to obtain the $\hat y_i$s. Hence, by difference, $SS_{Model}$ has $n - 1 - (n - 2) = 1$ degree of freedom.↩︎

This workbook was written by Josiah King, Umberto Noe, and Martin Corley, and is licensed under a Creative Commons Attribution 4.0 International License.

Linear Regression Basics

Exercises: Simple regression

Exploring the data

Fitting a model

Interpreting coefficients

Interpreting \(\sigma\)

Inference for regression coefficients

Fitted and predicted values

Partitioning variation

Testing Model Utility

Take a breather

Exercises: Multiple Regression

~ Numeric + Numeric

Pause

Before we go too far

Choose > Fit > Assess > Use

The LINE Mnemonic

Exercises: Assumptions & Diagnostics

Linearity

Equal variances (Homoscedasticity)

Independence

Normality of errors

Multicollinearity

Individual cases

Other influence.measures()