Week 7 Exercises: Linear Regression

Income and Education

Research Question: is there an association between the level of education and an employee’s income?

Data: riverview.csv

Our data for these exercises is from an hypothetical study into income disparity for employees in a local authority. The riverview data, which come from Lewis-Beck, 2015, contain five attributes collected from a random sample of \(n=32\) employees working for the city of Riverview, a hypothetical midwestern city in the US. The attributes include:

  • education: Years of formal education
  • income: Annual income (in thousands of U.S. dollars)
  • seniority: Years of seniority
  • gender: Employee’s gender
  • male: Dummy coded gender variable (0 = Female, 1 = Male)
  • party: Political party affiliation

The first six rows of the data are:

education income seniority gender male party
8 37.449 7 male 1 Democrat
8 26.430 9 female 0 Independent
10 47.034 14 male 1 Democrat
10 34.182 16 female 0 Independent
10 25.479 1 female 0 Republican
12 46.488 11 female 0 Democrat

The data is available at https://uoepsy.github.io/data/riverview.csv.

Question 1

Load the required libraries (probably just tidyverse for now), and read in the riverview data to your R session.

Let us first visualise and describe the marginal distributions of those variables which are of interest to us. These are the distribution of each variable (employee incomes and education levels) without reference to the values of the other variables.

Hints:

  • You could use, for example, geom_density() for a density plot or geom_histogram() for a histogram.
  • Look at the shape, center and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal?
  • Do you notice any extreme observations?

Solution

Question 2

After we’ve looked at the marginal distributions of the variables of interest in the analysis, we typically move on to examining relationships between the variables.

Visualise and describe the relationship between income and level of education among the employees in the sample.

Hints:
Think about:

  • Direction of association
  • Form of association (can it be summarised well with a straight line?)
  • Strength of association (how closely do points fall to a recognizable pattern such as a line?)
  • Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail.

Solution

Question 3

Using the lm() function, fit a linear model to the sample data, in which employee income is explained by education level. Assign it to a name to store it in your environment.

Hints:
You can see how to fit linear models in R using lm() in 7A #fitting-linear-models-in-r

Solution

Question 4

Interpret the estimated intercept and slope in the context of the question of interest.

Hints:
We saw how to extract lots of information on our model using summary() (see 7A #model-summary), but there are lots of other functions too.

If we called our linear model object “model1” in the environment, then we can use:

  • type model1, i.e. simply invoke the name of the fitted model;
  • type model1$coefficients;
  • use the coef(model1) function;
  • use the coefficients(model1) function;
  • use the summary(model1)$coefficients to extract just that part of the summary.

Solution

Question 5

Test the hypothesis that the population slope is zero — that is, that there is no linear association between income and education level in the population.

Hints:
You don’t need to do anything for this, you can find all the necessary information in summary() of your model.
See 7A #inference-for-regression-coefficients.

Solution

Question 6

What is the proportion of the total variability in incomes explained by the linear relationship with education level?

Hints:

  • The question asks to compute the value of \(R^2\), but you might be able to find it already computed somewhere (so much stuff is already in summary() of the model.
  • See 7A #r2 if you’re unsure about what \(R^2\) is.

Solution

Question 7

Look at the output of summary() of your model. Identify the relevant information to conduct an F-test against the null hypothesis that the model is ineffective at predicting income using education level.

Hints:
7A #the-f-statistic will help.

Solution

Question 8

Create a visualisation of the estimated association between income and education.

Hints: Check 7A#example, which shows an example of using the sjPlot package to create a plot.

Solution

Optional: Question 9A

Consider the following:

  1. In fitting a linear regression model, we make the assumption that the errors around the line are normally distributed around zero (this is the \(\epsilon \sim N(0, \sigma)\) bit.)
  2. About 95% of values from a normal distribution fall within two standard deviations of the centre.

We can obtain the estimated standard deviation of the errors (\(\hat \sigma\)) from the fitted model using sigma() and giving it the name of our model.
What does this tell us?

Hints:
See 7A #the-error.

Solution

Optional: Question 9B

Compute the model-predicted income for someone with 1 year of education.

Hints:
Given that you know the intercept and slope, you can calculate this algebraically. However, try to also use the predict() function (see 7A#model-predictions).

Solution

Income and Degree Status

Let’s suppose instead of having measured education in years, researchers simply asked “have you had more than 18 years of education?”, assuming this to be indicative of having a degree.
The code below creates a this new variable for us:

riverview <- riverview %>%
  mutate(
    degree = ifelse(education >= 18, "yes","no")
  )
Question 10

Fit the following model, and interpret the coefficients.

\[ Income = b_0 + b_1 \ Degree + \varepsilon \]

Hints:
For help interpreting the coefficients, see 7A #binary-predictors.

Solution

Question 11

We’ve actually already seen a way to analyse questions of this sort (“is the average income different between those with and those without a degree?”)

Run the following t-test, and consider the statistic, p value etc. How does it compare to the model in the previous question?

t.test(income ~ degree, data = riverview, var.equal = TRUE)

    Two Sample t-test

data:  income by degree
t = -4.6408, df = 30, p-value = 6.412e-05
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -27.15051 -10.55673
sample estimates:
 mean in group no mean in group yes 
         46.08284          64.93646 

Solution

Student Sleep Quality

In the survey that many of you completed at the start of term (thank you to those that did!), we had a long set of items that were from various personality and psychometric scales.

One of these was a set of items that measure a persons’ “locus of control” (belief that you have control over your life, as opposed to it being decided by external forces such as fate/luck/destiny/god/the flying spaghetti monster). We also asked you to rate your average sleep quality from 0 to 100.

My pet theory is that people who believe they have more control over life are more inclined to have a better sleep routine, and so will rate their sleep quality as being better than those who believe that life is out of their control.

Data from the 2022 survey (now closed) can be found at https://uoepsy.github.io/data/usmr2022.csv.

usmrdata <- read_csv("https://uoepsy.github.io/data/usmr2022.csv")

The relevant variables are:

variable description
pseudonym User provided pseudonym to enable lookup
loc Locus of Control Score. Minimum score = 6, maximum score = 30. Higher scores indicate more perceived self-control over life
sleeprating Rating of average quality of sleep. Higher values indicate better sleep quality.
... ...
... ...
Question 12

Fit a linear regression model to test the association.

Hints:
Try to do the following steps:

  1. Explore the data (plot and describe)
  2. Fit the model
  3. Take a look at the assumption plots (see 7A#assumptions).
    • The trick to looking at assumption plots is to look for “things that don’t look random”.
    • As well as looking for patterns, these plots can also higlight individual datapoints that might be skewing the results. Can you figure out if there is someone weird in our dataset? Can you re-fit the model without that person? When you do so, do your conclusions change?

Solution

Footnotes

  1. Still not perfect, but don’t worry about that right now↩︎