Week 7 Exercises: Linear Regression

Income and Education

Research Question: is there an association between the level of education and an employee’s income?

Data: riverview.csv

Our data for these exercises is from an hypothetical study into income disparity for employees in a local authority. The riverview data, which come from Lewis-Beck, 2015, contain five attributes collected from a random sample of $n=32$ employees working for the city of Riverview, a hypothetical midwestern city in the US. The attributes include:

education: Years of formal education
income: Annual income (in thousands of U.S. dollars)
seniority: Years of seniority
gender: Employee’s gender
male: Dummy coded gender variable (0 = Female, 1 = Male)
party: Political party affiliation

The first six rows of the data are:

education	income	seniority	gender	male	party
8	37.449	7	male	1	Democrat
8	26.430	9	female	0	Independent
10	47.034	14	male	1	Democrat
10	34.182	16	female	0	Independent
10	25.479	1	female	0	Republican
12	46.488	11	female	0	Democrat

The data is available at https://uoepsy.github.io/data/riverview.csv.

Question 1

Load the required libraries (probably just tidyverse for now), and read in the riverview data to your R session.

Let us first visualise and describe the marginal distributions of those variables which are of interest to us. These are the distribution of each variable (employee incomes and education levels) without reference to the values of the other variables.

Hints:

You could use, for example, geom_density() for a density plot or geom_histogram() for a histogram.
Look at the shape, center and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal?
Do you notice any extreme observations?

Question 2

After we’ve looked at the marginal distributions of the variables of interest in the analysis, we typically move on to examining relationships between the variables.

Visualise and describe the relationship between income and level of education among the employees in the sample.

Hints:
Think about:

Direction of association
Form of association (can it be summarised well with a straight line?)
Strength of association (how closely do points fall to a recognizable pattern such as a line?)
Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail.

Question 3

Using the lm() function, fit a linear model to the sample data, in which employee income is explained by education level. Assign it to a name to store it in your environment.

Hints:
You can see how to fit linear models in R using lm() in 7A #fitting-linear-models-in-r

Question 4

Interpret the estimated intercept and slope in the context of the question of interest.

Hints:
We saw how to extract lots of information on our model using summary() (see 7A #model-summary), but there are lots of other functions too.

If we called our linear model object “model1” in the environment, then we can use:

type model1, i.e. simply invoke the name of the fitted model;
type model1$coefficients;
use the coef(model1) function;
use the coefficients(model1) function;
use the summary(model1)$coefficients to extract just that part of the summary.

Question 5

Test the hypothesis that the population slope is zero — that is, that there is no linear association between income and education level in the population.

Hints:
You don’t need to do anything for this, you can find all the necessary information in summary() of your model.
See 7A #inference-for-regression-coefficients.

Question 6

What is the proportion of the total variability in incomes explained by the linear relationship with education level?

Hints:

The question asks to compute the value of $R^2$, but you might be able to find it already computed somewhere (so much stuff is already in summary() of the model.
See 7A #r2 if you’re unsure about what $R^2$ is.

Question 7

Look at the output of summary() of your model. Identify the relevant information to conduct an F-test against the null hypothesis that the model is ineffective at predicting income using education level.

Hints:
7A #the-f-statistic will help.

summary(model1)


Call:
lm(formula = income ~ 1 + education, data = riverview)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.809  -5.783   2.088   5.127  18.379 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.3214     6.1232   1.849   0.0743 .  
education     2.6513     0.3696   7.173 5.56e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.978 on 30 degrees of freedom
Multiple R-squared:  0.6317,    Adjusted R-squared:  0.6194 
F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

From the summary(model1), the relevant row is just below the $R^2$, where it states:

F-statistic: 51.45 on 1 and 30 DF,  p-value: 5.562e-08

The overall test of model utility was significant $F(1, 30) = 51.45, p < .001$, indicating evidence against the null hypothesis that the model is ineffective (that education is not a useful predictor of income).

Question 8

Create a visualisation of the estimated association between income and education.

Hints: Check 7A#example, which shows an example of using the sjPlot package to create a plot.

Optional: Question 9A

Consider the following:

In fitting a linear regression model, we make the assumption that the errors around the line are normally distributed around zero (this is the $\epsilon \sim N(0, \sigma)$ bit.)
About 95% of values from a normal distribution fall within two standard deviations of the centre.

We can obtain the estimated standard deviation of the errors ($\hat \sigma$) from the fitted model using sigma() and giving it the name of our model.
What does this tell us?

Hints:
See 7A #the-error.

Optional: Question 9B

Compute the model-predicted income for someone with 1 year of education.

Hints:
Given that you know the intercept and slope, you can calculate this algebraically. However, try to also use the predict() function (see 7A#model-predictions).

Income and Degree Status

Let’s suppose instead of having measured education in years, researchers simply asked “have you had more than 18 years of education?”, assuming this to be indicative of having a degree.
The code below creates a this new variable for us:

riverview <- riverview %>%
  mutate(
    degree = ifelse(education >= 18, "yes","no")
  )

Question 10

Fit the following model, and interpret the coefficients.

\[ Income = b_0 + b_1 \ Degree + \varepsilon \]

Hints:
For help interpreting the coefficients, see 7A #binary-predictors.

Question 11

We’ve actually already seen a way to analyse questions of this sort (“is the average income different between those with and those without a degree?”)

Run the following t-test, and consider the statistic, p value etc. How does it compare to the model in the previous question?

t.test(income ~ degree, data = riverview, var.equal = TRUE)


    Two Sample t-test

data:  income by degree
t = -4.6408, df = 30, p-value = 6.412e-05
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -27.15051 -10.55673
sample estimates:
 mean in group no mean in group yes 
         46.08284          64.93646

Student Sleep Quality

In the survey that many of you completed at the start of term (thank you to those that did!), we had a long set of items that were from various personality and psychometric scales.

One of these was a set of items that measure a persons’ “locus of control” (belief that you have control over your life, as opposed to it being decided by external forces such as fate/luck/destiny/god/the flying spaghetti monster). We also asked you to rate your average sleep quality from 0 to 100.

My pet theory is that people who believe they have more control over life are more inclined to have a better sleep routine, and so will rate their sleep quality as being better than those who believe that life is out of their control.

Data from the 2022 survey (now closed) can be found at https://uoepsy.github.io/data/usmr2022.csv.

usmrdata <- read_csv("https://uoepsy.github.io/data/usmr2022.csv")

The relevant variables are:

variable	description
pseudonym	User provided pseudonym to enable lookup
loc	Locus of Control Score. Minimum score = 6, maximum score = 30. Higher scores indicate more perceived self-control over life
sleeprating	Rating of average quality of sleep. Higher values indicate better sleep quality.
...	...
...	...

Question 12

Fit a linear regression model to test the association.

Hints:
Try to do the following steps:

Explore the data (plot and describe)
Fit the model
Take a look at the assumption plots (see 7A#assumptions).
- The trick to looking at assumption plots is to look for “things that don’t look random”.
- As well as looking for patterns, these plots can also higlight individual datapoints that might be skewing the results. Can you figure out if there is someone weird in our dataset? Can you re-fit the model without that person? When you do so, do your conclusions change?

usmrdata <- read_csv("https://uoepsy.github.io/data/usmr2022.csv")

Let’s just take a look at the relevant variables and relationship:

p1 <- ggplot(usmrdata, aes(x=loc))+
  geom_histogram()
p2 <- ggplot(usmrdata, aes(x=sleeprating))+
  geom_histogram() 
p3 <- ggplot(usmrdata, aes(x=loc, y=sleeprating))+
  geom_point()
# remember, to make this plot arranging work,
# we need to load the patchwork package
p3 / (p1 + p2)

And go straight ahead and fit our model

sleepmod <- lm(sleeprating ~ loc, usmrdata)
summary(sleepmod)


Call:
lm(formula = sleeprating ~ loc, data = usmrdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-66.358 -13.856   2.647  14.644  41.649 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.3352    13.2292   1.613 0.110892    
loc           2.0009     0.5831   3.432 0.000968 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.21 on 77 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1326,    Adjusted R-squared:  0.1214 
F-statistic: 11.78 on 1 and 77 DF,  p-value: 0.0009681

Wooh, looks like we were right! People with greater perceived control tend to report better sleep quality.

Our assumptions don’t look all that great though..

plot(sleepmod)

From the bottom right plot above, we can see that the 65th observation is looking a bit influential. There is also one datapoint that is looking weird in both of the left-hand plots - this may well be the same 65th person.
I wonder who that is..

usmrdata[65, c("pseudonym","sleeprating","loc")]

# A tibble: 1 × 3
  pseudonym          sleeprating   loc
  <chr>                    <dbl> <dbl>
1 Definitely not Ian           3     6

Ah, it’s Ian!
Ian is a guy who sleeps very badly (low sleeprating score) and feels like he has no control over his life (low loc score).
The thing is - Ian is a valuable datapoint - he’s a real person and we shouldn’t just get rid of him because he’s “a bit odd”.

However, it would be nice to know how much Ian’s “oddness” is affecting our conclusions, so let’s re-fit the model on everybody except Ian:

sleepmod_noian <- lm(sleeprating ~ loc, usmrdata[-65, ])
summary(sleepmod_noian)


Call:
lm(formula = sleeprating ~ loc, data = usmrdata[-65, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-65.093 -12.731   2.907  14.326  39.106 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  32.4454    14.7558   2.199   0.0309 *
loc           1.5249     0.6465   2.359   0.0209 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.98 on 76 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.06821,   Adjusted R-squared:  0.05595 
F-statistic: 5.563 on 1 and 76 DF,  p-value: 0.02091

Our conclusions haven’t changed - we still have a significant association.

What we have just done is called a “sensitivity analysis” - we’ve asked if our analyses are sensitive to including/excluding Ian.

Code

ggplot(usmrdata, aes(x=loc,y=sleeprating))+
  geom_point()+
  geom_smooth(method=lm, fullrange=TRUE)+
  xlim(5,30)+
  labs(title="With Ian") +

ggplot(usmrdata[-65, ], aes(x=loc,y=sleeprating))+
  geom_point()+
  geom_smooth(method=lm, fullrange=TRUE)+
  xlim(5,30)+
  labs(title="Without Ian")

We now have a decision to make. Do we continue with Ian removed, or do we keep him in? There’s not really a right answer here, but it’s worth noting a practical issue - our assumption plots look considerably better¹ for our model without Ian.

Personally, I would probably remove Ian, and when writing up the analysis, mention clearly why we exclude him, and how that decision has/hasn’t influenced our conclusions.

Footnotes

Still not perfect, but don’t worry about that right now↩︎