education | income | seniority | gender | male | party |
---|---|---|---|---|---|
8 | 37.449 | 7 | male | 1 | Democrat |
8 | 26.430 | 9 | female | 0 | Independent |
10 | 47.034 | 14 | male | 1 | Democrat |
10 | 34.182 | 16 | female | 0 | Independent |
10 | 25.479 | 1 | female | 0 | Republican |
12 | 46.488 | 11 | female | 0 | Democrat |
Week 7 Exercises: Linear Regression
Income and Education
Research Question: is there an association between the level of education and an employee’s income?
Data: riverview.csv
Our data for these exercises is from an hypothetical study into income disparity for employees in a local authority. The riverview data, which come from Lewis-Beck, 2015, contain five attributes collected from a random sample of \(n=32\) employees working for the city of Riverview, a hypothetical midwestern city in the US. The attributes include:
education
: Years of formal educationincome
: Annual income (in thousands of U.S. dollars)seniority
: Years of senioritygender
: Employee’s gendermale
: Dummy coded gender variable (0 = Female, 1 = Male)party
: Political party affiliation
The first six rows of the data are:
The data is available at https://uoepsy.github.io/data/riverview.csv.
Load the required libraries (probably just tidyverse for now), and read in the riverview data to your R session.
Let us first visualise and describe the marginal distributions of those variables which are of interest to us. These are the distribution of each variable (employee incomes and education levels) without reference to the values of the other variables.
Hints:
- You could use, for example,
geom_density()
for a density plot orgeom_histogram()
for a histogram. - Look at the shape, center and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal?
- Do you notice any extreme observations?
After we’ve looked at the marginal distributions of the variables of interest in the analysis, we typically move on to examining relationships between the variables.
Visualise and describe the relationship between income and level of education among the employees in the sample.
Hints:
Think about:
- Direction of association
- Form of association (can it be summarised well with a straight line?)
- Strength of association (how closely do points fall to a recognizable pattern such as a line?)
- Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail.
Using the lm()
function, fit a linear model to the sample data, in which employee income is explained by education level. Assign it to a name to store it in your environment.
Hints:
You can see how to fit linear models in R using lm()
in 7A #fitting-linear-models-in-r
Interpret the estimated intercept and slope in the context of the question of interest.
Hints:
We saw how to extract lots of information on our model using summary()
(see 7A #model-summary), but there are lots of other functions too.
If we called our linear model object “model1” in the environment, then we can use:
- type
model1
, i.e. simply invoke the name of the fitted model; - type
model1$coefficients
; - use the
coef(model1)
function; - use the
coefficients(model1)
function; - use the
summary(model1)$coefficients
to extract just that part of the summary.
Test the hypothesis that the population slope is zero — that is, that there is no linear association between income and education level in the population.
Hints:
You don’t need to do anything for this, you can find all the necessary information in summary()
of your model.
See 7A #inference-for-regression-coefficients.
What is the proportion of the total variability in incomes explained by the linear relationship with education level?
Hints:
- The question asks to compute the value of \(R^2\), but you might be able to find it already computed somewhere (so much stuff is already in
summary()
of the model. - See 7A #r2 if you’re unsure about what \(R^2\) is.
Look at the output of summary()
of your model. Identify the relevant information to conduct an F-test against the null hypothesis that the model is ineffective at predicting income using education level.
Hints:
7A #the-f-statistic will help.
Create a visualisation of the estimated association between income and education.
Hints: Check 7A#example, which shows an example of using the sjPlot package to create a plot.
Consider the following:
- In fitting a linear regression model, we make the assumption that the errors around the line are normally distributed around zero (this is the \(\epsilon \sim N(0, \sigma)\) bit.)
- About 95% of values from a normal distribution fall within two standard deviations of the centre.
We can obtain the estimated standard deviation of the errors (\(\hat \sigma\)) from the fitted model using sigma()
and giving it the name of our model.
What does this tell us?
Hints:
See 7A #the-error.
Compute the model-predicted income for someone with 1 year of education.
Hints:
Given that you know the intercept and slope, you can calculate this algebraically. However, try to also use the predict()
function (see 7A#model-predictions).
Income and Degree Status
Let’s suppose instead of having measured education in years, researchers simply asked “have you had more than 18 years of education?”, assuming this to be indicative of having a degree.
The code below creates a this new variable for us:
<- riverview %>%
riverview mutate(
degree = ifelse(education >= 18, "yes","no")
)
Fit the following model, and interpret the coefficients.
\[ Income = b_0 + b_1 \ Degree + \varepsilon \]
Hints:
For help interpreting the coefficients, see 7A #binary-predictors.
We’ve actually already seen a way to analyse questions of this sort (“is the average income different between those with and those without a degree?”)
Run the following t-test, and consider the statistic, p value etc. How does it compare to the model in the previous question?
t.test(income ~ degree, data = riverview, var.equal = TRUE)
Two Sample t-test
data: income by degree
t = -4.6408, df = 30, p-value = 6.412e-05
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
-27.15051 -10.55673
sample estimates:
mean in group no mean in group yes
46.08284 64.93646
Student Sleep Quality
In the survey that many of you completed at the start of term (thank you to those that did!), we had a long set of items that were from various personality and psychometric scales.
One of these was a set of items that measure a persons’ “locus of control” (belief that you have control over your life, as opposed to it being decided by external forces such as fate/luck/destiny/god/the flying spaghetti monster). We also asked you to rate your average sleep quality from 0 to 100.
My pet theory is that people who believe they have more control over life are more inclined to have a better sleep routine, and so will rate their sleep quality as being better than those who believe that life is out of their control.
Data from the 2022 survey (now closed) can be found at https://uoepsy.github.io/data/usmr2022.csv.
<- read_csv("https://uoepsy.github.io/data/usmr2022.csv") usmrdata
The relevant variables are:
variable | description |
---|---|
pseudonym | User provided pseudonym to enable lookup |
loc | Locus of Control Score. Minimum score = 6, maximum score = 30. Higher scores indicate more perceived self-control over life |
sleeprating | Rating of average quality of sleep. Higher values indicate better sleep quality. |
... | ... |
... | ... |
Fit a linear regression model to test the association.
Hints:
Try to do the following steps:
- Explore the data (plot and describe)
- Fit the model
- Take a look at the assumption plots (see 7A#assumptions).
- The trick to looking at assumption plots is to look for “things that don’t look random”.
- As well as looking for patterns, these plots can also higlight individual datapoints that might be skewing the results. Can you figure out if there is someone weird in our dataset? Can you re-fit the model without that person? When you do so, do your conclusions change?
Footnotes
Still not perfect, but don’t worry about that right now↩︎