Back to Basics

For an overview of basic statistical tests and core concepts (e.g., $p$-values), please revisit the DAPR1 materials for a refresher (also accessible via the DAPR1 Learn page).

Terminology

Term	Definition
(Observational) unit	The individual entities on which data are collected
Variable	Any characteristic recorded on the observational units
Numeric variable	A variable that records a numerical quantity for each case. For such variables standard arithmetic operations make sense. For example: height, IQ, and weight
Categorical variable	A categorical variable places units into one of several groups. For example: country of birth, dominant hand, and eye colour
Binary variable	A special case of categorical variable with only 2 possible levels. For example: handedness (left or right), smoking status (smoker or non-smoker), pass test (yes or no)
Response variable (also more commonly called a dependent variable, or outcome variable)	Measures the outcome of interest in a study
Explanatory/independent variable (also called predictors)	Are used to explain differences/changes in the response variable
Observational study	An observational study is a study in which the researcher does not manipulate any of the variables involved in the study, but merely records the values as they naturally exist
Experimental study	An experiment is a study in which the researcher imposes the values of the explanatory variable on the units before measuring the response variable

Data Exploration

The common first port of call for almost any statistical analysis is to explore the data, and we can do this visually and/or numerically.

	Marginal Distributions	Bivariate Associations
Description	The distribution of each variable individually (i.e., without reference to the values of the other variables).	Describing the association between two numeric variables.
Visually	Plot each variable individually. You could use, for example, `geom_density()` for a density plot or `geom_histogram()` for a histogram to comment on and/or examine: The shape of the distribution. Look at the shape, centre and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal? Identify any unusual observations. Do you notice any extreme observations (i.e., outliers)?	Plot associations among two variables. You could use, for example, `geom_point()` for a scatterplot to comment on and/or examine: The direction of the association indicates whether there is a positive or negative association The form of association refers to whether the relationship between the variables can be summarized well with a straight line or some more complicated pattern The strength of association entails how closely the points fall to a recognizable pattern such as a line Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail
Numerically	Compute and report summary statistics e.g., mean, standard deviation, median, min, max, etc. You could, for example, calculate summary statistics such as the mean (`mean()`) and standard deviation (`sd()`), etc. within `summarize()`	Compute and report the correlation coefficient. You can use the `cor()` function to calculate this

Numeric Exploration

Numeric exploration of data involves examining key statistics like mean, median, and standard deviation via descriptives tables; and assessing the associations among variables through correlation coefficients. Exploring our data numerically helps us to identify patterns and associations in the data.

Descriptives

Descriptives Tables

Descriptives Tables - Examples

The tidyverse way
The psych way

We can use the summarise() function to numerically summarise/describe our data. Some key values we may want to consider extracting are (though not limited to): the mean (via mean(), standard deviation (via sd()), minimum value (via min()), maximum value (via max()), standard error (via se()), and skewness (via skew()).

Numeric values only example:
Categorical and numeric values example:

library(tidyverse)
library(kableExtra)

# using the pre-loaded iris dataset
# taking the mean and standard deviation of sepal length via the summarize function
# returning a table with a caption, where numbers are rounded to 2 dp
# asking for a table that is not the full width of the window display
iris %>%
    summarize(
        M_Length = mean(Sepal.Length),
        SD_Length = sd(Sepal.Length)
    ) %>%
    kable(caption = "Sepal Length Descriptives (in cm)", digits = 2) %>%
    kable_styling(full_width = FALSE)

Sepal Length Descriptives (in cm)
M_Length	SD_Length
5.84	0.83

library(tidyverse)
library(kableExtra)

# using the pre-loaded iris dataset
# grouping by Species. NOTE: we can group by 2 variables - we would just separate by a comma within group_by( , )
# taking the mean and standard deviation of sepal length via the summarize function
# returning a table of sepal length grouped by species with a caption, where numbers are rounded to 2 dp
# asking for a table that is not the full width of the window display
iris %>%
    group_by(Species) %>%
    summarize(
        M_Length = mean(Sepal.Length),
        SD_Length = sd(Sepal.Length)
    ) %>%
    kable(caption = "Sepal Length (in cm) Grouped by Species Descriptives Table", digits = 2) %>%
    kable_styling(full_width = FALSE)

Sepal Length (in cm) Grouped by Species Descriptives Table
Species	M_Length	SD_Length
setosa	5.01	0.35
versicolor	5.94	0.52
virginica	6.59	0.64

The describe() function will produce a table of descriptive statistics. If you would like only a subset of this output (e.g., mean, sd), you can use select() after calling describe() e.g., describe() %>% select(mean, sd).

Numeric values only example:
Categorical and numeric values options:

library(psych)
library(kableExtra)

# using the pre-loaded iris dataset
# we want to get descriptive statistics of the iris dataset, specifically the sepal length column
# we specifically want to select the mean and standard deviation from the descriptive statistics available (try this without including this argument to see what values you all get out)
# returning a table with a caption, where numbers are rounded to 2 dp
# asking for a table that is not the full width of the window display
describe(iris$Sepal.Length) %>%
    select(mean, sd) %>%
    kable(caption = "Sepal Length Descriptives (in cm)", digits = 2) %>%
    kable_styling(full_width = FALSE)

Sepal Length Descriptives (in cm)
	mean	sd
X1	5.84	0.83

Note that this is quite an overly complex way to return these summary statistics - using the tidyverse() way is much more intuitive and straightforward!

library(psych)
library(kableExtra)

# using the pre-loaded iris dataset
# we want to get descriptive statistics of the iris dataset, specifically the sepal length column by Species
# we want to return a matrix (hence mat = TRUE), then convert this to a dataframe
# we specifically want to select the mean and standard deviation from the descriptive statistics available (try this without including this argument to see what values you all get out)
# returning a table with a new column names of Group, Mean, SD; adding a caption; numbers are rounded to 2 dp
# asking for a table that is not the full width of the window display


describeBy(Sepal.Length ~ Species, data = iris, mat = TRUE, digits = 2) %>%
  as.data.frame() %>%
  rownames_to_column() %>% 
  select(group1, mean, sd) %>%
    kable(col.names = c("Group", "Mean", "SD"), caption = "Sepal Length Descriptives (in cm)", digits = 2) %>%
    kable_styling(full_width = FALSE)

Sepal Length Descriptives (in cm)
Group	Mean	SD
setosa	5.01	0.35
versicolor	5.94	0.52
virginica	6.59	0.64

Correlation

Correlation Coefficient

Correlation Matrix

A correlation matrix is a table showing the correlation coefficients - $r_{(x,y)}=\frac{\mathrm{cov}(x,y)}{s_xs_y}$ - between variables. Each cell in the table shows the association between two variables. The diagonals show the correlation of a variable with itself (and are therefore always equal to 1).

In R

We can create a correlation matrix by giving the cor() function a dataframe. It is important to remember that all variables must be numeric.

Let’s check the structure of the iris dataset to ensure that all variables are numeric:

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We can see that the variable Species in column 5 is a factor - this means that we cannot include this in our correlation matrix. Therefore, we need to subset, or, in other words, select specific columns. We can do this either giving the column numbers inside [], or using select(). In our case, we want the variables in columns 1 - 4, just not 5.

If you had NA values within your dataset, you could choose to remove these NAs using na.rm = TRUE inside the cor() function.

Index dataframe ([])
Variable selection (select())

round(cor(iris[,c(1:4)]), digits = 2)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

# select only the columns we want by variable name, and pass this to cor()
iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
  cor() %>%
  round(digits = 2)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

Correlation - Hypothesis Testing

Correlation - Hypothesis Testing in R

Visual Exploration

Visual exploration of our data allows us to visualize the distributions of our data, and to identify potential associations between variables.

How to Visualise Data

Data Visualisation - Marginal Examples

Histogram
Density

A histogram shows the frequency of values which fall within bins of an equal width.

Basic:

x-axis: possible values of some variable, grouped into bins
y-axis: frequency of a given value or values within bins
What are bins?: A bin represents a range of scores

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram() +
    labs(x = "Sepal Length (in cm)")

Updating Bins:

Within geom_histogram(), we can specify bins = to specify the number of columns we want (for this example, lets say we want 10):

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(bins = 10) +
    labs(x = "Sepal Length (in cm)")

Alternatively, we can specify binwidth = to specify the width of each bin (it is very helpful to be aware of the scale of your variable here!):

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(binwidth = 0.1) +
    labs(x = "Sepal Length (in cm)")

Outline columns with color:

Within geom_histogram(), we can specify color = to set a colored outline of the columns:

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(color = "red") +
    labs(x = "Sepal Length (in cm)")

Fill columns with color:

Within geom_histogram(), we can specify fill = to fill the columns with a color:

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(fill = "purple") +
    labs(x = "Sepal Length (in cm)")

A visualization of the distribution of a numeric variable.

Basic:

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_density() +
    labs(x = "Sepal Length (in cm)")

Filled:

We can fill our plot with colour by specifying fill = within geom_density():

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_density(fill = "lightblue") +
    labs(x = "Sepal Length (in cm)")

Line Type & Width:

We can change the type and width of the line by specifying linetype = and linewidth = within geom_density():

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_density(linetype = 6, linewidth = 3) +
    labs(x = "Sepal Length (in cm)")

Data Visualisation - Bivariate Examples

Unlike in our marginal plots where we specified our x-axis variable within aes(), to visualise bivariate associations, we need to specify what variables we want on both our x- and y-axis.

We can use a scatterplot (since the variables are numeric and continuous) to visualise the association between the two numeric variables - these will be our x- and y-axis values.

We plot these values for each row of our dataset, and we should end up with a cloud of scattered points.

Here we will want to comment on any key observations that we notice, including if we detect outliers or points that do not fit with the pattern in the rest of the data. Outliers are extreme observations that are not possible values of a variable or that do not seem to fit with the rest of the data. This could either be:

marginally along one axis: points that have an unusual (too high or too low) x-coordinate or y-coordinate;
jointly: observations that do not fit with the rest of the point cloud

Basic:

We need to specify + geom_point() to get a scatterplot:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Fill points with color:

Within geom_point(), we can specify color = to fill the points with a color:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point(color = "orange") +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Change size and opacity:

We can change the size (using size =) and the opacity (using alpha =) of our geom elements on the plot. Let’s include this below:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point(size = 3, alpha = 0.5) +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Add a line of best fit:

We can superimpose (i.e., add) a line of best fit by including the argument + geom_smooth(). Since we want to fit a straight line, we want to use method = "lm". We can also specify whether we want to display confidence intervals around our line by specifying se = TRUE / FALSE.

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

We can use a boxplot to visualise the association between one numeric variable and one categorical variable - these will be our y- and x-axis values respectively.

Basic:

We need to specify + geom_boxplot() to get a boxplot:

ggplot(data = iris, aes(x = Species, y = Sepal.Length)) +
    geom_boxplot() +
    labs(x = "Species", y = "Sepal Length (in cm)")

Change boxplot fill colours by group:

Within aes(), we can specify fill = to fill the boxes with a color:

ggplot(data = iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
    geom_boxplot() +
    labs(x = "Species", y = "Sepal Length (in cm)")

Change boxplot line colours by group:

Within aes(), we can specify color = to colour the lines with a color:

ggplot(data = iris, aes(x = Species, y = Sepal.Length, color = Species)) +
    geom_boxplot() +
    labs(x = "Species", y = "Sepal Length (in cm)")

Adding jitter:

We can add jittered points to a boxplot to better see the underlying distribution of the data (by adding a little random variation to each data point) via geom_jitter():

ggplot(data = iris, aes(x = Species, y = Sepal.Length, color = Species)) +
    geom_boxplot() +
    geom_jitter() + 
    labs(x = "Species", y = "Sepal Length (in cm)")

Change legend position:

We can add the argument + theme(legend.position = ) to move (or even remove) the legend by specifying, for example, "right", "left", "top", "bottom", or "none" to remove.

# legend at bottom of plot
ggplot(data = iris, aes(x = Species, y = Sepal.Length, color = Species)) +
    geom_boxplot() +
    labs(x = "Species", y = "Sepal Length (in cm)") + 
    theme(legend.position = "bottom")

When we have two numeric variables, as well as categorical variables, we can use facet_wrap() / facet_grid() to help divide/arrange our plots. If we had two categorical variables, by simply stringing them together to further group our plots by specifying facet_wrap( ~ cat_variable1 + cat_variable2)

Basic:

We need to specify + geom_point() to get a scatterplot, and either + facet_wrap() or + facet_grid() to separate by your categorical variable:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    facet_wrap(~Species) + 
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Add a line of best fit:

We can superimpose (i.e., add) a line of best fit by including the argument + geom_smooth(). Since we want to fit a straight line, we want to use method = "lm". We can also specify whether we want to display confidence intervals around our line by specifying se = TRUE / FALSE. Note that a line is fitted for every level of your categorical variable:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
        facet_wrap(~Species) + 
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Subplot layout:

You can change the overall layout of the subplots by specifying dir = within the facet_wrap() argument, where “h” will return a horizontal layout (this is the default) and “v” for vertical.

You can also change the layout of the subplot labels by specifying strip.position = within the facet_wrap() argument, where labels can be arranged to display at the “top” (this is the default), “bottom”, “left” or “right”.

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    facet_wrap(~Species, dir = "v", strip.position = "right") + 
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Functions and Mathematical Models

Basic functions and mathematical models are foundational tools used to describe and predict associations between variables.

Identification & Specification

Deterministic Models - Description & Specification

Deterministic Models - Visualisation

Deterministic Models - Predicted Values

Statistical Models

Statistical models are used to understand the associations among variables.

Specifying Hypotheses

Simple Linear Regression Models - Description & Specification

The association between two variables (e.g., recall accuracy and age) will show deviations from the ‘average pattern’. Hence, we need to create a model that allows for deviations from the linear relationship - we need a statistical model.

A statistical model includes both a deterministic function and a random error term. We typically refer to the outcome (‘dependent’) variable with the letter $y$ and to our predictor (‘explanatory’/‘independent’) variables with the letter $x$. A simple (i.e., one x variable only) linear regression model thus takes the following form (where the terms $\beta_0$ and $\beta_1$ are numbers specifying where the line going through the data meets the y-axis (i.e., the intercept - where $x$ = 0; $\beta_0$) and its slope (direction and gradient of line; $\beta_1$):

Model Specification

\[ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \]

Model Specification: Annotated

\[ y_i = \underbrace{\beta_0 + \beta_1 \cdot x_i}_{\text{function of }x} + \underbrace{\epsilon_i}_{\text{random error}} \\ \]

\[ \quad \text{where} \quad \epsilon_i \sim N(0, \sigma) \text{ independently} \]

Model Specification: Explained

Let’s break down what $y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \quad \text{where} \quad \epsilon_i \sim N(0, \sigma) \text{ independently}$ actually means by considering the statement in smaller parts:

$y_i = \beta_0 + \beta_1 \cdot x_i$
- $y_i$ is our measured outcome variable (our DV)
- $x_i$ is our measured predictor variable (our IV)
- $\beta_0$ is the model intercept
- $\beta_1$ is the model slope
$\epsilon \sim N(0, \sigma) \text{ independently}$
- $\epsilon$ is the residual error
- $\sim$ means ‘distributed according to’
- $N(0, \sigma) \text{ independently}$ means ‘normal distribution with a mean of 0 and a variance of $\sigma$’
- Together, we can say that the errors around the line have a mean of zero and constant spread as x varies

In R

There are basically two pieces of information that we need to pass to the lm() function:

The formula: The regression formula should be specified in the form y ~ x where $y$ is the dependent variable (DV) and $x$ the independent variable (IV).
The data: Specify which dataframe contains the variables specified in the formula.

In R, the syntax of the lm() function can be specified as follows (where DV = dependent variable, IV = independent variable, and dataframe_name = the name of your dataset):

Option A
Option B

model_name <- lm(DV ~ IV, data = data_name)

model_name <- lm(data_name$DV ~ data_name$IV)

you can also specify as:

Option A
Option B

model_name <- lm(DV ~ 1 + IV, data = data_name)

model_name <- lm(data_name$DV ~ 1 + data_name$IV)

Why is there a 1 in the two bottom options?

When we specify the linear model in R, we include after the tilde sign ($\sim$), the variables that appear to the right of the $\hat \beta$s. The intercept, or $\beta_0$, is a constant. That is, we could write it as multiplied by 1.

Including the 1 explicitly is not necessary because it is included by default (you can check this by comparing the outputs of A & B above with and without the 1 included - the estimates are the same!). After a while, you will find you just want to drop the 1 when calling lm() because you know that it’s going to be there, but in these early weeks we tried to keep it explicit to make it clear that you want the intercept to be estimated.

Numeric Outcomes & Predictors

Simple Linear Regression Models - Example

Multiple Linear Regression Models - Description & Specification

Multiple linear regression involves looking at one continuous outcome (i.e., DV), with two or more independent variables (i.e., IVs).

A multiple linear regression model takes the following form:

\[ y_i = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \epsilon_i \] \[ \quad \text{where} \quad \epsilon_i \sim N(0, \sigma) \text{ independently} \] So, for example, we could extend our recall accuracy model to include recall confidence as a predictor:

\[ \text{Recall Accuracy}_i = \beta_0 + \beta_1 \cdot \text{Recall Confidence}_i + \beta_2 \cdot \text{Age}_i + \epsilon_i \]

In R:

Multiple and simple linear regression follow the same structure within the lm() function - the logic scales up to however many predictor variables we want to include in our model. You simply add (using the + sign) more independent variables. For example, if we wanted to build a multiple linear regression that included three independent variables, we could fit one of the following via the lm() function:

Option A
Option B

model_name <- lm(DV ~ IV1 + IV2 + IV3, data = data_name)

model_name <- lm(data_name$DV ~ data_name$IV1 + data_name$IV2 + data_name$IV3)

Interpretation of Multiple Regression Coefficients

You’ll hear a lot of different ways that people explain multiple regression coefficients.

For the model $y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \epsilon$, the estimate $\hat \beta_1$ will often be reported as:

“the increase in $y$ for a one unit increase in $x_1$ when…”

“holding the effect of $x_2$ constant.”
“controlling for differences in $x_2$.”
“partialling out the effects of $x_2$.”
“holding $x_2$ equal.”
“accounting for effects of $x_2$.”

For models with 3+ predictors, just like building the model in R, the logic of the above simply extends.

For example “the increase in [outcome] for a one unit increase in [predictor] when…”

“holding [other predictors] constant.”
“accounting for [other predictors].”
“controlling for differences in [other predictors].”
“partialling out the effects of [other predictors].”
“holding [other predictors] equal.”
“accounting for effects of [other predictors].”

Simple Linear Regression Models - Visualisation

Multiple Linear Regression Models - Visualisation

Numeric Outcomes & Categorical Predictors

Overview

Coding Variables as Factors

When we have categorical predictors, it is important that we tell R specifically to code them appropriately as factors.

In R

We can use various functions to convert between different types of data, such as:

factor() / as_factor() - to turn a variable into a factor
as.numeric() - to turn a variable into numbers

As a first step, it is a good idea to look at the structure of the dataset you are working with. For the purpose of this example, our dataset is called “tips” (you might recall this from DAPR1):

str(tips)

spc_tbl_ [157 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Bill  : num [1:157] 23.7 36.1 32 17.4 15.4 ...
 $ Tip   : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
 $ Credit: chr [1:157] "n" "n" "y" "y" ...
 $ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
 $ Day   : chr [1:157] "f" "f" "f" "f" ...
 $ Server: chr [1:157] "A" "B" "A" "B" ...
 $ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...
 - attr(*, "spec")=
  .. cols(
  ..   Bill = col_double(),
  ..   Tip = col_double(),
  ..   Credit = col_character(),
  ..   Guests = col_double(),
  ..   Day = col_character(),
  ..   Server = col_character(),
  ..   PctTip = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

From the output, we can see that Credit (whether guests paid with a credit card; n/y responses) was coded as a <chr> or character variable. If we wanted to set this as a factor so that R recognises it as a categorical variable, we can use on of the following:

as_factor()
factor()

tips <- tips %>% 
  mutate(Credit = as_factor(Credit))

We could also use the factor() function, and at the same time label factors appropriately to aid reader interpretation (it may not be immediately clear to some that n represents ‘No’ and y represents ‘Yes’). To do so, we list the all levels of Credit, and provide a new label corresponding to each level:

tips$Credit <- factor(tips$Credit, 
                      levels = c("n", "y"),
                      labels = c("No", "Yes"))

Using either of the above approaches, if we now run str(tips) again, you should see that Credit is now coded as a factor with 2 levels:

str(tips)

tibble [157 × 7] (S3: tbl_df/tbl/data.frame)
 $ Bill  : num [1:157] 23.7 36.1 32 17.4 15.4 ...
 $ Tip   : num [1:157] 10 7 5.01 3.61 3 2.5 3.44 2.42 3 2 ...
 $ Credit: Factor w/ 2 levels "n","y": 1 1 2 2 1 1 1 1 1 1 ...
 $ Guests: num [1:157] 2 3 2 2 2 2 2 2 2 2 ...
 $ Day   : chr [1:157] "f" "f" "f" "f" ...
 $ Server: chr [1:157] "A" "B" "A" "B" ...
 $ PctTip: num [1:157] 42.2 19.4 15.7 20.8 19.5 13.4 16 12.4 12.7 10.7 ...

Binary Predictors

Categorical Predictors with k levels

When we have a categorical explanatory variable with more than 2 levels, our model gets a bit more complex - it needs not just one, but a number of dummy variables. For a categorical variable with $k$ levels, we can express it in $k-1$ dummy variables.

For example, the “species” column below has three levels, and can be expressed by the two variables “species_dog” and “species_parrot”:

  species species_dog species_parrot
1     cat           0              0
2     cat           0              0
3     dog           1              0
4  parrot           0              1
5     dog           1              0
6     cat           0              0
7     ...         ...            ...

The “cat” level is expressed whenever both the “species_dog” and “species_parrot” variables are 0.
The “dog” level is expressed whenever the “species_dog” variable is 1 and the “species_parrot” variable is 0.
The “parrot” level is expressed whenever the “species_dog” variable is 0 and the “species_parrot” variable is 1.

R will do all of this re-expression for us. If we include in our model a categorical explanatory variable with 3 different levels, the model will estimate 2 parameters - one for each dummy variable. We can interpret the parameter estimates (the coefficients we obtain using coefficients(),coef() or summary()) as the estimated increase in the outcome variable associated with an increase of one in each dummy variable (holding all other variables equal).

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	60.28	1.209	49.86	5.273e-39
speciesdog	-11.47	1.71	-6.708	3.806e-08
speciesparrot	-4.916	1.71	-2.875	0.006319

Recall that the intercept is the estimated outcome when all predictors are zero. In our example then, this represents the cat. We think of the “cat” category in this example as the reference level - it is the category against which other categories are compared against. Therefore, in the above example, an increase in 1 of “species_dog” is the difference between a “cat” and a “dog”. An increase in one of “species_parrot” is the difference between a “cat” and a “parrot”.

Dummy vs Effects Coding

Name	Constraint	Meaning of \(\beta_0\)	R
Sum to zero (Effects Coding)	\(\beta_1 + \beta_2 + \beta_3 = 0\)	\(\beta_0 = \mu\)	`contr.sum`
Reference group (Dummy Coding)	\(\beta_1 = 0\)	\(\beta_0 = \mu_1\)	`contr.treatment`

Categorical Predictors - Interpretation

Let’s apply this to a different example - the iris dataset, and specifically, the Species categorical variable:

#check what levels we have
levels(iris$Species)

[1] "setosa"     "versicolor" "virginica"

Dummy Coding
Effects Coding

From the above, we can see that Species has 3 levels - “setosa”, “versicolor”, and “virginica”. If we put these into a model, assuming R’s default ordering, we know that R will automatically apply dummy (or treatment coding), and “setosa” will be taken as our reference group:

#fit model
spec_model <- lm(Sepal.Length ~ Species, data = iris)
summary(spec_model)


Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Let’s first map our coefficients and estimates:

Coefficient	Estimate	Corresponds to
(Intercept)	5.0060	$\beta_0 = \hat \mu_1$
Speciesversicolor	0.9300	$\beta_0 + \beta_1 = \hat \mu_2$
Speciesvirginica	1.5820	$\beta_0 + \beta_2 = \hat \mu_3$

The estimate corresponding to (Intercept) contains $\hat \beta_0 = \hat \mu_1 = 5.01$. The estimated average sepal length for the species setosa was approximately 5.01.
The second estimate corresponds to Speciesversicolor and was $\hat \beta_1 = 0.93$. The difference in mean sepal length between setosa and versicolor species was estimated to be $0.93~cm$. Thus, $\hat \mu_2 = 5.01 + 0.93 = 5.94$. We could say - the species iris versicolor had a sepal length of approximately 5.94cm, and this was approximately 0.93cm longer than the iris setosa. This difference was statistically significant $(p < .001)$.
The third estimate corresponds to Speciesvirginica and was $\hat \beta_2 = 1.58$. The difference in mean sepal length between setosa and virginica species was estimated to be $1.58~cm$. Thus, $\hat \mu_2 = 5.01 + 1.58 = 6.59$. We could say - the species iris virginica had a sepal length of approximately 6.59cm, and this was approximately 1.58cm longer than the iris setosa. This difference was statistically significant $(p < .001)$.

First we need to tell R to apply effects (or sum-to-zero) coding and check the ordering of the levels:

contrasts(iris$Species) <- "contr.sum"
contrasts(iris$Species)

           [,1] [,2]
setosa        1    0
versicolor    0    1
virginica    -1   -1

Then we can run our model:

#fit model
spec_model2 <- lm(Sepal.Length ~ Species, data = iris)
summary(spec_model2)


Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.84333    0.04203 139.020   <2e-16 ***
Species1    -0.83733    0.05944 -14.086   <2e-16 ***
Species2     0.09267    0.05944   1.559    0.121    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

           versicolor virginica
setosa              0         0
versicolor          1         0
virginica           0         1

Let’s first map our coefficients and estimates:

Coefficient	Estimate	Corresponds to
(Intercept)	5.84333	$\beta_0 = \frac{\mu_1 + \mu_2 + \mu_3}{3} = \mu$
Species1	-0.83733	$\beta_1 = \mu_1 - \mu$
Species2	0.09267	$\beta_2 = \mu_2 - \mu$

The first estimate corresponding to (Intercept) contains $\hat \beta_0 = \hat \mu = 5.84$. The estimated average sepal length across iris species was approximately $5.84~cm$.
The second estimate corresponds to Species1 and was $\hat \beta_1 = -0.84$. The difference in mean sepal length between setosa $(\hat \mu_1)$ and the grand mean $(\hat \mu_0)$ was estimated to be $0.84~cm$. In other words, the iris species of setosa had a sepal length $0.84~cm$ shorter than average, where its length was estimated to be $5.84333 + (-0.83733) = 5~cm$. This difference in length was statistically significant $(p < .001)$.
The third estimate corresponds to Species2 and was $\hat \beta_2 = 0.09$. The difference in mean sepal length between versicolor $(\hat \mu_2)$ and the grand mean $(\hat \mu_0)$ was estimated to be $0.09~cm$. In other words, the iris species of versicolor had a sepal length $0.09~cm$ longer than average, where its length was estimated to be $5.84333 + 0.09267 = 5.94~cm$. This difference in length was not statistically significant $(p = .121)$.
The estimate for Species3, representing the difference of “virginica” to the grand mean is not shown by summary(). Because of the side-constraint, we know that $\mu_3 = \beta_0 - (\beta_1 + \beta_2)$. The difference in sepal length between virginica and the grand mean was estimated to be $-(-0.83733 + 0.09267) = 0.74466$. In other words, the virginica iris species had a sepal length $0.74~cm$ longer than average, where its length was estimated to be $5.84333 - (-0.83733 + 0.09267) = 6.59~cm$.

Specifying Reference Levels

Interaction Models

Specifying Interaction Models

Interpreting Coefficients

Interpreting coefficients for A and B in the presence of an interaction A:B
Interpreting the interaction term A:B

When you include an interaction between $x_1$ and $x_2$ in a regression model, you are estimating the extent to which the effect of $x_1$ on $y$ is different across the values of $x_2$.

What this means is that the effect of $x_1$ on $y$ depends on/is conditional upon the value of $x_2$ (and vice versa, the effect of $x_2$ on $y$ is different across the values of $x_1$).

This means that we can no longer talk about the “effect of $x_1$ holding $x_2$ constant”. Instead we can talk about a marginal effect of $x_1$ on $y$ at a specific value of $x_2$.

When we fit the model $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \cdot x_2) + \epsilon$ using lm():

the parameter estimate $\hat \beta_1$ is the marginal effect of $x_1$ on $y$ where $x_2 = 0$
the parameter estimate $\hat \beta_2$ is the marginal effect of $x_2$ on $y$ where $x_1 = 0$ In other words, when we fit a model with an interaction in R, we get out coefficients for both predictors, and for the interaction. The coefficients for each individual predictor reflect the effect on the outcome when the other predictor is zero.

N.B. Regardless of whether or not there is an interaction term in our model, all parameter estimates in multiple regression are “conditional” in the sense that they are dependent upon the inclusion of other variables in the model. For instance, in $y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \epsilon$ the coefficient $\hat \beta_1$ is conditional upon holding $x_2$ constant.

The coefficient for an interaction term can be thought of as providing an adjustment to the slope.

Let’s take the following model as an example:

lm(formula = y ~ x1 * x2, data = df)
...

...
Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  ...        ....       ...      ...  
x1           ...        ....       ...      ...  
x2           ...        ....       ...      ...  
x1:x2        ...        ....       ...      ...  
---

These coefficients can be interpreted, in turn as:

Coefficient	Interpretation
`(Intercept)`	the estimated $y$ when all predictors ($x_1$ and $x_2$) are zero is [estimate]
`x1`	when $x_2$ is zero, a 1 unit increase in $x_1$ is associated with a [estimate] change in $y$
`x2`	when $x_1$ is zero, a 1 unit increase in $x_2$ is associated with a [estimate] change in $y$.
`x1:x2`	as $x_2$ increases by 1, the association between $x_1$ and $y$ changes by [estimate] or as $x_1$ increases by 1, the association between $x_2$ and $y$ changes by [estimate]

What if there are other things (e.g., other predictors/covatiates) in the model too?

Note that the interaction x1:x2 changes how we interpret the individual coefficients for x1 and x2.

It does not change how we interpret coefficients for other predictors that might be in our model. For variables that aren’t involved in the interaction term, these are still held constant.

For example, suppose we also had another predictor $c_1$ in our model:

lm(y ~ c1 + x1 + x2 + x1:x2)

Coefficient	Interpretation
`(Intercept)`	the estimated $y$ when all predictors ($c_1$, $x_1$ and $x_2$) are zero is [estimate]
`c1`	a 1 unit increase in $c_1$ is associated with a [estimate] increase in $y$, holding constant all other variables in the model ($x_1$ and $x_2$)
`x1`	holding $c_1$ constant, when $x_2$ is zero, a 1 unit increase in $x_1$ is associated with a [estimate] change in $y$
`x2`	holding $c_1$ constant, when $x_1$ is zero, a 1 unit increase in $x_2$ is associated with a [estimate] change in $y$.
`x1:x2`	holding $c_1$ constant, as $x_2$ increases by 1, the association between $x_1$ and $y$ changes by [estimate] or holding $c_1$ constant, as $x_1$ increases by 1, the association between $x_2$ and $y$ changes by [estimate]

Example Data

Numeric x Categorical Example

Research Question

Does the association between body mass and flipper length differ between species of penguin?

Visualise Data

Model Specification

Model Building

Results Interpretation

$\beta_0$ = (Intercept) = 4060.55

The intercept, or predicted body mass when flipper length was average and species was Adelie.
- An Adelie penguin with an average flipper length was expected to have a body mass of $4060.55g$.

$\beta_1$ = mc_flipper_length_mm = 32.83

The simple slope of flipper length (in mm) for species reference group (Adelie).
- For an Adelie penguin, every 1 additional mm in flipper length was associated with a significant $32.83g$ increase in their body mass $(p < .001)$.

$\beta_2$ = speciesChinstrap = -151.42

The simple effect of species (or the difference in body mass between Adelie and Chinstrap penguins) when flipper length was average.
- A Chinstrap penguin with an average flipper length had a body mass $151.42g$ lighter than an Adelie penguin with the same flipper length. Note that this difference was not statistically different from zero $(p = .062)$.

$\beta_3$ = speciesGentoo = 126.66

The simple effect of species (or the difference in body mass between Adelie and Gentoo penguins) when flipper length was average.
- A Gentoo penguin with an average flipper length had a body mass $126.66g$ heavier than an Adelie penguin with the same flipper length. Note that this difference was not statistically different from zero $(p = .242)$.

$\beta_4$ = mc_flipper_length_mm:speciesChinstrap = 1.74

The interaction between flipper length (in mm; mean centered) and species (levels: Adelie/Chinstrap). This is the estimated difference in simple slopes of flipper length for Adelie vs. Chinstrap penguins.
- In comparison to Adelie penguins, for a Chinstrap penguin every 1 additional mm in flipper length was associated with a $1.74g$ point greater change in their body mass. Note that this adjustment was not statistically different from zero $(p = .825)$.

$\beta_5$ = mc_flipper_length_mm:speciesGentoo = 21.79

The interaction between flipper length (in mm; mean centered) and species (levels: Adelie/Gentoo). This is the estimated difference in simple slopes of flipper length for Adelie vs. Gentoo penguins.
- In comparison to Adelie penguins, for a Gentoo penguin every 1 additional mm in flipper length was associated with a $21.79g$ greater increase in their body mass. Note that this adjustment was statistically significant $(p = .002)$.

Model Visualisation

Numeric x Numeric Example

Research Question

Does the influence of bill length on body mass vary depending on flipper length?

Visualise Data

Model Specification

Model Building

Results Interpretation

Model Visualisation

We can do this using the probe_interaction() function from the interactions package.

In terms of of specification, it might be useful to look up the helper function (i.e., ?probe_interaction). As a quick guide:

model =: The name model to be used
pred =: The continuous predictor variable that will appear on the x-axis
modx =: The continuous moderator variable
interval =: If we say TRUE, then confidence/prediction intervals will be plotted around the line
jnplot =: Since we are looking at a numeric x numeric interaction, we want to specify that this is TRUE

Remember to give your plot informative titles/labels. You, for example, likely want to give your plot:

a clear and concise title (specify main.title =)
axis labels with units or scale included (specify x.label = and y.label =)
a legend title (specify legend.main =)

library(interactions)

plt_mdl_nn <- probe_interaction(model = mdl_nn,
                                pred = mc_flipper_length_mm,
                                modx = mc_bill_length_mm,
                                cond.int = T,
                                interval = T,
                                jnplot = T,
                                main.title = "Bill Length Moderating the Effect of Flipper Length on Body Mass",
                                x.label = "Flipper Length (in mm; Mean Centered)",
                                y.label = "Body Mass (in g)",
                                legend.main = "Bill Length (in mm; Mean Centered)")

From the above, we can choose to extract different information/visualisations of simple slopes (this will likely be dependent upon the question(s) you are trying to answer) - the interaction plot, simple slopes analysis only, johnson-neyman plot only, or both simple slopes and johnson-neyman plot:

The default simple slopes analysis selects $z$-values for us at which to test the slope. The defaults are: the mean of $Z$, and $+1~SD$ and $-1~SD$ from the mean:

plt_mdl_nn$interactplot

Here we can look a the significance of each slope:

plt_mdl_nn$simslopes$slopes

  Value of mc_bill_length_mm     Est.     S.E.     2.5%    97.5%   t val.
1              -5.459584e+00 38.84037 3.185723 32.57403 45.10672 12.19201
2               1.018029e-15 45.39104 2.108209 41.24417 49.53790 21.53061
3               5.459584e+00 51.94170 2.222151 47.57071 56.31269 23.37452
             p
1 1.379943e-28
2 2.378006e-65
3 1.402468e-72

The Johnson-Neyman plot allows us to visualise the regions of significance - i.e., it identifies the range of the moderator variable $(Z)$ where the effect of the independent variable $(X)$ on the dependent variable $(Y)$ is statistically significant (e.g., $p < .05$). Outwith these regions, the effect of the independent variable is not significant.

Pointers to help with interpretation of the plot:

x-axis = Values of moderator variable $(Z)$
y-axis = The conditional effect (slope) of the independent variable $(X)$ on the dependent variable $(Y)$
Bold black line (range of observed data) = The actual range of the moderator variable $(Z)$ values within the dataset. This helps with interpretation of results, and more importantly, avoid extrapolation - i.e., should help to ensure that the interpretations of the plot are data-driven and based on the actually observed data
Zero line = The horizontal line at $y = 0$ indicates the point where the effect of the IV $(X)$ on the DV $(Y)$ is neither positive or negative
Shaded areas = Regions where the effect is significant (e.g., outside the bounds of 95% confidence intervals that include zero) are highlighted in blue. Regions where the effect is non-significant (e.g., crosses the zero line, inside the bounds of the 95% confidence intervals that include zero) are highlighted in red

plt_mdl_nn$simslopes$jnplot

When we return both the simple slopes analysis and Johnson-Neyman plot, we can see that some text is also provided to aid our interpretation of the plot:

plt_mdl_nn$simslopes

JOHNSON-NEYMAN INTERVAL

When mc_bill_length_mm is OUTSIDE the interval [-83.07, -23.71], the slope
of mc_flipper_length_mm is p < .05.

Note: The range of observed values of mc_bill_length_mm is [-11.82, 15.68]

SIMPLE SLOPES ANALYSIS

When mc_bill_length_mm = -5.459584e+00 (- 1 SD): 

                                         Est.    S.E.   t val.      p
----------------------------------- --------- ------- -------- ------
Slope of mc_flipper_length_mm           38.84    3.19    12.19   0.00
Conditional intercept                 4076.93   42.62    95.65   0.00

When mc_bill_length_mm =  1.018029e-15 (Mean): 

                                         Est.    S.E.   t val.      p
----------------------------------- --------- ------- -------- ------
Slope of mc_flipper_length_mm           45.39    2.11    21.53   0.00
Conditional intercept                 4141.49   26.45   156.56   0.00

When mc_bill_length_mm =  5.459584e+00 (+ 1 SD): 

                                         Est.    S.E.   t val.      p
----------------------------------- --------- ------- -------- ------
Slope of mc_flipper_length_mm           51.94    2.22    23.37   0.00
Conditional intercept                 4206.05   35.60   118.14   0.00

In our example, we could say:

Example Interpretation

The association between body mass (in g) and flipper length (in mm; mean centered) was significant when bill length (in mm; mean centered) was more than 83.07mm below the mean or greater than -23.71mm above the mean.

Categorical x Categorical Example

Research Question

Do differences in body mass between species differ by sex?

Visualise Data

Model Specification

Model Building

Results Interpretation

$\beta_0$ = (Intercept) = 3368.84

The intercept, or predicted body mass of a female Adelie (i.e., when both predictor variables are at their reference levels).
- A female Adelie penguin was expected to have a body mass of $3368.84g$.

$\beta_1$ = speciesChinstrap = 158.37

The difference in body mass between female Adelie and Chinstrap penguins.
- In comparison to female Adelie penguins, Chinstrap penguins of the same sex were significantly $(p < .05)$ heavier by $158.37g$.

$\beta_2$ = speciesGentoo = 1310.91

The difference in body mass between female Adelie and Gentoo penguins.
- In comparison to female Adelie penguins, Gentoo penguins of the same sex were significantly $(p < .001)$ heavier by $1310.91g$.

$\beta_3$ = sexmale = 674.66

The difference in body mass between female and male Adelie penguins.
- In comparison to female Adelie penguins, male Adelie penguins were significantly heavier by $674.66g$ $(p < .001)$.

$\beta_4$ = speciesChinstrap:sexmale = -262.89

The difference in body mass for Adelie and Chinstrap penguins between females and males differed by $-262.89g$.
- The difference in body mass between female and male penguins was $262.89g$ less for Chinstrap penguins in comparison to Adele penguins. This difference was statistically significant $p < .05$.
- The difference in body mass between female and male penguins was significantly less for the Chinstrap species in comparison to Adelie (where there was an additional $262.89g$ lesser difference between the two sexes).

$\beta_5$ = speciesGentoo:sexmale = 130.44

The difference in body mass for Adelie and Gentoo penguins between females and males differed by $130.44g$.
- The difference in body mass between female and male penguins was $130.45g$ more for Gentoo penguins in comparison to Adele penguins. This difference was not statistically significant $p = .089$.
- The difference in body mass between female and male penguins was more for the Gentoo species in comparison to Adelie (where there was an additional $130.45g$ greater difference between the two sexes), though this difference was not statistically significant.

Model Visualisation

Coding Constraints

When we have categorical predictors, our choice of contrasts coding changes the bits that we’re getting our of our model.

Suppose we have a 2x2 design (condition A and B, in groups 1 and 2):

Figure 5: Categorical x Categorical Interaction plot

When we are using the default contrasts coding (i.e., treatment or dummy) in R, then our coefficients for the individual predictors represent moving between the dots in Figure 5.

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.9098     0.1759  10.855  < 2e-16 ***
conditionB             1.1841     0.2488   4.759 5.65e-06 ***
grouping2             -1.6508     0.2488  -6.635 1.09e-09 ***
conditionB:grouping2  -2.1627     0.3519  -6.146 1.15e-08 ***
---

The intercept is the red circle in Figure 5.
The coefficient for condition is the difference between the red circle and the red triangle in Figure 5.
The coefficient for grouping is the difference between the red circle and the blue circle in Figure 5.
The interaction coefficient is the difference from the slope of the red line to the slope of the blue line.

However, when we change to using effects (or sum-to-zero) coding, we’re switching where zero is in our model. So if we change to sum contrasts (here we’ve changed both predictors to using sum-to-zero coding), then we end up estimating the effect of each predictor averaged across the other.

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.13577    0.08796  12.912  < 2e-16 ***
conditionB            0.05141    0.08796   0.584     0.56    
grouping2            -1.36607    0.08796 -15.530  < 2e-16 ***
conditionB:grouping2 -0.54066    0.08796  -6.146 1.15e-08 ***
---

The intercept is the grey X in Figure 6.
The coefficient for condition is the difference between the grey X and the grey triangle in Figure 6.
The coefficient for grouping is the difference between the grey X and the blue line in Figure 6.
The interaction coefficient is the difference from the slope of the grey line to slope of the blue line.

Figure 6: Visualisation of sum-to-zero for categorical x categorical interaction plot

It can get quite confusing when we start switching up the contrasts, but it’s all just because we’re changing what “zero” means, and what “moving 1” means:

Dummy/Treatment Coding
Effects/Sum-to-Zero Coding

Simple Effects

Simple effects involve examining the effect of one independent variable at a specific level of a second independent variable. This can allow us to better understand the nature of the interaction as we can examine group differences within one level of one of the independent variables.

Going back to our penguins example, we could ask:

Is there an effect of sex for chinstrap penguins? Or in other words, is there a difference in body mass between male and female chinstrap penguins?

To test simple effects, we can use the emmeans package.

In R

#load the emmeans package
library(emmeans)

#obtain estimates of simple comparisons of cell means
mdl_cc_emm <- emmeans(mdl_cc, ~species*sex)

#return cell means
mdl_cc_emm

 species   sex    emmean   SE  df lower.CL upper.CL
 Adelie    female   3369 36.2 327     3298     3440
 Chinstrap female   3527 53.1 327     3423     3632
 Gentoo    female   4680 40.6 327     4600     4760
 Adelie    male     4043 36.2 327     3972     4115
 Chinstrap male     3939 53.1 327     3835     4043
 Gentoo    male     5485 39.6 327     5407     5563

Confidence level used: 0.95

#specify that we want to compare sex for each species
mdl_cc_simple <- pairs(mdl_cc_emm, simple = "sex")

#return comparison 
mdl_cc_simple

species = Adelie:
 contrast      estimate   SE  df t.ratio p.value
 female - male     -675 51.2 327 -13.174  <.0001

species = Chinstrap:
 contrast      estimate   SE  df t.ratio p.value
 female - male     -412 75.0 327  -5.487  <.0001

species = Gentoo:
 contrast      estimate   SE  df t.ratio p.value
 female - male     -805 56.7 327 -14.188  <.0001

We can also visualise the interaction using emmip(). Here we need to specify the model object, what we want on the x- and y-axis, and whether we want 95% CIs to be displayed.

In R

## first argument is the model object we need to use to visualise the slopes
## species is on the x axis
## separate lines for each level of sex
## return 95% CIs is set to true (default is false)

emmip(mdl_cc, sex ~ species,
      CIs = TRUE,
      ylab = "Predicted Weight (g)",
      xlab = "Species")

Figure 7: Predicted Penguin Weight by Species and Sex

Example Interpretation

A simple effects analysis examined the whether there was a difference in body mass between male and female chinstrap penguins. Males $(M = 3939~g)$ were estimated to be heavier than females $(M = 3527~g)$ with an estimated difference of $412~g$, and this difference was statistically significant $p < .001$. This difference is visually presented in Figure 7.

General

Extracting Information

Multiple regression output in R, model formula highlighted

The call section at the very top of the summary() output shows us the formula that was specified in R to fit the regression model.

In the above, we can see that recall accuracy is our DV, recall confidence and age were our two IVs, and our dataset was named recalldata.

Multiple regression output in R, residuals highlighted

Residuals are the difference between the observed values and model predicted values of the DV.

Ideally, for the model to be unbiased, we want our median value to be around 0, as this would show that the errors are random fluctuations around the true line. When this is the case, we know that our model is doing a good job predicting values at the high and low ends of our dataset, and that our residuals were somewhat symmetrical.

Multiple regression output in R, model coefficients highlighted

Our model estimates help us to build our best fitting equation of the line that represents the association between our DV and our IV(s).

In the above example, we can build our equation for our model from this information:

\[ \text{Recall Accuracy}_i = \beta_0 + \beta_1 \cdot \text{Recall Confidence}_i + \beta_2 \cdot \text{Age}_i + \epsilon_i \] \[ \widehat{\text{Recall Accuracy}} = 36.16 + 0.90 \cdot \text{Recall Confidence} - 0.34 \cdot \text{Age} \]

In R

There are numerous equivalent ways to obtain the estimated regression coefficients — that is, $\hat \beta_0$, $\hat \beta_1$, …., $\hat \beta_k$ — from the fitted model (for this below example, our fitted model has been named mdl):

mdl
mdl$coefficients
coef(mdl)
coefficients(mdl)

The standard error of the coefficient is an estimate of the standard deviation of the coefficient (i.e., how much uncertainty there is in our estimated coefficient).

In R

If you wanted to obtain just the standard error for each estimated regression coefficient, you could do the following (for this below example, our fitted model has been named mdl):

summary(mdl)$coefficients[,2]

Using the standard error, we can create confidence intervals to estimate a plausible range of values for the true population parameter. Recall the formula for obtaining a confidence interval for the population slope is:

\[ \hat \beta_j \pm t^* \cdot SE(\hat \beta_j) \] where $t^*$ denotes the critical value chosen from t-distribution with $n-k-1$ degrees of freedom (where $k$ = number of predictors and $n$ = sample size) for a desired $\alpha$ level of confidence.

In R

We can obtain the confidence intervals for the regression coefficients using the command confint()

The t-statistic is the $\beta$ coefficient divided by the standard error:

\[ t = \frac{\hat \beta_j - 0}{SE(\hat \beta_j)} \]

which follows a $t$-distribution with $n-k-1$ degrees of freedom (where $k$ = number of predictors and $n$ = sample size).

With this, we can test the the null hypothesis $H_0: \beta_j = 0$.

Generally speaking, you want your model coefficients to have large $t$-statistics as this would indicate that the standard error was small in comparison to the coefficient. The larger our $t$-statistic, the more confident we can be that the coefficient is not 0.

In R

If you wanted to obtain just the $t$-values for each estimated regression coefficient, you could do the following (for this below example, our fitted model has been named mdl):

coef(summary(mdl))[, "t value"]
summary(mdl)$coefficients[,3]

From our $t$-value, we can compute our $p$-value. The $p$-value help us to understand whether our coefficient(s) are statistically significant (i.e., that the coefficient is statistically different from 0). The $p$-value of each estimate indicates the probability of observing a $t$-value at least as extreme as, or more extreme than, the one calculated from the sample data when assuming the null hypothesis to be true.

In Psychology, a $p$-value < .05 is usually used to make statements regarding statistical significance (it is important that you always state your $\alpha$ level to help your reader understand any statements regarding statistical significance).

The number of asterisks marks corresponds with the significance of the coefficient (see the ‘Signif. codes’ legend just under the coefficients section).

In R

If you wanted to obtain just the $p$-values for each estimated regression coefficient, you could do the following (for this below example, our fitted model has been named mdl):

summary(mdl)$coefficients[,4]

Multiple regression output in R, model standard deviation of the errors highlighted

The standard deviation of the errors, denoted by $\sigma$, is an important quantity that our model estimates. It represents how much individual data points tend to deviate above and below the regression line - in other words, it tells us how well the model fits the data.

A small $\sigma$ indicates that the points hug the line closely and we should expect fairly accurate predictions, while a large $\sigma$ suggests that, even if we estimate the line perfectly, we can expect individual values to deviate from it by substantial amounts.

The estimated standard deviation of the errors is denoted $\hat \sigma$, and is estimated by essentially averaging squared residuals (giving the variance) and taking the square-root:

\[ \begin{align} & \hat \sigma = \sqrt{\frac{SS_{Residual}}{n - k - 1}} \\ \qquad \\ & \text{where} \\ & SS_{Residual} = \textrm{Sum of Squared Residuals} = \sum_{i=1}^n{(\epsilon_i)^2} \end{align} \]

In R

There are a couple of equivalent ways to obtain the estimated standard deviation of the errors — that is, $\hat \sigma$ — from the fitted model (for this example, our fitted model has been named mdl):

sigma(mdl)
summary(mdl)

Manual Contrasts

Dummy and effects coding allow us to test the significance of the difference between means of groups and some other mean (either reference group or grand mean respectively). However, in some cases, we may want to test more specific hypotheses that require us to test the difference between particular combinations of groups. In such cases, we can use manual contrasts.

Rules

In R - Additive Model

Example - Additive Model

Suppose we wanted to address the following question:

Research Question

Does the sepal length of an iris grown in Western states (i.e., iris setosa) differ from the sepal length of an Iris grown in Eastern states (i.e., iris versicolor and virginica)?

We could specify our hypothesis as:

\[ \begin{aligned} \quad H_0 &: \mu_\text{Western} = \mu_\text{Eastern} \\ \quad H_0 &: \mu_\text{Setosa} = \frac{1}{2} (\mu_\text{Versicolor} + \mu_\text{Virginica}) \\ \\ \quad H_1 &: \mu_\text{Western} \neq \mu_\text{Eastern} \\ \quad H_1 &: \mu_\text{Setosa} \neq \frac{1}{2} (\mu_\text{Versicolor} + \mu_\text{Virginica}) \\ \\ \end{aligned} \]

And then conduct our manual contrast analysis:

# Step 1: Fit and run the model 
spec_model <- lm(Sepal.Length ~ Species, data = iris)

# Step 2: Use`emmeans()`& `plot()`
seplength_mean <- emmeans(spec_model, ~ Species)
seplength_mean

 Species    emmean     SE  df lower.CL upper.CL
 setosa       5.01 0.0728 147     4.86     5.15
 versicolor   5.94 0.0728 147     5.79     6.08
 virginica    6.59 0.0728 147     6.44     6.73

Confidence level used: 0.95

plot(seplength_mean)

# Step 3: Check levels order via `levels()`
levels(iris$Species)

[1] "setosa"     "versicolor" "virginica"

# Step 4: Define contrast & weights - want to compare Iris setosa to iris versicolor and iris virginica
seplength_comp <- list("Western State Iris - Eastern State Iris" = c(-1, 1/2, 1/2))

# Step 5: run contrast analysis
seplength_comp_test <- contrast(seplength_mean, method = seplength_comp)
seplength_comp_test

 contrast                                estimate     SE  df t.ratio p.value
 Western State Iris - Eastern State Iris     1.26 0.0892 147  14.086  <.0001

# Step 6: confidence intervals
confint(seplength_comp_test)

 contrast                                estimate     SE  df lower.CL upper.CL
 Western State Iris - Eastern State Iris     1.26 0.0892 147     1.08     1.43

Confidence level used: 0.95

# Bonus Step: Run inferential test and return CIs in one command
summary(seplength_comp_test, infer = TRUE)

 contrast                                estimate     SE  df lower.CL upper.CL
 Western State Iris - Eastern State Iris     1.26 0.0892 147     1.08     1.43
 t.ratio p.value
  14.086  <.0001

Confidence level used: 0.95

And write up our findings in the context of the hypothesis / research question:

Example Interpretation

We performed a test against $H_0: \mu_1 - \frac{1}{2}(\mu_2 + \mu_3) = 0$. At the 5% significance level, there was evidence that iris sepal length was significantly different between Western and Eastern states in the US $(t(147) = 14.09, p < .001, \text{two-sided})$, and this difference was estimated to be 1.26cm. We are 95% confident that an Iris grown in an Eastern state, on average, would be between 1.08cm and 1.43cm longer than those grown in a Western state $(CI_{95}[1.08, 1.43])$.

In R - Interaction Model

Example - Interaction Model

Suppose we wanted to address the following question:

Research Question

Does the difference in body mass between male and female penguins differ between those residing exclusively on the Antarctic Continent (i.e., Adelie) and those living in both Antarctica and the sub-Antarctic islands (i.e., Gentoo and Chinstrap)?

We could specify our hypothesis as:

\[ \begin{aligned} H_0 &: \mu_\text{(Male, Antarctic Continent)} - \mu_\text{(Female, Antarctic Continent)} = \\ &\mu_\text{(Male, Northern Antarctica | sub-Antarctic islands)} - \mu_\text{(Female, Northern Antarctica | sub-Antarctic islands)} \end{aligned} \]

\[ \begin{aligned} H_1 &: \mu_\text{(Male, Antarctic Continent)} - \mu_\text{(Female, Antarctic Continent)} \neq \\ &\mu_\text{(Male, Northern Antarctica | sub-Antarctic islands)}) - (\mu_\text{(Female, Northern Antarctica | sub-Antarctic islands)}) \end{aligned} \]

Or equivalently as:

\[ \begin{aligned} H_0 &: \mu_\text{(Male, Adelie)} - \mu_\text{(Female, Adelie)} = \\ & \frac{1}{2} (\mu_\text{(Male, Gentoo)} + \mu_\text{(Male, Chinstrap)}) - \frac{1}{2}(\mu_\text{(Female, Gentoo)} + \mu_\text{(Female, Chinstrap)}) \\ \end{aligned} \]

\[ \begin{aligned} H_1 &: \mu_\text{(Male, Adelie)} - \mu_\text{(Female, Adelie)} \neq \\ &\frac{1}{2} (\mu_\text{(Male, Gentoo)} + \mu_\text{(Male, Chinstrap)}) - \frac{1}{2}(\mu_\text{(Female, Gentoo)} + \mu_\text{(Female, Chinstrap)}) \\ \end{aligned} \]

And then conduct our manual contrast analysis:

# Step 1: Fit and run the model 
mdl_cc <- lm(body_mass_g ~ species * sex, data = penguins)

# Step 2: Specify the coefficients to be used in the contrast analysis, and present in a formatted table

#TWO EQUALLY VALID WAYS TO DO THIS - SELECT ONLY ONE:
#Option 1:
species_coef  <- c('Adelie' = -1, 'chinstrap' = 0.5, 'Gentoo' = 0.5)
sex_coef  <- c('male' = -1, 'female' = 1)
contr_coef <- outer(species_coef, sex_coef)
contr_coef

          male female
Adelie     1.0   -1.0
chinstrap -0.5    0.5
Gentoo    -0.5    0.5

#Option 2:
species_coef  <- c('Adelie' = -1, 'chinstrap' = 0.5, 'Gentoo' = 0.5)
sex_coef  <- c('male' = -1, 'female' = 1)
contr_coef_2 <- species_coef %o% sex_coef
contr_coef_2

          male female
Adelie     1.0   -1.0
chinstrap -0.5    0.5
Gentoo    -0.5    0.5

#Convert into a well-formatted table:
contr_coef %>% 
    kable(., caption = "Penguin Contrast Weights") %>%
    kable_styling(full_width = FALSE)

Penguin Contrast Weights
	male	female
Adelie	1.0	-1.0
chinstrap	-0.5	0.5
Gentoo	-0.5	0.5

# Step 3: Use`emmeans()`& `plot()`
species_sex_mean <- emmeans(mdl_cc, ~ species*sex)
species_sex_mean

 species   sex    emmean   SE  df lower.CL upper.CL
 Adelie    female   3369 36.2 327     3298     3440
 Chinstrap female   3527 53.1 327     3423     3632
 Gentoo    female   4680 40.6 327     4600     4760
 Adelie    male     4043 36.2 327     3972     4115
 Chinstrap male     3939 53.1 327     3835     4043
 Gentoo    male     5485 39.6 327     5407     5563

Confidence level used: 0.95

plot(species_sex_mean)

# Step 4/5: Define contrast & weights, and give a name to this contrast 
species_sex_comp <- contrast(species_sex_mean,
                             method = list('Penguin Hyp' = c(-1, 0.5, 0.5, 1, -0.5, -0.5))
                     )

# Step 6: examine output and return inferential stats - CIs, t-ratio, p-value

#OPTION 1:
#examine output
species_sex_comp

 contrast    estimate   SE  df t.ratio p.value
 Penguin Hyp     66.2 69.5 327   0.952  0.3416

#obtain confidence intervals
confint(species_sex_comp)

 contrast    estimate   SE  df lower.CL upper.CL
 Penguin Hyp     66.2 69.5 327    -70.6      203

Confidence level used: 0.95

#OPTION 2:
summary(species_sex_comp, infer = TRUE)

 contrast    estimate   SE  df lower.CL upper.CL t.ratio p.value
 Penguin Hyp     66.2 69.5 327    -70.6      203   0.952  0.3416

Confidence level used: 0.95

And write up our findings in the context of the hypothesis / research question:

Example Interpretation

At the 5% significance level, there was no evidence that the difference in body weight between male and female penguins differed by where the species was based geographically $(t(327) = 0.95, p = .342, \text{two-sided})$.

Multiple Comparisons

Pairwise Comparisons

Why does the Number of Tests Matter?

When to use Which Correction

Bonferroni

Use Bonferroni’s method when you are interested in a small number of planned contrasts (or pairwise comparisons).
Bonferroni’s method is to divide alpha by the number of tests/confidence intervals.
Assumes that all comparisons are independent of one another.
It sacrifices slightly more power than Tukey’s method (discussed below), but it can be applied to any set of contrasts or linear combinations (i.e., it is useful in more situations than Tukey).
It is usually better than Tukey if we want to do a small number of planned comparisons.

Šídák

(A bit) more powerful than the Bonferroni method.
Assumes that all comparisons are independent of one another.
Less common than Bonferroni method, largely because it is more difficult to calculate (not a problem now we have computers).

Tukey

It specifies an exact family significance level for comparing all pairs of treatment means.
Use Tukey’s method when you are interested in all (or most) pairwise comparisons of means.

Scheffe

It is the most conservative (least powerful) of all tests.
It controls the family alpha level for testing all possible contrasts.
It should be used if you have not planned contrasts in advance.
For testing pairs of treatment means it is too conservative (you should use Bonferroni or Šídák).

Others

In the wider literature, Holm’s step-down and Hochberg’s step-up mentioned. Do feel free to read about these in your spare time - there are lots of resources online.

In R

You can easily change which correction you are using via the adjust = argument. For example, using our categorical x categorical interaction model (mdl_cc) using the penguins dataset:

bonf_pair_comp <- pairs(mdl_cc_emm, adjust = "bonferroni")

bonf_pair_comp

 contrast                          estimate   SE  df t.ratio p.value
 Adelie female - Chinstrap female      -158 64.2 327  -2.465  0.2131
 Adelie female - Gentoo female        -1311 54.4 327 -24.088  <.0001
 Adelie female - Adelie male           -675 51.2 327 -13.174  <.0001
 Adelie female - Chinstrap male        -570 64.2 327  -8.875  <.0001
 Adelie female - Gentoo male          -2116 53.7 327 -39.425  <.0001
 Chinstrap female - Gentoo female     -1153 66.8 327 -17.246  <.0001
 Chinstrap female - Adelie male        -516 64.2 327  -8.037  <.0001
 Chinstrap female - Chinstrap male     -412 75.0 327  -5.487  <.0001
 Chinstrap female - Gentoo male       -1958 66.2 327 -29.564  <.0001
 Gentoo female - Adelie male            636 54.4 327  11.691  <.0001
 Gentoo female - Chinstrap male         741 66.8 327  11.085  <.0001
 Gentoo female - Gentoo male           -805 56.7 327 -14.188  <.0001
 Adelie male - Chinstrap male           105 64.2 327   1.627  1.0000
 Adelie male - Gentoo male            -1441 53.7 327 -26.855  <.0001
 Chinstrap male - Gentoo male         -1546 66.2 327 -23.345  <.0001

P value adjustment: bonferroni method for 15 tests

plot(bonf_pair_comp)

sidak_pair_comp <- pairs(mdl_cc_emm, adjust = "sidak")

sidak_pair_comp

 contrast                          estimate   SE  df t.ratio p.value
 Adelie female - Chinstrap female      -158 64.2 327  -2.465  0.1931
 Adelie female - Gentoo female        -1311 54.4 327 -24.088  <.0001
 Adelie female - Adelie male           -675 51.2 327 -13.174  <.0001
 Adelie female - Chinstrap male        -570 64.2 327  -8.875  <.0001
 Adelie female - Gentoo male          -2116 53.7 327 -39.425  <.0001
 Chinstrap female - Gentoo female     -1153 66.8 327 -17.246  <.0001
 Chinstrap female - Adelie male        -516 64.2 327  -8.037  <.0001
 Chinstrap female - Chinstrap male     -412 75.0 327  -5.487  <.0001
 Chinstrap female - Gentoo male       -1958 66.2 327 -29.564  <.0001
 Gentoo female - Adelie male            636 54.4 327  11.691  <.0001
 Gentoo female - Chinstrap male         741 66.8 327  11.085  <.0001
 Gentoo female - Gentoo male           -805 56.7 327 -14.188  <.0001
 Adelie male - Chinstrap male           105 64.2 327   1.627  0.8096
 Adelie male - Gentoo male            -1441 53.7 327 -26.855  <.0001
 Chinstrap male - Gentoo male         -1546 66.2 327 -23.345  <.0001

P value adjustment: sidak method for 15 tests

plot(sidak_pair_comp)

tukey_pair_comp <- pairs(mdl_cc_emm, adjust = "tukey")

tukey_pair_comp

 contrast                          estimate   SE  df t.ratio p.value
 Adelie female - Chinstrap female      -158 64.2 327  -2.465  0.1376
 Adelie female - Gentoo female        -1311 54.4 327 -24.088  <.0001
 Adelie female - Adelie male           -675 51.2 327 -13.174  <.0001
 Adelie female - Chinstrap male        -570 64.2 327  -8.875  <.0001
 Adelie female - Gentoo male          -2116 53.7 327 -39.425  <.0001
 Chinstrap female - Gentoo female     -1153 66.8 327 -17.246  <.0001
 Chinstrap female - Adelie male        -516 64.2 327  -8.037  <.0001
 Chinstrap female - Chinstrap male     -412 75.0 327  -5.487  <.0001
 Chinstrap female - Gentoo male       -1958 66.2 327 -29.564  <.0001
 Gentoo female - Adelie male            636 54.4 327  11.691  <.0001
 Gentoo female - Chinstrap male         741 66.8 327  11.085  <.0001
 Gentoo female - Gentoo male           -805 56.7 327 -14.188  <.0001
 Adelie male - Chinstrap male           105 64.2 327   1.627  0.5812
 Adelie male - Gentoo male            -1441 53.7 327 -26.855  <.0001
 Chinstrap male - Gentoo male         -1546 66.2 327 -23.345  <.0001

P value adjustment: tukey method for comparing a family of 6 estimates

plot(tukey_pair_comp)

scheffe_pair_comp <- pairs(mdl_cc_emm, adjust = "scheffe")

scheffe_pair_comp

 contrast                          estimate   SE  df t.ratio p.value
 Adelie female - Chinstrap female      -158 64.2 327  -2.465  0.3014
 Adelie female - Gentoo female        -1311 54.4 327 -24.088  <.0001
 Adelie female - Adelie male           -675 51.2 327 -13.174  <.0001
 Adelie female - Chinstrap male        -570 64.2 327  -8.875  <.0001
 Adelie female - Gentoo male          -2116 53.7 327 -39.425  <.0001
 Chinstrap female - Gentoo female     -1153 66.8 327 -17.246  <.0001
 Chinstrap female - Adelie male        -516 64.2 327  -8.037  <.0001
 Chinstrap female - Chinstrap male     -412 75.0 327  -5.487  <.0001
 Chinstrap female - Gentoo male       -1958 66.2 327 -29.564  <.0001
 Gentoo female - Adelie male            636 54.4 327  11.691  <.0001
 Gentoo female - Chinstrap male         741 66.8 327  11.085  <.0001
 Gentoo female - Gentoo male           -805 56.7 327 -14.188  <.0001
 Adelie male - Chinstrap male           105 64.2 327   1.627  0.7539
 Adelie male - Gentoo male            -1441 53.7 327 -26.855  <.0001
 Chinstrap male - Gentoo male         -1546 66.2 327 -23.345  <.0001

P value adjustment: scheffe method with rank 5

plot(scheffe_pair_comp)

none_pair_comp <- pairs(mdl_cc_emm, adjust = "none")

none_pair_comp

 contrast                          estimate   SE  df t.ratio p.value
 Adelie female - Chinstrap female      -158 64.2 327  -2.465  0.0142
 Adelie female - Gentoo female        -1311 54.4 327 -24.088  <.0001
 Adelie female - Adelie male           -675 51.2 327 -13.174  <.0001
 Adelie female - Chinstrap male        -570 64.2 327  -8.875  <.0001
 Adelie female - Gentoo male          -2116 53.7 327 -39.425  <.0001
 Chinstrap female - Gentoo female     -1153 66.8 327 -17.246  <.0001
 Chinstrap female - Adelie male        -516 64.2 327  -8.037  <.0001
 Chinstrap female - Chinstrap male     -412 75.0 327  -5.487  <.0001
 Chinstrap female - Gentoo male       -1958 66.2 327 -29.564  <.0001
 Gentoo female - Adelie male            636 54.4 327  11.691  <.0001
 Gentoo female - Chinstrap male         741 66.8 327  11.085  <.0001
 Gentoo female - Gentoo male           -805 56.7 327 -14.188  <.0001
 Adelie male - Chinstrap male           105 64.2 327   1.627  0.1047
 Adelie male - Gentoo male            -1441 53.7 327 -26.855  <.0001
 Chinstrap male - Gentoo male         -1546 66.2 327 -23.345  <.0001

plot(none_pair_comp)

Model Predicted Values & Residuals

Model predicted values are the estimates generated by a regression model for the dependent variable based on the independent variable(s), whilst residuals are the differences between these predicted values and the actual observed values (in turn indicating the accuracy of the model’s predictions).

Predicted Values

Residuals

Predicted Values - Example

Data Transformations

There are many transformations we can do to a continuous variable, but the most common ones are centering and scaling. These transformations can help to aid interpretability of our statistical models.

Centering

Scaling

Standardisation

predictor	outcome	in lm	coefficient	interpretation
standardised	raw	`y ~ scale(x)`	\(\beta = b \cdot s_x\)	“difference in Y for a 1 SD increase in X”
standardised	standardised	`scale(y) ~ scale(x)`	\(\beta = b \cdot \frac{s_x}{s_y}\)	“difference in SD of Y for a 1 SD increase in X”

Model Fit

Assessing model fit involves examining metrics like the sum of squares to measure variability explained by the model, the $F$-ratio to evaluate the overall significance of the model by comparing explained variance to unexplained variance, and $R$-squared / Adjusted $R$-squared to quantify the proportion of variance in the dependent variable explained by the independent variable(s).

Sums of Squares

F-ratio

Overview:

We can perform a test to investigate if a model is ‘useful’ — that is, a test to see if our explanatory variable explains more variance in our outcome than we would expect by just some random chance variable.

With one predictor, the $F$-statistic is used to test the null hypothesis that the regression slope for that predictor is zero:

\[ H_0: \text{the model is ineffective, }b_1 = 0 \\ \] \[ H_1 : \text{the model is effective, }b_1 \neq 0 \\ \]

In multiple regression, the logic is the same, but we are now testing against the null hypothesis that all regression slopes are zero. Our test is framed in terms of the following hypotheses:

\[ H_0: \text{the model is ineffective, }b_1,...., b_k = 0 \\ \]

\[ H_1 : \text{the model is effective, }b_1,...., b_k \neq 0 \\ \]

The relevant test-statistic is the $F$-statistic, which uses “Mean Squares” (these are Sums of Squares divided by the relevant degrees of freedom). We then compare that against (you guessed it) an $F$-distribution! $F$-distributions vary according to two parameters, which are both degrees of freedom.

Formula:

\[ F_{(df_{model},~df_{residual})} = \frac{MS_{Model}}{MS_{Residual}} = \frac{SS_{Model}/df_{Model}}{SS_{Residual}/df_{Residual}} \\ \quad \\ \]

\[ \begin{align} & \text{Where:} \\ & df_{model} = k \\ & df_{residual} = n-k-1 \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

Description:

To test the significance of an overall model, we can conduct an $F$-test. The $F$-test compares your model to a model containing zero predictor variables (i.e., the intercept only model), and tests whether your added predictor variables significantly improved the model.

It is called the $F$-ratio because it is the ratio of the how much of the variation is explained by the model (per parameter) versus how much of the variation is unexplained (per remaining degrees of freedom).

The $F$-test involves testing the statistical significance of the $F$-ratio.

Q: What does the $F$-ratio test?
A: The null hypothesis that all regression slopes in a model are zero (i.e., explain no variance in your outcome/DV). The alternative hypothesis is that at least one of the slopes is not zero.

The $F$-ratio you see at the bottom of summary(model) is actually a comparison between two models: your model (with some explanatory variables in predicting $y$) and the null model.

In regression, the null model can be thought of as the model in which all explanatory variables have zero regression coefficients. It is also referred to as the intercept-only model, because if all predictor variable coefficients are zero, then we are only estimating $y$ via an intercept (which will be the mean - $\bar y$).

Interpretation:

Alongside viewing the $F$-ratio, you can see the results from testing the null hypothesis that all of the coefficients are $0$ (the alternative hypothesis is that at least one coefficient is $\neq 0$. Under the null hypothesis that all coefficients = 0, the ratio of explained:unexplained variance should be approximately 1)

If your model predictors do explain some variance, the $F$-ratio will be significant, and you would reject the null, as this would suggest that your predictor variables included in your model improved the model fit (in comparison to the intercept only model).

Points to note:

The larger your $F$-ratio, the better your model
The $F$-ratio will be close to 1 when the null is true (i.e., that all slopes are zero)

In R

We can see the $F$-statistic and associated $p$-value at the bottom of the output of summary(<modelname>):

Multiple regression output in R, F statistic highlighted

Example Interpretation

The linear model with recall confidence and age explained a significant amount of variance in recall accuracy beyond what we would expect by chance $F(2, 17) = 12.92, p < .001$.

R-squared and Adjusted R-squared

Model Comparisons

One useful thing we might want to do is compare our models with and without some predictor(s).There are numerous ways we can do this, but the method chosen depends on the models and underlying data:

Nested vs Non-Nested Models

Incremental F-test

AIC & BIC

Model Assumptions

Linear models rely on numerous underlying assumptions about the data. These assumptions ensure that the association between variables is appropriately captured, and that inferences drawn from the model are accurate and valid. Model diagnostics can help further assess whether these assumptions hold. When these assumptions are violated, there are numerous techniques that can be employed, such as through data transformations or using robust alternatives, to ensure reliable model interpretations.

Linearity

Simple Linear Regression

In simple linear regression with only one explanatory variable, we could assess linearity through a simple scatterplot of the outcome variable against the explanatory. This would allow us to check if the errors have a mean of zero. If this assumption was met, the residuals would appear to be randomly scattered around zero.

The rationale for this is that, once you remove from the data the linear trend, what’s left over in the residuals should not have any trend, i.e. have a mean of zero.

Multiple Regression

In multiple regression, however, it becomes more necessary to rely on diagnostic plots of the model residuals. This is because we need to know whether the relations are linear between the outcome and each predictor after accounting for the other predictors in the model.

In order to assess this, we use partial-residual plots (also known as ‘component-residual plots’). This is a plot with each explanatory variable $x_j$ on the x-axis, and partial residuals on the y-axis***.

Partial residuals for a predictor $x_j$ are calculated as: \[ \hat \epsilon + \hat \beta_j x_j \]

In R:

Simple Linear Regression
Multiple Linear Regression

#specify model
recall_simp <- lm(recall_accuracy ~ age, data = recalldata)

#create plot
ggplot(recalldata, aes(x = age, y = recall_accuracy)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE, colour = "blue") + #fit straight line to data
    geom_smooth(method = "loess", se = FALSE, colour = "red") + #fit loess line to data
    labs(x = "Age", y = "Recall Accuracy")

Interpretation Guidance

The loess line should closely follow the data.

We can create these plots for all predictors in the model by using the crPlots() function from the car package:

#specify model
recall_mdl <- lm(recall_accuracy ~ recall_confidence + age, data = recalldata)

#create plots
crPlots(recall_mdl)

Interpretation Guidance

You are looking for the pink line to follow a linear trend line (i.e., follow the blue line). In other words, the loess line should closely follow the linear line.

Important to Note for Interaction Models

***When there is an interaction in the model, assessing linearity becomes difficult. In fact, crPlots() will not work. To assess, you can create a residuals-vs-fitted plot.

Independence (of errors)

Normality (of errors)

Equal Variances (Homoscedasticity)

Useful Assumption Plots

plot(modelname)
check_model(modelname)

We can run plot(mymodel) which will cycle through these plots (asking us to press enter each time to move to the next plot), or we can arrange these plots in a matrix via par(mfrow), for example in a 2 x 2 matrix as shown below (make sure to always reset your graphical parameters! If needed, we could also extract specific plots using, for instance: plot(mymodel, which = 3) for the third plot.

In R

par(mfrow=c(2,2))
plot(recall_mdl)

par(mfrow=c(1,1))

Interpretation Guidance

Top Left: For the Residuals vs Fitted plot, we want the red line to be horizontal at close to zero across the plot. We don’t want the residuals (the points) to be fanning in/out.
Top Right: For the Normal Q-Q plot, we want the residuals (the points) to follow closely to the diagonal line, indicating that they are relatively normally distributed.⁴
Bottom Left: For the Scale-Location plot, we want the red line to be horizontal across the plot. These plots allow us to examine the extent to which the variance of the residuals changes across the fitted values. If it is angled, we are likely to see fanning in/out of the points in the residuals vs fitted plot.
Bottom Right: The Residuals vs Leverage plot indicates points that might be of individual interest as they may be unduly influencing the model. There are funnel-shaped lines on this plot (sometimes out of scope of the plotting window). Ideally, we want our residuals inside the funnel - the further the residual is to the right (the more leverage it has), the closer to the 0 we want it to be.

Note, if we have only categorical predictors in our model, many of these will show vertical lines of points. This doesn’t indicate that anything is wrong, and the same principles described above continue to apply

The check_model() function from the performance package is a useful way to check the assumptions of models, as it also returns some useful notes to aid your interpretation. However, it is important to check each assumption individually with plots that are more suitable for a statistics report.

In R

library(performance)
check_model(recall_mdl)

Multicollinearity

Individual Case Diagnostics

We have seen that some specific individual cases in our data can influence our model more than others. We can identify these as:

Regression outliers: A large residual $\hat \epsilon_i$ - i.e., a big discrepancy between their predicted y-value and their observed y-value.
- Standardised residuals: For residual $\hat \epsilon_i$, divide by the estimate of the standard deviation of the residuals. In R, the rstandard() function will give you these
- Studentised residuals: For residual $\hat \epsilon_i$, divide by the estimate of the standard deviation of the residuals excluding case $i$. In R, the rstudent() function will give you these.
High leverage cases: These are cases which have considerable potential to influence the regression model (e.g., cases with an unusual combination of predictor values).
- Hat values: are used to assess leverage. In R, The hatvalues() function will retrieve these.
High influence cases: When a case has high leverage and is an outlier, it will have a large influence on the regression model.
- Cook’s Distance: combines leverage (hatvalues) with outlying-ness to capture influence: $D_i = \text{Outlyingness} \times \text{Leverage}$. Cook’s distance refers to the average distance the $\hat{y}$ values will move if a given case is removed. In R, the cooks.distance() function will provide these values. Alongside Cook’s Distance, we can examine the extent to which model estimates and predictions are affected when an entire case is dropped from the dataset and the model is refitted.
DFFit: the change in the predicted value at the $i^{th}$ observation with and without the $i^{th}$ observation is included in the regression.
DFbeta: the change in a specific coefficient with and without the $i^{th}$ observation is included in the regression. DFbeta represents the difference in the beta coefficients when a case is excluded from the model versus when it’s included. A large DFbeta value would suggest that a case has a substantial impact on the estimated coefficients, and thus a high influence on the model results; a small DFbeta value would suggest that the case has less influence on the estimated coefficients. A commonly used cut-off or threshold to compare $|DFBETA|$ values (absolute values) against is $\frac{2}{\sqrt{n}}$ (see Belsley et al., (1980) p. 28 for more info)⁵.
DFbetas: the change in a specific coefficient divided by the standard error, with and without the $i^{th}$ observation is included in the regression.
COVRATIO: measures the effect of an observation on the covariance matrix of the parameter estimates. In simpler terms, it captures an observation’s influence on standard errors. Values which are $>1+\frac{3(k+1)}{n}$ or $<1-\frac{3(k+1)}{n}$ are considered as having strong influence.

In R, we can get lots of these measures with the influence.measures() function:

influence.measures(my_model) will give you out a dataframe of the various measures.
summary(influence.measures(my_model)) will provide a nice summary of what R deems to be the influential points.

Next Steps: What to do with Violations of Assumptions / Problematic Case Diagnostic Results

There are lots of different options available, and there is no one right answer. Assuming that we have no issues with model specification (i.e., are not missing variables, have modeled appropriately), then we may want to consider one of the below approaches (note: this is not an exhaustive list!)

The first step is to re-examine your data. It is important to be familiar with your dataset, as you need to know what values are typical, normal, and possible. Could it be the case that you have missed some impossible values (e.g., a negative value of a persons height), values outwith the possible range (e.g., a score of 55 on a survey where scores can only range 10-50), values that don’t make any sense (e.g., an age of 200), or maybe there are even typos / data entry errors (e.g., forgetting to put a decimal point, so having a height of 152m instead of 1.52m)!

If there is a simple error in the data, it could be that you can fix the typo. If that is not possible (maybe you didn’t collect the data, so are unsure of what the value(s) should/could be), you will need to delete the value (i.e., set as an NA), because you know that it is incorrect.

We should aim to never change a legitimate value where possible (and remember that if you have a large dataset, a small number of extreme values will be unlikely to have a strong influence on your results).

If there is an extreme, but legitimate value that you have determined is adversely influencing your model (i.e., by examining the assumptions and diagnostics as outlined above), you may want to consider ways to reduce this influence (e.g., winsorizing - which essentially truncates or caps the identified extreme values to a specified percentile, in turn reducing their influence on the model without completely eliminating the observation(s). For example, you could replace values below the 5th percentile with the 5th percentile value, and values above the 95th percentile with the 95th percentile value).

If after re-examining your data you cannot identify any atypical, non-normal, or impossible values, you may need to select a different approach as outlined below.

This allows us to assess the sensitivity of our results (i.e., parameter estimates, p-values, confidence intervals) to changes in our modelling approach (i.e., the removal of observations).

We can re-fit our model after excluding our identified outliers and potentially influential observations, and compare these results to the original model.

Process of Removing Observations

The current example involves removing all identified outliers and potentially influential observations at the same time. Ideally, and to ensure a more thorough sensitivity analysis, you would remove each of these observations one at a time, assess the effects on the model by comparing to your original, reassessing the remaining pre-identified observations, and repeating the process if necessary.

## wellbeing model
wb_mdl1 <- lm(wellbeing ~ outdoor_time + social_int, data = mwdata) 
summary(wb_mdl1)


Call:
lm(formula = wellbeing ~ outdoor_time + social_int, data = mwdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7611  -3.1308  -0.4213   3.3126  18.8406 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  28.62018    1.48786  19.236  < 2e-16 ***
outdoor_time  0.19909    0.05060   3.935 0.000115 ***
social_int    0.33488    0.08929   3.751 0.000232 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.065 on 197 degrees of freedom
Multiple R-squared:  0.1265,    Adjusted R-squared:  0.1176 
F-statistic: 14.26 on 2 and 197 DF,  p-value: 1.644e-06

## wellbeing model
wb_mdl2 <- lm(wellbeing ~ outdoor_time + social_int, data = mwdata[-c(16, 25, 50, 53, 56, 58, 59, 60, 62, 72, 73, 75, 76, 78, 79, 85, 101, 109, 125, 126, 127, 131, 149, 151, 159, 163, 165, 169, 173, 176, 179, 197), ])
summary(wb_mdl2)


Call:
lm(formula = wellbeing ~ outdoor_time + social_int, data = mwdata[-c(16, 
    25, 50, 53, 56, 58, 59, 60, 62, 72, 73, 75, 76, 78, 79, 85, 
    101, 109, 125, 126, 127, 131, 149, 151, 159, 163, 165, 169, 
    173, 176, 179, 197), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7700 -2.6445 -0.6073  2.8586  9.6605 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.91311    1.42612  19.573  < 2e-16 ***
outdoor_time  0.19356    0.04901   3.950 0.000116 ***
social_int    0.39830    0.08964   4.443 1.62e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.044 on 165 degrees of freedom
Multiple R-squared:  0.1774,    Adjusted R-squared:  0.1675 
F-statistic:  17.8 on 2 and 165 DF,  p-value: 1.004e-07

tab_model(wb_mdl1, wb_mdl2,
          dv.labels = c("Wellbeing (WEMWBS Scores)", "Wellbeing (WEMWBS Scores)"),
          pred.labels = c("outdoor_time" = "Outdoor Time (hours per week)",
                          "social_int" = "Social Interactions (number per week)"),
          title = "Regression Table for Wellbeing Models wb1 and wb2")

Regression Table for Wellbeing Models wb1 and wb2
	Wellbeing (WEMWBS Scores)			Wellbeing (WEMWBS Scores)
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	28.62	25.69 – 31.55	<0.001	27.91	25.10 – 30.73	<0.001
Outdoor Time (hours per week)	0.20	0.10 – 0.30	<0.001	0.19	0.10 – 0.29	<0.001
Social Interactions (number per week)	0.33	0.16 – 0.51	<0.001	0.40	0.22 – 0.58	<0.001
Observations	200			168
R² / R² adjusted	0.126 / 0.118			0.177 / 0.167

We conducted a sensitivity analysis to assess how robust our conclusions were regarding outdoor time and the weekly number of social interactions in the presence of previously identified outliers and potentially influential observations. We re-fit the model, excluding these 28 observations (14% of our original sample), and compared these model results (wb_mdl2) to those of our original model (wb_mdl1).

There was little difference in the estimates from wb_mdl1 and wb_mdl2, and so we can conclude that after conducting a sensitivity analysis, there were no meaningful differences in our results, and hence our conclusions from our original model hold. Specifically:

The direction of all model estimates are the same in wb_mdl1 and wb_mdl2 (i.e., all positive)
There is no difference in statistical significance, and the p-values were of a similar magnitude (i.e., all < .001)
The estimate and confidence intervals for outdoor_time are very similar
There are some quantitative differences in the estimate and confidence intervals for social_int. The estimate differs slightly in magnitude by 0.07), but given that this remains positive and significant, we do not need to be too concerned about this.

The bootstrap method is an alternative non-parametric method of constructing a standard error. Instead of having to rely on calculating the standard error with a formula and potentially applying fancy mathematical corrections, bootstrapping involves mimicking the idea of “repeatedly sampling from the population”. It does so by repeatedly resampling with replacement from our original sample.

What this means is that we don’t have to rely on any assumptions about our model residuals, because we actually generate an actual distribution that we can take as an approximation of our sampling distribution, meaning that we can actually look at where 95% of the distribution falls, without having to rely on any summing of squared deviations.

Note, the bootstrap may provide us with an alternative way of conducting inference, but our model may still be mis-specified. It is also very important to remember that bootstrapping is entirely reliant on utilising our original sample to pretend that it is a population (and mimic sampling from that population). If our original sample is not representative of the population that we’re interested in, bootstrapping doesn’t help us at all.

The method of ordinary least squares regression (OLS: i.e., the type of regression model you have been fitting on the course) assumes that there is constant variance in the errors (homoscedasticity). The method of weighted least squares (WLS) can be used when the ordinary least squares assumption of constant variance in the errors is violated (i.e., you have evidence of heteroscedasticity, like we do in Q3 of this lab).

If we have some specific belief that your non-constant variance is due to differences in the variances of the outcome between various groups, then it might be better to use Weighted Least Squares.

As an example, imagine we are looking at weight of different dog breeds (Figure 10). The weights of chihuahuas are all quite close together (between 2 to 5kg), but the weight of, for example, spaniels is anywhere from 8 to 25kg - a much bigger variance.

Figure 10: The weights of 49 dogs, of 7 breeds

Recall that the default way that lm() deals with categorical predictors such as dog breed, is to compare each one to a reference level. In this case, that reference level is “beagle” (first in the alphabet). Looking at Figure 10 above, which comparison do you feel more confident in?

A: Beagles (14kg) vs Pugs (9.1kg). A difference of 4.9kg.
B: Beagles (14kg) vs Spaniels (19kg). A difference of 5kg.

Hopefully, your intuition is that A looks like a clearer difference than B because there’s less overlap between Beagles and Pugs than between Beagles and Spaniels. Our standard linear model, however, assumes the standard errors are identical for each comparison:


Call:
lm(formula = weight ~ breed, data = dogdf)
...
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             13.996      1.649   8.489 1.17e-10 ***
breedpug                -4.858      2.332  -2.084   0.0433 *  
breedspaniel             5.052      2.332   2.167   0.0360 *  
breedchihuahua         -10.078      2.332  -4.322 9.28e-05 ***
breedboxer              20.625      2.332   8.846 3.82e-11 ***
breedgolden retriever   17.923      2.332   7.687 1.54e-09 ***
breedlurcher             5.905      2.332   2.533   0.0151 *  
---

Furthermore, we can see that we have heteroscedasticity in our residuals - the variance is not constant across the model:

plot(dogmodel, which=3)

Weighted least squares is a method that allows us to apply weights to each observation, where the size of the weight indicates the precision of the information contained in that observation.

We can, in our dog-breeds example, allocate different weights to each breed. Accordingly, the Chihuahuas are given higher weights (and so Chihuahua comparisons result in a smaller SE), and Spaniels and Retrievers are given lower weights.

library(nlme)
load(url("https://uoepsy.github.io/data/dogweight.RData"))
dogmod_wls = gls(weight ~ breed, data = dogdf, 
                 weights = varIdent(form = ~ 1 | breed))
summary(dogmod_wls)

Coefficients:
                           Value Std.Error   t-value p-value
(Intercept)            13.995640  1.044722 13.396516  0.0000
breedpug               -4.858097  1.271562 -3.820576  0.0004
breedspaniel            5.051696  2.763611  1.827933  0.0747
breedchihuahua        -10.077615  1.095964 -9.195207  0.0000
breedboxer             20.625429  1.820370 11.330351  0.0000
breedgolden retriever  17.922779  2.976253  6.021927  0.0000
breedlurcher            5.905261  1.362367  4.334559  0.0001

We can also apply weights that change according to continuous predictors (e.g. observations with a smaller value of $x$ are given more weight than observations with larger values).

A data transformation involves the replacement of a variable (e.g., $y$) by a function of that variable in order to change the shape of a distribution or association (e.g., to help reduce skew). We can transform the outcome variable prior to fitting the model, using something such as log(y) or sqrt(y). This will sometimes allow us to estimate a model for which our assumptions are satisfied.

Some of the most common (not an exhaustive list) transformations are:

Log (log(y)): Often used for reducing right skewness. Note, this transformation cannot be applied to zero or negative values (make sure to check your data!)
Square root (sqrt(y)): Also often used for reducing right skewness. This transformation can be applied to zero values (but not negative), and is commonly applied to count data

Figure 11: A model of a transformed outcome variable can sometimes avoid violations of assumptions that arise when modeling the outcome variable directly. Data from https://uoepsy.github.io/data/trouble1.csv

The major downside of this is that we are no longer modelling $y$, but some transformation $f(y)$ ($y$ with some function $f$ applied to it). Interpretation of the coefficients changes accordingly, such that we are no longer talking in terms of changes in y, but changes in $f(y)$. When the transformation function used is non-linear (see the Right-Hand of Figure 12) a change in $f(y)$ is not the same for every $y$.

Figure 12: The log transformation is non-linear

For certain transformations, we can re-express coefficients to be interpretable with respect to $y$ itself. For instance, the model using a log transform $ln(y) = b_0 + b_1(x)$ gives us a coefficient that represents statement A below. We can re-express this by taking the opposite function to logarithm, the exponent, exp(). Similar to how this works in logistic regression, the exponentiated coefficients obtained from exp(coef(model)) are multiplicative, meaning we can say something such as statement B

A: “a 1 unit change in $x$ is associated with a $b$ unit change in $ln(y)$”.
B: “a 1 unit change in $x$ is associated with $e^b$ percent change in $y$.”

Finding the optimal transformation to use can be difficult, but there are methods out there to help you. One such method is the BoxCox transformation, which can be conducted using BoxCox(variable, lambda="auto"), from the forecast package.⁶

Generalized Linear Models
Higher Order Terms

Generalized Linear Models (GLMs) can appropriately deal with data that do not follow a normal distribution (which is a requirement for traditional linear models). They can accommodate various types of distributions, including the Poisson, binomial, and gamma distributions. This makes them suitable for modelling count data (e.g., number of sunny days Edinburgh has per year - yes, count data can include 0!), binary data (where there are only two possible values e.g., doesn’t wear glasses vs wear glasses, smoker vs non-smoker, i.e., values that are yes/no or 0/1), and other types of non-normal data.

We will explore some GLMs later in the course (Semester 2 Block 4), where we will work with logistic regression models.

Higher order regression terms refer to the inclusion of polynomial terms of degree higher than one in a regression model. In a linear regression model, the association between the dependent variable ($Y$) and the independent variable ($X$) is assumed to be linear, which means the association can be represented by a straight line. However, in many real-world scenarios, associations between variables are not strictly linear, and including higher order regression terms can help capture more complex relationships. Higher order terms that you could incorporate include quadratic, cubic, or higher degree polynomial terms.

For example, in a quadratic regression model, the relationship between $Y$ and $X$ can be represented as:

\[ Y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot X^2 + \epsilon \] \[ \begin{align} & \text{Where:} \\ & Y = \text{Dependent Variable} \\ & X = \text{Independent Variable} \\ \end{align} \]

As in our models we’ve seen so far, $\beta_0$, $\beta_1$, and $\beta_2$ are the coefficients to be estimated in the above model. What is different from what we’ve seen in DAPR2 is the term $\beta_2 \cdot X^2$, and this represents the quadratic term. This allows for a curved as opposed to straight line to represent the association between $Y$ and $X$, and hence can allow us to capture more complex relationships. For example, we can model the association between height and age:

Figure 13: Two linear models, one with a quadratic term (right)

Please note that these types of models are beyond the scope of the DAPR2 course, but if you want to know more, please do read up on these in your own time.

Removing outliers and potentially influential observations should be a last resort - not all outliers are inherently ‘bad’ - we do expect natural variation in our population(s) of interest. Outliers can be informative about the topic under investigation, and this is why you need to be very careful about excluding outliers due only to their ‘extremeness’. In doing so, you can distort your results by removing variability - i.e., by forcing the data to be more normal and less variable than it actually is, and reduce statistical power by reducing the size of your sample.

If you do decide to remove observations, you will need to document what specific data points you excluded, and provide an explanation as to why these were excluded.

To set specific values to NA in our dataset (and save this updated dataset in a new object named mwdata2), we could use the following code. For the purpose of this demonstration, lets say that we wanted to set any age values of <20 as NA. In the original dataset mwdata, we had 3 individuals aged 18, and 6 aged 19, so we should end up with 9 NA values in mwdata2 column age:

#specify age column in original dataset, where age is < 20, for values to be set to NA and save to new object named mwdata2 to avoid overwriting original data
mwdata2 <-  mwdata %>% 
    mutate(age = replace(age, age < 20, NA))

#check how many NA values we have - there should be 9 (so 9 TRUEs):
table(is.na(mwdata2$age))


FALSE  TRUE 
  191     9

If we wanted to remove a full row from the datset, we could use the following code. For the purpose of this demonstration, lets say that we wanted to remove all rows that were highlighted in the above assumption and diagnostic checks as potentially having an adverse influence on our model estimates:

# create new dataset 'mwdata3' without (by specifying -) identified outliers and potentially influential observations
mwdata3 <- mwdata[-c(16, 25, 50, 53, 56, 58, 59, 60, 62, 72, 73, 75, 76, 78, 79, 85, 101, 109, 125, 126, 127, 131, 149, 151, 159, 163, 165, 169, 173, 176, 179, 197), ]

# check dimensions - should now have 32 rows less than original dataset 200 - 32 = 168
dim(mwdata3)

[1] 168   7

Bootstrap

The bootstrap is a general approach to assessing whether the sample results are statistically significant or not, and allows us to draw inferences to the population from a regression model. This method is assumption-free and does not rely on conditions such as normality of the residuals.

It is based on sampling repeatedly with replacement (to avoid always getting the original sample exactly) from the data at hand, and then computing the regression coefficients from each re-sample. We will equivalently use the word “bootstrap sample” or “resample” (for sample with replacement).

Overview

Terminology

In R

Follow these steps:

1: Load the car package.
2: Use the Boot() function (do not forget the uppercase B!) which takes as arguments:
- the fitted model
- f, saying which bootstrap statistics to compute on each bootstrap sample. By default f = coef, returning the regression coefficients.
- R, saying how many bootstrap samples to compute. By default R = 999 but this could be any number. To experiment we recommend 1000, when you want to produce results for journals, it is typical to go with 10,000 or more.
- ncores, saying if to perform the calculations in parallel (and more efficiently). However, this will depend on your PC, and you need to find how many cores you have by running parallel::detectCores() on your PC. By default the function uses ncores = 1.
3: Run the code. However, please remember that the Boot() function does not want a model which was fitted using data with NAs. To remove, for example, you could use na.omit.
4: Look at the summary() of the bootstrap results. When doing so the output will show, for each regression coefficient, the value in the original sample in the column original, and in the bootSE column, the estimate of the variability of the coefficient from bootstrap sample to bootstrap sample. The bootSE provides us the bootstrap standard error, or bootstrap SE in short. We can use this to answer the key question of how accurate our estimate is.
5: Compute confidence intervals via the Confint() function. Use your preferred confidence level (usually, and by default, 95%) by specifying level =. If you select 95% confidence intervals, by also specifying the type = "perc" argument, R will return the values that comprise 95% of all values in between them, i.e. the value with 2.5% of observations below it and the value with 2.5% of observations above it and 97.5% of observations below it.
6: Provide interpretation in the context of your research question and report results in APA format. (Note: the actual estimates are those from our original model, it is just the bounds of the interval that bootstrapping is providing us with).

In R

#specify model
recall_mdl <- lm(recall_accuracy ~ recall_confidence + age, data = recalldata)

#step 1: load car package
library(car)

#step 2/3: bootstrap model (asking to resample 1000 times, i.e., getting a distribution of 1000 values for the coefficients)
bootmymodel <- Boot(recall_mdl, R = 1000)

#step 4: check summary
summary(bootmymodel)


Number of bootstrap replications R = 1000 
                  original   bootBias   bootSE  bootMed
(Intercept)       36.15959 -3.2711632 16.47382 34.86127
recall_confidence  0.89573  0.0522770  0.25721  0.92141
age               -0.33916  0.0027399  0.11400 -0.32973

#step 5: confidence intervals
Confint(bootmymodel, level = 0.95, type = "perc")

Bootstrap percent confidence intervals

                    Estimate      2.5 %     97.5 %
(Intercept)       36.1595862 -1.0225793 61.5221101
recall_confidence  0.8957292  0.5188793  1.5129003
age               -0.3391577 -0.5824887 -0.1458007

Visualisation

General Formatting & Presenting of Results

LaTeX Symbols & Equations

By embedding LaTeX into RMarkdown, you can accurately and precisely format mathematical expressions, ensuring that they are not only technically correct but also visually appealing and easy to interpret.

LaTeX Guide

APA Formatting

APA format is a writing/presentation style that is often used in psychology to ensure consistency in communication. APA formatting applies to all aspects of writing - from formatting of papers (including tables and figures), citation of sources, and reference lists. This means that it also applies to how you present results in your Psychology courses, including DAPR2.

APA Formatting Guides

Tables

We want to ensure that we are presenting results in a well formatted table. To do so, there are lots of different packages available (see Lesson 4 of the RMD bootcamp).

One of the most convenient ways to present results from regression models is to use the tab_model() function from sjPlot

Creating tables via tab_model

Cross Referencing

Cross-referencing is a very helpful way to direct your reader through your document, and the good news is that this can be done automatically in RMarkdown.

Cross Referencing

Flash Card Aims

R Packages

Presenting Results

Back to Basics

Data Exploration

Numeric Exploration

Descriptives

Correlation

Visual Exploration

Functions and Mathematical Models

Statistical Models

Numeric Outcomes & Predictors

Numeric Outcomes & Categorical Predictors

Interaction Models

Numeric x Categorical Example

Numeric x Numeric Example

Categorical x Categorical Example

General

Manual Contrasts

Multiple Comparisons

Model Predicted Values & Residuals

Model predicted values (\(\hat y_i\)) for sample data

Model predicted values for other (unobserved) data

Data Transformations

Model Fit

Total Sum of Squares

Residual Sum of Squares

Model Sum of Squares

Model Comparisons

Model Assumptions

Simple Linear Regression

Multiple Regression

Bootstrap

General Formatting & Presenting of Results

LaTeX Symbols & Equations

APA Formatting

Tables

Cross Referencing

References

Footnotes