Back to Basics

For an overview of basic statistical tests and core concepts (e.g., $p$-values), please revisit the DAPR1 materials for a refresher (also accessible via the DAPR1 Learn page).

Terminology

Term	Definition
(Observational) unit	The individual entities on which data are collected
Variable	Any characteristic recorded on the observational units
Numeric variable	A variable that records a numerical quantity for each case. For such variables standard arithmetic operations make sense. For example: height, IQ, and weight
Categorical variable	A categorical variable places units into one of several groups. For example: country of birth, dominant hand, and eye colour
Binary variable	A special case of categorical variable with only 2 possible levels. For example: handedness (left or right), smoking status (smoker or non-smoker), pass test (yes or no)
Response variable (also more commonly called a dependent variable, or outcome variable)	Measures the outcome of interest in a study
Explanatory/independent variable (also called predictors)	Are used to explain differences/changes in the response variable
Observational study	An observational study is a study in which the researcher does not manipulate any of the variables involved in the study, but merely records the values as they naturally exist
Experimental study	An experiment is a study in which the researcher imposes the values of the explanatory variable on the units before measuring the response variable

Data Exploration

The common first port of call for almost any statistical analysis is to explore the data, and we can do this visually and/or numerically.

Marginal Distributions & Bivariate Associations

	Marginal Distributions	Bivariate Associations
Description	The distribution of each variable individually (i.e., without reference to the values of the other variables).	Describing the association between two numeric variables.
Visually	Plot each variable individually. You could use, for example, `geom_density()` for a density plot, `geom_boxplot()` for a boxplot or `geom_histogram()` for a histogram to comment on and/or examine: The shape of the distribution. Look at the shape, centre and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal? Identify any unusual observations. Do you notice any extreme observations (i.e., outliers)?	Plot associations among two variables. You could use, for example, `geom_point()` for a scatterplot to comment on and/or examine: The direction of the association indicates whether there is a positive or negative association The form of association refers to whether the relationship between the variables can be summarized well with a straight line or some more complicated pattern The strength of association entails how closely the points fall to a recognizable pattern such as a line Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail
Numerically	Compute and report summary statistics e.g., mean, standard deviation, median, min, max, etc. You could, for example, calculate summary statistics such as the mean (`mean()`) and standard deviation (`sd()`), etc. within `summarize()`	Compute and report the correlation coefficient. You can use the `cor()` function to calculate this

Numeric Exploration

Numeric exploration of data involves examining and describing key statistics like mean, median, and standard deviation via descriptives tables; and assessing the associations among variables through correlation coefficients. Exploring our data numerically helps us to identify patterns and associations in the data. When doing so, it is important to contextualise the descriptive statistics within the scope of the research question and associated scales.

Descriptives

Descriptives Tables

Descriptives Tables - Examples

Sepal Length Descriptives (in cm)
M_Length	SD_Length
5.84	0.83

Sepal Length Descriptives (in cm)
	mean	sd
X1	5.84	0.83

Correlation

Correlation Coefficient

Correlation Matrix

A correlation matrix is a table showing the correlation coefficients between variables. Each cell in the table shows the association between two variables. The diagonals show the correlation of a variable with itself (and are therefore always equal to 1).

In R

We can create a correlation matrix by giving the cor() function a dataframe. It is important to remember that all variables must be numeric. One way to check this is by using the str() argument.

Let’s check the structure of the iris dataset to ensure that all variables are numeric:

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We can see that the variable Species in column 5 is a factor - this means that we cannot include this in our correlation matrix. Therefore, we need to subset, or, in other words, select specific columns. We can do this either giving the column numbers inside [], or using select(). In our case, we want the variables in columns 1 - 4, just not 5.

If you had NA values within your dataset, you could choose to remove these NAs using na.rm = TRUE inside the cor() function.

In R

Index dataframe ([])
Variable selection (select())

round(cor(iris[,c(1:4)]), digits = 2)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

# select only the columns we want by variable name, and pass this to cor()
iris |> 
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) |>
  cor() |>
  round(digits = 2)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

Correlation - Hypothesis Testing

Correlation - Hypothesis Testing in R

Visual Exploration

Visual exploration of our data allows us to visualize the distributions of our data, and to identify potential associations among variables.

How to Visualise Data

Marginal Distributions - Examples

A histogram shows the frequency of values which fall within bins of an equal width.

Basic:

x-axis: possible values of some variable, grouped into bins
y-axis: frequency of a given value or values within bins
What are bins?: A bin represents a range of scores

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram() +
    labs(x = "Sepal Length (in cm)")

Updating Bins:

Within geom_histogram(), we can specify bins = to specify the number of columns we want (for this example, lets say we want 10):

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(bins = 10) +
    labs(x = "Sepal Length (in cm)")

Alternatively, we can specify binwidth = to specify the width of each bin (it is very helpful to be aware of the scale of your variable here!):

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(binwidth = 0.1) +
    labs(x = "Sepal Length (in cm)")

Outline Columns with Color:

Within geom_histogram(), we can specify color = to set a colored outline of the columns:

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(color = "darkred") +
    labs(x = "Sepal Length (in cm)")

Fill Columns with Color:

Within geom_histogram(), we can specify fill = to fill the columns with a color:

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_histogram(fill = "darkred") +
    labs(x = "Sepal Length (in cm)")

“Density” is a bit similar to the notion of “relative frequency” (or “proportion”), in that for a density curve, the values on the y-axis are scaled so that the total area under the curve is equal to 1. In creating a curve for which the total area underneath is equal to one, we can use the area under the curve in a range of values to indicate the proportion of values in that range.

Basic:

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_density() +
    labs(x = "Sepal Length (in cm)")

Filled:

We can fill our plot with colour by specifying fill = within geom_density():

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_density(fill = "darkred") +
    labs(x = "Sepal Length (in cm)")

Line Type & Width:

We can change the type and width of the line by specifying linetype = and linewidth = within geom_density():

ggplot(data = iris, aes(x = Sepal.Length)) +
    geom_density(linetype = 6, linewidth = 3) +
    labs(x = "Sepal Length (in cm)")

Boxplots provide a useful way of visualising the interquartile range (IQR). You can see what each part of the boxplot represents in Figure Figure 1.

Basic:

We need to specify + geom_boxplot() to get a boxplot:

ggplot(data = iris, aes(x = Sepal.Length)) +
  geom_boxplot() +
    labs(x = "Sepal Length (in cm)")

Rotate Boxplot:

If we had set aes(y = Sepal.Length) instead, then it would simply be rotated 90 degrees:

ggplot(data = iris, aes(y = Sepal.Length)) +
  geom_boxplot() +
    labs(y = "Sepal Length (in cm)")

Bivariate Associations - Examples

Unlike in our marginal plots where we specified our x-axis variable within aes(), to visualise bivariate associations, we need to specify what variables we want on both our x- and y-axis.

Scatterplot
Scatterplot of Matrices (SPLOM)

We can use a scatterplot (since the variables are numeric and continuous) to visualise the association between the two numeric variables - these will be our x- and y-axis values.

We plot these values for each row of our dataset, and we should end up with a cloud of scattered points.

Here we will want to comment on any key observations that we notice, including if we detect outliers or points that do not fit with the pattern in the rest of the data. Outliers are extreme observations that are not possible values of a variable or that do not seem to fit with the rest of the data. This could either be:

marginally along one axis: points that have an unusual (too high or too low) x-coordinate or y-coordinate
jointly: observations that do not fit with the rest of the point cloud

Basic:

We need to specify + geom_point() to get a scatterplot:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Fill Points with Color:

Within geom_point(), we can specify color = to fill the points with a color:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point(color = "darkred") +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Change Size and Opacity:

We can change the size (using size =) and the opacity (using alpha =) of our geom elements on the plot. Let’s include this below:

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point(size = 3, alpha = 0.5) +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Add a Line of Best Fit:

We can superimpose (i.e., add) a line of best fit by including the argument + geom_smooth(). Since we want to fit a straight line, we want to use method = "lm". We can also specify whether we want to display confidence intervals around our line by specifying se = TRUE / FALSE.

ggplot(data = iris, aes(x = Petal.Length, y = Sepal.Length)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(x = "Petal Length (in cm)", y = "Sepal Length (in cm)")

Using pairs.panels() is likely the most useful way to visualise the associations among numeric variables. It returns a scatterplot of matrices (SPLOM) returning you (1) the marginal distribution of each variable via a histogram, (2) the correlation between variables, and (3) bivariate scatterplots.

iris |>
    select(Sepal.Length, Petal.Length, Petal.Width) |>
    pairs.panels(main = "Iris SPLOM")

Multivariate Associations - Examples

Functions and Mathematical Models

Basic functions and mathematical models are foundational tools used to describe and predict associations between variables.

Identification & Specification

Deterministic Models

Description & Specification

Visualisation

Predicted Values

Statistical Models

Statistical models are used to understand the associations among variables.

Specifying Hypotheses

Numeric Outcomes & Numeric Predictors

Simple Linear Regression Models

Description & Model Specification

The association between two variables (e.g., recall accuracy and age) will show deviations from the ‘average pattern’. Hence, we need to create a model that allows for deviations from the linear relationship - we need a statistical model.

A statistical model includes both a deterministic function and a random error term. We typically refer to the outcome (‘dependent’) variable with the letter $y$ and to our predictor (‘explanatory’/‘independent’) variables with the letter $x$. A simple (i.e., one x variable only) linear regression model thus takes the following form (where the terms $\beta_0$ and $\beta_1$ are numbers specifying where the line going through the data meets the y-axis (i.e., the intercept - where $x$ = 0; $\beta_0$) and its slope (direction and gradient of line; $\beta_1$):

Model Specification

\[ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \]

Model Specification: Annotated

\[ y_i = \underbrace{\beta_0 + \beta_1 \cdot x_i}_{\text{function of }x} + \underbrace{\epsilon_i}_{\text{random error}} \\ \]

\[ \quad \text{where} \quad \epsilon_i \sim N(0, \sigma) \text{ independently} \]

Model Specification: Explained

Let’s break down what $y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \quad \text{where} \quad \epsilon_i \sim N(0, \sigma) \text{ independently}$ actually means by considering the statement in smaller parts:

$y_i = \beta_0 + \beta_1 \cdot x_i$
- $y_i$ is our measured outcome variable (our DV)
- $x_i$ is our measured predictor variable (our IV)
- $\beta_0$ is the model intercept
- $\beta_1$ is the model slope
$\epsilon \sim N(0, \sigma) \text{ independently}$
- $\epsilon$ is the residual error
- $\sim$ means ‘distributed according to’
- $N(0, \sigma) \text{ independently}$ means ‘normal distribution with a mean of 0 and a variance of $\sigma$’
- Together, we can say that the errors around the line have a mean of zero and constant spread as x varies

In R

There are basically two pieces of information that we need to pass to the lm() function:

The formula: The regression formula should be specified in the form y ~ x where $y$ is the dependent variable (DV) and $x$ the independent variable (IV).
The data: Specify which dataframe contains the variables specified in the formula.

In R, the syntax of the lm() function can be specified as follows (where DV = dependent variable, IV = independent variable, and data_name = the name of your dataset):

Option A
Option B

model_name <- lm(DV ~ IV, data = data_name)

model_name <- lm(data_name$DV ~ data_name$IV)

you can also specify as:

Option A
Option B

model_name <- lm(DV ~ 1 + IV, data = data_name)

model_name <- lm(data_name$DV ~ 1 + data_name$IV)

Why is there a 1 in the two bottom options?

When we specify the linear model in R, we include after the tilde sign ($\sim$), the variables that appear to the right of the $\hat \beta$s. The intercept, or $\beta_0$, is a constant. That is, we could write it as multiplied by 1.

Including the 1 explicitly is not necessary because it is included by default (you can check this by comparing the outputs of A & B above with and without the 1 included - the estimates are the same!). After a while, you will find you just want to drop the 1 when calling lm() because you know that it’s going to be there, but in these early weeks we tried to keep it explicit to make it clear that you want the intercept to be estimated.

Example

Research Question

Is there an association between recall accuracy and age?

Overview

Visualise Data

Model & Hypothesis Specification

Model Building

Results Interpretation

Model Visualisation

Multiple Linear Regression Models

Description & Model Specification

Interpretation of Coefficients

Example

Research Question

Is recall accuracy associated with recall confidence and age?

Overview

Visualise Data

Model & Hypothesis Specification

Model Building

Results Interpretation

Model Visualisation

General - Extracting Information

It is important to have a good grasp of how to understand and interpret the key components of your model summary() output, including model coefficients, standard errors, $t$-values, $p$-values, etc., and how these can be used in further calculations (such as confidence intervals). As well as knowing how to extract from R, it is necessary to understand how to compute some of these statistics by hand too.

Model Call

Residuals

Model Coefficients

Our model estimates help us to build our best fitting equation of the line that represents the association between our DV and our IV(s).

In the above example, we can build our equation for our model from this information:

\[ \text{Recall Accuracy}_i = \beta_0 + \beta_1 \cdot \text{Recall Confidence}_i + \beta_2 \cdot \text{Age}_i + \epsilon_i \] \[ \widehat{\text{Recall Accuracy}} = 36.16 + 0.90 \cdot \text{Recall Confidence} - 0.34 \cdot \text{Age} \]

How to calculate $\hat \beta_0$ and $\hat \beta_1$

By Hand
Using R

Let’s apply to a straightforward example to try by-hand. Suppose you have a simple linear regression model (i.e., with only one IV) where you have the following data points:

Observed $x_i$	Observed $y_i$
1	5
2	7
3	8
4	6
5	9

Step 1: Calculate mean of both $x$ and $y$

$\bar x = {\frac{1+2+3+4+5}{5}} = 3$

$\bar y = {\frac{5+7+8+6+9}{5}} = 7$

Step 2: Calculate $\beta_0$ and $\beta_1$

We need to calculate the slope first, as we need to know the value of $\beta_1$ in order to calculate $\beta_0$

Slope ($\beta_1$)

\[ \begin{align} & \hat \beta_1 = \frac{SP_{xy}}{SS_x} \\ \\ \\ & \text{Where}: \\ & \text{SP}_\text{xy} = \text{sum of cross-products:} \\ & \text{SP}_\text{xy} = \sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \\ & \text{and} \\ & \text{SS}_\text{x} = \text{sums of squared deviations of x:} \\ & \text{SS}_\text{x} = \sum_{i = 1}^{n}(x_i - \bar{x})^2 \\ \end{align} \]

\[ \begin{align} &\text{SP}_\text{xy} =\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y}) = \\ & (1-3)(5-7) + (2-3)(7-7) + (3-3)(8-7) + (4-3)(6-7) + (5-3)(9-7) = \\ & 4 + 0 + 0 + (-1) + 4 = \\ & 7 \\ \\ & \text{SS}_\text{x}= \sum_{i = 1}^{n}(x_i - \bar{x})^2 = \\ & (1-3)^2 + (2-3)^2 + (3-3)^22 + (4-3)^2 + (5-3)^2 = \\ & 4 + 1 + 0 + 1 + 4 = \\ & 10 \\ \\ & \hat \beta_1 = \frac{SP_{xy}}{SS_x} = \frac{7}{10} = 0.7 \\ \end{align} \]

Intercept ($\beta_0$)

\[ \begin{align} &\hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x} \\ &\hat \beta_0 = 7 - 0.7 \cdot 3 \\ &\hat \beta_0 = 7 - 2.1 \\ &\hat \beta_0 = 4.9 \end{align} \]

In R

There are numerous equivalent ways to obtain the estimated regression coefficients — that is, $\hat \beta_0$, $\hat \beta_1$, …., $\hat \beta_k$ — from the fitted model (for this below example, our fitted model has been named mdl):

mdl
mdl$coefficients
coef(mdl)
coefficients(mdl)

The standard error of the coefficient is an estimate of the standard deviation of the coefficient (i.e., how much uncertainty there is in our estimated coefficient).

The formula for the standard error of the slope is:

\[ \begin{align} & SE(\hat \beta_j) = \sqrt{\frac{\text{SS}_\text{Residual}/(n-k-1)}{\sum(x_{ij} - \bar{x_{j}})^2(1-R_{xj}^2)}} \\ \\ & \text{Where}: \\ \\ & \text{SS}_\text{Residual} = \text{ residual sum of squares} \\ & n = \text{ sample size} \\ & k = \text{ number of predictors} \\ & x_{ij} = \text{ the observed value of a predictor (j) for an individual (i)} \\ & \bar{x_{j}} = \text{the mean of a predictor (j)} \\ & R_{xj}^2 = \text{the multiple correlation coefficient of the predictors} \\ \end{align} \]

Let’s apply to a straightforward example. Suppose you have a simple linear regression model (i.e., with only one IV, which means that $R_{xj}^2 = 0$ since there is only one predictor) and the following data points:

Observed $x_i$	Observed $y_i$
1	5
2	7
3	8
4	6
5	9

There are a number of steps you need to take to calculate by hand:

Calculate sum of the squared residuals
1. Calculate predicted values
2. Calculate residuals (i.e., the difference between the observed value ($y_i$) and the predicted value ($\hat{y}_i$) for each observation)
3. Square the residuals
4. Calculate the Sum of Squared Residuals
Calculate the sum of squared deviations of the ($x$) values from their mean
Use values from 1 & 2 to calculate $SE(\hat \beta_j)$

Step 1.1: Calculate predicted values

Using $\hat{y}_i = \beta_0 + \beta_1 \cdot x_i$ and our model coefficients $\beta_0 = 4.9$ and $\beta_1 = 0.7$:

Observed ($x_i$)	Observed ($y_i$)	Predicted ($\hat{y}_i$)
1	5	4.9 + (0.7*1) = 5.6
2	7	4.9 + (0.7*2) = 6.3
3	8	4.9 + (0.7*3) = 7
4	6	4.9 + (0.7*4) = 7.7
5	9	4.9 + (0.7*5) = 8.4

Step 1.2: Calculate residuals

$\epsilon_1 = 5 − 5.6 = -0.6$
$\epsilon_2 = 7 - 6.3 = 0.7$
$\epsilon_3 = 8 - 7 = 1$
$\epsilon_4 = 6 - 7.7 = -1.7$
$\epsilon_5 = 9 − 8.4 = 0.6$

Step 1.3: Square the residuals

$\epsilon_1^2 = -0.6^2 = 0.36$
$\epsilon_2^2 = 0.7^2 = 0.49$
$\epsilon_3^2 = 1^2 = 1$
$\epsilon_4^2 = -1.7^2 = 2.89$
$\epsilon_5^2 = 0.6^2 = 0.36$

Step 1.4: Calculate the Sum of Squared Residuals

\[ \sum \epsilon_i^2 = 0.36 + 0.49 + 1 + 2.89 + 0.36 = 5.1 \]

Step 2. Calculate the sum of squared deviations of the ($x$) values from their mean

The mean of $x$ can be calculated as: $\bar x = {\frac{1+2+3+4+5}{5}} = 3$. Using this, we can then calculate the sum of squared deviations of $x$:

\[ \begin{align} & \sum_{i = 1}^{n}(x_i - \bar{x})^2 = \\ & (1-3)^2 + (2-3)^2 + (3-3)^22 + (4-3)^2 + (5-3)^2 = \\ & 4 + 1 + 0 + 1 + 4 = \\ & 10 \\ \end{align} \]

Step 3: Calculate $SE(\hat \beta_j)$

From this, we can finally calculate $SE(\hat \beta_j)$:

\[ \begin{align} & SE(\hat \beta_j) = \sqrt{\frac{\text{SS}_\text{Residual}/(n-k-1)}{\sum(x_{ij} - \bar{x_{j}})^2(1-R_{xj}^2)}} \\ & SE(\hat \beta_j) = \sqrt{\frac{5.1/(5-1-1)}{10 \cdot (1-0)}} \\ & SE(\hat \beta_j) = \sqrt{\frac{5.1/3}{10}} \\ & SE(\hat \beta_j) = \sqrt{\frac{1.7}{10}} \\ & SE(\hat \beta_j) = \sqrt{0.17} \\ & SE(\hat \beta_j) = 0.4207 \\ \end{align} \]

In R

If you wanted to obtain just the standard error for each estimated regression coefficient, you could do the following (for this below example, our fitted model has been named mdl):

summary(mdl)$coefficients[,2]

The t-statistic is the $\beta$ coefficient divided by the standard error:

\[ t = \frac{\hat \beta_j - 0}{SE(\hat \beta_j)} \]

which follows a $t$-distribution with $n-k-1$ degrees of freedom (where $k$ = number of predictors and $n$ = sample size).

With this, we can test the the null hypothesis $H_0: \beta_j = 0$.

Generally speaking, you want your model coefficients to have large $t$-statistics as this would indicate that the standard error was small in comparison to the coefficient. The larger our $t$-statistic, the more confident we can be that the coefficient is not 0.

How to calculate $t = \frac{\hat \beta_j - 0}{SE(\hat \beta_j)}$

By Hand
In R

We can calculate the test statistic $t$ for $\beta_\text{Age}$ (or $\beta_2$) by hand from our recall_multi model as follows:

\[ \begin{align} \\ & t = \frac{\hat \beta_j - 0}{SE(\hat \beta_j)} \\ \\ & t = \frac{-0.3392 - 0}{0.1534} \\ \\ & t = -2.211213 \\ \\ & t = -2.21 \\ \end{align} \]

We then need to calculate $t^*$:

n <- nrow(recalldata)
k <- 2
tstar <- qt(0.975, df = n - k - 1)
tstar

[1] 2.109816

And finally compare $|t|$ to $t^*$. Since $|t|$ is larger than $t^*$ (-2.21 > 2.11), we can reject the null hypothesis.

In R

If you wanted to obtain just the $t$-values for each estimated regression coefficient, you could use the following:

coef(summary(mdl))[, "t value"]
summary(mdl)$coefficients[,3]

For example:

coef(summary(recall_multi))[, "t value"]

      (Intercept) recall_confidence               age 
         2.815890          4.684654         -2.211515

summary(recall_multi)$coefficients[,3]

      (Intercept) recall_confidence               age 
         2.815890          4.684654         -2.211515

From our $t$-value, we can compute our $p$-value. The $p$-value help us to understand whether our coefficient(s) are statistically significant (i.e., that the coefficient is statistically different from 0). The $p$-value of each estimate indicates the probability of observing a $t$-value at least as extreme as, or more extreme than, the one calculated from the sample data when assuming the null hypothesis to be true.

In Psychology, a $p$-value < .05 is usually used to make statements regarding statistical significance (it is important that you always state your $\alpha$ level to help your reader understand any statements regarding statistical significance).

The number of asterisks marks corresponds with the significance of the coefficient (see the ‘Signif. codes’ legend just under the coefficients section).

In R

If you wanted to obtain just the $p$-values for each estimated regression coefficient, you could do the following (for this below example, our fitted model has been named mdl):

summary(mdl)$coefficients[,4]

Confidence Intervals

Using the estimate and standard error of a given $\beta$ coefficient, we can create confidence intervals to estimate a plausible range of values for the true population parameter. Recall the formula for obtaining a confidence interval for the population slope is:

\[ \hat \beta_j \pm t^* \cdot SE(\hat \beta_j) \]

where $t^*$ denotes the critical value chosen from $t$-distribution with $n-k-1$ degrees of freedom (where $k$ = number of predictors and $n$ = sample size) for a desired $\alpha$ level of confidence.

How to calculate $\hat \beta_j \pm t^* \cdot SE(\hat \beta_j)$

By Hand
Using R

To calculate by hand for $\hat \beta_\text{Age}$ from our recall_multi model, we first need to calculate $t^*$:

n <- nrow(recalldata)
k <- 2
tstar <- qt(0.975, df = n - k - 1)
tstar

[1] 2.109816

For 95% confidence intervals, we use $t^* = 2.1098$, and can simply substitute into the formula:

\[ \begin{align} & \text{Lower CI} = \hat \beta_\text{Age} - t^* \cdot SE(\hat \beta_\text{Age}) \\ & \text{Lower CI} = -0.3392 - (2.1098 \cdot 0.1534) \\ & \text{Lower CI} = -0.6628433 \\ & \text{Lower CI} = -0.663 \\ \\ & \text{Upper CI} = \hat \beta_\text{Age} + t^* \cdot SE(\hat \beta_\text{Age}) \\ & \text{Upper CI} = -0.3392 + (2.1098 \cdot 0.1534) \\ & \text{Upper CI} = -0.01555668 \\ & \text{Upper CI} = -0.016 \\ \end{align} \]

In R

We can obtain the confidence intervals for the regression coefficients using the command confint().

confint(recall_multi)

                       2.5 %      97.5 %
(Intercept)        9.0668871 63.25228524
recall_confidence  0.4923220  1.29913640
age               -0.6627188 -0.01559663

Or alternatively use R to compute using the manual process (though it makes more sense to use confint() given it is less prone to typos!):

tibble(
  b2_LowerCI = round(-0.3392 - (qt(0.975, n-3) * 0.1534), 3),
  b2_UpperCI = round(-0.3392 + (qt(0.975, n-3) * 0.1534), 3)
      )

# A tibble: 1 × 2
  b2_LowerCI b2_UpperCI
       <dbl>      <dbl>
1     -0.663     -0.016

$\sigma$

Multiple regression output in R, model standard deviation of the errors highlighted

The standard deviation of the errors, denoted by $\sigma$, is an important quantity that our model estimates. It represents how much individual data points tend to deviate above and below the regression line - in other words, it tells us how well the model fits the data.

A small $\sigma$ indicates that the points hug the line closely and we should expect fairly accurate predictions, while a large $\sigma$ suggests that, even if we estimate the line perfectly, we can expect individual values to deviate from it by substantial amounts.

The estimated standard deviation of the errors is denoted $\hat \sigma$, and is estimated by essentially averaging squared residuals (giving the variance) and taking the square-root:

\[ \begin{align} & \hat \sigma = \sqrt{\frac{\text{SS}_\text{Residual}}{n - k - 1}} \\ \qquad \\ & \text{Where:} \\ & \text{SS}_\text{Residual} = \textrm{Sum of Squared Residuals} = \sum_{i=1}^n{(\epsilon_i)^2} \\ \\ \\ & \text{and so, equivalently:} \\ \\ & \hat \sigma = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n - k - 1}} \\ \end{align} \]

How to calculate $\hat \sigma$

By Hand
Using R

There are a number of steps you need to take to calculate by hand:

Calculate predicted values
Calculate residuals (i.e., the difference between the observed value ($y_i$) and the predicted value ($\hat{y}_i$) for each observation)
Square the residuals
Calculate the Sum of Squared Residuals
Determine the Residual Standard Deviation ($\sigma$)

Let’s apply to a straightforward example. Suppose you have a simple linear regression model (i.e., with only one IV) and the following data points:

Observed ($x_i$)	Observed ($y_i$)
1	5
2	7
3	8
4	6
5	9

Step 1: Calculate predicted values

Using $\hat{y}_i = \beta_0 + \beta_1 \cdot x_i$ and our model coefficients $\beta_0 = 4.9$ and $\beta_1 = 0.7$:

Observed ($x_i$)	Observed ($y_i$)	Predicted ($\hat{y}_i$)
1	5	4.9 + (0.7*1) = 5.6
2	7	4.9 + (0.7*2) = 6.3
3	8	4.9 + (0.7*3) = 7
4	6	4.9 + (0.7*4) = 7.7
5	9	4.9 + (0.7*5) = 8.4

Step 2: Calculate residuals

$\epsilon_1 = 5 − 5.6 = -0.6$
$\epsilon_2 = 7 - 6.3 = 0.7$
$\epsilon_3 = 8 - 7 = 1$
$\epsilon_4 = 6 - 7.7 = -1.7$
$\epsilon_5 = 9 − 8.4 = 0.6$

Step 3: Square the residuals

$\epsilon_1^2 = -0.6^2 = 0.36$
$\epsilon_2^2 = 0.7^2 = 0.49$
$\epsilon_3^2 = 1^2 = 1$
$\epsilon_4^2 = -1.7^2 = 2.89$
$\epsilon_5^2 = 0.6^2 = 0.36$

Step 4: Calculate the Sum of Squared Residuals

\[ \sum \epsilon_i^2 = 0.36 + 0.49 + 1 + 2.89 + 0.36 = 5.1 \]

Step 5: Determine the Residual Standard Deviation ($\sigma$)

\[ \begin{align} & \hat \sigma = \sqrt{\frac{\text{SS}_\text{Residual}}{n - k - 1}} \\ \\ & \hat \sigma = \sqrt{\frac{5.1}{5 - 1 - 1}} \\ \\ & \hat \sigma = \sqrt{\frac{5.1}{3}} \\ \\ & \hat \sigma = \sqrt{1.70} \\ \\ & \hat \sigma = 1.304 \\ \end{align} \]

In R

There are a couple of equivalent ways to obtain the estimated standard deviation of the errors — that is, $\hat \sigma$ — from the fitted model (for this example, our fitted model has been named mdl):

sigma(mdl)
summary(mdl)

Model Predicted Values & Residuals

Model predicted values are the estimates generated by a regression model for the dependent variable based on the independent variable(s), whilst residuals are the differences between these predicted values and the actual observed values (in turn indicating the accuracy of the model’s predictions).

Predicted Values

Residuals

Predicted Values - Example

Data Transformations

There are many transformations we can do to a continuous variable, but the most common ones are centering and scaling. These transformations can help to aid interpretability of our statistical models.

Centering

Scaling

Standardisation

predictor	outcome	in lm	coefficient	interpretation
standardised	raw	`y ~ scale(x)`	\(\beta = b \cdot s_x\)	“difference in Y for a 1 SD increase in X”
standardised	standardised	`scale(y) ~ scale(x)`	\(\beta = b \cdot \frac{s_x}{s_y}\)	“difference in SD of Y for a 1 SD increase in X”

Model Fit

Linear Models

Assessing model fit involves examining metrics like the sum of squares to measure variability explained by the model, the $F$-ratio to evaluate the overall significance of the model by comparing explained variance to unexplained variance, and $R$-squared / Adjusted $R$-squared to quantify the proportion of variance in the dependent variable explained by the independent variable(s).

Sums of Squares

To quantify and assess a model’s utility in explaining variance in an outcome variable, we can split the total variability of that outcome variable into two terms: the variability explained by the model plus the variability left unexplained in the residuals.

The sum of squares measures the deviation or variation of data points away from the mean (i.e., how spread out are the numbers in a given dataset). We are trying to find the equation/function that best fits our data by varying the least from our data points.

Total Sum of Squares

Formula:

\[ \text{SS}_\text{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2 \] Can also be derived from:

\[ \text{SS}_\text{Total} = \text{SS}_\text{Model} + \text{SS}_\text{Residual} \]

In words:

Squared distance of each data point from the mean of $y$.

Description:

How much variation there is in the DV.

Example:

Let’s apply to a straightforward example to try by-hand. Suppose you have a simple linear regression model (i.e., with only one IV) where you have the following data points:

Observed $x_i$	Observed $y_i$
1	5
2	7
3	8
4	6
5	9

Steps:

Calculate the mean of $y$ ($\bar y$)
Calculate for each observation $y_i$ - $\bar y$
Square each of the obtained $y_i$ - $\bar y$ values
Sum squared values

Step 1: Calculate the mean of $y_i$

$\bar y = {\frac{5+7+8+6+9}{5}} = 7$

Step 2 & 3: Calculate for each observation $y_i$ - $\bar y$ & square values

Observed $y_i$	$y_i$ - $\bar y$	$(y_i - \bar y)^2$
5	5 - 7 = -2	$-2^2 = 4$
7	7 - 7 = 0	$0^2 = 0$
8	8 - 7 = 1	$1^2 = 1$
6	6 - 7 = -1	$-1^2 = 1$
9	9 - 7 = 2	$2^2 = 4$

Step 4: Calculate sum squared values

$\text{SS}_\text{Total} = 4 + 0 + 1 + 1 + 4 = 10$

Residual Sum of Squares

Formula:

\[ \text{SS}_\text{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

In words:

Squared distance of each point from the predicted value.

Description:

How much of the variation in the DV the model did not explain - a measure that captures the unexplained variation in your regression model. Lower residual sum of squares suggests that your model fits the data well, and higher suggests that the model poorly explains the data (in other words, the lower the value, the better the regression model). If the value was zero here, it would suggest the model fits perfectly with no error.

Example:

Let’s apply to a straightforward example to try by-hand. Suppose you have a simple linear regression model (i.e., with only one IV) where you have the following data points:

Observed $x_i$	Observed $y_i$
1	5
2	7
3	8
4	6
5	9

Steps:

Calculate predicted values ($\hat{y}_i$)
Calculate residuals (i.e., the difference between the observed value ($y_i$) and the predicted value ($\hat{y}_i$) for each observation)
Square the residuals
Sum squared values

Step 1: Calculate predicted values

Using $\hat{y}_i = \beta_0 + \beta_1 \cdot x_i$ and our model coefficients $\beta_0 = 4.9$ and $\beta_1 = 0.7$:

Observed $x_i$	Observed $y_i$	Predicted ($\hat{y}_i$)
1	5	$4.9 + (0.7*1) = 5.6$
2	7	$4.9 + (0.7*2) = 6.3$
3	8	$4.9 + (0.7*3) = 7$
4	6	$4.9 + (0.7*4) = 7.7$
5	9	$4.9 + (0.7*5) = 8.4$

Step 2: Calculate residuals

$\epsilon_1 = 5 − 5.6 = -0.6$
$\epsilon_2 = 7 - 6.3 = 0.7$
$\epsilon_3 = 8 - 7 = 1$
$\epsilon_4 = 6 - 7.7 = -1.7$
$\epsilon_5 = 9 − 8.4 = 0.6$

Step 3: Square the residuals

$\epsilon_1^2 = -0.6^2 = 0.36$
$\epsilon_2^2 = 0.7^2 = 0.49$
$\epsilon_3^2 = 1^2 = 1$
$\epsilon_4^2 = -1.7^2 = 2.89$
$\epsilon_5^2 = 0.6^2 = 0.36$

Step 4: Calculate sum of squared values

$\text{SS}_\text{Residual} = 0.36 + 0.49 + 1 + 2.89 + 0.36 = 5.1$

Model Sum of Squares

Formula:

\[ \text{SS}_\text{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2 \]

Can also be derived from:

\[ \text{SS}_\text{Model} = \text{SS}_\text{Total} - \text{SS}_\text{Residual} \]

In words:

The deviance of the predicted scores from the mean of $y$.

Description:

How much of the variation in the DV your model explained - like a measure that captures how well the regression line fits your data.

Example:

Let’s apply to a straightforward example to try by-hand. Suppose you have a simple linear regression model (i.e., with only one IV) where you have the following data points:

Observed $x_i$	Observed $y_i$
1	5
2	7
3	8
4	6
5	9

Steps:

Calculate mean of $y$ ($\bar y$)
Calculate predicted values ($\hat{y}_i$)
Calculate for each observation $\hat{y}_i - \bar y$
Squaring each of the obtained $\hat{y}_i - \bar y$ values
Sum squared values

Step 1: Calculate the mean of $y_i$

$\bar y = {\frac{5+7+8+6+9}{5}} = 7$

Step 2: Calculate predicted values

Using $\hat{y}_i = \beta_0 + \beta_1 \cdot x_i$ and our model coefficients $\beta_0 = 4.9$ and $\beta_1 = 0.7$:

Observed ($x_i$)	Observed ($y_i$)	Predicted ($\hat{y}_i$)
1	5	$4.9 + (0.7*1) = 5.6$
2	7	$4.9 + (0.7*2) = 6.3$
3	8	$4.9 + (0.7*3) = 7$
4	6	$4.9 + (0.7*4) = 7.7$
5	9	$4.9 + (0.7*5) = 8.4$

Step 3 & 4: Calculate for each observation $\hat{y}_i$ - $\bar y$ & square values

$\hat{y}_i$ - $\bar y$	$(\hat{y}_i - \bar y)^2$
$5.6 - 7 = -1.4$	$(-1.4)^2 = 1.96$
$6.3 - 7 = -0.7$	$(-0.7)^2 = 0.49$
$7 - 7 = 0$	$(0)^2 = 0$
$7.7 - 7 = 0.7$	$(0.7)^2 = 0.49$
$8.4 - 7 = 1.4$	$(1.4)^2 = 1.96$

Step 5: Calculate sum of squared values

$\text{SS}_\text{Model} = 1.96 + 0.49 + 0 + 0.49 + 1.96 = 4.9$

Alternatively:

\[ \begin{align} & \text{SS}_\text{Model} = \text{SS}_\text{Total} - \text{SS}_\text{Residual} \\ & \text{SS}_\text{Model} = 10 - 5.1 \\ & \text{SS}_\text{Model} = 4.9 \\ \end{align} \]

F-ratio

Overview:

We can perform a test to investigate if a model is ‘useful’ — that is, a test to see if our explanatory variable explains more variance in our outcome than we would expect by just some random chance variable.

With one predictor, the $F$-statistic is used to test the null hypothesis that the regression slope for that predictor is zero:

\[ H_0: \text{the model is ineffective, }b_1 = 0 \\ \] \[ H_1 : \text{the model is effective, }b_1 \neq 0 \\ \]

In multiple regression, the logic is the same, but we are now testing against the null hypothesis that all regression slopes are zero. Our test is framed in terms of the following hypotheses:

\[ H_0: \text{the model is ineffective, }b_1,...., b_k = 0 \\ \]

\[ H_1 : \text{the model is effective, }b_1,...., b_k \neq 0 \\ \]

The relevant test-statistic is the $F$-statistic, which uses “Mean Squares” (these are Sums of Squares divided by the relevant degrees of freedom). We then compare that against (you guessed it) an $F$-distribution! $F$-distributions vary according to two parameters, which are both degrees of freedom.

Formula:

\[ \text{F}_{(df_{model},~df_{residual})} = \frac{\text{MS}_\text{Model}}{\text{MS}_\text{Residual}} = \frac{\text{SS}_\text{Model}/\text{df}_\text{Model}}{\text{SS}_\text{Residual}/\text{df}_\text{Residual}} \\ \quad \\ \]

\[ \begin{align} & \text{Where:} \\ & df_{model} = k \\ & df_{residual} = n-k-1 \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

Description:

To test the significance of an overall model, we can conduct an $F$-test. The $F$-test compares your model to a model containing zero predictor variables (i.e., the intercept only model), and tests whether your added predictor variables significantly improved the model.

It is called the $F$-ratio because it is the ratio of the how much of the variation is explained by the model (per parameter) versus how much of the variation is unexplained (per remaining degrees of freedom).

The $F$-test involves testing the statistical significance of the $F$-ratio.

Q: What does the $F$-ratio test?
A: The null hypothesis that all regression slopes in a model are zero (i.e., explain no variance in your outcome/DV). The alternative hypothesis is that at least one of the slopes is not zero.

The $F$-ratio you see at the bottom of summary(model) is actually a comparison between two models: your model (with some explanatory variables in predicting $y$) and the null model.

In regression, the null model can be thought of as the model in which all explanatory variables have zero regression coefficients. It is also referred to as the intercept-only model, because if all predictor variable coefficients are zero, then we are only estimating $y$ via an intercept (which will be the mean - $\bar y$).

Interpretation:

Alongside viewing the $F$-ratio, you can see the results from testing the null hypothesis that all of the coefficients are $0$ (the alternative hypothesis is that at least one coefficient is $\neq 0$. Under the null hypothesis that all coefficients = 0, the ratio of explained:unexplained variance should be approximately 1)

If your model predictors do explain some variance, the $F$-ratio will be significant, and you would reject the null, as this would suggest that your predictor variables included in your model improved the model fit (in comparison to the intercept only model).

Points to note:

The larger your $F$-ratio, the better your model
The $F$-ratio will be close to 1 when the null is true (i.e., that all slopes are zero)

How to calculate $F$-ratio

By Hand
Using R

Steps:

Calculate model sum of squares
Calculate residual sum of squares
Calculate total sum of squares
Calculate $df_{model}$
Calculate $df_{residual}$

Step 1, 2, & 3

Follow steps above in the Sums of Squares flashcard:

\[ \begin{align} & \text{SS}_\text{Total} = 10 \\ & \text{SS}_\text{Residual} = 5.1 \\ & \text{SS}_\text{Model} = 4.9 \\ \end{align} \]

Step 4: Calculate $df_{model}$

\[ \begin{align} &df_{model} = k \\ &df_{model} = 1 \end{align} \]

Step 5: Calculate $df_{residual}$

\[ \begin{align} &df_{residual} = n-k-1 \\ &df_{residual} = 5-1-1 \\ &df_{residual} = 3 \end{align} \]

\[ \begin{align} &\text{F}_{(df_{model},~df_{residual})} = \frac{\text{MS}_\text{Model}}{\text{MS}_\text{Residual}} = \frac{\text{SS}_\text{Model}/\text{df}_\text{Model}}{\text{SS}_\text{Residual}/\text{df}_\text{Residual}} \\ \\ &\text{F}_{(df_{model},~df_{residual})} = \frac{4.9/1}{5.1/3} \\ \\ &\text{F}_{(df_{model},~df_{residual})} = \frac{4.9}{1.7} \\ \\ &\text{F}_{(df_{model},~df_{residual})} = 2.88 \end{align} \]

In R

We can see the $F$-statistic and associated $p$-value at the bottom of the output of summary(<modelname>):

Multiple regression output in R, F statistic highlighted

Alternatively, you can extract this information as it is stored in the summary() of the model:

#F-Statistic
summary(recall_multi)$fstatistic

   value    numdf    dendf 
12.92426  2.00000 17.00000

#P-Value
pf(summary(recall_multi)$fstatistic[1], 
   summary(recall_multi)$fstatistic[2], 
   summary(recall_multi)$fstatistic[3], 
   lower.tail = FALSE)

       value 
0.0003866881

Example Interpretation

The linear model with recall confidence and age explained a significant amount of variance in recall accuracy beyond what we would expect by chance $F(2, 17) = 12.92, p < .001$.

R-squared and Adjusted R-squared

Overview:

$R^2$ represents the proportion of variance in $Y$ that is explained by the model predictor variables.

Formula:

The $R^2$ coefficient is defined as the proportion of the total variability in the outcome variable which is explained by our model:

\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]

The Adjusted $R^2$ coefficient is defined as:

\[ \hat R^2 = 1 - \frac{(1 - R^2)(n-1)}{n-k-1} \quad \\ \]

\[ \begin{align} & \text{Where:} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ \end{align} \]

When to report Multiple $R^2$ vs. Adjusted $R^2$:

The Multiple $R^2$ value should be reported for a simple linear regression model (i.e., one predictor).

Unlike $R^2$, Adjusted-$R^2$ does not necessarily increase with the addition of more explanatory variables, by the inclusion of a penalty according to the number of explanatory variables in the model. Since Adjusted-$R^2$ is adjusted for the number of predictors in the model, this should be used when there are 2 or more predictors in the model. As a side note, the Adjusted-$R^2$ should always be less than or equal to $R^2$.

How to calculate Multiple $R^2$ & Adjusted $R^2$

By Hand
Using R

Using the information calculated above in the Sums of Squares flashcard above, we can simply substitute values into the formula for $R^2$:

\[ \begin{align} & R^2 = \frac{\text{SS}_{\text{Model}}}{\text{SS}_{\text{Total}}} = 1 - \frac{\text{SS}_{\text{Residual}}}{\text{SS}_{\text{Total}}} \\ \\ & R^2 = \frac{4.9}{10} = 1 - \frac{5.1}{10} \\ \\ & R^2 = 0.49 = 0.49 \end{align} \]

And for Adjusted-$R^2$:

\[ \begin{align} & \text{Adjusted-R}^2 = 1 - \frac{(1 - R^2)(n-1)}{n-k-1} \\ & \quad \\ & \text{Adjusted-R}^2 = 1 - \frac{(1 - 0.49)(5-1)}{5-1-1} \\ & \quad \\ & \text{Adjusted-R}^2 = 1 - \frac{(0.51)(4)}{3} \\ & \quad \\ & \text{Adjusted-R}^2 = 1 - \frac{2.04}{3} \\ & \quad \\ & \text{Adjusted-R}^2 = 1 - 0.68 \\ & \quad \\ & \text{Adjusted-R}^2 = 0.32 \\ \end{align} \]

In R

We can see both $R^2$ and Adjusted-$R^2$ in the second bottom row of the summary(<modelname>):

Multiple regression output in R, R^2 statistic highlighted

Alternatively, you can extract this information as it is stored in the summary() of the model:

#R-Squared
summary(recall_multi)$r.squared

[1] 0.6032536

#Adjusted R-Squared
summary(recall_multi)$adj.r.squared

[1] 0.5565775

Example Interpretation

Together, recall confidence and age explained approximately 55.66% of the variance in recall accuracy.

Model Comparisons

Linear Models

One useful thing we might want to do is compare our models with and without some predictor(s).There are numerous ways we can do this, but the method chosen depends on the models and underlying data:

Nested vs Non-Nested Models

Incremental F-test

AIC & BIC

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) combine information about the sample size, the number of model parameters, and the residual sums of squares ($SS_{residual}$). Models do not need to be nested to be compared via AIC and BIC, but they need to have been fit to the same dataset.

AIC can be calculated as:

\[ \begin{align} & AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k \\ \end{align} \quad \\ \]

BIC can be calculated as:

\[ \begin{align} & BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n) \\ \end{align} \quad \\ \]

Where for both AIC and BIC:

\[ \begin{align} & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align} \]

For both of these fit indices, lower values are better, and both include a penalty for the number of predictors in the model (although BIC’s penalty is harsher).

So how do we determine whether there is a statistical difference between two models? To evaluate our model comparisons, we need to look at the difference ($\Delta$) between the two values:

AIC: There are no specific thresholds to suggest how big a difference in two models is needed to conclude that one is substantively better than the other
BIC: Using the following $\Delta BIC$ cutoffs (Raftery, 1995):

Value	Interpretation of Difference between Models
$\Delta < 2$	No evidence
$2 > \Delta < 6$	Positive evidence
$6 > \Delta < 10$	Strong evidence
$\Delta > 10$	Very strong evidence

In R

We can calculate AIC and BIC by using the AIC() and BIC() functions respectively:

#AIC
AIC(modelname)

#BIC
BIC(modelname)

For example, with AIC:

and BIC:

Example Interpretation

Based on both AIC and BIC, the model predicting recall accuracy that included both recall confidence and age was better fitting $(\text{AIC} = 152.28; \text{BIC} = 156.27)$ than the model with age alone $(\text{AIC} = 166.86; \text{BIC} = 169.85)$.

General Formatting & Presenting of Results

LaTeX Symbols & Equations

By embedding LaTeX into RMarkdown, you can accurately and precisely format mathematical expressions, ensuring that they are not only technically correct but also visually appealing and easy to interpret.

LaTeX Guide

APA Formatting

APA format is a writing/presentation style that is often used in psychology to ensure consistency in communication. APA formatting applies to all aspects of writing - from formatting of papers (including tables and figures), citation of sources, and reference lists. This means that it also applies to how you present results in your Psychology courses, including DAPR2.

APA Formatting Guides

Tables

We want to ensure that we are presenting results in a well formatted table. To do so, there are lots of different packages available (see Lesson 4 of the RMD bootcamp).

One of the most convenient ways to present results from regression models is to use the tab_model() function from sjPlot.

Creating tables via tab_model

Cross Referencing

Cross-referencing is a very helpful way to direct your reader through your document, and the good news is that this can be done automatically in RMarkdown.

Cross Referencing

Observed \(y_i\)	\(y_i\) - \(\bar y\)	\((y_i - \bar y)^2\)
5	5 - 7 = -2	\(-2^2 = 4\)
7	7 - 7 = 0	\(0^2 = 0\)
8	8 - 7 = 1	\(1^2 = 1\)
6	6 - 7 = -1	\(-1^2 = 1\)
9	9 - 7 = 2	\(2^2 = 4\)

Observed \(x_i\)	Observed \(y_i\)	Predicted (\(\hat{y}_i\))
1	5	\(4.9 + (0.7*1) = 5.6\)
2	7	\(4.9 + (0.7*2) = 6.3\)
3	8	\(4.9 + (0.7*3) = 7\)
4	6	\(4.9 + (0.7*4) = 7.7\)
5	9	\(4.9 + (0.7*5) = 8.4\)

Observed (\(x_i\))	Observed (\(y_i\))	Predicted (\(\hat{y}_i\))
1	5	\(4.9 + (0.7*1) = 5.6\)
2	7	\(4.9 + (0.7*2) = 6.3\)
3	8	\(4.9 + (0.7*3) = 7\)
4	6	\(4.9 + (0.7*4) = 7.7\)
5	9	\(4.9 + (0.7*5) = 8.4\)

\(\hat{y}_i\) - \(\bar y\)	\((\hat{y}_i - \bar y)^2\)
\(5.6 - 7 = -1.4\)	\((-1.4)^2 = 1.96\)
\(6.3 - 7 = -0.7\)	\((-0.7)^2 = 0.49\)
\(7 - 7 = 0\)	\((0)^2 = 0\)
\(7.7 - 7 = 0.7\)	\((0.7)^2 = 0.49\)
\(8.4 - 7 = 1.4\)	\((1.4)^2 = 1.96\)

Block 1 Flash Cards

R Packages

Presenting Results

Back to Basics

Data Exploration

Numeric Exploration

Descriptives

Correlation

Size

Direction

Visual Exploration

Functions and Mathematical Models

Deterministic Models

Statistical Models

Numeric Outcomes & Numeric Predictors

Simple Linear Regression Models

Example

Multiple Linear Regression Models

Example

General - Extracting Information

Model Predicted Values & Residuals

Model predicted values (\(\hat y_i\)) for sample data

Model predicted values for other (unobserved) data

Data Transformations

Model Fit

Linear Models

Total Sum of Squares

Residual Sum of Squares

Model Sum of Squares

Model Comparisons

Linear Models

General Formatting & Presenting of Results

LaTeX Symbols & Equations

APA Formatting

Tables

Cross Referencing

Footnotes