Blocks 1, 2, & 3 Flash Cards

Flash Card Aims

The purpose of these flashcards is to complement your Semester 1 Weeks 1 - 11,and Semester 2 Weeks 1 - 5 core learning materials i.e., your lecture and lab materials, by offering additional guidance and examples on key concepts/topics. It’s designed to deepen your understanding, clarify complex concepts, and help you make connections between different areas of study. Think of it as an extra resource that supports what you’re learning in the classroom.

You may want to consider using the below as a supporting document whilst your work through lab exercises, and/or refer to in order to aid revision.

R Packages

Within this reading, the following packages are used:

  • tidyverse
  • sjPlot
  • kableExtra
  • psych
  • emmeans
  • performance
  • car
  • interactions

Presenting Results

Note that you must not copy any of the write-ups included below for future reports - if you do, you will be committing plagiarism, and this type of academic misconduct is taken very seriously by the University. You can find out more here.

Back to Basics

For an overview of basic statistical tests and core concepts (e.g., \(p\)-values), please revisit the DAPR1 materials for a refresher (also accessible via the DAPR1 Learn page).

Terminology

Data Exploration

The common first port of call for almost any statistical analysis is to explore the data, and we can do this visually and/or numerically.

Marginal Distributions Bivariate Associations
Description The distribution of each variable individually (i.e., without reference to the values of the other variables). Describing the association between two numeric variables.
Visually Plot each variable individually.

You could use, for example, geom_density() for a density plot or geom_histogram() for a histogram to comment on and/or examine:
  • The shape of the distribution. Look at the shape, centre and spread of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal?
  • Identify any unusual observations. Do you notice any extreme observations (i.e., outliers)?
Plot associations among two variables.

You could use, for example, geom_point() for a scatterplot to comment on and/or examine:
  • The direction of the association indicates whether there is a positive or negative association
  • The form of association refers to whether the relationship between the variables can be summarized well with a straight line or some more complicated pattern
  • The strength of association entails how closely the points fall to a recognizable pattern such as a line
  • Unusual observations that do not fit the pattern of the rest of the observations and which are worth examining in more detail
Numerically    Compute and report summary statistics e.g., mean, standard deviation, median, min, max, etc.

You could, for example, calculate summary statistics such as the mean (mean()) and standard deviation (sd()), etc. within summarize()
Compute and report the correlation coefficient.

You can use the cor() function to calculate this

Numeric Exploration

Numeric exploration of data involves examining key statistics like mean, median, and standard deviation via descriptives tables; and assessing the associations among variables through correlation coefficients. Exploring our data numerically helps us to identify patterns and associations in the data.

Descriptives

Descriptives Tables


Descriptives Tables - Examples

Correlation

Correlation Coefficient


Correlation Matrix


Correlation - Hypothesis Testing


Correlation - Hypothesis Testing in R

Visual Exploration

Visual exploration of our data allows us to visualize the distributions of our data, and to identify potential associations between variables.

How to Visualise Data


Data Visualisation - Marginal Examples


Data Visualisation - Bivariate Examples

Functions and Mathematical Models

Basic functions and mathematical models are foundational tools used to describe and predict associations between variables.

Identification & Specification


Deterministic Models - Description & Specification


Deterministic Models - Visualisation


Deterministic Models - Predicted Values

Statistical Models

Statistical models are used to understand the associations among variables.

Specifying Hypotheses


Simple Linear Regression Models - Description & Specification

Numeric Outcomes & Predictors

Simple Linear Regression Models - Example


Multiple Linear Regression Models - Description & Specification


Simple Linear Regression Models - Visualisation


Multiple Linear Regression Models - Visualisation

Numeric Outcomes & Categorical Predictors

Overview


Coding Variables as Factors


Binary Predictors


Categorical Predictors with k levels


Dummy vs Effects Coding


Categorical Predictors - Interpretation


Specifying Reference Levels

Interaction Models

Specifying Interaction Models


Interpreting Coefficients


Example Data

Numeric x Categorical Example

Research Question

Does the association between body mass and flipper length differ between species of penguin?

Visualise Data


Model Specification


Model Building


Results Interpretation


Model Visualisation


Numeric x Numeric Example

Research Question

Does the influence of bill length on body mass vary depending on flipper length?

Visualise Data


Model Specification


Model Building


Results Interpretation


Model Visualisation

Categorical x Categorical Example

Research Question

Do differences in body mass between species differ by sex?

Visualise Data


Model Specification


Model Building


Results Interpretation


Model Visualisation


Coding Constraints


Simple Effects


General

Extracting Information

Manual Contrasts

Dummy and effects coding allow us to test the significance of the difference between means of groups and some other mean (either reference group or grand mean respectively). However, in some cases, we may want to test more specific hypotheses that require us to test the difference between particular combinations of groups. In such cases, we can use manual contrasts.

Rules


In R - Additive Model


Example - Additive Model


In R - Interaction Model


Example - Interaction Model

Multiple Comparisons

Pairwise Comparisons


Why does the Number of Tests Matter?


When to use Which Correction

Model Predicted Values & Residuals

Model predicted values are the estimates generated by a regression model for the dependent variable based on the independent variable(s), whilst residuals are the differences between these predicted values and the actual observed values (in turn indicating the accuracy of the model’s predictions).

Predicted Values


Residuals


Predicted Values - Example

Data Transformations

There are many transformations we can do to a continuous variable, but the most common ones are centering and scaling. These transformations can help to aid interpretability of our statistical models.

Centering


Scaling


Standardisation

Model Fit

Assessing model fit involves examining metrics like the sum of squares to measure variability explained by the model, the \(F\)-ratio to evaluate the overall significance of the model by comparing explained variance to unexplained variance, and \(R\)-squared / Adjusted \(R\)-squared to quantify the proportion of variance in the dependent variable explained by the independent variable(s).

Sums of Squares


F-ratio


R-squared and Adjusted R-squared

Model Comparisons

One useful thing we might want to do is compare our models with and without some predictor(s).There are numerous ways we can do this, but the method chosen depends on the models and underlying data:

Nested vs Non-Nested Models


Incremental F-test


AIC & BIC

Model Assumptions

Linear models rely on numerous underlying assumptions about the data. These assumptions ensure that the association between variables is appropriately captured, and that inferences drawn from the model are accurate and valid. Model diagnostics can help further assess whether these assumptions hold. When these assumptions are violated, there are numerous techniques that can be employed, such as through data transformations or using robust alternatives, to ensure reliable model interpretations.

Linearity


Independence (of errors)


Normality (of errors)


Equal Variances (Homoscedasticity)


Useful Assumption Plots


Multicollinearity


Individual Case Diagnostics


Next Steps: What to do with Violations of Assumptions / Problematic Case Diagnostic Results

Bootstrap

The bootstrap is a general approach to assessing whether the sample results are statistically significant or not, and allows us to draw inferences to the population from a regression model. This method is assumption-free and does not rely on conditions such as normality of the residuals.

It is based on sampling repeatedly with replacement (to avoid always getting the original sample exactly) from the data at hand, and then computing the regression coefficients from each re-sample. We will equivalently use the word “bootstrap sample” or “resample” (for sample with replacement).

Overview


Terminology


In R


Visualisation

General Formatting & Presenting of Results

LaTeX Symbols & Equations

By embedding LaTeX into RMarkdown, you can accurately and precisely format mathematical expressions, ensuring that they are not only technically correct but also visually appealing and easy to interpret.

LaTeX Guide

APA Formatting

APA format is a writing/presentation style that is often used in psychology to ensure consistency in communication. APA formatting applies to all aspects of writing - from formatting of papers (including tables and figures), citation of sources, and reference lists. This means that it also applies to how you present results in your Psychology courses, including DAPR2.

APA Formatting Guides

Tables

We want to ensure that we are presenting results in a well formatted table. To do so, there are lots of different packages available (see Lesson 4 of the RMD bootcamp).

One of the most convenient ways to present results from regression models is to use the tab_model() function from sjPlot

Creating tables via tab_model

Cross Referencing

Cross-referencing is a very helpful way to direct your reader through your document, and the good news is that this can be done automatically in RMarkdown.

Cross Referencing

References

Ramsey, Fred, and Daniel Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning.

Footnotes

  1. Yes, the error term is gone. This is because the line of best-fit gives you the prediction of the average recall accuracy for a given age, and not the individual recall accuracy of an individual person, which will almost surely be different from the prediction of the line.↩︎

  2. Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/ doi: 10.5281/zenodo.3960218↩︎

  3. what defines a ‘family’ of tests is debatable.↩︎

  4. QQplots plot the values against the associated percentiles of the normal distribution. So if we had ten values, it would order them lowest to highest, then plot them on the y against the 10th, 20th, 30th.. and so on percentiles of the standard normal distribution (mean 0, SD 1)↩︎

  5. Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons. DOI: 10.1002/0471725153↩︎

  6. This method finds an appropriate value for \(\lambda\) such that the transformation \((sign(x) |x|^{\lambda}-1)/\lambda\) results in a close to normal distribution.↩︎