LEARNING OBJECTIVES

Review the main concepts from introductory statistics.
Understand the concept of a function.
Be able to discuss what a statistical model is.
Understand the link between models and functions.

Refresher of basic terminology

Question 1

Provide a short definition for each of these terms:

(Observational) unit
Variable
Categorical variable
Numeric variable
Response/dependent variable
Explanatory/independent variable
Observational study
Experiment

Solution

Term	Definition
(Observational) unit	The individual entities on which data are collected.
Variable	Any characteristic recorded on the observational units.
Categorical variable	A categorical variable places units into one of several groups. Examples are “country of birth,” “dominant hand,” and “eye colour.”
Numeric variable	A variable that records a numerical quantity for each case. For such variables standard arithmetic operations make sense. For example, average height or weight makes sense.
Response/dependent variable	A response variable (also called a dependent variable) measures the outcome of interest in a study.
Explanatory/independent variable	Explanatory variables (also called independent variables or predictors) are used to explain changes in the response variable.
Observational study	An observational study is a study in which the researcher does not manipulate any of the variables involved in the study, but merely records the values as they naturally exist.
Experiment	An experiment is a study in which the researcher imposes the values of the explanatory variable on the units before measuring the response variable.

Functions and mathematical models

Question 2

Consider the function \(y = 2 + 5 \ x\).

Identify the independent variable
Identify the dependent variable
Describe in words what the function does, and compute the output for the following input: \[ x = \begin{bmatrix} 2 \\ 6 \end{bmatrix} \]

Solution

Question 3

Write down in words and in symbols the function summarising the relationship between the side of a square and its perimeter.

Hint: We are interested in how the perimeter varies as a function of its side. Hence, the perimeter is the dependent variable, and the side is the independent variable.

Solution

In today’s lab you will compute the output of functions and plot them. To do so, you will need functionality from the tidyverse package such as tibble(), mutate(), and ggplot().

Question 4

Load the tidyverse package.

Create a data set called squares containing the perimeter of squares having sides of length \(0, 2, 5, 9\) metres.

Hint: Remember that to combine multiple numbers together we use the function c().

Solution

library(tidyverse)

squares <- tibble(
  side = c(0, 2, 5, 9),
  perimeter = 4 * side
)

squares

## # A tibble: 4 x 2
##    side perimeter
##   <dbl>     <dbl>
## 1     0         0
## 2     2         8
## 3     5        20
## 4     9        36

Question 5

Plot the squares data as points.

Solution

ggplot(data = squares, aes(x = side, y = perimeter)) +
  geom_point() +
  labs(x = 'Side (m)', y = 'Perimeter (m)')

Now, instead of just 4 points, we will obtain many more, one hundred, and use them to visualise the relationship between side and perimeter of squares.

Question 6

Create a sequence of one hundred side lengths (x) going from 0 to 3 metres.

Compute the corresponding perimeters (y).

Plot the side and perimeter data as points on a graph.

Visualise the functional relationship between side and perimeter of squares. To do so, use the function geom_line() to connect the computed points with lines.

Solution

Let’s start by creating the side and perimeter data:

squares_grid <- tibble(
  side = seq(0, 3, length.out = 100),
  perimeter = 4 * side
)

Plot the individual points:

ggplot(data = squares_grid, aes(x = side, y = perimeter)) +
  geom_point() +
  labs(x = 'Side (m)', y = 'Perimeter (m)', title = 'Perimeter = 4 Side')

Visualise the functional relationship by connecting the individual points with lines:

ggplot(data = squares_grid, aes(x = side, y = perimeter)) +
  geom_line(colour = 'blue') +
  labs(x = 'Side (m)', y = 'Perimeter (m)', title = 'Perimeter = 4 Side')

The function \(y = 4 \ x\) that you plotted above is an example of a function representing a mathematical model.

We typically validate a model using experimental data. However, we all know how squares work and that two squares with the same side will have the same perimeter. Hence this is a deterministic model as it is a model of an exact relationship.

Question 7

The Scottish National Gallery kindly provided us with measurements of side and perimeter (in metres) for a sample of 10 square paintings.

The data are provided below:

sng <- tibble(
  side = c(1.3, 0.75, 2, 0.5, 0.3, 1.1, 2.3, 0.85, 1.1, 0.2),
  perimeter = c(5.2, 3.0, 8.0, 2.0, 1.2, 4.4, 9.2, 3.4, 4.4, 0.8)
)

Plot the mathematical model of the relationship between side and perimeter for squares, and superimpose on top the experimental data from the Scottish National Gallery.

Solution

sng <- tibble(
  side = c(1.3, 0.75, 2, 0.5, 0.3, 1.1, 2.3, 0.85, 1.1, 0.2),
  perimeter = c(5.2, 3.0, 8.0, 2.0, 1.2, 4.4, 9.2, 3.4, 4.4, 0.8)
)

ggplot() +
  geom_line(data = squares_grid, aes(x = side, y = perimeter), colour = 'blue') +
  geom_point(data = sng, aes(x = side, y = perimeter), colour = 'black', 
             alpha = 0.5, size = 3) +
  labs(x = 'Side (m)', y = 'Perimeter (m)')

Figure 1: The exact relationship between side and perimeter of squares.

The above plot shows perfect agreement between the observed data and the model.

Question 8

Use the mathematical model to predict the perimeter of a painting with a side of 1.5 metres.

Solution

Statistical models

Consider now the relationship between height (in inches) and handspan (in cm). Utts and Heckard (2015) provides data for a sample of 167 students which reported their height and handspan as part of a class survey.

Data: handheight.csv. Click the plus to expand →

height	handspan
68	21.5
71	23.5
73	22.5
64	18.0
68	23.5
59	20.0

Question 9

Read the handheight data into R and name the data set handheight.

Solution

handheight <- read_csv(file = 'https://uoepsy.github.io/data/handheight.csv')
head(handheight)

## # A tibble: 6 x 2
##   height handspan
##    <dbl>    <dbl>
## 1     68     21.5
## 2     71     23.5
## 3     73     22.5
## 4     64     18  
## 5     68     23.5
## 6     59     20

Question 10

Investigate how handspan varies as a function of height for the students in the sample.

Do you notice any outliers or points that do not fit with the pattern in the rest of the data?

Comment on any main differences you notice with the relationship between side and perimeter of squares.

Hint: Use a scatterplot to visualise the relationship between two numeric variables.

Solution

The handheight data set contains two variables, height and handspan, which are both numeric and continuous. We display the relationship between two numeric variables with a scatterplot.

We can also add marginal boxplots for each variable using the package ggExtra. Before using the package, make sure you have it installed via install.packages('ggExtra').

library(ggExtra)

plt <- ggplot(handheight, aes(x = height, y = handspan)) +
  geom_point(size = 3, alpha = 0.5) +
  labs(x = 'Height (in.)', y = 'Handspan (cm)')

ggMarginal(plt, type = 'boxplot')

Figure 2: The statistical relationship between height and handspan.

Outliers are extreme observations that do not seem to fit with the rest of the data. This could either be:

marginally along one axis: points that have an unusual (too high or too low) x-coordinate or y-coordinate;
jointly: observations that do not fit with the rest of the point cloud.

The boxplots in Figure 2 do not highlight any outliers in the marginal distributions of height and handspan. Furthermore, from the scatterplot we do not notice any extreme observations or points that do not fit with the rest of the point cloud.

We notice a moderate, positive (that is, increasing) linear relationship between height and handspan.

Recall Figure 1, displaying the relationship between side and perimeters of squares. In the plot we notice two points on top of each other, reflecting the fact that two squares having the same side will always have the same perimeter. In fact, the data from the Scottish National Gallery include two squared paintings with a side of 1.1m, both having a measured perimeter of 4.4m.

Figure 2, instead, displays the relationship between height and handspan of a sample of students. The first thing that grabs our attention is the fact that students having the same height do not necessarily have the same handspan. Rather, we clearly see a variety of handspan values for students all having a height of, for example, 70in. To be more precise, the seven students who are 70 in. tall all have differing handspans.

Question 11

Using the following command, superimpose on top of the scatterplot a best-fit line describing how handspan varies as a function of height. For the moment, the argument se = FALSE tells R to not display uncertainty bands.

geom_smooth(method = lm, se = FALSE)

Comment on any differences you notice with the line summarising the linear relationship between side and perimeter.

Solution

ggplot(handheight, aes(x = height, y = handspan)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method = lm, se = FALSE) +
  labs(x = 'Height (in.)', y = 'Handspan (cm)')

Figure 3: The best-fit line.

The line representing the relationship between side and perimeter of squares is able to predict the actual perimeter value from the measurement of the side of a square. This is possible because the relationship between side and perimeter is an exact one. That is, any squares having the same side will have the same perimeter, and there will be no variation in those values.

The line that best fits the relationship between height and handspan, instead, is only able to predict the average handspan for a given value of height. This is because there will be a distribution of handspans at each value of height. The line will fit the trend/pattern in the values, but there will be individual-to-individual variability that we must accept around that average pattern.

The mathematical model \[ Perimeter = 4 \ Side \] or, equivalently, \[ y = 4 \ x \] represents the exact relationship between side and perimeter of squares.

Contrary to the relationship represented by the mathematical model above, the relationship between height and handspan shows deviations from the “average pattern.” Hence, we need to create a model that allows for deviations from the linear relationship. This is called a statistical model.

A statistical model includes both a deterministic function and a random error term: \[ Handspan = \beta_0 + \beta_1 \ Height + \epsilon \] or, in short, \[ y = \underbrace{\beta_0 + \beta_1 \ x}_{f(x)} + \underbrace{\epsilon}_{\text{random error}} \]

The deterministic function need not be linear if the scatterplot displays signs of nonlinearity. In the equation above, the terms \(\beta_0\) and \(\beta_1\) are numbers specifying where the line going through the data meets the y-axis and its slope (rate of increase/decrease).

Question 12

The line of best-fit is given by:¹ \[ \widehat{Handspan} = -3 + 0.35 \ Height \]

What is your best guess for the handspan of a student who is 73in tall?

And for students who are 5in?

Solution

References

Utts, Jessica M, and Robert F Heckard. 2015. Mind on Statistics. Cengage Learning.

Yes, the error term is gone. This is because the line of best-fit gives you the prediction of the average handspan for a given height, and not the individual handspan of a person, which will almost surely be different from the prediction of the line.↩︎

Functions and models

Refresher of basic terminology

Functions and mathematical models

Statistical models

References