Information about solutions

Solutions for these exercises are available immediately below each question.
We would like to emphasise that much evidence suggests that testing enhances learning, and we strongly encourage you to make a concerted attempt at answering each question before looking at the solutions. Immediately looking at the solutions and then copying the code into your work will lead to poorer learning.
We would also like to note that there are always many different ways to achieve the same thing in R, and the solutions provided are simply one approach.

LEARNING OBJECTIVES

Review the main concepts from introductory statistics.
Understand the concept of a function.
Be able to discuss what a statistical model is.
Understand the link between models and functions.

Refresher of basic terminology

Question 1

Provide a short definition for each of these terms:

(Observational) unit
Variable
Categorical variable
Numeric variable
Response/dependent variable
Explanatory/independent variable
Observational study
Experiment

Solution

Term	Definition
(Observational) unit	The individual entities on which data are collected.
Variable	Any characteristic recorded on the observational units.
Categorical variable	A categorical variable places units into one of several groups. Examples are “country of birth,” “dominant hand,” and “eye colour.”
Numeric variable	A variable that records a numerical quantity for each case. For such variables standard arithmetic operations make sense. For example, average height or weight makes sense.
Response/dependent variable	A response variable (also called a dependent variable) measures the outcome of interest in a study.
Explanatory/independent variable	Explanatory variables (also called independent variables or predictors) are used to explain changes in the response variable.
Observational study	An observational study is a study in which the researcher does not manipulate any of the variables involved in the study, but merely records the values as they naturally exist.
Experiment	An experiment is a study in which the researcher imposes the values of the explanatory variable on the units before measuring the response variable.

Functions and mathematical models

Question 2

Consider the function \(y = 2 + 5 \ x\).

Identify the independent variable
Identify the dependent variable
Describe in words what the function does, and compute the output for the following input: \[ x = \begin{bmatrix} 2 \\ 6 \end{bmatrix} \]

Solution

Question 3

Write down in words and in symbols the function summarising the relationship between the side of a square and its perimeter.

Hint: We are interested in how the perimeter varies as a function of its side. Hence, the perimeter is the dependent variable, and the side is the independent variable.

Solution

In today’s lab you will compute the output of functions and plot them. To do so, you will need functionality from the tidyverse package such as tibble(), mutate(), and ggplot(). If you don’t have the package installed, go to the RStudio menu, select Tools, click Install Packages, type tidyverse and click install.

Question 4

Load the tidyverse package.

Create a data set called squares containing the perimeter of squares having sides of length \(0, 2, 5, 9\) metres.

Hint: Remember that to combine multiple numbers together we use the function c().

Solution

library(tidyverse)

squares <- tibble(
  side = c(0, 2, 5, 9),
  perimeter = 4 * side
)

squares

## # A tibble: 4 x 2
##    side perimeter
##   <dbl>     <dbl>
## 1     0         0
## 2     2         8
## 3     5        20
## 4     9        36

Question 5

Plot the squares data as points.

Solution

Now, instead of just 4 points, we will obtain many more, one hundred, and use them to visualise the relationship between side and perimeter of squares.

Question 6

Create a sequence of one hundred side lengths (x) going from 0 to 3 metres.

Compute the corresponding perimeters (y).

Plot the side and perimeter data as points on a graph.

Visualise the functional relationship between side and perimeter of squares. To do so, use the function geom_line() to connect the computed points with lines.

Solution

The function \(y = 4 \ x\) that you plotted above is an example of a function representing a mathematical model.

We typically validate a model using experimental data. However, we all know how squares work and that two squares with the same side will have the same perimeter. Hence this is a deterministic model as it is a model of an exact relationship.

Question 7

The Scottish National Gallery kindly provided us with measurements of side and perimeter (in metres) for a sample of 10 square paintings.

The data are provided below:

sng <- tibble(
  side = c(1.3, 0.75, 2, 0.5, 0.3, 1.1, 2.3, 0.85, 1.1, 0.2),
  perimeter = c(5.2, 3.0, 8.0, 2.0, 1.2, 4.4, 9.2, 3.4, 4.4, 0.8)
)

Plot the mathematical model of the relationship between side and perimeter for squares, and superimpose on top the experimental data from the Scottish National Gallery.

Solution

Question 8

Use the mathematical model to predict the perimeter of a painting with a side of 1.5 metres.

Solution

Note!

In this labs workbook we often provide examples of write-ups or interpretations of results in boxes that look like the one below. Keep an eye on them as they will help you in reporting your results!

Example!

Statistical models

Consider now the relationship between height (in inches) and handspan (in cm). Utts and Heckard (2015) provides data for a sample of 167 students which reported their height and handspan as part of a class survey.

Data: handheight.csv.

height	handspan
68	21.5
71	23.5
73	22.5
64	18.0
68	23.5
59	20.0

Question 9

Read the handheight data into R and name the data set handheight.

Solution

handheight <- read_csv(file = 'https://uoepsy.github.io/data/handheight.csv')
head(handheight)

## # A tibble: 6 x 2
##   height handspan
##    <dbl>    <dbl>
## 1     68     21.5
## 2     71     23.5
## 3     73     22.5
## 4     64     18  
## 5     68     23.5
## 6     59     20

Question 10

Investigate how handspan varies as a function of height for the students in the sample.

Do you notice any outliers or points that do not fit with the pattern in the rest of the data?

Comment on any main differences you notice with the relationship between side and perimeter of squares.

Hint: Use a scatterplot to visualise the relationship between two numeric variables.

Solution

The handheight data set contains two variables, height and handspan, which are both numeric and continuous. We display the relationship between two numeric variables with a scatterplot.

We can also add marginal boxplots for each variable using the package ggExtra. Before using the package, make sure you have it installed via install.packages('ggExtra').

library(ggExtra)

plt <- ggplot(handheight, aes(x = height, y = handspan)) +
  geom_point(size = 3, alpha = 0.5) +
  labs(x = 'Height (in.)', y = 'Handspan (cm)')

ggMarginal(plt, type = 'boxplot')

Figure 2: The statistical relationship between height and handspan.

Outliers are extreme observations that are not possible values of a variable or that do not seem to fit with the rest of the data. This could either be:

marginally along one axis: points that have an unusual (too high or too low) x-coordinate or y-coordinate;
jointly: observations that do not fit with the rest of the point cloud.

The boxplots in Figure 2 do not highlight any outliers in the marginal distributions of height and handspan. Furthermore, from the scatterplot we do not notice any extreme observations or points that do not fit with the rest of the point cloud.

We notice a moderate, positive (that is, increasing) linear relationship between height and handspan.

Recall Figure 1, displaying the relationship between side and perimeters of squares. In the plot we notice two points on top of each other, reflecting the fact that two squares having the same side will always have the same perimeter. In fact, the data from the Scottish National Gallery include two squared paintings with a side of 1.1m, both having a measured perimeter of 4.4m.

Figure 2, instead, displays the relationship between height and handspan of a sample of students. The first thing that grabs our attention is the fact that students having the same height do not necessarily have the same handspan. Rather, we clearly see a variety of handspan values for students all having a height of, for example, 70in. To be more precise, the seven students who are 70 in. tall all have differing handspans.

Question 11

Using the following command, superimpose on top of the scatterplot a best-fit line describing how handspan varies as a function of height. For the moment, the argument se = FALSE tells R to not display uncertainty bands.

geom_smooth(method = lm, se = FALSE)

Comment on any differences you notice with the line summarising the linear relationship between side and perimeter.

Solution

The mathematical model \[ Perimeter = 4 * Side \] or, equivalently, \[ y = 4 * x \] represents the exact relationship between side and perimeter of squares.

Contrary to the relationship represented by the mathematical model above, the relationship between height and handspan shows deviations from the “average pattern.” Hence, we need to create a model that allows for deviations from the linear relationship. This is called a statistical model.

A statistical model includes both a deterministic function and a random error term: \[ Handspan = \beta_0 + \beta_1 * Height + \epsilon \] or, in short, \[ y = \underbrace{\beta_0 + \beta_1 * x}_{\text{function of }x} + \underbrace{\epsilon}_{\text{random error}} \]

The deterministic function need not be linear if the scatterplot displays signs of nonlinearity. In the equation above, the terms \(\beta_0\) and \(\beta_1\) are numbers specifying where the line going through the data meets the y-axis and its slope (rate of increase/decrease).

Question 12

The line of best-fit is given by:¹ \[ \widehat{Handspan} = -3 + 0.35 \ Height \]

What is your best guess for the handspan of a student who is 73in tall?

And for students who are 5in?

Solution

References

Utts, Jessica M, and Robert F Heckard. 2015. Mind on Statistics. Cengage Learning.

Yes, the error term is gone. This is because the line of best-fit gives you the prediction of the average handspan for a given height, and not the individual handspan of a person, which will almost surely be different from the prediction of the line.↩︎

Functions and models

Refresher of basic terminology

Functions and mathematical models

Statistical models

References