Introduction to the linear model (LM)

class: center, middle, inverse, title-slide

.title[
# <b>Introduction to the linear model (LM)</b>
]
.subtitle[
## Data Analysis for Psychology in R 2<br><br>
]
.author[
### dapR2 Team
]
.institute[
### Department of Psychology<br>The University of Edinburgh
]

---

# Weeks Learning Objectives
1. Understand the link between models and functions.

2. Understand the key concepts (intercept and slope) of the linear model

3. Understand what residuals represent.

4. Understand the key principles of least squares

5. Be able to specify a simple linear model (labs)

---
class: inverse, center, middle

# Part 1: Functions & Models

---
# What is a model?
+ Pretty much all statistics is about models.

+ A model is a formal representation of a system.

+ Put another way, a model is an idea about the way the world is.

---
# A model as a function
+ We tend to represent mathematical models as functions.

+ A **function** is an expression that defines the relationship between one variable (or set of variables) and another variable (or set of variables)
  
  + It allows us to specify what is important (arguments) and how these things interact with each other (operations)

+ This allows us to make and test predictions

---
# An Example
+ To think through these relations, we can use a simpler example.

+ Suppose I have a model for growth of babies.<sup>1</sup>

$$
Length = 55 + 4 * Age
$$
--

+ I'm using this model to formally represent the relationship between a baby's age and their length.

.footnote[
[1] Length is measured in cm; Age is measured in months
]

---
# Visualizing a model

.pull-left[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-1-1.png" width="80%" />
]

.pull-right[

{{content}}
]

--
+ The x-axis shows `Age`
{{content}}

+ The y-axis shows `Length`
{{content}}

+ The black line represents our model: `$y = 55+4x$`
{{content}}

---
# Models as "a state of the world"
+ Let's suppose my model is true.
  + That is, it is a perfect representation of how babies grow.
  
+ What are the implications of this?

--
  + My models creates predictions
  
  + **IF** my model is a true representation of the world, **THEN** data from the world should closely match my predictions.

---
# Predictions and data

.pull-left[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-2-1.png" width="80%" />
]

.pull-right[

<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Age </th>
   <th style="text-align:right;"> PredictedLength </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10.00 </td>
   <td style="text-align:right;"> 95 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10.25 </td>
   <td style="text-align:right;"> 96 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10.50 </td>
   <td style="text-align:right;"> 97 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10.75 </td>
   <td style="text-align:right;"> 98 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.00 </td>
   <td style="text-align:right;"> 99 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.25 </td>
   <td style="text-align:right;"> 100 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.50 </td>
   <td style="text-align:right;"> 101 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.75 </td>
   <td style="text-align:right;"> 102 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12.00 </td>
   <td style="text-align:right;"> 103 </td>
  </tr>
</tbody>
</table>

]

???
+ Our predictions are points which fall on our line (representing the model, as a function)
+ Here the arrows are showing how we can use the model to find a predicted value.
+ we find the value of the input on the x-axis (here 11), read up to the line, then across to the y-axis

---
# Predictions and data

.pull-left[

+ Consider the predictions when the children get a lot older...

]

.pull-right[
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Age </th>
   <th style="text-align:right;"> Year </th>
   <th style="text-align:right;"> Prediction </th>
   <th style="text-align:right;"> Prediction_M </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 216 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 919 </td>
   <td style="text-align:right;"> 9.19 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 228 </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 967 </td>
   <td style="text-align:right;"> 9.67 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 240 </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 1015 </td>
   <td style="text-align:right;"> 10.15 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 252 </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 1063 </td>
   <td style="text-align:right;"> 10.63 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 264 </td>
   <td style="text-align:right;"> 22 </td>
   <td style="text-align:right;"> 1111 </td>
   <td style="text-align:right;"> 11.11 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 276 </td>
   <td style="text-align:right;"> 23 </td>
   <td style="text-align:right;"> 1159 </td>
   <td style="text-align:right;"> 11.59 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 288 </td>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 1207 </td>
   <td style="text-align:right;"> 12.07 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 300 </td>
   <td style="text-align:right;"> 25 </td>
   <td style="text-align:right;"> 1255 </td>
   <td style="text-align:right;"> 12.55 </td>
  </tr>
</tbody>
</table>

]

--
+ What does this say about our model?
{{content}}

--
+ If we were to collect actual data on height and age, will our observations fall on the line?
{{content}}

---

# Length & Age is non-linear

.pull-left[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-5-1.png" width="80%" />
]

.pull-right[

+ Our red line is plotted based on the mean length for different ages [real data](https://www.cdc.gov/growthcharts/who/boys_length_weight.htm)

]

---
# How good is my model?
+ How might we judge how good our model is?

1. Model is represented as a function
  
  2. We see that as a line (or surface if we have more things to consider)
  
  3. That yields predictions (or values we expect if our model is true)
  
  4. We can collect data
  
  5. If the predictions do not match the observed data (observations deviate from our line), that says something about our model.

---
# Models and Statistics
+ In statistics we (roughly) follow this process:

+ We define a model that represents one state of the world (probabilistically)

+ We collect data to compare to it.

+ These comparisons lead us to make inferences about how the world actually is, by comparison to a world that we specify by our model.

---
# Deterministic vs Statistical models

.pull-left[
A deterministic model is a model for an **exact** relationship:
$$
y = \underbrace{3 + 2 x}_{f(x)}
$$
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" />

]

.pull-right[
A statistical model allows for case-by-case **variability**:
$$
y = \underbrace{3 + 2 x}_{f(x)} + \epsilon
$$
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-7-1.png" width="70%" style="display: block; margin: auto;" />
]

---
class: center, middle
# Time to take a breath. Questions...

---
class: inverse, center, middle

# Part 2: The Linear Model

---
# Linear model
+ For the majority of the course, we will focus on how we move from the idea of an association to estimating a model for the relationship.

+ We'll mostly look at the **linear model**

+ Assumes the relationship between the outcome variable and the predictor(s) is linear
  
  + Describes a continuous **outcome** variable as a function of one or more **predictor** variables
  
      + In other words, in using a linear model, we are typically trying to explain variation in an outcome (AKA `$Y$` , dependent, response) variable, using one or more predictor ( `$x$` , independent, explanatory) variable(s).

---
# Example

**Question: Do students who study more get higher scores on the test?**

.pull-left[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> student </th>
   <th style="text-align:right;"> hours </th>
   <th style="text-align:right;"> score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID1 </td>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID2 </td>
   <td style="text-align:right;"> 1.0 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID3 </td>
   <td style="text-align:right;"> 1.5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID4 </td>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID5 </td>
   <td style="text-align:right;"> 2.5 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID6 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID7 </td>
   <td style="text-align:right;"> 3.5 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID8 </td>
   <td style="text-align:right;"> 4.0 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID9 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID10 </td>
   <td style="text-align:right;"> 5.0 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>

]

.pull-right[

**Simple data**

+ `student` = ID variable unique to each respondent

+ `hours` = the number of hours spent studying. This will be our predictor ( `$x$` ).

+ `score` = test score. This will be our outcome ( `$y$` ).

]

---
# Scatterplot of our data

.pull-left[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-9-1.png" width="80%" />
]

.pull-right[

]

???
+ we can visualize our data. We can see points moving bottom left to top right
+ so association looks positive
+ Now let's add a line that represents the best model

---
# Definition of the line

.pull-left[
+ The line can be described by two values:

+ **Intercept**: the point where the line crosses the `$y$` -axis and `$x = 0$`

+ **Slope**: the gradient of the line, or rate of change
]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-11-1.png" width="80%" />

]

???
+ In our example, intercept = for someone who doesn't study, what score will they get?
+ Slope = for every hour of study, how much will my score change

---
# Intercept and slope

.pull-left[

]

.pull-right[

]

---
# Linear Model Equation

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ `$y_i$` = the outcome variable (e.g. `score`)

+ `$x_i$` = the predictor variable, (e.g. `hours`)

+ `$\beta_0$` = intercept

+ `$\beta_1$` = slope

+ `$\epsilon_i$` = residual (we will come to this shortly)

---
# Linear Model Equation

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ **Why do we have `$i$` in some places and not others?**

+ `$i$` is a subscript to indicate that each participant has their own value.

+ So each participant has their own: 
    + score on the test ( `$y_i$` )
    + number of hours studied ( `$x_i$` ) and
    + residual term ( `$\epsilon_i$` )

--
+ **What does it mean that the intercept ( `$\beta_0$` ) and slope ( `$\beta_1$` ) do not have the subscript `$i$`?**

+ It means there is one value for all observations.
    + Remember the model is for **all of our data**

---
# What is `$\epsilon_i$`?

.pull-left[
+ `$\epsilon_i$`, or the residual, is a measure of how well the model fits each data point.

+ It is the distance between the model line (on `$y$`-axis) and a data point.

+ `$\epsilon_i$` is positive if the point is above the line (red in plot)

+ `$\epsilon_i$` is negative if the point is below the line (blue in plot)

]

.pull-right[

]

???
+ comment red = positive and bigger (longer arrow) model is worse
+ blue is negative, and smaller (shorter arrow) model is better
+ key point to link here is the importance of residuals for knowing how good the model is
+ Link to last lecture in that they are the variability 
+ that is the link into least squares

---
# How to find the line?

.pull-left[
+ The line represents a model of our data.
    + In our example, the model that best characterizes the relationship between hours of study and test score

+ In the scatterplot, the data are represented by points.

+ So a good line is a line that is "close" to all points.

+ The method that we use to identify the best-fitting line is the **Principle of Least Squares**

]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-16-1.png" width="80%" />

]

---
class: center, middle

# Questions?

---
class: inverse, center, middle

# Part 3: Principle of Least Squares

---

# Linear Model
+ Yesterday we left off having introduced the linear model:

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ Where

+ `$y_i$` is our measured outcome variable
  + `$x_i$` is our measured predictor variable
  + `$\beta_0$` is the model intercept
  + `$\beta_1$` is the model slope
  + `$\epsilon_i$` is the residual error (difference between the model predicted and the observed value of `$y$`)

+ The values of `$y$` and `$x$` come from the observed data. 
+ We'll now go through calculating `$\beta_0$` and `$\beta_1$`

---
# Principle of least squares

.pull-left[

+ The values `$\beta_0$` and `$\beta_1$` are typically **unknown** and need to be estimated from our data.

+ We denote the "best" estimated values as `$\hat \beta_0$` and `$\hat \beta_1$`

+ We find the values of `$\hat \beta_0$` and `$\hat \beta_1$` (and thus our best line) using **least squares**
    
+ Least squares minimizes the distances between the actual values of `$y$` and the model-predicted values of `$\hat y$`
  + That is, it minimizes the residuals for each data point (the line is "close")

]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-17-1.png" width="80%" />

]

---
# Principle of least squares

+ Formally, least squares minimizes the **residual sum of squares**

.pull-left[
+ Essentially:

+ Fit a line.
]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-18-1.png" width="80%" />
]

---
count: false

# Principle of least squares

+ Formally, least squares minimizes the **residual sum of squares**

.pull-left[
+ Essentially:

+ Fit a line.
  + Calculate the residuals
]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-19-1.png" width="80%" />

]

---
count: false

# Principle of least squares

+ Formally, least squares minimizes the **residual sum of squares**

.pull-left[
+ Essentially:

+ Fit a line.
  + Calculate the residuals
  + Square them
]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-20-1.png" width="80%" />

]
---
count: false

# Principle of least squares

+ Formally, least squares minimizes the **residual sum of squares**

.pull-left[
+ Essentially:

+ Fit a line.
  + Calculate the residuals
  + Square them
  + Sum up the squares
  
]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-21-1.png" width="80%" />

]

---
count: false

# Principle of least squares

+ Formally, least squares minimizes the **residual sum of squares**

.pull-left[
+ Essentially:

+ Fit a line.
  + Calculate the residuals
  + Square them
  + Sum up the squares
  
+ **Why do you think we square the deviations? **
]

.pull-right[
<img src="dapr2_01_introlm_lecture_files/figure-html/unnamed-chunk-22-1.png" width="80%" />

]

---

# Residual Sum of Squares

`$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$`

---
count: false

# Residual Sum of Squares

`$$SS_{Residual} = \sum_{i=1}^{n}(\color{#BF1932}{y_i} - \hat{y}_i)^2$$`
+ Data = `$y_i$`
    + This is what we have measured in our study. 
    + For us, the test scores.

---
count: false

# Residual Sum of Squares

`$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \color{#BF1932}{\hat{y}_i})^2$$`
+ Data = `$y_i$`
    + This is what we have measured in our study. 
    + For us, the test scores.

+ Predicted value = `$\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i$`
    + Or, the value of the outcome our model predicts given someone's values for predictors.
    + In our example: given you study for 4 hours, what test score does our model predict you will get?

---
count: false

# Residual Sum of Squares

`$$SS_{Residual} = \sum_{i=1}^{n}(\color{#BF1932}{y_i - \hat{y}_i})^2$$`
+ Data = `$y_i$`
    + This is what we have measured in our study. 
    + For us, the test scores.

+ Predicted value = `$\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i$` 
    + Or, the value of the outcome our model predicts given someone's values for predictors.
    + In our example: given you study for 4 hours, what test score does our model predict you will get?

+ Residual = Difference between `$y_i$` and `$\hat{y}_i$`.

---
# Key Point

+ It is worth a brief pause as this is a very important point.

> The values of the intercept and slope that minimize the sum of square residual are our estimated coefficients from our data.

> Minimizing the `$SS_{residual}$` means that across all our data, the predicted values from our model are as close as they can be to the actual measured values of the outcome.

---
# Calculating the slope

+ Calculations for slope:

`$$\hat \beta_1 = \frac{SP_{xy}}{SS_x}$$`

.pull-left[
+ `$SP_{xy}$` = sum of cross-products:

`$$SP_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$`

+ `$SS_x$` = sums of squared deviations of `$x$`:

`$$SS_x = \sum_{i=1}^{n}(x_i - \bar{x})^2$$`

]

.pull-right[

+ `$x_i$` = predictor data (in our example, `hours`)

+ `$y_i$` = outcome data (in our example, `scores`)

+ `$\bar{y}$` = mean of `$y$`

+ `$\bar{x}$` = mean of `$x$`

+ `$n$` = total number of observations

+ `$\Sigma$` = sum it all up

]

---
# Calculating the intercept

+ Calculations for intercept:

`$$\hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x}$$`

+ `$\hat \beta_1$` = slope estimate

+ `$\bar{y}$` = mean of `$y$`

+ `$\bar{x}$` = mean of `$x$`

---
class: center, middle

## Questions?

## Time for a little R and to look at an example hand calculation.

---
# `lm` in R
+ We do not generally calculate our linear models by hand.

+ In R, we use the `lm()` function.

```r
lm(DV ~ IV, data = datasetName)
```

+ The first bit of code is the model formula:
  + The outcome or DV appears on the left of ~
  + The predictor(s) or IV appear on the right of ~

+ We then give R the name of the data set
  + This set must contain variables (columns) with the same names as you have specified in the model formula.

---
# `lm` in R

.pull-left[

```r
test <- tibble(
  student = paste(rep("ID",10),1:10, sep=""),
  hours = seq(0.5,5,.5),
  score = c(1,3,1,2,2,6,3,3,4,8)
)
```
]

.pull-right[

```r
head(test, 4)
```

```
## # A tibble: 4 × 3
##   student hours score
##   <chr>   <dbl> <dbl>
## 1 ID1       0.5     1
## 2 ID2       1       3
## 3 ID3       1.5     1
## 4 ID4       2       2
```
]

```r
lm(score ~ hours, data = test)
```

```
## 
## Call:
## lm(formula = score ~ hours, data = test)
## 
## Coefficients:
## (Intercept)        hours  
##       0.400        1.055
```

---
class: center, middle

## We've just run our first linear model!

## Questions?

---

# Summary
+ Take home points...

1. In statistics, we are building models that describe how a set of variables relate.
  2. The **linear model** is one such model we will use in this course.
  3. The linear model describes our data based on an intercept and a slope(s)
  4. From this model (line) we can make predictions about peoples scores on an outcome
  5. The degree to which our predictions differ from the observed data = residual = error = how good (or bad) the model is
  6. We find our model coefficients based on least squares, which are the coefficients that minimize the sum of squared residuals

+ The majority of this course is going to revolve around getting a deeper understanding of these points.

---
class: center, middle
# Thanks for listening!