Introduction to the linear model (LM)

class: center, middle, inverse, title-slide

# <b>Introduction to the linear model (LM)</b>
## Data Analysis for Psychology in R 2<br><br>
### dapR2 Team
### Department of Psychology<br>The University of Edinburgh

---

# Weeks Learning Objectives
1. Understand the link between models and functions.

2. Understand the key concepts (intercept and slope) of the linear model

3. Understand what residuals represent.

4. Be able to specify a simple linear model (labs)

---
# What is a model?
+ Pretty much all statistics is about models.

+ A model is a formal representation of a system.

+ Put another way, a model is an idea about the way the world is.

---
# A model as a function
+ We tend to represent mathematical models as functions.
  + which can be very helpful.
  
+ It allows for the precise specification about what is important (arguments) and what those things do (operations)
  + This leads to predictions
  + And these predictions can be tested.

---
# An Example
+ To think through these relations, we can use a simpler example.

+ Suppose I have a model for growth of babies.<sup>1</sup>

$$
Length = 55 + 4 * Month
$$

.footnote[
[1] Length is measured in cm.
]

---
# Visualizing a model

.pull-left[
<img src="dapr2_01_introlm_files/figure-html/unnamed-chunk-1-1.png" width="80%" />
]

.pull-right[

{{content}}
]

--
+ The black line represents our model
{{content}}

--
+ The x-axis shows `Age` `$(x)$`
{{content}}

--
+ The y-axis values for `Length` our model predicts
{{content}}

---
# Models as "a state of the world"
+ Let's suppose my model is true.
  + That is, it is a perfect representation of how babies grow.
  
+ What are the implications of this?

---
# Models and predictions
+ My models creates predictions.

+ **IF** my model is a true representation of the world, **THEN** data from the world should closely match my predictions.

---
# Predictions and data

.pull-left[
<img src="dapr2_01_introlm_files/figure-html/unnamed-chunk-2-1.png" width="80%" />
]

.pull-right[

<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Age </th>
   <th style="text-align:right;"> Prediction </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10.00 </td>
   <td style="text-align:right;"> 95 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10.25 </td>
   <td style="text-align:right;"> 96 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10.50 </td>
   <td style="text-align:right;"> 97 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10.75 </td>
   <td style="text-align:right;"> 98 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.00 </td>
   <td style="text-align:right;"> 99 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.25 </td>
   <td style="text-align:right;"> 100 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.50 </td>
   <td style="text-align:right;"> 101 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11.75 </td>
   <td style="text-align:right;"> 102 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12.00 </td>
   <td style="text-align:right;"> 103 </td>
  </tr>
</tbody>
</table>

]

???
+ Our predictions are points which fall on our line (representing the model, as a function)
+ Here the arrows are showing how we can use the model to find a predicted value.
+ we find the value of the input on the x-axis (here 11), read up to the line, then across to the y-axis

---
# Predictions and data

.pull-left[

+ Consider the predictions when the children get a lot older...

]

.pull-right[
<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Age </th>
   <th style="text-align:right;"> Year </th>
   <th style="text-align:right;"> Prediction </th>
   <th style="text-align:right;"> Prediction_M </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 216 </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 919 </td>
   <td style="text-align:right;"> 9.19 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 228 </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 967 </td>
   <td style="text-align:right;"> 9.67 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 240 </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 1015 </td>
   <td style="text-align:right;"> 10.15 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 252 </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 1063 </td>
   <td style="text-align:right;"> 10.63 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 264 </td>
   <td style="text-align:right;"> 22 </td>
   <td style="text-align:right;"> 1111 </td>
   <td style="text-align:right;"> 11.11 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 276 </td>
   <td style="text-align:right;"> 23 </td>
   <td style="text-align:right;"> 1159 </td>
   <td style="text-align:right;"> 11.59 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 288 </td>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 1207 </td>
   <td style="text-align:right;"> 12.07 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 300 </td>
   <td style="text-align:right;"> 25 </td>
   <td style="text-align:right;"> 1255 </td>
   <td style="text-align:right;"> 12.55 </td>
  </tr>
</tbody>
</table>

]

--
+ What do you think this would mean for our actual data?
{{content}}

--
+ Will the data fall on the line?
{{content}}

---
# How good is my model?
+ How might we judge how good our model is?

1. Model is represented as a function
  
  2. We see that as a line (or surface if we have more things to consider)
  
  3. That yields predictions (or values we expect if our model is true)
  
  4. We can collect data
  
  5. If the predictions do not match the data (points deviate from our line), that says something about our model.

---
# Models and Statistics
+ In statistics we (roughly) follow this process.

+ We define a model that represents one state of the world (probabilistically)

+ We then collect data to compare to it.

+ These comparisons lead us to make inferences about how the world actually is, by comparison to a world that we specify by our model.

---
# Length & Age is non-linear

.pull-left[
<img src="dapr2_01_introlm_files/figure-html/unnamed-chunk-5-1.png" width="80%" />
]

.pull-right[

+ Our red line is plotted based on the mean length for different ages [real data](https://www.cdc.gov/growthcharts/who/boys_length_weight.htm)

]

---
# Deterministic vs Statistical models

.pull-left[
A deterministic model is a model for an **exact** relationship:
$$
y = \underbrace{3 + 2 x}_{f(x)}
$$
<img src="dapr2_01_introlm_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" />

]

.pull-right[
A statistical model allows for case-by-case **variability**:
$$
y = \underbrace{3 + 2 x}_{f(x)} + \epsilon
$$
<img src="dapr2_01_introlm_files/figure-html/unnamed-chunk-7-1.png" width="70%" style="display: block; margin: auto;" />
]

---
class: center, middle
# Time to take a breath. Questions...

---
# Linear model
+ What we will focus on for the majority of the course is how we move from the idea of an association, to estimating a model for the relationship.

+ This model is the **linear model**

+ When using a linear model, we are typically trying to explain variation in an **outcome** (Y, dependent, response) variable, using one or more **predictor** (x, independent, explanatory) variable(s).

---
# Example

.pull-left[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> student </th>
   <th style="text-align:right;"> hours </th>
   <th style="text-align:right;"> score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID1 </td>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID2 </td>
   <td style="text-align:right;"> 1.0 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID3 </td>
   <td style="text-align:right;"> 1.5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID4 </td>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID5 </td>
   <td style="text-align:right;"> 2.5 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID6 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID7 </td>
   <td style="text-align:right;"> 3.5 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID8 </td>
   <td style="text-align:right;"> 4.0 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID9 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID10 </td>
   <td style="text-align:right;"> 5.0 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>

]

.pull-right[

**Simple data**

+ `student` = ID variable unique to each respondent

+ `hours` = the number of hours spent studying. This will be our predictor ( `$x$` )

+ `score` = test score ( `$y$` )

**Question: Do students who study more get higher scores on the test?**
]

---
# Scatterplot of our data

.pull-left[
<img src="dapr2_01_introlm_files/figure-html/unnamed-chunk-9-1.png" width="80%" />
]

.pull-right[

]

???
+ we can visualize our data. We can see points moving bottom left to top right
+ so association looks positive
+ Now let's add a line that represents the best model

---
# Definition of the line
+ The line can be described by two values:

+ **Intercept**: the point where the line crosses `$y$`, and `$x$` = 0

+ **Slope**: the gradient of the line, or rate of change

???
+ In our example, intercept = for someone who doesn't study, what score will they get?
+ Slope = for every hour of study, how much will my score change

---
# Intercept and slope

.pull-left[

]

.pull-right[

]

---
# How to find a line?
+ The line represents a model of our data.
    + In our example, the model that best characterizes the relationship between hours of study and test score.

+ In the scatterplot, the data is represented by points.

+ So a good line, is a line that is "close" to all points.

---
# Linear Model

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ `$y_i$` = the outcome variable (e.g. `score`)

+ `$x_i$` = the predictor variable, (e.g. `hours`)

+ `$\beta_0$` = intercept

+ `$\beta_1$` = slope

+ `$\epsilon_i$` = residual (we will come to this shortly)
  + where `$\epsilon_i \sim N(0, \sigma)$` independently.
    + `$\sigma$` = standard deviation (spread) of the errors
    + The standard deviation of the errors, `$\sigma$`, is constant

---
# Linear Model

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ **Why do we have `$i$` in some places and not others?**

+ `$i$` is a subscript to indicate that each participant has their own value.

+ So each participant has their own: 
    + score on the test ( `$y_i$` )
    + number of hours studied ( `$x_i$` ) and
    + residual term ( `$\epsilon_i$` )

--
+ **What does it mean that the intercept ( `$\beta_0$` ) and slope ( `$\beta_1$` ) do not have the subscript `$i$`?**

+ It means there is one value for all observations.
    + Remember the model is for **all of our data**

---
# What is `$\epsilon_i$`?

.pull-left[
+ `$\epsilon_i$`, or the residual, is a measure of how well the model fits each data point.

+ It is the distance between the model line (on `$y$`-axis) and a data point.

+ `$\epsilon_i$` is positive if the point is above the line (red in plot)

+ `$\epsilon_i$` is negative if the point is below the line (blue in plot)

]

.pull-right[

]

???
+ comment red = positive and bigger (longer arrow) model is worse
+ blue is negative, and smaller (shorter arrow) model is better
+ key point to link here is the importance of residuals for knowing how good the model is
+ Link to last lecture in that they are the variability 
+ that is the link into least squares

---
# Summary
+ Take home points...

1. In statistics, we are building models that describe how a set of variables relate.
  2. The **linear model** is one such model we will use in this course.
  3. The linear model describes our data based on an intercept and a slope(s)
  4. From this model (line) we can make predictions about peoples scores on an outcome
  5. The degree to which our predictions differ from the observed data = residual = error = how good (or bad) the model is

+ The majority of this course is going to revolve around getting a deeper understanding of these 5 points.

---
class: center, middle
# That is all for this week