Week 2: Introduction to Linear Model

class: center, middle, inverse, title-slide

# <b>Week 2: Introduction to Linear Model</b>
## Data Analysis for Psychology in R 2<br><br>
### TOM BOOTH & ALEX DOUMAS
### Department of Psychology<br>The University of Edinburgh
### AY 2020-2021

---

# Week's Learning Objectives
1. Be able to specify a simple linear model.

2. Understand and describe fitted values and residuals.

3. Be able to interpret the coefficients from a linear model.

4. Be able to test hypotheses and construct confidence intervals for the model coefficients.

---
# Topics for today
+ Moving on from the idea of a line and a function, we will discuss:
  + Least squares and the linear model
  + Differentiate measured data, fitted values and residuals
  + Calculating slope and intercept
  
  
---

# Things to recap
+ This week we will build from:
  + arithmetic mean
  + concept of squared deviations 
  
???
+ this is material to point students to.
+ no need to spend time on this here

---
# Recap correlation
+ Correlation coefficient is a **standardized measure of association** between two variables.

+ We can calculate correlations for different data types.

+ For now, we will focus on two numeric (continuous) variables.

+ Typical visualization of correlations is through **scatterplots**.

---

# Scatterplot
+ Scatterplots plot points at the (x,y) co-ordinates for two measured variables.

+ We plot each individual data point (typically a participants pair of responses).
	+ This produces the clouds of points.

---

# Scatterplot

.pull-left[
<img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-3-1.png" width="90%" />
]

.pull-right[

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:right;"> height </th>
   <th style="text-align:right;"> weight </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> John </td>
   <td style="text-align:right;"> 1.52 </td>
   <td style="text-align:right;"> 54 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Peter </td>
   <td style="text-align:right;"> 1.60 </td>
   <td style="text-align:right;"> 49 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Robert </td>
   <td style="text-align:right;"> 1.68 </td>
   <td style="text-align:right;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> David </td>
   <td style="text-align:right;"> 1.78 </td>
   <td style="text-align:right;"> 67 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> George </td>
   <td style="text-align:right;"> 1.86 </td>
   <td style="text-align:right;"> 70 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Matthew </td>
   <td style="text-align:right;"> 1.94 </td>
   <td style="text-align:right;"> 110 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bradley </td>
   <td style="text-align:right;"> 2.09 </td>
   <td style="text-align:right;"> 98 </td>
  </tr>
</tbody>
</table>

+ `name` = nominal variable
+ `height` = height in metres, numeric
+ `weight` = weight in kg's, numeric

]

---

# Strength of correlation

---
# Linear model
+ What we will focus on for the majority of the course is how we move from the idea of an association, to estimating a model for the relationship.

+ This model is the **linear model**

+ When using a linear model, we are typically trying to explain variation in an **outcome** (Y, dependent, response) variable, using one or more **predictor** (x, independent, explanatory) variable(s).

---
# Example

.pull-left[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> student </th>
   <th style="text-align:right;"> hours </th>
   <th style="text-align:right;"> score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ID1 </td>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID2 </td>
   <td style="text-align:right;"> 1.0 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID3 </td>
   <td style="text-align:right;"> 1.5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID4 </td>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID5 </td>
   <td style="text-align:right;"> 2.5 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID6 </td>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID7 </td>
   <td style="text-align:right;"> 3.5 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID8 </td>
   <td style="text-align:right;"> 4.0 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID9 </td>
   <td style="text-align:right;"> 4.5 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ID10 </td>
   <td style="text-align:right;"> 5.0 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>

]

.pull-right[

**Simple data**

+ `student` = ID variable unique to each respondent

+ `hours` = the number of hours spent studying. This will be our predictor ( `$x$` )

+ `score` = test score ( `$y$` )

**Question: Do students who study more get higher scores on the test?**
]

---
# Scatterplot of our data

.pull-left[
<img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-7-1.png" width="90%" />
]

.pull-right[

]

???
+ we can visualize our data. We can see points moving bottom left to top right
+ so association looks positive
+ Now let's add a line that represents the best model

---
# Definition of the line
+ The line can be described by two values:

+ **Intercept**: the point where the line crosses `$y$`, and `$x$` = 0

+ **Slope**: the gradient of the line, or rate of change

???
+ In our example, intercept = for someone who doesn't study, what score will they get?
+ Slope = for every hour of study, how much will my score change

---
# Intercept and slope

.pull-left[

]

.pull-right[

]

---
# How to find a line?
+ The line represents a model of our data.
    + In our example, the model that best characterizes the relationship between hours of study and test score.

+ In the scatterplot, the data is represented by points.

+ So a good line, is a line that is "close" to all points.

---
# Linear Model

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ `$y_i$` = the outcome variable (e.g. `score`)

+ `$x_i$` = the predictor variable, (e.g. `hours`)

+ `$\beta_0$` = intercept

+ `$\beta_1$` = slope

+ `$\epsilon_i$` = residual (we will come to this shortly)

where `$\epsilon_i \sim N(0, \sigma)$` independently.
  + `$\sigma$` = standard deviation (spread) of the errors
  + The standard deviation of the errors, `$\sigma$`, is constant

---
# Linear Model

`$$y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i$$`

+ **Why do we have `$i$` in some places and not others?**

+ `$i$` is a subscript to indicate that each participant has their own value.

+ So each participant has their own: 
    + score on the test ( `$y_i$` )
    + number of hours studied ( `$x_i$` ) and
    + residual term ( `$\epsilon_i$` )

--
+ **What does it mean that the intercept ( `$\beta_0$` ) and slope ( `$\beta_1$` ) do not have the subscript `$i$`?**

+ It means there is one value for all observations.
    + Remember the model is for **all of our data**

---
# What is `$\epsilon_i$`?

.pull-left[
+ `$\epsilon_i$`, or the residual, is a measure of how well the model fits each data point.

+ It is the distance between the model line (on `$y$`-axis) and a data point.

+ `$\epsilon_i$` is positive if the point is above the line (red in plot)

+ `$\epsilon_i$` is negative if the point is below the line (blue in plot)

]

.pull-right[

]

???
+ comment red = positive and bigger (longer arrow) model is worse
+ blue is negative, and smaller (shorter arrow) model is better
+ key point to link here is the importance of residuals for knowing how good the model is
+ Link to last lecture in that they are the variability 
+ that is the link into least squares

---
class: center, middle
# Time for a break

---
class: center, middle
# Welcome Back!

**Where we left off... **

---
# Principle of least squares

+ The numbers `$\beta_0$` and `$\beta_1$` are typically **unknown** and need to be estimated in order to fit a line through the point cloud.

+ We denote the "best" values as `$\hat \beta_0$` and `$\hat \beta_1$`

+ The best fitting line is found using **least squares**
    + Minimizes the distances between the actual values of `$y$` and the model-predicted values of `$\hat y$`
    + Specifically minimizes the sum of the *squared* deviations

---
# Principle of least squares

+ Actual value = `$y_i$`

+ Model-predicted value = `$\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$`

+ Deviation or residual = `$y_i - \hat y_i$`

+ Minimize the **residual sum of squares**, `$SS_{Residual}$`, which is

`$$SS_{Residual} = \sum_{i=1}^{n} [y_i - (\hat \beta_0 + \hat \beta_1 x_{i})]^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2$$`

---
# Principle of least squares

+ **Why do you think we square the deviations? **

+ HINT: Look back to the "What is `$\epsilon_i$`?" slide
    
--

+ We have positive and negative residual terms

+ If we simply added them, they would cancel out.

---
# Data, predicted values and residuals

+ Data = `$y_i$`
    + This is what we have measured in our study. 
    + For us, the test scores.

+ Predicted value = `$\hat{y}_i = \hat \beta_0 + \hat \beta_1 x_i$` = the y-value on the line at specific values of `$x$`
    + Or, the value of the outcome our model predicts given someone's values for predictors.
    + In our example, given you study for 4 hrs, what test score does our model predict you will get.

+ Residual = Difference between `$y_i$` and `$\hat{y}_i$`. So;

`$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$`

???
+ these are important distinctions for understanding linear models
+ return to them a lot.

---
# Fitting the line
+ Calculations for slope:

`$$\hat \beta_1 = \frac{SP_{xy}}{SS_x}$$`

+ `$SP_{xy}$` = sum of cross-products:

`$$SP_{xy} = \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$`

+ `$SS_x$` = sums of squared deviations of `$x$`:

`$$SS_x = \sum_{i=1}^{n}(x_i - \bar{x})^2$$`

---
# Fitting the line
+ Calculations for intercept:

`$$\hat \beta_0 = \bar{y} - \hat \beta_1 \bar{x}$$`

+ `$\hat \beta_1$` = slope estimate

+ `$\bar{y}$` = mean of `$y$`

+ `$\bar{x}$` = mean of `$x$`

---
class: center, middle
# Time for a break

This would be a good time to take a look at the lecture 3 worked example

Here we show these calculations for our example.

---
class: center, middle
# Welcome Back!

**Where we left off... **

Calculated the intercept and slope

Now let's think about error...

---
# What is `$\sigma$`?

.pull-left[
<center>**Small `$\sigma$`**</center>
<img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-14-1.png" width="90%" style="display: block; margin: auto;" />
]

.pull-right[
<center>**Large `$\sigma$`**</center>
<img src="dapR2_lec03_LMintro_files/figure-html/unnamed-chunk-15-1.png" width="90%" style="display: block; margin: auto;" />
]

---
# What is `$\sigma$`?

+ The less scatter around the line, the smaller the standard deviation of the errors

+ The less scatter around the line, the stronger the relationship between `$y$` and `$x$`.

+ We estimate `$\sigma$` using the residuals

+ The estimated standard deviation of the errors is:
`$$\hat \sigma = \sqrt{\frac{SS_{Residual}}{n - k - 1}} = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n - k - 1}}$$`

+ In simple linear regression we only have one `$x$`, so `$k = 1$` and the denominator becomes `$n - 2$`.

---
# Summary of today

+ Moved from correlation to linear model

+ Calculated slope and intercept

+ Discussed `$SS_{Residual}$`

+ Discussed `$\hat \sigma$` and the relation to good models

---
# Next tasks
+ This week:
  + Complete your lab
  + Come to office hours
  + Weekly quiz - practice no.2
      + Open Monday 09:00
      + Closes Sunday 17:00