Testing and Evaluating LM

class: center, middle, inverse, title-slide

.title[
# <b> Testing and Evaluating LM</b>
]
.subtitle[
## Data Analysis for Psychology in R 2<br><br>
]
.author[
### dapR2 Team
]
.institute[
### Department of Psychology<br>The University of Edinburgh
]

---

# Course Overview

.pull-left[

<table style="border: 1px solid black;>
  <tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1;text-align:center;vertical-align: middle">
        <b>Introduction to Linear Models</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Intro to Linear Regression</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        Interpreting Linear Models</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:1">
        <b>Testing Individual Predictors</b></td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Model Testing & Comparison</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Linear Model Analysis</td></tr>

<tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle">
        <b>Analysing Experimental Studies</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Categorical Predictors & Dummy Coding</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        	Effects Coding & Coding Specific Contrasts</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Assumptions & Diagnostics</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Bootstrapping</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        	Categorical Predictor Analysis</td></tr>
</table>

]

.pull-right[

<table style="border: 1px solid black;>
  <tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle">
        <b>Interactions</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interactions I</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interactions II</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interactions III</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Analysing Experiments</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Interaction Analysis</td></tr>

<tr style="padding: 0 1em 0 1em;">
    <td rowspan="5" style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4;text-align:center;vertical-align: middle">
        <b>Advanced Topics</b></td>
    <td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Power Analysis</td>
  </tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Binary Logistic Regression I</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Binary Logistic Regression II</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        Logistic Regresison Analysis</td></tr>
  <tr><td style="border: 1px solid black;padding: 0 1em 0 1em;opacity:0.4">
        	Exam Prep and Course Q&A</td></tr>
</table>

]

---

# This Week's Learning Objectives

1. Understand how to interpret significance tests for `$\beta$` coefficients

2. Understand how to calculate and interpret `$R^2$` and adjusted- `$R^2$` as a measure of model quality

3. Be able to locate information on the significance of individual predictors and overall model fit in R `lm` model output

---
class: inverse, center, middle

# Part 1: Overview

---
# Recap
+ Last week we expanded the general linear model equation to include multiple predictors:

`$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_j x_{ji} + \epsilon_i$$`

+ And we ran an example concerning test scores:

`$$score_i = \beta_0 + \beta_1 hours_{i} + \beta_2 motivation_{i} + \epsilon_i$$`

+ And we looked at how to run this model in R:

``` r
lm(score ~ hours + motivation, data = test_study2)
```

---
# Evaluating our model

So far we have estimated values for the key parameters of our model ( `$\beta$`s )

+ Now we have to think about how we evaluate the model

+ Evaluating a model will consist of:

1. Evaluating the individual coefficients
  
  2. Evaluating the overall model quality
  
  3. Evaluating the model assumptions

**Important:** Before accepting a set of results, all three of these aspects of evaluation must be considered

+ We will talk about evaluating individual coefficients and model quality today
+ Model assumptions covered later in the course (Semester 1, Week 8)

---
#  Significance of individual effects 
+ A general way to ask this question would be to state:

> **Is our model informative about the relationship between X and Y?**

+ In the context of our example from last lecture, we could ask,

> **Is study time a useful predictor of test score?**

+ The above is a research question

+ We need to turn this into a testable statistical hypothesis

---
#  Evaluating individual predictors 
+ Steps in hypothesis testing:

--
 
  + Research question
    
--
  
  + Statistical hypothesis
    
--
  
  + Define the null hypothesis
    
--
  
  + Calculate an estimate of effect of interest
  
--
  
  + Calculate an appropriate test statistic
    
--
  
  + Evaluate the test statistic against the null

---
# Research question and hypotheses

+ **Research questions** are statements of what we intend to study.

+ A good question defines:

+ constructs under study
  + the relationship being tested
  + a direction of relationship
  + target populations etc.

> **Does increased study time improve test scores in school-age children?**

+ **Statistical hypotheses** are testable mathematical statements.

+ In typical testing in Psychology, we define a **null ( `$H_0$` )** and an **alternative ( `$H_1$` )** hypothesis.
  + `$H_0$` is precise, and states a specific value for the effect of interest
  + `$H_1$` is not specific, and simply says "something else other than the null is more likely"

---
# Statistical significance: Overview

+ Remember, we can only ever test the null hypothesis

+ We select a significance level, `$\alpha$` (typically .05)

+ Then we calculate the `$p$`-value associated with our test statistic

+ If the associated `$p$` is smaller than `$\alpha$`, then we **reject** the null

+ If it is larger, then we **fail to reject** the null

---

class: center, middle

# Questions?

---
class: inverse, center, middle

# Part 2: Steps in significance testing

---
# Defining null

.pull-left[
+ Conceptually:
	+ If `$x$` yields no information on `$y$`, then `$\beta_1 = 0$`
	
+ **Why would this be the case?**
]

---
count: false

# Defining null

.pull-left[
+ Conceptually:
	+ If `$x$` yields no information on `$y$`, then `$\beta_1 = 0$`
	
+ **Why would this be the case?**

+ `$\beta$` gives the predicted change in `$y$` for a unit change in `$x$`.
	+ If `$x$` and `$y$` are unrelated, then a change in `$x$` will not result in any change to the predicted value of `$y$`
	+ So for a unit change in `$x$`, there is no (=0) change in `$y$`
	
+ We can state this formally as a null and alternative:

`$$H_0: \beta_1 = 0$$`
`$$H_1: \beta_1 \neq 0$$`
]

.pull-right[

![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-5-1.svg)

]

???
+ For the null to be testable, we need to formally define it. 
+ Point out here the difference in the specificity of the hypotheses. `$H_0$` is that the `$b_1$` takes a specific value. `$H_1$` is that `$b_1$` has some value that is not this specific value. i.e. one is directly testable, the other is not.

---
# Point estimate and test statistic

+ We have already seen how we calculate `$\hat \beta_1$`.

+ The associated test statistic for `$\beta$` coefficients is a `$t$`-statistic

`$$t = \frac{\hat \beta}{SE(\hat \beta)}$$`

+ where

+ `$\hat \beta$` = any `$\beta$` coefficient we have calculated
  + `$SE(\hat \beta)$` = standard error of `$\beta$`

+ **Recall** that the standard error describes the spread of the sampling distribution
  + The standard error (SE) provides a measure of sampling variability
  + A smaller SE suggests a more precise estimate (=good)
  
???
+ brief reminders on test statistics
  + every quantity we wish to calculate a significance test for needs an test statistic.
  + the test statistic is a value that has a known sampling distribution
+ If sampling distribution is unfamiliar, again, recap the hypothesis testing material

---
# Lets look at the output from `lm` again

``` r
summary(performance)
```

```
## 
## Call:
## lm(formula = score ~ hours + motivation, data = test_study2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9548  -2.8042  -0.2847   2.9344  13.8240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.86679    0.65473  10.488   <2e-16 ***
## hours        1.37570    0.07989  17.220   <2e-16 ***
## motivation   0.91634    0.38376   2.388   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.386 on 147 degrees of freedom
## Multiple R-squared:  0.6696,	Adjusted R-squared:  0.6651 
## F-statistic: 148.9 on 2 and 147 DF,  p-value: < 2.2e-16
```

---
# And work out the `$t$`-values

+ Let's check the value for `motivation` together:

`$$t = \frac{\hat \beta_2}{SE(\hat \beta_2)} = \frac{0.9163}{0.3838} = 2.388(3dp)$$`

+ (Feel free to check `hours` in your own time)

+ So we know where the `$\beta$` values come from, and we have just seen `$t$`
+ What about the `$SE$` and `$p$`?

---
#  SE( `$\hat \beta_j$` )

+ The formula for the standard error of the slope is:

`$$SE(\hat \beta_j) = \sqrt{\frac{ SS_{Residual}/(n-k-1)}{\sum(x_{ij} - \bar{x_{j}})^2(1-R_{xj}^2)}}$$`

+ Where:
	+ `$SS_{Residual}$` is the residual sum of squares
	+ `$n$` is the sample size
	+ `$k$` is the number of predictors
	+ `$x_{ij}$` is the observed value of a predictor ( `$j$` ) for an individual ( `$i$` )
	+ `$\bar{x_{j}}$` is the mean of a predictor
	+ `$R_{xj}^2$` derives from the multiple correlation coefficient of the predictors

+ `$R_{xj}^2$` captures to degree to which all of our predictors are related to each other
  + For simple linear models, `$R_{xj}^2$` = 0 as there is only 1 predictor
  
---
# SE( `$\hat \beta_j$` )

`$$SE(\hat \beta_j) = \sqrt{\frac{ SS_{Residual}/(n-k-1)}{\sum(x_{ij} - \bar{x_{j}})^2(1-R_{xj}^2)}}$$`  
+ We want our `$SE$` to be smaller - this means our estimate is precise

+ Examining the above formula we can see that:
	+ `$SE$` is smaller when residual variance ( `$SS_{residual}$` ) is smaller
	+ `$SE$` is smaller when sample size ( `$n$` ) is larger
	+ `$SE$` is larger when the number of predictors ( `$k$` ) is larger
	+ `$SE$` is larger when a predictor is strongly correlated with other predictors ( `$R_{xj}^2$` )

???
+ We'll return to this later when we discuss multi-collinearity issues

---
# Sampling distribution for the null

.pull-left[

+ So what about `$p$`?

+ `$p$` refers to the likelihood of having results as extreme as ours, given `$H_0$` is true

+ To compute that likelihood, we need a sampling distribution for the null

+ For `$\beta$`, this is a ** `$t$`-distribution**

+ Remember, the shape of the `$t$`-distribution changes depending on the degrees of freedom

]

.pull-right[
![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-8-1.svg)

]

+ For `$\beta$`, we use a `$t$`-distribution with ** `$n-k-1$` degrees of freedom**.
	+ `$n$` = sample size
	+ `$k$` = number of predictors
	+ The additional - 1 represents the intercept

---
#  A decision about the null 
+ We have a `$t$`-value associated with our `$\beta$` coefficient in the R model summary
	
	+ `$t$` = 2.388

+ We evaluate it against a `$t$`-distribution with `$n-k-1$` degrees of freedom

+ `$df$` = 150-2-1 = 147

+ As with all tests we need to set our `$\alpha$`
	
	+ Let's set `$\alpha$` = 0.05 (two tailed)

+ Now we need a critical value to compare our observed `$t$`-value to

---
# Visualise the null

.pull-left[
![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-9-1.svg)

]

.pull-right[

+ `$t$`-distribution with 147 df (our null distribution)

]
---
count: false

# Visualise the null

.pull-left[
![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-10-1.svg)

]

.pull-right[
+ `$t$`-distribution with 147 df (our null distribution)

+ Critical values `$(t^*)$` establish a boundary for significance
  
  + The probability that a `$t$`-value will fall within these extreme regions of the distribution given `$H_0$` is true is equal to `$\alpha$`
    + Because we are performing a two-tailed test, `$\alpha$` is split between each tail:

]

---
count: false

# Visualise the null

.pull-left[
![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-11-1.svg)

]

.pull-right[
+ `$t$`-distribution with 147 df (our null distribution)

``` r
(LowerCrit = round(qt(0.025, 147), 3))
```

```
## [1] -1.976
```

``` r
(UpperCrit = round(qt(0.975, 147), 3))
```

```
## [1] 1.976
```

]

---
count: false

# Visualise the null

.pull-left[
![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-14-1.svg)

]

.pull-right[
+ `$t$`-distribution with 147 df (our null distribution)

``` r
(LowerCrit = round(qt(0.025, 147), 3))
```

```
## [1] -1.976
```

``` r
(UpperCrit = round(qt(0.975, 147), 3))
```

```
## [1] 1.976
```

+ `$t$` = 2.388, `$p$` = .018

]

???
+ discuss this plot.
+ remind them of 2-tailed
+ areas
+ % underneath each end
+ comment on how it would be different one tailed
+ remind about what X is, thus where the line is

---
class: center, middle

# Questions?

---
class: inverse, center, middle

# Part 3: An alternative using confidence intervals

---
# Refresher: What is a confidence interval?

+ When we perform these analyses, we obtain a parameter estimate from our sample (e.g. `$\beta_2 = 0.92$`)

+ It's unlikely that the true value is exactly equal to our parameter estimate

+ We can be much more certain we've captured the true value if we report **confidence intervals**
  
  + Range of plausible values for the parameter
  
  + The wider the range, the more confident we can be that our interval captures the true value

+ How many of you are confident that I'm exactly 35 years old?
      
      + How many of you are confident that I'm between 33 & 38 years old?
    
      + How many of you are confident that I'm between 29 & 42 years old?
    
      + How many of you are confident that I'm between 25 & 46 years old?

---
# Refresher: What is a confidence level?

+ To create a confidence interval we must decide on a **confidence level**
  
  + A number between 0 and 1 specified by us
  
  + How confident do you want to be that the confidence interval will contain the true parameter value?

+ Typical confidence levels are 90%, 95%, or 99%

> **Test your understanding:** If we select a 90% confidence level, will the range of values included in our CI be smaller or larger than if we selected a 99% confidence level?

---
#  Confidence intervals for `$\beta$`
+ We can also compute confidence intervals for `$\hat \beta$`

`$$\hat \beta_1 \pm t^* \times SE(\hat \beta_1)$$`
--

+ Typically, the confidence level we report relates to our chosen `$\alpha$`, and we calculate it as `$100 \times (1 - \alpha)$`

+ So, the 95% confidence interval for the effect (slope) of `motivation` would be:

``` r
(LowerCI = round(0.91634 - (qt(0.975, 147) * 0.38376), 3))
```

```
## [1] 0.158
```

``` r
(UpperCI = round(0.91634 + (qt(0.975, 147)* 0.38376), 3))
```

```
## [1] 1.675
```

+ We can be 95% confident that the range 0.158 and 1.675 contains the true value of our `$\beta_2$`

---
# `confint` function

+ We can get confidence intervals for our models more easily:

``` r
confint(performance)
```

```
##                 2.5 %   97.5 %
## (Intercept) 5.5728881 8.160686
## hours       1.2178208 1.533576
## motivation  0.1579477 1.674729
```

+ The confidence intervals for both `motivation` and `hours` do not include the null value (in this case, 0)

+ This provides support (beyond `$p<.05$`) that **motivation and hours are statistically significant predictors of test scores**

---
class: center, middle

# Questions?

---
class: inverse, center, middle

# Part 4: Cofficient of determination ( `$R^2$` )

---
# Model output again

``` r
performance <- lm(score ~ hours + motivation, data = test_study2)
summary(performance)
```

---
#  Quality of the overall model

+ When we measure an outcome ( `$y$` ) in some data, the scores will vary (we hope).

+ Variation in `$y$` = total variation of interest

+ The aim of our linear model is to build a model which describes our outcome variable as a function of our predictor variable(s)

+ We are trying to explain variation in `$y$` using variation in `$x$`
  	+ When `$y$` co-varies with `$x$`...
  	+ we can predict changes in `$y$` based on changes in `$x$`...
  	+ so we say the variance in `$y$` is explained or accounted for

+ But the model will not explain all the variance in `$y$`

+ What is left unexplained is called the residual variance

+ We can break down variation in our data (i.e. variation in `$y$`) based on sums of squares as:

`$$SS_{Total} = SS_{Model} + SS_{Residual}$$`

---
#  Coefficient of determination

+ One way to consider how good our model is, would be to consider the proportion of total variance our model accounts for

`$$R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}$$`

+ `$R^2$` = coefficient of determination

+ Quantifies the amount of variability in the outcome accounted for by the predictors
  + The more variance accounted for, the better the model fit
  + Represents the extent to which the prediction of `$y$` is improved when predictions are based on the linear relation between `$x$` and `$y$`, compared to not considering `$x$`

+ To illustrate, we can calculate the different sums of squares

---
# Total Sum of Squares

.pull-left[
+ Each Sums of Squares measure quantifies different sources of variation

`$$SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2$$`

+ Squared distance of each data point from the mean of `$y$`

+ Mean is our baseline

> **Test your understanding:** Why might this be the case?

]

.pull-right[

![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-20-1.svg)

]

---
count: false

# Total Sum of Squares

.pull-left[
+ Each Sums of Squares measure quantifies different sources of variation

`$$SS_{Total} = \sum_{i=1}^{n}(y_i - \bar{y})^2$$`

+ Squared distance of each data point from the mean of `$y$`

+ Mean is our baseline

> **Test your understanding:** Why might this be the case?

> Without any other information, our best guess at the value of `$y$` for any person is the mean.

]

.pull-right[

![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-21-1.svg)

]

---
# Residual Sum of Squares

.pull-left[
+ Each Sums of Squares measure quantifies different sources of variation

`$$SS_{Residual} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$`

+ This may look familiar

+ Squared distance of each point from the predicted value
]

.pull-right[

![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-22-1.svg)

]

---
# Model Sums of Squares

.pull-left[
+ Each Sums of Squares measure quantifies different sources of variation

`$$SS_{Model} = \sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2$$`

+ The deviance of the predicted scores from the mean of `$y$`

+ Easy to calculate if we know total sum of squares and residual sum of squares

`$$SS_{Model} = SS_{Total} - SS_{Residual}$$`

]

.pull-right[

![](dapr2_03_testingbeta_files/figure-html/unnamed-chunk-23-1.svg)

]

---
# Values in our sample
+ In the current example, these values are:

+ `$SS_{total}$` = 8556.06
  + `$SS_{residual}$` = 2826.83
  + `$SS_{model}$` = 5729.23

+ In the Learn folder for this week, there is a document that shows the calculations from the raw data

---
#  Coefficient of determination 
+ Let's come back to `$R^2$`

`$$R^2 = 1 - \frac{SS_{Residual}}{SS_{Total}}$$`

+ Or

`$$R^2 = \frac{SS_{Model}}{SS_{Total}}$$`

+ So in our example:

`$$R^2 = \frac{SS_{Model}}{SS_{Total}} = \frac{5729.23}{8556.06} = 0.6695$$`

** `$R^2$` = 0.6695 means that 66.95% of the variation in test scores is accounted for by hours of revision and student motivation.**

---
#  Check against model output

``` r
summary(performance)
```

???
We can check this against the R-output:
Be sure to flag small amounts of rounding difference from working through "by hand" and so presenting to less decimal places.

---
#  Adjusted `$R^2$`

+ When there are two or more predictors, `$R^2$` tends to be an inflated estimate of the corresponding population value

+ Due to random sampling fluctuation, even when `$R^2 = 0$` in the population, it's value in the sample may `$\neq 0$`

+ In **smaller samples** , the fluctuations from zero will be larger on average

+ With **more predictors** , there are more opportunities to add to the positive fluctuation

+ We therefore compute an adjusted `$R^2$`

`$$\hat R^2 = 1 - (1 - R^2)\frac{N-1}{N-k-1}$$`

+ Adjusted `$R^2$` adjusts for both sample size ( `$N$` ) and number of predictors ( `$k$` )

---
#  In our example

.pull-left[

``` r
summary(performance)
```

<img src="figs/perfResults.png" height="50%" />
]

.pull-right[
+ **Based on adjusted R-squared, hours studying and student motivation explain 66.5% of the variance in test scores**

+ As the sample size is large and the number of predictors small, unadjusted (0.67) and adjusted R-squared (0.665) are similar
]

---
class: center, middle

# Questions?

---
# Summary

+ Key take homes:
  1. We have an inferential test, based on a `$t$`-distribution, for the significance of `$\beta$`
  2. We can compute confidence intervals that give us more certainty that we have captured the true value of `$\beta$`
  3. We are more likely to find a statistically significant effect when residuals are small and we have a large sample
  4. We can assess the degree to which our model explains variance in the outcome based on `$R^2$`
  5. When we have multiple predictors, we should use the adjusted `$R^2$` to get a more conservative estimate
  
+ Next week we will look at overall model significance and comparisons between models

---

## This week

.pull-left[

### Tasks

**Attend your lab and work together on the exercises**

<br>

**Complete the weekly quiz**

Quizzes from now onwards contribute to your final mark (14/18 best scores counted)

]

.pull-right[

### Support

**Help each other on the Piazza forum**

<br>

**Attend office hours (see Learn page for details)**

]

---
class: inverse, center, middle

# Thanks for listening