class: center, middle, inverse, title-slide .title[ #
F-tests & Model Comparison
] .subtitle[ ## Data Analysis for Psychology in R 2
] .author[ ### dapR2 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Weeks Learning Objectives 1. Understand the use of `\(F\)` and incremental `\(F\)` tests. 2. Be able to run and interpret `\(F\)`-tests in R. 3. Understand how to use model comparisons to test different types of question. 4. Understand the difference between nested and non-nested models, and the appropriate statistics to use for comparison in each case. --- class: inverse, center, middle # Part 1: Recap and `\(F\)`-tests --- # Where we left off... + Last week we looked at: + The significance of individual predictors + Overall model evaluation through `\(R^2\)` and adjusted `\(R^2\)` to see how much variance in the outcome has been explained. + Today we will: + Look at significance tests of the overall model + Discuss how we can use the same tools to do incremental tests (how much does my model improve when I add variables) --- # Significance of the overall model + The test of the individual predictors (IVs, or `\(x\)`'s) does not tell us if the overall model is significant or not. + Neither does `\(R^2\)` + But both are indicative + To test the significance of the model as a whole, we conduct an `\(F\)`-test. --- # F-test & F-ratio + An `\(F\)`-test involves testing the statistical significance of a test statistic called (wait for it) the `\(F\)`-ratio. + The `\(F\)`-ratio tests the null hypothesis that all the regression slopes in a model are all zero. -- + In other words, our predictors tell us nothing about our outcome. + They explain no variance. -- + If our predictors do explain some variance, our `\(F\)`-ratio will be significant. --- # Our results (significant F) ```r performance <- lm(score ~ hours + motivation, data = test_study2); summary(performance) ``` ``` ## ## Call: ## lm(formula = score ~ hours + motivation, data = test_study2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.9548 -2.8042 -0.2847 2.9344 13.8240 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.86679 0.65473 10.488 <2e-16 *** ## hours 1.37570 0.07989 17.220 <2e-16 *** ## motivation 0.91634 0.38376 2.388 0.0182 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.386 on 147 degrees of freedom ## Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651 ## F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16 ``` --- # F-ratio: Some details + `\(F\)`-ratio is a ratio of the explained to unexplained variance: `$$F = \frac{\frac{SS_{model}}{df_{model}}}{\frac{SS_{residual}}{df_{residual}}} = \frac{MS_{Model}}{MS_{Residual}}$$` + Where MS = mean squares -- + **What are mean squares?** + Mean squares are sums of squares calculations divided by the associated degrees of freedom. + We saw how to calculate model and residual sums of squares last week + But what are degrees of freedom... --- # Degrees of freedom + The degrees of freedom are defined as the number of independent values associated with the different calculations. + Df are typically the combination of the amount of data you have (sample size, `\(n\)`) and the number of things you need to calculate/estimate ( `\(k\)` ). + **Model degrees of freedom = `\(k\)` ** + `\(SS_{model}\)` are dependent on estimated `\(\beta\)` s, hence `\(k\)`. + **Residual degrees of freedom = `\(n-k-1\)` ** + `\(SS_{residual}\)` calculation is based on our model, in which we estimate `\(k\)` `\(\beta\)` terms and an intercept ( `\(1\)` ) + **Total degrees of freedom = `\(n-1\)` ** + `\(SS_{total}\)` calculation is based on the observed `\(y_i\)` and `\(\bar{y}\)` . + In order to estimate `\(\bar{y}\)` , all apart from one value of `\(y\)` is free to vary, hence `\(n-1\)` --- # F-table <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> SS </th> <th style="text-align:left;"> df </th> <th style="text-align:left;"> MS </th> <th style="text-align:left;"> Fratio </th> <th style="text-align:left;"> pvalue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:left;"> k </td> <td style="text-align:left;"> SS model/df model </td> <td style="text-align:left;"> MS model/ MS residual </td> <td style="text-align:left;"> F(df model,df residual) </td> </tr> <tr> <td style="text-align:left;"> Residual </td> <td style="text-align:left;"> n-k-1 </td> <td style="text-align:left;"> SS residual/df residual </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- # Our example (note the df at the bottom) ```r summary(performance) ``` ``` ## ## Call: ## lm(formula = score ~ hours + motivation, data = test_study2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.9548 -2.8042 -0.2847 2.9344 13.8240 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.86679 0.65473 10.488 <2e-16 *** ## hours 1.37570 0.07989 17.220 <2e-16 *** ## motivation 0.91634 0.38376 2.388 0.0182 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.386 on 147 degrees of freedom ## Multiple R-squared: 0.6696, Adjusted R-squared: 0.6651 ## F-statistic: 148.9 on 2 and 147 DF, p-value: < 2.2e-16 ``` --- # F-ratio + Bigger `\(F\)`-ratios indicate better fitting models. + It means the variance explained by the model is big compared to the residual variance. -- + `\(H_0\)` for the model says that the best guess of any individual's `\(y\)` value is `\(\bar{y}\)` plus error. + Or, that the `\(x\)` variables carry no information collectively about `\(y\)`. + I.e. all slopes = 0 -- + `\(F\)`-ratio will be close to 1 when `\(H_0\)` is true + If there is equivalent residual to model variation ( `\(MS_{residuals} = MS_{model}\)` ), `\(F\)`=1 + If there is more model than residual `\(F\)` > 1 --- # Testing the significance of `\(F\)` + The `\(F\)`-ratio is our test statistic for the significance of our model. + As with all statistical inferences, we would select an `\(\alpha\)` level. + Identify the proper null `\(F\)`-distribution and calculate the critical value of `\(F\)` associated with chosen level of `\(\alpha\)` + Compare our `\(F\)`-statistic to the critical value. + If our value is more extreme than the critical value, it is considered significant --- # Sampling distribution for the null .pull-left[ + Similar to the `\(t\)`-distribution, the `\(F\)`-distribution changes shape based on `\(df\)` + With an `\(F\)`-statistic, we have to consider both the `\(df_{model}\)` and `\(df_{residual}\)` ] .pull-right[ <img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- # A decision about the null + We have an `\(F\)`-statistic: + `\(F = 148.9\)` -- + We need to calculate `\(df_{model}\)` and `\(df_{residual}\)` to get our null distribution: + `\(df_{model}=k=2\)` + `\(df_{residual}=n-k-1=150-2-1=147\)` -- + We need to set our `\(\alpha\)` level + `\(\alpha = .05\)` -- + Now we can compute our critical value for `\(F\)` --- # Visualize the test .pull-left[ <img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] .pull-right[ + `\(F\)`-distribution with 2 `\(df_{model}\)` and 147 `\(df_{residual}\)` (our null distribution) ] --- count: false # Visualize the test .pull-left[ <img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] .pull-right[ + `\(F\)`-distribution with 2 `\(df_{model}\)` and 147 `\(df_{residual}\)` (our null distribution) + Our critical value ```r (Crit = round(qf(0.95, 2, 147), 3)) ``` ``` ## [1] 3.058 ``` ] --- count: false # Visualize the test .pull-left[ <img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] .pull-right[ + `\(F\)`-distribution with 2 `\(df_{model}\)` and 147 `\(df_{residual}\)` (our null distribution) + Our critical value ```r (Crit = round(qf(0.95, 2, 147), 3)) ``` ``` ## [1] 3.058 ``` + We can calculate the probability of an F-statistic at least as extreme as ours, given `\(H_0\)` is true (our `\(p\)`-value): ```r (pVal = 1-pf(148.9, 2, 147)) ``` ``` ## [1] 0 ``` ] --- count: false # Visualize the test .pull-left[ <img src="dapr2_04_ftests_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ] .pull-right[ + `\(F\)`-distribution with 2 `\(df_{model}\)` and 147 `\(df_{residual}\)` (our null distribution) + Our critical value ```r (Crit = round(qf(0.95, 2, 147), 3)) ``` ``` ## [1] 3.058 ``` + We can calculate the probability of an F-statistic at least as extreme as ours, given `\(H_0\)` is true (our `\(p\)`-value): ```r (pVal = 1-pf(148.9, 2, 147)) ``` ``` ## [1] 0 ``` + Our model significantly predicted the variance in test score, `\(F(2,147)= 148.90, p < .001\)` ] --- class: center, middle # Questions? --- class: inverse, center, middle # Part 2: Model Comparison & Incremental `\(F\)`-tests --- # Model comparisons + So far, our questions have been _is our overall model better than nothing?_ ( `\(F\)`-test ) or _which variables, specifically, are good predictors of the outcome variable?_ ( `\(t\)`-tests of `\(\beta\)` estimates ) -- + But what if instead we wanted to ask: > **When I make a change to my model, does it improve or not?** + This question is the core of model comparison -- + We can adapt this to our models in a more specific way: + Eg is a model with `\(x_1\)` and `\(x_2\)` and `\(x_3\)` as predictors better than the model with just `\(x_1\)`? -- + In our linear model journey so far, we haven't yet explored the tools to answer this question. + We have tested individual predictors + and we have tested overall models + **but we have not tested the improvement when we add predictors** + More generally, we have not looked at combined tests of the effects of predictors >1 but < all predictors --- # `\(F\)`-test as an incremental test + One important way we can think about the `\(F\)`-test and the `\(F\)`-ratio is as an incremental test against an "empty" or null model. + A null or empty model is a linear model with only the intercept. + In this model, our predicted value of the outcome for every case in our data set, is the mean of the outcome ( `\(\bar{y}\)`). + That is, with no predictors, we have no information that may help us predict the outcome. + So we will be "least wrong" by guessing the mean of the outcome. + An empty model is the same as saying all `\(\beta\)` = 0. + And remember, this was the null hypothesis of the `\(F\)`-test -- + So in this way, the `\(F\)`-test can be seen as **comparing two models**. + We can extend this idea, and use the `\(F\)`-test to compare two models that contain different sets of predictors. + This is the **incremental `\(F\)`-test** --- # Incremental `\(F\)`-test .pull-left[ + The incremental `\(F\)`-test evaluates the statistical significance of the improvement in variance explained in an outcome with the addition of further predictor(s) + It is based on the difference in `\(F\)`-values between two models. + We call the model with the additional predictor(s) **model 1** or **full model** + We call the model without **model 0** or **restricted model** ] .pull-right[ `$$F_{(df_R-df_F),df_F} = \frac{(SSR_R-SSR_F)/(df_R-df_F)}{SSR_F / df_F}$$` $$ `\begin{align} & \text{Where:} \\ & SSR_R = \text{residual sums of squares for the restricted model} \\ & SSR_F = \text{residual sums of squares for the full model} \\ & df_R = \text{residual degrees of freedom from the restricted model} \\ & df_F = \text{residual degrees of freedom from the full model} \\ \end{align}` $$ ] --- # Why does this matter? + A very reasonable question you may have at this point is "so what, why does that matter?" + Well, it is very useful if we every want to formally compare two models and decide which is the better one. + And that is what we will discuss next! --- class: inverse, center, middle # Part 3: Examples of Model comparisons --- # Example 1 + How about this example based on data from the Midlife In United States (MIDUS2) study: + Outcome: self-rated health + Covariates: Age, sex + Predictors: Big Five traits and Purpose in Life. + Research Question: Does personality predict self-rated health over and above age and sex? --- # The data ```r midus <- read_csv("data/MIDUS2.csv") midus2 <- midus %>% select(1:4, 31:42) %>% mutate( PIL = rowMeans(.[grep("PIL", names(.))],na.rm=T) ) %>% select(1:4, 12:17) %>% drop_na(.) slice(midus2, 1:3) ``` ``` ## # A tibble: 3 × 10 ## ID age sex health O C E A N PIL ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10002 69 MALE 8 2.14 2.8 2.6 3.4 2 5.86 ## 2 10019 51 MALE 8 3.14 3 3.4 3.6 1.5 5.71 ## 3 10023 78 FEMALE 4 3.57 3.4 3.6 4 1.75 5.14 ``` --- # Example 1: The models + Does personality significantly predict self-rated health over and above the effects of age and sex? + First step here is to run two models. + M1: We predict from age and sex + M2: we add in the FFM traits ```r m1 <- lm(health ~ age + sex, data = midus2) ``` ```r m2 <- lm(health ~ age + sex + O + C + E + A + N, data = midus2) ``` --- # Incremental `\(F\)`-test in R + Second step + Compare the two models based on an incremental `\(F\)`-test + In order to apply the `\(F\)`-test for model comparison in R, we use the `anova()` function. + `anova()` takes as its arguments models that we wish to compare + Here we will show examples with 2 models, but we could use more. ```r anova(m1, m2) ``` --- # Incremental `\(F\)`-test in R ```r anova(m1, m2) ``` ``` ## Analysis of Variance Table ## ## Model 1: health ~ age + sex ## Model 2: health ~ age + sex + O + C + E + A + N ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 1758 4740.2 ## 2 1753 4055.4 5 684.85 59.208 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: center, middle # Questions? --- class: inverse, center, middle # Part 4: Nested models and alternatives to `\(F\)`-tests --- # Nested vs non-nested models + The `\(F\)`-ratio depends on the comparison models being nested + Nested means that the predictors in one model are a subset of the predictors in the other + We also require the models to be computed on the same data --- # Nested vs non-nested models .pull-left[ **Nested** ```r m0 <- lm(outcome ~ x1 + x2 , data = data) m1 <- lm(outcome ~ x1 + x2 + x3, data = data) ``` + These models are nested. + `x1` and `x2` appear in both models ] .pull-right[ **Non-nested** ```r m0 <- lm(outcome ~ x1 + x2 + x4, data = data) m1 <- lm(outcome ~ x1 + x2 + x3, data = data) ``` + These models are non-nested + There are unique variables in both models + `x4` in `m0` + `x3` in `m1` ] --- # Model comparison for non-nested models + So what happens when we have non-nested models? + There are two commonly used alternatives + AIC + BIC + Unlike the incremental `\(F\)`-test AIC and BIC do not require two models to be nested + Smaller (more negative) values indicate better fitting models. + So we compare values and choose the model with the smaller AIC or BIC value --- # AIC & BIC .pull-left[ `$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$` $$ `\begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align}` $$ ] .pull-right[ `$$BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n)$$` $$ `\begin{align} & \text{Where:} \\ & SS_{residual} = \text{sum of squares residuals} \\ & n = \text{sample size} \\ & k = \text{number of explanatory variables} \\ & \text{ln} = \text{natural log function} \end{align}` $$ ] --- # Parsimony corrections + Both AIC and BIC contain something called a parsimony correction + In essence, they penalise models for being complex + This is to help us avoid overfitting (adding predictors arbitarily to improve fit) `$$AIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + 2k$$` `$$BIC = n\,\text{ln}\left( \frac{SS_{residual}}{n} \right) + k\,\text{ln}(n)$$` + BIC has a harsher parsimony penalty for typical sample sizes when applying linear models than AIC + When `\(\text{ln}(n) > 2\)` BIC will have a more severe parsimony penalty (i.e. essentially all the time!) --- # In R + Let's use AIC and BIC on our `m1` and `m2` models from previously: .pull-left[ ```r AIC(m1, m2) ``` ``` ## df AIC ## m1 4 6749.246 ## m2 9 6484.457 ``` ] .pull-right[ ```r BIC(m1, m2) ``` ``` ## df BIC ## m1 4 6771.141 ## m2 9 6533.719 ``` ] --- # Let's consider a different example + Our previous models were nested + `m1` had just covariates + `m2` added personality + Using the same data, lets consider a non-nested example. + Suppose we want to compare a model that: + predicts self-rated health from just 5 personality variables (`nn1` : non-nested model 1) + to a model that predicts from age, sex and a variable called Purpose in Life (PIL) (`nn2`). --- # Applied to non-nested models ```r nn1 <- lm(health ~ O + C + E + A + N, data=midus2) nn2 <- lm(health ~ age + sex + PIL, data = midus2) ``` ```r AIC(nn1, nn2) ``` ``` ## df AIC ## nn1 7 6501.524 ## nn2 5 6564.953 ``` ```r BIC(nn1, nn2) ``` ``` ## df BIC ## nn1 7 6539.840 ## nn2 5 6592.321 ``` --- # Considerations for use of AIC and BIC + AIC and BIC can be used for both nested and non-nested models. -- + The AIC and BIC for a single model are not meaningful + They only make sense for model comparisons + We evaluate these comparisons by looking at the difference, `\(\Delta\)`, between two values -- + There are no specific thresholds for `\(\Delta AIC\)` to suggest how big a difference in two models is needed to conclude that one is substantively better than the other -- + The following `\(\Delta BIC\)` cutoffs have been suggested (Raftery, 1995): | Value | Interpretation | |-------------------|---------------------------------------------------| | `\(\Delta < 2\)` | No evidence of difference between models | | `\(2 < \Delta < 6\)` | Positive evidence of difference between models | | `\(6 < \Delta < 10\)` | Strong evidence of difference between models | | `\(\Delta > 10\)` | Very strong evidence of difference between models | --- class: center, middle # Questions? --- # Pause to summarise what we know so far + So far we have seen how to: + run a linear model with a single predictor + extend this and add predictors + interpret these coefficients either in original units or standardized units + test the significance of `\(\beta\)` coefficients + test the significance of the overall model + estimate the amount of variance explained by our model + Short version, well done, you can now run and interpret linear models with continuous predictors. + Next week we will put this into action constructing and implementing an analysis plan for a linear model on a real example. --- class: inverse, center, middle # Thanks for listening