class: center, middle, inverse, title-slide .title[ #
Chi-Square Tests
] .subtitle[ ## Data Analysis for Psychology in R 1 ] .author[ ### DapR1 Team ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Week's Learning Objectives 1. Understand the difference between `\(\chi^2\)` goodness-of-fit and `\(\chi^2\)` test of independence 2. Perform a `\(\chi^2\)` goodness-of-fit and interpret results 3. Perform a `\(\chi^2\)` test of independence and interpret results 4. Understand the assumptions for `\(\chi^2\)` tests --- class: inverse, center, middle # Part 1 ## Introduction to `\(\chi^2\)` --- # Moving on from `\(t\)`-tests... + `\(t\)`-tests have allowed you to make comparisons using *continuous* data: + A continuous outcome variable from two separate groups (independent-samples `\(t\)`-test) + A continuous outcome variable from one group at two timepoints (paired-samples `\(t\)`-test) + One continuous variable against a single value (one-sample `\(t\)`-test) -- + You may instead want to test whether data are distributed across *categories* in the way that you would expect: + Is your sample distributed equally across levels of education? + Is smoking (Y/N) associated with cardiovascular disease (Y/N)? + Do sharks prefer to eat humans or fish? -- + In this case, you will will need a test that checks whether data are grouped according to your expectations. + `\(\chi^2\)`-tests are used to compare **frequencies** across categories in your data --- # `\(\chi^2\)`-tests vs `\(t\)`-tests + Similar to a `\(t\)`-test, 1. Compute a test statistic 2. Locate the test statistic on a distribution that reflects the probability of each test statistic value, given that `\(H_0\)` is true. 3. If the probability associated with your test statistic is small enough, your results are considered significant. -- + Like the `\(t\)`-distribution, the shape of the distribution depends on the degrees of freedom + Unlike the `\(t\)`-distribution, *df* in a `\(\chi^2\)` test isn't computed using sample size, but the number of groups within your data. .pull-left[ .center[ ** `\(t\)` Distribution ** ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-1-1.svg)<!-- --> ] ] .pull-right[ .center[ ** `\(\chi^2\)` Distribution** ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-2-1.svg)<!-- --> ] ] --- # `\(\chi^2\)` distribution .pull-left[ + As the number of comparison groups increases, the distribution curve flattens + Larger `\(\chi^2\)` values become more probable + A wider range of `\(\chi^2\)` values become more likely + The `\(\chi^2\)` distribution begins at 0 + Categorical variables don't have direction + We can investigate this further by looking at the `\(\chi^2\)` formula ] .pull-right[ .center[ ** `\(\chi^2\)` Distribution** ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-3-1.svg)<!-- --> ] ] --- # The basic `\(\chi^2\)` formula .center.f2[ `\(\chi^2 = \Sigma \frac{(O-E)^2}{E}\)` ] + `\(\Sigma\)` = sum up + `\(E\)` = Expected Cases + The values that you expect, given `\(H_0\)` is true + `\(O\)` = Observed Cases + The values you actually have --- # Assumptions of `\(\chi^2\)` tests + Sufficiently large `\(n\)` so that data approximate a normal distribution + What is 'sufficiently large' will depend on the number of cells you have. + Expected cases > 5 + Observations are independent + Each observation appears only in a single cell. --- # Types of `\(\chi^2\)` tests + Goodness of Fit + Test of Independence --- class: inverse, center, middle # Part 2 ## `\(\chi^2\)` Goodness of Fit test --- # `\(\chi^2\)` Goodness of Fit test .pull-left[ + Tests whether the values you actually have are consistent with the values you expect. + Looks at the distribution of data across a single category + **Hypotheses:** + `\(H_0: p_1 = p_{1,0},\ p_2 = p_{2,0},\ ...,\ p_C = p_{C,0}\)` + `\(H_1:\)` Some `\(p_i \neq p_{i,0}\)` ] --- count: false # `\(\chi^2\)` Goodness of Fit test .pull-left[ + Tests whether the values you actually have are consistent with the values you expect. + Looks at the distribution of data across a single category + **Hypotheses:** + `\(H_0: p_1 = p_{1,0},\ p_2 = p_{2,0},\ ...,\ p_C = p_{C,0}\)` + `\(H_1:\)` Some `\(p_i \neq p_{i,0}\)` ] .pull-right[ .pull-left.center[**Expected Values ** <img src="figures/chiSqGoF_Exp.png" width="80%" /> ] .pull-right.center[ **Observed Values ** <img src="figures/chiSqGoF_Obs.png" width="80%" /> ] ] --- # `\(\chi^2\)` Goodness of Fit test .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \frac{(O_i - E_i)^2}{E_i}\)` ] + `\(\sum\limits_{i=1}^k\)` : Sum all values from levels 1 through k + `\(i\)` : Current level --- # Performing a `\(\chi^2\)` Goodness of Fit test .pull-left[ + A new flower shop is trying to decide which days of the week they will be open + They want to know whether order number is consistent across days of the week + They count the total number of orders they take each day of the week over the course of a month ] .pull-right.center[ <br> <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> </tr> </tbody> </table> ] --- count: false # Performing a `\(\chi^2\)` Goodness of Fit test .pull-left[ + A new flower shop is trying to decide which days of the week they will be open + They want to know whether order number is consistent across days of the week + They count the total number of orders they take each day of the week over the course of a month + `\(H_0\)`: Orders will be consistent throughout the week + `\(p_{Monday}=p_{Tuesday}=\cdots\ p_{Sunday}\)` + `\(H_1\)`: Orders will differ across the week + Some `\(p_{i}\not=p_{i0}\)` ] .pull-right.center[ <br> <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> </tr> </tbody> </table> ] --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \frac{(O_i-E_i)^2}{E_i}\)` ] + `\(E_i=n\cdot\ p_i\)` + In this example, we expect each level to be approximately equal, so the expected proportion will be the same across levels. --- count: false # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \frac{(O_i-\color{#BF1932}{E_i})^2}{\color{#BF1932}{E_i}}\)` ] + `\(E_i=n\cdot\ p_i\)` + In this example, we expect each level to be approximately equal, so the expected proportion will be the same across levels. ```r exVal <- sum(flowerDat$Orders)*(1/length(levels(flowerDat$Day))) round(exVal, 2) ``` ``` ## [1] 53.86 ``` --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \frac{(O_i-\color{#BF1932}{E_i})^2}{\color{#BF1932}{E_i}}\)` ] <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Expected </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 53.86 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 53.86 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 53.86 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 53.86 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 53.86 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 53.86 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53.86 </td> </tr> </tbody> </table> --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \frac{(\color{#BF1932}{O_i - E_i})^2}{E_i}\)` ] <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Expected </th> <th style="text-align:right;"> Difference </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 0.14 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -14.86 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -9.86 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -6.86 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 14.14 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 18.14 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -0.86 </td> </tr> </tbody> </table> --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \frac{(O_i - E_i)\color{#BF1932}{^2}}{E_i}\)` ] <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Expected </th> <th style="text-align:right;"> Difference </th> <th style="text-align:right;"> Squared </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -14.86 </td> <td style="text-align:right;"> 220.73 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -9.86 </td> <td style="text-align:right;"> 97.16 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -6.86 </td> <td style="text-align:right;"> 47.02 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 14.14 </td> <td style="text-align:right;"> 200.02 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 18.14 </td> <td style="text-align:right;"> 329.16 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -0.86 </td> <td style="text-align:right;"> 0.73 </td> </tr> </tbody> </table> --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^k \color{#BF1932}{\frac{(O_i - E_i)^2}{E_i}}\)` ] <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Expected </th> <th style="text-align:right;"> Difference </th> <th style="text-align:right;"> Squared </th> <th style="text-align:right;"> SqbyExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -14.86 </td> <td style="text-align:right;"> 220.73 </td> <td style="text-align:right;"> 4.10 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -9.86 </td> <td style="text-align:right;"> 97.16 </td> <td style="text-align:right;"> 1.80 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -6.86 </td> <td style="text-align:right;"> 47.02 </td> <td style="text-align:right;"> 0.87 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 14.14 </td> <td style="text-align:right;"> 200.02 </td> <td style="text-align:right;"> 3.71 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 18.14 </td> <td style="text-align:right;"> 329.16 </td> <td style="text-align:right;"> 6.11 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -0.86 </td> <td style="text-align:right;"> 0.73 </td> <td style="text-align:right;"> 0.01 </td> </tr> </tbody> </table> --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the test statistic** .center.f3[ `\(\chi^2 = \color{#BF1932}{\sum\limits_{i=1}^k} \frac{(O_i - E_i)^2}{E_i}=\)` 16.62 ] <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Expected </th> <th style="text-align:right;"> Difference </th> <th style="text-align:right;"> Squared </th> <th style="text-align:right;"> SqbyExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;color: #BF1932 !important;"> 0.00 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -14.86 </td> <td style="text-align:right;"> 220.73 </td> <td style="text-align:right;color: #BF1932 !important;"> 4.10 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -9.86 </td> <td style="text-align:right;"> 97.16 </td> <td style="text-align:right;color: #BF1932 !important;"> 1.80 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -6.86 </td> <td style="text-align:right;"> 47.02 </td> <td style="text-align:right;color: #BF1932 !important;"> 0.87 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 14.14 </td> <td style="text-align:right;"> 200.02 </td> <td style="text-align:right;color: #BF1932 !important;"> 3.71 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 18.14 </td> <td style="text-align:right;"> 329.16 </td> <td style="text-align:right;color: #BF1932 !important;"> 6.11 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> -0.86 </td> <td style="text-align:right;"> 0.73 </td> <td style="text-align:right;color: #BF1932 !important;"> 0.01 </td> </tr> </tbody> </table> --- # Performing a `\(\chi^2\)` Goodness of Fit test **Find the test statistic on the distribution** .pull-left[ + `\(df=k-1\)` + `\(k\)` = number of levels within categorical variable ] --- count: false # Performing a `\(\chi^2\)` Goodness of Fit test **Find the test statistic on the distribution** .pull-left[ + `\(df=k-1\)` + `\(k\)` = number of levels within categorical variable ```r length(levels(flowerDat$Day))-1 ``` ``` ## [1] 6 ``` ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-15-1.svg)<!-- --> ] --- count: false # Performing a `\(\chi^2\)` Goodness of Fit test **Find the test statistic on the distribution** .pull-left[ + `\(df=k-1\)` + `\(k\)` = number of levels within categorical variable ```r length(levels(flowerDat$Day))-1 ``` ``` ## [1] 6 ``` + `\(\chi^2 =\)` 16.62 ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-17-1.svg)<!-- --> ] --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the probability a score at least as extreme as the test statistic** .pull-left[ + What proportion of the plot falls in the shaded area? ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-18-1.svg)<!-- --> ] --- # Performing a `\(\chi^2\)` Goodness of Fit test **Compute the probability a score at least as extreme as the test statistic** .pull-left[ + What proportion of the plot falls in the shaded area? ```r pchisq(sum(flowerDat$SqbyExp), df = 6, lower.tail = F) ``` ``` ## [1] 0.01080571 ``` + The probability that we would have a `\(\chi^2\)` value as extreme as 16.62 if `\(H_0\)` is true is only 0.01. ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-20-1.svg)<!-- --> ] --- class: center, middle .f1[Questions?] --- # Exploring our Results Further + If our results are significant, we are likely interested in knowing which levels within our category had the biggest differences. + We can get this information by looking at the Pearson residuals (AKA, standardized residuals). + `\(\frac{O_i-E_i}{\sqrt{E_i}}\)` -- ```r (flowerDat$Orders[1]-flowerDat$Expected[1])/sqrt(flowerDat$Expected[1]) ``` ``` ## [1] 0.01946616 ``` <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Expected </th> <th style="text-align:right;"> Difference </th> <th style="text-align:right;"> Squared </th> <th style="text-align:right;"> SqbyExp </th> <th style="text-align:right;"> Residuals </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 53.86 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.02 </td> </tr> </tbody> </table> --- # Exploring our Results Further .pull-left[ + Positive residuals indicate the the frequency of the corresponding level is higher than expected + Negative residuals indicate that the frequency of the corresponding level is lower than expected + More extreme residuals indicate that the values are contributing more strongly to the results + Values `\(\leq\)` -2 indicate the frequency of that level is **much lower** than expected + Values `\(\geq\)` 2 indicate the frequency of that level is **much higher** than expected ] .pull-right[ <br> <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Residuals </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -2.02 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> -1.34 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> -0.93 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 1.93 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 2.47 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> -0.12 </td> </tr> </tbody> </table> ] --- # Drawing Conclusions .pull-left[ **If you owned the flower shop, which two days would you choose to close each week?** ] .pull-right[ <br> <table> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:right;"> Orders </th> <th style="text-align:right;"> Residuals </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -2.02 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> -1.34 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> -0.93 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 1.93 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 2.47 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> -0.12 </td> </tr> </tbody> </table> ] --- class: inverse, center, middle # Part 3 ## `\(\chi^2\)` Test of Independence --- # `\(\chi^2\)` Test of Independence .pull-left[ + Checks whether two categorical variables from a single population are independent of each other. + Specifically, tests whether membership in Variable 1 is dependent upon membership in Variable 2 + **Hypotheses:** + `\(H_0:\)` Variable A is not associated with variable B + `\(H_1:\)` Variable A is associated with variable B ] --- count: false # `\(\chi^2\)` Test of Independence .pull-left[ + Checks whether two categorical variables from a single population are independent of each other. + Specifically, tests whether membership in Variable 1 is dependent upon membership in Variable 2 + **Hypotheses:** + `\(H_0:\)` Variable A is not associated with variable B + `\(H_1:\)` Variable A is associated with variable B ] .pull-right.center[ **Expected Values ** <img src="figures/chiSqToI_Exp.png" width="90%" /> **Observed Values ** <img src="figures/chiSqToI_Obs.png" width="90%" /> ] --- # `\(\chi^2\)` Test of Independence .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^r \sum\limits_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)` ] + `\(i\)` : current level within Variable A + `\(r\)` : total levels within A + `\(j\)` : levels within Variable B + `\(c\)` : total levels within B --- # Performing a `\(\chi^2\)` Test of Independence .pull-left[ + The flower shop is trying to decide on their flower stock + They want to know whether the flower type that sells the best depends on the season + `\(H_0\)`: Flower orders will be independent of season + `\(H_1\)`: Flower orders will be dependent on season ] .pull-right.center[ <br> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 186 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 185 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 228 </td> <td style="text-align:right;"> 192 </td> </tr> <tr> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 168 </td> <td style="text-align:right;"> 219 </td> <td style="text-align:right;"> 164 </td> </tr> <tr> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 183 </td> <td style="text-align:right;"> 246 </td> <td style="text-align:right;"> 173 </td> </tr> </tbody> </table> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^r \sum\limits_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)` ] + `\(E_{ij}=\frac{R_i\ \cdot\ C_j}{n}\)` + In this example, we expect the orders to be distributed evenly across season and flower type --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** + `\(E_{ij}=\frac{R_i\ \cdot\ C_j}{n}\)` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> <th style="text-align:right;"> Sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 186 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 185 </td> <td style="text-align:right;color: #BF1932 !important;"> 603 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 228 </td> <td style="text-align:right;"> 192 </td> <td style="text-align:right;color: #BF1932 !important;"> 592 </td> </tr> <tr> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 168 </td> <td style="text-align:right;"> 219 </td> <td style="text-align:right;"> 164 </td> <td style="text-align:right;color: #BF1932 !important;"> 551 </td> </tr> <tr> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 183 </td> <td style="text-align:right;"> 246 </td> <td style="text-align:right;"> 173 </td> <td style="text-align:right;color: #BF1932 !important;"> 602 </td> </tr> <tr> <td style="text-align:left;color: #BF1932 !important;"> Sum </td> <td style="text-align:right;color: #BF1932 !important;"> 709 </td> <td style="text-align:right;color: #BF1932 !important;"> 925 </td> <td style="text-align:right;color: #BF1932 !important;"> 714 </td> <td style="text-align:right;color: #BF1932 !important;color: #BF1932 !important;"> 2348 </td> </tr> </tbody> </table> <br> | Season | Lilies | Roses | Tulips | |--------|------------------|------------------|------------------| | Spring |(603 x 709)/2348|(603 x 925)/2348|(603 x 714)/2348| | Summer |(592 x 709)/2348|(592 x 925)/2348|(592 x 714)/2348| | Autumn |(551 x 709)/2348|(551 x 925)/2348|(551 x 714)/2348| | Winter |(602 x 709)/2348|(602 x 925)/2348|(602 x 714)/2348| --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^r \sum\limits_{j=1}^c \frac{(O_{ij} - \color{#BF1932}{E_{ij}})^2}{\color{#BF1932}{E_{ij}}}\)` ] .pull-left.center[ **Observed Values** <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 186 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 185 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 228 </td> <td style="text-align:right;"> 192 </td> </tr> </tbody> </table> ] .pull-right.center[ **Expected Values** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 182.08 </td> <td style="text-align:right;"> 237.55 </td> <td style="text-align:right;"> 183.37 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 178.76 </td> <td style="text-align:right;"> 233.22 </td> <td style="text-align:right;"> 180.02 </td> </tr> </tbody> </table> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^r \sum\limits_{j=1}^c \frac{\color{#BF1932}{(O_{ij} - E_{ij})}^2}{E_{ij}}\)` ] .pull-left.center[ **Observed Values** <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 186 </td> <td style="text-align:right;"> 232 </td> <td style="text-align:right;"> 185 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 172 </td> <td style="text-align:right;"> 228 </td> <td style="text-align:right;"> 192 </td> </tr> </tbody> </table> ] .pull-right.center[ **Expected Values** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 182.08 </td> <td style="text-align:right;"> 237.55 </td> <td style="text-align:right;"> 183.37 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 178.76 </td> <td style="text-align:right;"> 233.22 </td> <td style="text-align:right;"> 180.02 </td> </tr> </tbody> </table> ] .center[ **Difference** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 3.92 </td> <td style="text-align:right;"> -5.55 </td> <td style="text-align:right;"> 1.63 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> -6.76 </td> <td style="text-align:right;"> -5.22 </td> <td style="text-align:right;"> 11.98 </td> </tr> </tbody> </table> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^r \sum\limits_{j=1}^c \frac{(O_{ij} - E_{ij})\color{#BF1932}{^2}}{E_{ij}}\)` ] .pull-left.center[ **Difference** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 3.92 </td> <td style="text-align:right;"> -5.55 </td> <td style="text-align:right;"> 1.63 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> -6.76 </td> <td style="text-align:right;"> -5.22 </td> <td style="text-align:right;"> 11.98 </td> </tr> <tr> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 1.62 </td> <td style="text-align:right;"> 1.93 </td> <td style="text-align:right;"> -3.55 </td> </tr> <tr> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 1.22 </td> <td style="text-align:right;"> 8.84 </td> <td style="text-align:right;"> -10.06 </td> </tr> </tbody> </table> ] .pull-right.center[ **Squared** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 15.36 </td> <td style="text-align:right;"> 30.84 </td> <td style="text-align:right;"> 2.67 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 45.69 </td> <td style="text-align:right;"> 27.25 </td> <td style="text-align:right;"> 143.51 </td> </tr> <tr> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 2.63 </td> <td style="text-align:right;"> 3.73 </td> <td style="text-align:right;"> 12.62 </td> </tr> <tr> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 1.49 </td> <td style="text-align:right;"> 78.16 </td> <td style="text-align:right;"> 101.23 </td> </tr> </tbody> </table> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** .center.f3[ `\(\chi^2 = \sum\limits_{i=1}^r \sum\limits_{j=1}^c \color{#BF1932}{\frac{(O_{ij} - E_{ij})^2}{E_{ij}}}\)` ] .pull-left.center[ **Squared** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 15.36 </td> <td style="text-align:right;"> 30.84 </td> <td style="text-align:right;"> 2.67 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 45.69 </td> <td style="text-align:right;"> 27.25 </td> <td style="text-align:right;"> 143.51 </td> </tr> </tbody> </table> ] .pull-right.center[ **Expected** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 182.08 </td> <td style="text-align:right;"> 237.55 </td> <td style="text-align:right;"> 183.37 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 178.76 </td> <td style="text-align:right;"> 233.22 </td> <td style="text-align:right;"> 180.02 </td> </tr> </tbody> </table> ] .center[ **Squared over Expected** <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.01 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 0.80 </td> </tr> </tbody> </table> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the test statistic** .center.f3[ `\(\chi^2 = \color{#BF1932}{\sum\limits_{i=1}^r \sum\limits_{j=1}^c}\frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)` ] .pull-left.center[ **Squared over Expected** ```r kable(divTab, digits = 2) ``` <table> <thead> <tr> <th style="text-align:left;"> Seasons </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.01 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 0.80 </td> </tr> <tr> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:right;"> 0.08 </td> </tr> <tr> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.33 </td> <td style="text-align:right;"> 0.55 </td> </tr> </tbody> </table> ] .pull-right.center[ ** `\(\chi^2\)`** ```r sum(divTab[flowerCols]) ``` ``` ## [1] 2.397417 ``` ] --- # Performing a `\(\chi^2\)` Test of Independence **Find the test statistic on the distribution** .pull-left[ + `\(df=(r-1)(c-1)\)` + `\(c\)` = number of levels within Variable 1 + `\(r\)` = number of levels within Variable 2 ] --- count: false # Performing a `\(\chi^2\)` Test of Independence **Find the test statistic on the distribution** .pull-left[ + `\(df=(r-1)(c-1)\)` + `\(c\)` = number of levels within Variable 1 + `\(r\)` = number of levels within Variable 2 ```r r <- length(levels(seasonDat$Season)) c <- length(levels(seasonDat$Flowers)) (r-1)*(c-1) ``` ``` ## [1] 6 ``` ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-42-1.svg)<!-- --> ] --- count: false # Performing a `\(\chi^2\)` Test of Independence **Find the test statistic on the distribution** .pull-left[ + `\(df=(r-1)(c-1)\)` + `\(c\)` = number of levels within Variable 1 + `\(r\)` = number of levels within Variable 2 ```r r <- length(levels(seasonDat$Season)) c <- length(levels(seasonDat$Flowers)) (r-1)*(c-1) ``` ``` ## [1] 6 ``` ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-44-1.svg)<!-- --> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the probability a score at least as extreme as the test statistic** .pull-left[ + What proportion of the plot falls in the shaded area? ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-45-1.svg)<!-- --> ] --- # Performing a `\(\chi^2\)` Test of Independence **Compute the probability a score at least as extreme as the test statistic** .pull-left[ + What proportion of the plot falls in the shaded area? ```r pchisq(sum(divTab[flowerCols]), df = 6, lower.tail = F) ``` ``` ## [1] 0.8797671 ``` + The probability that we would have a `\(\chi^2\)` value as extreme as 2.4 if `\(H_0\)` is true is 0.88. ] .pull-right[ ![](dapR1_lec19_Chisquare_files/figure-html/unnamed-chunk-47-1.svg)<!-- --> ] --- class: center, middle .f1[Questions?] --- # Exploring our Results Further + We can also compute standardized residuals for the Test of Independence + In this case, you will calculate them separately by cell. + `\(\frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}\)` -- ```r (ObsVals['Spring', 'Lilies']-exVals[1, 'Lilies'])/sqrt(exVals[1, 'Lilies']) ``` ``` ## Lilies ## 1 0.2904051 ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Lilies </th> <th style="text-align:right;"> Roses </th> <th style="text-align:right;"> Tulips </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 0.29 </td> <td style="text-align:right;"> -0.36 </td> <td style="text-align:right;"> 0.12 </td> </tr> <tr> <td style="text-align:left;"> Summer </td> <td style="text-align:right;"> -0.51 </td> <td style="text-align:right;"> -0.34 </td> <td style="text-align:right;"> 0.89 </td> </tr> <tr> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -0.27 </td> </tr> <tr> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 0.09 </td> <td style="text-align:right;"> 0.57 </td> <td style="text-align:right;"> -0.74 </td> </tr> </tbody> </table> --- # Effect Sizes + There are 3 possibilities: + Phi coefficient + Cramer's V + Odds Ratios + You will learn more about odds ratios in DapR2, so we will focus on Phi and Cramer's V --- # Phi coefficient .center.f3[ `\(\phi=\sqrt{\frac{\chi^2}{n}}\)` ] + `\(n\)`: total number of observations + Should only be used when you have a 2x2 contingency table (2 categorical variables with 2 levels each) + Interpretation: + 0.1: small effect + 0.3: medium effect + 0.5: large effect --- # Cramer's V .center.f3[ `\(V=\sqrt{\frac{\chi^2}{n\cdot\ df^*}}\)` ] + where `\(df^* = min(r-1, c-1)\)` + Can be used when you aren't working with a 2x2 contingency table + Interpretation: + Cramer's V is interpreted based on `\(df^*\)`: | `\(df^*\)` | small | medium | large | |--------|-------|--------|-------| | 1 | .10 | .30 | .50 | | 2 | .07 | .21 | .35 | | 3 | .06 | .17 | .29 | | 4 | .05 | .15 | .25 | | 5 | .04 | .13 | .22 | --- class: center, middle .f1[Questions?] --- # Summary of Today + We learned about the `\(\chi^2\)` distribution and how it compares to the `\(t\)` distribution. + We discussed the assumptions of `\(\chi^2\)` tests. + We differentiated between the `\(\chi^2\)` Goodness of Fit test and the `\(\chi^2\)` Test of Independence + We walked through how to calculate both types of `\(\chi^2\)` values. + We talked about standardized residuals and how they relate to your `\(\chi^2\)` results + We covered the measures of effect size you may use with `\(\chi^2\)` tests. --- class: center, middle .f1[Thanks for listening!]