class: center, middle, inverse, title-slide .title[ #
Week 3: Describing Continuous Data
] .subtitle[ ## Data Analysis for Psychology in R 1
] .author[ ### Patrick Sturt ] .institute[ ### Department of Psychology
The University of Edinburgh ] --- # Weeks Learning Objectives 1. Understand the appropriate visualization for the distribution of numeric data. 2. Understand methods to calculate the spread for the distribution of numeric data. 3. Understand methods to calculate central tendency for the distribution of numeric data. --- # Topics for today + Histograms + Mean + Variance and standard deviation ??? +Points to mention + continuing on how we describe data + focus on continuous numeric data + will consider visualizations, central tendency and dispersion --- # Recap: Continuous data + Continuous (numeric) data is typically classed as interval or ratio + That means: + The numeric values are meaningful as numbers. + We are able to apply mathematical operations to the values. --- # Visualization .pull-left[ + Last lecture we discussed bar plots for frequency distributions of categorical variables. + For continuous data, we visualize the distribution using a histogram. **Example histogram on the height of a class** ] .pull-right[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- # Histogram .pull-left[ + Properties of a historgram: + X-axis: possible values of some variable. + Commonly presented in "bins" + A bin represents a range of scores (plot can look very different dependent on the bins) + Scale = dependent on the form of measurement, here centimetres + Y-axis: frequency of a given value or values within "bins" + Here our data is heights of the class + X-Axis values are the possible heights in bins of 4cm. + Y=Axis values are the counts of number of students in each bin. ] .pull-right[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- # Pause for thought? **Why have we used bins for ranges of values and not individual values?** ??? + In recording, literally invite them to pause and write down thoughts + then do verbal explanation. --- # Impact of bins .pull-left[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .pull-right[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- # Stats summer school example: test score .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:left;"> Degree </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> Score1 </th> <th style="text-align:right;"> Score2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:left;"> Psych </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 74 </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:left;"> Ling </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 72 </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:left;"> Ling </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 72 </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:left;"> Phil </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 74 </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:left;"> Ling </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 69 </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:left;"> Ling </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 72 </td> </tr> <tr> <td style="text-align:left;"> ID107 </td> <td style="text-align:left;"> Phil </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> ID108 </td> <td style="text-align:left;"> Psych </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 71 </td> </tr> <tr> <td style="text-align:left;"> ID109 </td> <td style="text-align:left;"> Psych </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 73 </td> </tr> <tr> <td style="text-align:left;"> ID110 </td> <td style="text-align:left;"> Ling </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 72 </td> </tr> </tbody> </table> ] .pull-right[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- # Stats summer school example: test score .pull-left[ ```r ex1 %>% ggplot(., aes(x=Score1)) + * geom_histogram(bins = 15, * color = "white", * fill = "steelblue4")+ xlab("Pre- Statistics Test Score") + ylab("Count \n") ``` + New bits of code: + `geom_histogram` is used to make histograms + `bins` is the number of columns we want + `color` provides the colour for the outline of the column + `fill` provides the main colour ] .pull-right[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- # Central Tendency: Mean .pull-left[ + Last lecture we looked at the mode and median. + Both can be used for continuous data, but the optimal measure is the **arithmetic mean.** + **Mean:** is the sum of all values, divided by the total number of observations. + I.e. this is the average as most people think about the average. ] .pull-right[ $$ \bar{x} = \frac{\sum_{i=1}^{N}{x_i}}{N} $$ + `\(\bar{x}\)` = estimate of mean of variable `\(x\)` + `\(x_i\)` = individual values of `\(x\)` + `\(N\)` = sample size ] --- # Hand calculation $$ \bar{x} = \frac{\sum_{i=1}^{N}{x_i}}{N} $$ + Our data: `$$x=[10,40,30,25,15,6]$$` + Worked calculation `$$\frac{\sum_{i=1}^{N}(10+40+30+25+15+6)}{6} = \frac{126}{6} = 21$$` --- # Arithmetic Mean: Test score .pull-left[ **Following hand-calculation in `R`** ```r sum(ex1$Score1)/length(ex1$Score1) ``` ``` ## [1] 65.9 ``` **Short way in `R`** ```r mean(ex1$Score1) ``` ``` ## [1] 65.9 ``` ] .pull-right[ **Working with `tidyverse`** ```r ex1 %>% summarise( mean = mean(Score1) ) ``` ``` ## # A tibble: 1 × 1 ## mean ## <dbl> ## 1 65.9 ``` + We will work with `tidyverse` and summarise as we can build up summary tables for our data sets. ] ??? + In this particular case the tidyverse is overkill, but we're using it becuase it allows us to build up summary tables very cleanly, which will come in handy as we get more statistics we'd like to summarise. --- # Variation around the mean ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- # Variation around the mean ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- # Sum of deviations + We could just add up the amount by which each observation differs from the mean. + This is called the **sum of deviations.** $$ SumDev = \sum_{i=1}^{N}{(x_i - \bar{x})} $$ + `\(x_i\)` = individual observations + `\(\bar{x}\)` = mean of `\(x\)` --- # Calculation: First 10 rows <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:right;"> Score1 </th> <th style="text-align:right;"> Score </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Deviance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 5.2 </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -0.8 </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -1.8 </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 3.2 </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -3.8 </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 2.2 </td> </tr> <tr> <td style="text-align:left;"> ID107 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 0.2 </td> </tr> <tr> <td style="text-align:left;"> ID108 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -1.8 </td> </tr> <tr> <td style="text-align:left;"> ID109 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -0.8 </td> </tr> <tr> <td style="text-align:left;"> ID110 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -1.8 </td> </tr> </tbody> </table> --- # Problem: Sum of deviations ```r ex1 %>% summarise( Variable = "Statistics Test Score", * "Sum Deviation" = round(sum(Score1 - mean(Score1)),2) ) ``` <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:right;"> Sum Deviation </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Statistics Test Score </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> + Uh oh! The positive and negative values cancel. + That means the sum of deviations from the mean will always be 0. --- # Variance + In order to remove the effect of sign, we can square each of the deviations. + This is called the ***variance*** . $$ \sigma^2 = \frac{\sum_{i=1}^{N}{(x_i - \bar{x})}^2}{N} $$ + Variance is the average squared deviation from the mean. + `\(\sigma^2\)` = variance (Greek letter lower case sigma) --- # Calculation <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:right;"> Score1 </th> <th style="text-align:right;"> Score </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Deviance </th> <th style="text-align:right;"> Deviance_sq </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ID101 </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 5.2 </td> <td style="text-align:right;"> 27.04 </td> </tr> <tr> <td style="text-align:left;"> ID102 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -0.8 </td> <td style="text-align:right;"> 0.64 </td> </tr> <tr> <td style="text-align:left;"> ID103 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -1.8 </td> <td style="text-align:right;"> 3.24 </td> </tr> <tr> <td style="text-align:left;"> ID104 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 3.2 </td> <td style="text-align:right;"> 10.24 </td> </tr> <tr> <td style="text-align:left;"> ID105 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -3.8 </td> <td style="text-align:right;"> 14.44 </td> </tr> <tr> <td style="text-align:left;"> ID106 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 4.84 </td> </tr> <tr> <td style="text-align:left;"> ID107 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:right;"> 0.04 </td> </tr> <tr> <td style="text-align:left;"> ID108 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -1.8 </td> <td style="text-align:right;"> 3.24 </td> </tr> <tr> <td style="text-align:left;"> ID109 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -0.8 </td> <td style="text-align:right;"> 0.64 </td> </tr> <tr> <td style="text-align:left;"> ID110 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 65.8 </td> <td style="text-align:right;"> -1.8 </td> <td style="text-align:right;"> 3.24 </td> </tr> </tbody> </table> --- # Variance ```r ex1 %>% summarise( Variable = "Statistics Test Score", "Sum Deviation" = round(sum(Score1 - mean(Score1)),2), * Variance = round((sum((Score1 - mean(Score1))^2))/length(Score1),2) ) ``` <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:right;"> Sum Deviation </th> <th style="text-align:right;"> Variance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Statistics Test Score </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 7.65 </td> </tr> </tbody> </table> + Problem: + Our units here are not quite right. + Variance is the mean **squared** deviation from the mean. --- # Standard deviation + What about a measure of variation in the same units as the mean/variable? + The ***standard deviation.*** + The standard deviation is the square root of the variance. + Taking the square root undoes (or fixes) the squaring of deviations that we did to get variance. $$ \sigma = \sqrt{\frac{\sum_{i=1}^{N}{(x_i - \bar{x})}^2}{N}} $$ --- # Standard deviation ```r ex1 %>% summarise( Variable = "Statistics Test Score", Variance = round((sum((Score1 - mean(Score1))^2))/length(Score1),2), * SD = round(sqrt((sum((Score1 - mean(Score1))^2))/length(Score1)),2) ) ``` ``` ## # A tibble: 1 × 3 ## Variable Variance SD ## <chr> <dbl> <dbl> ## 1 Statistics Test Score 7.65 2.77 ``` --- # Standard deviation + Easier `R` calculation ```r ex1 %>% summarise( Variable = "Statistics Test Score", "Sum Deviation" = round(sum(Score1 - mean(Score1)),2), * Variance = round(var(Score1),2), * SD = round(sd(Score1),2) ) ``` ``` ## # A tibble: 1 × 4 ## Variable `Sum Deviation` Variance SD ## <chr> <dbl> <dbl> <dbl> ## 1 Statistics Test Score 0 7.7 2.78 ``` --- # An important difference with `var` and `sd` .pull-left[ **Population Variance** $$ \sigma^2 = \frac{\sum_{i=1}^{N}{(x_i - \bar{x})}^2}{N} $$ **Population SD** $$ \sigma = \sqrt{\frac{\sum_{i=1}^{N}{(x_i - \bar{x})}^2}{N}} $$ ] .pull-right[ **Sample Variance** $$ s^2 = \frac{\sum_{i=1}^{N}{(x_i - \bar{x})}^2}{N-1} $$ **Sample SD** $$ s = \sqrt{\frac{\sum_{i=1}^{N}{(x_i - \bar{x})}^2}{N-1}} $$ ] -- + NOTE: R defaults to sample values. ??? Make the point that we will come back to this in more detail, but for now it is just important to note that the R defaults to sample. --- # Summary of last 2 lectures <table class="table" style="font-size: 22px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Measure </th> <th style="text-align:left;"> Strength </th> <th style="text-align:left;"> Weakness </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mode </td> <td style="text-align:left;"> Actually occurs in our data </td> <td style="text-align:left;"> Not algebraically calculable </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Unaffected by extreme values </td> <td style="text-align:left;"> Probably does not exist for true continuous data (think reaction time) </td> </tr> <tr> <td style="text-align:left;"> Median </td> <td style="text-align:left;"> No assumptions about interval value of data </td> <td style="text-align:left;"> Not relatable to measures of dispersion (see next week) </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Unaffected by extreme values </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> Mean </td> <td style="text-align:left;"> Algebraically tractable </td> <td style="text-align:left;"> Sensitive to extreme values </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Related to measures of dispersion (see next week) </td> <td style="text-align:left;"> Assumes data are interval or better </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> Possible no case in your data takes the value of the mean </td> </tr> </tbody> </table> --- # Which measure should we use? <table class="table" style="font-size: 26px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Variable Type </th> <th style="text-align:left;"> Central Tendency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Categorical (Nominal) </td> <td style="text-align:left;"> Mode </td> </tr> <tr> <td style="text-align:left;"> Categorical (Ordered) </td> <td style="text-align:left;"> Mode/Median </td> </tr> <tr> <td style="text-align:left;"> Continuous </td> <td style="text-align:left;"> Mean (any in fact) </td> </tr> <tr> <td style="text-align:left;"> Count </td> <td style="text-align:left;"> Mode (mean) </td> </tr> </tbody> </table> + Depends on the level of measurement. --- # Which measure should we use? <table class="table" style="font-size: 22px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Variable Type </th> <th style="text-align:left;"> Central Tendency </th> <th style="text-align:left;"> Dispersion </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Categorical (Nominal) </td> <td style="text-align:left;"> Mode </td> <td style="text-align:left;"> Frequency Table </td> </tr> <tr> <td style="text-align:left;"> Categorical (Ordered) </td> <td style="text-align:left;"> Mode/Median </td> <td style="text-align:left;"> Range </td> </tr> <tr> <td style="text-align:left;"> Continuous </td> <td style="text-align:left;"> Mean (any in fact) </td> <td style="text-align:left;"> Variance & Standard Deviation </td> </tr> <tr> <td style="text-align:left;"> Count </td> <td style="text-align:left;"> Mode (mean) </td> <td style="text-align:left;"> Range (Variance & SD) </td> </tr> </tbody> </table> + Depends on the level of measurement. --- # A few extra bits? + You may come across the mathematical language of *moments.* + Moments describe the shape of a set of points + Mean + Variance + Skew + Kurtosis --- # Skew ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-25-1.png)<!-- --> + Is a measure of asymmetry of a distribution. --- # Kurtosis ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-26-1.png)<!-- --> + Kurtosis is a measure of the flatness of the peak and the fatness of the tails of the distribution. --- # Do they matter? .pull-left[ ![](data:image/png;base64,#dapR1_lec3_DescribingContData_files/figure-html/unnamed-chunk-27-1.png)<!-- --> ] .pull-right[ + It can make a difference in how we describe data. + Both skew and kurtosis impact the **normality** of the distribution of the data. ] --- # Summary of today + Continuous variables are... + Visualized with histogram + summarised with mean and standard deviation + We can describe the shape of the distribution with skew and kurtosis --- # Next tasks + Next week, we will look at describing relationships. + This week: + Complete your lab + Come to office hours + Weekly quiz - first assessed quiz on content of weeks 1 and 2 + Open Monday 09:00 + Closes Sunday 17:00 + Feedback available next week on Monday at 09:00