class: center, middle, inverse, title-slide #
Week 2: Measurement and Distributions
## Univariate Statistics and Methodology using R ### Martin Corley ### Department of Psychology
The University of Edinburgh --- class: inverse, center, middle # Part 1 .center[ ![](lecture_2_files/img/stick.png) ] --- # Notch on a stick .flex.items-center[ .w-50.pa2[ ![](lecture_2_files/img/stick_num.png) ] .w-50.pa2[ ![](lecture_2_files/img/stick_numnum.png) ]] --- count: false class: middle .br3.center.pa2.pt2.bg-gray.white.f3[ It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. ] --- # Problem is... .pull-left[ - we don't have any way of measuring accurately enough - our measurements are likely to be _close to_ the truth - they might vary, if we measure more than once ] .pull-right[ ![](lecture_2_files/img/stick_caliper.png) ] --- # Measurement .pull-left[ ![:scale 70%](lecture_2_files/img/playmo_proft.svg) ] .pull-right[ - we might _expect_ values close to the "true" measurement to be more frequent - something like this: .center[ ![](lecture_2_files/figure-html/sketchnorm-1.svg)<!-- --> ]] ??? - so let's do a thought experiment and imagine what things would look like if lots and lots of people tried to measure the "true" distance of the notch from the end of the stick. - most of them would be quite competent, and we would expect the majority of the measurements to be close to the "true" value. - every now and again, someone would overshoot or undershoot by rather more. - _theoretically_ they might be completely off-beam, although the chances of being way off get vanishingly small quite quickly. --- # Something quite familiar .center[ ![](lecture_2_files/figure-html/normnorm-1.svg)<!-- --> ] ??? - if we think a bit more about our thought experiment, we've actually described something quite familiar: A bell curve - but what is a "bell curve"? - to answer that question, let's start back where we were last week, with dice. --- # Dice again .center[ ![](lecture_2_files/figure-html/dice-1.svg)<!-- --> ] ??? - the height of the bars represent the numbers of times we obtain each value - but why are the bars not touching each other? --- # Dice throws aren't really numbers .pull-left[ - **A** = ![:scale 20%](lecture_2_files/img/A1.svg) - **B** = ![:scale 20%](lecture_2_files/img/B1.svg) or ![:scale 20%](lecture_2_files/img/B2.svg) - **C** = ![:scale 20%](lecture_2_files/img/C1.svg) or ![:scale 20%](lecture_2_files/img/C2.svg) or ![:scale 20%](lecture_2_files/img/C3.svg) ] .pull-right[ .center[ ![](lecture_2_files/figure-html/dice2-1.svg)<!-- --> ]] .pt3[ - bar plot ("bar chart") always has gaps between bars - represents frequencies of _discrete categories_ (`factors`) ] ??? - I could just label the possible outcomes of throwing two dice arbitrarily - if you think about it, there are only 11 possible values that the sum of two dice can take. - and if the dice didn't have actual numbers on their faces, you could still enumerate the outcomes - so the outcomes are _discrete_ (you can never throw a value between 3 and 4, or between "B and C") - and the bars on a bar plot have gaps between them to show this. --- # Back to the Stick - let's look at our stick measurement again .flex.items-center[.w-30.pa2[ <img src="lecture_2_files/img/stick_numnum.png" width="587" /> ] .w-10[ ] .w-30.pa2[ ![](lecture_2_files/figure-html/sketchnorm-1.svg)<!-- --> ] .w-30.pa2[ - "true measurement" on graph _must_ be approximate ] ] ??? - the line we used to indicate the "true measure" isn't fully accurate - we would have to draw an infinitely thin line with infinite precision --- # Zooming In... .pull-left[ ![](lecture_2_files/figure-html/zoom-3.svg)<!-- --> ] .pull-right[ - we are using the _width_ of the line to show measurement precision - here, we know the value is between 0.0761 and 0.0762 + but we can't be any more precise ] --- # Histograms .pull-left[ - this principle allows us to draw a _histogram_ of all the measurements taken - the bars are touching because this represents _continuous_ data ] .pull-right[ ![](lecture_2_files/figure-html/hist1-3.svg)<!-- --> ] --- count: false # Histograms .pull-left[ - this principle allows us to draw a _histogram_ of all the measurements taken - the bars are touching because this represents _continuous_ data - we know that there were 12 measurements around 0.076 + strictly, between 0.0755 and 0.0765 ] .pull-right[ ![](lecture_2_files/figure-html/redhist-3.svg)<!-- --> ] --- # Histograms (2) .pull-left[ - note that the _bin width_ of the histogram matters - these show same data as on the previous slide ] .pull-right[ ![](lecture_2_files/figure-html/hist2-3.svg)<!-- --> ] --- count: false # Histograms (2) .pull-left[ - note that the _bin width_ of the histogram matters - these show same data as on the previous slide ] .pull-right[ ![](lecture_2_files/figure-html/hist3-3.svg)<!-- --> ] --- # Histograms in R .pull-left[ ```r head(notches) # some measurements ``` ``` ## [1] 0.07432 0.07612 0.07955 0.07861 0.07653 0.07422 ``` ```r hist(notches) ``` ] .pull-right[ ![](lecture_2_files/figure-html/histstix-1.svg)<!-- --> ] .flex.items-center[ .w-5.pa1[ ![:scale 70%](lecture_1_files/img/danger.svg) ] .w-95.pa1[ - you can make prettier graphs using `ggplot()` - `hist()` and friends are useful for exploring data ]] --- class: inverse, center, middle, animated, swing # End of Part 1 --- class: inverse, center, middle # Part 2 ## The Normal Distribution --- # Histograms .flex.items-top[ .w-50.br3.pa2.bg-light-green[ ### The Good - way to examine the _distribution_ of data - easy to interpret ( `\(y\)` axis = counts ) - sometimes helpful in spotting weird data ] .w-50.br3.pa2.bg-light-red[ ### The Bad - changing bin width can completely change graph + can lack precision if bins too wide + can appear sparse if bins too narrow ]] --- # Density Plots .center[ ![](lecture_2_files/img/pusher.svg) ] - can "squeeze" the bars until they're "infinitely thin" --- # Density Plots .center[ ![](lecture_2_files/figure-html/dense1-4.svg)<!-- --> ] --- count: false # Density Plots .center[ ![](lecture_2_files/figure-html/dense2-4.svg)<!-- --> ] - values are _interpolated_ - depends on a kernel (smoothing) function --- # Density Plots .center[ ![](lecture_2_files/figure-html/dense3-4.svg)<!-- --> ] - note that `\(y\)` axis is no longer a count - multiplied by a _range_ on the `\(x\)` axis, gives **proportion** of cases (0.1714 ) - _total area_ under curve = "all possibilities" = **1** ??? - note that this isn't the whole curve, because the curve is **asymptotic** + tiny possibility that someone will say the distance is -1000 or 873 --- # A Famous Density Plot .pull-left[ - when we started thinking about measurement, we thought things might look a bit like this - the so-called **normal curve** + a hypothetical density plot ] .pull-right[ ![](lecture_2_files/figure-html/normnorm-1.svg)<!-- --> ] ??? - in part 3, we'll look at where the normal curve comes from - for now, let's look at some of its features --- # Normal Curves .center[ ![](lecture_2_files/figure-html/norms-5.svg)<!-- --> ] - normal curves are centred about the **mean** (or "true measurement") - the area under the (asymptotic) curve is always **1** ??? - the y value here is just what is needed to ensure that the area under the is 1 --- # Standard Deviation .pull-left[ - normal curves can be defined in terms of _two parameters_ - one is the centre or **mean** of the distribution ( `\(\bar{x}\)`, or sometimes `\(\mu\)` ) - the other is the **standard deviation** ( `\(\textrm{sd}\)`, or sometimes `\(\sigma\)` ) `$$\textrm{sd}=\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}$$` ] .pull-right[ ![](lecture_2_files/figure-html/annotated-5.svg)<!-- --> ] .pt3.pa2[ - standard deviation is the "average distance of observations from the mean" ] --- # The Standard Normal Curve .pull-left[ - we can **standardize** any value on any normal curve by - subtracting the mean + the effective mean is now **zero** - dividing by the standard deviation + the effective standard deviation is now **one** ] .pull-right[ ![](lecture_2_files/figure-html/snorm-5.svg)<!-- --> $$ z_i = \frac{x_i - \bar{x}}{\sigma} $$ ] --- # The Standard Normal Curve .pull-left[ - the normal curve is a _density plot_ - the area between 1 standard deviation below the mean and 1 standard deviation above the mean is _always_ 0.6827 ] .pull-right[ ![](lecture_2_files/figure-html/snorm95-5.svg)<!-- --> ] --- count: false # The Standard Normal Curve .pull-left[ - the normal curve is a _density plot_ - the area between 1 standard deviation below the mean and 1 standard deviation above the mean is _always_ 0.6827 - we can ask the question the other way around: _an area of .95_ lies between -1.96 and 1.96 standard deviations from the mean + "95% of the predicted observations" (the 95% confidence interval) ] .pull-right[ ![](lecture_2_files/figure-html/snorm68-5.svg)<!-- --> ] --- class: inverse, center, middle, animated, swing # End of Part 2 --- class: inverse, center, middle # Part 3 ## Sampling from a Population --- background-image: url("lecture_2_files/img/playmo_pop.jpg") ??? We want to say something about the population, rather than about one stick. For example -- what's their average height? (We know this, it's on the packet, 7.5cm, but in the real world people differ...) Let's go into RStudio and do a little simulation. --- class: inverse, center, middle, animated, swing # End of Part Three --- class: inverse, center, middle # Part 4 # Towards Statistical Testing --- # Central Limit Theorem - what we have just see is a demonstration of **Central Limit Theorem** - lay version: _sample means will be normally distributed about the true mean_ .br3.center.pa2.pt2.bg-gray.white.f3[ the standard deviation ("width") of the distribution of sample means is referred to as the **standard error** of the distribution ] --- # Central Limit Theorem (2) - if you look up CLT on Wikipedia you'll see it's defined in terms of _adding two numbers_ + the sample mean is a sum of _many_ numbers, divided by `\(n\)` + adding many numbers is like adding two numbers: .pt0[ `\(1 + 3 + 2 + 5 = (1 + 3 + 2) + 5 = 6 + 5\)` ] + dividing by something doesn't make any difference --- # `\(n-1\)` - we've just shown how adding many numbers is equivalent to adding two numbers - so _if we know the sum_ of a bunch of numbers, `\(n-1\)` of those numbers can be anything .center[
sum of n-1 numbers
nth number
sum
90
10
100
102
-2
100
67
33
100
] - so if we know a summary statistic (e.g., mean, sd) we know about the data with `\(n-1\)` **degrees of freedom** --- # Statistical Estimates - so far, we've talked about sampling repeatedly from a population - this might not be possible(!) - if we only have one sample we can make _estimates_ of the mean and standard error + the estimated _mean_ is the sample mean (we have no other info) + the estimated _standard error_ of the mean is defined in terms of the sample standard deviation $$ \textrm{se} = \frac{\sigma}{\sqrt{n}} = \frac{\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}}{\sqrt{n}} $$ --- # Putting it Together - the _normal curve_ is a density plot with known properties + it can be defined in terms of two parameters, mean, and standard deviation - if we repeatedly sample from a population and measure the mean of a population, we'll get a normal distribution + the mean will be (close to) the population mean - if we sample once from a population which is approximately normal + our estimated mean and sd for the population are the sample mean and sd + the _standard error_, or standard deviation of the sample means can be estimated as `\(\sigma/\sqrt{n}\)` --- # Can We Use This For Real? - we have some survey data from the USMR class, including _height_ in cm - perhaps we're interested in the "average height of a young statistician" (!) + "young statisticians" are a **population** + the USMR class of 2020 is a **sample** .pt2[ ] .br3.center.pa2.pt2.bg-gray.white.f3[ can we use the information from the sample of 41 responses we have to say anything about the population? ] --- # Looking at the class data .pull-left[ ```r # the class heights in cm are in hData hist(hData, xlab="height in cm") ``` ] .pull-right[ ![](lecture_2_files/figure-html/doahist-1.svg)<!-- --> ] .flex.items-center[ .w-5.pa1[ ![:scale 70%](lecture_1_files/img/danger.svg) ] .w-95.pa1[ - data taken directly from the class survey responses - uses the `googlesheets4` library ]] --- # Mean, Standard Deviation .pull-left[ - information about the distribution of the sample ```r mean(hData) ``` ``` ## [1] 167.9 ``` ```r sd(hData) ``` ``` ## [1] 9.264 ``` ] .pull-right[ ![](lecture_2_files/figure-html/realnorm-6.svg)<!-- --> ] --- # Standard Error .pull-left[ - **standard error** is the "standard deviation of the mean" - as we saw in the simulation - can be _estimated_ as `\(\frac{\sigma}{\sqrt{n}}\)` ```r n <- length(hData) # standard error sd(hData) / sqrt(n) ``` ``` ## [1] 1.447 ``` ] .pull-right[ ![](lecture_2_files/figure-html/senorm-6.svg)<!-- --> ] --- # Statistically Useful Information .flex.items-center[.w-50.pa2[ ![](lecture_2_files/figure-html/senorm2-6.svg)<!-- --> - we know that the area between `\(\bar{x}-1.96\sigma\)` and `\(\bar{x}+1.96\sigma\)` is 0.95 ] .w-50.pa2[ .br3.center.pa2.pt2.bg-gray.white.f3[ if we measure the mean height of 41 people from the same population as the USMR class, we estimate that the answer we obtain will lie between 165.1cm and 170.8cm 95% of the time ] ]] --- # The Aim of the Game - as statisticians, a major goal is to infer from **samples** to **populations** - more about how we do this next time --- class: inverse, center, middle, animated, swing # End --- # Acknowledgements - icons by Diego Lavecchia from the [Noun Project](https://thenounproject.com/)