Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
we don’t have any way of measuring accurately enough
our measurements are likely to be close to the truth
they might vary, if we measure more than once
the heights of the bars represent the numbers of times we obtain each value
but why are the bars not touching each other?
A =
B = or
C = or
or
factors
)written | ⊢ min | ⊢ max |
---|---|---|
7.5 | ≥ 7.450 | < 7.550 |
7.50 | ≥ 7.495 | < 7.505 |
we can represent all the measurements with a histogram
the bars are touching because this represents continuous data
we can represent all the measurements with a histogram
the bars are touching because this represents continuous data
note that the bin width of the histogram matters
these histograms all show the same data
the good
way to examine the distribution of data
easy to interpret (\(y\) axis = counts)
sometimes helpful in spotting weird data
the bad
changing bin width can completely change graph
only gives info about distribution and mode
[1] 7.504 4.196 7.516 7.385 7.550 7.500 7.473 7.453 7.424 7.583 7.445 7.609
[13] 7.502 7.466 7.531 7.425 7.546 7.452 7.490 7.463 7.473 7.481 7.580 7.544
[25] 7.482 4.199 7.628 7.489 7.560 7.471 7.488 7.503 7.507 7.406 7.500 7.565
[37] 7.466 7.394 7.509 7.522 7.462 7.529 7.567 7.461 7.514 7.474 7.532 7.530
[49] 7.462 7.508 7.569 7.539 7.566 7.447 7.486 7.627 7.501 7.487 7.539 7.513
[61] 7.581 7.522 7.529 7.500 7.491 7.523 7.485 7.527 7.412 7.560 7.512 7.650
density plots depend on a smoothing function
essentially, they’re making guesses where there is no data
this is equivalent to saying that if I pick an observation \(x_i\) from this sample at random, there is a probability of .1885 that \(7.500 \le x_i < 7.525\)
when we started thinking about measurement, we thought things might look a bit like this
the so-called normal curve
the normal curve is a hypothetical, asymptotic, density plot, with an area under the curve of 1
normal curves can be defined in terms of two parameters
one is the centre or mean of the distribution (\(\bar{x}\), or sometimes \(\mu\))
\[\textrm{sd}=\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}\]
area under the curve is always the same by sd
for ± 1 sd it’s 0.6827
there is a 68% chance of obtaining data within 1 sd of the mean
we can standardize any value on any normal curve by
subtracting the mean
dividing by the standard deviation
\[ z_i = \frac{x_i - \bar{x}}{\sigma} \]
95% of the hypothetical observations (95% confidence interval)
lay version: sample means will be normally distributed about the true mean
if we repeatedly sample from a population, we’ll get a normal distribution of means
the mean of the distribution of means will be (close to) the population mean
the standard deviation (“width”) of the distribution of sample means is referred to as the standard error of the distribution
if you look up CLT on Wikipedia you’ll see it’s defined in terms of adding two numbers
the sample mean is a sum of many numbers, divided by \(n\)
adding many numbers is like adding two numbers:
\[\color{red}{1 + 3 + 2} + 5 = \color{red}{(1 + 3 + 2)} + 5 = \color{red}{6} + 5\]
we’ve just shown how adding many numbers is equivalent to adding two numbers
so if we know the sum of a bunch of numbers, \(n-1\) of those numbers can be anything
sum of n-1 numbers | nth number | sum |
---|---|---|
90 | 10 | 100 |
102 | -2 | 100 |
67 | 33 | 100 |
the normal curve is a density plot with known properties
if we repeatedly sample from a population and measure the mean, we’ll get a normal distribution
the mean of means will be (close to) the population mean
the standard deviation of the sample means is called the standard error
if we don’t know the population, we’ll have to estimate the population mean
we’ll want some kind of way of assessing how good our estimate is
the estimated mean is the sample mean (we have no other info)
the estimated standard error of the mean is defined in terms of the sample standard deviation
\[ \textrm{se} = \frac{\sigma}{\sqrt{n}} = \frac{\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}}{\sqrt{n}} \]
\[ \textrm{se} = \frac{\sigma}{\color{red}{\sqrt{n}}} = \frac{\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}}{\color{red}{\sqrt{n}}} \]
there is only one population mean
our best estimate is the sample mean
we know that 95% of our best estimates will fall within ±1.96 standard errors of the sample mean
we have some survey data from the last couple of USMR classes, including height in cm
perhaps we’re interested in the “mean height of a young statistician” (!)
“young statisticians” are a population
the USMR classes are a sample
can we use the information from the sample of 304 responses we have to say anything about the population?
based our sample of 304 people from the USMR class, we are 95% confident that the population mean lies between 166.9cm and 168.9cm
course | mean | se | n |
---|---|---|---|
dapr1 | 168.0 | 0.6780 | 118 |
dapr2 | 167.0 | 1.1239 | 48 |
rms2 | 167.4 | 0.9750 | 67 |
usmr | 167.9 | 0.5021 | 304 |
USMR
\(\bar{x}=167.9\)
\(\text{se}=0.50\)
DAPR2
\(\bar{x}=167.0\)
\(\text{se}=1.12\)
not much evidence that DAPR2 and USMR come from different populations
inferring from samples to populations is a major goal of statistics
more about this next time