Univariate Statistics and Methodology using R
Psychology, PPLS
University of Edinburgh
we don’t have any way of measuring accurately enough
our measurements are likely to be close to the truth
they might vary, if we measure more than once
the heights of the bars represent the numbers of times we obtain each value
but why are the bars not touching each other?
A =
B = or
C = or or
factors
)written | ⊢ min | ⊢ max |
---|---|---|
7.5 | ≥ 7.450 | < 7.550 |
7.50 | ≥ 7.495 | < 7.505 |
we can represent all the measurements with a histogram
the bars are touching because this represents continuous data
we can represent all the measurements with a histogram
the bars are touching because this represents continuous data
note that the bin width of the histogram matters
these histograms all show the same data
the good
way to examine the distribution of data
easy to interpret (\(y\) axis = counts)
sometimes helpful in spotting weird data
the bad
changing bin width can completely change graph
only gives info about distribution and mode
[1] 7.504 4.196 7.516 7.385 7.550 7.500 7.473 7.453 7.424 7.583 7.445 7.609
[13] 7.502 7.466 7.531 7.425 7.546 7.452 7.490 7.463 7.473 7.481 7.580 7.544
[25] 7.482 4.199 7.628 7.489 7.560 7.471 7.488 7.503 7.507 7.406 7.500 7.565
[37] 7.466 7.394 7.509 7.522 7.462 7.529 7.567 7.461 7.514 7.474 7.532 7.530
[49] 7.462 7.508 7.569 7.539 7.566 7.447 7.486 7.627 7.501 7.487 7.539 7.513
[61] 7.581 7.522 7.529 7.500 7.491 7.523 7.485 7.527 7.412 7.560 7.512 7.650
density plots depend on a smoothing function
essentially, they’re making guesses where there is no data
this is equivalent to saying that if I pick an observation \(x_i\) from this sample at random, there is a probability of .1885 that \(7.500 \le x_i < 7.525\)
when we started thinking about measurement, we thought things might look a bit like this
the so-called normal curve
the normal curve is a hypothetical, asymptotic, density plot, with an area under the curve of 1
normal curves can be defined in terms of two parameters
one is the centre or mean of the distribution (\(\bar{x}\), or sometimes \(\mu\))
\[\textrm{sd}=\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}\]
area under the curve is always the same by sd
for ± 1 sd it’s 0.6827
there is a 68% chance of obtaining data within 1 sd of the mean
we can standardize any value on any normal curve by
subtracting the mean
dividing by the standard deviation
\[ z_i = \frac{x_i - \bar{x}}{\sigma} \]
95% of the hypothetical observations (95% confidence interval)
lay version: sample means will be normally distributed about the true mean
if we repeatedly sample from a population, we’ll get a normal distribution of means
the mean of the distribution of means will be (close to) the population mean
the standard deviation (“width”) of the distribution of sample means is referred to as the standard error of the distribution
if you look up CLT on Wikipedia you’ll see it’s defined in terms of adding two numbers
the sample mean is a sum of many numbers, divided by \(n\)
adding many numbers is like adding two numbers:
\[\color{red}{1 + 3 + 2} + 5 = \color{red}{(1 + 3 + 2)} + 5 = \color{red}{6} + 5\]
we’ve just shown how adding many numbers is equivalent to adding two numbers
so if we know the sum of a bunch of numbers, \(n-1\) of those numbers can be anything
sum of n-1 numbers | nth number | sum |
---|---|---|
90 | 10 | 100 |
102 | -2 | 100 |
67 | 33 | 100 |
if we only have one sample (e.g., from an experiment) we can make estimates of the mean and standard error
the estimated standard error of the mean is defined in terms of the sample standard deviation
\[ \textrm{se} = \frac{\sigma}{\sqrt{n}} = \frac{\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}}{\sqrt{n}} \]
the normal curve is a density plot with known properties
if we repeatedly sample from a population and measure the mean, we’ll get a normal distribution
if we sample once from a population which is approximately normal
our estimated mean and sd for the population are the sample mean and sd
the standard error, or standard deviation of the sample means, can be estimated as \(\sigma/\sqrt{n}\)
we have some survey data from the last couple of USMR classes, including height in cm
perhaps we’re interested in the “mean height of a young statistician” (!)
“young statisticians” are a population
the USMR classes are a sample
can we use the information from the sample of 231 responses we have to say anything about the population?
if we measure the mean height of 231 people from the same population as the USMR class, we estimate that the answer we obtain will lie between 167.0cm and 169.4cm 95% of the time
course | mean | se | n |
---|---|---|---|
dapr1 | 168.0 | 0.6780 | 118 |
dapr2 | 167.0 | 1.1239 | 48 |
rms2 | 167.4 | 0.9750 | 67 |
usmr | 168.2 | 0.6004 | 231 |
USMR
\(\bar{x}=168.2\)
\(\text{se}=0.60\)
DAPR2
\(\bar{x}=167.0\)
\(\text{se}=1.12\)
not much evidence that DAPR2 and USMR come from different populations
inferring from samples to populations is a major goal of statistics
more about this next time