Measurement and Distributions

Univariate Statistics and Methodology using R

Martin Corley

Psychology, PPLS

University of Edinburgh

Measurement

The problem with measurement

  • when we measure something, we want to identify its true measurement (the ground truth)
  • we don’t have any way of measuring accurately enough

  • our measurements are likely to be close to the truth

  • they might vary, if we measure more than once

Measurement

  • we might expect values close to the “true” measurement to be more frequent

Something quite familiar

Dice again

  • the heights of the bars represent the numbers of times we obtain each value

  • but why are the bars not touching each other?

Dice throws aren’t really numbers

  • A =

  • B = or

  • C = or or

  • bar plot (“bar chart”) always has gaps between bars
  • represents frequencies of discrete categories (factors)

Back to playmobil

  • height is a Ratio variable
  • there will be limits to our precision, coventionally indicated by number of digits
height of figure (cm)
written ⊢ min ⊢ max
7.5 ≥ 7.450 < 7.550
7.50 ≥ 7.495 < 7.505

Histograms

  • we can represent all the measurements with a histogram

  • the bars are touching because this represents continuous data

Histograms

  • we can represent all the measurements with a histogram

  • the bars are touching because this represents continuous data

  • we know that there were 7 measurements of about 7.50 cm
    • strictly, ≥ 7.495 and < 7.505 cm

Histograms (2)

  • note that the bin width of the histogram matters

  • these histograms all show the same data

Histograms in R

head(heights)
[1] 7.482 7.569 7.424 7.530 7.501 7.516
hist(heights)

Histograms

the good

  • way to examine the distribution of data

  • easy to interpret (\(y\) axis = counts)

  • sometimes helpful in spotting weird data

the bad

  • changing bin width can completely change graph

  • only gives info about distribution and mode

    • not, e.g., mean or median

Histograms

 [1] 7.504 4.196 7.516 7.385 7.550 7.500 7.473 7.453 7.424 7.583 7.445 7.609
[13] 7.502 7.466 7.531 7.425 7.546 7.452 7.490 7.463 7.473 7.481 7.580 7.544
[25] 7.482 4.199 7.628 7.489 7.560 7.471 7.488 7.503 7.507 7.406 7.500 7.565
[37] 7.466 7.394 7.509 7.522 7.462 7.529 7.567 7.461 7.514 7.474 7.532 7.530
[49] 7.462 7.508 7.569 7.539 7.566 7.447 7.486 7.627 7.501 7.487 7.539 7.513
[61] 7.581 7.522 7.529 7.500 7.491 7.523 7.485 7.527 7.412 7.560 7.512 7.650

Density Plots

  • density plots depend on a smoothing function

  • essentially, they’re making guesses where there is no data

Density Plots

  • \(y\) axis is no longer a count
  • total area under curve = “all possibilities” = 1

Density Plots

  • partial area under the curve gives proportion of cases (here, 0.1885 ≥ 7.500 and < 7.525)

this is equivalent to saying that if I pick an observation \(x_i\) from this sample at random, there is a probability of .1885 that \(7.500 \le x_i < 7.525\)

The Normal Distribution

A Famous Density Plot

  • when we started thinking about measurement, we thought things might look a bit like this

  • the so-called normal curve

the normal curve is a hypothetical, asymptotic, density plot, with an area under the curve of 1

Normal Curves: Mean

  • normal curves can be defined in terms of two parameters

  • one is the centre or mean of the distribution (\(\bar{x}\), or sometimes \(\mu\))

Standard Deviation

  • the other is the standard deviation (sd, or sometimes \(\sigma\))

\[\textrm{sd}=\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}\]

  • standard deviation is the “average distance of observations from the mean”

Why Does the Height Vary?

  • area under the curve is always the same by sd

  • for ± 1 sd it’s 0.6827

there is a 68% chance of obtaining data within 1 sd of the mean

The Standard Normal Curve

  • we can standardize any value on any normal curve by

  • subtracting the mean

    • the effective mean is now zero
  • dividing by the standard deviation

    • the effective standard deviation is now one

\[ z_i = \frac{x_i - \bar{x}}{\sigma} \]

The Standard Normal Curve

  • the area between 1 standard deviation below the mean and 1 standard deviation above the mean is, as we know, 0.6827

  • we can ask the question the other way around: an area of .95 lies between -1.96 and 1.96 standard deviations from the mean

95% of the hypothetical observations (95% confidence interval)

Sampling from a Population

Samples vs Populations

  • population: all members of group you are hypothesizing about
  • sample: the subset of the population you’re testing

Central Limit Theorem

  • lay version: sample means will be normally distributed about the true mean

  • if we repeatedly sample from a population, we’ll get a normal distribution of means

  • the mean of the distribution of means will be (close to) the population mean

the standard deviation (“width”) of the distribution of sample means is referred to as the standard error of the distribution

Central Limit Theorem (2)

  • if you look up CLT on Wikipedia you’ll see it’s defined in terms of adding two numbers

    • the sample mean is a sum of many numbers, divided by \(n\)

    • adding many numbers is like adding two numbers:

    \[\color{red}{1 + 3 + 2} + 5 = \color{red}{(1 + 3 + 2)} + 5 = \color{red}{6} + 5\]

    • dividing by something doesn’t make any difference

\(n-1\)

  • we’ve just shown how adding many numbers is equivalent to adding two numbers

  • so if we know the sum of a bunch of numbers, \(n-1\) of those numbers can be anything

sum of n-1 numbers nth number sum
90 10 100
102 -2 100
67 33 100
  • so if we know a summary statistic (e.g., mean, sd) we know about the data with \(n-1\) degrees of freedom

Statistical Estimates

  • if we only have one sample (e.g., from an experiment) we can make estimates of the mean and standard error

    • the estimated mean is the sample mean (we have no other info)
  • the estimated standard error of the mean is defined in terms of the sample standard deviation

    \[ \textrm{se} = \frac{\sigma}{\sqrt{n}} = \frac{\sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}}{\sqrt{n}} \]

Putting it Together

  • the normal curve is a density plot with known properties

    • it can be defined in terms of two parameters: mean, and standard deviation
  • if we repeatedly sample from a population and measure the mean, we’ll get a normal distribution

    • the mean of means will be (close to) the population mean
  • if we sample once from a population which is approximately normal

    • our estimated mean and sd for the population are the sample mean and sd

    • the standard error, or standard deviation of the sample means, can be estimated as \(\sigma/\sqrt{n}\)

Can We Use This For Real?

  • we have some survey data from the last couple of USMR classes, including height in cm

  • perhaps we’re interested in the “mean height of a young statistician” (!)

    • “young statisticians” are a population

    • the USMR classes are a sample

can we use the information from the sample of 304 responses we have to say anything about the population?

Looking at the Class Data

# the class heights are
# in uheights
hist(uheights, xlab = "height (cm)")

  • histogram suggests that the heights are (approximately) normally distributed

Mean, Standard Deviation

  • information about the distribution of the sample
mean(uheights)
[1] 167.9
sd(uheights)
[1] 8.755

Standard Error

  • standard error is the “standard deviation of the mean”

  • as we saw in the simulation

  • can be estimated as \(\frac{\sigma}{\sqrt{n}}\)

n <- length(uheights)
# standard error
sd(uheights)/sqrt(n)
[1] 0.5021

Statistically Useful Information

  • we know that the area between \(\bar{x}-1.96\sigma\) and \(\bar{x}+1.96\sigma\) is 0.95

if we measure the mean height of 304 people from the same population as the USMR class, we estimate that the answer we obtain will lie between 166.9cm and 168.9cm 95% of the time

Statistically Useful Information

  • we also have information from 3 other statistics courses
course mean se n
dapr1 168.0 0.6780 118
dapr2 167.0 1.1239 48
rms2 167.4 0.9750 67
usmr 167.9 0.5021 304
  • are the young statisticians on those courses from different populations? (in terms of height)

Standard Errors (again)

USMR

\(\bar{x}=167.9\)

\(\text{se}=0.50\)

DAPR2

\(\bar{x}=167.0\)

\(\text{se}=1.12\)

Statistical Inference

  • not much evidence that DAPR2 and USMR come from different populations

  • inferring from samples to populations is a major goal of statistics

  • more about this next time

End