Sampling distributions

1 Fundamentals of inference

This section contains essential terminology and functions that are needed to complete the exercises provided.

Population vs sample

Typically:

We do not have data for the entire population. There are different possible reasons:
- It’s too expensive to collect them
- Because of deadlines, there is not sufficient time
- It’s not possible to reach the entire population
It’s much easier to obtain data by taking a sample from that population of interest and measuring only the units chosen in the sample.
- Note that units do not necessarily have to be individuals, but they could be schools, companies, etc.
We wish to use the sample data to:
- Investigate a claim about the whole population.
- Test an hypothesis about the entire population.
- Answer a question about the whole population.

The process of using information from a sample (the part) in order to draw conclusions about the entire population (the whole) is known as statistical inference.

Parameters vs statistics

As we do not typically have the data for the entire population, the population is considered as “unknown” with respect the data that we wish to investigate. We could shortly say that the population data are unknown.

For example, if we are interested in the average IQ in the population, we don’t have the resources to go to every single individual and test their IQ score. So, in this respect, the population IQ scores are unknown.

As a consequence of this, any numerical summary of the population data is also unknown. In the above example, the population mean IQ score is unknown and needs to be estimated.

Sample data are more readily available or feasible to collect. Imagine collecting a sample of 50 individuals, chosen at random from the population, and testing each to obtain their IQ score. If you performed the random experiment, you would then obtain a sequence of 50 IQ measurements.

At the same time, it is also feasible to compute any numerical summary of the sample data. For example, you can compute the mean IQ score for those 50 individuals in the sample.

We typically use this terminology to distinguish a numerical summary when computed in the population (unknown) or in the collected sample (known).

A parameter is a numerical summary of a population.
A statistic is a numerical summary of the sample.

A statistic is often used as a “best guess” or “estimate” for the unknown parameter. That is, we use the (sample) statistic to estimate a (population) parameter.

In the above example, the population mean IQ score is the parameter of interest, while the sample mean IQ score is the statistic.

It is typical to use special notation to distinguish between parameters and statistics in order to convey with a single letter: (1) which numerical summary is being computed, and (2) if it is computed on the population or on the sample data.

The following table summarizes standard notation for some population parameters, typically unknown, and the corresponding estimates computed on a sample.

Notation for common parameters and statistics.
Numerical summary	Population parameter	Sample statistic
Mean	\(\mu\)	\(\bar{x}\) or \(\hat{\mu}\)
Standard deviation	\(\sigma\)	\(s\) or \(\hat{\sigma}\)
Proportion	\(p\)	\(\hat{p}\)

The Greek letter \(\mu\) (mu) represents the population mean (parameter), while \(\bar{x}\) (x-bar) or \(\hat{\mu}\) (mu-hat) is the mean computed from the sample data (sample statistic).

The Greek letter \(\sigma\) (sigma) represents the population standard deviation (parameter), while \(s\) or \(\hat{\sigma}\) (sigma-hat) is the standard deviation computed from the sample data (sample statistic).

The Greek letter \(p\) represents the population proportion (parameter), while \(\hat{p}\) (p-hat) is the proportion computed from the sample data (sample statistic).

(Optional) Statistics are random variables

The process of sampling \(n\) people at random from the population is a random experiment, as it leads to an uncertain outcome. Hence, a statistic is a numerical summary of a random experiment and for this reason it is a random variable (random number).

Before actually performing the random experiment and picking individuals for the sample, the sample mean is a random number, as its value is uncertain. Once we actually perform the random experiment and measure the individuals in the sample, we have an observed value for the sample mean, i.e. a known number.

We will then distinguish between:

a statistic - a random variable (random number) that is uncertain because it involves a random experiment.
an observed statistic - the actual number that is computed once the sample data have been collected by performing the chance experiment.

Notation-wise we distinguish between a statistic (random variable) and an observed statistic (observed number) by using, respectively, an uppercase letter or a lowercase letter.

Uppercase letters refer to random variables. Recall that a random variable represents a well defined number, but whose value is uncertain as it involves an experiment of chance in order to reach to an actual value.
For example, \(\bar X\) = “mean IQ score in a random sample of 50 individuals” is a random variable. It is clearly defined in operational terms by: (1) take a sample of 50 individuals at random, (2) measure their IQ score, and (3) compute the mean of their 50 IQ scores. However, the actual value that we can obtain is uncertain as it is the result of an experiment of chance involving random sampling from a population. There are many possible values we could obtain, so we are not sure which one we will see.
Another example: \(X\) = “number of heads in 10 flips of a coin”. This is also a clearly defined experiment: (1) flip a coin 10 times and (2) count the number of heads. However, the result is a random number, which is uncertain and will be unknown until we actually perform the experiment and flip a coin 10 times.

Lowercase letters refer to observed (i.e., realised) values of the random variable. An observed value of a random variable is just a number.
For example, once we actually collect 50 individuals and measure their IQ scores, we can sum those 50 numbers and divide the sum by 50 to obtain the observed sample mean. Say the sample mean IQ score is 102.3, we would write \(\bar x = 102.3\).

In short, we would use for a sample mean:

Statistic (sample mean): an uppercase letter before the value is actually known: \(\bar X\)
Observed statistic (observed sample mean): a lowercase letter once the value is actually known: \(\bar x\)

Avoiding bias due to sampling

Sampling bias occurs when the method used to select which units enter the sample causes the sample to not be a good representation of the population.

If sampling bias exists, we cannot generalise our sample conclusions to the population.

To be able to draw conclusions about the population, we need a representative sample. The key in choosing a representative sample is random sampling. Imagine an urn with tickets, where each ticket has the name of each population unit. Random sampling would involve mixing the urn and blindly drawing out some tickets from the urn. Random sampling is a strategy to avoid sampling bias.

Simple random sampling

When we select the units entering the sample via simple random sampling, each unit in the population has an equal chance of being selected, meaning that we avoid sampling bias.

When instead some units have a higher chance of entering the same, we have misrepresentation of the population and sampling bias.

In general, we have bias when the method of collecting data causes the data to inaccurately reflect the population.

Sampling distribution

The natural gestation period (in days) for human births is normally distributed in the population with mean 266 days and standard deviation 16 days. This is a special case which rarely happens in practice: we actually know what the distribution looks like in the population.

We will use this unlikely example to study how well does the sample mean estimate the population mean and, to do so, we need to know what the population mean is so that we can compare the estimate and the true value. Remember, however, that in practice the population parameter would not be known.

We will consider data about the gestation period of the 49,863 women who gave birth in Scotland in 2019. These can be found at the following address: https://uoepsy.github.io/data/pregnancies.csv

First, we read the population data:

library(tidyverse)
gest <- read_csv('https://uoepsy.github.io/data/pregnancies.csv')
dim(gest)

[1] 49863     2

The data set contains information about 49,863 cases. For each case an identifier and the length of pregnancy Look at the top six rows of the data set (the “head”):

head(gest)

# A tibble: 6 × 2
     id gest_period
  <dbl>       <dbl>
1     1        256.
2     2        269.
3     3        253.
4     4        292.
5     5        271.
6     6        253.

We now want to investigate how much the sample means will vary from sample to sample. To do so, we will take many samples from the population of all gestation periods, and compute for each sample the mean.

To do so, we will load a function which we prepared for you called rep_sample_n(). This function is used to take a sample of \(n\) units from the population, and it lets you repeat this process many times.

To get the function in your computer, run this code:

source('https://uoepsy.github.io/files/rep_sample_n.R')

NOTE

You need to copy and paste the line

source('https://uoepsy.github.io/files/rep_sample_n.R')

at the top of each file in which you want to use the rep_sample_n() function.

The function takes the following arguments:

rep_sample_n(data, n = <sample size>, samples = <how many samples>)

data is the population
n is the sample size
samples is how many samples of size \(n\) you want to take

Before doing anything involving random sampling, it is good practice to set the random seed. This is to ensure reproducibility of the results. Random number generation in R works by specifying a starting seed, and then numbers are generated starting from there.

Set the random seed to any number you wish. Depending on the number, you will get the same results as me or not:

set.seed(1234)

Obtain 10 samples of \(n = 5\) individuals each:

samples <- rep_sample_n(gest, n = 5, samples = 10)
samples

# A tibble: 50 × 3
   sample    id gest_period
    <dbl> <dbl>       <dbl>
 1      1 40784        255.
 2      1 40854        275.
 3      1 41964        281.
 4      1 15241        246.
 5      1 33702        247.
 6      2 35716        267.
 7      2 17487        289.
 8      2 15220        266.
 9      2 19838        238.
10      2  2622        281.
# ℹ 40 more rows

The samples data frame contains 3 columns:

sample, telling us which sample each row refers to
id, telling us the units chosen to enter each sample
gest_period, telling us the gestation period (in days) of each individual

Note that the tibble samples has 50 rows, which is given by 5 individuals in each sample * 10 samples.

You can inspect the sample data in the following interactive table in which the data corresponding to each sample have been colour-coded so that you can distinguish the rows belonging to the 1st, 2nd, …, and 10th sample:

Now, imagine computing the mean of the five observation in each sample. This will lead to 10 means, one for each of the 10 samples (of 5 individuals each).

sample_means <- samples %>%
  group_by(sample) %>%
  summarise(mean_gest = mean(gest_period))

sample_means

# A tibble: 10 × 2
   sample mean_gest
    <dbl>     <dbl>
 1      1      261.
 2      2      268.
 3      3      273.
 4      4      269.
 5      5      261.
 6      6      271.
 7      7      252.
 8      8      262.
 9      9      258.
10     10      267.

As you can see this leads to a tibble having 10 rows (one for each sample), where each row is a mean computed from the 5 individuals which were chosen to enter the sample.

The gestation period (in days) for the first five women sampled were 255.15, 275.34, 281.04, 245.62, 247.5

This sample has a mean of \(\bar x\) = 260.93 days.

The second sample of 5 women had gestation periods 267.03, 288.74, 265.56, 238, 280.56

The second sample has a mean gestation period of \(\bar x\) = 267.98 days.

In Figure 1 we display the individual gestation periods in each sample as dots, along with the means gestation period \(\bar x\) of each sample. The position of the sample mean is given by a red vertical bar.

We then increased the sample size to 50 women and took 10 samples each of 50 individuals. This set of samples together with their means is also plotted in Figure 1.

Figure 1: Gestation period (in days) of samples of individuals.

Two important points need to be made from Figure Figure 1. First, each sample (and therefore each sample mean) is different. This is due to the randomness of which individuals end up being in each sample. The sample means vary in an unpredictable way, illustrating the fact that \(\bar X\) is a summary of a random experiment (randomly choosing a sample) and hence is a random variable. Secondly, as we increase the sample size from 5 to 50, there appears to be a decrease in the variability of sample means (compare the variability in the vertical bars in panel (a) and panel (b)). That is, with a larger sample size, the sample means fluctuate less and are more “consistent”.

To further investigate the variability of sample means, we will now generate many more sample means computed on:

1,000 samples of \(n = 5\) women
1,000 samples of \(n = 50\) women
1,000 samples of \(n = 500\) women

We will also add at the end of each tibble a column specifying the sample size. In the first tibble, mutate(n = 5) creates a column called n where all values will be 5, to remind ourselves that those means were computed with samples of size \(n = 5\). Remember that mutate() takes a tibble and creates a new column or changes an existing one.

# (a) 1,000 means from 1,000 samples of 5 women each
sample_means_5 <- rep_sample_n(gest, n = 5, samples = 1000) %>%
  group_by(sample) %>%
  summarise(mean_gest = mean(gest_period)) %>%
  mutate(n = 5)
head(sample_means_5)

# A tibble: 6 × 3
  sample mean_gest     n
   <dbl>     <dbl> <dbl>
1      1      267.     5
2      2      263.     5
3      3      255.     5
4      4      269.     5
5      5      267.     5
6      6      266.     5

# (b) 1,000 means from 1,000 samples of 50 women each
sample_means_50 <- rep_sample_n(gest, n = 50, samples = 1000) %>%
  group_by(sample) %>%
  summarise(mean_gest = mean(gest_period)) %>%
  mutate(n = 50)
head(sample_means_50)

# A tibble: 6 × 3
  sample mean_gest     n
   <dbl>     <dbl> <dbl>
1      1      263.    50
2      2      262.    50
3      3      269.    50
4      4      266.    50
5      5      267.    50
6      6      268.    50

# (c) 1,000 means from 1,000 samples of 500 women each
sample_means_500 <- rep_sample_n(gest, n = 500, samples = 1000) %>%
  group_by(sample) %>%
  summarise(mean_gest = mean(gest_period)) %>%
  mutate(n = 500)
head(sample_means_500)

# A tibble: 6 × 3
  sample mean_gest     n
   <dbl>     <dbl> <dbl>
1      1      265.   500
2      2      265.   500
3      3      267.   500
4      4      266.   500
5      5      265.   500
6      6      265.   500

We now combine the above datasets of sample means for different sample sizes into a unique tibble. The function bind_rows() takes multiple tibbles and stacks them under each other.

sample_means_n <- bind_rows(sample_means_5, 
                            sample_means_50, 
                            sample_means_500)

We now plot three different density histograms showing the distribution of 1,000 sample means computed from samples of size 5, 50, and 500.

This would correspond to creating a histogram of the “red vertical bars” from Figure Figure 1, the only difference is that we have many more samples (1,000).

ggplot(sample_means_n) +
  geom_histogram(aes(mean_gest, after_stat(density)), 
                 color = 'white', binwidth = 1) +
  facet_grid(n ~ ., labeller = label_both) +
  theme_bw() + 
  labs(x = 'Sample mean of gestation period (days)', y = 'Density')

Figure 2: Density histograms of the sample means from 1,000 samples of women (\(n\) women per sample).

Each of the density histograms above displays the distribution of the sample mean, computed on samples of the same size and from the same population.

Such a distribution is called the sampling distribution of the sample mean.

Sampling distribution

The sampling distribution of a statistic is the distribution of a sample statistic computed on many different samples of the same size from the same population.

A sampling distribution shows how the statistic varies from sample to sample due to sampling variation.

Centre and spread of a sampling distribution

What is the mean and standard deviation of each histogram?

sample_means_n %>%
  group_by(n) %>%
  summarise(mean_xbar = mean(mean_gest),
            sd_xbar = sd(mean_gest))

# A tibble: 3 × 3
      n mean_xbar sd_xbar
  <dbl>     <dbl>   <dbl>
1     5      266.   7.38 
2    50      266.   2.23 
3   500      266.   0.734

Compare these quantities to the population mean and standard deviation: \(\mu\) = 266 and \(\sigma\) = 16.1.

Regardless of the size of the samples we were drawing (5, 50, or 500), the average of the sample means was equal to the population mean. However, the standard deviation of the sample means was smaller than the population mean. The variability in sample means also decreases as the sample size increases.

There is an interesting patter in the decrease, which we will now verify. It can be proved that the standard deviation of the sample mean \(\sigma_{\bar X} = \frac{\sigma}{\sqrt{n}}\), i.e. the population standard deviation divided by \(\sqrt{n}\) with \(n\) being the sample size.

Obtain the population standard deviation. Remember the entire population data were called gest and in this case we are very lucky to have the data for the entire population, typically we wouldn’t have those and neither the population standard deviation.

sigma <- sd(gest$gest_period)

Now compute add a column that compares the SD from sampling with the theory-based one:

sample_means_n %>%
  group_by(n) %>%
  summarise(mean_xbar = mean(mean_gest),
            sd_xbar = sd(mean_gest)) %>%
  mutate(sd_theory = sigma / sqrt(n))

# A tibble: 3 × 4
      n mean_xbar sd_xbar sd_theory
  <dbl>     <dbl>   <dbl>     <dbl>
1     5      266.   7.38      7.20 
2    50      266.   2.23      2.28 
3   500      266.   0.734     0.720

The last two columns will be closer and closer as we increase the number of different samples we take from the population (e.g. 5,000 or 10,000 or even more samples.)

The following result holds: \[ \begin{aligned} \mu_{\bar X} &= \mu = \text{Population mean} \\ \sigma_{\bar X} &= \frac{\sigma}{\sqrt{n}} = \frac{\text{Population standard deviation}}{\sqrt{\text{Sample size}}} \end{aligned} \]

Because on average the sample mean (i.e. the estimate) is equal to the population mean (i.e. the parameter), the sample mean \(\bar X\) is an unbiased estimator of the population mean. In other words, it does not consistently “miss” the target. (However, if your sampling method is biased, the sample mean will be biased too.)

The standard deviation of the sample means tells us that the variability in the sample means gets smaller smaller as the sample size increases. Because \(\sqrt{4} = 2\) we halve \(\sigma_{\bar X}\) by making the sample size 4 times as large. Similarly, as \(\sqrt{9} = 3\), we reduce \(\sigma_{\bar X}\) by one third by making the sample size 9 times as large.

The variability, or spread, of the sampling distribution shows how much the sample statistics tend to vary from sample to sample. This is key in understanding how accurate our estimate of the population parameter, based on just one sample, will be.

Recall that the standard deviation tells us the size of a typical deviation from the mean. Here, the mean is the population parameter \(\mu\), and a deviation of \(\bar x\) from \(\mu\) is called an estimation error. Hence, the standard deviation of the sample mean is called the standard error of the mean. This tells us the typical estimation error that we commit when we estimate a population mean with a sample mean.

Standard error

The standard error of a statistic, denoted \(SE\), is the standard deviation of its sampling distribution.

So, the standard error of the mean can be either computed as the standard deviation of the sampling distribution, or using the formula \[ SE = \sigma_{\bar X} = \frac{\sigma}{\sqrt{n}} \]

The sample mean is normally distributed

We also notice that the density histograms in Figure 2 are symmetric and bell-shaped. Hence, they follow the shape of the normal curve.

The random variable \(\bar X\) follows a normal distribution: \[ \bar X \sim N(\mu,\ SE) \]

We can also compute a z-score. We have that: \[ Z = \frac{\bar X - \mu}{SE} \sim N(0, 1) \]

We know that for a normally distributed random variable, approximately 95% of all values fall within two standard deviations of its mean. Thus, for approximately 95% of all samples, the sample means falls within \(\pm 2 SE\) of the population mean \(\mu\). Similarly, since \(P(-3 < Z < 3) = 0.997\), it is even more rare to get a sample mean which is more than three standard errors away from the population mean (only 0.3% of the times).

This suggests that:

The standard error \(SE\) is a measure of precision of \(\bar x\) as an estimate of \(\mu\).
If is a pretty safe bet to say that the true value of \(\mu\) lies somewhere between \(\bar x - 2 SE\) and \(\bar x + 2 SE\).
We will doubt any hypothesis specifying that the population mean is \(\mu\) when the value \(\mu\) is more than \(2 SE\) away from the sample mean we got from our data, \(\bar x\). We shall be even more suspicious when the hypothesised value \(\mu\) is more than \(3 SE\) away from \(\bar x\).

Centre and shape of a sampling distribution

Centre: If samples are randomly selected, the sampling distribution will be centred around the population parameter. (No bias)
Shape: For most of the statistics we consider, if the sample size is large enough, the sampling distribution will follow a normal distribution, i.e. it is symmetric and bell-shaped. (Central Limit Theorem)

Clearly, we can compute sampling distributions for other statistics too: the proportion, the standard deviation, …

This requires the following steps:

Obtaining multiple samples, all of the same size, from the same population;
For each sample, calculate the value of the statistic;
Plot the distribution of the computed statistics.

Why sample size matters

You might be wondering: why did we take multiple samples of size \(n\) from the population when, in practice, we can only afford to take one?

This is a good question. We have taken multiple samples to show how the estimation error varies with the sample size. We saw in Figure 2, shown again below, that smaller sample sizes lead to more variable statistics, while larger sample sizes lead to more precise statistics, i.e. the estimates are more concentrated around the true parameter value.

Density histograms of the sample means from 5,000 samples of women (\(n\) women per sample).

This teaches us that, when we have to design a study, it is better to obtain just one sample with size \(n\) as large as we can afford.

Think about it

What would the sampling distribution of the mean look like if we could afford to take samples as big as the entire population, i.e. of size \(n = N\)?

Think about it: Answer

If you can, it is best to measure the entire population.

If we could afford to measure the entire population, then we would find the exact value of the parameter all the time. By taking multiple samples of size equal to the entire population, every time we would obtain the population parameter exactly, so the distribution would look like a histogram with a single bar on top of the true value: we would find the true parameter with a probability of one, and the estimation error would be 0.

pop_means <- gest %>%
  rep_sample_n(n = nrow(gest), samples = 10) %>%
  group_by(sample) %>%
  summarise(mean_gest = mean(gest_period))
pop_means

# A tibble: 10 × 2
   sample mean_gest
    <dbl>     <dbl>
 1      1      266.
 2      2      266.
 3      3      266.
 4      4      266.
 5      5      266.
 6      6      266.
 7      7      266.
 8      8      266.
 9      9      266.
10     10      266.

The following is a dotplot of the means computed above:

To summarize:

We have high precision when the estimates are less variable, and this happens for a large sample size.
We have no bias when we select samples that are representative of the population, and this happens when we do random sampling. No bias means that the estimates will be centred at the true population parameter to be estimated.

2 Glossary

Statistical inference. The process of drawing conclusions about the population from the data collected in a sample.
Population. The entire collection of units of interest.
Sample. A subset of the entire population.
Random sample. A subset of the entire population, picked at random, so that any conclusion made from the sample data can be generalised to the entire population.
Representation bias. Happens when some units of the population are systematically underrepresented in samples.
Generalisability. When information from the sample can be used to draw conclusions about the entire population. This is only possible if the sampling procedure leads to samples that are representative of the entire population (such as those drawn at random).
Parameter. A fixed but typically unknown quantity describing the population.
Statistic. A quantity computed on a sample.
Sampling distribution. The distribution of the values that a statistic takes on different samples of the same size and from the same population.
Standard error. The standard error of a statistic is the standard deviation of the sampling distribution of the statistic.