LEARNING OBJECTIVES

  1. Recognise the difference between parameters and statistics
  2. Be able to use a sample statistic to estimate an unknown parameter
  3. Understand what a sampling distribution is
  4. Understand the concept of standard error
  5. Recognise why sample size matters

Fundamentals of inference

This section contains essential terminology and functions that are needed to complete the exercises provided.

Population vs sample

Parameters vs statistics

Avoiding bias due to sampling

Sampling distribution

Centre and spread of a sampling distribution

The sample mean is normally distributed

Why sample size matters


Exercises

Question 1

Two students, Mary and Alex, wanted to investigate the average hours of study per week among students in their university. Each were given the task to sample \(n = 20\) students many times, and compute the mean of each sample of size 20. Mary sampled the students at random, while Alex asked students from the library. The distribution of sample means computed by Mary and Alex are shown in the dotplot below in green and red, respectively.

What do you notice in the distributions above? Why did Mary and Alex get so different results?

Solution

Question 2

Average price of goods sold by ACME Corporation

Suppose you work for a company that is interested in buying ACME Corporation1 and your boss wants to know within the next 30 minutes what is the average price of goods sold by that company and how the prices of the goods they sell differ from each other.

Since ACME Corporation has such a big mail order catalogue, see Figure 4, we will assume that the company sells many products. Furthermore, we only have the catalogue in paper-form and no online list of prices is available.

Product catalogue of ACME corporation.

Figure 4: Product catalogue of ACME corporation.

  1. Identify the population of interest and the population parameters.
  2. Can we compute the parameters within the next 30 minutes?
  3. How would you proceed in estimating the population parameters if you just had time to read through 100 item descriptions? Would you pick the first 100 items or would you pick 100 random page numbers?
  4. State which statistics you would use to estimate the population parameters.

Solution

Question 3

What is a parameter? Give two examples of parameters.

What is a statistic? Give two example of statistics.

What is an estimate?

Solution

Question 4

What is the difference between a statistic before and after collecting the sample data?

Why is it made? What notational device is used to communicate the distinction?

Solution

Question 8

Sampling distributions

What is a sampling distribution?

Solution

Data: HollywoodMovies.csv

Question 5

Reading data into R

Read the Hollywood movies data into R, and call it hollywood.

Check that the data were read into R correctly.

Solution

Question 6

Extracting relevant variables

Extract from the hollywood tibble the three variables of interest (Movie, Genre, Budget) and keep the movies for which we have all information (no missing entries).

Hint: Check the help page for the function drop_na() or na.omit().

Solution

Question 7

Proportion of comedy movies

What is the population proportion of comedy movies? What is an estimate of the proportion of comedy movies using a sample of size 20? Using the appropriate notation, report your results in one or two sentences.

Solution

Question 9

Sampling distribution of the proportion

Compute the sampling distribution of the proportion of comedy movies for samples of size \(n = 20\), using 1000 different samples.

Is it centred at the population value?

Solution

Question 10

Standard error

Using the replicated samples from the previous question, what is the standard error of the sample proportion of comedy movies?

Solution

Question 11

The effect of sample size on the standard error of the sample proportion

How does the sample size affect the standard error of the sample proportion? Compute the sampling distribution for the proportion of comedy movies using 1,000 samples each of size \(n = 20\), \(n = 50\), and \(n = 200\) respectively.

Solution

Question 12

Comparing the budget for action and comedy movies

What is the population average budget (in millions of dollars) allocated for making action vs comedy movies? And the standard deviation?

Solution


Glossary

  • Statistical inference. The process of drawing conclusions about the population from the data collected in a sample.
  • Population. The entire collection of units of interest.
  • Sample. A subset of the entire population.
  • Random sample. A subset of the entire population, picked at random, so that any conclusion made from the sample data can be generalised to the entire population.
  • Representation bias. Happens when some units of the population are systematically underrepresented in samples.
  • Generalisability. When information from the sample can be used to draw conclusions about the entire population. This is only possible if the sampling procedure leads to samples that are representative of the entire population (such as those drawn at random).
  • Parameter. A fixed but typically unknown quantity describing the population.
  • Statistic. A quantity computed on a sample.
  • Sampling distribution. The distribution of the values that a statistic takes on different samples of the same size and from the same population.
  • Standard error. The standard error of a statistic is the standard deviation of the sampling distribution of the statistic.

  1. You might remember it from the cartoon Wile E. Coyote and the Road Runner.↩︎