Data generating process

Since DAPR1, you’ve been summarising data not just in terms of its central tendency (e.g., by using the mean or the median), but also in terms of its spread (e.g., by using the standard deviation or the range). We need to use both kinds of measures because there’s always variability in the data we observe. When we think about the process that generated our data, we are asking: Where does that variability come from?

Three sources of variability

(1) Our outcome data may contain variability which is generated by manipulating or controlling for certain variables. This is the kind of variability you’ve been modelling in DAPR1 and DAPR2. For example:

The presence of a depressive disorder might generate wellbeing scores that are lower for people who have the disorder and higher for people who don’t have the disorder.
An experimental manipulation might generate faster reaction times in one condition and slower reaction times in another condition.
Right-handed keyboard responses might be faster for people who are right-handed and slower for people who are left-handed.

This kind of variability is reproducible. If you gathered new data and ran the analysis again, you would need the variables to contain the exact same levels. For example, if you ran the analysis again with new data, you would specifically look at the same depressive disorder, or specifically test the same experimental manipulation. And you would likely find a similar pattern of variability.

(2) Our outcome data may also contain variability which is generated by certain grouping variables that we don’t manipulate or control. This is the kind of variability that we will focus on in DAPR3. For example:

A person with fast reflexes might generate faster reaction times than a person with slow reflexes (grouping variable: person).
A difficult question in our test battery might generate lower success rates than an easier question (grouping variable: test question).

This kind of variability is random. If you gathered new data and ran the analysis again, these variables would not need to contain the exact same levels. For example, it doesn’t matter exactly who takes part in our reaction time experiment, because we want our results to generalise to a broader population of people. And it doesn’t matter exactly what questions we use in a test battery, because we want our results to generalise over a broad range of questions.

(3) Our outcome data will also always contain left-over variability that’s not associated with any variable in our dataset at all. This variability is random noise, generated because the world is random.

Models formalise the data generating process

When we define a model for our data, we’re essentially saying “this is how we think this data was produced”. More technically, we can say that we’re using a model to “formalise” the data generating process.

Each kind of variability outlined above appears in our model in a particular way:

We model manipulated / controlled / reproducible variability using fixed effects.
We model non-manipulated / non-controlled / random variability using random effects.
Purely random left-over variability appears in our models as the models’ residuals.

Data generating process

Three sources of variability

Models formalise the data generating process

Linked flash cards

Outgoing links

Backlinks