class: center, middle, inverse, title-slide #
Missing Data
## Data Analysis for Psychology in R 2
### dapR2 Team ### Department of Psychology
The University of Edinburgh ### AY 2020-2021 --- # Weeks Learning Objectives + Develop awareness of the glm for modelling non-continuous outcomes + Understand the types of missing data and common causes. + Develop awareness of approaches to dealing with missing data. --- # Topics for today - Conceptual introduction to missing data. - Mechanisms of missingness - Main methods you will read about in the literature. --- # Missing Data - Missing data is very common - We worry about two main things with missing data: - Loss of efficiency (due to smaller sample size) - Bias (i.e., incorrect estimates) - The effects of missing data depend on a combination of the missing data mechanism and how we deal with the missing data --- # Missing Data Mechanisms - Missing data can be of several types: - Missing completely at random (MCAR) - Missing at random (MAR) - Missing not at random (MNAR) --- # MAR - Missing at random (MAR) means: **When the probability of missing data on a variable `\(Y\)` is related to other variables in the model but not to the values of `\(Y\)` itself** - Example: - `\(X\)` = self-control and `\(Y\)` = aggression. - People with lower self-control are more likely to have missing data on aggression. - After taking into account self-control, people who are high in aggression are no more likely to have missing data on aggression. - Challenge is that there is no way to confirm that there is no relation between aggression scores and missing data on aggression because that would knowledge of the missing scores. --- # MCAR - Missing completely at random (MCAR) means: **Genuinely random missingness. No relation between `\(Y\)` or any other variable in the model and missingness on `\(Y\)`** - The data you have are a simple random sample of the complete data. - The ideal missing data scenario! - Example: - `\(X\)` = self-control and `\(Y\)`= aggression - People of all levels of self-control and aggression are equally likely to have missing data on aggression --- # MNAR - Missing not random (MNAR) means: **When the probability of missingness on `\(Y\)` is related to the values of `\(Y\)` itself** - Example: - `\(X\)` = self-control, `\(Y\)` = aggression - Those high in aggression are more likely to have missing data on the aggression variable, even after taking into account self-control. - As with MAR, there is no way to verify that data are MNAR without knowledge of the missing values --- # Methods for Missing Data - Deletion methods: - Listwise deletion - Pairwise deletion - Imputation methods: - Mean imputation - Regression imputation - Multiple imputation - Maximum likelihood estimation - Methods for MNAR data: - Pattern mixture models - Random coefficient models --- # Deletion Methods - Listwise deletion or complete case analysis - Delete everyone from the analysis who has missing data on either self-control or aggression - Will give biased results unless data are MCAR - Even if data are MCAR, power will be reduced by reducing the sample size - Bottom line: **not recommended** --- # Deletion Methods - Pairwise deletion or available-case analysis - Uses available data for each analysis - Different cases contribute to different correlations in a correlation matrix - Example: - Cases 2,3,7,18, 56, 100 not used in the self-control- aggression correlation - Cases 2,7,18,77, 103 not used in the aggression-substance use correlation - Doesn’t reduce power as much as listwise deletion - But still gives biased results whenever data are not MCAR - Bottom line: **not recommended** --- # Mean Imputation - Replace missing values with the mean on that variable - Two major issues: - artificially reduces the variability of the data - can give very biased estimates even when data are MCAR - Bottom line: **not recommended** > ‘possibly the worst missing data handling method available… you should absolutely avoid this approach’ Enders (2011) --- # Regression Imputation - Replaces the missing values with values predicted from a regression - Estimate a set of regression equations where the incomplete variables are predicted from the complete variables - Use the regression equations to calculate the predicted values on the incomplete variables - Based on the principle of using information from the complete data to estimate the missing data - Two forms: - ‘normal’ regression imputation - Stochastic regression imputation (adds a residual term to overcome loss of variance) - Stochastic regression is preferred and gives unbiased results if data are MAR --- # Multiple Imputation - Procedure: - Imputes missing data several times to create multiple complete datasets - Analyses are conducted for each dataset - Analysis results are pooled across datasets to get parameter estimates and standard errors - 3-5 datasets are often enough but more is better and 20+ is ideal - Unlike in most single imputation approaches, the standard errors take account of the additional uncertainty due to missingness - Gives unbiased parameter estimates under MAR - Can be tedious if pooling has to be done by hand -Bottom line: **recommended method if data are likely to be MAR** - If using MI, include as many variables and higher-order effects as possible in the imputation phase --- # Maximum Likelihood Estimation - Estimation method - Makes use of all the information in the model to arrive at the parameter estimates (e.g. regression coefficient) ‘as if’ the data were complete - Does not ‘impute’ individual values - Gives unbiased estimates under MAR - Assumes multivariate normality - Even under MCAR it is superior to listwise and pairwise deletion because it uses more information from the observed data. - Practical advantage= easy to implement, usually much more so than multiple imputation - Bottom line: **recommended method if data are likely to be MAR** --- # Methods for MNAR - Selection models: - Combines model for predicting missingness as well as the analysis model of interest, e.g.: - Selection model= predict missingness on aggression from covariates such as gender, age, ADHD scores - Substantive model= predict aggression from self-control - The parameter estimates of the latter are adjusted by the inclusion of the former - Makes strong, untestable assumptions - Under many realistic conditions, gives worse results than MI or MLE, even when the MAR assumption is violated - Bottom line: **Good to include as part of a sensitivity analysis but often between to use MI or MLE** --- # Methods for MNAR - Pattern mixture models - Stratifies the sample according to different missing data patterns - E.g. pattern 1= complete data , pattern 2= ‘missing on aggression’ , pattern 3= ‘missing on self-control’ - Estimates the substantive model in each subgroup - Pools the results from each subgroup to get a weighted average parameter estimate - Similar to selection model, requires strong untestable assumptions - Violations of the assumptions lead to bias - Bottom line: **Good to include as part of a sensitivity analysis but often between to use MI or MLE** --- # What to do? .pull-left[ - When MLE may be better: - When the substantive model includes interactions - For structural equation models (more on this in dapR3) - For the inexperienced (easier to learn and implement) ] .pull-right[ - When MI may be better: - When a structural equation model has categorical indicators - When there is missing data on the predictors - When including auxiliary variables ] --- # Summary of today - This was intended as a short overview of missing data and the methods you may read about. - Goal was to make you aware of the issue. - We will return to FIML approaches in dapR3. - Most likely to work effectively for your own practical research.