Methods for Handling Missing Data
Joseph Hogan Brown University
MDEpiNet Conference Workshop
October 22, 2018
Hogan (MDEpiNet) Missing Data October 22, 2018 1 / 160
Methods for Handling Missing Data Joseph Hogan Brown University - - PowerPoint PPT Presentation
Methods for Handling Missing Data Joseph Hogan Brown University MDEpiNet Conference Workshop October 22, 2018 Hogan (MDEpiNet) Missing Data October 22, 2018 1 / 160 Course Overview I 1 Introduction and Background Introduce case studies
MDEpiNet Conference Workshop
Hogan (MDEpiNet) Missing Data October 22, 2018 1 / 160
1 Introduction and Background ◮ Introduce case studies ◮ Missing data mechanisms ◮ Review and critique of commonly-used methods 2 Case Study 1: Growth Hormone Study ◮ Analysis using mixture models ◮ Setting up sensitivity analysis ◮ Inference about treatment effects Hogan (MDEpiNet) Missing Data October 22, 2018 2 / 160
3 Case Study 2: Smoking cessation study ◮ Exploratory analysis for long sequence of binary data ◮ Analysis via IPW methods under MAR ◮ Comparative analysis via GEE, ML, LOCF Hogan (MDEpiNet) Missing Data October 22, 2018 3 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 4 / 160
◮ Placebo ◮ rhGH ◮ Exercise + Placebo (EP) ◮ Exercise + rhGH (EG)
◮ Quadriceps strength, in ft-lbs of torque ◮ Measured at baseline, 6 months, 12 months
◮ Mean quad strength at 12 months ◮ Compare EP and EG arms only, for illustration Hogan (MDEpiNet) Missing Data October 22, 2018 5 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 6 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 7 / 160
◮ Supervised exercise vs. Wellness education program
◮ Weekly smoking status over 12 weeks
◮ Smoking rate at week 12 following baseline
◮ Binary outcomes ◮ Mean has some structure as a function of time ◮ Large number of repeated measures Hogan (MDEpiNet) Missing Data October 22, 2018 8 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 9 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 10 / 160
◮ Sampling variability ◮ Uncertainty due to missing data or untestable assumptions Hogan (MDEpiNet) Missing Data October 22, 2018 11 / 160
◮ Mechanisms that lead to missing data ◮ Biases missing data may cause ◮ Methods of addressing missing data
Hogan (MDEpiNet) Missing Data October 22, 2018 12 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 13 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 14 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 15 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 16 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 17 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 18 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 19 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 20 / 160
◮ These are called missing data mechanisms
◮ State the assumptions unambiguously so others can critique them ◮ Carry out sensitivity analysis wherever possible Hogan (MDEpiNet) Missing Data October 22, 2018 21 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 22 / 160
◮ Have model covariates only ◮ Have model covariates and auxiliary information Hogan (MDEpiNet) Missing Data October 22, 2018 23 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 24 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 25 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 26 / 160
◮ Missing data mechanism ◮ Selection mechanism
Hogan (MDEpiNet) Missing Data October 22, 2018 27 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 28 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 29 / 160
◮ Estimates will be consistent ◮ Standard errors may be larger than if you had the full data
Hogan (MDEpiNet) Missing Data October 22, 2018 30 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 31 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 32 / 160
◮ Men have higher BP ◮ Men less likely to be deleted ◮ ⇒ those with higher BP less likely to be deleted
Hogan (MDEpiNet) Missing Data October 22, 2018 33 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 34 / 160
◮ The X distribution is different between those with missing and observed Y ’s
Hogan (MDEpiNet) Missing Data October 22, 2018 35 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 36 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 37 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 38 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 39 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 40 / 160
◮ Among those in follow up at time j, missingness is independent of the outcome Yj
◮ Missingness does not depend on present or future Y ’s, given the past. Hogan (MDEpiNet) Missing Data October 22, 2018 41 / 160
1 Selection mechanism
◮ Can model selection probability as a function of observed past 2 Imputation mechanism
◮ Can impute missing Yj using a model of the observed Yj ◮ Critical: Must correctly specify observed-data model Hogan (MDEpiNet) Missing Data October 22, 2018 42 / 160
◮ Missing Yj is equal to the most recently observed value of Y ◮ Missing value filled in with probability one (no variance)
◮ Conditional distribution of missing Yj not equal to that for observed Yj Hogan (MDEpiNet) Missing Data October 22, 2018 43 / 160
◮ Particular way of structuring mean and variance Hogan (MDEpiNet) Missing Data October 22, 2018 44 / 160
◮ MAR holds ◮ All parts of the model are correctly specified
Hogan (MDEpiNet) Missing Data October 22, 2018 45 / 160
◮ Not necessarily a full parametric model
◮ Inferences are most efficient when covariance correctly specified ◮ Correct inference about time-specific means even if covariance mis-specified ◮ Reason: all information about time-specific means is already observed Hogan (MDEpiNet) Missing Data October 22, 2018 46 / 160
◮ Information about time-specific means relies on ‘imputation’ of missing observations ◮ These imputations come from the conditional distribution
◮ The form of the conditional distribution depends on the working covariance
◮ Can get different treatment effects with different working covariances Hogan (MDEpiNet) Missing Data October 22, 2018 47 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 48 / 160
1 Introduce modeling approach 2 Relate modeling approach to missing data hierarchy 3 Illustrate on simple cases 4 Include a treatment comparison 5 Discussion of key points from case study Hogan (MDEpiNet) Missing Data October 22, 2018 49 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 50 / 160
◮ Variable: Y3
◮ Ignoring baseline covariates ◮ Using information from baseline covariate Y1 ◮ MAR and MNAR (sensitivity analysis)
◮ Expand to longitudinal case ◮ MAR – using regression imputation ◮ MNAR – sensitivity analysis Hogan (MDEpiNet) Missing Data October 22, 2018 51 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 52 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 53 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 54 / 160
iYi
Hogan (MDEpiNet) Missing Data October 22, 2018 55 / 160
◮
◮ Estimator is the observed-data mean
◮ Shift observed-dta mean by ∆ P(R = 0) ◮ Shift is proportional to fraction of missing observations Hogan (MDEpiNet) Missing Data October 22, 2018 56 / 160
5 10 70 75 80 85 90 95 100 105 delta y3.delta
Hogan (MDEpiNet) Missing Data October 22, 2018 57 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 58 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 59 / 160
◮ Fit model for F1 using observed data; call it
◮ Use this model to impute missing values of Y3
◮ Parameterize a model so that F0 is related to F1 through a sensitivity parameter ∆ ◮ Generically write this as F0 = F ∆
1
◮ Use fitted version of F ∆
1 to impute missing Y3
Hogan (MDEpiNet) Missing Data October 22, 2018 60 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 61 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 62 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 63 / 160
1 Fit the model E(Y3|Y1, R = 1) = α1 + β1Y1
2 For those with R = 0, impute predicted value via
3 Estimate overall mean as the mixture
Hogan (MDEpiNet) Missing Data October 22, 2018 64 / 160
60 80 100 120 20 40 60 80 100 120 140 y1[r == 1] y3[r == 1]
Hogan (MDEpiNet) Missing Data October 22, 2018 65 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 66 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 67 / 160
[R=1] 3
[R=0] 1
Hogan (MDEpiNet) Missing Data October 22, 2018 68 / 160
◮ Draw bootstrap sample ◮ Carry out imputation procedure ◮ Repeat for lots of bootstrap samples (say B) ◮ Base SE and CI on the B bootstrapped estimators
◮ Estimators are linear ◮ Bootstrap takes care of missing data uncertainty here
Hogan (MDEpiNet) Missing Data October 22, 2018 69 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 70 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 71 / 160
1 Fit the model E(Y3|Y1, R = 1) = α1 + β1Y1
2 For those with R = 0, impute predicted value via
3 Estimate overall mean as the mixture
[R=1] 3
[R=0] 1
Hogan (MDEpiNet) Missing Data October 22, 2018 72 / 160
◮ Usually appropriate to anchor analysis at MAR ◮ Examine effect of MNAR by varying ∆ away from 0
◮ Will always be specific to application. ◮ Ensure that range is appropriate to context (see upcoming example) ◮ Can use data-driven range for ∆, e.g. based on SD
◮ ‘Stress test’ approach ◮ Inverted sensitivity analysis — find values of ∆ that would change substantive
◮ Average over plausible ∆ values Hogan (MDEpiNet) Missing Data October 22, 2018 73 / 160
◮ ∆ > 0 ⇒ dropouts have higher mean ◮ ∆ < 0 ⇒ dropouts have lower mean
◮ Residual variation in outcome quantified by SD of regression error
◮ Suggests scaling ∆ in units of σ ◮ Will illustrate in longitudinal case Hogan (MDEpiNet) Missing Data October 22, 2018 74 / 160
◮ Compare treatments ◮ Illustrate sensitivity analysis under MNAR ◮ Discuss how to report results Hogan (MDEpiNet) Missing Data October 22, 2018 75 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 76 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 77 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 78 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 79 / 160
1 Fit model for E(Y2|Y1, R2 = 1)
1 Y1
1
2 Impute missing Y2 as before
1 Y1i
Hogan (MDEpiNet) Missing Data October 22, 2018 80 / 160
1 Fit model for E(Y3|Y1, Y2, R3 = 1)
1 Y1 + β(3) 2 Y2
1 ,
2
2 Impute missing Y3 as follows: ◮ For those with Y1, Y2 observed,
1 Y1i +
2 Y2i
◮ For those with only Y1 observed,
1 Y1i +
2
Hogan (MDEpiNet) Missing Data October 22, 2018 81 / 160
1 Y1i +
2
1 Y1i
Hogan (MDEpiNet) Missing Data October 22, 2018 82 / 160
1 Y1i +
2 Y2i
3 = var(Y3 | Y1, Y2, R3 = 1)
Hogan (MDEpiNet) Missing Data October 22, 2018 83 / 160
[K=3] 3
[K=2] 2
[K=1] 1
Hogan (MDEpiNet) Missing Data October 22, 2018 84 / 160
◮ Can check validity of imputation
◮ Treatment effect estimates ◮ p-values
◮ Treatment effect, SE, p-value ◮ These are computed using bootstrap Hogan (MDEpiNet) Missing Data October 22, 2018 85 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 86 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 87 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 88 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 89 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 90 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 91 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 92 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 93 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 94 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 95 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 96 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 97 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 98 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 99 / 160
1 2 3 4 5 6 60 80 100 120 Visit PANSS
Hogan (MDEpiNet) Missing Data October 22, 2018 100 / 160
1 2 3 4 5 6 60 80 100 120 Visit PANSS
Hogan (MDEpiNet) Missing Data October 22, 2018 101 / 160
1 2 3 4 5 6 60 80 100 120 Visit PANSS
Hogan (MDEpiNet) Missing Data October 22, 2018 102 / 160
1 2 3 4 5 6 60 80 100 120 Visit PANSS
Hogan (MDEpiNet) Missing Data October 22, 2018 103 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 104 / 160
◮ We used regression imputation, but this is only one possibility ◮ Regression models need not be linear ◮ More complex models may require more complex imputation procedures
◮ Missing data distribution indexed by sensitivity parameter that cannot be estimated
◮ Separates testable from untestable assumptions ◮ Easy to assess effect of departures from MAR Hogan (MDEpiNet) Missing Data October 22, 2018 105 / 160
◮ Need to limit number of ∆’s to make inferences manageable ◮ Need sensible scale and range for ∆’s ◮ Scope of sensitivity analysis should be specified as part of trial protocol to avoid
◮ Sensitivity analysis provides range of conclusions ◮ Can use as a ‘stress-test’: under what MNAR scenario would our conclusions
◮ Can also use Bayesian formulations that average results over a prior for sensitivity
Hogan (MDEpiNet) Missing Data October 22, 2018 106 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 107 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 108 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 109 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 110 / 160
i β
i
i
Hogan (MDEpiNet) Missing Data October 22, 2018 111 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 112 / 160
40 60 80 100 120 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 y1 prob Hogan (MDEpiNet) Missing Data October 22, 2018 113 / 160
20 40 60 80 100 120 140 1.5 2.0 2.5 y3[r == 1] wt[r == 1] Hogan (MDEpiNet) Missing Data October 22, 2018 114 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 115 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 116 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 117 / 160
i1β + X T ij γ + θYi,j−1
Hogan (MDEpiNet) Missing Data October 22, 2018 118 / 160
1 Formulate and fit models for
2 Compute estimated value of
3 Compute weighted mean
Hogan (MDEpiNet) Missing Data October 22, 2018 119 / 160
◮ Weights near zero lead to bias and inefficiency ◮ Need to check histogram ◮ Can use stabilized weights
◮ No guarantees here ◮ Can use lack of fit diagnostics to weed out poor-fitting models
◮ Poses a more serious problem with respect to final inferences ◮ Not good to pick weight model that gives lowest p-value! ◮ Pre-specify weight covariates that are related to missingness and outcome Hogan (MDEpiNet) Missing Data October 22, 2018 120 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 121 / 160
◮ constant quit rate up to week 4 ◮ separate treatment quit rates after week 4, but constant over time
Hogan (MDEpiNet) Missing Data October 22, 2018 122 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 123 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 124 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 125 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 126 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 127 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 128 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 129 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 130 / 160
◮ Goal 1: Estimate E(Y3) in GH data ◮ Goal 2: Estimate treatment effect in CTQ data
◮ Model specification ◮ Drawing imputed values from the model ◮ Combining observed and imputed information ◮ Standard error estimation ◮ Sensitivity analysis (CTQ data) Hogan (MDEpiNet) Missing Data October 22, 2018 131 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 132 / 160
1 MAR, in that
2 Model for f (y | x, v, r = 1) has known form Hogan (MDEpiNet) Missing Data October 22, 2018 133 / 160
1 Fit a model for f (y | x, v, r = 1); if it is a parametric model, this means estimating
2 For each person having R = 0, take a draw of Y | X, V from the fitted model.
3 Do this several times for each individual, so that each person has multiple draws of
i
i
i
Hogan (MDEpiNet) Missing Data October 22, 2018 134 / 160
4 Perform the analysis you would have carried out had the data been complete. If
Hogan (MDEpiNet) Missing Data October 22, 2018 135 / 160
5 Now need an estimate and standard error ◮ The estimate is the sample mean
K
◮ The (estimate of) variance of
K
K
Hogan (MDEpiNet) Missing Data October 22, 2018 136 / 160
◮ This implies that the model f (y3 | y1, r = 1) is a normal distribution with mean and
Hogan (MDEpiNet) Missing Data October 22, 2018 137 / 160
1 Fit a model for f (y3 | y1, r3 = 1)
Hogan (MDEpiNet) Missing Data October 22, 2018 138 / 160
2 For each person having R = 0, take a draw of Y3 | Y1 from the fitted model. That
Hogan (MDEpiNet) Missing Data October 22, 2018 139 / 160
3 Do this several times for each individual, so that each person has multiple draws of
3i ,
3i , . . . ,
3i
Hogan (MDEpiNet) Missing Data October 22, 2018 140 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 141 / 160
4 Perform the analysis you would have carried out had the data been complete. If
Hogan (MDEpiNet) Missing Data October 22, 2018 142 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 143 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 144 / 160
5 The (estimate of) variance of
K
K
Hogan (MDEpiNet) Missing Data October 22, 2018 145 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 146 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 147 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 148 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 149 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 150 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 151 / 160
2 Fit imputation model
Hogan (MDEpiNet) Missing Data October 22, 2018 152 / 160
3 Generate imputation
Hogan (MDEpiNet) Missing Data October 22, 2018 153 / 160
4 Repeat 10 times and re-fit treatment model to filled-in datasets
Hogan (MDEpiNet) Missing Data October 22, 2018 154 / 160
5 Summarize results over imputed datasets, and compare to original model (used
Hogan (MDEpiNet) Missing Data October 22, 2018 155 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 156 / 160
i
i
i )
i )
i )
Hogan (MDEpiNet) Missing Data October 22, 2018 157 / 160
i
i
i )
i )
i )
◮ As ∆ → ∞, φ∆
i → 1
◮ As ∆ → −∞, φ∆
i → 0
Hogan (MDEpiNet) Missing Data October 22, 2018 158 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 159 / 160
Hogan (MDEpiNet) Missing Data October 22, 2018 160 / 160