Dealing with Missing Data Challenges and Solutions Nicole Erler - PowerPoint PPT Presentation

Dealing with Missing Data Challenges and Solutions Nicole Erler Department of Biostatistics, Erasmus Medical Center � n.erler@erasmusmc.nl � N_Erler � www.nerler.com � NErler 13 January 2020

Handling Missing Values is Easy! Functions automatically exclude missing values: ## [...] ## Residual standard error: 2.305 on 69 degrees of freedom ## (25 observations deleted due to missingness) ## Multiple R-squared: 0.09255, Adjusted R-squared: 0.02679 ## F-statistic: 1.407 on 5 and 69 DF, p-value: 0.2325 1

Handling Missing Values is Easy! Functions automatically exclude missing values: ## [...] ## Residual standard error: 2.305 on 69 degrees of freedom ## (25 observations deleted due to missingness) ## Multiple R-squared: 0.09255, Adjusted R-squared: 0.02679 ## F-statistic: 1.407 on 5 and 69 DF, p-value: 0.2325 Imputation is super easy: library ("mice") imp <- mice (mydata) However ... 1

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR ◮ the incomplete variable has a certain conditional distribution (e.g. normal) 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR ◮ the incomplete variable has a certain conditional distribution (e.g. normal) ◮ all associations are linear 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR ◮ the incomplete variable has a certain conditional distribution (e.g. normal) ◮ all associations are linear ◮ compatibility and congeniality 2

Handling Missing Values Correctly is Not So Easy! Complete case analysis is usually biased ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR ◮ the incomplete variable has a certain conditional distribution (e.g. normal) ◮ all associations are linear ◮ compatibility and congeniality violation ➡ bias 2

Imputation ??? Remind me, how did that imputation thing work again??? 3

Imputation Imputation filling in missing values with (good) "guesses" 4

Imputation Imputation filling in missing values with (good) "guesses" Important: Missing values ➡ uncertainty This needs to be taken into account!!! 4

Imputation Imputation filling in missing values with (good) "guesses" Important: Missing values ➡ uncertainty This needs to be taken into account!!! Donald Rubin (in the 1970s): Represent each missing value with multiple imputed values Multiple Imputation Note: Imputation is not the only approach to handle missing values. (Also: maximum likelihood, inverse probability weighting, ...) 4

Multiple Imputation multiple incomplete analysis pooled imputed data results results datasets 1. Imputation: impute multiple times ➡ multiple completed datasets 2. Analysis: analyse each of the datasets 3. Pooling: combine results, taking into account additional uncertainty 5

Imputation Step Two main approaches Joint Model Multiple Imputation ◮ the "original" approach ◮ often using a multivariate normal distribution 6

Imputation Step Two main approaches Joint Model Multiple Imputation ◮ the "original" approach ◮ often using a multivariate normal distribution Multiple Imputation with Chained Equations (MICE) ◮ also: Fully Conditional Specification ( FCS ) ◮ now often considered the gold standard 6

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables : � �� full conditionals ... x 1 x 2 x 3 x 4 NA NA ... � � NA NA ... � � NA NA ... � � . . . . . . . . . . . . 7

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables : � �� full conditionals ... x 1 x 2 x 3 x 4 x 1 ∼ x 2 + x 3 + x 4 + . . . NA NA ... � � x 2 ∼ x 1 + x 3 + x 4 + . . . NA NA ... � � NA NA ... x 3 ∼ x 1 + x 2 + x 4 + . . . � � . . . . . . . . x 4 ∼ x 1 + x 2 + x 3 + . . . . . . . . . . 7

Multiple Imputation with Chained Equations (MICE) For each incomplete variable, specify a model using all other variables : � �� full conditionals ... x 1 x 2 x 3 x 4 x 1 ∼ x 2 + x 3 + x 4 + . . . NA NA ... � � x 2 ∼ x 1 + x 3 + x 4 + . . . NA NA ... � � NA NA ... x 3 ∼ x 1 + x 2 + x 4 + . . . � � . . . . . . . . x 4 ∼ x 1 + x 2 + x 3 + . . . . . . . . . . For example: ◮ linear regression ◮ logistic regression ◮ ... 7

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: ... x 1 x 2 x 3 x 4 NA NA ... � � ◮ start with initial guess NA NA ... � � NA NA ... � � . . . . . . . . . . . . 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: ... x 1 x 2 x 3 x 4 NA NA ... � � ◮ start with initial guess NA NA ... � � ◮ update x 1 based on initial values of NA NA ... � � . . . . . . . . x 2 , x 3 , x 4 , . . . . . . . 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: ... x 1 x 2 x 3 x 4 NA NA ... � � ◮ start with initial guess NA NA ... � � ◮ update x 1 based on initial values of NA NA ... � � . . . . . . . . x 2 , x 3 , x 4 , . . . . . . . ◮ update x 2 based on new x 1 and initial values of x 3 , x 4 , . . . ◮ ... 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: ... x 1 x 2 x 3 x 4 NA NA ... � � ◮ start with initial guess NA NA ... � � ◮ update x 1 based on initial values of NA NA ... � � . . . . . . . . x 2 , x 3 , x 4 , . . . . . . . ◮ update x 2 based on new x 1 and initial values of x 3 , x 4 , . . . ◮ ... ◮ update x 1 again, based on updated x 2 , x 3 , x 4 , . . . ◮ ... 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: ... x 1 x 2 x 3 x 4 NA NA ... � � ◮ start with initial guess NA NA ... � � ◮ update x 1 based on initial values of NA NA ... � � . . . . . . . . x 2 , x 3 , x 4 , . . . . . . . ◮ update x 2 based on new x 1 and initial values of x 3 , x 4 , . . . ◮ ... ◮ update x 1 again, based on updated x 2 , x 3 , x 4 , . . . ◮ ... ◮ until convergence 8

Multiple Imputation with Chained Equations (MICE) MICE is an iterative algorithm: ... x 1 x 2 x 3 x 4 NA NA ... � � ◮ start with initial guess NA NA ... � � ◮ update x 1 based on initial values of NA NA ... � � . . . . . . . . x 2 , x 3 , x 4 , . . . . . . . ◮ update x 2 based on new x 1 and initial values of x 3 , x 4 , . . . ◮ ... ◮ update x 1 again, based on updated x 2 , x 3 , x 4 , . . . ◮ ... ◮ until convergence Values from last iteration ➡ one imputed dataset 8

MICE Makes Assumptions ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR ◮ the incomplete variable has a certain conditional distribution (e.g. normal) ◮ all associations are linear ◮ compatibility and congeniality 9

Missing Data Mechanisms Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p ( R | X obs , X mis ) = p ( R ) questionnaire got lost in mail Missingness is independent of all data. Missing At Random (MAR) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p ( R | X obs , X mis ) = p ( R ) questionnaire got lost in mail Missingness is independent of all data. Missing At Random (MAR) overweight participants are p ( R | X obs , X mis ) = p ( R | X obs ) less likely to report their chocolate consumption (and Missingness depends only on observed data. we know their weight) Missing Not At Random (MNAR) 10

Missing Data Mechanisms Missing Completely At Random (MCAR) p ( R | X obs , X mis ) = p ( R ) questionnaire got lost in mail Missingness is independent of all data. Missing At Random (MAR) overweight participants are p ( R | X obs , X mis ) = p ( R | X obs ) less likely to report their chocolate consumption (and Missingness depends only on observed data. we know their weight) Missing Not At Random (MNAR) overweight participants are p ( R | X obs , X mis ) � = p ( R | X obs ) less likely to report their Missingness depends (also) on unobserved data. weight 10

MICE Makes Assumptions ( Imputation ) methods make certain assumptions , e.g.: ◮ missingness is M(C)AR ◮ the incomplete variable has a certain conditional distribution (e.g. normal) ◮ all associations are linear ◮ compatibility and congeniality In case of MNAR: MICE ➡ bias 11

Dealing with Missing Data Challenges and Solutions Nicole Erler - PowerPoint PPT Presentation

Dealing with Missing Data Challenges and Solutions Nicole Erler Department of Biostatistics, Erasmus Medical Center n.erler@erasmusmc.nl N_Erler www.nerler.com NErler 13 January 2020 Handling Missing Values is Easy! Functions

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Performing and tracking imputation Nicholas Tierney Statistician DataCamp Dealing With Missing

Dealing With The Irate Customer Dealing With The Irate Customer Dealing with difficult

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Dealing with missing values part 1 Applied Multivariate Statistics Spring 2012 Overview

Dealing with missing values part 1 Applied Multivariate Statistics Spring 2013 Overview

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Dealing Dealing with the News with the News Media in Media in Crisis Crisis Response

Dealing with Winter Neighbourhood Operations 1 Dealing with Winter Background to

Cross Border Update Dermot Corry Dealing/Transaction Accounts Dealing/transaction accounts

Advances in ML: Theory Meets Practice Julie Josse Review on Missing Values Methods with Demos

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Statistics and Data Analysis R Programming and Logistic Regression Ling-Chieh Kung Department of

Lecture 8: Model assessment, nested models, and hypothesis testing Ani Manichaikul

Two-way ANOVA. Interaction. Susanne Rosthj Section of Biostatistics Department of Public

Multiple Regression Peerapat Wongchaiwat, Ph.D. wongchaiwat@hotmail.com The Multiple Regression

Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The

Stat 8053, Fall 2013: Robust Regression Duncans occupational-prestige regression was introduced

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Evaluating an Alternative CS1 for Students with Prior Programming Experience Michael S.