A missing value tour
Julie Josse Ecole Polytechnique, INRIA 26 january 2020
Workshop of the Applied Machine Learning Days 2020, Lausanne 1
A missing value tour Julie Josse Ecole Polytechnique, INRIA 26 - - PowerPoint PPT Presentation
A missing value tour Julie Josse Ecole Polytechnique, INRIA 26 january 2020 Workshop of the Applied Machine Learning Days 2020, Lausanne 1 Overview 1. Introduction 2. Handling missing values (inferential framework) 3. Supervised learning
Workshop of the Applied Machine Learning Days 2020, Lausanne 1
2
3
4
4
4
25 50 75
Acide.tranexamique AIS.externe AIS.face AIS.tete Catecholamines Choc.hemorragique Craniectomie.decompressive DVE ISS.2 Osmotherapie PIC Trauma.Center Trauma.cranien Anomalie.pupillaire IOT.SMUR Mydriase FC Glasgow.initial ACR.1 Delta.hemocue IGS.II Hb PAS PAD DC.en.rea SpO2 Traitement.antiagregants Traitement.anticoagulant Ventilation.FiO2 PAS.min FC.max PAD.min SpO2.min Glasgow.moteur.initial Bloc.J0.neurochirurgie Temps.lieux.hop Hemocue.init DTC.IP.max PAS.SMUR FC.SMUR PAD.SMUR Glasgow.sortie Mannitol.SSH Cause.du.DC Regr.mydriase.osmo Variable Percentage
NA Not Informed Not made Not Applicable Impossible Percentage of missing values
5
25 50 75 Acide.tranexamique AIS.externe AIS.face AIS.tete Catecholamines Choc.hemorragique Craniectomie.decompressive DVE ISS.2 Osmotherapie PIC Trauma.Center Trauma.cranien Anomalie.pupillaire IOT.SMUR Mydriase FC Glasgow.initial ACR.1 Delta.hemocue IGS.II Hb PAS PAD DC.en.rea SpO2 Traitement.antiagregants Traitement.anticoagulant Ventilation.FiO2 PAS.min FC.max PAD.min SpO2.min Glasgow.moteur.initial Bloc.J0.neurochirurgie Temps.lieux.hop Hemocue.init DTC.IP.max PAS.SMUR FC.SMUR PAD.SMUR Glasgow.sortie Mannitol.SSH Cause.du.DC Regr.mydriase.osmo
Variable Percentage
NA Not Informed Not made Not Applicable Impossible Percentage of missing values
?lm, ?glm, na.action = na.omit
6
Books: Schafer (2002), Little & Rubin (2002); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2018), etc.
7
Books: Schafer (2002), Little & Rubin (2002); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2018), etc.
7
Books: Schafer (2002), Little & Rubin (2002); Kim & Shao (2013); Carpenter & Kenward (2013); van Buuren (2018), etc.
7
i.i.d. N2((µx, µy), Σxy)
−1 1 2 3 4 −2 −1 1 2 3
X Y
8
i.i.d. N2((µx, µy), Σxy)
−1 1 2 3 4 −1 1 2
X Y
8
i.i.d. N2((µx, µy), Σxy)
−2 −1 1 2 −2 −1 1 2
Mean imputation X Y
8
5 −6 −4 −2 2 4 6 8 Individuals factor map (PCA) Dim 1 (44.79%) Dim 2 (23.50%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland
boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland
0.0 0.5 1.0
0.0 0.5 1.0
Variables factor map (PCA)
Dim 1 (44.79%) Dim 2 (23.50%) LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass
−5 5 −6 −4 −2 2 4 6 Individuals factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland
boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland
−1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 Variables factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) LL LMA Nmass Pmass Amass Rmass
library(FactoMineR) PCA(ecolo) Warning message: Missing are imputed by the mean
You should use imputePCA from missMDA library(missMDA) imp <- imputePCA(ecolo) PCA(imp$comp)
1Wright, I. et al. (2004). The worldwide leaf economics spectrum. Nature.
9
−2 −1 1 2 −2 −1 1 2
Mean imputation X Y
−2 −1 1 2 −2 −1 1 2
Regression imputation X Y
−2 −1 1 2 −3 −2 −1 1 2
Stochastic regression imputation X Y
10
iid
(softimpute Hastie & Mazuder; missMDA J. & Husson)
2J., Husson, Robin & Narasimhan. (2018). Imputation of mixed data with multilevel SVD. 3https://cran.r-project.org/web/views/MissingData.html
11
Feat1 Feat2 Feat3 Feat4 Feat5... C1 1 1 1 1 1 C2 1 1 1 1 1 C3 2 2 2 2 2 C4 2 2 2 2 2 C5 3 3 3 3 3 C6 3 3 3 3 3 C7 4 4 4 4 4 C8 4 4 4 4 4 C9 5 5 5 5 5 C10 5 5 5 5 5 C11 6 6 6 6 6 C12 6 6 6 6 6 C13 7 7 7 7 7 C14 7 7 7 7 7 Igor 8 NA NA 8 8 Frank 8 NA NA 8 8 Bertrand 9 NA NA 9 9 Alex 9 NA NA 9 9 Yohann 10 NA NA 10 10 Jean 10 NA NA 10 10
Feat1 Feat2 Feat3 Feat4 Feat5 1 1.0 1.00 1 1 1 1.0 1.00 1 1 2 2.0 2.00 2 2 2 2.0 2.00 2 2 3 3.0 3.00 3 3 3 3.0 3.00 3 3 4 4.0 4.00 4 4 4 4.0 4.00 4 4 5 5.0 5.00 5 5 5 5.0 5.00 5 5 6 6.0 6.00 6 6 6 6.0 6.00 6 6 7 7.0 7.00 7 7 7 7.0 7.00 7 7 8 6.87 6.87 8 8 8 6.87 6.87 8 8 9 6.87 6.87 9 9 9 6.87 6.87 9 9 10 6.87 6.87 10 10 10 6.87 6.87 10 10
Feat1 Feat2 Feat3 Feat4 Feat5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10
12
X1 X2 X3 ... Y NA 20 10 ... shock
45 NA ... shock NA 30 ... no shock NA 32 35 ... shock
NA 12 ... no shock 1 63 40 ... shock 13
X1 X2 X3 ... Y NA 20 10 ... shock
45 NA ... shock NA 30 ... no shock NA 32 35 ... shock
NA 12 ... no shock 1 63 40 ... shock
X1 X2 X3 ... Y 3 20 10 ... shock
45 6 ... shock 4 30 ... no shock
32 35 ... shock
75 12 ... no shock 1 63 40 ... shock 13
X1 X2 X3 ... Y NA 20 10 ... shock
45 NA ... shock NA 30 ... no shock NA 32 35 ... shock
NA 12 ... no shock 1 63 40 ... shock
X1 X2 X3 ... Y 3 20 10 ... shock
45 6 ... shock 4 30 ... no shock
32 35 ... shock
75 12 ... no shock 1 63 40 ... shock
X1 X2 X3 Y 3 20 10 s
45 6 s 4 30 no s
32 35 s
75 12 no s 1 63 40 s X1 X2 X3 Y
20 10 s
45 9 s 12 30 no s 13 32 35 s
10 12 no s 1 63 40 s X1 X2 X3 Y 7 20 10 s
45 12 s
30 no s 2 32 35 s
20 12 no s 1 63 40 s library(mice); mice(traumadata) library(missMDA); MIPCA(traumadata) 13
X1 X2 X3 Y 3 20 10 s
45 6 s 4 30 no s
32 35 s
15 12 no s 1 63 40 s X1 X2 X3 Y
20 10 s
45 9 s 12 30 no s 13 32 35 s
10 12 no s 1 63 40 s X1 X2 X3 Y 7 20 10 s
45 12 s
30 no s 2 32 35 s
20 12 no s 1 63 40 s
−6 −4 −2 2 4 6 −4 −2 2 4
Supplementary projection
Dim 1 (71.33%) Dim 2 (16.94%) 1 2 3 4 5 6 7 8 9 10 11 12
14
X1 X2 X3 Y 3 20 10 s
45 6 s 4 30 no s
32 35 s 1 63 40 s
15 12 no s X1 X2 X3 Y
20 10 s
45 9 s 12 30 no s 13 32 35 s 1 63 40 s
10 12 no s X1 X2 X3 Y 7 20 10 s
45 12 s
30 no s 2 32 35 s 1 63 40 s
20 12 no s
M
M
M
imp.mice <- mice(traumadata) lm.mice.out <- with(imp.mice, glm(Y ~ ., family = "binomial"))
15
f : X→Y
f :X→Y
n
i=1 ℓ (f (Xi), Yi)
n→∞ E[ℓ(f ⋆(X), Y )] 16
f : X→Y
f :X→Y
n
i=1 ℓ (f (Xi), Yi)
n→∞ E[ℓ(f ⋆(X), Y )]
16
NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
17
18
19
1, x2, . . . , xd): x′ 1 = x11M1=0 + E[X1]1M1=1
impute(x′) =
19
−2 −1 1 2 3 −5 5 x y
−2 −1 1 2 3 −6 −4 −2 2 4 6 x y
20
−2 −1 1 2 3 −5 5 10 x y
−2 −1 1 2 3 −5 5 10 x y
20
NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
21
103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Linear problem (high noise) 103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Friedman problem (high noise) 103 104 105 Sample size 0.7 0.8 0.9 1.0 Explained variance Non-linear problem (low noise) DECISION TREE 103 104 105 Sample size 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.55 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance RANDOM FOREST 103 104 105 Sample size 0.65 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance XGBOOST Surrogates (rpart) Mean imputation Gaussian imputation MIA Bayes rate Block (XGBoost) 103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Linear problem (high noise) 103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Friedman problem (high noise) 103 104 105 Sample size 0.7 0.8 0.9 1.0 Explained variance Non-linear problem (low noise) DECISION TREE 103 104 105 Sample size 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.55 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance RANDOM FOREST 103 104 105 Sample size 0.65 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance XGBOOST Surrogates (rpart) Mean imputation Gaussian imputation MIA Bayes rate Block (XGBoost)
22
seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be legitimately handled in this way and situations where standard estimators applied to the imputed data have substantial biases.” (Dempster & Rubin, 1983)
23
24
At Random data. 2019. Sportisse, Boyer, J.
25
4https://www.r-consortium.org/projects/call-for-proposals
26
5https://rmisstastic.netlify.com/lectures/ 6https://rmisstastic.netlify.com/tutorials/erler_course_
7https://rmisstastic.netlify.com/tutorials/Josse_slides_imputation_PCA_2018.pdf
27
28