[PPT] - Treatment effect estimation with missing attributes Julie Josse PowerPoint Presentation

SLIDE 1

Treatment effect estimation with missing attributes

Julie Josse

´ Ecole Polytechnique, INRIA Visiting Researcher, Google Brain

Mathematical Methods of Modern Statistics, June 2020

1

SLIDE 2

Collaborators

Methods: Imke Mayer (PhD X, EHESS), Jean-Philippe Vert (Google Brain), Stefan Wager (Stanford) Assistance Publique Hopitaux de Paris

2

SLIDE 3

Covid data

4780 patients (patients with at least one PCR-documented SARS-CoV-2

RNA from a nasopharyngeal sample)

119 continuous and categorical variables: heterogeneous
34 hospitals: multilevel data

Hospital Treatment Age Sex Weight DDI BP dead28 . . . Beaujon HCQ 54 m 85 NM 180 yes Pitie AZ 76 m NR NR 131 no Beaujon HCQ+AZ 63 m 80 270 145 yes Pitie HCQ 80 f NR NR 107 no HEGP none 66 m 98 5890 118 no . . . ...

3

SLIDE 4

Covid data

4780 patients (patients with at least one PCR-documented SARS-CoV-2

RNA from a nasopharyngeal sample)

119 continuous and categorical variables: heterogeneous
34 hospitals: multilevel data

Hospital Treatment Age Sex Weight DDI BP dead28 . . . Beaujon HCQ 54 m 85 NM 180 yes Pitie AZ 76 m NR NR 131 no Beaujon HCQ+AZ 63 m 80 270 145 yes Pitie HCQ 80 f NR NR 107 no HEGP none 66 m 98 5890 118 no . . . ... ⇒ Estimate causal effect: Administration of the treatment ”Hydroxychloroquine” on the outcome 28-day mortality.

3

SLIDE 5

Observational data: non random assignment

survived deceased Pr(survived | treatment) Pr(deceased | treatment) HCQ 497 (11.4%) 111 (2.6%) 0.817 0.183 HCQ+AZI 158 (3.6%) 54 (1.2%) 0.745 0.255 none 2699 (62.1%) 830 (19.1%) 0.765 0.235

Mortality rate 23% - for HCQ 18% - non treated 24%: treatment helps?

25 50 75 100

0.000 0.005 0.010 0.015 0.020 0.025 0.000 0.005 0.010 0.015 0.020 0.025 Age

Mean Median

Treatment arm

HCQ Nothing

Comparison of the distribution of Age between HCQ and non treated.

Severe patients (with higher risk of death) are less likely to be treated. If control group does not look like treatment group, difference in response may be confounded by differences between the groups.

4

SLIDE 6

Potential outcome framework (Neyman, 1923, Rubin, 1974)

Causal effect

n iid samples (Xi, Wi, Yi(1), Yi(0)) ∈ Rd × {0, 1} × R × R
Individual causal effect of the treatment: ∆i Yi(1) − Yi(0)

Missing problem: ∆i never observed (only observe one outcome/indiv)

Covariates Treatment Outcome(s) X1 X2 X3 W Y(0) Y(1) 1.1 20 F 1 ? Survived

6

45 F Dead ? 15 M 1 ? Survived . . . . . . . . . . . .

2

52 M Survived ? 5

SLIDE 7

Potential outcome framework (Neyman, 1923, Rubin, 1974)

Causal effect

n iid samples (Xi, Wi, Yi(1), Yi(0)) ∈ Rd × {0, 1} × R × R
Individual causal effect of the treatment: ∆i Yi(1) − Yi(0)

Missing problem: ∆i never observed (only observe one outcome/indiv)

Covariates Treatment Outcome(s) X1 X2 X3 W Y(0) Y(1) 1.1 20 F 1 ? Survived

6

45 F Dead ? 15 M 1 ? Survived . . . . . . . . . . . .

2

52 M Survived ?

Average treatment effect (ATE): τ E[∆i] = E[Yi(1) − Yi(0)]

The ATE is the difference of the average outcome had everyone gotten treated and the average outcome had nobody gotten treatment. ATE=0.05: mortality rate in the treated group is 5% points higher than in the control group. So, on average the treatment increases the risk of dying.

5

SLIDE 8

Assumption for ATE identifiability in observational data

Unconfoundedness - selection on observables {Yi(0), Yi(1)} ⊥ ⊥ Wi | Xi

Treatment assignment Wi is random conditionally on covariates Xi Measure enough covariates to capture dependence between Wi and outcomes

Observed outcome: Yi = WiYi(1) + (1 − Wi)Yi(0)

Unconfoundedness - graphical model

X W Y {Y (0), Y (1)}

Unobserved confounders make it impossible to separate correlation and causality when correlated to both the outcome and the treatment. ATE not identifiable without assumption: it is not a sample size problem!

6

SLIDE 9

Assumption for ATE identifiability in observational data

Propensity score: probability of treatment given observed covariates. Propensity score - overlap assumption e(x) P(Wi = 1 | Xi = x) ∀ x ∈ X. We assume overlap, i.e. η < e(x) < 1 − η, ∀ x ∈ X and some η > 0

Left: Non smoker and never treated Right: Smokers and all treated If proba to be treated when smoker e(x) = 1, how to estimate the outcome for smokers when not treated Y (0)? How to extrapolate if total confusion?

7

SLIDE 10

Inverse-propensity weighting estimation of ATE

Average treatment effect (ATE): τ E[∆i] = E[Yi(1) − Yi(0)] Propensity score: e(x) P(Wi = 1 | Xi = x) IPW estimator (Horvitz-Thomson, survey) ˆ τIPW 1 n

n

i=1

WiYi ˆ e(Xi) − (1 − Wi)Yi 1 − ˆ e(Xi)

⇒ Balance the differences between the two groups

⇒ Consistent estimator of τ as long as ˆ e(·) is consistent.

X Y Treated

bservations have

higher X’s on average X Y Reweighting control

bservations with

high X’s adjusts for difference

Credit: S. Athey 8

SLIDE 11

Doubly robust ATE estimation

Model Treatment on Covariates e(x) P(Wi = 1 | Xi = x) Model Outcome on Covariates µ(w)(x) E[Yi(w) | Xi = x] Augmented IPW - Double Robust (DR)

ˆ τAIPW 1

n

i=1

ˆ

µ(1)(Xi) − ˆ µ(0)(Xi) + Wi

Yi −ˆ µ(1)(Xi ) ˆ e(Xi )

− (1 − Wi)

Yi −ˆ µ(0)(Xi ) 1−ˆ e(Xi )

is consistent if either the ˆ

µ(w)(x) are consistent or ˆ e(x) is consistent. Possibility to use any (machine learning) procedure such as random forests, deep nets, etc. to estimate ˆ e(x) and ˆ µ(w)(x) without harming the interpretability of the causal effect estimation. Properties - Double Machine Learning (Chernozhukov et al., 2018) If ˆ e(x) and ˆ µ(w)(x) converge at the rate n1/4 then √n (ˆ τDR − τ)

d

− − − →

n→∞ N(0, V ∗), V ∗ semiparametric efficient variance. 9

SLIDE 12

Missing values

20 40 60 age days_since_first_case_datetime gender num_hospitals

n_corticoids

period asthma cancer chemotherapy_radiotherapy chronic_hepatic_disease chronic_obstructive_pulmonary_disease chronic_respiratory_failure diabetes dyslipidemia heart_arrhythmia hematological_malignancies hypertension ischemic_heart_disease kidney_disease

besity

smoker CREAT_value CRP_value PNN_value LYM_value TP_value GDS_PaCO2_value GDS_PaO2_value weigh_kg GDS_SAT_value LDH_value DDI_value Variable Percentage

Percentage of missing values

10

SLIDE 13

Missing values

20 40 60 a g e d a y s _ s i n c e _ f i r s t _ c a s e _ d a t e t i m e g e n d e r n u m _ h

s

p i t a l s

n

_ c

r

t i c

i

d s p e r i

d

a s t h m a c a n c e r c h e m

t

h e r a p y _ r a d i

t

h e r a p y c h r

n

i c _ h e p a t i c _ d i s e a s e c h r

n

i c _

b

s t r u c t i v e _ p u l m

n

a r y _ d i s e a s e c h r

n

i c _ r e s p i r a t

r

y _ f a i l u r e d i a b e t e s d y s l i p i d e m i a h e a r t _ a r r h y t h m i a h e m a t

l
g

i c a l _ m a l i g n a n c i e s h y p e r t e n s i

n

i s c h e m i c _ h e a r t _ d i s e a s e k i d n e y _ d i s e a s e

b

e s i t y s m

k

e r C R E A T _ v a l u e C R P _ v a l u e P N N _ v a l u e L Y M _ v a l u e T P _ v a l u e G D S _ P a C O 2 _ v a l u e G D S _ P a O 2 _ v a l u e w e i g h _ k g G D S _ S A T _ v a l u e L D H _ v a l u e D D I _ v a l u e Variable Percentage

Percentage of missing values

Deleting rows with missing values? “One of the ironies of Big Data is that missing data play an ever more significant role” (R. Samworth, 2019) An n × p matrix, each entry is missing with probability 0.01 p = 5 = ⇒ ≈ 95% of rows kept p = 300 = ⇒ ≈ 5% of rows kept

10

SLIDE 14

Missing (informative) values in the covariates

Straightforward – but often biased – solution is complete-case analysis.

Covariates Treatment Outcome(s) X1 X2 X3 W Y(0) Y(1) NA 20 F 1 ? Survived

6

45 NA Dead ? NA M 1 ? Survived NA 32 F 1 ? Dead 1 63 M 1 Dead ?

2

NA M Survived ?

→ Often not a good idea! What are the alternatives? Three families of methods - different assumptions

Unconfoundedness with missingness + (no) missing values

mechanisms Mayer, J., Wager, Sverdrup, Moyer, Gauss. AOAS 2020.

Classical unconfoundedness + classical missing values mechanisms
Latent unconfoundedness + classical missing values mechanisms

Mayer, J., Raimundo, Vert. 2020.

11

SLIDE 15

1. Unconfoundedness with missing + (no) missing hypothesis

Covariates Treatment Outcome(s) X ∗

1

X ∗

2

X ∗

3

W Y(0) Y(1) NA 20 F 1 ? S

6

45 NA D ? NA M 1 ? S NA 32 F 1 ? D 1 63 M 1 D ?

2

NA M S ?

Unconfoundedness: {Yi(1), Yi(0)} ⊥ ⊥ Wi | X not testable from the data. ⇒ Doctors give us the DAG (covariates relevant for either treatment decision and for predicting the outcome) Unconfoundedness with missing values: {Yi(1), Yi(0)} ⊥ ⊥ Wi | X ∗ X ∗ (1 − R) ⊙ X + R ⊙ NA; with Rij = 1 if Xij is missing, 0 otherwise. ⇒ Doctors decide to treat a patient based on what they observe/record. We have access to the same information as the doctors.

12

SLIDE 16

Under 1: Double Robust with missing values

AIPW with missing values ˆ τ ∗ 1

n

i
µ∗

(1)(Xi) −

µ∗

(0)(Xi) + Wi Yi− µ∗

(1)(Xi)

e∗(Xi)

− (1 − Wi)

Yi− µ∗

(0)(Xi)

1− e∗(Xi)

Generalized propensity score (Rosenbaum and Rubin, 1984)

e∗(x∗) P(W = 1 | X ∗ = x∗) One model per pattern:

r∈{0,1}d E

W |Xobs(r), R = r
1R=r

⇒ Supervised learning with missing values. 1

Mean imputation is consistent with a universally consistent learner.
Missing Incorporate in Attributes (MIA) for trees methods.

1consistency of supervised learning with missing values J., Prost, Scornet, Varoquaux. JMLR 2020

13

SLIDE 17

Under 1: Double Robust with missing values

AIPW with missing values ˆ τ ∗ 1

n

i
µ∗

(1)(Xi) −

µ∗

(0)(Xi) + Wi Yi− µ∗

(1)(Xi)

e∗(Xi)

− (1 − Wi)

Yi− µ∗

(0)(Xi)

1− e∗(Xi)

Generalized propensity score (Rosenbaum and Rubin, 1984)

e∗(x∗) P(W = 1 | X ∗ = x∗) One model per pattern:

r∈{0,1}d E

W |Xobs(r), R = r
1R=r

⇒ Supervised learning with missing values. 1

Mean imputation is consistent with a universally consistent learner.
Missing Incorporate in Attributes (MIA) for trees methods.

Implemented in grf package: combine two non-parametrics models, forests (conditional outcome and treatment assignment) adapted to any missing values with MIA. ˆ τAIPW ∗ is √n-consistent, asymptotically normal given the product of RMSE of the nuisance estimates decay as o(n−1/2) Mayer et al. AOAS 2020

1consistency of supervised learning with missing values J., Prost, Scornet, Varoquaux. JMLR 2020

13

SLIDE 18

2. Classical unconfoundedness + missing values mechanism

Apart´ e on missing values mechanisms taxonomy (Rubin, 1976)

5 10 15 Gravity score (GCS) 100 200 Systolic Blood Pressure 5 10 15 Gravity score (GCS) 100 200 Systolic Blood Pressure 5 10 15 Gravity score (GCS) 100 200 Systolic Blood Pressure

MCAR

MAR
MNAR

Orange: missing values for Systolic Blood Pressure - Gravity index (GCS) is always observed

MCAR (completely at random): Proba to be missing does not depend on SBP neither on gravity MAR: Proba depends on gravity (we do not measure for too severe patients) MNAR (not at random): Proba depends on SBP (low SBP not measured)

14

SLIDE 19

Under 2: Multiple Imputation

Consistency of IPW with missing values (Seaman and White, 2014)

Assume Missing At Random (MAR) mechanism. Multiple imputation (MICE using (X ∗, W , Y )) with IPW on each imputed data is consistent when Gaussian covariates and logistic/linear treatment/oucome model

X∗ 1 X∗ 2 X∗ 3 ... W Y NA 20 10 ... 1 survived

6

45 NA ... 1 survived NA 30 ... died NA 32 35 ... survived

2

NA 12 ... died 1 63 40 ... 1 survived

1) Generate M plausible values for each missing value

X1 X2 X3 ... W Y 3 20 10 ... 1 s

6

45 6 ... 1 s 4 30 ... d

4

32 35 ... s

2

15 12 ... d 1 63 40 ... 1 s X1 X2 X3 ... W Y

7

20 10 ... 1 s

6

45 9 ... 1 s 12 30 ... d 13 32 35 ... s

2

10 12 ... d 1 63 40 ... 1 s X1 X2 X3 ... W Y 7 20 10 ... 1 s

6

45 12 ... 1 s

5

30 ... d 2 32 35 ... s

2

20 12 ... d 1 63 40 ... 1 s

2) Estimate ATE on each imputed data set: ˆ τm, Var (ˆ τm) 3) Combine the results (Rubin’s rules): ˆ τ =

1 M

M

m=1 ˆ

τm

Var (ˆ

τ) =

1 M

M

m=1

Var (ˆ τm) +

1 + 1

M

1

M−1

M

m=1 (ˆ

τm − ˆ τ)2

15

SLIDE 20

3. Latent unconfoundedness + missing values mechanism

Latent confounding assumption The covariates X are noisy (incomplete) proxies of the true latent confounders Z (Kallus et al., 2018; Louizos et al., 2017).

X ∗ (1 − R) ⊙ X + R ⊙ NA; with Rij = 1 if Xij is missing, 0 otherwise Observed outcome: Yi = WiYi(1) + (1 − Wi)Yi(0)

X X ⋆ R Z W Y {Y (0), Y (1)}

16

SLIDE 21

3. Latent unconfoundedness + missing values mechanism

Latent confounding assumption The covariates X are noisy (incomplete) proxies of the true latent confounders Z (Kallus et al., 2018; Louizos et al., 2017).

X ∗ (1 − R) ⊙ X + R ⊙ NA; with Rij = 1 if Xij is missing, 0 otherwise Observed outcome: Yi = WiYi(1) + (1 − Wi)Yi(0)

X X ⋆ R Z W Y {Y (0), Y (1)}

Matrix Factorization as a pre-processing step

Assume data are generated as a low-rank structure corrupted by
noise. Estimate Z using matrix completion from X ∗(softimpute types).
Plug ˆ

Z in regression model of outcome on treatment and confounders: Y = τW + Zβ + ε, ε ∼ N(0, σ2I) (or in the (A)IPW estimators)

Kallus et al. (2018) show that ˆ

τ is a consistent estimator under MCAR of the Average Treatment Effect.

16

SLIDE 22

3. Latent unconfoundedness + missing values mechanism

Latent confounding assumption Covariates Xn×d proxies of the latent confounders Zn×q.

X ∗ (1 − R) ⊙ X + R ⊙ NA; with Rij = 1 if Xij is missing, 0 otherwise

X X ⋆ R Z W Y {Y (0), Y (1)}

MissDeepCausal (MDC) Mayer, J., Raimundo, Vert, 2020.

Assume a Deep Latent Variable Model instead of linear factor analysis
Leverage VAE with MAR values (Mattei and Frellsen, 2019). Imputing

NA with 0 maximizes an ELBO of the observed log-likelihood.

Draw (Z (j))1≤j≤B from the posterior distribution P(Z|X ⋆) (using

importance sampling with Q(Z|X ⋆) for proposal). MDC-Multiple Imputation: estimate ATE on each (Z (j)) MDC-process plug-in ˆ Z(x⋆) E[Z|X ⋆ = x⋆] in classical estimators Flexible with promising empirical results.

17

SLIDE 23

Methods to do causal inference with missing values

Covariates Missingness Unconfoundedness Models for (W , Y ) multivariate normal general M(C)AR general Missing Latent Classical logistic- linear non- param.

1. (SA)EM 2

✓ ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗

1. Mean.GRF

✓ ✓ ✓ (✓) ✓ ✗ ✗ ✓ ✓

1. MIA.GRF

✓ ✓ ✓ (✓) ✓ ✗ ✗ ✓ ✓

2. Mult. Imp.

✓ ✓ ✓ ✗ (✗) ✗ ✓ ✓ (✗)

3. MatrixFact.

✓ ✗ ✓ ✗ ✗ ✓ ✗ ✓ (✗) 3. MissDeep- Causal ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓

Methods & assumptions on data generating process (models for covariates,

utcome, treatment), missing values mechanism and identifiability conditions.

✓: can be handled ✗: not applicable in theory (✓): empirical results and ongoing work on theoretical guarantees (✗): no theoretical guarantees but heuristics.

2Use of EM algorithms for logistic regression with missing values. Jiang et al. (2020)

18

SLIDE 24

Simulations: no overall best performing method.

10 covariates generated with Gaussian mixture model Xi ∼ Nd(µ(ci ), Σ(ci ))|Ci = ci,

C from a multinomial distribution with three categories.

Unconfoundedness on complete/observed covariates, 30% NA
Logistic-linear for (W , Y ), logit(e(Xi·)) = αT Xi·, Yi ∼ N(βT Xi· + τWi, σ2)

Figure 1: Estimated with AIPW and true ATE τ = 1

Unconf. despite missingness

Complete data unconf.

100 500 1000 5000

−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0

Method

mean.loglin mice mf saem grf

→ GRF-MIA is asymptotically unbiased under unconfoundedness despite missingness. → Multiple imputation requires many imputations to remove bias.

19

SLIDE 25

Simulations: no overall best performing method.

100 covariates generated with a DLVM model, latent confounding (q = 3):

Zi ∼ Nq(0, σz), covariates Xi sampled from Nd(µ(Z), Σ(Z)), where (µ(Z), Σ(Z)) = (V tanh(UZ + a) + b, diag{exp(ηT tanh(UZ + a) + δ)}) with U, V , a, b, δ, η drawn from standard Gaussian and uniform distributions.

30% MCAR, n = 1000.
Logistic-linear for (W , Y ), logit(e(Zi·)) = αT Zi·, Yi ∼ N(βT Zi· + τWi, σ2)

Figure 1: Estimated with AIPW and true ATE τ = 1.

M D C . m i M D C . p r

c

e s s M F M I C E X Z

0.6 0.7 0.8 0.9 1.0 1.1 1.2

→ MDC empirically unbiased if number of features (d) >> dim of the latent space (q) Tuning: variance of the prior of Z and ˆ q chosen by cross-validation using the ELBO

19

SLIDE 26

Results for Covid Patients

33 covariates, 26 confounders. 4137 patients. ATE estimations (×100): effect of Hydroxychloroquine on 28day mortality

Matrix Facto.grf GRF−MIA MICE.grf Matrix Facto.glm MICE.glm −50 −25 25 50

ATE (x 100) Imputation.method Matrix Facto GRF−MIA MICE IPW DR Unadjusted HCQ vs Nothing, ATE estimation (4137 patients)

(y-axis: estimation approach, solid: Doubly Robust AIPW, dotted: IPW), (x-axis: ATE estimation with CI)

The obtained value corresponds to the difference in percentage points between mortality rates in treatment and control. Light Blue: unadjusted (-5.3)

20

SLIDE 27

Conclusion and perspectives

Take-away messages

Missing attributes alter causal analyses. Performance of methods

depends on the underlying assumptions

21

SLIDE 28

Conclusion and perspectives

Take-away messages

Missing attributes alter causal analyses. Performance of methods

depends on the underlying assumptions Further details in original papers

Mayer, I, J., Wager, S., Sverdrup, E., Moyer, J.D. & Gauss, T. (2020). Doubly robust treatment effect with missing attributes. Annals of Applied Statistics Mayer, I., J., Raimundo, F. & Vert, J.-P. (2020). MissDeepCausal: causal inference from incomplete data using deep latent variable models. Sbidian, E. et al. (2020). Hydroxychloroquine with or without azithromycin and in-hospital mortality or discharge in patients hospitalized for COVID-19 infection: a cohort study of 4,642 in-patients in France.

Future work

Coupling of observational data and RCT data
Heterogeneous treatment effects
Architecture of neural nets with missing values
More with MNAR data

21

SLIDE 29

Missing value website

More information and details on missing values: R-miss-tastic

platform. (Mayer et al., 2019)

→ Theoretical and practical tutorials, popular datasets, bibliography, workflows (in R and in python), active contributors/researchers in the community, etc. rmisstastic.netlify.com Interested in contribute to our platform? Feel free to contact us!

22

SLIDE 30

MERCI

22

SLIDE 31

References

SLIDE 32

References i

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68. Jiang, W., Josse, J., Lavielle, M., and Group, T. (2020). Logistic regression with missing covariates?parameter estimation, model selection and prediction within a joint-modeling

framework. Computational Statistics & Data Analysis, 145:106907.

Kallus, N., Mao, X., and Udell, M. (2018). Causal inference with noisy and missing covariates via matrix factorization. In Advances in Neural Information Processing Systems, pages 6921–6932. Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. (2017). Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 6446–6456. Mattei, P.-A. and Frellsen, J. (2019). MIWAE: Deep generative modelling and imputation of incomplete data sets. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4413–4423, Long Beach, California, USA. PMLR. Mayer, I., Josse, J., Tierney, N., and Vialaneix, N. (2019). R-miss-tastic: a unified platform for missing values methods and workflows. arXiv preprint arXiv:1908.04822. Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387):516–524.

SLIDE 33

References ii

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592. Seaman, S. and White, I. (2014). Inverse probability weighting with missing predictors of treatment assignment or missingness. Communications in Statistics-Theory and Methods, 43(16):3499–3515.

SLIDE 34

Observational data: non random assignment

⇒ Treatment assignment W depends on covariates X. Distribution of covariates of treated and control are different.

SLIDE 35

1. Unconfoundedness despite missingness

Adapt the initial assumptions s.t. treatment assignment is unconfounded given only the observed covariates and the response pattern. Notations Mask R ∈ {0, 1}d, Rij = 1 when Xij is missing and 0 otherwise X ∗ (1 − R) ⊙ X + R ⊙ NA ∈ {R ∪ NA}d Unconfoundedness despite missingness {Yi(1), Yi(0)} ⊥ ⊥ Wi | X ∗ CIT:Wi ⊥ ⊥ Xi | X ∗

i , Ri or CIO:Yi(w) ⊥

⊥ Xi | X ∗

i , Ri

for w ∈ {0, 1}

X X∗ R W Y {Y (0), Y (1)} X X∗ R W Y {Y (0), Y (1)}

SLIDE 36

Mean imputation

(xi, yi) ∼

i.i.d. N2((µx, µy), Σxy)

X Y

0.56
1.93
0.86
1.50

..... ... 2.16 0.7 0.16 0.74

−2

−1 1 2 3 4 −2 −1 1 2 3

X Y

µy = 0 σy = 1 ρxy = 0.6 ˆ µy = −0.01 ˆ σy = 1.01 ˆ ρ = 0.66

SLIDE 37

Mean imputation

(xi, yi) ∼

i.i.d. N2((µx, µy), Σxy)

70 % of missing entries completely at random on Y

X Y

0.56

NA

0.86

NA ..... ... 2.16 0.7 0.16 NA

−2

−1 1 2 3 4 −1 1 2

X Y

µy = 0 σy = 1 ρxy = 0.6 ˆ µy = 0.18 ˆ σy = 0.9 ˆ ρxy = 0.6

SLIDE 38

Mean imputation

(xi, yi) ∼

i.i.d. N2((µx, µy), Σxy)

70 % of missing entries completely at random on Y
Estimate parameters on the mean imputed data

X Y

0.56

0.01

0.86

0.01 ..... ... 2.16 0.7 0.16 0.01

●
●
●
●
●
−3

−2 −1 1 2 −2 −1 1 2

Mean imputation X Y

●
●
●
●
●
●●
●
µy = 0

σy = 1 ρxy = 0.6 ˆ µy = 0.01 ˆ σy = 0.5 ˆ ρ = 0.30 Mean imputation deforms joint and marginal distributions

SLIDE 39

Mean imputation is bad for estimation

−5

5 −6 −4 −2 2 4 6 8 Individuals factor map (PCA) Dim 1 (44.79%) Dim 2 (23.50%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland

●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
alpine

boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland

1.0
0.5

0.0 0.5 1.0

1.0
0.5

0.0 0.5 1.0

Variables factor map (PCA)

Dim 1 (44.79%) Dim 2 (23.50%) LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass LL LMA Nmass Pmass Amass Rmass

−10

−5 5 −6 −4 −2 2 4 6 Individuals factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) alpine boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland

●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
alpine

boreal desert grass/m temp_for temp_rf trop_for trop_rf tundra wland

−1.5

−1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 Variables factor map (PCA) Dim 1 (91.18%) Dim 2 (4.97%) LL LMA Nmass Pmass Amass Rmass

library(FactoMineR) PCA(ecolo) Warning message: Missing are imputed by the mean

f the variable:

You should use imputePCA from missMDA library(missMDA) imp <- imputePCA(ecolo) PCA(imp$comp)

Ecological data: 3 n = 69000 species - 6 traits. Estimated correlation between Pmass & Rmass ≈ 0 (mean imputation) or ≈ 1 (EM PCA)

3Wright, I. et al. (2004). The worldwide leaf economics spectrum. Nature.

SLIDE 40

Imputation methods

by regression takes into account the relationship: Estimate β - impute

ˆ yi = ˆ β0 + ˆ β1xi ⇒ variance underestimated and correlation overestimated

by stochastic reg: Estimate β and σ - impute from the predictive

yi ∼ N

xi ˆ

β, ˆ σ2 ⇒ preserve distributions Here ˆ β, ˆ σ2 estimated with complete data, but MLE can be obtained with EM

●
●
●
●
●
−3

−2 −1 1 2 −2 −1 1 2

Mean imputation X Y

●
●
●
●
●
●●
●
●
−3

−2 −1 1 2 −2 −1 1 2

Regression imputation X Y

−3

−2 −1 1 2 −3 −2 −1 1 2

Stochastic regression imputation X Y

µy = 0

σy = 1 ρxy = 0.6 0.01 0.5 0.30 0.01 0.72 0.78 0.01 0.99 0.59

SLIDE 41

Imputation methods for multivariate data

Assuming a joint model

Gaussian distribution: xi. ∼ N (µ, Σ) (Amelia Honaker, King, Blackwell)
low rank: Xn×d = µn×d + ε εij

iid

∼ N

0, σ2

with µ of low rank k

(softimpute Hastie & Mazuder; missMDA J. & Husson)

latent class - nonparametric Bayesian (dpmpm Reiter)
deep learning using variational autoencoders (MIWAE, Mattei, 2018)

Using conditional models (joint implicitly defined)

with logistic, multinomial, poisson regressions (mice van Buuren)
iterative impute each variable by random forests (missForest Stekhoven)

Imputation for categorical, mixed, blocks/multilevel data 4, etc. ⇒ Missing values plateform5 J., Mayer., Tierney, Vialaneix

4J., Husson, Robin & Narasimhan. (2018). Imputation of mixed data with multilevel SVD. 5https://rmisstastic.netlify.com/

SLIDE 42

Mean imputation consistent

Learn on the mean-imputed training data, impute the test set with the same means and predict is optimal if the missing data are MAR and the learning algorithm is universally consistent Framework - assumptions

Y = f (X) + ε
X = (X1, . . . , Xd) has a continuous density g > 0 on [0, 1]d
f ∞ < ∞
Missing data MAR on X1 with R1

| = X1|X2, . . . , Xd.

(x2, . . . , xd) → P[R1 = 1|X2 = x2, . . . , Xd = xd] is continuous
ε is a centered noise independent of (X, R1)

(remains valid when missing values occur for variables X1, . . . , Xj)

SLIDE 43

Mean imputation consistent

Learn on the mean-imputed training data, impute the test set with the same means and predict is optimal if the missing data are MAR and the learning algorithm is universally consistent Mean imputed entry x′ = (x′

1, x2, . . . , xd): x′ 1 = x11R1=0 + E[X1]1R1=1

˜ X = X ⊙ (1 − R) + NA ⊙ R (takes value in R ∪ {NA}) Theorem

Prediction with mean is equal to the Bayes function almost everywhere f ⋆

impute(x′) = E[Y |X ∗ = x∗]

Other values than the mean are OK but use the same value for the train and test sets, otherwise the algorithm may fail as the distributions differ

SLIDE 44

Consistency of supervised learning with NA: Rationale

Specific value, systematic like a code for missing
The learner detects the code and recognizes it at the test time
With categorical data, just code ”Missing”
With continuous data, any constant:
Need a lot of data (asymptotic result) and a super powerful learner
●
●
●
●
−3

−2 −1 1 2 3 −5 5 x y

●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3

−2 −1 1 2 3 −6 −4 −2 2 4 6 x y

●
●
Train

Test Mean imputation not bad for prediction; it is consistent; despite its drawbacks for estimation - Useful in practice!

SLIDE 45

Consistency of supervised learning with NA: Rationale

Specific value, systematic like a code for missing
The learner detects the code and recognizes it at the test time
With categorical data, just code ”Missing”
With continuous data, any constant: out of range
Need a lot of data (asymptotic result) and a super powerful learner
●
●
●
●
−3

−2 −1 1 2 3 −5 5 10 x y

●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3

−2 −1 1 2 3 −5 5 10 x y

●
●
Train

Test Mean imputation not bad for prediction; it is consistent; despite its drawbacks for estimation - Useful in practice!

SLIDE 46

Consistency: 40% missing values MCAR

103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Linear problem (high noise) 103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Friedman problem (high noise) 103 104 105 Sample size 0.7 0.8 0.9 1.0 Explained variance Non-linear problem (low noise) DECISION TREE 103 104 105 Sample size 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.55 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance RANDOM FOREST 103 104 105 Sample size 0.65 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance XGBOOST Surrogates (rpart) Mean imputation Gaussian imputation MIA Bayes rate Block (XGBoost) 103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Linear problem (high noise) 103 104 105 Sample size 0.3 0.4 0.5 0.6 0.7 0.8 Explained variance Friedman problem (high noise) 103 104 105 Sample size 0.7 0.8 0.9 1.0 Explained variance Non-linear problem (low noise) DECISION TREE 103 104 105 Sample size 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.55 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance RANDOM FOREST 103 104 105 Sample size 0.65 0.70 0.75 0.80 Explained variance 103 104 105 Sample size 0.60 0.65 0.70 0.75 Explained variance 103 104 105 Sample size 0.96 0.97 0.98 0.99 1.00 Explained variance XGBOOST Surrogates (rpart) Mean imputation Gaussian imputation MIA Bayes rate Block (XGBoost)

SLIDE 47

End-to-end learning with missing values

NA NA NA NA NA NA NA

Xtrain Ytrain

NA NA NA NA NA NA NA

Xtest ˆ Ytest ˆ f prediction learner

Random forests powerful learner
Trees well suited for empirical risk minimization with missing values:

Handle half discrete data X ∗ that takes values in R ∪ {NA}

SLIDE 48

CART (Breiman, 1984)

Built recursively by splitting the current cell into two children: Find the feature j⋆, the threshold z⋆ which minimises the (quadratic) loss (j⋆, z⋆) ∈ arg min

(j,z)∈S

E

Y − E[Y |Xj ≤ z]

2 · 1Xj≤z +

Y − E[Y |Xj > z]

2 · 1Xj>z

.

X1 X2 root

SLIDE 49

CART (Breiman, 1984)

Built recursively by splitting the current cell into two children: Find the feature j⋆, the threshold z⋆ which minimises the (quadratic) loss (j⋆, z⋆) ∈ arg min

(j,z)∈S

E

Y − E[Y |Xj ≤ z]

2 · 1Xj≤z +

Y − E[Y |Xj > z]

2 · 1Xj>z

.

X1 X2 root X1 ≤ 3.3 X1 > 3.3

SLIDE 50

CART (Breiman, 1984)

Built recursively by splitting the current cell into two children: Find the feature j⋆, the threshold z⋆ which minimises the (quadratic) loss (j⋆, z⋆) ∈ arg min

(j,z)∈S

E

Y − E[Y |Xj ≤ z]

2 · 1Xj≤z +

Y − E[Y |Xj > z]

2 · 1Xj>z

.

X1 X2 root X1 ≤ 3.3 X1 > 3.3 X2 ≤ 1.5 X2 > 1.5

SLIDE 51

CART with missing values

X1 X2 Y 1 2 NA 3 NA 4

root

6

SLIDE 52

CART with missing values

X1 X2 Y 1 2 NA 3 NA 4

root X1 ≤ s1 X1 > s1 1) Select variable and threshold on observed data 6

E

Y − E[Y |Xj ≤ z, Rj = 0]

2 · 1Xj ≤z,Rj =0 +

Y − E[Y |Xj > z, Rj = NA]

2 · 1Xj >z,Rj =0

.

6 Variable selection bias (not a problem to predict): partykit package, Hothorn, et al.

SLIDE 53

CART with missing values

X1 X2 Y 1 2 NA 3 NA 4

root X1 ≤ s1 X1 > s1 X2 ≤ s2 X2 > s2 1) Select variable and threshold on observed data 6

E

Y − E[Y |Xj ≤ z, Rj = 0]

2 · 1Xj ≤z,Rj =0 +

Y − E[Y |Xj > z, Rj = NA]

2 · 1Xj >z,Rj =0

.

2) Propagate observations (2 & 3) with missing values?

Probabilistic split: Bernoulli(

#L #L+#R ) (Rweeka)

Block: Send all to a side by minimizing the error (xgboost, lightgbm)
Surrogate split: Search another variable that gives a close partition (rpart)

6 Variable selection bias (not a problem to predict): partykit package, Hothorn, et al.

SLIDE 54

Missing incorporated in attribute (Twala et al. 2008)

One step: select the variable, the threshold and propagate missing values. Use missingness to make the best possible splits. f ⋆ ∈ arg min

f ∈Pc,miss

E

Y − f (X ∗)

2 , where Pc,miss = Pc,miss,L ∪ Pc,miss,R ∪ Pc,miss,sep with

1. Pc,miss,L → {{X ∗

j ≤ z ∨ X ∗ j = NA}, {X ∗ j > z}}

2. Pc,miss,R → {{X ∗

j ≤ z}, {X ∗ j > z ∨ X ∗ j = NA}}

3. Pc,miss,sep → {{X ∗

j = NA}, {X ∗ j = NA}}.

Missing values treated like a category (well to handle R ∪ NA)
Good for informative pattern (R explains Y )
Implementation trick: duplicate the incomplete columns, and replace

the missing entries once by +∞ and once by −∞ (J. Tibshirani) 7 Target model/pattern: E [Y |X ∗] =

r∈{0,1}d E

Y |Xobs(r), R = r
1R=r

Does not require the missing data to be MAR.

7Implemented for conditional forests partykit, generalized random forest grf, scikitlearn