 
              Imbalance in representation space ► We do not want treatment groups to be identical Treatment group imbalance klK 𝑦 klL 𝑦 ≠ 𝑞 j 𝑞 j Φ 𝑦 𝑦 b Φ(𝑦) b 𝑦 L Φ(𝑦) L Control, 𝑈 = 0 Treated, 𝑈 = 1
Integral probability metric penalty ► Regularizer to improve counterfactual estimation ► Penalize treatment distributional distance in representation space ℎ L 𝑗𝑔 𝑈 = 1 𝑀(ℎ L (Φ), 𝑍(1)) … 𝑦 Φ … 𝑀 ℎ K Φ , 𝑍(0) … 𝑗𝑔 𝑈 = 0 ℎ K klK , ̂ klL ) IPM p ( ̂ 𝑞 j 𝑞 j 𝑈 ► Integral Probability Metrics (IPM) such as Wasserstein distance and MMD IPM q 𝑞 K , 𝑞 L = sup v  𝑡 𝑞 K 𝑡 − 𝑞 L 𝑡 𝑒𝑡 With G a function family: u∈p 𝒯
Integral probability metric penalty ► Regularizer to improve counterfactual estimation ► Penalize treatment distributional distance in representation space ℎ L 𝑗𝑔 𝑈 = 1 𝑀(ℎ L (Φ), 𝑍(1)) … 𝑦 Φ … 𝑀 ℎ K Φ , 𝑍(0) … 𝑗𝑔 𝑈 = 0 ℎ K klK , ̂ klL ) IPM p ( ̂ 𝑞 j 𝑞 j 𝑈 ► Integral Probability Metrics (IPM) such as Wasserstein distance and MMD With G a IPM q 𝑞 K , 𝑞 L = sup v  𝑡 𝑞 K 𝑡 − 𝑞 L 𝑡 𝑒𝑡 function family: u∈p 𝒯
Individual-level treatment effect generalization bound ► Factual per-treatment group ► Precision in Estimation of prediction error Heterogeneous Effects 1 : b { klK = v 𝐷𝐵𝑈𝐹 j,| = ℎ Φ 𝑦 , 1 − ℎ(Φ 𝑦 , 0) 𝑞 ‡lK 𝑦 𝑒𝑦 † ► 𝜗 „ 𝑍 0 − 𝑍 0 • b b { klL = v 𝑞 ‡lL 𝑦 𝑒𝑦 𝜗 }~k• (𝜚, ℎ) = v 𝐷𝐵𝑈𝐹 j,| − CATE 𝑦 𝑞 𝑦 𝑒𝑦 † 𝜗 „ 𝑍 1 − 𝑍 1 • • ► Theorem 1: klK Φ, ℎ + 𝜗 „ klL Φ, ℎ + 𝐶 j IPM q 𝑞 j klL , 𝑞 j klK 𝜗 }~k• (𝜚, ℎ) ≤ 2 𝜗 „ Effect error Prediction error Treatment group distance 1 Hill, Journal of Computational and Graphical Statistics 2011
► Theorem 1: klK Φ, ℎ + 𝜗 „ klL Φ, ℎ + 𝐶 j IPM q 𝑞 j klL , 𝑞 j klK 𝜗 CATE ≤ 2 𝜗 „ Effect error Prediction error Treatment group distance • Problem with Theorem 1: Too loose when we have overlap + infinite samples • We should be able to achieve the predicFon error itself on either group
Trading off accuracy for balance ► Our full architecture learns a representation Φ(x) , a re-weighting 𝑥 ‡ (𝑦) and hypotheses ℎ ‡ (Φ) to trade-off between the re-weighted loss 𝑥ℓ and imbalance between re-weighted representations Hypotheses Weighted loss Context Repres. Treatment ℎ K 𝑦 Φ 𝑢 𝑥ℓ ℎ L DNN 𝑥 ‡lK , 𝑥 L 𝑞 j ‡lL ) IPM(𝑥 K 𝑞 j Φ Weighting Imbalance
Individual-treatment effect generalization bound ► Theorem 2 *: (Representation learning) ž Ÿ Φ, ℎ + 𝐶 j IPM q 𝑞 j ‡ (𝑦) L ‡ (𝑦), 𝑥 ‡ 𝑞 j 𝜗 CATE ≤ 2 › 𝜗 ‡ ‡∈{K,L} Effect risk Re-weighted factual loss Imbalance of re-weighted representations ► Letting Φ 𝑦 = 𝑦 , and 𝑥 ‡ (𝑦) be inverse propensity weights, we recover classic result ► Minimizing a weighted loss and IPM converge to the representation and hypothesis that minimize CATE error *Extension to finite samples available
Evaluating Individual Treatment Effect (CATE) Estimates ► No ground truth , similar to o ff -policy evaluation in reinforcement learning
Evaluating Individual Treatment Effect (CATE) Estimates ► No ground truth , similar to off-policy evaluation in reinforcement learning ► Requires either: ► Knowledge of the true outcome (synthetic) ► Knowledge of treatment assignment policy (e.g. a randomized controlled trial)
Evaluating Individual Treatment Effect (CATE) Estimates ► No ground truth , similar to off-policy evaluation in reinforcement learning ► Requires either: ► Knowledge of the true outcome (synthetic) ► Knowledge of treatment assignment policy (e.g. a randomized controlled trial) ► Our framework has proven effective in both settings
IH IHDP Benchmark 1 ► The Infant Health and Development Program (IHDP) ► Studied the effects of home visits and other interventions ► Real covariates and treatment, synthesized outcome ► Overlap is not satisfied (by design) ► Used to evaluate MSE in CATE prediction 1 Hill, JCGS, 2011
Empir Em piric ical l results ults ► BART, Bayesian Additive Regression Method CATE MSE Trees, are state-of-the-art baselines BART 1 2.3 ± 0.1 ► Standard neural networks Neural net 2.0 ± 0.0 competitive Shared rep. 2 𝟐. 𝟏 ± 𝟏. 𝟏 ► Shared representation learning with Shared rep. 𝟏. 𝟗 ± 𝟏. 𝟏 ERM halves the MSE on IHDP 2 + invariance 2 Shared rep. 𝟏. 𝟖 ± 𝟏. 𝟏 + invariance + weighting 3 ► Minimizing upper bounds on risk, including 𝑒 ℋ further reduces the MSE 1 Hill, JCGS, 2011, 2 S ., Johansson, Sontag. ICML , 2017, 3 Johansson, Kallus, S. , Sontag. arXiv , 2018
In Intermedia iate conclu lusio sions ► ML is well understood when test data ≈ training data ► Learning individualized policies from observational data requires going beyond test ≈ train ► Fewer/worse guarantees when assumptions are violated
Outline • ML for causal inference • Causal inference for ML • Off-policy evaluation in a partially observable Markov decision process • Robust learning for unsupervised covariate shift
Outline • ML for causal inference • Causal inference for ML • Off-policy evaluation in a partially observable Markov decision process • Robust learning for unsupervised covariate shift “Off-Policy Evaluation in Partially Observable Environments”, Tennenholtz, Mannor, S AAAI 2020
Healthcare with time-varying decisions • Physicians make ongoing decisions: treat, see change in patients state, modify treatment, and so on Doctor Patient
Healthcare with time-varying decisions • Maps very well to reinforcement learning paradigm Figure: Shweta Bhatt
Reinforcement learning (RL) and causal inference From causal inference perspective • RL usually assumes we can intervene directly • à mostly about how to experiment optimally in a dynamic environment
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective • RL usually assumes we can intervene directly • à mostly about how to experiment optimally in a dynamic environment
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspecFve • RL usually assumes we can • Causal inference usually deals intervene directly with cases we cannot intervene directly • à mostly about how to experiment optimally in a dynamic environment
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective • RL usually assumes we can • Causal inference usually deals intervene directly with cases we cannot intervene directly • à mostly about how to experiment optimally in • Causal inference usually focuses a dynamic environment on single point-in-time actions
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective • RL usually assumes we can • Causal inference usually deals intervene directly with cases we cannot intervene directly • à mostly about how to experiment optimally in • Causal inference usually focuses a dynamic environment on single point-in-time actions • à mostly about off-policy evaluation of a simple policy such as “treat everyone”
A meePng point of RL and causal inference • When performing off-policy evaluation of data from i. dynamic environment with ongoing actions ii. while we possibly do not have access to the same data as the agent • Example: learning from records of physicians treating patients in an intensive care unit (ICU) • Mistakes were made: applying RL to observational intensive care unit data without considering hidden confounders or overlap (common support / positivity) (see “Guidelines for Reinforcement Learning in Healthcare” Gottesman et al. 2019) • In RL nomenclature, hidden confounding can be described by a Partially Observable Markov Decision Process (POMDP)
Partially Observable Markov Decision Process (POMDP): some formalism 7
POMDP causal graph Causal RL Example name name confounder state Information 𝐯 t (possibly (possibly available to “hidden”) “unobserved”) the doctor action, medications, 𝐛 t action treatment procedures… 𝐬 t outcome reward mortality treatment The way behavioral 𝝆 𝒄 assignment doctors treat policy process patients Proxy Electronic 𝐴 t observation variable health record
POMDP causal graph Causal RL Example name name confounder state Information 𝐯 t (possibly (possibly available to “hidden”) “unobserved”) the doctor action, medications, 𝐛 t action treatment procedures… 𝐬 t outcome reward mortality treatment The way behavioral 𝝆 𝒄 assignment doctors treat policy process patients Proxy Electronic 𝐴 t observation variable health record
POMDP causal graph Causal RL Example name name confounder state information 𝐯 t (possibly (possibly available to “hidden”) “unobserved”) the doctor action, medications, 𝐛 t action treatment procedures… 𝐬 t outcome reward mortality treatment the way behavioral 𝝆 𝒄 assignment doctors treat policy process patients Proxy Electronic 𝐴 t observation variable health record
POMDP causal graph Causal RL Example name name confounder state information 𝐯 t (possibly (possibly available to “hidden”) “unobserved”) the doctor action, medications, 𝐛 t action treatment procedures… 𝐬 t outcome reward mortality treatment the way behavioral 𝝆 𝒄 assignment doctors treat policy process patients proxy electronic 𝐴 t observation variable health record
POMDP causal graph Causal RL Example name name confounder state information 𝐯 t (possibly (possibly available to “hidden”) “unobserved”) the doctor action, medications, 𝐛 t action treatment procedures… 𝐬 t outcome reward mortality treatment the way behavioral 𝝆 𝒄 assignment doctors treat policy process patients proxy electronic 𝐴 t observation variable health record
POMDP causal graph • Observe data from 𝝆 𝒄 , with 𝐯 𝐮 unobserved
Our goal: evaluate a new policy 𝝆 𝒇 given data from 𝝆 𝒄 • Observe data from 𝝆 𝒄 , with 𝐯 𝐮 unobserved… • Evaluate a proposed policy 𝝆 𝒇 (𝒜 𝒖 ) in terms of policy value (discounted over a finite horizon) • Why a function of 𝐴 t ? Because 𝐯 t is unobserved • How to evaluate 𝝆 𝒇 (𝒜 𝒖 ) given only observations from 𝝆 𝒄 , with 𝐯 t unobserved? • This is a problem anyone trying to create optimal dynamic treatment policies with observational data must address
Our goal: evaluate a new policy 𝝆 𝒇 given data from 𝝆 𝒄 • Observe data from 𝝆 𝒄 , with 𝐯 𝐮 unobserved… • Evaluate a proposed policy 𝝆 𝒇 (𝒜 𝒖 ) in terms of policy value (discounted over a finite horizon) • Denote 𝑞 𝝆 𝒄 (𝑏, 𝑐, 𝑑, … |𝑒, 𝑓, 𝑔, … ) probabilities from observed behavioral policy • Can sample from this distribution • Denote 𝑞 𝝆 𝒇 (𝑏, 𝑐, 𝑑, … |𝑒, 𝑓, 𝑔, … ) probabilities from targeted evaluation policy • Cannot sample from this distribution
Our goal: evaluate a new policy 𝝆 𝒇 given data from 𝝆 𝒄 • Observing data from 𝝆 𝒄 , with 𝐯 𝐮 unobserved evaluate a proposed policy 𝝆 𝒇 (𝐴 𝐮 ) in terms of policy value (discounted over a finite horizon) • Without further assumptions: IMPOSSIBLE • Example: ICU doctors treating sicker patients more aggressively • Impossible even when conditioning on entire observable history 𝐴 𝟐 , 𝐛 𝟐 , 𝒔 𝟐 , … , 𝐴 𝐔 , 𝐛 𝐔 , 𝒔 𝐔 • Due to hidden confounding by 𝐯 𝐮 • But much harder: confounder<->action dynamics
Proxies and negative controls • Miao, Geng, & Tchetgen Tchetgen. “Identifying causal effects with proxy variables of an unmeasured confounder.” Biometrika (2018) • Only 𝒗 is unobserved 𝐴 𝒗 𝐱 • Goal: identify the causal effect of 𝐛 on 𝐬 • 𝐴 ⫫ 𝐱 | 𝒗 • In general: impossible 𝐬 𝐛 • New identification condition: matrices 𝑁 ¹º (𝑏) = 𝑞(𝐱 = 𝑗|𝐴 = j, 𝐛 = 𝑏) are invertible for all 𝑏 • Requires 𝐱 and 𝐴 to be discrete with as many categories as discrete 𝒗
Our goal: evaluate a new policy 𝝆 𝒇 given data from 𝝆 𝒄 • Assume 𝐴 𝐮 are discrete with ≥ categories as 𝐯 𝐮 Invertibility example (untestable from data) If z ¿ are binary, then a sufficient condition for ‡ 𝑏 = 𝑞 𝝆 𝒄 𝐴 𝐮 = 𝑗 𝐴 𝐮 𝟐 = 𝑘, 𝐛 𝐮 = 𝑏 • Let 𝑁 ¹º invertiblity of 𝑁 ‡ (𝑏) is • Theorem: 𝑞 z ¿ = 1 z ¿ L = 1, 𝑏 ≠ 𝑞 z ¿ = 1 z ¿ L = 0, 𝑏 If 𝑁 ‡ (𝑏) are all invertible then we can evaluate value of a proposed policy 𝝆 𝒇 (𝒜 𝒖 ) given observational data gathered under 𝝆 𝒄 , without observing 𝐯 𝐮 • Future and past observations 𝒜 𝒖 are conditionally independent proxies for unobserved 𝐯 𝐮
Assumptions 1. Assume 𝐴 𝐮 are discrete with ≥ categories as 𝐯 𝐮 ‡ 𝑏 = 𝑞 𝝆 𝒄 𝐴 𝐮 = 𝑗 𝐴 𝐮 𝟐 = 𝑘, 𝐛 𝐮 = 𝑏 2. Matrices 𝑁 ¹º are invertible for all 𝑏 and 𝑢 • Allow off-policy evaluation for class of POMDPs • No need to measure or even know what is 𝐯 𝐮 • As usual in Causal Inference, some of the assumptions are unverifiable from data
Assumptions 1. Assume 𝐴 𝐮 are discrete with ≥ categories as 𝐯 𝐮 ‡ 𝑏 = 𝑞 𝝆 𝒄 𝐴 𝐮 = 𝑗 𝐴 𝐮 𝟐 = 𝑘, 𝐛 𝐮 = 𝑏 2. Matrices 𝑁 ¹º are invertible for all 𝑏 and 𝑢 • Observed sequence 𝜐 = 𝑨 K , 𝑏 K , … , 𝑨 k , 𝑏 k ∈ 𝒰 k ‡ 𝑏 = 𝑞 𝝆 𝒄 𝐴 𝐮 = 𝑗, 𝐴 𝐮 𝟐 = 𝑨 ‡ L 𝐴 𝐮 𝟑 = 𝑘, 𝐛 𝐮 𝟐 = 𝑏 • 𝑂 ¹º • 𝑋 ‡ 𝜐 = 𝑁 ‡ 𝑏 ‡ L 𝑂 ‡ (𝑏 ‡ L ) L 𝑞 𝝆 𝒄 (𝒜 𝟏 = 𝑘) K (𝜐) = ∑ º 𝑁 K 𝑏 Æ ¹º • 𝑅 ¹ 𝑋 ‡ 𝜐 ⋅ 𝑅 K (𝜐) k • Ω 𝜐 = ∏ ‡lK k • Λ Ë 𝜐 = ∏ ‡lK 𝝆 𝒇 (𝑏 ‡ |𝑨 K , 𝑏 K , … , 𝑨 ‡ L , 𝑏 ‡ L , 𝑨 ‡ ) • Then: 𝑞 𝝆 𝒇 𝑠 ‡ = ∑ `∈𝒰 Í Λ Ë 𝜐 𝑞 𝝆 𝒄 (𝑠 ‡ , 𝑨 ‡ |𝑏 ‡ , 𝑨 ‡ L ) Ω 𝜐
Off-policy POMDP evaluation • The above evaluation requires estimating the inverses of many conditional probability tables • Scales poorly statistically • We introduce another causal model called decoupled - POMDP • Similar causal graph • Significantly reduces the dimensions and improves condition number of the estimated inverse matrices
Decoupled POMDP
Off-policy POMDP evaluation • The above evaluation requires estimating the inverses of many conditional probability tables • Scales poorly statistically • We introduce another causal model called decoupled - POMDP • Similar causal graph • Significantly reduces the dimensions and improves condition number of the estimated inverse matrices • Current challenge: scaling to realistic health data
Outline • ML for causal inference • Causal inference for ML • Off-policy evaluation in a partially observable Markov decision process • Robust learning for unsupervised covariate shift
Outline • ML for causal inference • Causal inference for ML • Off-policy evaluation in a partially observable Markov decision process • Robust learning for unsupervised covariate shift “Robust learning with the Hilbert- Schmidt independence criterion”, Greenfeld & S arXiv:1910.00270
Classic non-causal tasks in machine learning: many success stories • Classification • ImageNet • MNIST • TIMIT (sound) • Sentiment analysis • Prediction • Which patients will die? • Which users will click? • (under current practice)
Failures of ML Classification models
Failures of ML Classification models test set ≠ train set, but we know humans succeed here
How to learn models which are ro robust to a-priori unknown changes in test distribution? • Source distribution 𝑄 Î (𝑌, 𝑍) • Learn model that works well on unknown Target distributions 𝑄 Ï 𝑌, 𝑍 ∈  Set of Source possible 𝑄 Î targets 
How to learn models which are ro robust to a-priori unknown changes in test distribution? • Source distribution 𝑄 Î 𝑌, 𝑍 • Learn model that works well on all target distributions 𝑄 Ï 𝑌, 𝑍 ∈  • What is  ? • We assume Covariate Shift : For all 𝑄 Ï 𝑌, 𝑍 ∈  , 𝑄 Ï 𝑍|𝑌 = 𝑄 Î (𝑍|𝑌) • Further restrictions on  to follow • Covariate shift is easy if learning 𝑄 Î 𝑍 𝑌 is easy • Focus on tasks where it’s hard
Unsupervised covariate shift • A model that works well even when the underlying distribution of instances changes • Works as long as 𝑄(𝑍|𝑌) is stable • When does this happen?
Causal mechanisms are stable
Learning with an independence criterion • 𝑌 causes 𝑍 , structural causal model: 𝒁 = 𝒈 ∗ 𝒀 + 𝝑 , 𝝑 ⫫ 𝒀 • 𝑔 ∗ 𝑦 is the mechanism tying 𝑌 to 𝑍 • 𝜗 is independent addiKve noise • Therefore, 𝑍 − 𝑔 ∗ 𝑌 ⫫ 𝑌 • Mooij, Janzing, Peters & Schölkopf (2009): Learn structure of causal models by learning funcFons 𝑔 such that 𝑍 − 𝑔 𝑌 is approximately independent of 𝑌 • Need a non-parametric measure of independence • Hilbert-Schmidt independence criterion, HSIC
Hilbert-Schmidt independence criterion: HSIC • Let 𝑌 , 𝑍 be two metric spaces with a joint distribution 𝑄(𝑌, 𝑍) •  Õ and  Ö are reproducing kernel Hilbert spaces on 𝑌 and 𝑍 induced by kernels 𝐿(⋅,⋅) and 𝑀(⋅,⋅) respectively • 𝐼𝑇𝐽𝐷(𝑌, 𝑍) measures the degree of dependence between 𝑌 and 𝑍 • Empirical version: Sample 𝑦 L , 𝑧 L , … , 𝑦 Ü , 𝑧 Ü Denote (some abuse of notation) 𝐿 the 𝑜 × 𝑜 kernel matrix on 𝑌 , 𝑀 is 𝑜 × 𝑜 kernel matrix on 𝑍 L • { 𝐼𝑇𝐽𝐷 (𝑌, 𝑍;  Õ ,  Ö ) = Ü L à 𝑢𝑠 𝐿𝐼𝑀𝐼 𝐼 is a centering matrix, 𝐼 ¹º = 𝜀 ¹º − L Ü
Learning with HSIC • Hypothesis class ℋ • Classic learning for loss ℓ , e.g. squared loss: min |∈ℋ 𝔽 ℓ(𝑍, ℎ 𝑌 ) • Learning with HSIC (Mooij et al., 2009): |∈ℋ 𝐼𝑇𝐽𝐷 𝑌, 𝑍 − ℎ 𝑌 ;  Õ ,  Ö min
Learning with HSIC • Learning with HSIC (Mooij et al., 2009): min |∈ℋ 𝐼𝑇𝐽𝐷 𝑌, 𝑍 − ℎ 𝑌 ;  Õ ,  Ö • Recall: 𝑍 − 𝑔 ∗ 𝑌 ⫫ 𝑌 • If objective equals 0 then ℎ ∗ 𝑌 = 𝑔 ∗ 𝑦 + 𝑐 for some constant 𝑐 • Can learn up to an additive bias term
Learning with HSIC • Learning with HSIC (Mooij et al., 2009): min |∈ℋ 𝐼𝑇𝐽𝐷 𝑌, 𝑍 − ℎ 𝑌 ;  Õ ,  Ö • DifferenFable with respect to ℎ 𝑌 • We opFmize with SGD using mini-batches to approximate HSIC
Theoretical results • Learnability: minimizing HSIC-loss over a sample leads to generalization • Robustness: minimizing HSIC-loss leads to tightly-bounded error in unsupervised covariate shift â ŸãäåæŸ • • If denstiy ratio çèéäêæ • is “nice” in the sense of low RKHS norm. â
Experiments – rotated MNIST (Heinze-Deml & Meinshausen 2017) • Train on ordinary MNIST • Test on MNIST rotated uniformly at random [-45°,45°]
Experiments – rotated MNIST (Heinze-Deml & Meinshausen 2017) • Train on ordinary MNIST • Test on MNIST rotated uniformly at random [-45°,45°] Source { C11 - sourcH 7raLnLng schHPH C11 - sourcH C11 - sourcH 7raLnLng schHPH H6IC 7raLnLng schHPH 0LP 2x256 - sourcH C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH HSIC C11 - sourcH 7raLnLng schHPH Cross HntroSy H6IC H6IC H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH H6IC Cross HntroSy Cross HntroSy Cross HntroSy Cross HntroSy 0LP 2x256 - sourcH Cross 0LP 2x524 - sourcH 0LP 2x524 - sourcH Cross HntroSy 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH entropy 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 4x512 - sourcH C11 - sourcH 7raLnLng schHPH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH H6IC Target { 0LP 4x256 - sourcH 0LP 2x256 - sourcH 0LP 4x1024 - sourcH Cross HntroSy 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH 0LP 2x524 - sourcH C11 - targHt C11 - sourcH C11 - sourcH 0LP 4x512 - sourcH 7raLnLng schHPH C11 - targHt 7raLnLng schHPH C11 - targHt C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH 0LP 4x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x1024 - sourcH H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt Cross HntroSy 0LP 2x256 - sourcH 0LP 4x1024 - sourcH H6IC H6IC 0LP 4x256 - sourcH Cross HntroSy 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH C11 - targHt C11 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt Cross HntroSy Cross HntroSy 0LP 4x512 - sourcH 0LP 2x524 - sourcH C11 - targHt 0LP 2x1024 - sourcH 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x256 - targHt 0LP 2x1024 - targHt 0LP 2x256 - targHt 0LP 4x1024 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 2x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt C11 - targHt 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 4x512 - sourcH 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x256 - sourcH 0LP 2x524 - targHt 0LP 2x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 2x524 - targHt 0LP 4x512 - sourcH 0LP 2x1024 - targHt 0LP 4x1024 - targHt C11 - targHt 70 80 90 100 0LP 4x512 - sourcH 0LP 4x512 - sourcH 60 65 60 70 65 75 70 80 75 85 80 90 85 95 90 100 95 100 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 2x1024 - targHt 0LP 2x256 - targHt Accuracy Accuracy Accuracy 60 65 70 75 80 85 90 95 100 0LP 4x1024 - sourcH 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH Accuracy 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 2x524 - targHt 0LP 4x512 - targHt C11 - targHt 0LP 4x512 - targHt 0LP 2x1024 - targHt C11 - targHt C11 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x256 - targHt 0LP 2x256 - targHt 0LP 4x1024 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 60 65 70 75 80 85 90 95 100 60 65 70 75 80 85 90 95 100 60 65 70 75 80 85 90 95 100 0LP 4x512 - targHt Accuracy 0LP 2x524 - targHt Accuracy 60 65 70 75 80 85 90 95 100 Accuracy 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 4x1024 - targHt Accuracy 0LP 2x1024 - targHt 60 65 70 75 80 85 90 95 100 0LP 2x1024 - targHt 0LP 2x1024 - targHt Accuracy 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 60 65 70 75 80 85 90 95 100 60 65 60 65 70 75 70 80 75 85 80 90 85 95 90 100 95 100 Accuracy Accuracy Accuracy
Experiments – rotated MNIST (Heinze-Deml & Meinshausen 2017) • Train on ordinary MNIST • Test on MNIST rotated uniformly at random [-45°,45°] Source { C11 - sourcH 7raLnLng schHPH H6IC 0LP 2x256 - sourcH C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH HSIC Cross HntroSy H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH Cross HntroSy Cross HntroSy Cross 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH entropy 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH Target { 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH C11 - targHt C11 - targHt C11 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 70 80 90 100 60 65 60 70 65 75 70 80 75 85 80 90 85 95 90 100 95 100 Accuracy Accuracy Accuracy 60 65 70 75 80 85 90 95 100 Accuracy
Outline • ML for causal inference • Causal inference for ML • Off-policy evaluaFon in a parFally observable Markov decision process • Robust learning for unsupervised covariate shiT
Summary • Machine learning for causal- inference: • Individual-level treatment effects from observational data - robustness to treatment assignments process • Using ecently proposed “negative control” to create first Off-Policy Evaluation scheme for POMDPs, with past and future in the role of the controls • Learning models robust against unknown covariate shift
Thank you to all my collaborators! • Fredrik Johansson (Chalmers) • David Sontag (MIT) • Nathan Kallus (Cornell-Tech) • Guy Tennenholtz (Technion) • Shie Mannor (Technion) • Daniel Greenfeld (Technion)
Even estimating average effects from observational data is hard! Do we believe we can estimate individual-level effects? • Causal identification assumptions: • Hidden confounding: No unmeasured factors that affect both treatment and outcome • Common support: 𝑈 = 1 and 𝑈 = 0 populations should be similar • Accurate effect estimates: be able to approximate 𝔽 𝑍|𝑦, 𝑈 = 𝑢
Even esPmaPng average effects from observaPonal data is hard! Do we believe we can esPmate individual-level effects? • Causal identification assumptions: • Hidden confounding • Common support • Accurate effect estimates • We focus on tasks where we hope we can address all three concerns • And still be useful • Designing for causal identification
You have condition A. Treatment options are T=0, T=1
Obviously, give T=0 No need for algorithmic decision support
Obviously, give T=0 Obviously, give T=1 No need for algorithmic decision support
Obviously, give T=0 Obviously, give T=1 I’m not so sure…
Recommend
More recommend