Machine learning and causal inference: a two-way road Uri Shalit - - PowerPoint PPT Presentation

machine learning and causal inference a two way road
SMART_READER_LITE
LIVE PREVIEW

Machine learning and causal inference: a two-way road Uri Shalit - - PowerPoint PPT Presentation

Machine learning and causal inference: a two-way road Uri Shalit Technion Israel Institute of Technology DATAIA Seminar Paris, January 2020 What is causality? A big question! Extremely short into to causality (in the context of


slide-1
SLIDE 1

Machine learning and causal inference: a two-way road

Uri Shalit Technion – Israel Institute of Technology DATAIA Seminar Paris, January 2020

slide-2
SLIDE 2

What is causality?

slide-3
SLIDE 3

A big question!

slide-4
SLIDE 4

Extremely short into to causality

(in the context of statistics and learning)

  • Aspirin caused my headache to disappear
  • The car crashed because it didn’t brake in time
  • The students succeeded because of the new teacher
slide-5
SLIDE 5

Extremely short into to causality

(in the context of statistics and learning)

  • Aspirin caused my headache to disappear
  • Had I not taken Aspirin, I would still have had the headache
  • The car crashed because it didn’t brake in time
  • Had the car braked in time, it wouldn’t have crashed
  • The students succeeded because of the new teacher
  • Had the students remained with the old teacher, they wouldn’t have

succeeded

slide-6
SLIDE 6

Extremely short into to causality

(in the context of statistics and learning)

  • Aspirin caused my headache to disappear
  • Had I not taken Aspirin, I would sFll have had the headache
  • The car crashed because it didn’t brake in Fme
  • Had the car braked in Fme, it wouldn’t have crashed
  • The students succeeded because of the new teacher
  • Had the students remained with the old teacher, they wouldn’t have

succeeded

counterfactuals

slide-7
SLIDE 7

Extremely short into to causality

(in the context of statistics and learning)

  • Aspirin caused my headache to disappear
  • Had I not taken Aspirin, I would still have had the headache
  • The car crashed because it didn’t brake in time
  • Had the car braked in time, it wouldn’t have crashed
  • The students succeeded because of the new teacher
  • Had the students remained with the old teacher, they wouldn’t have

succeeded

Counterfactuals: imagine a world where everything is the same except the “cause”

slide-8
SLIDE 8

Counterfactuals

  • Often in terms of imagined interventions
  • Never directly observable – we need a causal model
  • “Counterfactual world” is sometimes statistically identical to
  • bserved reality, for example in Randomized Controlled Trials
slide-9
SLIDE 9

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluaFon in a parFally observable Markov decision

process

  • Robust learning for unsupervised covariate shiT
slide-10
SLIDE 10

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluation in a partially observable Markov decision

process

  • Robust learning for unsupervised covariate shift
slide-11
SLIDE 11

Causal effect inference questions

  • Which medication will make patients better?
  • Which economic policy will lower unemployment?
  • The effects of actions on outcomes
slide-12
SLIDE 12

Causal effect inference from observational data

  • Which medication will make patients better?
  • Infer from medical records
  • Which economic policy will lower unemployment?
  • Infer from past economic measurement
  • The effects of actions on outcomes
slide-13
SLIDE 13

Causal inference from observational data - confounding

  • Which medicaFon will make paFents beWer?
  • Infer from medical records
  • Maybe younger/wealthier/female/… paFents tend to receive medicaFon A over B?
  • Which economic policy will lower unemployment?
  • Infer from past economic measurement
  • Maybe policy was enacted in beWer past economic Fmes?
slide-14
SLIDE 14

This part based on work with Fredrik Johansson (MITàChalmers), Nathan Kallus (Cornell) and David Sontag (MIT)

(i) Johansson, S, Sontag, (2016). Learning representations for counterfactual

  • inference. In International Conference on Machine Learning.

(ii) Shalit, U., Johansson, F., & Sontag, D. (2017). Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning. (iii) Johansson, Kallus, S, Sontag, (2020) Generalization bounds and representation learning for estimation of potential

  • utcomes and causal effects. arXiv preprint arXiv:2001.07426.
slide-15
SLIDE 15

Age = 54 Gender = Female Race = Asian Blood sugar = 7.7% WBC count = 6.8*109/L Temperature = 36.7°C Blood pressure = 150/95

May 15

Anna

Our goal: Conditional Average Treatment Effect (CATE)

slide-16
SLIDE 16

Age = 54 Gender = Female Race = Asian Blood sugar = 7.7% WBC count = 6.8*109/L Temperature = 36.7°C Blood pressure = 150/95

May 15 Blood pressure = ?

𝒁(𝟏)

  • Sep. 15
  • Sep. 15

𝒖𝒔𝒇𝒃𝒖𝒏𝒇𝒐𝒖 𝑼 = 𝟐 𝒖𝒔𝒇𝒃𝒖𝒏𝒇𝒐𝒖 𝑼 = 𝟏

Anna

Blood pressure = ?

𝒁(𝟐)

𝑍(0), 𝑍(1): potenFal outcomes

(Rubin-Neyman causal model)

Our goal: Conditional Average Treatment Effect (CATE)

slide-17
SLIDE 17

𝑌: patient features 𝐷𝐵𝑈𝐹 𝑌 : = 𝔽 𝑍(1) − 𝑍(0)|𝑌

𝑍(0), 𝑍(1): potential outcomes

(Rubin-Neyman causal model)

Our goal: Conditional Average Treatment Effect (CATE)

slide-18
SLIDE 18

𝑌: patient features 𝐷𝐵𝑈𝐹 𝑌 : = 𝔽 𝑍(1) − 𝑍(0)|𝑌

  • We never directly observe CATE
  • We only see either 𝑍(1) or 𝑍 0
  • The choice is not random
  • How to estimate the CATE function?

𝑍(0), 𝑍(1): potential outcomes

(Rubin-Neyman causal model)

slide-19
SLIDE 19

Es Estimate pot

  • tential

al ou

  • utcom
  • mes
  • Outcomes under treatment and control, 𝑍 1 , 𝑍 0 ∈ ℝ
  • Treatments 𝑈 ∈ 0,1 , 𝑍 = 𝑈𝑍 1 + 1 − 𝑈 𝑍 0
  • Confounders 𝑌 ∈ ℝB
  • Condi6onal effect (CATE) 𝜐 𝑌 ≔ 𝔽[𝑍 1 − 𝑍 0 ∣ 𝑌]

Only one observed for any one patient!

slide-20
SLIDE 20

Ob Observational datasets: Rh Rheumatoid arthritis

► Historical records of treatments and outcomes Patient Age Prior disease activity Observed treatment Disease activity Anna 54 High A High Calvin 52 High A Low John 48 Low B Low Peter 60 Low B High 𝑌 𝑈 𝑍

slide-21
SLIDE 21

Ob Observational datasets: Rh Rheumatoid arthritis

► Unobserved counterfactual outcomes Patient Age Prior disease activity Disease activity (A) Disease activity (B) Anna 54 High High ? Calvin 52 High Low ? John 48 Low ? Low Peter 60 Low ? High

Outcomes under alternative treatments

𝑌 𝑍(0) 𝑍 1

slide-22
SLIDE 22

Es Estimating pot

  • tential

al ou

  • utcom
  • mes

𝑦

Control outcome 𝔽[𝑍 0 ∣ 𝑌]

Age Mortality 𝑦

Control outcome 𝔽[𝑍 0 ∣ 𝑌]

Age Mortality

Treated outcome 𝔽[𝑍 1 ∣ 𝑌]

Effect of treatment τ(𝑦)

slide-23
SLIDE 23

𝑦 Age Mortality

𝑍(1)

𝜐

𝑍(0)

𝑞K 𝑌 ≔ 𝑞(𝑌 ∣ 𝑈 = 0)

Treated group Control group

𝑞L 𝑌 ≔ 𝑞(𝑌 ∣ 𝑈 = 1)

Es Estimating pot

  • tential

al ou

  • utcom
  • mes
slide-24
SLIDE 24

𝑦

Control outcome 𝔽[𝑍 0 ∣ 𝑌]

Age Mortality

Treated

Es EsPm PmaPn Png cou

  • unterfac

actual al for

  • r treated
slide-25
SLIDE 25

Fo Formalizing su sufficient assum assumptions ns

  • 1. Ignorability (no unmeasured confounders):

“Patients with similar 𝑌 respond similarly” ∀𝑢 ∶ 𝑍 𝑢 ⊥ 𝑈 ∣ 𝑌

  • 2. Overlap: “Similar patients with different treatments exist”

∀𝑢, 𝑦 ∶ 𝑞 𝑈 = 𝑢 𝑌 = 𝑦 > 0

3. SUTVA: “No patient-patient interference” 4. Consistency: “We observe 𝑍 𝑢 for patients with 𝑈 = 𝑢”

slide-26
SLIDE 26

Ta Take-aw aways

  • 1. These are strong assumptions that

don’t always hold

  • 2. Even when they do, estimation is still

challenging

slide-27
SLIDE 27

Cl Classical view

  • Causal estimation often focused on parameter estimation

E.g., assume: 𝑍 = 𝛾S𝑌 + 𝜾𝑈 + 𝜗, Goal: find 𝜾!

Treatment effect Observed outcome

slide-28
SLIDE 28

Ma Machine learning view

  • Causal estimation often focused on parameter estimation

E.g., assume: 𝑍 = 𝛾S𝑌 + 𝜾𝑈 + 𝜗, Goal: find 𝜾!

  • ML view: Find prediction of 𝜐 = 𝑍 1 − 𝑍(0) with small error 𝑀

̂ 𝜐, 𝜐

̂ 𝜐∗ = arg min

_ `∈𝒰

𝔽 𝑀 ̂ 𝜐, 𝜐 = arg min

_ `∈𝒰

𝔽 ̂ 𝜐 𝑌 − 𝜐 b

Treatment effect Observed outcome

slide-29
SLIDE 29

► Treatment is assigned unfirmly at random: 𝑞 𝑈 = 1

𝑌 = 𝑄 𝑈 = 1

► Here: every dot is a unit, color indicates observed treatment ► Predict outcome under unobserved treatment

Easier: Randomized Controlled Trials (RCT)

𝑦L 𝑦b

Control, 𝑈 = 0 Treated, 𝑈 = 1

“Training set” distribution = “Test set” distribution

slide-30
SLIDE 30

► In randomized control trials, there is no confounding – just do regression! ► New architecture for estimating counterfactuals and CATE ► One “head” per potential outcome – avoids washing away treatment ► Shared representation layers Φ 𝑦 for sample efficiency

Neural network architecture: TARNet

(Treatment-Agnostic Representation Network)

𝑈 𝑦

𝑀(ℎL(Φ), 𝑍(1))

Φ

ℎK

𝑀 ℎK Φ , 𝑍(0)

… … …

ℎL

𝑗𝑔 𝑈 = 0 𝑗𝑔 𝑈 = 1

slide-31
SLIDE 31

► Predict outcome under unobserved treatment ► Treatment is not assigned equally at random: 𝑞 𝑈 = 1

𝑌 ≠ 𝑄 𝑈 = 1

► There is a non-negligible difference between treatment group distributions

Observational studies: test ≠ train

𝑒 Control, 𝑈 = 0 Treated, 𝑈 = 1

Example: A difference in means “Treated tend to be younger”

𝑦L 𝑦b

slide-32
SLIDE 32

► Learn a representation Φ of the data that makes it more like an RCT ► A shared representation helps identify meaningful interactions ► Penalize the distributional distance between treatment groups

New type of bias-variance tradeoff

Representation learning

Φ(𝑦)L

Φ(𝑦)b

Φ 𝑦

Representation space

𝑦L 𝑦b

Original space

Control, 𝑈 = 0 Treated, 𝑈 = 1 Control, 𝑈 = 0 Treated, 𝑈 = 1

slide-33
SLIDE 33

► We do not want treatment groups to be identical

Imbalance in representation space

Φ(𝑦)L

Φ(𝑦)b 𝑞j

klL 𝑦 ≠ 𝑞j klK 𝑦

Φ 𝑦

𝑦L 𝑦b Treatment group imbalance

Control, 𝑈 = 0 Treated, 𝑈 = 1

slide-34
SLIDE 34

► Regularizer to improve counterfactual estimation ► Penalize treatment distributional distance in representation space ► Integral Probability Metrics (IPM) such as Wasserstein distance and MMD

Integral probability metric penalty

𝑈 𝑦

𝑀(ℎL(Φ), 𝑍(1))

Φ

IPMp( ̂ 𝑞j

klK, ̂

𝑞j

klL)

ℎK

𝑀 ℎK Φ , 𝑍(0)

… … …

ℎL

𝑗𝑔 𝑈 = 0 𝑗𝑔 𝑈 = 1

IPMq 𝑞K, 𝑞L = sup

u∈p

v

𝒯

𝑕 𝑡 𝑞K 𝑡 − 𝑞L 𝑡 𝑒𝑡 With G a function family:

slide-35
SLIDE 35

► Regularizer to improve counterfactual estimation ► Penalize treatment distributional distance in representation space ► Integral Probability Metrics (IPM) such as Wasserstein distance and MMD

Integral probability metric penalty

𝑈 𝑦

𝑀(ℎL(Φ), 𝑍(1))

Φ

IPMp( ̂ 𝑞j

klK, ̂

𝑞j

klL)

ℎK

𝑀 ℎK Φ , 𝑍(0)

… … …

ℎL

𝑗𝑔 𝑈 = 0 𝑗𝑔 𝑈 = 1

IPMq 𝑞K, 𝑞L = sup

u∈p

v

𝒯

𝑕 𝑡 𝑞K 𝑡 − 𝑞L 𝑡 𝑒𝑡 With G a function family:

slide-36
SLIDE 36

► Precision in Estimation of

Heterogeneous Effects1:

{ 𝐷𝐵𝑈𝐹j,| = ℎ Φ 𝑦 , 1 − ℎ(Φ 𝑦 , 0)

𝜗}~k•(𝜚, ℎ) = v

  • {

𝐷𝐵𝑈𝐹j,| − CATE 𝑦

b

𝑞 𝑦 𝑒𝑦

𝜗}~k•(𝜚, ℎ) ≤ 2 𝜗„

klK Φ, ℎ + 𝜗„ klL Φ, ℎ + 𝐶j IPMq 𝑞j klL, 𝑞j klK

𝜗„

klK = v

𝑍 0 − 𝑍 0

b

𝑞‡lK 𝑦 𝑒𝑦

► Factual per-treatment group

prediction error

Effect error Prediction error Treatment group distance

1Hill, Journal of Computational and Graphical Statistics 2011

► Theorem 1:

𝜗„

klL = v

𝑍 1 − 𝑍 1

b

𝑞‡lL 𝑦 𝑒𝑦

Individual-level treatment effect generalization bound

slide-37
SLIDE 37
  • Problem with Theorem 1:

Too loose when we have overlap + infinite samples

  • We should be able to achieve the predicFon error itself on either

group

𝜗CATE ≤ 2 𝜗„

klK Φ, ℎ + 𝜗„ klL Φ, ℎ + 𝐶j IPMq 𝑞j klL, 𝑞j klK

Effect error Prediction error Treatment group distance

► Theorem 1:

slide-38
SLIDE 38

► Our full architecture learns a representation Φ(x), a re-weighting

𝑥‡(𝑦) and hypotheses ℎ‡(Φ) to trade-off between the re-weighted loss 𝑥ℓ and imbalance between re-weighted representations

Trading off accuracy for balance

𝑦 Φ ℎL ℎK 𝑥

IPM(𝑥K𝑞j

‡lK, 𝑥L𝑞j ‡lL)

𝑥ℓ

𝑢

Context Repres. Hypotheses Weighting Imbalance Weighted loss Treatment

DNN

Φ

slide-39
SLIDE 39

► Theorem 2*: (Representation learning) ► Letting Φ 𝑦 = 𝑦, and 𝑥‡(𝑦) be inverse propensity weights, we

recover classic result

► Minimizing a weighted loss and IPM converge to the

representation and hypothesis that minimize CATE error

Individual-treatment effect generalization bound

𝜗CATE ≤ 2 ›

‡∈{K,L}

𝜗‡

žŸ Φ, ℎ + 𝐶j IPMq 𝑞j L ‡(𝑦), 𝑥‡ 𝑞j ‡ (𝑦) Effect risk Re-weighted factual loss Imbalance of re-weighted representations

*Extension to finite samples available

slide-40
SLIDE 40

► No ground truth, similar to off-policy evaluation in

reinforcement learning

Evaluating Individual Treatment Effect (CATE) Estimates

slide-41
SLIDE 41

► No ground truth, similar to off-policy evaluation in

reinforcement learning

► Requires either: ► Knowledge of the true outcome (synthetic) ► Knowledge of treatment assignment policy

(e.g. a randomized controlled trial)

Evaluating Individual Treatment Effect (CATE) Estimates

slide-42
SLIDE 42

► No ground truth, similar to off-policy evaluation in

reinforcement learning

► Requires either: ► Knowledge of the true outcome (synthetic) ► Knowledge of treatment assignment policy

(e.g. a randomized controlled trial)

► Our framework has proven effective in both settings

Evaluating Individual Treatment Effect (CATE) Estimates

slide-43
SLIDE 43

IH IHDP Benchmark1

► The Infant Health and Development Program (IHDP)

► Studied the effects of home visits and other interventions

► Real covariates and treatment, synthesized outcome ► Overlap is not satisfied (by design) ► Used to evaluate MSE in CATE prediction

1Hill, JCGS, 2011

slide-44
SLIDE 44

Em Empir piric ical l results ults

Method CATE MSE BART1 2.3 ± 0.1 Neural net 2.0 ± 0.0 Shared rep.2 𝟐. 𝟏 ± 𝟏. 𝟏 Shared rep. + invariance2 𝟏. 𝟗 ± 𝟏. 𝟏 Shared rep. + invariance + weighting3 𝟏. 𝟖 ± 𝟏. 𝟏

► BART, Bayesian Additive Regression

Trees, are state-of-the-art baselines

► Standard neural networks

competitive

► Shared representation learning with

ERM halves the MSE on IHDP2

► Minimizing upper bounds on risk,

including 𝑒ℋ further reduces the MSE

1Hill, JCGS, 2011, 2S., Johansson, Sontag. ICML, 2017, 3Johansson, Kallus, S., Sontag. arXiv, 2018

slide-45
SLIDE 45

In Intermedia iate conclu lusio sions

► ML is well understood when test data ≈ training data ► Learning individualized policies from observational data

requires going beyond test ≈ train

► Fewer/worse guarantees when assumptions are violated

slide-46
SLIDE 46

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluation in a partially observable Markov decision

process

  • Robust learning for unsupervised covariate shift
slide-47
SLIDE 47

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluation in a partially observable Markov decision

process

  • Robust learning for unsupervised covariate shift

“Off-Policy Evaluation in Partially Observable Environments”, Tennenholtz, Mannor, S AAAI 2020

slide-48
SLIDE 48

Healthcare with time-varying decisions

  • Physicians make ongoing decisions: treat, see change in patients

state, modify treatment, and so on

Doctor Patient

slide-49
SLIDE 49

Healthcare with time-varying decisions

  • Maps very well to reinforcement learning paradigm

Figure: Shweta Bhatt

slide-50
SLIDE 50

Reinforcement learning (RL) and causal inference

From causal inference perspective

  • RL usually assumes we can

intervene directly

  • à mostly about how to

experiment optimally in a dynamic environment

slide-51
SLIDE 51

From causal inference perspective

  • RL usually assumes we can

intervene directly

  • à mostly about how to

experiment optimally in a dynamic environment From RL perspective

Reinforcement learning (RL) and causal inference

slide-52
SLIDE 52

From causal inference perspective

  • RL usually assumes we can

intervene directly

  • à mostly about how to

experiment optimally in a dynamic environment From RL perspecFve

  • Causal inference usually deals

with cases we cannot intervene directly

Reinforcement learning (RL) and causal inference

slide-53
SLIDE 53

From causal inference perspective

  • RL usually assumes we can

intervene directly

  • à mostly about how to

experiment optimally in a dynamic environment From RL perspective

  • Causal inference usually deals

with cases we cannot intervene directly

  • Causal inference usually focuses
  • n single point-in-time actions

Reinforcement learning (RL) and causal inference

slide-54
SLIDE 54

From causal inference perspective

  • RL usually assumes we can

intervene directly

  • à mostly about how to

experiment optimally in a dynamic environment From RL perspective

  • Causal inference usually deals

with cases we cannot intervene directly

  • Causal inference usually focuses
  • n single point-in-time actions
  • à mostly about off-policy

evaluation of a simple policy such as “treat everyone”

Reinforcement learning (RL) and causal inference

slide-55
SLIDE 55

A meePng point of RL and causal inference

  • When performing off-policy evaluation of data from

i. dynamic environment with ongoing actions ii. while we possibly do not have access to the same data as the agent

  • Example: learning from records of physicians treating patients in an

intensive care unit (ICU)

  • Mistakes were made: applying RL to observational intensive care unit data

without considering hidden confounders or overlap (common support / positivity)

(see “Guidelines for Reinforcement Learning in Healthcare” Gottesman et al. 2019)

  • In RL nomenclature, hidden confounding can be described by a Partially

Observable Markov Decision Process (POMDP)

slide-56
SLIDE 56

Partially Observable Markov Decision Process (POMDP): some formalism

7

slide-57
SLIDE 57

POMDP causal graph

Causal name RL name Example 𝐯t confounder (possibly “hidden”) state (possibly “unobserved”) Information available to the doctor 𝐛t action, treatment action medications, procedures… 𝐬t

  • utcome

reward mortality 𝝆𝒄 treatment assignment process behavioral policy The way doctors treat patients 𝐴t Proxy variable

  • bservation

Electronic health record

slide-58
SLIDE 58

POMDP causal graph

Causal name RL name Example 𝐯t confounder (possibly “hidden”) state (possibly “unobserved”) Information available to the doctor 𝐛t action, treatment action medications, procedures… 𝐬t

  • utcome

reward mortality 𝝆𝒄 treatment assignment process behavioral policy The way doctors treat patients 𝐴t Proxy variable

  • bservation

Electronic health record

slide-59
SLIDE 59

POMDP causal graph

Causal name RL name Example 𝐯t confounder (possibly “hidden”) state (possibly “unobserved”) information available to the doctor 𝐛t action, treatment action medications, procedures… 𝐬t

  • utcome

reward mortality 𝝆𝒄 treatment assignment process behavioral policy the way doctors treat patients 𝐴t Proxy variable

  • bservation

Electronic health record

slide-60
SLIDE 60

POMDP causal graph

Causal name RL name Example 𝐯t confounder (possibly “hidden”) state (possibly “unobserved”) information available to the doctor 𝐛t action, treatment action medications, procedures… 𝐬t

  • utcome

reward mortality 𝝆𝒄 treatment assignment process behavioral policy the way doctors treat patients 𝐴t proxy variable

  • bservation

electronic health record

slide-61
SLIDE 61

POMDP causal graph

Causal name RL name Example 𝐯t confounder (possibly “hidden”) state (possibly “unobserved”) information available to the doctor 𝐛t action, treatment action medications, procedures… 𝐬t

  • utcome

reward mortality 𝝆𝒄 treatment assignment process behavioral policy the way doctors treat patients 𝐴t proxy variable

  • bservation

electronic health record

slide-62
SLIDE 62
  • Observe data from 𝝆𝒄, with 𝐯𝐮 unobserved

POMDP causal graph

slide-63
SLIDE 63
  • Observe data from 𝝆𝒄, with 𝐯𝐮 unobserved…
  • Evaluate a proposed policy 𝝆𝒇(𝒜𝒖) in terms of

policy value (discounted over a finite horizon)

  • Why a function of 𝐴t ? Because 𝐯t is unobserved
  • How to evaluate 𝝆𝒇(𝒜𝒖) given only observations

from 𝝆𝒄, with 𝐯t unobserved?

  • This is a problem anyone trying to create
  • ptimal dynamic treatment policies with
  • bservational data must address

Our goal: evaluate a new policy 𝝆𝒇 given data from 𝝆𝒄

slide-64
SLIDE 64
  • Observe data from 𝝆𝒄, with 𝐯𝐮 unobserved…
  • Evaluate a proposed policy 𝝆𝒇(𝒜𝒖) in terms of

policy value (discounted over a finite horizon)

  • Denote 𝑞𝝆𝒄(𝑏, 𝑐, 𝑑, … |𝑒, 𝑓, 𝑔, … ) probabilities

from observed behavioral policy

  • Can sample from this distribution
  • Denote 𝑞𝝆𝒇(𝑏, 𝑐, 𝑑, … |𝑒, 𝑓, 𝑔, … ) probabilities

from targeted evaluation policy

  • Cannot sample from this distribution

Our goal: evaluate a new policy 𝝆𝒇 given data from 𝝆𝒄

slide-65
SLIDE 65

Our goal: evaluate a new policy 𝝆𝒇 given data from 𝝆𝒄

  • Observing data from 𝝆𝒄, with 𝐯𝐮 unobserved

evaluate a proposed policy 𝝆𝒇(𝐴𝐮) in terms of policy value (discounted over a finite horizon)

  • Without further assumptions:

IMPOSSIBLE

  • Example: ICU doctors treating sicker patients

more aggressively

  • Impossible even when conditioning on entire
  • bservable history 𝐴𝟐, 𝐛𝟐, 𝒔𝟐 , … , 𝐴𝐔, 𝐛𝐔, 𝒔𝐔
  • Due to hidden confounding by 𝐯𝐮
  • But much harder: confounder<->action dynamics
slide-66
SLIDE 66

Proxies and negative controls

  • Miao, Geng, & Tchetgen Tchetgen.

“Identifying causal effects with proxy variables of an unmeasured confounder.” Biometrika (2018)

  • Only 𝒗 is unobserved
  • Goal: identify the causal effect
  • f 𝐛 on 𝐬
  • 𝐴 ⫫ 𝐱 | 𝒗
  • In general: impossible
  • New identification condition:

matrices 𝑁¹º(𝑏) = 𝑞(𝐱 = 𝑗|𝐴 = j, 𝐛 = 𝑏) are invertible for all 𝑏

  • Requires 𝐱 and 𝐴 to be discrete with

as many categories as discrete 𝒗

𝒗 𝐛 𝐬 𝐴 𝐱

slide-67
SLIDE 67
  • Assume 𝐴𝐮 are discrete with ≥ categories as 𝐯𝐮

(untestable from data)

  • Let 𝑁¹º

‡ 𝑏 = 𝑞𝝆𝒄 𝐴𝐮 = 𝑗 𝐴𝐮 𝟐 = 𝑘, 𝐛𝐮 = 𝑏

  • Theorem:

If 𝑁‡(𝑏) are all invertible then we can evaluate value of a proposed policy 𝝆𝒇(𝒜𝒖) given observational data gathered under 𝝆𝒄, without observing 𝐯𝐮

  • Future and past observations 𝒜𝒖 are

conditionally independent proxies for unobserved 𝐯𝐮

Our goal: evaluate a new policy 𝝆𝒇 given data from 𝝆𝒄

Invertibility example If z¿ are binary, then a sufficient condition for invertiblity of 𝑁‡(𝑏) is 𝑞 z¿ = 1 z¿ L = 1, 𝑏 ≠ 𝑞 z¿ = 1 z¿ L = 0, 𝑏

slide-68
SLIDE 68
  • Allow off-policy evaluation for class of POMDPs
  • No need to measure or even know what is 𝐯𝐮
  • As usual in Causal Inference, some of the

assumptions are unverifiable from data

Assumptions

  • 1. Assume 𝐴𝐮 are discrete with ≥ categories as 𝐯𝐮
  • 2. Matrices 𝑁¹º

‡ 𝑏 = 𝑞𝝆𝒄 𝐴𝐮 = 𝑗 𝐴𝐮 𝟐 = 𝑘, 𝐛𝐮 = 𝑏

are invertible for all 𝑏 and 𝑢

slide-69
SLIDE 69
  • Observed sequence 𝜐 = 𝑨K, 𝑏K, … , 𝑨k, 𝑏k ∈ 𝒰

k

  • 𝑂¹º

‡ 𝑏 = 𝑞𝝆𝒄 𝐴𝐮 = 𝑗, 𝐴𝐮 𝟐 = 𝑨‡ L 𝐴𝐮 𝟑 = 𝑘, 𝐛𝐮 𝟐 = 𝑏

  • 𝑋‡ 𝜐 = 𝑁‡ 𝑏‡ L𝑂‡(𝑏‡ L)
  • 𝑅¹

K(𝜐) = ∑º 𝑁K 𝑏Æ ¹º L 𝑞𝝆𝒄(𝒜𝟏 = 𝑘)

  • Ω 𝜐 = ∏‡lK

k

𝑋‡ 𝜐 ⋅ 𝑅K (𝜐)

  • ΛË 𝜐 = ∏‡lK

k

𝝆𝒇(𝑏‡|𝑨K, 𝑏K, … , 𝑨‡ L, 𝑏‡ L, 𝑨‡)

  • Then:

𝑞𝝆𝒇 𝑠‡ = ∑`∈𝒰

Í ΛË 𝜐 𝑞𝝆𝒄(𝑠‡, 𝑨‡|𝑏‡, 𝑨‡ L) Ω 𝜐

Assumptions

  • 1. Assume 𝐴𝐮 are discrete with ≥ categories as 𝐯𝐮
  • 2. Matrices 𝑁¹º

‡ 𝑏 = 𝑞𝝆𝒄 𝐴𝐮 = 𝑗 𝐴𝐮 𝟐 = 𝑘, 𝐛𝐮 = 𝑏

are invertible for all 𝑏 and 𝑢

slide-70
SLIDE 70

Off-policy POMDP evaluation

  • The above evaluation requires estimating the inverses of

many conditional probability tables

  • Scales poorly statistically
  • We introduce another causal model called

decoupled-POMDP

  • Similar causal graph
  • Significantly reduces the dimensions and improves condition

number of the estimated inverse matrices

slide-71
SLIDE 71

Decoupled POMDP

slide-72
SLIDE 72

Off-policy POMDP evaluation

  • The above evaluation requires estimating the inverses of

many conditional probability tables

  • Scales poorly statistically
  • We introduce another causal model called

decoupled-POMDP

  • Similar causal graph
  • Significantly reduces the dimensions and improves condition

number of the estimated inverse matrices

  • Current challenge: scaling to realistic health data
slide-73
SLIDE 73

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluation in a partially observable Markov decision

process

  • Robust learning for unsupervised covariate shift
slide-74
SLIDE 74

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluation in a partially observable Markov decision

process

  • Robust learning for unsupervised covariate shift

“Robust learning with the Hilbert- Schmidt independence criterion”, Greenfeld & S arXiv:1910.00270

slide-75
SLIDE 75

Classic non-causal tasks in machine learning: many success stories

  • Classification
  • ImageNet
  • MNIST
  • TIMIT (sound)
  • Sentiment analysis
  • Prediction
  • Which patients will die?
  • Which users will click?
  • (under current practice)
slide-76
SLIDE 76

Failures of ML Classification models

slide-77
SLIDE 77

Failures of ML Classification models

test set ≠ train set, but we know humans succeed here

slide-78
SLIDE 78

How to learn models which are ro

robust to

a-priori unknown changes in test distribution?

  • Source distribution 𝑄

Î(𝑌, 𝑍)

  • Learn model that works well on unknown

Target distributions 𝑄Ï 𝑌, 𝑍 ∈ 𝒭

Set of possible targets 𝒭

Source 𝑄

Î

slide-79
SLIDE 79
  • Source distribution 𝑄

Î 𝑌, 𝑍

  • Learn model that works well on all target distributions 𝑄Ï 𝑌, 𝑍 ∈ 𝒭
  • What is 𝒭?
  • We assume Covariate Shift:

For all 𝑄Ï 𝑌, 𝑍 ∈ 𝒭, 𝑄Ï 𝑍|𝑌 = 𝑄

Î(𝑍|𝑌)

  • Further restrictions on 𝒭 to follow
  • Covariate shift is easy if learning 𝑄

Î 𝑍 𝑌 is easy

  • Focus on tasks where it’s hard

How to learn models which are ro

robust to

a-priori unknown changes in test distribution?

slide-80
SLIDE 80

Unsupervised covariate shift

  • A model that works well even when the underlying distribution of

instances changes

  • Works as long as 𝑄(𝑍|𝑌) is stable
  • When does this happen?
slide-81
SLIDE 81

Causal mechanisms are stable

slide-82
SLIDE 82

Learning with an independence criterion

  • 𝑌 causes 𝑍, structural causal model:

𝒁 = 𝒈∗ 𝒀 + 𝝑, 𝝑 ⫫ 𝒀

  • 𝑔∗ 𝑦 is the mechanism tying 𝑌 to 𝑍
  • 𝜗 is independent addiKve noise
  • Therefore, 𝑍 − 𝑔∗ 𝑌 ⫫ 𝑌
  • Mooij, Janzing, Peters & Schölkopf (2009):

Learn structure of causal models by learning funcFons 𝑔 such that 𝑍 − 𝑔 𝑌 is approximately independent of 𝑌

  • Need a non-parametric measure of independence
  • Hilbert-Schmidt independence criterion, HSIC
slide-83
SLIDE 83

Hilbert-Schmidt independence criterion: HSIC

  • Let 𝑌, 𝑍 be two metric spaces with a joint distribution 𝑄(𝑌, 𝑍)
  • 𝒣Õ and 𝒣Ö are reproducing kernel Hilbert spaces on 𝑌 and 𝑍 induced by

kernels 𝐿(⋅,⋅) and 𝑀(⋅,⋅) respectively

  • 𝐼𝑇𝐽𝐷(𝑌, 𝑍) measures the degree of dependence between 𝑌 and 𝑍
  • Empirical version: Sample 𝑦L, 𝑧L , … , 𝑦Ü, 𝑧Ü

Denote (some abuse of notation) 𝐿 the 𝑜 × 𝑜 kernel matrix on 𝑌, 𝑀 is 𝑜 × 𝑜 kernel matrix on 𝑍

  • {

𝐼𝑇𝐽𝐷 (𝑌, 𝑍; 𝒣Õ, 𝒣Ö) =

L Ü L à 𝑢𝑠 𝐿𝐼𝑀𝐼

𝐼 is a centering matrix, 𝐼¹º = 𝜀¹º − L

Ü

slide-84
SLIDE 84

Learning with HSIC

  • Hypothesis class ℋ
  • Classic learning for loss ℓ, e.g. squared loss:

min

|∈ℋ 𝔽 ℓ(𝑍, ℎ 𝑌 )

  • Learning with HSIC (Mooij et al., 2009):

min

|∈ℋ 𝐼𝑇𝐽𝐷 𝑌, 𝑍 − ℎ 𝑌 ; 𝒣Õ, 𝒣Ö

slide-85
SLIDE 85

Learning with HSIC

  • Learning with HSIC (Mooij et al., 2009):

min

|∈ℋ 𝐼𝑇𝐽𝐷 𝑌, 𝑍 − ℎ 𝑌 ; 𝒣Õ, 𝒣Ö

  • Recall: 𝑍 − 𝑔∗ 𝑌 ⫫ 𝑌
  • If objective equals 0 then ℎ∗ 𝑌 = 𝑔∗ 𝑦 + 𝑐 for some constant 𝑐
  • Can learn up to an additive bias term
slide-86
SLIDE 86

Learning with HSIC

  • Learning with HSIC (Mooij et al., 2009):

min

|∈ℋ 𝐼𝑇𝐽𝐷 𝑌, 𝑍 − ℎ 𝑌 ; 𝒣Õ, 𝒣Ö

  • DifferenFable with respect to ℎ 𝑌
  • We opFmize with SGD using mini-batches to approximate HSIC
slide-87
SLIDE 87

Theoretical results

  • Learnability: minimizing HSIC-loss over a sample leads to

generalization

  • Robustness: minimizing HSIC-loss leads to tightly-bounded

error in unsupervised covariate shift

  • If denstiy ratio

â

ŸãäåæŸ •

â

çèéäêæ • is “nice” in the sense of low RKHS norm.

slide-88
SLIDE 88

Experiments – rotated MNIST (Heinze-Deml & Meinshausen 2017)

  • Train on ordinary MNIST
  • Test on MNIST rotated uniformly at random [-45°,45°]
slide-89
SLIDE 89

Experiments – rotated MNIST (Heinze-Deml & Meinshausen 2017)

  • Train on ordinary MNIST
  • Test on MNIST rotated uniformly at random [-45°,45°]

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy 60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

Source{ Target{

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

70 80 90 100

Accuracy

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy 60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy 60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy 60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy 60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

HSIC Cross entropy

slide-90
SLIDE 90
  • Train on ordinary MNIST
  • Test on MNIST rotated uniformly at random [-45°,45°]

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy 60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

Source{ Target{

60 65 70 75 80 85 90 95 100

Accuracy

C11 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x1024 - sourcH C11 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt

7raLnLng schHPH

H6IC Cross HntroSy

HSIC Cross entropy 70 80 90 100

Accuracy

Experiments – rotated MNIST (Heinze-Deml & Meinshausen 2017)

slide-91
SLIDE 91

Outline

  • ML for causal inference
  • Causal inference for ML
  • Off-policy evaluaFon in a parFally observable Markov decision

process

  • Robust learning for unsupervised covariate shiT
slide-92
SLIDE 92

Summary

  • Machine learning for causal-

inference:

  • Individual-level treatment effects from
  • bservational data - robustness to

treatment assignments process

  • Using ecently proposed “negative

control” to create first Off-Policy Evaluation scheme for POMDPs, with past and future in the role of the controls

  • Learning models robust against

unknown covariate shift

slide-93
SLIDE 93

Thank you to all my collaborators!

  • Fredrik Johansson (Chalmers)
  • David Sontag (MIT)
  • Nathan Kallus (Cornell-Tech)
  • Guy Tennenholtz (Technion)
  • Shie Mannor (Technion)
  • Daniel Greenfeld (Technion)
slide-94
SLIDE 94
slide-95
SLIDE 95

Even estimating average effects from

  • bservational data is hard!

Do we believe we can estimate individual-level effects?

  • Causal identification assumptions:
  • Hidden confounding:

No unmeasured factors that affect both treatment and outcome

  • Common support:

𝑈 = 1 and 𝑈 = 0 populations should be similar

  • Accurate effect estimates:

be able to approximate 𝔽 𝑍|𝑦, 𝑈 = 𝑢

slide-96
SLIDE 96

Even esPmaPng average effects from

  • bservaPonal data is hard!

Do we believe we can esPmate individual-level effects?

  • Causal identification assumptions:
  • Hidden confounding
  • Common support
  • Accurate effect estimates
  • We focus on tasks where

we hope we can address all three concerns

  • And still be useful
  • Designing for causal identification
slide-97
SLIDE 97

You have condition A. Treatment

  • ptions are

T=0, T=1

slide-98
SLIDE 98

Obviously, give T=0

No need for algorithmic decision support

slide-99
SLIDE 99

Obviously, give T=0 Obviously, give T=1

No need for algorithmic decision support

slide-100
SLIDE 100

Obviously, give T=0 Obviously, give T=1 I’m not so sure…

slide-101
SLIDE 101

Obviously, give T=1 Obviously, give T=0 I’m not so sure… Recommend T=0

slide-102
SLIDE 102

Obviously, give T=1 Obviously, give T=0 I’m not so sure… Recommend T=0

  • If decision could really go either way:

Recommending a suboptimal action is not as risky

slide-103
SLIDE 103
  • If decision could really go either way:

Recommending a subopFmal acFon is not as risky

  • Need not make explicit

recommendaFon

Obviously, give T=1 Obviously, give T=0 I’m not so sure… T=0 T=1

slide-104
SLIDE 104

Estimating average effects is hard! When do we believe we can estimate individual-level effects?

  • Causal identification assumptions:
  • Hidden confounding à

conscious point in time decision by trained decision makers

  • Common supportà

focus on cases with explicit decision uncertainty

  • Accurate effect estimatesà

sign(𝐷𝐵𝑈𝐹) more important than exact number

slide-105
SLIDE 105

Estimating average effects is hard! When do we believe we can estimate individual-level effects?

  • We don’t need to esFmate the effects for each

paFent correctly

  • Suffice to give useful recommendaFon in cases of

physician uncertainty

  • Physician uncertainty is exactly where we will have

more data regarding treatment alternaFves for similar paFents

  • Include a “we have no recommenda6on” op6on
slide-106
SLIDE 106

We are developing a best-practice “pipeline” for decision support models in clinical point-in-time decision support

slide-107
SLIDE 107

Focus on process, not specific models

slide-108
SLIDE 108

Preliminary results – study 2 Acute disease treatment

  • Investigating the causal effects of diuretics on kidney function in

hospitalized acute heart failure patients with kidney injury in Rambam Medical Center

  • Physicians tell us:

They have poor guidance how to prescribe diuretics and blood-pressure medications to these patients

  • 2157 patients
  • More than 200 covariates which are potential confounders:

demographics, lab tests, diagnoses, medications, administrative and more

  • Empirically: half of cohort had increased diuretics,

half had decreased diuretics

slide-109
SLIDE 109

Preliminary results – study 2 Acute disease treatment

  • T=1: “Decrease diuretics”
  • Often improves kidney function
  • Might hurt heart function
  • Physicians must balance multiple outcomes
  • For now we only examined effect on kidney function
slide-110
SLIDE 110

Policy value

  • From {

𝐷𝐵𝑈𝐹(𝑦) we can derive a policy recommendaFon for treatment

  • Simple: 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • For any policy 𝜌 we can esFmate its policy value:

expected outcome if paFents were treated by policy 𝜌

  • We use Doubly-Robust policy value esFmate (Dudík et al. 2011,2014)
slide-111
SLIDE 111
  • Increase or decrease

diuretics?

  • Policy value: %

improvement in kidney function (creatinine)

  • 100%: excellent
  • 0%: no improvement
  • Recommendations for

461 out of 530 (test set)

  • {

𝐷𝐵𝑈𝐹: T-learner XGBoost

  • 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • Bootstrap confidence

intervals

slide-112
SLIDE 112
  • Increase or decrease

diuretics?

  • Policy value: %

improvement in kidney function (creatinine)

  • 100%: excellent
  • 0%: no improvement
  • Recommendations for

461 out of 530 (test set)

  • {

𝐷𝐵𝑈𝐹: T-learner XGBoost

  • 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • Bootstrap confidence

intervals

Increase everyone Random policy Decrease everyone { 𝐷𝐵𝑈𝐹 policy

0% 40%

slide-113
SLIDE 113
  • Increase or decrease

diuretics?

  • Policy value: %

improvement in kidney function (creatinine)

  • 100%: excellent
  • 0%: no improvement
  • Recommendations for

461 out of 530 (test set)

  • {

𝐷𝐵𝑈𝐹: T-learner XGBoost

  • 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • Bootstrap confidence

intervals

Increase everyone Decrease everyone { 𝐷𝐵𝑈𝐹 policy

0% 40%

Doctors policy

slide-114
SLIDE 114
  • Increase or decrease

diureFcs?

  • Policy value: %

improvement in kidney funcFon (creaFnine)

  • 100%: excellent
  • 0%: no improvement
  • RecommendaFons for

461 out of 530 (test set)

  • {

𝐷𝐵𝑈𝐹: T-learner XGBoost

  • 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • Bootstrap confidence

intervals

Increase everyone Decrease everyone { 𝐷𝐵𝑈𝐹 policy

0% 40%

Doctors policy Random policy

slide-115
SLIDE 115
  • Increase or decrease

diureFcs?

  • Policy value: %

improvement in kidney funcFon (creaFnine)

  • 100%: excellent
  • 0%: no improvement
  • RecommendaFons for

461 out of 530 (test set)

  • {

𝐷𝐵𝑈𝐹: T-learner XGBoost

  • 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • Bootstrap confidence

intervals

Increase everyone Decrease everyone

0% 40%

Doctors policy Random policy

slide-116
SLIDE 116
  • Increase or decrease

diuretics?

  • Policy value: %

improvement in kidney function (creatinine)

  • 100%: excellent
  • 0%: no improvement
  • Recommendations for

461 out of 530 (test set)

  • {

𝐷𝐵𝑈𝐹: T-learner XGBoost

  • 𝜌 𝑦 = 𝕁 {

𝐷𝐵𝑈𝐹 𝑦 > 0

  • Bootstrap confidence

intervals

Increase everyone Doctors policy Random policy Decrease everyone { 𝐷𝐵𝑈𝐹 policy

0% 40%

slide-117
SLIDE 117
  • Our recommendations

better than current practice (p=0.015)

  • Our recommendations have

approximately same value as “decrease diuretics for all patients”

  • Our recommendations

decrease diuretics for only 50% of patients

  • More flexibility with respect

to other outcomes

  • Effect on other outcomes is

work in progress

Increase everyone Doctors policy Random policy Decrease everyone { 𝐷𝐵𝑈𝐹 policy

0% 40%