Gumbel-Max Structural Causal Models Michael Oberst David Sontag - - PowerPoint PPT Presentation

gumbel max structural causal models
SMART_READER_LITE
LIVE PREVIEW

Gumbel-Max Structural Causal Models Michael Oberst David Sontag - - PowerPoint PPT Presentation

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models Michael Oberst David Sontag MIT MIT @MichaelOberst Motivation: Building trust in RL policies Goal : Apply reinforcement learning in high risk settings (e.g.,


slide-1
SLIDE 1

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models

@MichaelOberst Michael Oberst MIT David Sontag MIT

slide-2
SLIDE 2

Motivation: Building trust in RL policies

►Goal: Apply reinforcement learning in

high risk settings (e.g., healthcare)

►Problem: How to safely evaluate a

policy? No simulator, and off-policy evaluation can fail due to

►Confounding ►Small sample sizes ►Poorly specified rewards ►Could try to interpret the policy directly,

but if not possible, what can we do?

slide-3
SLIDE 3

Motivation: Building trust in RL policies

Observational Data

𝑄 𝑇′, 𝑆 𝑇, 𝐵)

Markov Decision Process (MDP) Policy

𝜌 𝐵 𝑇)

?

𝑇: Current State 𝐵: Action 𝑆: Reward 𝑇′: Next State Suppose we are given:

  • Markov Decision Process (MDP)
  • Policy (e.g., learned using MDP)
slide-4
SLIDE 4

Using counterfactuals to “sanity check”

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3

slide-5
SLIDE 5

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3

slide-6
SLIDE 6

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics …patient has infection 𝑇0 𝐵1 𝑇: State 𝐵: Action

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝐵1 𝐵2 𝐵3

slide-7
SLIDE 7

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics …patient has infection 𝑇0 𝐵1

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 …infection cleared 𝑇1

slide-8
SLIDE 8

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics …patient has infection 𝑇0 𝐵1

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 …infection cleared 𝑇1 Model-based rollout not a fair comparison

slide-9
SLIDE 9

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics …patient has infection 𝑇0 𝐵1

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 𝑇1

slide-10
SLIDE 10

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics …patient has infection 𝑇0 𝐵1

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 𝑇1 …drug reaction Counterfactual influenced by actual outcome

slide-11
SLIDE 11

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics No action Discharge …patient has infection …drug reaction …patient recovers 𝑇0 𝐵1 𝑇1 𝐵2 𝑇2 𝐵3

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3

slide-12
SLIDE 12

If the new policy had been applied to this patient…

Using counterfactuals to “sanity check”

Antibiotics No action Discharge …patient has infection …drug reaction …patient recovers 𝑇0 𝐵1 𝑇1 𝐵2 𝑇2 𝐵3

Time

Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation Idea: If the counterfactual trajectory is unreasonable given full context of patient, the model / policy may be flawed 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3

slide-13
SLIDE 13

Using counterfactuals to “sanity check”

1 Decomposition of reward

  • ver real episodes, to

identify interesting cases Approach See paper / poster for synthetic case study motivated by sepsis management

slide-14
SLIDE 14

Using counterfactuals to “sanity check”

1 Decomposition of reward

  • ver real episodes, to

identify interesting cases Approach Example See paper / poster for synthetic case study motivated by sepsis management

slide-15
SLIDE 15

Using counterfactuals to “sanity check”

1 Decomposition of reward

  • ver real episodes, to

identify interesting cases Approach 2 Examine counterfactual trajectories under new policy 3 Validate and/or criticize conclusions, using full patient information (e.g., chart review) Example See paper / poster for synthetic case study motivated by sepsis management

slide-16
SLIDE 16

Simulating counterfactual trajectories

What we need 1 Observed trajectories 3 Model of discrete dynamics, e.g., Markov Decision Process 𝑇 𝐵 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) 𝑇: Current State 𝐵: Action 𝑇′: Next State

slide-17
SLIDE 17

Simulating counterfactual trajectories

What we need 1 Observed trajectories 3 Model of discrete dynamics, e.g., Markov Decision Process 𝑇 𝐵 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝑇′ = 𝑔(𝑇, 𝐵, 𝑉𝑡′) 𝑉𝑡′ ∼ 𝑄(𝑉𝑡′) 𝑉𝑇′ 𝑇 𝐵 𝑇′ Structural Causal Model (SCM) 𝑇: Current State 𝐵: Action 𝑇′: Next State

slide-18
SLIDE 18

Simulating counterfactual trajectories

What we need 1 Observed trajectories 3 Model of discrete dynamics, e.g., Markov Decision Process 𝑇 𝐵 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝑇′ = 𝑔(𝑇, 𝐵, 𝑉𝑡′) 𝑉𝑡′ ∼ 𝑄(𝑉𝑡′) 𝑉𝑇′ 𝑇 𝐵 𝑇′ Structural Causal Model (SCM) Problem: Choice of SCM is not identifiable from data! 𝑇: Current State 𝐵: Action 𝑇′: Next State

slide-19
SLIDE 19

So, what should we use for the structural causal model (SCM)?

There are multiple SCMs consistent with 𝑄 𝑇′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables, assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states! Key challenge: Non-identifiability

slide-20
SLIDE 20

So, what should we use for the structural causal model (SCM)?

There are multiple SCMs consistent with 𝑄 𝑇′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables, assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states! Theorem 1 (informal): (Newly defined) property of counterfactual stability generalizes monotonicity to categorical variables Key challenge: Non-identifiability

slide-21
SLIDE 21

So, what should we use for the structural causal model (SCM)?

There are multiple SCMs consistent with 𝑄 𝑇′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables, assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states! Theorem 1 (informal): (Newly defined) property of counterfactual stability generalizes monotonicity to categorical variables Key challenge: Non-identifiability Gumbel-Max SCM Use the Gumbel-Max trick to sample from a categorical distribution with 𝑙 categories: 𝑕𝑘 ∼ 𝐻𝑣𝑛𝑐𝑓𝑚 𝑇′ = 𝑏𝑠𝑕𝑛𝑏𝑦𝑘 { log 𝑄 𝑇′ = 𝑘 𝑇, 𝐵) + 𝑕𝑘 } Theorem 2: Gumbel-Max SCM satisfies the counterfactual stability condition

slide-22
SLIDE 22

Thank you!

Come to our poster for more details: Pacific Ballroom #72