Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models
@MichaelOberst Michael Oberst MIT David Sontag MIT
Gumbel-Max Structural Causal Models Michael Oberst David Sontag - - PowerPoint PPT Presentation
Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models Michael Oberst David Sontag MIT MIT @MichaelOberst Motivation: Building trust in RL policies Goal : Apply reinforcement learning in high risk settings (e.g.,
@MichaelOberst Michael Oberst MIT David Sontag MIT
►Goal: Apply reinforcement learning in
high risk settings (e.g., healthcare)
►Problem: How to safely evaluate a
policy? No simulator, and off-policy evaluation can fail due to
►Confounding ►Small sample sizes ►Poorly specified rewards ►Could try to interpret the policy directly,
but if not possible, what can we do?
𝑄 𝑇′, 𝑆 𝑇, 𝐵)
𝜌 𝐵 𝑇)
𝑇: Current State 𝐵: Action 𝑆: Reward 𝑇′: Next State Suppose we are given:
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3
If the new policy had been applied to this patient…
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3
If the new policy had been applied to this patient…
Antibiotics …patient has infection 𝑇0 𝐵1 𝑇: State 𝐵: Action
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝐵1 𝐵2 𝐵3
If the new policy had been applied to this patient…
Antibiotics …patient has infection 𝑇0 𝐵1
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 …infection cleared 𝑇1
If the new policy had been applied to this patient…
Antibiotics …patient has infection 𝑇0 𝐵1
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 …infection cleared 𝑇1 Model-based rollout not a fair comparison
If the new policy had been applied to this patient…
Antibiotics …patient has infection 𝑇0 𝐵1
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 𝑇1
If the new policy had been applied to this patient…
Antibiotics …patient has infection 𝑇0 𝐵1
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3 𝑇1 …drug reaction Counterfactual influenced by actual outcome
If the new policy had been applied to this patient…
Antibiotics No action Discharge …patient has infection …drug reaction …patient recovers 𝑇0 𝐵1 𝑇1 𝐵2 𝑇2 𝐵3
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3
If the new policy had been applied to this patient…
Antibiotics No action Discharge …patient has infection …drug reaction …patient recovers 𝑇0 𝐵1 𝑇1 𝐵2 𝑇2 𝐵3
Time
Antibiotics Mechanical Ventilation Sedation …patient has infection …drug reaction …significant agitation Idea: If the counterfactual trajectory is unreasonable given full context of patient, the model / policy may be flawed 𝑇: State 𝐵: Action 𝐵1 𝐵2 𝐵3
identify interesting cases Approach See paper / poster for synthetic case study motivated by sepsis management
identify interesting cases Approach Example See paper / poster for synthetic case study motivated by sepsis management
identify interesting cases Approach 2 Examine counterfactual trajectories under new policy 3 Validate and/or criticize conclusions, using full patient information (e.g., chart review) Example See paper / poster for synthetic case study motivated by sepsis management
What we need 1 Observed trajectories 3 Model of discrete dynamics, e.g., Markov Decision Process 𝑇 𝐵 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) 𝑇: Current State 𝐵: Action 𝑇′: Next State
What we need 1 Observed trajectories 3 Model of discrete dynamics, e.g., Markov Decision Process 𝑇 𝐵 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝑇′ = 𝑔(𝑇, 𝐵, 𝑉𝑡′) 𝑉𝑡′ ∼ 𝑄(𝑉𝑡′) 𝑉𝑇′ 𝑇 𝐵 𝑇′ Structural Causal Model (SCM) 𝑇: Current State 𝐵: Action 𝑇′: Next State
What we need 1 Observed trajectories 3 Model of discrete dynamics, e.g., Markov Decision Process 𝑇 𝐵 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝑇′ = 𝑔(𝑇, 𝐵, 𝑉𝑡′) 𝑉𝑡′ ∼ 𝑄(𝑉𝑡′) 𝑉𝑇′ 𝑇 𝐵 𝑇′ Structural Causal Model (SCM) Problem: Choice of SCM is not identifiable from data! 𝑇: Current State 𝐵: Action 𝑇′: Next State
There are multiple SCMs consistent with 𝑄 𝑇′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables, assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states! Key challenge: Non-identifiability
There are multiple SCMs consistent with 𝑄 𝑇′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables, assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states! Theorem 1 (informal): (Newly defined) property of counterfactual stability generalizes monotonicity to categorical variables Key challenge: Non-identifiability
There are multiple SCMs consistent with 𝑄 𝑇′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables, assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states! Theorem 1 (informal): (Newly defined) property of counterfactual stability generalizes monotonicity to categorical variables Key challenge: Non-identifiability Gumbel-Max SCM Use the Gumbel-Max trick to sample from a categorical distribution with 𝑙 categories: 𝑘 ∼ 𝐻𝑣𝑛𝑐𝑓𝑚 𝑇′ = 𝑏𝑠𝑛𝑏𝑦𝑘 { log 𝑄 𝑇′ = 𝑘 𝑇, 𝐵) + 𝑘 } Theorem 2: Gumbel-Max SCM satisfies the counterfactual stability condition
Come to our poster for more details: Pacific Ballroom #72