Partially-Observable Markov Decision Processes as Dynamical Causal - PowerPoint PPT Presentation

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013

The POMDP Mindset We poke the world (perform an action ) Agent World

The POMDP Mindset We poke the world (perform an action ) We get a poke back (see an observation ) We get a poke back (get a reward ) Agent World -$1

What next? We poke the world (perform an action ) We get a poke back (see an observation ) We get a poke back (get a reward ) Agent World -$1

What next? ? We poke the world (perform an action ) We get a poke back The world is (see an observation ) a mystery... We get a poke back (get a reward ) Agent World -$1

The agent needs a representation to use when making decisions ? We poke the world (perform an action ) Representation Representation of how the of current world works world state The world is We get a poke back a mystery... (see an observation ) We get a poke back Agent (get a reward ) World -$1

Many problems can be framed this way ● Robot navigation (take movement actions, receive sensor measurements) ● Dialog management (ask questions, receive answers) ● Target tracking (search a particular area, receive sensor measurements) … the list goes on ...

The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 -$5 $10 -$1 -$1 8

The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 Given a history of actions, ... ... observations, o t-1 o t o t+1 o t+2 and rewards ... ... How can we r t-1 r t r t+1 r t+2 act in order to -$5 $10 -$1 -$1 maximize long-term future rewards ? 9

The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 Key Challenge: The entire history may be needed ... ... to make near-optimal decisions o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 10

The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 All past events are needed to predict ... ... o t-1 o t o t+1 o t+2 future events ... ... r t-1 r t r t+1 r t+2 11

The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 The representation ... ... is a sufficient s t-1 s t s t+1 s t+2 statistic that summar izes the history ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 12

The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 The representation ... ... is a sufficient s t-1 s t s t+1 s t+2 statistic that summar izes the history ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 We call this representation the information state . 13

What is state ? ● Sometimes, there exists an obvious choice for this hidden variable (such as a robot's true position) ● At other times, learning a representation that makes the system Markovian may provide insights into the problem.

Formal POMDP definition A POMDP consists of ● A set of states S, actions A, and observations O ● A transition function T( s' | s , a ) ● An observation function O( o | s , a ) ● A reward function R( s , a ) ● A discount factor γ ∞ t R t term discounted reward. ∑ γ The goal is to maximize E[ ], the expected long- t = 1

Relationship to Other Models Hidden State? Markov Model Hidden Markov Model ... ... s t-1 s t s t+1 s t+2 Decisions? ... ... s t-1 s t s t+1 s t+2 ... ... o t-1 o t o t+1 o t+2 Markov Decision Process POMDP ... ... ... ... a t-1 a t-1 a t a t a t+1 a t+1 a t+2 a t+2 ... ... a t-1 a t a t+1 a t+2 ... ... ... ... s t-1 s t-1 s t s t s t+1 s t+1 s t+2 s t+2 ... ... s t-1 s t s t+1 s t+2 ... ... ... ... o t-1 o t-1 o t o t o t+1 o t+1 o t+2 o t+2 ... ... r t-1 r t r t+1 r t+2 ... ... ... ... r t-1 r t-1 r t r t r t+1 r t+1 r t+2 r t+2

Formal POMDP definition A POMDP consists of ● A set of states S, actions A, and observations O ● A transition function T( s' | s , a ) ● An observation function O( o | s , a ) ● A reward function R( s , a ) ● A discount factor γ This ∞ t R t optimization term discounted reward. ∑ γ The goal is to maximize E[ ], the expected long- is called t = 1 “Planning”

Formal POMDP definition A POMDP consists of ● A set of states S, actions A, and observations O ● A transition function T( s' | s , a ) Learning These ● An observation function O( o | s , a ) is called ● A reward function R( s , a ) “Learning” ● A discount factor γ ∞ t R t term discounted reward. ∑ γ The goal is to maximize E[ ], the expected long- t = 1

Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O

State and State , a quick aside ● In the POMDP literature, the term “state” usually refers to the hidden state (i.e. the robot's true location). ● The posterior distribution of states s is called the “belief” b(s). It is a sufficient statistic for the history, and thus the information state for the POMDP.

Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O

Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ Belief b P ( b oa ∣ b ) V ( b oa ))] (sufficient statistic/ o ∈ O information state)

Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O Immediate reward for taking action a in belief b

Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O Expected future rewards

Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O … especially when b is high-dimensional, solving for this continuous function is not easy (PSPACE-HARD)

Planning: Yes, we can! ● Global: Approximate the entire function V(b) via a set of support points b'. - e.g. SARSOP b ● Local: Approximate a the value for a particular belief ... with forward simulation b t a - e.g. POMCP o

Learning ● Given histories h =( a 1 ,r 1 ,o 1 ,a 2 ,r 2 ,o 2 , ... ,a T ,r T ,o T ) we can learn T, O, R via forward-filtering/backward- sampling or <fill in your favorite timeseries algorithm> ● Two principles usually suffice for exploring to learn: ● Optimism under uncertainty: Try actions that might be good ● Risk control: If an action seems risky, ask for help.

Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Anti- diabetic agents) Clinician Model Patient Model Lab Results (A1c) Collaborators: Isaac Kohane, Stan Shaw

Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Anti- diabetic agents) Clinician Model Patient Model Lab Results (A1c)

Discovered Patient States ● The “patient states” each correspond to a set of A1c levels (unsurprising) A1c A1c A1c A1c A1c 5.5- 6.5- 7.5- < 5.5 > 8.5 6.5 7.5 8.5

Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Anti- diabetic agents) Clinician Model Patient Model Lab Results (A1c)

Discovered Clinician States ● The “clinician states” follow the standard treatment protocols for diabetes (unsurprising, but exciting that we discovered this is a completely unsupervised manner) ● Next steps: Incorporate more variables; identify patient and clinician outliers (quality of care) Metformin, Basic A1c up A1c control Glipizide Insulins A1c up A1c up Metformin A1c control Glargine, Metformin, Lispro, A1c up Glyburide Aspart

Example: Experimental Design ● In a very general sense: ● Action space: all possible experiments + “submit” ● State space: which hypothesis is true ● Observation space: results of experiments ● Reward: cost of experiment ● Allows for non-myopic sequencing of experiments. ● Example: Bayesian Optimization ? Joint with: Ryan Adams/HIPS group

Summary ● POMDPs provide a framework for ● modeling causal dynamical systems ● making optimal sequential decisions ● POMDPs can be learned and solved! ... ... a t-1 a t a t+1 a t+2 ... ... s t-1 s t s t+1 s t+2 ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2

Partially-Observable Markov Decision Processes as Dynamical Causal - PowerPoint PPT Presentation

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013 The POMDP Mindset We poke the world (perform an action ) Agent World The POMDP Mindset We poke the world (perform an

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and

Dichotomic Observables; GP(1992) vs Psudospin observable(2002). Gisin-Peres observable for Bell

Conditional Planning Section 12.4 Sec. 12.4 p.1/13 Outline Fully observable environments

Conditional Planning Section 11.3 Sec. 11.3 p.1/18 Outline Fully observable environments

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed

Rayleigh- -Taylor instability Taylor instability Rayleigh in partially ionized in partially

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone

Uninformed Search (Ch. 3-3.4) Environment classification 1. Fully vs. partially observable = how

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Pattern-Database Heuristics for Partially Observable Nondeterministic Planning

Diabetes Drugs and Cardiac Disease Robert J. Rushakoff, MD Professor of Medicine University of

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 2 purified secondary metabolites of S.

DRUG Metabolism Holds its Destiny in its own Hands Dennis A. Smith, 2010 In future drug

Informed Consent: What do the data show? Christine Grady The views expressed here are the

on Heart Failure and Related Outcomes Paul W. Armstrong MD University of Alberta Frans Van de

AntiDos and Donts Craig Smollin MD Associate Medical Director, California Poison Control

Poison and Poisoning Azwin bin Md Tumin National Poison Centre Outline a. Definition b.

1 Using Positive Experiences To Heal Personal And Relational Wounds Love & Intimacy: The

Partially-Observable Markov Decision Processes as Dynamical Causal - PowerPoint PPT Presentation

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013 The POMDP Mindset We poke the world (perform an action ) Agent World The POMDP Mindset We poke the world (perform an

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and

Dichotomic Observables; GP(1992) vs Psudospin observable(2002). Gisin-Peres observable for Bell

Conditional Planning Section 12.4 Sec. 12.4 p.1/13 Outline Fully observable environments

Conditional Planning Section 11.3 Sec. 11.3 p.1/18 Outline Fully observable environments

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed

Rayleigh- -Taylor instability Taylor instability Rayleigh in partially ionized in partially

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone

Uninformed Search (Ch. 3-3.4) Environment classification 1. Fully vs. partially observable = how

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Pattern-Database Heuristics for Partially Observable Nondeterministic Planning

Diabetes Drugs and Cardiac Disease Robert J. Rushakoff, MD Professor of Medicine University of

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 2 purified secondary metabolites of S.

DRUG Metabolism Holds its Destiny in its own Hands Dennis A. Smith, 2010 In future drug

Informed Consent: What do the data show? Christine Grady The views expressed here are the

on Heart Failure and Related Outcomes Paul W. Armstrong MD University of Alberta Frans Van de

AntiDos and Donts Craig Smollin MD Associate Medical Director, California Poison Control

Poison and Poisoning Azwin bin Md Tumin National Poison Centre Outline a. Definition b.

1 Using Positive Experiences To Heal Personal And Relational Wounds Love &amp; Intimacy: The

1 Using Positive Experiences To Heal Personal And Relational Wounds Love & Intimacy: The