Partially-Observable Markov Decision Processes as Dynamical Causal - - PowerPoint PPT Presentation

partially observable
SMART_READER_LITE
LIVE PREVIEW

Partially-Observable Markov Decision Processes as Dynamical Causal - - PowerPoint PPT Presentation

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013 The POMDP Mindset We poke the world (perform an action ) Agent World The POMDP Mindset We poke the world (perform an


slide-1
SLIDE 1

Partially-Observable Markov Decision Processes

as

Dynamical Causal Models

Finale Doshi-Velez NIPS Causality Workshop 2013

slide-2
SLIDE 2

The POMDP Mindset

We poke the world (perform an action)

Agent World

slide-3
SLIDE 3

The POMDP Mindset

We poke the world (perform an action) We get a poke back (see an observation) We get a poke back (get a reward)

Agent World

  • $1
slide-4
SLIDE 4

What next?

We poke the world (perform an action) We get a poke back (see an observation) We get a poke back (get a reward)

Agent World

  • $1
slide-5
SLIDE 5

What next?

We poke the world (perform an action) We get a poke back (see an observation) We get a poke back (get a reward)

Agent

?

The world is a mystery...

World

  • $1
slide-6
SLIDE 6

The agent needs a representation to use when making decisions

We poke the world (perform an action) We get a poke back (see an observation) We get a poke back (get a reward) Representation

  • f current

world state Representation

  • f how the

world works

Agent

?

The world is a mystery...

World

  • $1
slide-7
SLIDE 7

Many problems can be framed this way

  • Robot navigation (take movement actions,

receive sensor measurements)

  • Dialog management (ask questions, receive

answers)

  • Target tracking (search a particular area,

receive sensor measurements) … the list goes on ...

slide-8
SLIDE 8

8

  • t-1
  • t
  • t+1
  • t+2

... ...

The Causal Process, Unrolled

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

  • $1
  • $1
  • $5

$10

slide-9
SLIDE 9

9

  • t-1
  • t
  • t+1
  • t+2

... ...

The Causal Process, Unrolled

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

  • $1
  • $1
  • $5

$10

Given a history

  • f actions,
  • bservations,

and rewards How can we act in order to maximize long-term future rewards?

slide-10
SLIDE 10

10

  • t-1
  • t
  • t+1
  • t+2

... ...

The Causal Process, Unrolled

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ... Key Challenge: The entire history may be needed to make near-optimal decisions

slide-11
SLIDE 11

11

... ...

The Causal Process, Unrolled

at-1 at at+1 at+2

... ...

rt-1 rt+2

... ...

All past events are needed to predict future events

  • t-1
  • t
  • t+1
  • t+2

rt rt+1

slide-12
SLIDE 12

12

  • t-1
  • t
  • t+1
  • t+2

... ...

The Causal Process, Unrolled

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

st-1 st st+1 st+2

... ...

The representation is a sufficient statistic that summarizes the history

slide-13
SLIDE 13

13

  • t-1
  • t
  • t+1
  • t+2

... ...

The Causal Process, Unrolled

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

st-1 st st+1 st+2

... ...

We call this representation the information state.

The representation is a sufficient statistic that summarizes the history

slide-14
SLIDE 14

What is state?

  • Sometimes, there exists an
  • bvious choice for this

hidden variable (such as a robot's true position)

  • At other times, learning a

representation that makes the system Markovian may provide insights into the problem.

slide-15
SLIDE 15

Formal POMDP definition

A POMDP consists of

  • A set of states S, actions A, and observations O
  • A transition function T( s' | s , a )
  • An observation function O( o | s , a )
  • A reward function R( s , a )
  • A discount factor γ

The goal is to maximize E[ ], the expected long- term discounted reward. ∑

t=1 ∞

γ

t Rt

slide-16
SLIDE 16

Relationship to Other Models

  • t-1
  • t
  • t+1
  • t+2

... ...

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

st-1 st st+1 st+2

... ...

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

st-1 st st+1 st+2

... ...

Decisions? Hidden State?

st-1 st st+1 st+2

... ...

  • t-1
  • t
  • t+1
  • t+2

... ...

st-1 st st+1 st+2

... ...

Markov Model

  • t-1
  • t
  • t+1
  • t+2

... ...

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

st-1 st st+1 st+2

... ...

Hidden Markov Model POMDP Markov Decision Process

slide-17
SLIDE 17

Formal POMDP definition

A POMDP consists of

  • A set of states S, actions A, and observations O
  • A transition function T( s' | s , a )
  • An observation function O( o | s , a )
  • A reward function R( s , a )
  • A discount factor γ

The goal is to maximize E[ ], the expected long- term discounted reward. ∑

t=1 ∞

γ

t Rt

This

  • ptimization

is called “Planning”

slide-18
SLIDE 18

Formal POMDP definition

A POMDP consists of

  • A set of states S, actions A, and observations O
  • A transition function T( s' | s , a )
  • An observation function O( o | s , a )
  • A reward function R( s , a )
  • A discount factor γ

The goal is to maximize E[ ], the expected long- term discounted reward. ∑

t=1 ∞

γ

t Rt

Learning These is called “Learning”

slide-19
SLIDE 19

Planning

V (b)=max E[∑

t=1 ∞

γ

t Rt∣b0=b]

.=maxa[R(b,a)+γ(∑

  • ∈O

P(boa∣b)V (boa))]

Bellman Recursion for the value (long-term expected reward)

slide-20
SLIDE 20

State and State, a quick aside

  • In the POMDP literature, the term “state”

usually refers to the hidden state (i.e. the robot's true location).

  • The posterior distribution of states s is called

the “belief” b(s). It is a sufficient statistic for the history, and thus the information state for the POMDP.

slide-21
SLIDE 21

Planning

V (b)=max E[∑

t=1 ∞

γ

t Rt∣b0=b]

.=maxa[R(b,a)+γ(∑

  • ∈O

P(boa∣b)V (boa))]

Bellman Recursion for the value (long-term expected reward)

slide-22
SLIDE 22

Planning

V (b)=max E[∑

t=1 ∞

γ

t Rt∣b0=b]

.=maxa[R(b,a)+γ(∑

  • ∈O

P(boa∣b)V (boa))]

Bellman Recursion for the value (long-term expected reward)

Belief b (sufficient statistic/ information state)

slide-23
SLIDE 23

Planning

V (b)=max E[∑

t=1 ∞

γ

t Rt∣b0=b]

.=maxa[R(b,a)+γ(∑

  • ∈O

P(boa∣b)V (boa))]

Bellman Recursion for the value (long-term expected reward)

Immediate reward for taking action a in belief b

slide-24
SLIDE 24

Planning

V (b)=max E[∑

t=1 ∞

γ

t Rt∣b0=b]

.=maxa[R(b,a)+γ(∑

  • ∈O

P(boa∣b)V (boa))]

Bellman Recursion for the value (long-term expected reward)

Expected future rewards

slide-25
SLIDE 25

Planning

V (b)=max E[∑

t=1 ∞

γ

t Rt∣b0=b]

.=maxa[R(b,a)+γ(∑

  • ∈O

P(boa∣b)V (boa))]

Bellman Recursion for the value (long-term expected reward) … especially when b is high-dimensional, solving for this continuous function is not easy (PSPACE-HARD)

slide-26
SLIDE 26

Planning: Yes, we can!

  • Global: Approximate the

entire function V(b) via a set

  • f support points b'.
  • e.g. SARSOP
  • Local: Approximate a the

value for a particular belief with forward simulation

  • e.g. POMCP

b bt

...

a

slide-27
SLIDE 27

Learning

  • Given histories

we can learn T, O, R via forward-filtering/backward- sampling or <fill in your favorite timeseries algorithm>

  • Two principles usually suffice for exploring to learn:
  • Optimism under uncertainty: Try actions that might be good
  • Risk control: If an action seems risky, ask for help.

h=(a1,r1,o1,a2,r2,o2,...,aT ,rT ,oT)

slide-28
SLIDE 28

Example: Timeseries in Diabetes

Clinician Model Patient Model Lab Results (A1c) Meds (Anti- diabetic agents)

Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed.

Collaborators: Isaac Kohane, Stan Shaw

slide-29
SLIDE 29

Example: Timeseries in Diabetes

Clinician Model Patient Model Lab Results (A1c) Meds (Anti- diabetic agents)

Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed.

slide-30
SLIDE 30

Discovered Patient States

A1c < 5.5 A1c 5.5- 6.5 A1c 6.5- 7.5 A1c 7.5- 8.5 A1c > 8.5

  • The “patient states” each correspond to a set of

A1c levels (unsurprising)

slide-31
SLIDE 31

Example: Timeseries in Diabetes

Clinician Model Patient Model Lab Results (A1c) Meds (Anti- diabetic agents)

Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed.

slide-32
SLIDE 32

Discovered Clinician States

Metformin

  • The “clinician states” follow the standard treatment

protocols for diabetes (unsurprising, but exciting that we discovered this is a completely unsupervised manner)

  • Next steps: Incorporate more variables; identify patient

and clinician outliers (quality of care)

Metformin, Glipizide Metformin, Glyburide Basic Insulins Glargine, Lispro, Aspart

A1c control

A1c up A1c up A1c up A1c up

A1c control

slide-33
SLIDE 33

Example: Experimental Design

  • In a very general sense:
  • Action space: all possible experiments + “submit”
  • State space: which hypothesis is true
  • Observation space: results of experiments
  • Reward: cost of experiment
  • Allows for non-myopic sequencing of experiments.
  • Example: Bayesian Optimization

Joint with: Ryan Adams/HIPS group

?

slide-34
SLIDE 34

Summary

  • POMDPs provide a framework for
  • modeling causal dynamical systems
  • making optimal sequential decisions
  • POMDPs can be learned and solved!
  • t-1
  • t
  • t+1
  • t+2

... ...

at-1 at at+1 at+2

... ...

rt-1 rt rt+1 rt+2

... ...

st-1 st st+1 st+2

... ...