[PPT] - Goal-Directed MDPs Models and Algorithms Mausam Indian Institute PowerPoint Presentation

SLIDE 1

Goal-Directed MDPs

Models and Algorithms

Mausam Indian Institute of Technology, Delhi Joint work with Andrey Kolobov and Dan Weld

SLIDE 2

Planning à la Sutton

control
full sequential
model-based
value-based
tabular/function-approximation
TD/Monte-Carlo

SLIDE 3

Typical Planning Setting

vs. RL: model of the world is known
vs. flat: model of the world in a declarative representation

– symbolic – large problems

vs. reward: goal directed
vs. complete state space: knowledge of the start state
domain independent: no additional human input

SLIDE 4

3 Key Messages

M#0: No need for exploration-exploitation tradeoff

– planning is purely a computational problem (V.I. vs. Q)

M#1: Search in planning

– states can be ignored or reordered for efficient computation

M#2: Representation in planning

– develop interesting representations for Factored MDPs

 Exploit structure to design domain-independent algorithms

M#3: Goal-directed MDPs

– design algorithms/models that use explicit knowledge of goals

4

SLIDE 5

Agenda

Background: Stochastic Shortest Paths MDPs
Background: Heuristic Search for SSP MDPs
Algorithms: Automatic Basis Function Discovery
Models: SSPs  Generalized SSPs

SLIDE 6

Infinite Horizon Discounted Reward MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

R(s,a,s’): reward
γ: discount factor

SLIDE 7

Where Does γ Come From?

γ can affect optimal policy significantly

– γ = 0 + ε: yields myopic policies for “impatient” agents – γ = 1 - ε: yields far-sighted policies, inefficient to compute

How to set it?

– Sometimes suggested by data

(e.g., inflation or interest rate)

– Often set to whatever gives a reasonable policy

7

SLIDE 8

Infinite Horizon Discounted Reward MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

R(s,a,s’): reward
γ: discount factor

SLIDE 9

Stochastic Shortest Path MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

R(s,a,s’): reward
γ: discount factor

SLIDE 10

Stochastic Shortest Path MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

C(s,a,s’): cost
γ: discount factor

SLIDE 11

Stochastic Shortest Path MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

C(s,a,s’): cost

SLIDE 12

Stochastic Shortest Path MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

C(s,a,s’): cost
G: set of goals

Minimize

expected cost to reach a goal
under full observability
indefinite horizon

SLIDE 13

Bellman Equations for SSP

add base case; no discount factor

V ¤(s) = if s 2 G = min

a2A

X

s02S

T (s; a; s0) [C(s; a; s0) + V ¤(s0)]

SLIDE 14

SSP vs. IHDR?

SSP

Discounted- reward MDPs Finite-horizon MDPs

SLIDE 15

Discounted Reward MDP  SSP

[Bertsekas&Tsitsiklis 95]

15

S0

a

C=-r1 C=-r2 T= γ t1 T= γ t2

S1 S2 SG

C=0 T=1-γ C=0 T=1-γ

S0

a

R=r1 R=r2 T=t1 T=t2

S1 S2

SLIDE 16

When is SSP well formed/defined

Under two conditions:

There is a proper policy (reaches a goal with P= 1 from all states)
Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with P=1

16

[Bertsekas, 1995]

S: A set of states
A: A set of actions
T(s,a,s’): transition model
C(s,a,s’): cost
G: set of goals

SLIDE 17

Agenda

Background: Stochastic Shortest Paths MDPs
Background: Heuristic Search for SSP MDPs
Algorithms: Automatic Basis Function Discovery
Models: SSPs  Generalized SSPs

SLIDE 18

Heuristic Search

Limitations of VI

– enumeration of state space – curse of dimensionality

Heuristic search: insights

– knowledge of a start state to save on computation

~ (all sources shortest path  single source shortest path)

– additional knowledge in the form of heuristic fn

~ (dfs/bfs  A*)

SLIDE 19

SSPs0

Under two conditions:

There is a proper policy (reaches a goal with P= 1 from all states)
Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with P=1

19

S: A set of states
A: A set of actions
T(s,a,s’): transition model
C(s,a,s’): cost
G: set of goals
s0: start state

SLIDE 20

SSPs0

What is a solution to SSPs0
Policy (S !A)?

– are states that are not reachable from s0 relevant? – states that are never visited (even though reachable)?

SLIDE 21

Partial Policy

Define Partial policy

– ¼: S’ ! A, where S’µ S

Define Partial policy closed w.r.t. a state s.

– is a partial policy ¼s – defined for all states s’ reachable by ¼s starting from s

21

SLIDE 22

Partial policy closed wrt s0

22

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9

SLIDE 23

Partial policy closed wrt s0

23

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1

Is this policy closed wrt s0?

SLIDE 24

Partial policy closed wrt s0

24

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1

Is this policy closed wrt s0?

SLIDE 25

Partial policy closed wrt s0

25

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1 ¼s0(s6)= a1

Is this policy closed wrt s0?

SLIDE 26

Policy Graph of ¼s0

26

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1 ¼s0(s6)= a1

SLIDE 27

Greedy Policy Graph

Define greedy policy: ¼V = argmina QV(s,a)
Define greedy partial policy rooted at s0

– Partial policy rooted at s0 – Greedy policy – denoted by

Define greedy policy graph

– Policy graph of : denoted by

27

¼V

s0

¼V

s0

GV

s0

SLIDE 28

Heuristic Function

h(s): S!R

– estimates V*(s) – gives an indication about “goodness” of a state – usually used in initialization V0(s) = h(s) – helps us avoid seemingly bad states

Define admissible heuristic

– optimistic – h(s) · V*(s)

28

SLIDE 29

A General Scheme for Heuristic Search in MDPs

Two (over)simplified intuitions

– Focus on states in greedy policy wrt V rooted at s0 – Focus on states with residual > ²

Find & Revise:

– repeat

find a state that satisfies the two properties above
perform a Bellman backup

– until no such state remains

29

SLIDE 30

FIND & REVISE [Bonet&Geffner 03a]

Convergence to V* is guaranteed

– if heuristic function is admissible – ~no state gets starved in 1 FIND steps

30

(perform Bellman backups)

SLIDE 31

32

LAO* family

add s0 to the fringe and to greedy policy graph repeat

FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
choose a subset of affected states
perform some REVISE computations on this subset
recompute the greedy graph

until greedy graph has no fringe & residuals in greedy graph small

utput the greedy graph as the final policy

SLIDE 32

33

LAO* [Hansen&Zilberstein 98]

add s0 to the fringe and to greedy policy graph repeat

FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph

until greedy graph has no fringe & residuals in greedy graph small

utput the greedy graph as the final policy

SLIDE 33

34

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

add s0 in the fringe and in greedy graph

s0

V(s0) = h(s0)

SLIDE 34

35

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

V(s0) = h(s0)

FIND: expand some states on the fringe (in greedy graph)

SLIDE 35

36

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset

s0 s1 s2 s3 s4

V(s0) h h h h

SLIDE 36

37

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

s0 s1 s2 s3 s4

V(s0) h h h h

SLIDE 37

38

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h h h h h V(s0)

SLIDE 38

39

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h h h h h V(s0)

SLIDE 39

40

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V

SLIDE 40

41

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V

SLIDE 41

42

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V V h

SLIDE 42

43

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V V h

SLIDE 43

44

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V h V h h h V V h

SLIDE 44

45

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V h V h h h V V h

SLIDE 45

46

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V V V h h h V V h

SLIDE 46

47

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V V V h h h V V h

SLIDE 47

48

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

utput the greedy graph as the final policy

V V V h V h V V h

SLIDE 48

49

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

utput the greedy graph as the final policy

V V V h V h V V h

SLIDE 49

50

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

s4 was never expanded s8 was never touched

V V V h V h V V h

s8

M#1: some states can be ignored for efficient compuation

SLIDE 50

51

LAO* [Hansen&Zilberstein 98]

add s0 to the fringe and to greedy policy graph repeat

FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform VI on this subset
recompute the greedy graph

until greedy graph has no fringe

utput the greedy graph as the final policy
ne expansion

lot of computation

SLIDE 51

52

Optimizations in LAO*

add s0 to the fringe and to greedy policy graph repeat

FIND: expand best state s on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
VI iterations until greedy graph changes (or low residuals)
recompute the greedy graph

until greedy graph has no fringe

utput the greedy graph as the final policy

SLIDE 52

53

Optimizations in LAO*

add s0 to the fringe and to greedy policy graph repeat

FIND: expand all states in greedy fringe
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
VI iterations until greedy graph changes (or low residuals)
recompute the greedy graph

until greedy graph has no fringe

utput the greedy graph as the final policy

SLIDE 53

54

iLAO* [Hansen&Zilberstein 01]

add s0 to the fringe and to greedy policy graph repeat

FIND: expand all states in greedy fringe
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
nly one backup per state in greedy graph
recompute the greedy graph

until greedy graph has no fringe

utput the greedy graph as the final policy

in what order? (fringe  start) DFS postorder

SLIDE 54

Real Time Dynamic Programming

[Barto et al 95]

Original Motivation

– agent acting in the real world

Trial

– simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal

RTDP: repeat trials forever

– Converges in the limit #trials ! 1

55

No termination condition!

SLIDE 55

Trial

56

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

SLIDE 56

Trial

57

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h h h V

start at start state repeat perform a Bellman backup simulate greedy action

SLIDE 57

Trial

58

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h h h V

start at start state repeat perform a Bellman backup simulate greedy action

h h

SLIDE 58

Trial

59

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action

h h

SLIDE 59

Trial

60

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action

h h

SLIDE 60

Trial

61

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action

V h

SLIDE 61

Trial

62

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action until hit the goal

V h

SLIDE 62

Trial

63

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action until hit the goal

V h

RTDP repeat forever

SLIDE 63

RTDP Family of Algorithms

repeat s Ã s0 repeat //trials REVISE s; identify agreedy FIND: pick s’ s.t. T(s, agreedy, s’) > 0 s Ã s’ until s 2 G until termination test

64

SLIDE 64

Admissible heuristic

⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a)

Label a state s as solved

– if V(s) has converged

best action

ResV(s) < ² ) ) V(s) won’t change!

label s as solved

sg s

Termination Test: Labeling

SLIDE 65

Labeling (contd)

66

best action

ResV(s) < ² s' already solved ) ) V(s) won’t change!

label s as solved

sg s s'

SLIDE 66

Labeling (contd)

67

best action

ResV(s) < ² s' already solved ) ) V(s) won’t change! label s as solved sg s s'

best action

ResV(s) < ² ResV(s’) < ²

V(s), V(s’) won’t change! label s, s’ as solved

sg s s'

best action M#3: some algorithms use explicit knowledge of goals M#1: some states can be ignored for efficient computation

SLIDE 67

Labeled RTDP [Bonet&Geffner 03b]

repeat s Ã s0 label all goal states as solved repeat //trials REVISE s; identify agreedy FIND: sample s’ from T(s, agreedy, s’) s Ã s’ until s is solved for all states s in the trial try to label s as solved until s0 is solved

68

SLIDE 68

terminates in finite time

– due to labeling procedure

anytime

– focuses attention on more probable states

fast convergence

– focuses attention on unconverged states

69

LRTDP

SLIDE 69

LRTDP Extensions

Different ways to pick next state
Different termination conditions
Bounded RTDP [McMahan et al 05]
Focused RTDP [Smith&Simmons 06]
Value of Perfect Information RTDP [Sanner et al

09]

70

SLIDE 70

Where do Heuristics come from?

Domain-dependent heuristics
Domain-independent heuristics

– dependent on specific domain representation

71

M#2: factored representations expose useful problem structure

SLIDE 71

Take-Homes

efficient computation given start state s0

– heuristic search

automatic computation of heuristics

– domain independent manner

SLIDE 72

Shameless Plug

74

SLIDE 73

Agenda

Background: Stochastic Shortest Paths MDPs
Background: Heuristic Search for SSP MDPs
Algorithms: Automatic Basis Function Discovery
Models: SSPs  Generalized SSPs

SLIDE 74

Previous Work

76

Determinization

– Determinize the MDP – Classical planners fast – E.g., FF-Replan – Cons: may be troubled by

Complex contingencies
Probabilities
Function Approximation

– Dimensionality reduction – Represent state values with basis functions

E.g., V*(s) ≈ ∑iwibi(s)

– Cons:

Need a human to get bi

Our Work

Marry these paradigms to extract problem-specific structure in a fast, problem-independent way.

SLIDE 75

Example Domain

78

G e t S G e t W G e t H

SLIDE 76

Example Domain (cont’d)

79

S m a s h T w e a k

SLIDE 77

SSPs0 MDP

S: A set of states
A: A set of actions
T(s,a,s’): transition

model

C(s,a,s’): action cost
s0: start state
G: set of goals

GetW, GetH, GetS, Tweak, Smash

SLIDE 78

Contributions

ReTrASE — a scalable approximate MDP solver

– Combines function approximation with classical planning – Uses classical planner to automatically generate basis functions – Fast, memory-efficient, high-quality policies

81

SLIDE 79

The Big Picture: ReTrASE

82

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods

[Kolobov, Mausam, Weld, AIJ’12]

Basis Functions

SLIDE 80

Determinizing the Domain

P = 9/10 P = 1/10

83

SLIDE 81

Generating Trajectories

84

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

SLIDE 82

Generating Trajectories

85

SLIDE 83

86

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

Computing Basis Functions

SLIDE 84

Regressing Trajectories

87

basis funct ctions ions basis function guarantees goal is reachable from s

= 1 = 2

Initial weights

SLIDE 85

Basis Functions

88

SLIDE 86

89

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

Computing Values

SLIDE 87

Meaning of Basis Function Weights

90 90

Want to compute basis function weights so that the blue basis function looks “better” than the pink one!

SLIDE 88

Value of a Basis Function

Basis function enables at least one trajectory

– applicable from all relevant states

Trajectories combine to form policies
Value of a basis function ~ “quality” of its policies
Algorithm based on RTDP

– Learn basis function values – Use them to compute values of states

91

SLIDE 89

Experimental Results

Criteria:

– Scalability (vs. VI/RTDP-based planners) – Solution quality (vs. IPPC winners)

Domains: 6 from IPPC-06 and IPPC-08
Competitors:

– Best performer on the particular domain – Best performer in the particular IPPC – LRTDP

92

SLIDE 90

The Big Picture

ReTrASE is vastly more scalable than

VI/RTDP-based planners

ReTrASE typically rivals or outperforms the

best-performing planners on IPPC goal-

riented domains

93

SLIDE 91

Triangle-Tire: Memory Consumption

94

LRTDPOPT ReTrASE LRTDPFF

Triangle-Tire Problem # LOG10(Amount of Memory)

SLIDE 92

Triangle-Tire: Success Rate

95

Triangle-Tire World’08 Problem # % of Successful Trials

ReTrASE HMDPP RFF-PG

SLIDE 93

Exploding Blocks World: Success Rate

96

Exploding Blocks World’06 Problem # % of Successful Trials

ReTrASE FFReplan FPG

~2800 states!

SLIDE 94

SSPs0

Under two conditions:

There is a proper policy (reaches a goal with P= 1 from all states)
Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with P=1

97

S: A set of states
A: A set of actions
T(s,a,s’): transition model
C(s,a,s’): cost
G: set of goals
s0: start state

?

SLIDE 95

Key Drawback of ReTrASE…

Dead-end handling expensive

– expensive to identify: drain on time – too many to store: drain on space

SLIDE 96

99

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

Computing Values

SLIDE 97

Research Question

Can we devise a sound dead-end identification procedure fast enough to obviate memoization?

100

Learns feature combinations whose presence guarantees a state to be a dead end

SLIDE 98

Nogoods

101

Nogood

SLIDE 99

Generate-and-Test Procedure

Generate a nogood candidate

– Key insight: Nogood = conjunction that defeats all b.f.s – For each b.f., pick a literal that defeats it

Test the candidate

– Needed for soundness, since we don’t know all b.f.s – Use the non-relaxed Planning Graph algorithm

102

SLIDE 100

Can act as submodule of many planners and ID dead ends

– By checking discovered nogoods against every state –

Benefits of SixthSense

110

SLIDE 101

Take Homes

Novel ideas to learn structure in the domain
Basis functions

– Learn by regressing trajectories – Represent good structure – Generalize across states

Nogoods

– Learn inductively; prove using a sound procedure – Represent bad structure – Generalize across dead-end states

SLIDE 102

Take Homes

A novel use of classical planners for MDP algos

– retains the decision-theoretic nature of MDPs – exploits the scalability of classical planners

Automatic ways to generate basis functions

– no longer an onus on human designer – exploits factored domain model

M#2: factored representations expose useful problem structure

SLIDE 103

Agenda

Background: Stochastic Shortest Paths MDPs
Background: Heuristic Search for SSP MDPs
Algorithms: Automatic Basis Function Discovery
Models: SSPs  Generalized SSPs

SLIDE 104

Theme of the Workshop

Value Functions  Generalized Value Functions
Gradient Extra-gradient
KL divergence  Bergman divergence
Contextual bandits  Linear bandits
SSPs  ?

SLIDE 105

SSP/SSPs0

SSP MDP is a tuple <S, A, T, C, G, (s0)>, where:

S is a finite state space
A is a finite action set
T is a stationary transition function
C is a stationary cost function
G is a set of absorbing cost-free goal states
(s0 is an initial state)

Under two conditions:

There is a proper policy (reaches a goal with P=1 from all states)
Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with PG = 1

120

Disallows dead ends Prevents algos from halting if we allowed dead ends, make cost a meaningless criterion

SLIDE 106

Stochastic Shortest-Path MDPs

Example applications:

– Controlling a Mars rover

“How to collect scientific data without damaging the rover?”

– Route planning

“How to climb mount Everest in the cheapest way?”

121

Dead ends are common!

SLIDE 107

Discrete MDP Research So Far

SSP MDPs

Negative MDPs Positive- bounded MDPs

Goal-oriented MDPs

????

Model many

interesting scenarios

Efficiently* solvable

by heuristic search

What interesting

problems are here?

How do we solve

them efficiently?

122

SLIDE 108

SSPADE: Dead Ends are Avoidable from s0

D.e.s may be avoidable from s0 via an optimal policy
Can’t compute V*(s) for every state
But need only “relevant” states to get the “right” value
Can be solved with optimal heuristic search from s0

– FIND shouldn’t starve states; REVISE should halt

123

S0 S1

a1 a2 a2

SG S2

a2 a1 a1 a2 a3

[Kolobov, Mausam, Weld, UAI’12]

SLIDE 109

fSSPUDE: SSP with Unavoidable Dead Ends (and a Finite Penalty on Them)

First attempt: if the agent reaches a d.e., it pays D

V*(s) = ε(D+1) + ε·0 + (1- ε)·D = D + ε

Makes non-d.e.s more “expensive” than d.e.s!

– Oops…

124

d s sg a

T(s, a, d) = 1- ε T(s, a, sg) = ε C= ε(D+1)

D

SLIDE 110

fSSPUDE: SSP with Unavoidable Dead Ends (and a Finite Penalty on Them)

Second attempt: agent allowed to stop at any state

– by paying a price = penalty D – Intuition: achieving a goal is worth –D to the agent

Equivalent to SSP MDP with a special astop action

– applicable in each state – leads directly to a goal by paying cost D

Thus, algorithms for SSP apply to fSSPUDE!

[Kolobov, Mausam, Weld, UAI’12]

SLIDE 111

MAXPROB: Dealing with Unavoidable Infinitely Damaging Dead Ends-1

126

S0 S1

a1 C = 2 C = 1 a2 a2 C = 7 C = 1

SG

C = 3 T = 0.3 T = 0.7

Sd

C = 0.8 a2 a1 a3 C = 5

P*G(s1)= 0.3 P*G(sd)= 0 P*G(s1)= 0.3

Comparing policies in terms of cost meaningless
MAXPROB/GSSP MDPs: evaluate policies by probability of reaching goal

– Set all action costs to 0 (they don’t matter), reward 1 for reaching goal – Fixed-point methods such as VI or LRTDP don’t converge because of traps

1

[Kolobov, Mausam, Weld, Geffner ICAPS’11]

SLIDE 112

MDP Examples S0

2 0.5

S1 S2 S3 S4

1
1
1

G

SSP

S0

2 0.5

S1 S2 S3 S4

1

G

SSP

S0

2 0.5

S1 S2 S3 S4

1

1

1

1
1

G

SSP

127

SLIDE 113

Generalized SSPs: Definition

An MDP M = <S, A, T, R, G, s0> for which

– There is a proper policy (reaches the goal with P=1) – Sum of non-negative rewards accumulated by any policy starting at s0 is bounded from above

Solving a GSSP = finding a reward-maximizing

Markovian policy that reaches the goal

128

SLIDE 114

Generalized SSPs: Example

129

S0

2 0.5

S1 S2 S3 S4

1

G S0

2 0.5

S1 S2 S3 S4

1

1

1

1
1

G

GSSP GSSP

SLIDE 115

Generalized SSPs: Example S0

2 0.5

S1 S2 S3 S4

1

G

Proper policy exists

130

SLIDE 116

Generalized SSPs: Example S0

2 0.5

S1 S2 S3 S4

1

G

For any ∏, sum of non-negative rewards ≤ 2

131

SLIDE 117

Generalized SSPs: Example S0

2 0.5

S1 S2 S3 S4

1

G

Solution

S0

2 0.5

S1 S2 S3 S4

1

G

Not a solution

132

SLIDE 118

GSSPs: Is V* A Fixed Point of B?

Reminder: in SSPs, V* = B V*, where

– B is the Bellman backup operator – B V(s) = maxa {R(s, a) + ∑s’ in succ(s,a)T(s, a, s’)V(s’)

In SSPs, V* is a fixed point of B

– Still true in GSSPs:

0.5 2

0.5

∞
∞
1
1
1

133

SLIDE 119

GSSPs: Is V* The Unique Fixed Point of B?

In SSPs, V* is the unique fixed point of B

– I.e., V* = B o B o … B V0, V0 is a heuristic value function – Not true in GSSPs: – Moreover, all suboptimal fixed points are admissible!

0.5 2

0.5

∞
∞
1
1
1

3 2 0.5 1 1 1 1

1

134

SLIDE 120

GSSPs: Is Every V*-greedy ∏ A Solution?

In SSPs, every ∏ greedy w.r.t V* reaches the

goal

– Not true in GSSPs:

0.5 2

0.5

∞
∞
1
1
1

135

SLIDE 121

Efficiently Solving GSSPs: Attempt #1

Just Run F&R!

– Start with an admissible V0 – Done!

3 2 0.5 1 1 1 1

1

3 2 0.5 1 1 1 1

1

136

SLIDE 122

Attempt #1: What Went Wrong?

In GSSPs, suboptimal fixed points are admissible!

– When starting with V0 ≥ V*, F&R hit one of them. – B can’t change V over traps – strongly connected components in V’s greedy graph

Can yield an arbitrarily poor solution

137

3 2 0.5 1 1 1 1

1

SLIDE 123

Efficiently Solving GSSPs: FRET

Find, Revise, Eliminate Traps

– First heuristic search algorithm for MDPs beyond SSP – Provably optimal if the heuristic is admissible

Main idea

– Run F&R until convergence – Eliminate traps in the policy envelope – Repeat until no more traps

139

SLIDE 124

5

2 0.5

2.3 2 1

1.1

1

4

2 0.5

2 2 1

1

1

4

2 0.5

1

1

1.5 2

0.5

1

1

1.5 2

0.5

1

1.5 2

0.5

1
∞
∞
∞
1
∞
1

Start with an admissible V0 Run F&R until convergence Eliminate Traps in the resulting Vi

R e p e a t

Find-and-Revise Find-and-Revise Eliminate Traps No traps left – done!

FRET Example: Finding V*

140

SLIDE 125

FRET Example: Extracting ∏*

0.5 2

0.5

∞
∞
1
1
1
Iteratively “connect” states to the goals

– Using optimal actions – Until s0 is connected

141

SLIDE 126

Experimental Setup

Problems: MAXPROB versions of EBW
Planners: VI vs FRET
Heuristics: Zero for VI, One+SixthSense for FRET

– SixthSense soundly identifies some of the “dead ends”; their values are set to 0

142

SLIDE 127

Experimental Setup

143

SLIDE 128

Goal-Oriented MDP Hierarchy

144

SSP

Discounted- reward MDPs Finite-horizon MDPs

SSPADE fSSPUDE iSSPUDE GSSP S3Ps

SLIDE 129

Future Work: Solving S3P

Stochastic Safest and Shortest Path (S3P) MDPs

– Teichteil-Koenigsbuch, AAAI’12 – Goal-oriented MDPs with no restriction on costs

145

S0 S1

a1 C(s1, a1, s0) = -1 C(s0, a1, s1) = 1 a2 a2 C(s0, a2, s0) = -7.2 C(s1, a2, sG) = 1

SG

C(s1, a2, s2) = -3 T(s1, a2, sG) = 0.3 T(s1, a2, s2) = 0.7

S2

C(s2, a2, s2) = 0.8 C(s2, a1, s2) = 2.4 a1 a2 a1

Alternating cycles Non-positive cycles Unavoidable dead ends

SLIDE 130

Take Homes

SSP MDPs exclude interesting planning scenarios
Generalized SSPs

– handle zero-cost cycles – GSSP contains SSP and several other MDP classes – heuristic search algorithm (FRET)

Dead-ends tricky in undiscounted goal MDPs
Well-formed extensions of SSP MDPs

– can have unintuitive DP properties – what is beyond GSSPs? – loads of open questions: theoretical & algorithmic

M#3: some models use explicit knowledge of goals

SLIDE 131

Agenda

Background: Stochastic Shortest Paths MDPs
Background: Heuristic Search for SSP MDPs
Algorithms: Automatic Basis Function Discovery
Models: SSPs  Generalized SSPs

SLIDE 132

S0 S0, L S0, S S0, R S1 S2 S1, R S2, L G S0 S0, L S0, S S1 G

AND-OR Graph in Flat Space ASAP Graph

S3 S3, L S0 S0, L S0, S S0, R S1 S2 S1, R S2, L G

AS Graph[1]

S0 S0, L S0, S S1 S1, R G

ASAM Graph[2]

[1]: Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in Markov decision

processes. Artificial Intelligence, 2003

[2]: Balaraman Ravindran and A Barto. Approximate homomorphisms: A framework for nonexact minimization in Markov decision processes. In ICKBCS, 2004.

SLIDE 133

Key Properties

PROPERTY 1: The original MDP does not reduce to an abstract MDP PROPERTY 2: ASAP subsumes abstractions computed by AS and ASAM PROPERTY 3: Value Iteration on abstract AND-OR graph returns optimal value functions for the

riginal MDP

SLIDE 134

Experiments

[Anand, Grover, Mausam, Singla – submitted]

M#1: states can be ignored (abstracted) for efficient computation

SLIDE 135

3 Key Messages

M#0: No need for exploration-exploitation tradeoff

– planning is purely a computational problem (V.I. vs. Q)

M#1: Search in planning

– states can be ignored or reordered for efficient computation

M#2: Representation in planning

– develop interesting representations for Factored MDPs

 Exploit structure to design domain-independent algorithms

M#3: Goal-directed MDPs

– design algorithms/models that use explicit knowledge of goals

151