Goal-Directed MDPs Models and Algorithms Mausam Indian Institute - - PowerPoint PPT Presentation

goal directed mdps
SMART_READER_LITE
LIVE PREVIEW

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute - - PowerPoint PPT Presentation

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute of Technology, Delhi Joint work with Andrey Kolobov and Dan Weld Planning la Sutton control full sequential model-based value-based


slide-1
SLIDE 1

Goal-Directed MDPs

Models and Algorithms

Mausam Indian Institute of Technology, Delhi Joint work with Andrey Kolobov and Dan Weld

slide-2
SLIDE 2

Planning à la Sutton

  • control
  • full sequential
  • model-based
  • value-based
  • tabular/function-approximation
  • TD/Monte-Carlo
slide-3
SLIDE 3

Typical Planning Setting

  • vs. RL: model of the world is known
  • vs. flat: model of the world in a declarative representation

– symbolic – large problems

  • vs. reward: goal directed
  • vs. complete state space: knowledge of the start state
  • domain independent: no additional human input
slide-4
SLIDE 4

3 Key Messages

  • M#0: No need for exploration-exploitation tradeoff

– planning is purely a computational problem (V.I. vs. Q)

  • M#1: Search in planning

– states can be ignored or reordered for efficient computation

  • M#2: Representation in planning

– develop interesting representations for Factored MDPs

 Exploit structure to design domain-independent algorithms

  • M#3: Goal-directed MDPs

– design algorithms/models that use explicit knowledge of goals

4

slide-5
SLIDE 5

Agenda

  • Background: Stochastic Shortest Paths MDPs
  • Background: Heuristic Search for SSP MDPs
  • Algorithms: Automatic Basis Function Discovery
  • Models: SSPs  Generalized SSPs
slide-6
SLIDE 6

Infinite Horizon Discounted Reward MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • R(s,a,s’): reward
  • γ: discount factor
slide-7
SLIDE 7

Where Does γ Come From?

  • γ can affect optimal policy significantly

– γ = 0 + ε: yields myopic policies for “impatient” agents – γ = 1 - ε: yields far-sighted policies, inefficient to compute

  • How to set it?

– Sometimes suggested by data

  • (e.g., inflation or interest rate)

– Often set to whatever gives a reasonable policy

7

slide-8
SLIDE 8

Infinite Horizon Discounted Reward MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • R(s,a,s’): reward
  • γ: discount factor
slide-9
SLIDE 9

Stochastic Shortest Path MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • R(s,a,s’): reward
  • γ: discount factor
slide-10
SLIDE 10

Stochastic Shortest Path MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • C(s,a,s’): cost
  • γ: discount factor
slide-11
SLIDE 11

Stochastic Shortest Path MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • C(s,a,s’): cost
slide-12
SLIDE 12

Stochastic Shortest Path MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • C(s,a,s’): cost
  • G: set of goals

Minimize

  • expected cost to reach a goal
  • under full observability
  • indefinite horizon
slide-13
SLIDE 13

Bellman Equations for SSP

add base case; no discount factor

V ¤(s) = if s 2 G = min

a2A

X

s02S

T (s; a; s0) [C(s; a; s0) + V ¤(s0)]

slide-14
SLIDE 14

SSP vs. IHDR?

SSP

Discounted- reward MDPs Finite-horizon MDPs

slide-15
SLIDE 15

Discounted Reward MDP  SSP

[Bertsekas&Tsitsiklis 95]

15

S0

a

C=-r1 C=-r2 T= γ t1 T= γ t2

S1 S2 SG

C=0 T=1-γ C=0 T=1-γ

S0

a

R=r1 R=r2 T=t1 T=t2

S1 S2

slide-16
SLIDE 16

When is SSP well formed/defined

Under two conditions:

  • There is a proper policy (reaches a goal with P= 1 from all states)
  • Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with P=1

16

[Bertsekas, 1995]

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition model
  • C(s,a,s’): cost
  • G: set of goals
slide-17
SLIDE 17

Agenda

  • Background: Stochastic Shortest Paths MDPs
  • Background: Heuristic Search for SSP MDPs
  • Algorithms: Automatic Basis Function Discovery
  • Models: SSPs  Generalized SSPs
slide-18
SLIDE 18

Heuristic Search

  • Limitations of VI

– enumeration of state space – curse of dimensionality

  • Heuristic search: insights

– knowledge of a start state to save on computation

~ (all sources shortest path  single source shortest path)

– additional knowledge in the form of heuristic fn

~ (dfs/bfs  A*)

slide-19
SLIDE 19

SSPs0

Under two conditions:

  • There is a proper policy (reaches a goal with P= 1 from all states)
  • Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with P=1

19

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition model
  • C(s,a,s’): cost
  • G: set of goals
  • s0: start state
slide-20
SLIDE 20

SSPs0

  • What is a solution to SSPs0
  • Policy (S !A)?

– are states that are not reachable from s0 relevant? – states that are never visited (even though reachable)?

slide-21
SLIDE 21

Partial Policy

  • Define Partial policy

– ¼: S’ ! A, where S’µ S

  • Define Partial policy closed w.r.t. a state s.

– is a partial policy ¼s – defined for all states s’ reachable by ¼s starting from s

21

slide-22
SLIDE 22

Partial policy closed wrt s0

22

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9

slide-23
SLIDE 23

Partial policy closed wrt s0

23

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1

Is this policy closed wrt s0?

slide-24
SLIDE 24

Partial policy closed wrt s0

24

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1

Is this policy closed wrt s0?

slide-25
SLIDE 25

Partial policy closed wrt s0

25

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1 ¼s0(s6)= a1

Is this policy closed wrt s0?

slide-26
SLIDE 26

Policy Graph of ¼s0

26

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8 s9 ¼s0(s0)= a1 ¼s0(s1)= a2 ¼s0(s2)= a1 ¼s0(s6)= a1

slide-27
SLIDE 27

Greedy Policy Graph

  • Define greedy policy: ¼V = argmina QV(s,a)
  • Define greedy partial policy rooted at s0

– Partial policy rooted at s0 – Greedy policy – denoted by

  • Define greedy policy graph

– Policy graph of : denoted by

27

¼V

s0

¼V

s0

GV

s0

slide-28
SLIDE 28

Heuristic Function

  • h(s): S!R

– estimates V*(s) – gives an indication about “goodness” of a state – usually used in initialization V0(s) = h(s) – helps us avoid seemingly bad states

  • Define admissible heuristic

– optimistic – h(s) · V*(s)

28

slide-29
SLIDE 29

A General Scheme for Heuristic Search in MDPs

  • Two (over)simplified intuitions

– Focus on states in greedy policy wrt V rooted at s0 – Focus on states with residual > ²

  • Find & Revise:

– repeat

  • find a state that satisfies the two properties above
  • perform a Bellman backup

– until no such state remains

29

slide-30
SLIDE 30

FIND & REVISE [Bonet&Geffner 03a]

  • Convergence to V* is guaranteed

– if heuristic function is admissible – ~no state gets starved in 1 FIND steps

30

(perform Bellman backups)

slide-31
SLIDE 31

32

LAO* family

add s0 to the fringe and to greedy policy graph repeat

  • FIND: expand some states on the fringe (in greedy graph)
  • initialize all new states by their heuristic value
  • choose a subset of affected states
  • perform some REVISE computations on this subset
  • recompute the greedy graph

until greedy graph has no fringe & residuals in greedy graph small

  • utput the greedy graph as the final policy
slide-32
SLIDE 32

33

LAO* [Hansen&Zilberstein 98]

add s0 to the fringe and to greedy policy graph repeat

  • FIND: expand best state s on the fringe (in greedy graph)
  • initialize all new states by their heuristic value
  • subset = all states in expanded graph that can reach s
  • perform VI on this subset
  • recompute the greedy graph

until greedy graph has no fringe & residuals in greedy graph small

  • utput the greedy graph as the final policy
slide-33
SLIDE 33

34

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

add s0 in the fringe and in greedy graph

s0

V(s0) = h(s0)

slide-34
SLIDE 34

35

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

V(s0) = h(s0)

FIND: expand some states on the fringe (in greedy graph)

slide-35
SLIDE 35

36

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset

s0 s1 s2 s3 s4

V(s0) h h h h

slide-36
SLIDE 36

37

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

s0 s1 s2 s3 s4

V(s0) h h h h

slide-37
SLIDE 37

38

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h h h h h V(s0)

slide-38
SLIDE 38

39

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h h h h h V(s0)

slide-39
SLIDE 39

40

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V

slide-40
SLIDE 40

41

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0 s1 s2 s3 s4 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V

slide-41
SLIDE 41

42

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V V h

slide-42
SLIDE 42

43

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

h h V h h h V V h

slide-43
SLIDE 43

44

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V h V h h h V V h

slide-44
SLIDE 44

45

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V h V h h h V V h

slide-45
SLIDE 45

46

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V V V h h h V V h

slide-46
SLIDE 46

47

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph

V V V h h h V V h

slide-47
SLIDE 47

48

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

  • utput the greedy graph as the final policy

V V V h V h V V h

slide-48
SLIDE 48

49

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

  • utput the greedy graph as the final policy

V V V h V h V V h

slide-49
SLIDE 49

50

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

LAO*

s0

Sg

s1 s2 s3 s4 s5 s6 s7

s4 was never expanded s8 was never touched

V V V h V h V V h

s8

M#1: some states can be ignored for efficient compuation

slide-50
SLIDE 50

51

LAO* [Hansen&Zilberstein 98]

add s0 to the fringe and to greedy policy graph repeat

  • FIND: expand best state s on the fringe (in greedy graph)
  • initialize all new states by their heuristic value
  • subset = all states in expanded graph that can reach s
  • perform VI on this subset
  • recompute the greedy graph

until greedy graph has no fringe

  • utput the greedy graph as the final policy
  • ne expansion

lot of computation

slide-51
SLIDE 51

52

Optimizations in LAO*

add s0 to the fringe and to greedy policy graph repeat

  • FIND: expand best state s on the fringe (in greedy graph)
  • initialize all new states by their heuristic value
  • subset = all states in expanded graph that can reach s
  • VI iterations until greedy graph changes (or low residuals)
  • recompute the greedy graph

until greedy graph has no fringe

  • utput the greedy graph as the final policy
slide-52
SLIDE 52

53

Optimizations in LAO*

add s0 to the fringe and to greedy policy graph repeat

  • FIND: expand all states in greedy fringe
  • initialize all new states by their heuristic value
  • subset = all states in expanded graph that can reach s
  • VI iterations until greedy graph changes (or low residuals)
  • recompute the greedy graph

until greedy graph has no fringe

  • utput the greedy graph as the final policy
slide-53
SLIDE 53

54

iLAO* [Hansen&Zilberstein 01]

add s0 to the fringe and to greedy policy graph repeat

  • FIND: expand all states in greedy fringe
  • initialize all new states by their heuristic value
  • subset = all states in expanded graph that can reach s
  • nly one backup per state in greedy graph
  • recompute the greedy graph

until greedy graph has no fringe

  • utput the greedy graph as the final policy

in what order? (fringe  start) DFS postorder

slide-54
SLIDE 54

Real Time Dynamic Programming

[Barto et al 95]

  • Original Motivation

– agent acting in the real world

  • Trial

– simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal

  • RTDP: repeat trials forever

– Converges in the limit #trials ! 1

55

No termination condition!

slide-55
SLIDE 55

Trial

56

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

slide-56
SLIDE 56

Trial

57

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h h h V

start at start state repeat perform a Bellman backup simulate greedy action

slide-57
SLIDE 57

Trial

58

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h h h V

start at start state repeat perform a Bellman backup simulate greedy action

h h

slide-58
SLIDE 58

Trial

59

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action

h h

slide-59
SLIDE 59

Trial

60

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action

h h

slide-60
SLIDE 60

Trial

61

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action

V h

slide-61
SLIDE 61

Trial

62

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action until hit the goal

V h

slide-62
SLIDE 62

Trial

63

s0

Sg

s1 s2 s3 s4 s5 s6 s7 s8

h h V h V

start at start state repeat perform a Bellman backup simulate greedy action until hit the goal

V h

RTDP repeat forever

slide-63
SLIDE 63

RTDP Family of Algorithms

repeat s à s0 repeat //trials REVISE s; identify agreedy FIND: pick s’ s.t. T(s, agreedy, s’) > 0 s à s’ until s 2 G until termination test

64

slide-64
SLIDE 64
  • Admissible heuristic

⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a)

  • Label a state s as solved

– if V(s) has converged

best action

ResV(s) < ² ) ) V(s) won’t change!

label s as solved

sg s

Termination Test: Labeling

slide-65
SLIDE 65

Labeling (contd)

66

best action

ResV(s) < ² s' already solved ) ) V(s) won’t change!

label s as solved

sg s s'

slide-66
SLIDE 66

Labeling (contd)

67

best action

ResV(s) < ² s' already solved ) ) V(s) won’t change! label s as solved sg s s'

best action

ResV(s) < ² ResV(s’) < ²

V(s), V(s’) won’t change! label s, s’ as solved

sg s s'

best action M#3: some algorithms use explicit knowledge of goals M#1: some states can be ignored for efficient computation

slide-67
SLIDE 67

Labeled RTDP [Bonet&Geffner 03b]

repeat s à s0 label all goal states as solved repeat //trials REVISE s; identify agreedy FIND: sample s’ from T(s, agreedy, s’) s à s’ until s is solved for all states s in the trial try to label s as solved until s0 is solved

68

slide-68
SLIDE 68
  • terminates in finite time

– due to labeling procedure

  • anytime

– focuses attention on more probable states

  • fast convergence

– focuses attention on unconverged states

69

LRTDP

slide-69
SLIDE 69

LRTDP Extensions

  • Different ways to pick next state
  • Different termination conditions
  • Bounded RTDP [McMahan et al 05]
  • Focused RTDP [Smith&Simmons 06]
  • Value of Perfect Information RTDP [Sanner et al

09]

70

slide-70
SLIDE 70

Where do Heuristics come from?

  • Domain-dependent heuristics
  • Domain-independent heuristics

– dependent on specific domain representation

71

M#2: factored representations expose useful problem structure

slide-71
SLIDE 71

Take-Homes

  • efficient computation given start state s0

– heuristic search

  • automatic computation of heuristics

– domain independent manner

slide-72
SLIDE 72

Shameless Plug

74

slide-73
SLIDE 73

Agenda

  • Background: Stochastic Shortest Paths MDPs
  • Background: Heuristic Search for SSP MDPs
  • Algorithms: Automatic Basis Function Discovery
  • Models: SSPs  Generalized SSPs
slide-74
SLIDE 74

Previous Work

76

  • Determinization

– Determinize the MDP – Classical planners fast – E.g., FF-Replan – Cons: may be troubled by

  • Complex contingencies
  • Probabilities
  • Function Approximation

– Dimensionality reduction – Represent state values with basis functions

  • E.g., V*(s) ≈ ∑iwibi(s)

– Cons:

  • Need a human to get bi

Our Work

Marry these paradigms to extract problem-specific structure in a fast, problem-independent way.

slide-75
SLIDE 75

Example Domain

78

G e t S G e t W G e t H

slide-76
SLIDE 76

Example Domain (cont’d)

79

S m a s h T w e a k

slide-77
SLIDE 77

SSPs0 MDP

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition

model

  • C(s,a,s’): action cost
  • s0: start state
  • G: set of goals

GetW, GetH, GetS, Tweak, Smash

slide-78
SLIDE 78

Contributions

ReTrASE — a scalable approximate MDP solver

– Combines function approximation with classical planning – Uses classical planner to automatically generate basis functions – Fast, memory-efficient, high-quality policies

81

slide-79
SLIDE 79

The Big Picture: ReTrASE

82

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods

[Kolobov, Mausam, Weld, AIJ’12]

Basis Functions

slide-80
SLIDE 80

Determinizing the Domain

P = 9/10 P = 1/10

83

slide-81
SLIDE 81

Generating Trajectories

84

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

slide-82
SLIDE 82

Generating Trajectories

85

slide-83
SLIDE 83

86

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

Computing Basis Functions

slide-84
SLIDE 84

Regressing Trajectories

87

basis funct ctions ions basis function guarantees goal is reachable from s

= 1 = 2

Initial weights

slide-85
SLIDE 85

Basis Functions

88

slide-86
SLIDE 86

89

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

Computing Values

slide-87
SLIDE 87

Meaning of Basis Function Weights

90 90

Want to compute basis function weights so that the blue basis function looks “better” than the pink one!

slide-88
SLIDE 88

Value of a Basis Function

  • Basis function enables at least one trajectory

– applicable from all relevant states

  • Trajectories combine to form policies
  • Value of a basis function ~ “quality” of its policies
  • Algorithm based on RTDP

– Learn basis function values – Use them to compute values of states

91

slide-89
SLIDE 89

Experimental Results

  • Criteria:

– Scalability (vs. VI/RTDP-based planners) – Solution quality (vs. IPPC winners)

  • Domains: 6 from IPPC-06 and IPPC-08
  • Competitors:

– Best performer on the particular domain – Best performer in the particular IPPC – LRTDP

92

slide-90
SLIDE 90

The Big Picture

  • ReTrASE is vastly more scalable than

VI/RTDP-based planners

  • ReTrASE typically rivals or outperforms the

best-performing planners on IPPC goal-

  • riented domains

93

slide-91
SLIDE 91

Triangle-Tire: Memory Consumption

94

LRTDPOPT ReTrASE LRTDPFF

Triangle-Tire Problem # LOG10(Amount of Memory)

slide-92
SLIDE 92

Triangle-Tire: Success Rate

95

Triangle-Tire World’08 Problem # % of Successful Trials

ReTrASE HMDPP RFF-PG

slide-93
SLIDE 93

Exploding Blocks World: Success Rate

96

Exploding Blocks World’06 Problem # % of Successful Trials

ReTrASE FFReplan FPG

~2800 states!

slide-94
SLIDE 94

SSPs0

Under two conditions:

  • There is a proper policy (reaches a goal with P= 1 from all states)
  • Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with P=1

97

  • S: A set of states
  • A: A set of actions
  • T(s,a,s’): transition model
  • C(s,a,s’): cost
  • G: set of goals
  • s0: start state

?

slide-95
SLIDE 95

Key Drawback of ReTrASE…

  • Dead-end handling expensive

– expensive to identify: drain on time – too many to store: drain on space

slide-96
SLIDE 96

99

Det(P)

Run a state space exploration routine (e.g, RTDP) MDP P State s

Policy Value(s)

Extraction Module

Evaluate s Determinize P

Trajectory

Run a classical planner Regress trajectory SixthSense

State s Dead End Nogoods Basis Functions

Computing Values

slide-97
SLIDE 97

Research Question

Can we devise a sound dead-end identification procedure fast enough to obviate memoization?

100

Learns feature combinations whose presence guarantees a state to be a dead end

slide-98
SLIDE 98

Nogoods

101

Nogood

slide-99
SLIDE 99

Generate-and-Test Procedure

  • Generate a nogood candidate

– Key insight: Nogood = conjunction that defeats all b.f.s – For each b.f., pick a literal that defeats it

  • Test the candidate

– Needed for soundness, since we don’t know all b.f.s – Use the non-relaxed Planning Graph algorithm

102

slide-100
SLIDE 100
  • Can act as submodule of many planners and ID dead ends

– By checking discovered nogoods against every state –

Benefits of SixthSense

110

slide-101
SLIDE 101

Take Homes

  • Novel ideas to learn structure in the domain
  • Basis functions

– Learn by regressing trajectories – Represent good structure – Generalize across states

  • Nogoods

– Learn inductively; prove using a sound procedure – Represent bad structure – Generalize across dead-end states

slide-102
SLIDE 102

Take Homes

  • A novel use of classical planners for MDP algos

– retains the decision-theoretic nature of MDPs – exploits the scalability of classical planners

  • Automatic ways to generate basis functions

– no longer an onus on human designer – exploits factored domain model

M#2: factored representations expose useful problem structure

slide-103
SLIDE 103

Agenda

  • Background: Stochastic Shortest Paths MDPs
  • Background: Heuristic Search for SSP MDPs
  • Algorithms: Automatic Basis Function Discovery
  • Models: SSPs  Generalized SSPs
slide-104
SLIDE 104

Theme of the Workshop

  • Value Functions  Generalized Value Functions
  • Gradient Extra-gradient
  • KL divergence  Bergman divergence
  • Contextual bandits  Linear bandits
  • SSPs  ?
slide-105
SLIDE 105

SSP/SSPs0

SSP MDP is a tuple <S, A, T, C, G, (s0)>, where:

  • S is a finite state space
  • A is a finite action set
  • T is a stationary transition function
  • C is a stationary cost function
  • G is a set of absorbing cost-free goal states
  • (s0 is an initial state)

Under two conditions:

  • There is a proper policy (reaches a goal with P=1 from all states)
  • Every improper policy incurs a cost of ∞ from every state from

which it does not reach the goal with PG = 1

120

Disallows dead ends Prevents algos from halting if we allowed dead ends, make cost a meaningless criterion

slide-106
SLIDE 106

Stochastic Shortest-Path MDPs

  • Example applications:

– Controlling a Mars rover

“How to collect scientific data without damaging the rover?”

– Route planning

“How to climb mount Everest in the cheapest way?”

121

Dead ends are common!

slide-107
SLIDE 107

Discrete MDP Research So Far

SSP MDPs

Negative MDPs Positive- bounded MDPs

Goal-oriented MDPs

????

  • Model many

interesting scenarios

  • Efficiently* solvable

by heuristic search

  • What interesting

problems are here?

  • How do we solve

them efficiently?

122

slide-108
SLIDE 108

SSPADE: Dead Ends are Avoidable from s0

  • D.e.s may be avoidable from s0 via an optimal policy
  • Can’t compute V*(s) for every state
  • But need only “relevant” states to get the “right” value
  • Can be solved with optimal heuristic search from s0

– FIND shouldn’t starve states; REVISE should halt

123

S0 S1

a1 a2 a2

SG S2

a2 a1 a1 a2 a3

[Kolobov, Mausam, Weld, UAI’12]

slide-109
SLIDE 109

fSSPUDE: SSP with Unavoidable Dead Ends (and a Finite Penalty on Them)

  • First attempt: if the agent reaches a d.e., it pays D

V*(s) = ε(D+1) + ε·0 + (1- ε)·D = D + ε

  • Makes non-d.e.s more “expensive” than d.e.s!

– Oops…

124

d s sg a

T(s, a, d) = 1- ε T(s, a, sg) = ε C= ε(D+1)

D

slide-110
SLIDE 110

fSSPUDE: SSP with Unavoidable Dead Ends (and a Finite Penalty on Them)

  • Second attempt: agent allowed to stop at any state

– by paying a price = penalty D – Intuition: achieving a goal is worth –D to the agent

  • Equivalent to SSP MDP with a special astop action

– applicable in each state – leads directly to a goal by paying cost D

  • Thus, algorithms for SSP apply to fSSPUDE!

[Kolobov, Mausam, Weld, UAI’12]

slide-111
SLIDE 111

MAXPROB: Dealing with Unavoidable Infinitely Damaging Dead Ends-1

126

S0 S1

a1 C = 2 C = 1 a2 a2 C = 7 C = 1

SG

C = 3 T = 0.3 T = 0.7

Sd

C = 0.8 a2 a1 a3 C = 5

P*G(s1)= 0.3 P*G(sd)= 0 P*G(s1)= 0.3

  • Comparing policies in terms of cost meaningless
  • MAXPROB/GSSP MDPs: evaluate policies by probability of reaching goal

– Set all action costs to 0 (they don’t matter), reward 1 for reaching goal – Fixed-point methods such as VI or LRTDP don’t converge because of traps

  • 1

[Kolobov, Mausam, Weld, Geffner ICAPS’11]

slide-112
SLIDE 112

MDP Examples S0

2 0.5

S1 S2 S3 S4

  • 1
  • 1
  • 1

G

SSP

S0

2 0.5

S1 S2 S3 S4

  • 1

G

SSP

S0

2 0.5

S1 S2 S3 S4

1

  • 1

1

  • 1
  • 1

G

SSP

127

slide-113
SLIDE 113

Generalized SSPs: Definition

  • An MDP M = <S, A, T, R, G, s0> for which

– There is a proper policy (reaches the goal with P=1) – Sum of non-negative rewards accumulated by any policy starting at s0 is bounded from above

  • Solving a GSSP = finding a reward-maximizing

Markovian policy that reaches the goal

128

slide-114
SLIDE 114

Generalized SSPs: Example

129

S0

2 0.5

S1 S2 S3 S4

  • 1

G S0

2 0.5

S1 S2 S3 S4

1

  • 1

1

  • 1
  • 1

G

GSSP GSSP

slide-115
SLIDE 115

Generalized SSPs: Example S0

2 0.5

S1 S2 S3 S4

  • 1

G

Proper policy exists

130

slide-116
SLIDE 116

Generalized SSPs: Example S0

2 0.5

S1 S2 S3 S4

  • 1

G

For any ∏, sum of non-negative rewards ≤ 2

131

slide-117
SLIDE 117

Generalized SSPs: Example S0

2 0.5

S1 S2 S3 S4

  • 1

G

Solution

S0

2 0.5

S1 S2 S3 S4

  • 1

G

Not a solution

132

slide-118
SLIDE 118

GSSPs: Is V* A Fixed Point of B?

  • Reminder: in SSPs, V* = B V*, where

– B is the Bellman backup operator – B V(s) = maxa {R(s, a) + ∑s’ in succ(s,a)T(s, a, s’)V(s’)

  • In SSPs, V* is a fixed point of B

– Still true in GSSPs:

  • 0.5 2

0.5

  • 1
  • 1
  • 1

133

slide-119
SLIDE 119

GSSPs: Is V* The Unique Fixed Point of B?

  • In SSPs, V* is the unique fixed point of B

– I.e., V* = B o B o … B V0, V0 is a heuristic value function – Not true in GSSPs: – Moreover, all suboptimal fixed points are admissible!

  • 0.5 2

0.5

  • 1
  • 1
  • 1

3 2 0.5 1 1 1 1

  • 1

134

slide-120
SLIDE 120

GSSPs: Is Every V*-greedy ∏ A Solution?

  • In SSPs, every ∏ greedy w.r.t V* reaches the

goal

– Not true in GSSPs:

  • 0.5 2

0.5

  • 1
  • 1
  • 1

135

slide-121
SLIDE 121

Efficiently Solving GSSPs: Attempt #1

  • Just Run F&R!

– Start with an admissible V0 – Done!

3 2 0.5 1 1 1 1

  • 1

3 2 0.5 1 1 1 1

  • 1

136

slide-122
SLIDE 122

Attempt #1: What Went Wrong?

  • In GSSPs, suboptimal fixed points are admissible!

– When starting with V0 ≥ V*, F&R hit one of them. – B can’t change V over traps – strongly connected components in V’s greedy graph

  • Can yield an arbitrarily poor solution

137

3 2 0.5 1 1 1 1

  • 1
slide-123
SLIDE 123

Efficiently Solving GSSPs: FRET

  • Find, Revise, Eliminate Traps

– First heuristic search algorithm for MDPs beyond SSP – Provably optimal if the heuristic is admissible

  • Main idea

– Run F&R until convergence – Eliminate traps in the policy envelope – Repeat until no more traps

139

slide-124
SLIDE 124

5

2 0.5

2.3 2 1

1.1

  • 1

4

2 0.5

2 2 1

1

  • 1

4

2 0.5

1

1

  • 1

1.5 2

0.5

1

1

  • 1

1.5 2

0.5

  • 1

1.5 2

0.5

  • 1
  • 1
  • 1

Start with an admissible V0 Run F&R until convergence Eliminate Traps in the resulting Vi

R e p e a t

Find-and-Revise Find-and-Revise Eliminate Traps No traps left – done!

FRET Example: Finding V*

140

slide-125
SLIDE 125

FRET Example: Extracting ∏*

  • 0.5 2

0.5

  • 1
  • 1
  • 1
  • Iteratively “connect” states to the goals

– Using optimal actions – Until s0 is connected

141

slide-126
SLIDE 126

Experimental Setup

  • Problems: MAXPROB versions of EBW
  • Planners: VI vs FRET
  • Heuristics: Zero for VI, One+SixthSense for FRET

– SixthSense soundly identifies some of the “dead ends”; their values are set to 0

142

slide-127
SLIDE 127

Experimental Setup

143

slide-128
SLIDE 128

Goal-Oriented MDP Hierarchy

144

SSP

Discounted- reward MDPs Finite-horizon MDPs

SSPADE fSSPUDE iSSPUDE GSSP S3Ps

slide-129
SLIDE 129

Future Work: Solving S3P

  • Stochastic Safest and Shortest Path (S3P) MDPs

– Teichteil-Koenigsbuch, AAAI’12 – Goal-oriented MDPs with no restriction on costs

145

S0 S1

a1 C(s1, a1, s0) = -1 C(s0, a1, s1) = 1 a2 a2 C(s0, a2, s0) = -7.2 C(s1, a2, sG) = 1

SG

C(s1, a2, s2) = -3 T(s1, a2, sG) = 0.3 T(s1, a2, s2) = 0.7

S2

C(s2, a2, s2) = 0.8 C(s2, a1, s2) = 2.4 a1 a2 a1

Alternating cycles Non-positive cycles Unavoidable dead ends

slide-130
SLIDE 130

Take Homes

  • SSP MDPs exclude interesting planning scenarios
  • Generalized SSPs

– handle zero-cost cycles – GSSP contains SSP and several other MDP classes – heuristic search algorithm (FRET)

  • Dead-ends tricky in undiscounted goal MDPs
  • Well-formed extensions of SSP MDPs

– can have unintuitive DP properties – what is beyond GSSPs? – loads of open questions: theoretical & algorithmic

M#3: some models use explicit knowledge of goals

slide-131
SLIDE 131

Agenda

  • Background: Stochastic Shortest Paths MDPs
  • Background: Heuristic Search for SSP MDPs
  • Algorithms: Automatic Basis Function Discovery
  • Models: SSPs  Generalized SSPs
slide-132
SLIDE 132

S0 S0, L S0, S S0, R S1 S2 S1, R S2, L G S0 S0, L S0, S S1 G

AND-OR Graph in Flat Space ASAP Graph

S3 S3, L S0 S0, L S0, S S0, R S1 S2 S1, R S2, L G

AS Graph[1]

S0 S0, L S0, S S1 S1, R G

ASAM Graph[2]

[1]: Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in Markov decision

  • processes. Artificial Intelligence, 2003

[2]: Balaraman Ravindran and A Barto. Approximate homomorphisms: A framework for nonexact minimization in Markov decision processes. In ICKBCS, 2004.

slide-133
SLIDE 133

Key Properties

PROPERTY 1: The original MDP does not reduce to an abstract MDP PROPERTY 2: ASAP subsumes abstractions computed by AS and ASAM PROPERTY 3: Value Iteration on abstract AND-OR graph returns optimal value functions for the

  • riginal MDP
slide-134
SLIDE 134

Experiments

[Anand, Grover, Mausam, Singla – submitted]

M#1: states can be ignored (abstracted) for efficient computation

slide-135
SLIDE 135

3 Key Messages

  • M#0: No need for exploration-exploitation tradeoff

– planning is purely a computational problem (V.I. vs. Q)

  • M#1: Search in planning

– states can be ignored or reordered for efficient computation

  • M#2: Representation in planning

– develop interesting representations for Factored MDPs

 Exploit structure to design domain-independent algorithms

  • M#3: Goal-directed MDPs

– design algorithms/models that use explicit knowledge of goals

151