Practical Linear- -value value Practical Linear Approximation - - PowerPoint PPT Presentation

practical linear value value practical linear
SMART_READER_LITE
LIVE PREVIEW

Practical Linear- -value value Practical Linear Approximation - - PowerPoint PPT Presentation

Practical Linear- -value value Practical Linear Approximation Techniques Approximation Techniques for First- -order order MDPs MDPs for First & Craig Sanner & Scott Sanner Craig Boutilier Boutilier Scott University of Toronto


slide-1
SLIDE 1

Practical Linear Practical Linear-

  • value

value Approximation Techniques Approximation Techniques for First for First-

  • order
  • rder MDPs

MDPs

Scott Scott Sanner Sanner & & Craig Craig Boutilier Boutilier University of Toronto University of Toronto UAI 2006 UAI 2006

slide-2
SLIDE 2

2

  • Relational

Relational desc

  • desc. of (

. of (prob prob) planning domain in (P)PDDL: ) planning domain in (P)PDDL:

( (:action :action load load-

  • box

box-

  • on
  • n-
  • truck

truck-

  • in

in-

  • city

city :parameters :parameters (?b (?b -

  • box ?t

box ?t -

  • truck ?c

truck ?c -

  • city)

city) :precondition :precondition (and ( (and (BIn BIn ?b ?c) ( ?b ?c) (TIn TIn ?t ?c)) ?t ?c)) :effect :effect (and (On ?b ?t) (not ( (and (On ?b ?t) (not (BIn BIn ?b ?c))) ?b ?c)))

London London Paris Paris Rome Rome Berlin Berlin Moscow Moscow

Box World: Box World:

Why Solve First Why Solve First-

  • order
  • rder MDPs

MDPs? ?

  • Or solve

Or solve first first-

  • order MDP
  • rder MDP for

for all all domain inst. at once! domain inst. at once!

  • Lift PPDDL MDP specification to first

Lift PPDDL MDP specification to first-

  • order (FOMDP)
  • rder (FOMDP)
  • Soln

Soln makes value distinctions for makes value distinctions for all all dom. instantiations!

  • dom. instantiations!
  • Can solve a

Can solve a ground MDP ground MDP for for each each domain instantiation: domain instantiation:

  • 3 trucks: 2 planes: 4 boxes:

3 trucks: 2 planes: 4 boxes:

slide-3
SLIDE 3

3

Background Background

1) 1) Symbolic DP for first Symbolic DP for first-

  • order
  • rder MDPs

MDPs (BRP, 2001) (BRP, 2001)

  • Defines FOMDP / operators / value iteration

Defines FOMDP / operators / value iteration

  • Requires FO simplification for compactness

Requires FO simplification for compactness

  • 2)

2) First First-

  • order approx. linear
  • rder approx. linear prog
  • prog. (SB, 2005)

. (SB, 2005)

  • Approximate value with linear comb. of basis funs.

Approximate value with linear comb. of basis funs.

  • No simplification

No simplification → → project onto weight space project onto weight space ☺ ☺

3) 3) Many practical questions remaining (SB, 2006) Many practical questions remaining (SB, 2006)

  • Other algorithms

Other algorithms – – first first-

  • order API?
  • rder API?
  • Where do basis functions come from?

Where do basis functions come from?

  • How to efficiently handle universal rewards?

How to efficiently handle universal rewards?

  • Optimizations for scalability?

Optimizations for scalability?

/ Talk Outline / Talk Outline

slide-4
SLIDE 4

4

FOMDP Foundation: FOMDP Foundation: SitCalc SitCalc

  • Deterministic Actions:

Deterministic Actions: loadS

loadS(b,t), (b,t), unloadS unloadS(b,t), … (b,t), …

  • Situations:

Situations: S

S0

0, do(

, do(loadS loadS(b,t), S (b,t), S0

0), …

), …

  • Fluents

Fluents: : BIn

BIn(b,c,s), (b,c,s), TIn TIn(t,c,s), On(b,t,s) (t,c,s), On(b,t,s)

  • Successor

Successor-

  • state axioms (

state axioms (SSAs SSAs) ) for

for each fluent

each fluent F

F:

:

  • Describe how action affects fluent

Describe how action affects fluent (like (like det

  • det. FO

. FO-

  • DBN)

DBN)

  • Ex:

Ex: BIn

BIn(b,c,do(a,s)) (b,c,do(a,s)) ≡ ≡ (1) Bin(b,c,s) (1) Bin(b,c,s) AND AND a a g g loadS loadS(b,t) (b,t) OR OR (2) (2) for some for some t t: : a a = = unloadS unloadS(b,t) (b,t) AND AND TIn TIn(t,c,s) (t,c,s)

  • Regression Operator:

Regression Operator: Regr

Regr( (ϕ ϕ) ) =

= ϕ

ϕ’ ’

  • Takes a formula

Takes a formula ϕ

ϕ describing a

describing a post post-

  • action

action state state

  • Uses

Uses SSAs SSAs to build to build ϕ

ϕ’ ’ describing

describing pre pre-

  • action

action state state

  • Crucial for backing up value fun to produce Q

Crucial for backing up value fun to produce Q-

  • fun!

fun!

slide-5
SLIDE 5

5

  • Operators:

Operators: Define unary, binary case operations Define unary, binary case operations

  • E.g., can take “cross

E.g., can take “cross-

  • sum”

sum” / / (or (or 1 1, , 0 0) ) of two cases

  • f two cases…

  • Must remove inconsistent elements (i.e., red bar )

Must remove inconsistent elements (i.e., red bar )

¬ ¬∃ ∃x.A(x)

x.A(x)

∃ ∃x.A(x)

x.A(x) 20 20 10 10

¬∃ ¬∃x.A(x)

x.A(x) ∧

∧ ¬ ¬∃ ∃y.A(y)

y.A(y)∧

∧B(y)

B(y)

¬∃ ¬∃x.A(x)

x.A(x) ∧

∧ ∃ ∃y.A(y)

y.A(y)∧

∧B(y)

B(y)

∃ ∃x.A(x)

x.A(x) ∧

∧ ¬ ¬∃ ∃y.A(y)

y.A(y)∧

∧B(y)

B(y)

∃ ∃x.A(x)

x.A(x) ∧

∧ ∃ ∃y.A(y)

y.A(y)∧

∧B(y)

B(y) 2 24 4 2 23 3 1 14 4 1 13 3

¬ ¬∃ ∃y.A(y)

y.A(y)∧

∧B(y)

B(y)

∃ ∃y.A(y)

y.A(y)∧

∧B(y)

B(y) 4 4 3 3 =

= / /

FOMDP Case Representation FOMDP Case Representation

  • Case:

Case: Assign value to first Assign value to first-

  • order state abstraction
  • rder state abstraction
  • E.g., can express reward in

E.g., can express reward in BoxWorld

BoxWorld FOMDP as…

FOMDP as…

¬ ¬ ∀ ∀b,c.

b,c. Dest Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s) (b,c,s)

∀ ∀b,c.

b,c. Dest Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s) (b,c,s) 1 1

rCase rCase(s) (s) = =

slide-6
SLIDE 6

6

FOMDP Actions and FODTR FOMDP Actions and FODTR

  • SitCalc

SitCalc is deterministic, how to handle probabilities? is deterministic, how to handle probabilities?

  • User’s stochastic actions:

User’s stochastic actions: load(b,t) load(b,t)

  • Nature’s deterministic choice:

Nature’s deterministic choice: loadS loadS(b,t) (b,t), , loadF loadF(b,t) (b,t)

  • Probability distribution over Nature’s choice:

Probability distribution over Nature’s choice:

  • First

First-

  • order decision
  • rder decision-
  • theoretic regression (FODTR):

theoretic regression (FODTR):

  • Given value fun

Given value fun vCase

vCase(s) (s) and user action, produces

and user action, produces first first-

  • order description of “Q
  • rder description of “Q-
  • fun” (modulo reward)

fun” (modulo reward)

“Q “Q-

  • Fun” =

Fun” = FODTR[ FODTR[ vCase vCase(s), load(b,t) ] = (s), load(b,t) ] = Regr Regr[ [ vCase vCase( after ( after loadS loadS… … ) ] ) ]

1 1

P( P( loadS loadS… … | load… ) | load… ) / / Regr Regr[ [ vCase vCase( after ( after loadF loadF… … ) ] ) ]

1 1

P( P( loadF loadF… … | load… ) | load… ) P( P(loadS loadS(b,t) (b,t) | load(b,t)) = | load(b,t)) = P( P(loadF loadF(b,t) (b,t) | load(b,t)) = 1 | load(b,t)) = 1 0

0 P(

P(loadS loadS(b,t) (b,t) | load(b,t)) | load(b,t)) ¬ ¬ snow snow(s)

(s)

snow snow(s)

(s) .5 .5 .1 .1

slide-7
SLIDE 7

7

FOMDP Backup Operators FOMDP Backup Operators

In fact, there are 3 types of “Q In fact, there are 3 types of “Q-

  • funs”/backup operators:

funs”/backup operators:

1) 1) B BA(

A(x x) )[

[vCase vCase(s) (s)] = ] = rCase rCase(s) (s) /

/ γ⋅

γ⋅FODTR[ FODTR[vCase vCase(s) (s)] ] 2) 2) B BA

A[

[vCase vCase(s) (s)] = ] = ∃ ∃x

x. . B

BA(

A(x x) )[

[vCase vCase(s) (s)] ] (action abstraction!) (action abstraction!) 3) 3) B BA

A max max[

[vCase vCase(s) (s)] = max( B ] = max( BA

A[

[vCase vCase(s) (s)] ) ] )

Let Let B Bload

load(b,t) (b,t)[

[vCase vCase(s)] (s)] = =

¬ ¬ϕ

ϕ(b,t)

(b,t)

ϕ ϕ(b,t)

(b,t) .9 .9

B Bload

load[

[vCase vCase(s)] (s)] = = ∃ ∃b,t

b,t. . ¬ ¬ϕ

ϕ(b,t)

(b,t)

∃ ∃b,t

b,t. . ϕ

ϕ(b,t)

(b,t) .9 .9

B Bload

load max max[

[vCase vCase(s)] (s)] = =

¬ ¬( (∃

∃b,t

b,t. . ϕ

ϕ(b,t))

(b,t)) ∧ ∧ ∃

∃b,t

b,t. . ¬ ¬ϕ

ϕ(b,t)

(b,t)

∃ ∃b,t

b,t. . ϕ

ϕ(b,t)

(b,t) .9 .9

Think of as Think of as Q(A(x),s) Q(A(x),s), , note the free note the free vars vars! ! Think of as Think of as ~Q(A,s) ~Q(A,s), , no no free free vars vars but now overlap! but now overlap! Think of as Think of as Q(A,s) Q(A,s), , no no free free vars vars and and no no overlap!

  • verlap!
slide-8
SLIDE 8

8

First First-

  • order Approx. Linear
  • rder Approx. Linear Prog
  • Prog. (FOALP)

. (FOALP)

  • Represent value fn as linear comb. of k basis fns:

Represent value fn as linear comb. of k basis fns:

  • Reduces MDP solution to finding good weights…

Reduces MDP solution to finding good weights… generalize generalize approx. LP

  • approx. LP used by (van Roy, GKP, SP):

used by (van Roy, GKP, SP):

  • FOALP issues resolved in (SB, 2005):

FOALP issues resolved in (SB, 2005):

∞ sum in objective: sum in objective: We give principled approximation We give principled approximation

∞ constraints: constraints: Only finite set of Only finite set of distinct distinct constraints, constraints, solve exactly & efficiently w/ constraint gen. (SP) solve exactly & efficiently w/ constraint gen. (SP)

¬ ¬ ∃ ∃b,c

b,c BIn BIn(b,c,s) (b,c,s)

∃ ∃b,c

b,c BIn BIn(b,c,s) (b,c,s) 1 1

¬ ¬ ∃ ∃t,c

t,c TIn TIn(t,c,s) (t,c,s)

∃ ∃t,c

t,c TIn TIn(t,c,s) (t,c,s) 1 1

vCase vCase(s) = w (s) = w1

1•

⊕ ⊕ …

… ⊕

⊕ w

wk

k•

  • Vars

Vars: : w wi

i; i

; i [ [ k k Minimize: Minimize:

Σ Σs

s Σ

Σi=1..k

i=1..k w

wi

i •bCase

bCasei

i(s)

(s) Subject to: Subject to: 0 0 m m B Ba

a max max[

[/ /i=1..k

i=1..k w

wi

i •bCase

bCasei

i(s)]

(s)] 0 / /i=1..k

i=1..k w

wi

i •bCase

bCasei

i(s);

(s); ∀ ∀a a∈ ∈A,s A,s

slide-9
SLIDE 9

9

First First-

  • order Approx. Policy
  • rder Approx. Policy Iter
  • Iter. (FOAPI)

. (FOAPI)

  • Need an explicit representation of a policy:

Need an explicit representation of a policy:

  • π

πCase(s) = max( Case(s) = max( ∪ ∪i=1..m

i=1..m B

BA

Ai

i[

[vCase vCase(s)] ) (s)] )

  • Each case partition should retain mapping to

Each case partition should retain mapping to A Ai

i

  • Now separate partitions in A

Now separate partitions in Ai

i-

  • specific policies:

specific policies:

  • π

πCase CaseAi

Ai(s) = { part

(s) = { part ∈ ∈ π πCase(s) s.t. part Case(s) s.t. part → → A Ai

i }

}

  • Specifies states where policy would apply

Specifies states where policy would apply A Ai

i

  • FOAPI: Direct generalization of GKP (exact objective!)

FOAPI: Direct generalization of GKP (exact objective!)

  • Start w/

Start w/ w

wi

i 0=0,

=0, π πCase Case0

0(s)

(s); ; i

iterate LP terate LP soln soln until until π πj+1=

j+1= π

πj

j:

:

  • Use

Use cgen cgen; if converges, obtain bounds on policy (GKP)! ; if converges, obtain bounds on policy (GKP)!

Vars Vars: : w wi

i (j+1) (j+1); i

; i [ [ k k Minimize: Minimize: φ

φ(j+1)

(j+1)

Subject to: Subject to: φ

φ(j+1)

(j+1) m

m | | π πCase Casej

j a a(s)

(s) / / B Ba

a max max (

(/ /i=1..k

i=1..k w

wi

i (j+1) (j+1) •bCase

bCasei

i(s))

(s)) 0 / /i=1..k

i=1..k w

wi

i (j+1) (j+1) •bCase

bCasei

i(s)|;

(s)|; ∀ ∀a a∈ ∈A,s A,s

slide-10
SLIDE 10

10

Generating Basis Functions Generating Basis Functions

  • Where do basis functions come from?

Where do basis functions come from?

  • Major question for automation!

Major question for automation!

  • Huge candidate space if systematically building basis

Huge candidate space if systematically building basis functions for all first functions for all first-

  • order formulae
  • rder formulae
  • Idea (GT, 2004): Regressions from goal make

Idea (GT, 2004): Regressions from goal make good candidate basis functions! good candidate basis functions!

  • Given initial basis function for reward:

Given initial basis function for reward: ∃

∃b

b. .Bin(b,P,s) Bin(b,P,s)

  • Regr

Regr w/ w/ unload

unload:

: ∃

∃b

b. .Bin(b,P,s) Bin(b,P,s) ∨ ∨ (

(∃ ∃b*,t*.

b*,t*.TIn TIn(t*,P,s) (t*,P,s)∧ ∧On(b*,t*,s)) On(b*,t*,s))

  • Render basis

Render basis disjoint disjoint from parents, will use later from parents, will use later

  • Iteratively solve FOMDP

Iteratively solve FOMDP

  • Retain all basis functions with

Retain all basis functions with wgt

  • wgt. > threshold

. > threshold τ τ

  • Generate new basis fns from retained set

Generate new basis fns from retained set

slide-11
SLIDE 11

11

  • Universal rewards are difficult for

Universal rewards are difficult for FOMDPs FOMDPs, e.g. , e.g.

  • Given reward:

Given reward:

  • Exact n

Exact n-

  • stage

stage-

  • to

to-

  • go value function has form:

go value function has form:

  • Exact value function has infinitely many values!

Exact value function has infinitely many values!

  • Cannot compactly represent such structure with

Cannot compactly represent such structure with piecewise piecewise-

  • constant case approximation of value fn

constant case approximation of value fn

Problems w/ Universal Reward Problems w/ Universal Reward

n n-

  • 1 boxes not at

1 boxes not at dest dest … … 1 box not at 1 box not at dest dest ∀ ∀b,c.

b,c. Dest Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s) (b,c,s)

γ γn

n-

  • 1

1 … …

γ γ

1 1

vCase vCasen

n(s)=

(s)= ¬ ¬” ” ∀ ∀b,c.

b,c. Dest Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s) (b,c,s) 1 1

rCase rCase (s)= (s)=

slide-12
SLIDE 12

12

Additive Goal Decomposition Additive Goal Decomposition

  • Solution for universal rewards:

Solution for universal rewards:

  • E.g., given:

E.g., given: ∀

∀b,c.

b,c. Dest Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s) (b,c,s)

  • Solve FOMDP for:

Solve FOMDP for: BIn

BIn(b*,c*,s) (b*,c*,s)

  • Given solution, gen. Q

Given solution, gen. Q-

  • funs

funs Q(A,s)

Q(A,s)<b*,c*>

<b*,c*>(s)

(s) for

for ∀

∀a a∈ ∈A A

  • At run

At run-

  • time: Given concrete domain, e.g.

time: Given concrete domain, e.g.

  • Instantiation:

Instantiation:{ {Dest

Dest(b (b1

1,c

,c1

1),

), Dest Dest(b (b2

2,c

,c2

2),

), Dest Dest(b (b3

3,c

,c3

3) }

) }

  • Let overall

Let overall Q(A,s) =

Q(A,s) = Q(A,s) Q(A,s)<b1,c1>

<b1,c1>(s) +

(s) + Q(A,s)

Q(A,s)<b2,c2>

<b2,c2>(s) +

(s) +

Q(A,s) Q(A,s)<b3,c3>

<b3,c3>(s)

(s) for

for ∀

∀a a∈ ∈A A

  • To execute policy: select action that maximizes sum of

To execute policy: select action that maximizes sum of values across values across all all Q Q-

  • funs, i.e.,

funs, i.e., Q(A,s)

Q(A,s)

  • Only heuristic: works in many, but not all cases

Only heuristic: works in many, but not all cases When reward in simple implicative form, solve When reward in simple implicative form, solve for single goal with distinguished constants. for single goal with distinguished constants.

slide-13
SLIDE 13

13

Optimizations Optimizations

  • Exploiting

Exploiting disjointness disjointness in basis functions: in basis functions:

  • Worst case for set

Worst case for set B

B of basis functions: must examine

  • f basis functions: must examine 2

2|

|B

B|

|

case partitions in constraint generation case partitions in constraint generation

  • But for any

But for any pairwise pairwise disjoint set disjoint set B’

B’ of basis functions,

  • f basis functions,

need examine only need examine only | |B’

B’|

| case partitions in case partitions in cgen cgen

  • Basis generation enforces

Basis generation enforces disjointness disjointness b/w child/parent! b/w child/parent!

  • Exploiting implicit max in constraint generation:

Exploiting implicit max in constraint generation:

  • In constraints, substitute

In constraints, substitute 0

0 m m B Ba

a max max …

…with

with 0

0 m m B Ba

a …

  • Removing internal redundancy/inconsistency w/

Removing internal redundancy/inconsistency w/ BDDs BDDs: :

  • Given:

Given: (

(∃

∃x A(x))

x A(x)) ∧ ∧ ( (∃

∃x A(x))

x A(x)) ∧ ∧ ( (∃

∃x A(x)

x A(x)∧ ∧B(x)) B(x))

  • a

a⇒ ⇒b b ∃ ∃x A(x)

x A(x)∧ ∧B(x) B(x)

a a ¬ ¬b b⇒¬ ⇒¬a a ∃ ∃x A(x)

x A(x)

b b

FOL Mapping FOL Mapping Impl Impl. . Prop Prop Var Var

  • a

a

T F

a a

T

b b

F

∃x A(x)

x A(x)∧ ∧B(x) B(x)

slide-14
SLIDE 14

14

Empirical Results: Runtime Empirical Results: Runtime

  • Offline solution times for

Offline solution times for BoxWorld BoxWorld & & BlocksWorld BlocksWorld: :

  • Without optimizations, cannot get past iteration 2 (> 36000 sec.

Without optimizations, cannot get past iteration 2 (> 36000 sec.) )

  • BoxWorld

BoxWorld: Policies simple, fewer constraints for FOAPI : Policies simple, fewer constraints for FOAPI

  • BlocksWorld

BlocksWorld: Policies complex (lots of equality) : Policies complex (lots of equality)

slide-15
SLIDE 15

15

Empirical Results: Performance Empirical Results: Performance

  • Evaluated

Evaluated cumulative reward cumulative reward on ICAPS 2004 Prob.

  • n ICAPS 2004 Prob.

Planning Comp. Planning Comp. BoxWorld BoxWorld ( (bx bx) and ) and BlocksWorld BlocksWorld ( (bw bw): ): G2: G2: temp. logic w/ control knowledge

  • temp. logic w/ control knowledge; P:

; P: RTDP RTDP-

  • based

based J1: J1: human human-

  • coded policy

coded policy; J2: ; J2: inductive FO policy inductive FO policy iter iter. .; ; J3: J3: deterministic FF deterministic FF-

  • replanner

replanner

slide-16
SLIDE 16

16

Related Work Related Work

  • Direct value iteration:

Direct value iteration:

  • ReBel

ReBel algorithm for algorithm for RMDPs RMDPs (

(KvOdR KvOdR, 2004) , 2004)

  • FOVIA algorithm for fluent calculus

FOVIA algorithm for fluent calculus (KS, 2005)

(KS, 2005)

  • First

First-

  • order decision diagrams
  • rder decision diagrams (JKW, 2006)

(JKW, 2006)

→ all disallow all disallow ∀ ∀ quant quant., e.g., ., e.g., universal universal cond

  • cond. effects

. effects

  • Sampling and/or inductive techniques:

Sampling and/or inductive techniques:

  • Approx. linear programming for
  • Approx. linear programming for RMDPs

RMDPs (GKGK, 2003)

(GKGK, 2003)

  • Inductive policy selection using FO regression

Inductive policy selection using FO regression (GT, 2004)

(GT, 2004)

  • Approximate policy iteration

Approximate policy iteration (FYG, 2004)

(FYG, 2004)

→ sampled domain instantiations do not ensure sampled domain instantiations do not ensure generalization across all possible worlds generalization across all possible worlds

→ nonetheless, these methods have worked well nonetheless, these methods have worked well empirically empirically

slide-17
SLIDE 17

17

Conclusions and Future Work Conclusions and Future Work

  • Conclusions:

Conclusions:

  • Developed

Developed domain domain-

  • independent

independent linear linear-

  • value

value approximation techniques / optimization for approximation techniques / optimization for FOMDPs FOMDPs

  • Encouraging empirical results on ICAPS 2004 IPPC

Encouraging empirical results on ICAPS 2004 IPPC

  • 2

2nd

nd place in ICAPS 2006 IPPC by # problems solved

place in ICAPS 2006 IPPC by # problems solved

  • Future work:

Future work:

  • Goal decomposition for complex

Goal decomposition for complex ∀ ∀ rewards rewards

  • (

(∀

∀b,c.

b,c. Dest Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s)) (b,c,s)) ∨ ∨ ∃

∃b.Bin(b,Paris,s)

b.Bin(b,Paris,s)

  • Online search to “patch

Online search to “patch-

  • up” decomposition error

up” decomposition error

  • E.g., additive decomposition is inadequate to solve

E.g., additive decomposition is inadequate to solve some difficult problems in some difficult problems in BlocksWorld BlocksWorld

  • More expressive rewards

More expressive rewards

  • Σ

Σb

b (

(∀ ∀c.

  • c. Dest

Dest(b,c) (b,c) ⇒ ⇒ BIn BIn(b,c,s)) (b,c,s))