Fast approximate planning in POMDPs Geoff Gordon - - PowerPoint PPT Presentation

fast approximate planning in pomdps
SMART_READER_LITE
LIVE PREVIEW

Fast approximate planning in POMDPs Geoff Gordon - - PowerPoint PPT Presentation

Fast approximate planning in POMDPs Geoff Gordon ggordon@cs.cmu.edu Joelle Pineau, Geoff Gordon, Sebastian Thrun. Point-based value iteration: an anytime algorithm for POMDPs Fast approximate planningin POMDPs p.1/37 Overview POMDPs are


slide-1
SLIDE 1

Fast approximate planning in POMDPs

Geoff Gordon

ggordon@cs.cmu.edu

Joelle Pineau, Geoff Gordon, Sebastian Thrun. Point-based value iteration: an anytime algorithm for POMDPs

Fast approximate planningin POMDPs – p.1/37

slide-2
SLIDE 2

Overview POMDPs are too slow

Fast approximate planningin POMDPs – p.2/37

slide-3
SLIDE 3

Overview POMDPs are too slow

Fast approximate planningin POMDPs – p.3/37

slide-4
SLIDE 4

Overview

Review of POMDPs Review of POMDP value iteration algorithms Point-based value iteration Theoretical results Actual results

Fast approximate planningin POMDPs – p.4/37

slide-5
SLIDE 5

POMDP overview

Planning in an uncertain world Actions have random effects Don’t observe full world state

Fast approximate planningin POMDPs – p.5/37

slide-6
SLIDE 6

POMDP definition

State x ∈ X, actions a ∈ A, observations z ∈ Z Rewards ra (column vectors), discount γ ∈ [0, 1) Belief b ∈ P(X) (row vectors) Starting belief b0

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Fast approximate planningin POMDPs – p.6/37

slide-7
SLIDE 7

POMDP definition cont’d

Transitions b → bTa (Ta stochastic) Observation likelihoods wz (row vectors)

  • z

wz = 1 Observation update: b ← wz × b · η where × is pointwise multiplication

Fast approximate planningin POMDPs – p.7/37

slide-8
SLIDE 8

Value functions

Just like MDP value function (but bigger) V (b) = expected total discounted future reward starting from b Knowing V means planning is 1-step lookahead If we discretize belief simplex, we are “done” From b get to bz1, bz2, . . . according to P(z | b, a)

Fast approximate planningin POMDPs – p.8/37

slide-9
SLIDE 9

Value functions

Additional structure: convexity Consider beliefs b1, b2, b3 = b1+b2

2

b3: flip a coin, then start in b1 if heads, b2 if tails b3 is always worse than average of b1, b2

Fast approximate planningin POMDPs – p.9/37

slide-10
SLIDE 10

Representation

Represent V as the upper surface of a (possibly infinite) set of hyperplanes V is set of hyperplanes Hyperplanes represented by normals v (column vectors) V (b) = maxv∈V b · v

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

Fast approximate planningin POMDPs – p.10/37

slide-11
SLIDE 11

Value iteration

Bellman’s equation: V (b) = max

a

Q(b, a) Q(b, a) = ra + γ

  • z

P(z | b, a)V (baz) where baz = η(bTa) × wz

Fast approximate planningin POMDPs – p.11/37

slide-12
SLIDE 12

Convergence

Backup operator T: V ← TV T is a contraction on P(X) → R b − b′ = maxx |b(x) − b′(x)|

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Fast approximate planningin POMDPs – p.12/37

slide-13
SLIDE 13

Sondik’s algorithm (1972)

Rearrange Bellman equation to make it linear: η−1 = P(z | b, a), and V (ηb) = ηV (b), so Q(b, a) = ra + γ

  • z

V ((bTa) × wz) = ra + γ

  • z

max

v∈V ((bTa) × wz) · v

= ra + γ

  • z

max

v∈V b · Ta(wz × v)

Fast approximate planningin POMDPs – p.13/37

slide-14
SLIDE 14

Evaluate from inside out

Suppose Vt(b) = b · v vz = wz × v vaz = γTavz va = vaz1 + vaz2 + . . . V′ = {va1, va2, . . .} Now Vt+1(b) = maxv∈V′ b · v

Fast approximate planningin POMDPs – p.14/37

slide-15
SLIDE 15

More than 1 hyperplane

Suppose Vt(b) = maxv∈V b · v Vz = wz × V set ops are elementwise Vaz = γTaVz Va = ra + Vaz1 ⊕ Vaz2 ⊕ . . . expensive! V′ = Va1 ∪ Va2 ∪ . . . Now Vt+1(b) = maxv∈V′ b · v

above representation due to [Cassandra et al]

Fast approximate planningin POMDPs – p.15/37

slide-16
SLIDE 16

A note on complexity

Or, some very large numbers Set Comment Total size Time/element Vz same size as V |Z| |V| O(|X|) Vaz still same size |A| |Z| |V| O(|X|2) Va big! |A| |V||Z| O(|X|) For example, w/ 5 actions, 5 observations: 1, 5, 15625, 4.6566 × 1021, 1.0948 × 10109, . . .

Fast approximate planningin POMDPs – p.16/37

slide-17
SLIDE 17

Witnesses (Littman 1994)

Don’t need all elements of V Just those which are arg max b · v for some b If we have the b (a witness), fast to check that v is indeed arg max

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

Fast approximate planningin POMDPs – p.17/37

slide-18
SLIDE 18

Witness details

Linear feasibility problem (size about |V| × |X|) b · v ≥ b · vi ∀i b · 1 = 1 b ≥ 0 Solve one LF per element of V—expensive, but well worth it Can add margin ǫ > 0 for approximate solution

  • don’t have to have all witnesses

Fast approximate planningin POMDPs – p.18/37

slide-19
SLIDE 19

Incremental pruning

(Cassandra, Littman, Zhang 1997) Prune Vz, Vaz, and Va as they are constructed Another big win in runtime We are now up to 16-state POMDPs

Fast approximate planningin POMDPs – p.19/37

slide-20
SLIDE 20

Summary so far

Solve POMDPs by repeatedly applying backup T Represent V with set of hyperplanes V V grows fast Can prune V using witnesses

Fast approximate planningin POMDPs – p.20/37

slide-21
SLIDE 21

Plan for rest of talk

Better use of witnesses: point backups Better way to find witnesses: exploration PBVI = point backups + exploration for witnesses PBVI examples

Fast approximate planningin POMDPs – p.21/37

slide-22
SLIDE 22

Backups at a point

Computing witnesses is expensive What if we knew a witness b already? Fast to compute both V (b) and d

dbV (b)

Intuitive, then formal derivation

Fast approximate planningin POMDPs – p.22/37

slide-23
SLIDE 23

Point backup—intuition

V (b′) depends on P(z | b, a)baz for all a, z P(z | b, a)baz are linear functions of b V (P(z | b, a)baz) is scaled/shifted copy of V Adding these copies: hard over P(X), easy at b

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Fast approximate planningin POMDPs – p.23/37

slide-24
SLIDE 24

Point backup—math

When V → V′, we want maxv∈V′ b · v That’s maxa maxv∈Va b · v, since V′ = Va1 ∪ Va2 . . . But maxv∈Va b · v is max

v1∈Vaz1

b · v1 + max

v2∈Vaz2

b · v2 + . . . since any v ∈ Va is v1 + v2 + . . . . . . and Vaz is quick to compute.

Fast approximate planningin POMDPs – p.24/37

slide-25
SLIDE 25

Advantage of point-based backups

Suppose we have a set B of witnesses and V of hyperplanes Pruning V takes time O(|B| |V| |X|) (w/ small constant) Without knowing witnesses, solve |V| LFs, each |V| × |X| Higher order, worse constants

Fast approximate planningin POMDPs – p.25/37

slide-26
SLIDE 26

Where do witnesses come from?

Grids (note difference to discretizing belief simplex) Random (Poon 2001) Interleave point-based with incremental pruning (Zhang & Zhang 2000) We are now up to 90-state POMDPs

Fast approximate planningin POMDPs – p.26/37

slide-27
SLIDE 27

New theorem

Bound error of the point-based backup operator Bound depends on how densely we sample reachable beliefs Probably exists an extension to “easily reachable” beliefs Error bound on one step + contraction of value iteration = overall error bound First result of this sort for POMDP VI

Fast approximate planningin POMDPs – p.27/37

slide-28
SLIDE 28

Definitions

Let ∆ be the set of reachable beliefs Let B be a set of witnesses Let ǫ(B) be the worst-case density of B in ∆: ǫ(B) = max

b′∈∆ min b∈B b − b′1

Fast approximate planningin POMDPs – p.28/37

slide-29
SLIDE 29

Theorem

A single point-based backup’s error is ǫ(B)(Rmax − Rmin) 1 − γ That means the error after value iteration is ǫ(B)(Rmax − Rmin) (1 − γ)2 plus a bit for stopping at finite horizon

Fast approximate planningin POMDPs – p.29/37

slide-30
SLIDE 30

Policy error

We therefore have that policy error is: ǫ(B)(Rmax − Rmin) (1 − γ)3 (1 − γ)3, ouch! But it does go to 0 as ǫ(B) → 0

Fast approximate planningin POMDPs – p.30/37

slide-31
SLIDE 31

Exploration

Theorem tells us we want to sample reachable beliefs with high worst-case 1-norm density We can do this by simulating forward from b0 Generate a set of candidate witnesses Accept those which are farthest (1-norm) from current set

Fast approximate planningin POMDPs – p.31/37

slide-32
SLIDE 32

Selecting new witnesses

. . . . . * . .

Fast approximate planningin POMDPs – p.32/37

slide-33
SLIDE 33

Summary of algorithm

B ← {b0} V = {0} (or whatever—e.g., use QMDP) Do some point-based backups on V using B

  • we backup k times, where γk is small

Add more beliefs to B

  • we double the size of B each time

Repeat

Fast approximate planningin POMDPs – p.33/37

slide-34
SLIDE 34

Tag problem

870 states, 2×29 observations, 5 actions fixed opponent policy

Fast approximate planningin POMDPs – p.34/37

slide-35
SLIDE 35

Results

Fast approximate planningin POMDPs – p.35/37

slide-36
SLIDE 36

Results

Catches opponent 60% of time Don’t know of another value iteration algorithm which could do this well On smaller problems, gets policies as good as

  • ther algorithms

But uses a small fraction of the compute time

Fast approximate planningin POMDPs – p.36/37

slide-37
SLIDE 37

Contributions and Conclusion

Others have used point-based backups

  • mostly in combination with other, more

expensive ops Others have tried to select witnesses quickly

  • on small problems, random & grid are good

heuristics Pushed to 10× larger problems with efficient algorithm and intelligent search for witnesses Our theorem is the strongest of its type

Fast approximate planningin POMDPs – p.37/37