fast approximate planning in pomdps
play

Fast approximate planning in POMDPs Geoff Gordon - PowerPoint PPT Presentation

Fast approximate planning in POMDPs Geoff Gordon ggordon@cs.cmu.edu Joelle Pineau, Geoff Gordon, Sebastian Thrun. Point-based value iteration: an anytime algorithm for POMDPs Fast approximate planningin POMDPs p.1/37 Overview POMDPs are


  1. Fast approximate planning in POMDPs Geoff Gordon ggordon@cs.cmu.edu Joelle Pineau, Geoff Gordon, Sebastian Thrun. Point-based value iteration: an anytime algorithm for POMDPs Fast approximate planningin POMDPs – p.1/37

  2. Overview POMDPs are too slow Fast approximate planningin POMDPs – p.2/37

  3. Overview POMDPs are too slow Fast approximate planningin POMDPs – p.3/37

  4. Overview Review of POMDPs Review of POMDP value iteration algorithms Point-based value iteration Theoretical results Actual results Fast approximate planningin POMDPs – p.4/37

  5. POMDP overview Planning in an uncertain world Actions have random effects Don’t observe full world state Fast approximate planningin POMDPs – p.5/37

  6. POMDP definition State x ∈ X , actions a ∈ A , observations z ∈ Z Rewards r a (column vectors), discount γ ∈ [0 , 1) Belief b ∈ P ( X ) (row vectors) Starting belief b 0 1 0.8 0.6 0.4 0.2 0 0 0.2 1 0.4 0.8 0.6 0.6 0.4 0.8 0.2 1 0 Fast approximate planningin POMDPs – p.6/37

  7. POMDP definition cont’d Transitions b → bT a ( T a stochastic) Observation likelihoods w z (row vectors) � w z = 1 z Observation update: b ← w z × b · η where × is pointwise multiplication Fast approximate planningin POMDPs – p.7/37

  8. Value functions Just like MDP value function (but bigger) V ( b ) = expected total discounted future reward starting from b Knowing V means planning is 1-step lookahead If we discretize belief simplex, we are “done” From b get to b z 1 , b z 2 , . . . according to P ( z | b, a ) Fast approximate planningin POMDPs – p.8/37

  9. Value functions Additional structure: convexity Consider beliefs b 1 , b 2 , b 3 = b 1 + b 2 2 b 3 : flip a coin, then start in b 1 if heads, b 2 if tails b 3 is always worse than average of b 1 , b 2 Fast approximate planningin POMDPs – p.9/37

  10. Representation Represent V as the upper surface of a (possibly infinite) set of hyperplanes V is set of hyperplanes 3 2.8 Hyperplanes represented 2.6 2.4 by normals v (column 2.2 vectors) 2 1.8 1.6 V ( b ) = max v ∈V b · v 1.4 1.2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.10/37

  11. Value iteration Bellman’s equation: V ( b ) = max Q ( b, a ) a � Q ( b, a ) = r a + γ P ( z | b, a ) V ( b az ) z where b az = η ( bT a ) × w z Fast approximate planningin POMDPs – p.11/37

  12. Convergence Backup operator T : V ← TV T is a contraction on P ( X ) �→ R � b − b ′ � = max x | b ( x ) − b ′ ( x ) | 3 2.8 2.8 2.7 2.6 2.6 2.4 2.5 �→ 2.2 2.4 2 2.3 1.8 2.2 1.6 2.1 1.4 2 1.2 1.9 1 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.12/37

  13. Sondik’s algorithm (1972) Rearrange Bellman equation to make it linear: η − 1 = P ( z | b, a ) , and V ( ηb ) = ηV ( b ) , so � Q ( b, a ) = r a + γ V (( bT a ) × w z ) z � = r a + γ max v ∈V (( bT a ) × w z ) · v z � = r a + γ max v ∈V b · T a ( w z × v ) z Fast approximate planningin POMDPs – p.13/37

  14. Evaluate from inside out Suppose V t ( b ) = b · v v z = w z × v v az = γT a v z v a = v az 1 + v az 2 + . . . V ′ = { v a 1 , v a 2 , . . . } Now V t +1 ( b ) = max v ∈V ′ b · v Fast approximate planningin POMDPs – p.14/37

  15. More than 1 hyperplane Suppose V t ( b ) = max v ∈V b · v V z = w z × V set ops are elementwise V az = γT a V z V a = r a + V az 1 ⊕ V az 2 ⊕ . . . expensive! V ′ = V a 1 ∪ V a 2 ∪ . . . Now V t +1 ( b ) = max v ∈V ′ b · v above representation due to [Cassandra et al] Fast approximate planningin POMDPs – p.15/37

  16. A note on complexity Or, some very large numbers Set Comment Total size Time/element V z same size as V | Z | |V| O ( | X | ) O ( | X | 2 ) V az still same size | A | | Z | |V| | A | |V| | Z | V a big! O ( | X | ) For example, w/ 5 actions, 5 observations: 1 , 5 , 15625 , 4 . 6566 × 10 21 , 1 . 0948 × 10 109 , . . . Fast approximate planningin POMDPs – p.16/37

  17. Witnesses (Littman 1994) Don’t need all elements of V Just those which are arg max b · v for some b If we have the b (a 3 2.8 witness ), fast to check 2.6 2.4 that v is indeed arg max 2.2 2 1.8 1.6 1.4 1.2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.17/37

  18. Witness details Linear feasibility problem (size about |V| × | X | ) b · v ≥ b · v i ∀ i b · 1 = 1 b ≥ 0 Solve one LF per element of V —expensive, but well worth it Can add margin ǫ > 0 for approximate solution • don’t have to have all witnesses Fast approximate planningin POMDPs – p.18/37

  19. Incremental pruning (Cassandra, Littman, Zhang 1997) Prune V z , V az , and V a as they are constructed Another big win in runtime We are now up to 16-state POMDPs Fast approximate planningin POMDPs – p.19/37

  20. Summary so far Solve POMDPs by repeatedly applying backup T Represent V with set of hyperplanes V V grows fast Can prune V using witnesses Fast approximate planningin POMDPs – p.20/37

  21. Plan for rest of talk Better use of witnesses: point backups Better way to find witnesses: exploration PBVI = point backups + exploration for witnesses PBVI examples Fast approximate planningin POMDPs – p.21/37

  22. Backups at a point Computing witnesses is expensive What if we knew a witness b already? Fast to compute both V ( b ) and d db V ( b ) Intuitive, then formal derivation Fast approximate planningin POMDPs – p.22/37

  23. Point backup—intuition V ( b ′ ) depends on P ( z | b, a ) b az for all a, z P ( z | b, a ) b az are linear functions of b V ( P ( z | b, a ) b az ) is scaled/shifted copy of V Adding these copies: hard over P ( X ) , easy at b 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fast approximate planningin POMDPs – p.23/37

  24. Point backup—math When V → V ′ , we want max v ∈V ′ b · v That’s max a max v ∈V a b · v , since V ′ = V a 1 ∪ V a 2 . . . But max v ∈V a b · v is max b · v 1 + max b · v 2 + . . . v 1 ∈V az 1 v 2 ∈V az 2 since any v ∈ V a is v 1 + v 2 + . . . . . . and V az is quick to compute. Fast approximate planningin POMDPs – p.24/37

  25. Advantage of point-based backups Suppose we have a set B of witnesses and V of hyperplanes Pruning V takes time O ( | B | |V| | X | ) (w/ small constant) Without knowing witnesses, solve |V| LFs, each |V| × | X | Higher order, worse constants Fast approximate planningin POMDPs – p.25/37

  26. Where do witnesses come from? Grids (note difference to discretizing belief simplex) Random (Poon 2001) Interleave point-based with incremental pruning (Zhang & Zhang 2000) We are now up to 90-state POMDPs Fast approximate planningin POMDPs – p.26/37

  27. New theorem Bound error of the point-based backup operator Bound depends on how densely we sample reachable beliefs Probably exists an extension to “easily reachable” beliefs Error bound on one step + contraction of value iteration = overall error bound First result of this sort for POMDP VI Fast approximate planningin POMDPs – p.27/37

  28. Definitions Let ∆ be the set of reachable beliefs Let B be a set of witnesses Let ǫ ( B ) be the worst-case density of B in ∆ : b ∈ B � b − b ′ � 1 ǫ ( B ) = max b ′ ∈ ∆ min Fast approximate planningin POMDPs – p.28/37

  29. Theorem A single point-based backup’s error is ǫ ( B )( R max − R min ) 1 − γ That means the error after value iteration is ǫ ( B )( R max − R min ) (1 − γ ) 2 plus a bit for stopping at finite horizon Fast approximate planningin POMDPs – p.29/37

  30. Policy error We therefore have that policy error is: ǫ ( B )( R max − R min ) (1 − γ ) 3 (1 − γ ) 3 , ouch! But it does go to 0 as ǫ ( B ) → 0 Fast approximate planningin POMDPs – p.30/37

  31. Exploration Theorem tells us we want to sample reachable beliefs with high worst-case 1-norm density We can do this by simulating forward from b 0 Generate a set of candidate witnesses Accept those which are farthest (1-norm) from current set Fast approximate planningin POMDPs – p.31/37

  32. Selecting new witnesses . . . . . * . . Fast approximate planningin POMDPs – p.32/37

  33. Summary of algorithm B ← { b 0 } V = { 0 } (or whatever—e.g., use QMDP) Do some point-based backups on V using B • we backup k times, where γ k is small Add more beliefs to B • we double the size of B each time Repeat Fast approximate planningin POMDPs – p.33/37

  34. Tag problem 870 states, 2 × 29 observations, 5 actions fixed opponent policy Fast approximate planningin POMDPs – p.34/37

  35. Results Fast approximate planningin POMDPs – p.35/37

  36. Results Catches opponent 60% of time Don’t know of another value iteration algorithm which could do this well On smaller problems, gets policies as good as other algorithms But uses a small fraction of the compute time Fast approximate planningin POMDPs – p.36/37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend