chapter 16 planning based on markov decision processes
play

Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau - PowerPoint PPT Presentation

Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 Dana Nau: Lecture slides for Automated Planning 1 Licensed


  1. Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 Dana Nau: Lecture slides for Automated Planning 1 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  2. Motivation c a b Intended ● Until now, we’ve assumed outcome c that each action has only one grasp(c) a b possible outcome ◆ But often that’s unrealistic ● In many situations, actions may have a b more than one possible outcome Unintended ◆ Action failures outcome » e.g., gripper drops its load ◆ Exogenous events » e.g., road closed ● Would like to be able to plan in such situations ● One approach: Markov Decision Processes Dana Nau: Lecture slides for Automated Planning 2 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  3. Stochastic Systems ● Stochastic system : a triple Σ = ( S, A, P ) ◆ S = finite set of states ◆ A = finite set of actions ◆ P a ( s ʹ″ | s ) = probability of going to s ʹ″ if we execute a in s ◆ ∑ s ʹ″ ∈ S P a ( s ʹ″ | s ) = 1 ● Several different possible action representations ◆ e.g., Bayes networks, probabilistic operators ● The book does not commit to any particular representation ◆ It only deals with the underlying semantics ◆ Explicit enumeration of each P a ( s ʹ″ | s ) Dana Nau: Lecture slides for Automated Planning 3 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  4. Example wait 2 ● Robot r1 starts wait at location l1 move(r1,l2,l1) ◆ State s1 in the diagram ● Objective is to wait get r1 to location l4 ◆ State s4 in Goal wait Start the diagram Dana Nau: Lecture slides for Automated Planning 4 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  5. Example wait 2 ● Robot r1 starts wait at location l1 move(r1,l2,l1) ◆ State s1 in the diagram ● Objective is to wait get r1 to location l4 ◆ State s4 in Goal wait Start the diagram ● No classical plan (sequence of actions) can be a solution, because we can’t guarantee we’ll be in a state where the next action is applicable π = 〈 move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4) 〉 Dana Nau: Lecture slides for Automated Planning 5 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  6. Policies wait 2 wait move(r1,l2,l1) π 1 = { (s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), wait (s5, wait) } Goal wait Start ● Policy : a function that maps states into actions ◆ Write it as a set of state-action pairs Dana Nau: Lecture slides for Automated Planning 6 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  7. Initial States wait ● For every state s , there 2 will be a probability wait P ( s ) that the system move(r1,l2,l1) starts in s ● The book assumes there’s a unique state wait s 0 such that the system always starts in s 0 Goal wait Start ● In the example, s 0 = s 1 ◆ P ( s 1 ) = 1 ◆ P ( s ) = 0 for all s ≠ s 1 Dana Nau: Lecture slides for Automated Planning 7 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  8. Histories ● History : a sequence wait of system states 2 h = 〈 s 0 , s 1 , s 2 , s 3 , s 4 , … 〉 wait h 0 = 〈 s1, s3, s1, s3, s1, … 〉 move(r1,l2,l1) h 1 = 〈 s1, s2, s3, s4, s4, … 〉 h 2 = 〈 s1, s2, s5, s5, s5, … 〉 h 3 = 〈 s1, s2, s5, s4, s4, … 〉 wait h 4 = 〈 s1, s4, s4, s4, s4, … 〉 h 5 = 〈 s1, s1, s4, s4, s4, … 〉 Goal wait Start h 6 = 〈 s1, s1, s1, s4, s4, … 〉 h 7 = 〈 s1, s1, s1, s1, s1, … 〉 ● Each policy induces a probability distribution over histories ◆ If h = 〈 s 0 , s 1 , … 〉 then P ( h | π ) = P ( s 0 ) ∏ i ≥ 0 P π ( S i ) ( s i+1 | s i ) The book omits this because it assumes a unique starting state Dana Nau: Lecture slides for Automated Planning 8 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  9. Example wait 2 π 1 = { (s1, move(r1,l1,l2)), wait (s2, move(r1,l2,l3)), move(r1,l2,l1) (s3, move(r1,l3,l4)), (s4, wait), (s5, wait) } wait Goal wait Start h 1 = 〈 s1, s2, s3, s4, s4, … 〉 P ( h 1 | π 1 ) = 1 × 1 × .8 × 1 × … = 0.8 goal h 2 = 〈 s1, s2, s5, s5 … 〉 P ( h 2 | π 1 ) = 1 × 1 × .2 × 1 × … = 0.2 P ( h | π 1 ) = 0 for all other h so π 1 reaches the goal with probability 0.8 Dana Nau: Lecture slides for Automated Planning 9 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  10. Example wait 2 π 2 = { (s1, move(r1,l1,l2)), wait wait (s2, move(r1,l2,l3)), move(r1,l2,l1) (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4)) } wait Goal wait Start h 1 = 〈 s1, s2, s3, s4, s4, … 〉 P ( h 1 | π 2 ) = 1 × 0.8 × 1 × 1 × … = 0.8 h 3 = 〈 s1, s2, s5, s4, s4, … 〉 P ( h 3 | π 2 ) = 1 × 0.2 × 1 × 1 × … = 0.2 P ( h | π 1 ) = 0 for all other h goal so π 2 reaches the goal with probability 1 Dana Nau: Lecture slides for Automated Planning 10 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  11. Example wait π 3 = { (s1, move(r1,l1,l4)), 2 (s2, move(r1,l2,l1)), wait (s3, move(r1,l3,l4)), move(r1,l2,l1) (s4, wait), (s5, move(r1,l5,l4) } π 3 reaches the goal with wait probability 1.0 Goal wait Start goal h 4 = 〈 s1, s4, s4, s4, … 〉 P ( h 4 | π 3 ) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 h 5 = 〈 s1, s1, s4, s4, s4, … 〉 P ( h 5 | π 3 ) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 h 6 = 〈 s1, s1, s1, s4, s4, … 〉 P ( h 6 | π 3 ) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125 • • • h 7 = 〈 s1, s1, s1, s1, s1, s1, … 〉 P ( h 7 | π 3 ) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0 Dana Nau: Lecture slides for Automated Planning 11 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  12. r = –100 Utility wait wait ● Numeric cost C ( s,a ) for each state s and action a ● Numeric reward R ( s ) for each state s ● No explicit goals any more ◆ Desirable states have wait high rewards wait Start ● Example: ◆ C ( s ,wait ) = 0 at every state except s3 ◆ C ( s,a ) = 1 for each “ horizontal ” action ◆ C ( s,a ) = 100 for each “ vertical ” action ◆ R as shown ● Utility of a history: ◆ If h = 〈 s 0 , s 1 , … 〉 , then V ( h | π ) = ∑ i ≥ 0 [ R ( s i ) – C ( s i , π ( s i ))] Dana Nau: Lecture slides for Automated Planning 12 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  13. r = –100 Example wait wait π 1 = { (s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait) } wait wait Start h 1 = 〈 s1, s2, s3, s4, s4, … 〉 h 2 = 〈 s1, s2, s5, s5 … 〉 V ( h 1 | π 1 ) = [ R ( s1 ) – C( s1, π 1 ( s1 ))] + [ R ( s2 ) – C( s2, π 1 ( s2 ))] + [ R ( s3 ) – C( s3, π 1 ( s3 ))] + [ R ( s4 ) – C( s4, π 1 ( s4 ))] + [ R ( s4 ) – C( s4, π 1 ( s4 ))] + … = [0 – 100] + [0 – 1] + [0 – 100] + [100 – 0] + [100 – 0] + … = ∞ V ( h 2 | π 1 ) = [0 – 100] + [0 – 1] + [–100 – 0] + [–100 – 0] + [–100 – 0] + … = – ∞ Dana Nau: Lecture slides for Automated Planning 13 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  14. r = –100 Discounted Utility wait wait ● We often need to use a discount factor , γ γ = 0.9 ◆ 0 ≤ γ ≤ 1 ● Discounted utility wait of a history: wait Start V ( h | π ) = ∑ i ≥ 0 γ i [ R ( s i ) – C ( s i , π ( s i ))] ◆ Distant rewards/costs have less influence ◆ Convergence is guaranteed if 0 ≤ γ < 1 ● Expected utility of a policy: ◆ E ( π ) = ∑ h P ( h | π ) V ( h | π ) Dana Nau: Lecture slides for Automated Planning 14 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  15. r = –100 Example wait wait π 1 = { (s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait) } γ = 0.9 wait wait h 1 = 〈 s1, s2, s3, s4, s4, … 〉 Start h 2 = 〈 s1, s2, s5, s5 … 〉 V ( h 1 | π 1 ) = .9 0 [0 – 100] + .9 1 [0 – 1] + .9 2 [0 – 100] + .9 3 [100 – 0] + .9 4 [100 – 0] + … = 547.9 V ( h 2 | π 1 ) = .9 0 [0 – 100] + .9 1 [0 – 1] + .9 2 [–100 – 0] + .9 3 [–100 – 0] + … = –910.1 E ( π 1 ) = 0.8 V ( h 1 | π 1 ) + 0.2 V ( h 2 | π 1 ) = 0.8(547.9) + 0.2(–910.1) = 256.3 Dana Nau: Lecture slides for Automated Planning 15 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend