cse 573
play

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time - PowerPoint PPT Presentation

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost to achieve a goal Example


  1. CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1

  2. Stochastic Shortest-Path MDPs: Motivation • Assume the agent pays cost to achieve a goal • Example applications: – Controlling a Mars rover “How to collect scientific data without damaging the rover?” – Navigation “What’s the fastest way to get to a destination, taking into account the traffic jams?” 8

  3. Stochastic Shortest-Path MDPs: Definition Bertsekas, 1995 SSP MDP is a tuple < S, A, T, C, G >, where: • S is a finite state space • ( D is an infinite sequence (1,2, …) ) • A is a finite action set • T : S x A x S à [0, 1] is a stationary transition function • C : S x A x S à R is a stationary cost function (low cost is good!) • G is a set of absorbing cost-free goal states Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1 9

  4. SSP MDP Details • In SSP, maximizing ELAU = minimizing exp. cost • Every cost-minimizing policy is proper! • Thus, an optimal policy = cheapest way to a goal • Why are SSP MDPs called “indefinite-horizon”? – If a policy is optimal, it will take a finite, but apriori unknown, time to reach goal 10

  5. SSP MDP Example , not! C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 a 2 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G C(s 2 , a 2 , s 2 ) = -3 a 1 C(s 2 , a 1 , s 1 ) = -1 T(s 2 , a 2 , s G ) = 0.7 a 1 T(s 2 , a 1 , s 1 ) = 0.4 C(s G , a 1 , s G ) = 0 No dead ends C(s 2 , a 1 , s 3 ) = 5 T(s 2 , a 1 , s 3 ) = 0.6 allowed! a 1 a 2 S 3 C(s 3 , a 1 , s 3 ) = 2.4 C(s 3 , a 2 , s 3 ) = 0.8 11

  6. SSP MDP Example , also not! C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 a 2 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G a 1 C(s 2 , a 2 , s 2 ) = -3 C(s 2 , a 1 , s 1 ) = -1 T(s 2 , a 2 , s G ) = 0.7 a 1 C(s G , a 1 , s G ) = 0 No cost-free “loops” allowed! 12

  7. SSP MDP Example C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G a 1 C(s 2 , a 2 , s 2 ) = 1 C(s 2 , a 1 , s 1 ) = 0 T(s 2 , a 2 , s G ) = 0.7 C(s G , a 1 , s G ) = 0 13

  8. SSP MDPs: Optimality Principle For an SSP MDP, let: For every history, Exp. Lin. Add. Utility the value of a policy π – V π (h) = E h [ C 1 + C 2 + … ] for all h is well-defined! Every policy either takes a Then: finite exp. # of steps to reach a goal, or has an infinite cost. – V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian – For all s : V*(s) = min a in A [ ∑ s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] π*(s) = argmin a in A [ ∑ s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] 14

  9. Fundamentals of MDPs ü General MDP Definition ü Expected Linear Additive Utility ü The Optimality Principle ü Finite-Horizon MDPs ü Infinite-Horizon Discounted-Reward MDPs ü Stochastic Shortest-Path MDPs • A Hierarchy of MDP Classes • Factored MDPs • Computational Complexity 15

  10. SSP and Other MDP Classes E.g., Indefinite-horizon discounted reward IHDR FH IHDR SSP 16

  11. SSP and Other MDP Classes SSP IHDR FH • FH => SSP : turn all states (s, L) into goals • IHDR => SSP : add γ-probability transitions to goal • Will concentrate on SSP in the rest of the tutorial 17

  12. IHDR à SSP -10 +1 +2 +1 +2 18

  13. IHDR à SSP 1) Invert rewards to costs -1 +10 -1 -2 -1 -2 19

  14. IHDR à SSP 1) Invert rewards to costs 2) Add new goal state & edges from absorbing states 3) ∀ s,a, add edges to goal with P = 1- 𝛿 4) Normalize ½ 𝛿 𝛿 +10 -1 ½ 𝛿 -2 𝛿 1- 𝛿 ½ 𝛿 -1 1- 𝛿 0 0 -2 1.0 0 1- 𝛿 G 0 20

  15. Computational Complexity of MDPs • Good news : – Solving IHDR, SSP in flat representation is P- complete – Solving FH in flat representation is P- hard – That is, they don’t benefit from parallelization, but are solvable in polynomial time! 22

  16. Computational Complexity of MDPs • Bad news : – Solving FH, IHDR, SSP in factored representation is EXPTIME - complete! – Flat representation doesn’t make MDPs harder to solve, it makes big ones easier to describe. 23

  17. Running Example a 20 a 40 s 2 s 4 a 00 C=5 a 41 Pr=0.6 a 21 a 1 s 0 s g C=2 a 3 Pr=0.4 a 01 s 1 s 3 All costs 1 unless otherwise marked 26

  18. Bellman Backup a 40 Q 2 (s 4 ,a 40 ) = 5 + 0 s 4 Q 2 (s 4 ,a 41 ) = 2+ 0.6 × 0 C=5 a 41 + 0.4 × 2 Pr=0.6 s g C=2 a 3 = 2.8 Pr=0.4 min s 3 a gr dy = = a 41 greedy 41 s g V 1 = 0 a 40 C=5 V 2 = = 2. 2.8 C=2 s 4 a 41 s 3 V 1 = 2

  19. Value Iteration [Bellman 57] No restriction on initial value function iteration n ℇ - consistency termination condition 28

  20. Running Example a 20 a 40 s 2 s 4 a 00 C=5 a 41 Pr=0.6 a 21 a 1 s 0 s g C=2 a 3 Pr=0.4 a 01 s 1 s 3 n V n (s 0 ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 29

  21. Convergence & Optimality • For an SSP MDP, ∀ s ∊ S , lim n à ∞ V n (s) = V*(s) irrespective of the initialization. 30

  22. VI à Asynchronous VI • Is backing up all states in an iteration essential? – No! • States may be backed up – as many times – in any order • If no state gets starved – convergence properties still hold!! 35

  23. Residual wrt Value Function V ( Res V ) • Residual at s with respect to V – magnitude( Δ V(s)) after one Bellman backup at s Res v (s) = | V i (s) – Min Σ T(s,a,s’)[C(s,a,s’) + V i (s’)] | a ∊ 𝓑 s ∊ 𝓣 • Residual wrt respect to V – max residual Res V < ∊ – Res V = max s (Res V (s )) ( ∊ - consistency ) 36

  24. (General) Asynchronous VI 37

  25. Heuristic Search Algorithms • Definitions • Find & Revise Scheme. • LAO* and Extensions • RTDP and Extensions • Other uses of Heuristics/Bounds • Heuristic Design 38

  26. Notation s 0 A 2 A 1 s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g 39

  27. Heuristic Search • Insight 1 – knowledge of a start state to save on computation ~ (all sources shortest path à single source shortest path) • Insight 2 – additional knowledge in the form of heuristic function ~ (dfs/bfs à A*) 40

  28. Model • SSP (as before) with an additional start state s 0 – denoted by SSP s0 • What is the solution to an SSP s0 • Policy ( S à A )? – are states that are not reachable from s 0 relevant? – states that are never visited (even though reachable)? 41

  29. Partial Policy • Define Partial policy – π : S’ à A , where S’ ⊆ S • Define Partial policy closed w.r.t. a state s. – is a partial policy π s – defined for all states s’ reachable by π s starting from s 42

  30. Partial policy closed wrt s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 Is this policy closed wrt s 0 ? π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 43

  31. Partial policy closed wrt s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 Is this policy closed wrt s 0 ? π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 π s0 (s 6 )= a 1 44

  32. Policy Graph of π s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 π s0 (s 6 )= a 1 45

  33. Greedy Policy Graph • Define greedy policy : π V = argmin a Q V (s,a) • Define greedy partial policy rooted at s 0 – Partial policy rooted at s 0 – Greedy policy – denoted by π V ¼ s0 • Define greedy policy graph – Policy graph of π : denoted by G V V ¼ s 0 s0 46

  34. Heuristic Function • h(s): S à R – estimates V*(s) – gives an indication about “goodness” of a state – usually used in initialization V 0 (s) = h(s) – helps us avoid seemingly bad states • Define admissible heuristic – Optimistic (underestimates cost) – h(s) ≤ V*(s) 47

  35. Admissible Heuristics • Basic idea – Relax probabilistic domain to deterministic domain a – Use heuristics(classical planning) s s 1 s 2 • All-outcome Determinization – For each outcome create a different action a 1 • Admissible Heuristics s s 1 a 2 – Cheapest cost solution for determinized domain s 2 – Classical heuristics over determinized domain 48

  36. Heuristic Search Algorithms • Definitions • Find & Revise Scheme. • LAO* and Extensions • RTDP and Extensions • Other uses of Heuristics/Bounds • Heuristic Design 49

  37. A General Scheme for Heuristic Search in MDPs • Two (over)simplified intuitions – Focus on states in greedy policy wrt. V rooted at s 0 – Focus on states with residual > ε • Find & Revise: – repeat • find a state that satisfies the two properties above • perform a Bellman backup – until no such state remains 50

  38. FIND & REVISE [Bonet&Geffner 03a] ( perform Bellman backups ) • Convergence to V* is guaranteed – if heuristic function is admissible – ~no state gets starved in ∞ FIND steps 51

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend