 
              CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1
Stochastic Shortest-Path MDPs: Motivation • Assume the agent pays cost to achieve a goal • Example applications: – Controlling a Mars rover “How to collect scientific data without damaging the rover?” – Navigation “What’s the fastest way to get to a destination, taking into account the traffic jams?” 8
Stochastic Shortest-Path MDPs: Definition Bertsekas, 1995 SSP MDP is a tuple < S, A, T, C, G >, where: • S is a finite state space • ( D is an infinite sequence (1,2, …) ) • A is a finite action set • T : S x A x S à [0, 1] is a stationary transition function • C : S x A x S à R is a stationary cost function (low cost is good!) • G is a set of absorbing cost-free goal states Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1 9
SSP MDP Details • In SSP, maximizing ELAU = minimizing exp. cost • Every cost-minimizing policy is proper! • Thus, an optimal policy = cheapest way to a goal • Why are SSP MDPs called “indefinite-horizon”? – If a policy is optimal, it will take a finite, but apriori unknown, time to reach goal 10
SSP MDP Example , not! C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 a 2 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G C(s 2 , a 2 , s 2 ) = -3 a 1 C(s 2 , a 1 , s 1 ) = -1 T(s 2 , a 2 , s G ) = 0.7 a 1 T(s 2 , a 1 , s 1 ) = 0.4 C(s G , a 1 , s G ) = 0 No dead ends C(s 2 , a 1 , s 3 ) = 5 T(s 2 , a 1 , s 3 ) = 0.6 allowed! a 1 a 2 S 3 C(s 3 , a 1 , s 3 ) = 2.4 C(s 3 , a 2 , s 3 ) = 0.8 11
SSP MDP Example , also not! C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 a 2 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G a 1 C(s 2 , a 2 , s 2 ) = -3 C(s 2 , a 1 , s 1 ) = -1 T(s 2 , a 2 , s G ) = 0.7 a 1 C(s G , a 1 , s G ) = 0 No cost-free “loops” allowed! 12
SSP MDP Example C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G a 1 C(s 2 , a 2 , s 2 ) = 1 C(s 2 , a 1 , s 1 ) = 0 T(s 2 , a 2 , s G ) = 0.7 C(s G , a 1 , s G ) = 0 13
SSP MDPs: Optimality Principle For an SSP MDP, let: For every history, Exp. Lin. Add. Utility the value of a policy π – V π (h) = E h [ C 1 + C 2 + … ] for all h is well-defined! Every policy either takes a Then: finite exp. # of steps to reach a goal, or has an infinite cost. – V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian – For all s : V*(s) = min a in A [ ∑ s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] π*(s) = argmin a in A [ ∑ s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] 14
Fundamentals of MDPs ü General MDP Definition ü Expected Linear Additive Utility ü The Optimality Principle ü Finite-Horizon MDPs ü Infinite-Horizon Discounted-Reward MDPs ü Stochastic Shortest-Path MDPs • A Hierarchy of MDP Classes • Factored MDPs • Computational Complexity 15
SSP and Other MDP Classes E.g., Indefinite-horizon discounted reward IHDR FH IHDR SSP 16
SSP and Other MDP Classes SSP IHDR FH • FH => SSP : turn all states (s, L) into goals • IHDR => SSP : add γ-probability transitions to goal • Will concentrate on SSP in the rest of the tutorial 17
IHDR à SSP -10 +1 +2 +1 +2 18
IHDR à SSP 1) Invert rewards to costs -1 +10 -1 -2 -1 -2 19
IHDR à SSP 1) Invert rewards to costs 2) Add new goal state & edges from absorbing states 3) ∀ s,a, add edges to goal with P = 1- 𝛿 4) Normalize ½ 𝛿 𝛿 +10 -1 ½ 𝛿 -2 𝛿 1- 𝛿 ½ 𝛿 -1 1- 𝛿 0 0 -2 1.0 0 1- 𝛿 G 0 20
Computational Complexity of MDPs • Good news : – Solving IHDR, SSP in flat representation is P- complete – Solving FH in flat representation is P- hard – That is, they don’t benefit from parallelization, but are solvable in polynomial time! 22
Computational Complexity of MDPs • Bad news : – Solving FH, IHDR, SSP in factored representation is EXPTIME - complete! – Flat representation doesn’t make MDPs harder to solve, it makes big ones easier to describe. 23
Running Example a 20 a 40 s 2 s 4 a 00 C=5 a 41 Pr=0.6 a 21 a 1 s 0 s g C=2 a 3 Pr=0.4 a 01 s 1 s 3 All costs 1 unless otherwise marked 26
Bellman Backup a 40 Q 2 (s 4 ,a 40 ) = 5 + 0 s 4 Q 2 (s 4 ,a 41 ) = 2+ 0.6 × 0 C=5 a 41 + 0.4 × 2 Pr=0.6 s g C=2 a 3 = 2.8 Pr=0.4 min s 3 a gr dy = = a 41 greedy 41 s g V 1 = 0 a 40 C=5 V 2 = = 2. 2.8 C=2 s 4 a 41 s 3 V 1 = 2
Value Iteration [Bellman 57] No restriction on initial value function iteration n ℇ - consistency termination condition 28
Running Example a 20 a 40 s 2 s 4 a 00 C=5 a 41 Pr=0.6 a 21 a 1 s 0 s g C=2 a 3 Pr=0.4 a 01 s 1 s 3 n V n (s 0 ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 29
Convergence & Optimality • For an SSP MDP, ∀ s ∊ S , lim n à ∞ V n (s) = V*(s) irrespective of the initialization. 30
VI à Asynchronous VI • Is backing up all states in an iteration essential? – No! • States may be backed up – as many times – in any order • If no state gets starved – convergence properties still hold!! 35
Residual wrt Value Function V ( Res V ) • Residual at s with respect to V – magnitude( Δ V(s)) after one Bellman backup at s Res v (s) = | V i (s) – Min Σ T(s,a,s’)[C(s,a,s’) + V i (s’)] | a ∊ 𝓑 s ∊ 𝓣 • Residual wrt respect to V – max residual Res V < ∊ – Res V = max s (Res V (s )) ( ∊ - consistency ) 36
(General) Asynchronous VI 37
Heuristic Search Algorithms • Definitions • Find & Revise Scheme. • LAO* and Extensions • RTDP and Extensions • Other uses of Heuristics/Bounds • Heuristic Design 38
Notation s 0 A 2 A 1 s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g 39
Heuristic Search • Insight 1 – knowledge of a start state to save on computation ~ (all sources shortest path à single source shortest path) • Insight 2 – additional knowledge in the form of heuristic function ~ (dfs/bfs à A*) 40
Model • SSP (as before) with an additional start state s 0 – denoted by SSP s0 • What is the solution to an SSP s0 • Policy ( S à A )? – are states that are not reachable from s 0 relevant? – states that are never visited (even though reachable)? 41
Partial Policy • Define Partial policy – π : S’ à A , where S’ ⊆ S • Define Partial policy closed w.r.t. a state s. – is a partial policy π s – defined for all states s’ reachable by π s starting from s 42
Partial policy closed wrt s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 Is this policy closed wrt s 0 ? π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 43
Partial policy closed wrt s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 Is this policy closed wrt s 0 ? π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 π s0 (s 6 )= a 1 44
Policy Graph of π s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 π s0 (s 6 )= a 1 45
Greedy Policy Graph • Define greedy policy : π V = argmin a Q V (s,a) • Define greedy partial policy rooted at s 0 – Partial policy rooted at s 0 – Greedy policy – denoted by π V ¼ s0 • Define greedy policy graph – Policy graph of π : denoted by G V V ¼ s 0 s0 46
Heuristic Function • h(s): S à R – estimates V*(s) – gives an indication about “goodness” of a state – usually used in initialization V 0 (s) = h(s) – helps us avoid seemingly bad states • Define admissible heuristic – Optimistic (underestimates cost) – h(s) ≤ V*(s) 47
Admissible Heuristics • Basic idea – Relax probabilistic domain to deterministic domain a – Use heuristics(classical planning) s s 1 s 2 • All-outcome Determinization – For each outcome create a different action a 1 • Admissible Heuristics s s 1 a 2 – Cheapest cost solution for determinized domain s 2 – Classical heuristics over determinized domain 48
Heuristic Search Algorithms • Definitions • Find & Revise Scheme. • LAO* and Extensions • RTDP and Extensions • Other uses of Heuristics/Bounds • Heuristic Design 49
A General Scheme for Heuristic Search in MDPs • Two (over)simplified intuitions – Focus on states in greedy policy wrt. V rooted at s 0 – Focus on states with residual > ε • Find & Revise: – repeat • find a state that satisfies the two properties above • perform a Bellman backup – until no such state remains 50
FIND & REVISE [Bonet&Geffner 03a] ( perform Bellman backups ) • Convergence to V* is guaranteed – if heuristic function is admissible – ~no state gets starved in ∞ FIND steps 51
Recommend
More recommend