CSE 573 Markov Decision Processes: Heuristic Search & Real-Time - PowerPoint PPT Presentation

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1

Stochastic Shortest-Path MDPs: Motivation • Assume the agent pays cost to achieve a goal • Example applications: – Controlling a Mars rover “How to collect scientific data without damaging the rover?” – Navigation “What’s the fastest way to get to a destination, taking into account the traffic jams?” 8

Stochastic Shortest-Path MDPs: Definition Bertsekas, 1995 SSP MDP is a tuple < S, A, T, C, G >, where: • S is a finite state space • ( D is an infinite sequence (1,2, …) ) • A is a finite action set • T : S x A x S à [0, 1] is a stationary transition function • C : S x A x S à R is a stationary cost function (low cost is good!) • G is a set of absorbing cost-free goal states Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1 9

SSP MDP Details • In SSP, maximizing ELAU = minimizing exp. cost • Every cost-minimizing policy is proper! • Thus, an optimal policy = cheapest way to a goal • Why are SSP MDPs called “indefinite-horizon”? – If a policy is optimal, it will take a finite, but apriori unknown, time to reach goal 10

SSP MDP Example , not! C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 a 2 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G C(s 2 , a 2 , s 2 ) = -3 a 1 C(s 2 , a 1 , s 1 ) = -1 T(s 2 , a 2 , s G ) = 0.7 a 1 T(s 2 , a 1 , s 1 ) = 0.4 C(s G , a 1 , s G ) = 0 No dead ends C(s 2 , a 1 , s 3 ) = 5 T(s 2 , a 1 , s 3 ) = 0.6 allowed! a 1 a 2 S 3 C(s 3 , a 1 , s 3 ) = 2.4 C(s 3 , a 2 , s 3 ) = 0.8 11

SSP MDP Example , also not! C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 a 2 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G a 1 C(s 2 , a 2 , s 2 ) = -3 C(s 2 , a 1 , s 1 ) = -1 T(s 2 , a 2 , s G ) = 0.7 a 1 C(s G , a 1 , s G ) = 0 No cost-free “loops” allowed! 12

SSP MDP Example C(s G , a 2 , s G ) = 0 C(s 2 , a 2 , s G ) = 1 C(s 1 , a 1 , s 2 ) = 1 T(s 2 , a 2 , s G ) = 0.3 C(s 1 , a 2 , s 1 ) = 7.2 a 1 a 2 a 2 S 1 S 2 S G a 1 C(s 2 , a 2 , s 2 ) = 1 C(s 2 , a 1 , s 1 ) = 0 T(s 2 , a 2 , s G ) = 0.7 C(s G , a 1 , s G ) = 0 13

SSP MDPs: Optimality Principle For an SSP MDP, let: For every history, Exp. Lin. Add. Utility the value of a policy π – V π (h) = E h [ C 1 + C 2 + … ] for all h is well-defined! Every policy either takes a Then: finite exp. # of steps to reach a goal, or has an infinite cost. – V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian – For all s : V*(s) = min a in A [ ∑ s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] π*(s) = argmin a in A [ ∑ s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] 14

Fundamentals of MDPs ü General MDP Definition ü Expected Linear Additive Utility ü The Optimality Principle ü Finite-Horizon MDPs ü Infinite-Horizon Discounted-Reward MDPs ü Stochastic Shortest-Path MDPs • A Hierarchy of MDP Classes • Factored MDPs • Computational Complexity 15

SSP and Other MDP Classes E.g., Indefinite-horizon discounted reward IHDR FH IHDR SSP 16

SSP and Other MDP Classes SSP IHDR FH • FH => SSP : turn all states (s, L) into goals • IHDR => SSP : add γ-probability transitions to goal • Will concentrate on SSP in the rest of the tutorial 17

IHDR à SSP -10 +1 +2 +1 +2 18

IHDR à SSP 1) Invert rewards to costs -1 +10 -1 -2 -1 -2 19

IHDR à SSP 1) Invert rewards to costs 2) Add new goal state & edges from absorbing states 3) ∀ s,a, add edges to goal with P = 1- 𝛿 4) Normalize ½ 𝛿 𝛿 +10 -1 ½ 𝛿 -2 𝛿 1- 𝛿 ½ 𝛿 -1 1- 𝛿 0 0 -2 1.0 0 1- 𝛿 G 0 20

Computational Complexity of MDPs • Good news : – Solving IHDR, SSP in flat representation is P- complete – Solving FH in flat representation is P- hard – That is, they don’t benefit from parallelization, but are solvable in polynomial time! 22

Computational Complexity of MDPs • Bad news : – Solving FH, IHDR, SSP in factored representation is EXPTIME - complete! – Flat representation doesn’t make MDPs harder to solve, it makes big ones easier to describe. 23

Running Example a 20 a 40 s 2 s 4 a 00 C=5 a 41 Pr=0.6 a 21 a 1 s 0 s g C=2 a 3 Pr=0.4 a 01 s 1 s 3 All costs 1 unless otherwise marked 26

Bellman Backup a 40 Q 2 (s 4 ,a 40 ) = 5 + 0 s 4 Q 2 (s 4 ,a 41 ) = 2+ 0.6 × 0 C=5 a 41 + 0.4 × 2 Pr=0.6 s g C=2 a 3 = 2.8 Pr=0.4 min s 3 a gr dy = = a 41 greedy 41 s g V 1 = 0 a 40 C=5 V 2 = = 2. 2.8 C=2 s 4 a 41 s 3 V 1 = 2

Value Iteration [Bellman 57] No restriction on initial value function iteration n ℇ - consistency termination condition 28

Running Example a 20 a 40 s 2 s 4 a 00 C=5 a 41 Pr=0.6 a 21 a 1 s 0 s g C=2 a 3 Pr=0.4 a 01 s 1 s 3 n V n (s 0 ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 29

Convergence & Optimality • For an SSP MDP, ∀ s ∊ S , lim n à ∞ V n (s) = V*(s) irrespective of the initialization. 30

VI à Asynchronous VI • Is backing up all states in an iteration essential? – No! • States may be backed up – as many times – in any order • If no state gets starved – convergence properties still hold!! 35

Residual wrt Value Function V ( Res V ) • Residual at s with respect to V – magnitude( Δ V(s)) after one Bellman backup at s Res v (s) = | V i (s) – Min Σ T(s,a,s’)[C(s,a,s’) + V i (s’)] | a ∊ 𝓑 s ∊ 𝓣 • Residual wrt respect to V – max residual Res V < ∊ – Res V = max s (Res V (s )) ( ∊ - consistency ) 36

(General) Asynchronous VI 37

Heuristic Search Algorithms • Definitions • Find & Revise Scheme. • LAO* and Extensions • RTDP and Extensions • Other uses of Heuristics/Bounds • Heuristic Design 38

Notation s 0 A 2 A 1 s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g 39

Heuristic Search • Insight 1 – knowledge of a start state to save on computation ~ (all sources shortest path à single source shortest path) • Insight 2 – additional knowledge in the form of heuristic function ~ (dfs/bfs à A*) 40

Model • SSP (as before) with an additional start state s 0 – denoted by SSP s0 • What is the solution to an SSP s0 • Policy ( S à A )? – are states that are not reachable from s 0 relevant? – states that are never visited (even though reachable)? 41

Partial Policy • Define Partial policy – π : S’ à A , where S’ ⊆ S • Define Partial policy closed w.r.t. a state s. – is a partial policy π s – defined for all states s’ reachable by π s starting from s 42

Partial policy closed wrt s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 Is this policy closed wrt s 0 ? π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 43

Partial policy closed wrt s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 Is this policy closed wrt s 0 ? π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 π s0 (s 6 )= a 1 44

Policy Graph of π s 0 s 0 a 1 is left action a 2 is on right s 2 s 3 s 4 s 1 s 9 s 5 s 6 s 7 s 8 S g π s0 (s 0 )= a 1 π s0 (s 1 )= a 2 π s0 (s 2 )= a 1 π s0 (s 6 )= a 1 45

Greedy Policy Graph • Define greedy policy : π V = argmin a Q V (s,a) • Define greedy partial policy rooted at s 0 – Partial policy rooted at s 0 – Greedy policy – denoted by π V ¼ s0 • Define greedy policy graph – Policy graph of π : denoted by G V V ¼ s 0 s0 46

Heuristic Function • h(s): S à R – estimates V*(s) – gives an indication about “goodness” of a state – usually used in initialization V 0 (s) = h(s) – helps us avoid seemingly bad states • Define admissible heuristic – Optimistic (underestimates cost) – h(s) ≤ V*(s) 47

Admissible Heuristics • Basic idea – Relax probabilistic domain to deterministic domain a – Use heuristics(classical planning) s s 1 s 2 • All-outcome Determinization – For each outcome create a different action a 1 • Admissible Heuristics s s 1 a 2 – Cheapest cost solution for determinized domain s 2 – Classical heuristics over determinized domain 48

Heuristic Search Algorithms • Definitions • Find & Revise Scheme. • LAO* and Extensions • RTDP and Extensions • Other uses of Heuristics/Bounds • Heuristic Design 49

A General Scheme for Heuristic Search in MDPs • Two (over)simplified intuitions – Focus on states in greedy policy wrt. V rooted at s 0 – Focus on states with residual > ε • Find & Revise: – repeat • find a state that satisfies the two properties above • perform a Bellman backup – until no such state remains 50

FIND & REVISE [Bonet&Geffner 03a] ( perform Bellman backups ) • Convergence to V* is guaranteed – if heuristic function is admissible – ~no state gets starved in ∞ FIND steps 51

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time - PowerPoint PPT Presentation

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost to achieve a goal Example

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

Office of the University Registrar 125 Jesse Hall 573-882-7881 https://registrar.missouri.edu/

Systems & Applications: Introduction Ling 573 NLP Systems and Applications March 29, 2016

Systems & Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

CS 573: Algorithms . Sariel Har-Peled sariel@illinois.edu 3306 SC University of Illinois,

Administrivia, Introduction CS 573: Algorithms Lecture 1 August 27, 2013 Sariel Har-Peled

CSE 573: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

CSE 573: Artificial Intelligence Bayes Net Teaser Daniel Weld [Most slides were created by

CSE 573: Artificial Intelligence Logistics 1 Autumn 2012 Dan in Boston (UIST) on Wed 10/10

CSE 573: Artificial Intelligence Hanna Hajishirzi Expectimax Complex Games slides adapted

CSE 573: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

CSE 573: Artificial Intelligence Winter 2017 Introduction & Agents Dan Weld TBD Gagan

Variational Inference for Bayesian Neural Networks Jesse Bettencourt, Harris Chan, Ricky Chen,

Primary Care Research: No Longer Lost in Translation James W. Mold, MD, MPH George Lynn Cross

Class 20 Fault localization (contd) Test-data generation Exam review: Nov 3,

Focusing via display Giuseppe Greco - University of Utrecht - (work in progress with W. Fussner,

A Simple Obfuscation Scheme for Pattern-Matching with Wildcards Allison Bishop Lucas Kowalczyk

Crea&ng Change: Primary Preven&on of Sexual Violence

RF Strategy & the M uCool T est Area Meeting the RF in Magnetic Field Challenge Alan Bross

INF 111 / CSE 121: Software Tools and Methods Lecture Notes for Fall Quarter, 2007 Michele

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time - PowerPoint PPT Presentation

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost to achieve a goal Example

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

Office of the University Registrar 125 Jesse Hall 573-882-7881 https://registrar.missouri.edu/

Systems &amp; Applications: Introduction Ling 573 NLP Systems and Applications March 29, 2016

Systems &amp; Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

CS 573: Algorithms . Sariel Har-Peled sariel@illinois.edu 3306 SC University of Illinois,

Administrivia, Introduction CS 573: Algorithms Lecture 1 August 27, 2013 Sariel Har-Peled

CSE 573: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

CSE 573: Artificial Intelligence Bayes Net Teaser Daniel Weld [Most slides were created by

CSE 573: Artificial Intelligence Logistics 1 Autumn 2012 Dan in Boston (UIST) on Wed 10/10

CSE 573: Artificial Intelligence Hanna Hajishirzi Expectimax Complex Games slides adapted

CSE 573: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

CSE 573: Artificial Intelligence Winter 2017 Introduction &amp; Agents Dan Weld TBD Gagan

Variational Inference for Bayesian Neural Networks Jesse Bettencourt, Harris Chan, Ricky Chen,

Primary Care Research: No Longer Lost in Translation James W. Mold, MD, MPH George Lynn Cross

Class 20 Fault localization (contd) Test-data generation Exam review: Nov 3,

Focusing via display Giuseppe Greco - University of Utrecht - (work in progress with W. Fussner,

A Simple Obfuscation Scheme for Pattern-Matching with Wildcards Allison Bishop Lucas Kowalczyk

Crea&amp;ng Change: Primary Preven&amp;on of Sexual Violence

RF Strategy &amp; the M uCool T est Area Meeting the RF in Magnetic Field Challenge Alan Bross

INF 111 / CSE 121: Software Tools and Methods Lecture Notes for Fall Quarter, 2007 Michele

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

Systems & Applications: Introduction Ling 573 NLP Systems and Applications March 29, 2016

Systems & Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

CSE 573: Artificial Intelligence Winter 2017 Introduction & Agents Dan Weld TBD Gagan

Crea&ng Change: Primary Preven&on of Sexual Violence

RF Strategy & the M uCool T est Area Meeting the RF in Magnetic Field Challenge Alan Bross