module 6
play

Module 6 Value Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Process Definition Set of states: Set of actions (i.e., decisions): Transition model: Pr


  1. Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Markov Decision Process β€’ Definition – Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐡 – Transition model: Pr⁑ (𝑑 𝑒 |𝑑 π‘’βˆ’1 , 𝑏 π‘’βˆ’1 ) – Reward model (i.e., utility): 𝑆(𝑑 𝑒 , 𝑏 𝑒 ) – Discount factor: 0 ≀ 𝛿 ≀ 1 – Horizon (i.e., # of time steps): β„Ž β€’ Goal: find optimal policy 𝜌 2 CS886 (c) 2013 Pascal Poupart

  3. Finite Horizon β€’ Policy evaluation 𝜌 𝑑 = β„Ž 𝛿 𝑒 Pr⁑ π‘Š (𝑇 𝑒 = 𝑑′|𝑇 0 = 𝑑, 𝜌)𝑆(𝑑′, 𝜌 𝑒 (𝑑′)) β„Ž 𝑒=0 β€’ Recursive form (dynamic programming) 𝜌 𝑑 = 𝑆(𝑑, 𝜌 0 𝑑 ) π‘Š 0 𝜌 (𝑑 β€² ) 𝜌 𝑑 = 𝑆 𝑑, 𝜌 𝑒 𝑑 Pr 𝑑 β€² 𝑑, 𝜌 𝑒 𝑑 + 𝛿 π‘Š π‘Š 𝑑 β€² 𝑒 π‘’βˆ’1 3 CS886 (c) 2013 Pascal Poupart

  4. Finite Horizon β€’ Optimal Policy 𝜌 βˆ— 𝜌 βˆ— 𝑑 β‰₯ π‘Š 𝜌 𝑑 β‘β‘βˆ€πœŒ, 𝑑 π‘Š β„Ž β„Ž β€’ Optimal value function π‘Š βˆ— (shorthand for π‘Š 𝜌 βˆ— ) βˆ— 𝑑 = max 𝑆(𝑑, 𝑏) π‘Š 0 𝑏 βˆ— 𝑑 = max Pr 𝑑 β€² 𝑑, 𝑏 π‘Š βˆ— (𝑑 β€² ) 𝑆 𝑑, 𝑏 + 𝛿 π‘Š 𝑑 β€² 𝑒 π‘’βˆ’1 𝑏 Bellman’s equation 4 CS886 (c) 2013 Pascal Poupart

  5. Value Iteration Algorithm valueIteration(MDP) βˆ— 𝑑 ← max π‘Š 𝑆(𝑑, 𝑏)β‘βˆ€π‘‘ 0 𝑏 For 𝑒 = 1 to β„Ž do βˆ— (𝑑 β€² ) βˆ— 𝑑 ← max Pr 𝑑 β€² 𝑑, 𝑏 π‘Š 𝑆 𝑑, 𝑏 + 𝛿 π‘Š β‘βˆ€π‘‘ 𝑑 β€² 𝑒 π‘’βˆ’1 𝑏 Return π‘Š βˆ— Optimal policy 𝜌 βˆ— βˆ— 𝑑 ← argmax 𝑒 = 0:⁑𝜌 0 𝑆 𝑑, 𝑏 β‘βˆ€π‘‘ 𝑏 βˆ— 𝑑 ← argmax βˆ— (𝑑 β€² ) Pr 𝑑 β€² 𝑑, 𝑏 π‘Š 𝑒 > 0 : 𝜌 𝑒 𝑆 𝑑, 𝑏 + 𝛿 β‘βˆ€π‘‘ 𝑑 β€² π‘’βˆ’1 𝑏 NB: 𝜌 βˆ— is non stationary (i.e., time dependent) 5 CS886 (c) 2013 Pascal Poupart

  6. Value Iteration β€’ Matrix form: 𝑆 𝑏 : 𝑇 Γ— 1 column vector of rewards for 𝑏 βˆ— : 𝑇 Γ— 1 column vector of state values π‘Š 𝑒 π‘ˆ 𝑏 : 𝑇 Γ— 𝑇 matrix of transition prob. for 𝑏 valueIteration(MDP) βˆ— ← max 𝑆 𝑏 ⁑ π‘Š 0 𝑏 For 𝑒 = 1 to β„Ž do βˆ— ← max 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š βˆ— ⁑ π‘Š 𝑒 π‘’βˆ’1 𝑏 Return π‘Š βˆ— 6 CS886 (c) 2013 Pascal Poupart

  7. Infinite Horizon β€’ Let β„Ž β†’ ∞ 𝜌 β†’ π‘Š 𝜌 𝜌 and π‘Š 𝜌 β€’ Then π‘Š β†’ π‘Š ∞ ∞ β„Ž β„Žβˆ’1 β€’ Policy evaluation: 𝜌 𝑑 = 𝑆 𝑑, 𝜌 ∞ 𝑑 Pr 𝑑 β€² 𝑑, 𝜌 ∞ 𝑑 𝜌 (𝑑 β€² ) + 𝛿 π‘Š π‘Š β‘βˆ€π‘‘ 𝑑 β€² ∞ ∞ β€’ Bellman’s equation: βˆ— 𝑑 = max Pr 𝑑 β€² 𝑑, 𝑏 π‘Š βˆ— (𝑑 β€² ) 𝑆 𝑑, 𝑏 + 𝛿 π‘Š 𝑑 β€² ∞ ∞ 𝑏 7 CS886 (c) 2013 Pascal Poupart

  8. Policy evaluation β€’ Linear system of equations 𝜌 𝑑 = 𝑆 𝑑, 𝜌 ∞ 𝑑 Pr 𝑑 β€² 𝑑, 𝜌 ∞ 𝑑 𝜌 (𝑑 β€² ) π‘Š + 𝛿 π‘Š β‘βˆ€π‘‘ 𝑑 β€² ∞ ∞ β€’ Matrix form: 𝑆 : 𝑇 Γ— 1 column vector of sate rewards for 𝜌 π‘Š : 𝑇 Γ— 1 column vector of state values for 𝜌 π‘ˆ : 𝑇 Γ— 𝑇 matrix of transition prob for 𝜌 π‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š 8 CS886 (c) 2013 Pascal Poupart

  9. Solving linear equations β€’ Linear system: π‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š β€’ Gaussian elimination: 𝐽 βˆ’ π›Ώπ‘ˆ π‘Š = 𝑆 β€’ Compute inverse: π‘Š = 𝐽 βˆ’ π›Ώπ‘ˆ βˆ’1 𝑆 β€’ Iterative methods β€’ Value iteration (a.k.a. Richardson iteration) β€’ Repeat π‘Š ← 𝑆 + π›Ώπ‘ˆπ‘Š 9 CS886 (c) 2013 Pascal Poupart

  10. Contraction β€’ Let 𝐼(π‘Š) ≝ 𝑆 + π›Ώπ‘ˆπ‘Š be the policy eval operator β€’ Lemma 1: 𝐼 is a contraction mapping. βˆ’ 𝐼 π‘Š βˆ’ π‘Š 𝐼 π‘Š ∞ ≀ 𝛿 π‘Š ∞ β€’ Proof 𝐼 π‘Š βˆ’ 𝐼 π‘Š ∞ βˆ’ 𝑆 βˆ’ π›Ώπ‘ˆπ‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š ∞ (by definition) βˆ’ π‘Š = π›Ώπ‘ˆ π‘Š ∞ (simplification) βˆ’ π‘Š ≀ 𝛿 π‘ˆ π‘Š ∞ (since 𝐡𝐢 ≀ 𝐡 𝐢 ) ∞ βˆ’ π‘Š π‘ˆ(𝑑, 𝑑 β€² ) = 𝛿 π‘Š ∞ (since max = 1 ) 𝑑′ 𝑑 10 CS886 (c) 2013 Pascal Poupart

  11. Convergence β€’ Theorem 2: Policy evaluation converges to π‘Š 𝜌 for any initial estimate π‘Š π‘œβ†’βˆž 𝐼 (π‘œ) π‘Š = π‘Š 𝜌 β‘β‘βˆ€π‘Š lim β€’ Proof β€’ By definition V 𝜌 = 𝐼 ∞ 0 , but policy evaluation computes 𝐼 ∞ π‘Š for any initial π‘Š β€’ By lemma 1, 𝐼 (π‘œ) π‘Š βˆ’ 𝐼 π‘œ ∞ ≀ 𝛿 π‘œ π‘Š π‘Š βˆ’ π‘Š ∞ β€’ Hence, when π‘œ β†’ ∞ , then 𝐼 (π‘œ) π‘Š βˆ’ 𝐼 π‘œ 0 ∞ β†’ 0 and 𝐼 ∞ π‘Š = π‘Š 𝜌 β‘β‘βˆ€π‘Š 11 CS886 (c) 2013 Pascal Poupart

  12. Approximate Policy Evaluation β€’ In practice, we can’t perform an infinite number of iterations. β€’ Suppose that we perform value iteration for 𝑙 steps and 𝐼 𝑙 π‘Š βˆ’ 𝐼 π‘™βˆ’1 π‘Š ∞ = πœ— , how far is 𝐼 𝑙 π‘Š from π‘Š 𝜌 ? 12 CS886 (c) 2013 Pascal Poupart

  13. Approximate Policy Evaluation β€’ Theorem 3: If 𝐼 𝑙 π‘Š βˆ’ 𝐼 π‘™βˆ’1 π‘Š ∞ ≀ πœ— then πœ— π‘Š 𝜌 βˆ’ 𝐼 𝑙 π‘Š ∞ ≀ 1 βˆ’ 𝛿 β€’ Proof π‘Š 𝜌 βˆ’ 𝐼 𝑙 π‘Š ∞ 𝐼 ∞ (π‘Š) βˆ’ 𝐼 𝑙 π‘Š = ∞ (by Theorem 2) 𝐼 𝑒+𝑙 π‘Š βˆ’ 𝐼 𝑒+π‘™βˆ’1 π‘Š ∞ = ∞ 𝑒=1 𝐼 𝑒+𝑙 (π‘Š) βˆ’ 𝐼 𝑒+π‘™βˆ’1 π‘Š ∞ ≀ ( 𝐡 + 𝐢 ≀ 𝐡 + | 𝐢 | ) 𝑒=1 ∞ πœ— ∞ 𝛿 𝑒 πœ— = = 1βˆ’π›Ώ (by Lemma 1) 𝑒=1 13 CS886 (c) 2013 Pascal Poupart

  14. Optimal Value Function β€’ Non-linear system of equations βˆ— 𝑑 = max Pr 𝑑 β€² 𝑑, 𝑏⁑ π‘Š βˆ— (𝑑 β€² ) β‘βˆ€π‘‘ π‘Š 𝑆 𝑑, 𝑏⁑ + 𝛿 𝑑 β€² ∞ ∞ 𝑏 β€’ Matrix form: 𝑆 𝑏 : 𝑇 Γ— 1 column vector of rewards for 𝑏 π‘Š βˆ— : 𝑇 Γ— 1 column vector of optimal values π‘ˆ a : 𝑇 Γ— 𝑇 matrix of transition prob for 𝑏 π‘Š βˆ— = max 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š βˆ— 𝑏 14 CS886 (c) 2013 Pascal Poupart

  15. Contraction 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š be the operator in β€’ Let 𝐼 βˆ— (π‘Š) ≝ max 𝑏 value iteration β€’ Lemma 3: 𝐼 βˆ— is a contraction mapping. 𝐼 βˆ— π‘Š βˆ’ 𝐼 βˆ— π‘Š βˆ’ π‘Š ∞ ≀ 𝛿 π‘Š ∞ β€’ Proof: without loss of generality, let 𝐼 βˆ— π‘Š 𝑑 β‰₯ 𝐼 βˆ— (π‘Š)(𝑑) and βˆ— = argmax Pr 𝑑 β€² 𝑑, 𝑏 π‘Š(𝑑′) 𝑆 𝑑, 𝑏 + 𝛿 let 𝑏 𝑑 𝑑 β€² 𝑏 15 CS886 (c) 2013 Pascal Poupart

  16. Contraction β€’ Proof continued: β€’ Then 0 ≀ 𝐼 βˆ— π‘Š 𝑑 βˆ’ 𝐼 βˆ— π‘Š 𝑑 (by assumption) βˆ— + 𝛿 Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ— 𝑑 β€² (by definition) ≀ 𝑆 𝑑, 𝑏 𝑑 π‘Š 𝑑 β€² βˆ— βˆ’ 𝛿 Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ— π‘Š 𝑑 β€² βˆ’π‘† 𝑑, 𝑏 𝑑 𝑑 β€² Pr 𝑑 β€² 𝑑, 𝑏 𝑑 𝑑 β€² βˆ’ π‘Š 𝑑 β€² βˆ— = 𝛿 π‘Š 𝑑 β€² Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ’ π‘Š βˆ— ≀ 𝛿 π‘Š (maxnorm upper bound) 𝑑 β€² ∞ Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ’ π‘Š βˆ— ∞ (since = 𝛿 π‘Š = 1 ) 𝑑 β€² β€’ Repeat the same argument for 𝐼 βˆ— π‘Š )(𝑑) and 𝑑 β‰₯ 𝐼 βˆ— (π‘Š for each 𝑑 16 CS886 (c) 2013 Pascal Poupart

  17. Convergence β€’ Theorem 4: Value iteration converges to π‘Š βˆ— for any initial estimate π‘Š π‘œβ†’βˆž 𝐼 βˆ—(π‘œ) π‘Š = π‘Š βˆ— β‘β‘βˆ€π‘Š lim β€’ Proof β€’ By definition V βˆ— = 𝐼 βˆ— ∞ 0 , but value iteration computes 𝐼 βˆ— ∞ π‘Š for some initial π‘Š β€’ By lemma 3, 𝐼 βˆ—(π‘œ) π‘Š βˆ’ 𝐼 βˆ— π‘œ ≀ 𝛿 π‘œ π‘Š π‘Š βˆ’ π‘Š ∞ ∞ β€’ Hence, when π‘œ β†’ ∞ , then 𝐼 βˆ—(π‘œ) π‘Š βˆ’ 𝐼 βˆ— π‘œ 0 β†’ ∞ 0 and 𝐼 βˆ— ∞ π‘Š = π‘Š βˆ— β‘β‘βˆ€π‘Š 17 CS886 (c) 2013 Pascal Poupart

  18. Value Iteration β€’ Even when horizon is infinite, perform finitely many iterations β€’ Stop when π‘Š π‘œ βˆ’ π‘Š ≀ πœ— π‘œβˆ’1 valueIteration(MDP) βˆ— ← max 𝑆 𝑏 ; β‘β‘β‘β‘β‘β‘β‘β‘π‘œ ← 0 π‘Š 0 𝑏 Repeat π‘œ ← π‘œ + 1 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š π‘Š π‘œ ← max π‘œβˆ’1 𝑏 Until π‘Š π‘œ βˆ’ π‘Š ∞ ≀ πœ— π‘œβˆ’1 Return π‘Š π‘œ 18 CS886 (c) 2013 Pascal Poupart

  19. Induced Policy β€’ Since π‘Š π‘œ βˆ’ π‘Š ∞ ≀ πœ— , by Theorem 4: we know π‘œβˆ’1 πœ— π‘œ βˆ’ π‘Š βˆ— that π‘Š ∞ ≀ 1βˆ’π›Ώ β€’ But, how good is the stationary policy 𝜌 π‘œ 𝑑 extracted based on π‘Š π‘œ ? 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑 β€² 𝑑, 𝑏 π‘Š π‘œ (𝑑 β€² ) 𝜌 π‘œ 𝑑 = argmax 𝑏 𝑑 β€² β€’ How far is π‘Š 𝜌 π‘œ from π‘Š βˆ— ? 19 CS886 (c) 2013 Pascal Poupart

  20. Induced Policy 2πœ— β€’ Theorem 5: π‘Š 𝜌 π‘œ βˆ’ π‘Š βˆ— ∞ ≀ 1βˆ’π›Ώ β€’ Proof π‘Š 𝜌 π‘œ βˆ’ π‘Š βˆ— π‘Š 𝜌 π‘œ βˆ’ π‘Š π‘œ βˆ’ π‘Š βˆ— ∞ = π‘œ + π‘Š ∞ π‘Š 𝜌 π‘œ βˆ’ π‘Š π‘œ βˆ’ π‘Š βˆ— ≀ ∞ + π‘Š ∞ ( 𝐡 + 𝐢 ≀ 𝐡 + | 𝐢 | ) π‘œ 𝐼 𝜌 π‘œ ∞ (π‘Š π‘œ βˆ’ 𝐼 βˆ— ∞ π‘Š = π‘œ ) βˆ’ π‘Š + π‘Š π‘œ π‘œ ∞ ∞ πœ— πœ— ≀ 1βˆ’π›Ώ + 1βˆ’π›Ώ (by Theorems 2 and 4) 2πœ— = 1βˆ’π›Ώ 20 CS886 (c) 2013 Pascal Poupart

  21. Summary β€’ Value iteration – Simple dynamic programming algorithm – Complexity: 𝑃(π‘œ 𝐡 𝑇 2 ) β€’ Here π‘œ is the number of iterations β€’ Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? – Yes: by policy iteration 21 CS886 (c) 2013 Pascal Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend