module 7
play

Module 7 Policy Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value iteration Optimize value function Extract induced policy Can we directly optimize the


  1. Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Policy Optimization • Value iteration – Optimize value function – Extract induced policy • Can we directly optimize the policy? – Yes, by policy iteration 2 CS886 (c) 2013 Pascal Poupart

  3. Policy Iteration • Alternate between two steps 1. Policy evaluation 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) ∀𝑡 𝑡 ′ 2. Policy improvement 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝜌 (𝑡 ′ ) 𝜌 𝑡 ← argmax ∀𝑡 𝑏 𝑡 ′ 3 CS886 (c) 2013 Pascal Poupart

  4. Algorithm policyIteration(MDP) Initialize 𝜌 0 to any policy 𝑜 ← 0 Repeat 𝑜 = 𝑆 𝜌 𝑜 + 𝛿𝑈 𝜌 𝑜 𝑊 Eval: 𝑊 𝑜 Improve: 𝜌 𝑜+1 ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 𝑜 ← 𝑜 + 1 Until 𝜌 𝑜+1 = 𝜌 𝑜 Return 𝜌 𝑜 4 CS886 (c) 2013 Pascal Poupart

  5. Monotonic Improvement • Lemma 1: Let 𝑊 𝑜 and 𝑊 𝑜+1 be successive value functions in policy iteration. Then 𝑊 𝑜+1 ≥ 𝑊 𝑜 . • Proof: – We know that 𝐼 ∗ 𝑊 𝑜 ≥ 𝐼 𝜌 𝑜 𝑊 𝑜 = 𝑊 𝑜 – Let 𝜌 𝑜+1 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 – Then 𝐼 ∗ 𝑊 𝑜 = 𝑆 𝜌 𝑜+1 + 𝛿𝑈 𝜌 𝑜+1 𝑊 𝑜 ≥ 𝑊 𝑜 – Rearranging: 𝑆 𝜌 𝑜+1 ≥ 𝐽 − 𝛿𝑈 𝜌 𝑜+1 𝑊 𝑜 𝑜+1 = 𝐽 − 𝛿𝑈 𝜌 𝑜+1 −1 𝑆 𝜌 𝑜+1 ≥ 𝑊 – Hence 𝑊 𝑜 5 CS886 (c) 2013 Pascal Poupart

  6. Convergence • Theorem 2: Policy iteration converges to 𝜌 ∗ & 𝑊 ∗ in finitely many iterations when 𝑇 and 𝐵 are finite. • Proof: – We know that 𝑊 𝑜+1 ≥ 𝑊 𝑜 ∀𝑜 by Lemma 1. – Since 𝐵 and 𝑇 are finite, there are finitely many policies and therefore the algorithm terminates in finitely many iterations. – At termination, 𝜌 𝑜+1 = 𝜌 𝑜 and therefore 𝑊 𝑜 satisfies Bellman’s equation: 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 = 𝑊 𝑜+1 = max 𝑜 𝑏 6 CS886 (c) 2013 Pascal Poupart

  7. Complexity • Value Iteration: – Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence • Policy Iteration: – Each iteration: 𝑃( 𝑇 3 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence 7 CS886 (c) 2013 Pascal Poupart

  8. Modified Policy Iteration • Alternate between two steps 1. Partial Policy evaluation Repeat 𝑙 times: 𝑊 𝜌 𝑡 ← 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 ∀𝑡 𝑡 ′ 2. Policy improvement 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝜌 (𝑡 ′ ) 𝜌 𝑡 ← argmax ∀𝑡 𝑏 𝑡 ′ 8 CS886 (c) 2013 Pascal Poupart

  9. Algorithm modifiedPolicyIteration(MDP) Initialize 𝜌 0 and 𝑊 0 to anything 𝑜 ← 0 Repeat Eval: Repeat 𝑙 times 𝑜 ← 𝑆 𝜌 𝑜 + 𝛿𝑈 𝜌 𝑜 𝑊 𝑊 𝑜 Improve: 𝜌 𝑜+1 ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 𝑜+1 ← 𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 𝑜 ← 𝑜 + 1 Until 𝑊 𝑜 − 𝑊 ∞ ≤ 𝜗 𝑜−1 Return 𝜌 𝑜 9 CS886 (c) 2013 Pascal Poupart

  10. Convergence • Same convergence guarantees as value iteration: 𝜗 𝑜 − 𝑊 ∗ • Value function 𝑊 𝑜 : 𝑊 ∞ ≤ 1−𝛿 • Value function 𝑊 𝜌 𝑜 of policy 𝜌 𝑜 : 2𝜗 𝑊 𝜌 𝑜 − 𝑊 ∗ ∞ ≤ 1−𝛿 • Proof: somewhat complicated (see Section 6.5 of Puterman’s book) 10 CS886 (c) 2013 Pascal Poupart

  11. Complexity • Value Iteration: – Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence • Policy Iteration: – Each iteration: 𝑃( 𝑇 3 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence • Modified Policy Iteration: – Each iteration: 𝑃(𝑙 𝑇 2 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence 11 CS886 (c) 2013 Pascal Poupart

  12. Summary • Policy iteration – Iteratively refine policy • Can we treat the search for a good policy as an optimization problem? – Yes: by linear programming 12 CS886 (c) 2013 Pascal Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend