Module 7 Policy Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

module 7
SMART_READER_LITE
LIVE PREVIEW

Module 7 Policy Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value iteration Optimize value function Extract induced policy Can we directly optimize the


slide-1
SLIDE 1

Module 7 Policy Iteration

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Policy Optimization

  • Value iteration

– Optimize value function – Extract induced policy

  • Can we directly optimize the policy?

– Yes, by policy iteration

slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Policy Iteration

  • Alternate between two steps
  • 1. Policy evaluation

𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)

𝑡′

∀𝑡

  • 2. Policy improvement

𝜌 𝑡 ← argmax

𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊𝜌(𝑡′)

𝑡′

∀𝑡

slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

Algorithm

policyIteration(MDP)

Initialize 𝜌0 to any policy 𝑜 ← 0 Repeat Eval: 𝑊

𝑜 = 𝑆𝜌𝑜 + 𝛿𝑈𝜌𝑜𝑊 𝑜

Improve: 𝜌𝑜+1 ← 𝑏𝑠𝑕𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊

𝑜

𝑜 ← 𝑜 + 1 Until 𝜌𝑜+1 = 𝜌𝑜 Return 𝜌𝑜

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Monotonic Improvement

  • Lemma 1: Let 𝑊

𝑜 and 𝑊 𝑜+1 be successive value

functions in policy iteration. Then 𝑊

𝑜+1 ≥ 𝑊 𝑜.

  • Proof:

– We know that 𝐼∗ 𝑊

𝑜 ≥ 𝐼𝜌𝑜 𝑊 𝑜 = 𝑊 𝑜

– Let 𝜌𝑜+1 = 𝑏𝑠𝑕𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊

𝑜

– Then 𝐼∗ 𝑊

𝑜 = 𝑆𝜌𝑜+1 + 𝛿𝑈𝜌𝑜+1𝑊 𝑜 ≥ 𝑊 𝑜

– Rearranging: 𝑆𝜌𝑜+1 ≥ 𝐽 − 𝛿𝑈𝜌𝑜+1 𝑊

𝑜

– Hence 𝑊

𝑜+1 = 𝐽 − 𝛿𝑈𝜌𝑜+1 −1𝑆𝜌𝑜+1 ≥ 𝑊 𝑜

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

Convergence

  • Theorem 2: Policy iteration converges to 𝜌∗ & 𝑊∗

in finitely many iterations when 𝑇 and 𝐵 are finite.

  • Proof:

– We know that 𝑊

𝑜+1 ≥ 𝑊 𝑜 ∀𝑜 by Lemma 1.

– Since 𝐵 and 𝑇 are finite, there are finitely many policies and therefore the algorithm terminates in finitely many iterations. – At termination, 𝜌𝑜+1 = 𝜌𝑜 and therefore 𝑊

𝑜 satisfies

Bellman’s equation: 𝑊

𝑜 = 𝑊 𝑜+1 = max 𝑏

𝑆𝑏 + 𝛿𝑈𝑏𝑊

𝑜

slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Complexity

  • Value Iteration:

– Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence

  • Policy Iteration:

– Each iteration: 𝑃( 𝑇 3 + 𝑇 2|𝐵|) – Few iterations: linear-quadratic convergence

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

Modified Policy Iteration

  • Alternate between two steps
  • 1. Partial Policy evaluation

Repeat 𝑙 times:

𝑊𝜌 𝑡 ← 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)

𝑡′

∀𝑡

  • 2. Policy improvement

𝜌 𝑡 ← argmax

𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊𝜌(𝑡′)

𝑡′

∀𝑡

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Algorithm

modifiedPolicyIteration(MDP)

Initialize 𝜌0 and 𝑊

0 to anything

𝑜 ← 0 Repeat Eval: Repeat 𝑙 times 𝑊

𝑜 ← 𝑆𝜌𝑜 + 𝛿𝑈𝜌𝑜𝑊 𝑜

Improve: 𝜌𝑜+1 ← 𝑏𝑠𝑕𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊

𝑜

𝑊

𝑜+1 ← 𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊 𝑜

𝑜 ← 𝑜 + 1 Until 𝑊

𝑜 − 𝑊 𝑜−1 ∞ ≤ 𝜗

Return 𝜌𝑜

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Convergence

  • Same convergence guarantees as value

iteration:

  • Value function 𝑊

𝑜: 𝑊 𝑜 − 𝑊∗ ∞ ≤ 𝜗 1−𝛿

  • Value function 𝑊𝜌𝑜 of policy 𝜌𝑜:

𝑊𝜌𝑜 − 𝑊∗

∞ ≤ 2𝜗 1−𝛿

  • Proof: somewhat complicated (see Section 6.5
  • f Puterman’s book)
slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Complexity

  • Value Iteration:

– Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence

  • Policy Iteration:

– Each iteration: 𝑃( 𝑇 3 + 𝑇 2|𝐵|) – Few iterations: linear-quadratic convergence

  • Modified Policy Iteration:

– Each iteration: 𝑃(𝑙 𝑇 2 + 𝑇 2|𝐵|) – Few iterations: linear-quadratic convergence

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

Summary

  • Policy iteration

– Iteratively refine policy

  • Can we treat the search for a good policy as an
  • ptimization problem?

– Yes: by linear programming