Module 7 Policy Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 7 Policy Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value iteration Optimize value function Extract induced policy Can we directly optimize the
CS886 (c) 2013 Pascal Poupart
2
Policy Optimization
- Value iteration
– Optimize value function – Extract induced policy
- Can we directly optimize the policy?
– Yes, by policy iteration
CS886 (c) 2013 Pascal Poupart
3
Policy Iteration
- Alternate between two steps
- 1. Policy evaluation
𝑊𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)
𝑡′
∀𝑡
- 2. Policy improvement
𝜌 𝑡 ← argmax
𝑏
𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊𝜌(𝑡′)
𝑡′
∀𝑡
CS886 (c) 2013 Pascal Poupart
4
Algorithm
policyIteration(MDP)
Initialize 𝜌0 to any policy 𝑜 ← 0 Repeat Eval: 𝑊
𝑜 = 𝑆𝜌𝑜 + 𝛿𝑈𝜌𝑜𝑊 𝑜
Improve: 𝜌𝑜+1 ← 𝑏𝑠𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊
𝑜
𝑜 ← 𝑜 + 1 Until 𝜌𝑜+1 = 𝜌𝑜 Return 𝜌𝑜
CS886 (c) 2013 Pascal Poupart
5
Monotonic Improvement
- Lemma 1: Let 𝑊
𝑜 and 𝑊 𝑜+1 be successive value
functions in policy iteration. Then 𝑊
𝑜+1 ≥ 𝑊 𝑜.
- Proof:
– We know that 𝐼∗ 𝑊
𝑜 ≥ 𝐼𝜌𝑜 𝑊 𝑜 = 𝑊 𝑜
– Let 𝜌𝑜+1 = 𝑏𝑠𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊
𝑜
– Then 𝐼∗ 𝑊
𝑜 = 𝑆𝜌𝑜+1 + 𝛿𝑈𝜌𝑜+1𝑊 𝑜 ≥ 𝑊 𝑜
– Rearranging: 𝑆𝜌𝑜+1 ≥ 𝐽 − 𝛿𝑈𝜌𝑜+1 𝑊
𝑜
– Hence 𝑊
𝑜+1 = 𝐽 − 𝛿𝑈𝜌𝑜+1 −1𝑆𝜌𝑜+1 ≥ 𝑊 𝑜
CS886 (c) 2013 Pascal Poupart
6
Convergence
- Theorem 2: Policy iteration converges to 𝜌∗ & 𝑊∗
in finitely many iterations when 𝑇 and 𝐵 are finite.
- Proof:
– We know that 𝑊
𝑜+1 ≥ 𝑊 𝑜 ∀𝑜 by Lemma 1.
– Since 𝐵 and 𝑇 are finite, there are finitely many policies and therefore the algorithm terminates in finitely many iterations. – At termination, 𝜌𝑜+1 = 𝜌𝑜 and therefore 𝑊
𝑜 satisfies
Bellman’s equation: 𝑊
𝑜 = 𝑊 𝑜+1 = max 𝑏
𝑆𝑏 + 𝛿𝑈𝑏𝑊
𝑜
CS886 (c) 2013 Pascal Poupart
7
Complexity
- Value Iteration:
– Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence
- Policy Iteration:
– Each iteration: 𝑃( 𝑇 3 + 𝑇 2|𝐵|) – Few iterations: linear-quadratic convergence
CS886 (c) 2013 Pascal Poupart
8
Modified Policy Iteration
- Alternate between two steps
- 1. Partial Policy evaluation
Repeat 𝑙 times:
𝑊𝜌 𝑡 ← 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡′ 𝑡, 𝜌 𝑡 𝑊𝜌(𝑡′)
𝑡′
∀𝑡
- 2. Policy improvement
𝜌 𝑡 ← argmax
𝑏
𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊𝜌(𝑡′)
𝑡′
∀𝑡
CS886 (c) 2013 Pascal Poupart
9
Algorithm
modifiedPolicyIteration(MDP)
Initialize 𝜌0 and 𝑊
0 to anything
𝑜 ← 0 Repeat Eval: Repeat 𝑙 times 𝑊
𝑜 ← 𝑆𝜌𝑜 + 𝛿𝑈𝜌𝑜𝑊 𝑜
Improve: 𝜌𝑜+1 ← 𝑏𝑠𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊
𝑜
𝑊
𝑜+1 ← 𝑛𝑏𝑦𝑏 𝑆𝑏 + 𝛿𝑈𝑏𝑊 𝑜
𝑜 ← 𝑜 + 1 Until 𝑊
𝑜 − 𝑊 𝑜−1 ∞ ≤ 𝜗
Return 𝜌𝑜
CS886 (c) 2013 Pascal Poupart
10
Convergence
- Same convergence guarantees as value
iteration:
- Value function 𝑊
𝑜: 𝑊 𝑜 − 𝑊∗ ∞ ≤ 𝜗 1−𝛿
- Value function 𝑊𝜌𝑜 of policy 𝜌𝑜:
𝑊𝜌𝑜 − 𝑊∗
∞ ≤ 2𝜗 1−𝛿
- Proof: somewhat complicated (see Section 6.5
- f Puterman’s book)
CS886 (c) 2013 Pascal Poupart
11
Complexity
- Value Iteration:
– Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence
- Policy Iteration:
– Each iteration: 𝑃( 𝑇 3 + 𝑇 2|𝐵|) – Few iterations: linear-quadratic convergence
- Modified Policy Iteration:
– Each iteration: 𝑃(𝑙 𝑇 2 + 𝑇 2|𝐵|) – Few iterations: linear-quadratic convergence
CS886 (c) 2013 Pascal Poupart
12
Summary
- Policy iteration
– Iteratively refine policy
- Can we treat the search for a good policy as an
- ptimization problem?