module 8
play

Module 8 Linear Programming CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve an optimization problem Can


  1. Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Policy Optimization • Value and policy iteration – Iterative algorithms that implicitly solve an optimization problem • Can we explicitly write down this optimization problem? – Yes, it can be formulated as a linear program 2 CS886 (c) 2013 Pascal Poupart

  3. Primal Linear Program primalLP(MDP) 𝑊 𝑥(𝑡)𝑊(𝑡) min 𝑡 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑡, 𝑏 subject to 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ return 𝑊 • Variables: 𝑊 𝑡 ∀𝑡 • Objective: min 𝑥(𝑡)𝑊(𝑡) 𝑡 where 𝑥(𝑡) is a weight assigned to state 𝑡 • Constraints: Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑡, 𝑏 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ 3 CS886 (c) 2013 Pascal Poupart

  4. Objective • Why do we minimize a weighted combination of the values? Shouldn’t we maximize value? • Value functions 𝑊 that satisfy the constraints are upper bounds on the optimal value function 𝑊 ∗ 𝑊 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 • Minimizing value ensures that we choose the lowest upper bound V 𝑊(𝑡) = 𝑊 ∗ 𝑡 ∀𝑡 min 4 CS886 (c) 2013 Pascal Poupart

  5. Upper bound • Theorem: Value functions 𝑊 that satisfy Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡, 𝑏 are 𝑡 ′ upper bounds on the optimal value function 𝑊 ∗ 𝑊 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 • Proof: Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ – Since 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡, 𝑏 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ – Then 𝑊 𝑡 ≥ max 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡 𝑡 ′ 𝑏 = 𝐼 ∗ (𝑊)(𝑡) ∀𝑡 – Furthermore 𝑊 ≥ 𝐼 ∗ 𝑊 ≥ 𝐼 ∗ (𝐼 ∗ ≥ ⋯ ≥ 𝐼 ∗ ∞ 𝑊 = 𝑊 ∗ 𝑊 5 CS886 (c) 2013 Pascal Poupart

  6. Weight function (initial state) • How do we choose the weight function? • If the policy always starts in the same initial state 𝑡 0 , then set 𝑥 𝑡 = 1 𝑡 = 𝑡 0 otherwise 0 • This ensures that 𝑥 𝑡 𝑊 𝑡 = 𝑊 ∗ (𝑡 0 ) 𝑡 6 CS886 (c) 2013 Pascal Poupart

  7. Weight function (any state) • If the policy may start in any state, then assign a positive weight to each state, i.e. 𝑥 𝑡 > 0 ∀𝑡 • This ensures that 𝑊 is minimized at each 𝑡 and therefore 𝑊 𝑡 = 𝑊 ∗ 𝑡 ∀𝑡 • The magnitude of the weight doesn’t matter when the LP is solved exactly. We will revisit the choice of 𝑥(𝑡) when we discuss approximate linear programming. 7 CS886 (c) 2013 Pascal Poupart

  8. Optimal Policy • Linear program finds 𝑊 ∗ • We can extract 𝜌 ∗ from 𝑊 ∗ as usual: 𝜌 ∗ 𝑡 ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑡 ′ • Or check the active constraints – For each 𝑡 , check which 𝑏 ∗ leads to equality 𝑊 𝑡 = 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ 𝑊(𝑡 ′ ) 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑏 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ – Set 𝜌 ∗ 𝑡 ← 𝑏 ∗ 8 CS886 (c) 2013 Pascal Poupart

  9. Direct Policy Optimization • The optimal solution to the primal linear program is 𝑊 ∗ , but we still have to extract 𝜌 ∗ • Could we directly optimize 𝜌 ? – Yes, by considering the dual linear program 9 CS886 (c) 2013 Pascal Poupart

  10. Dual Linear Program dualLP(MDP) max 𝑧 𝑡, 𝑏 𝑆(𝑡, 𝑏) 𝑡,𝑏 𝑧 (𝑡 ′ |𝑡, 𝑏)𝑧 𝑡, 𝑏 subject to 𝑧 𝑡′, 𝑏′ = 𝑐 𝑡′ + 𝛿 Pr ∀𝑡 𝑏′ 𝑡,𝑏 𝑧 𝑡, 𝑏 ≥ 0 ∀𝑡, 𝑏 Let 𝜌 𝑏|𝑡 = Pr 𝑏 𝑡 = 𝑧(𝑡, 𝑏)/ 𝑧(𝑡, 𝑏) 𝑏 return 𝜌 • Variables: y 𝑡, 𝑏 ∀𝑡, 𝑏 – frequency of each 𝑡, 𝑏 -pair (proportional to 𝜌 ) • Objective: max 𝑧 𝑡, 𝑏 𝑆(𝑡, 𝑏) 𝑡,𝑏 𝑧 • Constraints: (𝑡 ′ |𝑡, 𝑏)𝑧 𝑡, 𝑏 𝑧 𝑡′, 𝑏′ = 𝑐 𝑡′ + 𝛿 Pr 𝑏′ 𝑡,𝑏 10 CS886 (c) 2013 Pascal Poupart

  11. Duality • For every primal linear program in the form min Interpretation: 𝑦 𝑑 𝑈 𝑦 𝑑 = 𝑥 s. t. 𝐵𝑦 ≥ 𝑐 𝑦 = 𝑊 𝑧 ∝ 𝜌 • There is an equivalent dual 𝐵 = 𝐽 − 𝛿𝑈 𝑏 ∀𝑏 linear program in the form 𝑐 = [𝑆 𝑏 ]∀𝑏 max 𝑐 𝑈 𝑧 𝑧 s. t. 𝐵 𝑈 𝑧 = 𝑑 and 𝑧 ≥ 0 𝑑 𝑈 𝑦 = max 𝑐 𝑈 𝑧 • Where min 𝑦 𝑧 11 CS886 (c) 2013 Pascal Poupart

  12. State Frequency • Let 𝑔(𝑡) be the frequency of 𝑡 under policy 𝜌 . 0 step: 𝑔 0 𝑡 = 𝑥(𝑡) 1 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 1 step: 𝑔 (𝑡′|𝑡, 𝜌 𝑡 )𝑥 𝑡 𝑡 2 𝑡′′ = 𝑥 𝑡′′ + 𝛿 Pr 2 steps: 𝑔 (𝑡′′|𝑡′, 𝜌 𝑡′ )𝑥 𝑡′ 𝑡′ +𝛿 2 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 Pr 𝑡′′ 𝑡 ′ , 𝜌 𝑡 ′ 𝑥(𝑡) 𝑡,𝑡 ′ … n steps: + 𝛿 Pr 𝑡 𝑜 𝑡 𝑜−1 , 𝜌 𝑡 𝑜−1 𝑜−1 (𝑡 𝑜−1 ) 𝑜 𝑡 𝑜 = 𝑥 𝑡 𝑜 𝑔 𝑔 𝑡 𝑜−1 ∞ steps: 𝑔 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑡) 𝑔(𝑡) 𝑡 12 CS886 (c) 2013 Pascal Poupart

  13. State-Action Frequency • Let 𝑧 𝑡, 𝑏 be the state-action frequency 𝑧 𝑡, 𝑏 = 𝜌 𝑏|𝑡 𝑔 𝑡 where 𝜌 𝑏 𝑡 = Pr 𝑏 𝑡 is a stochastic policy • Then the following equations are equivalent 𝑔 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑡) 𝑔(𝑡) 𝑡 𝑔 𝜌 𝑡′ = 𝑥 𝑡′ + Pr 𝑡 ′ 𝑡, 𝑏 𝜌 𝑏|𝑡 𝑔 𝜌 (𝑡) ⇔ 𝜌(𝑏 ′ |𝑡 ′ ) 𝑏 ′ 𝑡 = 𝑥 𝑡′ + Pr 𝑡 ′ 𝑡, 𝑏 𝑧(𝑡, 𝑏) ⇔ 𝑧(𝑡 ′ , 𝑏 ′ ) 𝑏 ′ 𝑡 Constraint of dual LP 13 CS886 (c) 2013 Pascal Poupart

  14. Policy • We can recover 𝜌 from 𝑧 . 𝑧 𝑡, 𝑏 = 𝜌 𝑏 𝑡 𝑔 𝑡 (by definition) 𝑧 𝑡,𝑏 𝜌 𝑏 𝑡 = 𝑔 𝑡 (isolate 𝜌 ) 𝑧 𝑡,𝑏 𝜌 𝑏 𝑡 = (by definition) 𝑧 𝑡,𝑏 𝑏 • 𝜌 may be stochastic • Actions with non-zero probability are necessarily optimal 14 CS886 (c) 2013 Pascal Poupart

  15. Objective • Duality theory guarantees that the objectives of the primal and dual LPs are equal max 𝑧 𝑡, 𝑏 𝑆 𝑡, 𝑏 = min 𝑊 𝑥(𝑡) 𝑊(𝑡) y 𝑡,𝑏 𝑡 • This means that 𝑧 𝑡, 𝑏 𝑆 𝑡, 𝑏 implicitly 𝑡,𝑏 measures the value of the optimal policy. 15 CS886 (c) 2013 Pascal Poupart

  16. Solution Algorithms • Two broad classes of algorithms: – Simplex (corner search) – Interior point methods (interior iterative methods) • Polynomial complexity (MDP is in P, not NP) • Many packages for linear programming – CPLEX (robust, efficient and free for academia) 16 CS886 (c) 2013 Pascal Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend