dynamic programming
play

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic - PowerPoint PPT Presentation

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is for problems with two properties: 1. Optimal substructure Optimal solution can be decomposed into subproblems 2. Overlapping subproblems


  1. Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10

  2. Dynamic Programming • Dynamic Programming is for problems with two properties: 1. Optimal substructure • Optimal solution can be decomposed into subproblems 2. Overlapping subproblems • Subproblems recur many times • Solutions can be cached and reused • Examples: − Shortest Path, Hanoi Tower ,……. − Markov Decision Process

  3. Sutton, Richard S.; Barto, Andrew G.. Reinforcement Learning (Adaptive Computation and Machine Learning series) (p. 189)

  4. Dynamic Programming for MDP • Bellman equation gives recursive decomposition • Value function stores and reuses solutions • Dynamic programming assumes full knowledge of the MDP • Used for Model-based Planning

  5. Policy Evaluation (Prediction) • Calculate the state-action function 𝑊 𝜌 for an arbitrary policy 𝜌 • Can be solved iteratively 𝑤 𝑙+1 𝑇 ← 𝐹 𝜌 𝑆 𝑢+1 + 𝛿𝑤 𝑙 𝑇 𝑢+1

  6. Policy Evaluation in Small Grid World • One terminal state (shown twice as shaded squares) • Actions leading out of the grid leave state unchanged • Reward is -1 until the terminal state is reached

  7. How to Improve a Policy 1. Evaluate the policy − 𝑤 𝜌 𝑡 = 𝐹[𝑆 𝑢+1 + 𝑆 𝑢+2 + ⋯ |𝑇 𝑢 = 𝑡] 2. Improve the policy by acting greedily with respect to v − 𝜌′ = 𝑕𝑠𝑓𝑓𝑒𝑧(𝑤 𝜌 ) • This process of policy iteration always converges to 𝜌′

  8. Policy Iteration • Policy evaluation Estimate 𝑤 𝜌 • Policy improvement Generate 𝜌′ ≥ 𝜌

  9. Jack’s Car Rental

  10. Policy Improvement (1)

  11. Policy Improvement (2)

  12. Modified Policy Iteration • Do we need to iteratively evaluate until convergence of 𝑤 𝜌 ? • Can we simply stop after k iteration? − Example: Small grid world achieves optimal policy after k=3 iterations • Update policy every iteration? => Value Iteration

  13. Value Iteration • Updating value function 𝑤 only, don’t calculate policy function 𝜌 • Policy is implicit built using 𝑤

  14. Shortest Path Example

  15. Policy Iteration vs. Value Iteration • Policy iteration • Value iteration

  16. Reference • David Silver, Lecture 3: Planning by Dynamic Programming (https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=3) • Chapter 4, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend