policy gradient
play

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of - PowerPoint PPT Presentation

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we focused on approximating value or action-value function: Policy Gradient methods focus on parameterize the policy: 3 Types of Reinforcement


  1. Policy Gradient Prof. Kuan-Ting Lai 2020/5/22

  2. Advantages of Policy-based RL • Previously we focused on approximating value or action-value function: • Policy Gradient methods focus on parameterize the policy:

  3. 3 Types of Reinforcement Learning • Value-based − Learn value function Model-based − Implicit policy • Policy-based − No value function − Learn Policy directly • Actor-critic Value- Policy- − Learn both value and policy Actor based based function -critic Policy DQN Gradient

  4. Lex Fridman, MIT Deep Learning, https://deeplearning.mit.edu/

  5. Policy Objective Function • Goal: given policy 𝜌 𝜄 (𝑡, 𝑏) with parameters θ , find best θ • How to measure the quality of a policy? 𝐾 𝜄 ← 𝑤 𝜌 𝑡 0 = 𝐹[σ 𝜌(𝑏|𝑡)𝑟 𝜌 (𝑡, 𝑏) ]

  6. Short Corridor with Switched Actions

  7. Policy Optimization • Policy-based RL is an optimization problem that can be solved by: − Hill climbing − Simplex / amoeba / Nelder Mead − Genetic algorithms − Gradient descent − Conjugate gradient − Quasi-newton

  8. Computing Gradients By Finite Differences • Estimate kth partial derivative of objective function w.r.t. Θ • By perturbing by small amount in k -th dimension where 𝑣 𝑙 is unit vector with 1 in k- th component, 0 elsewhere • Simple, noisy, inefficient but sometime work! • Works for all kinds of policy, even if policy is not differentiable

  9. Score Function • Assume 𝜌 𝜄 is differentiable whenever it is non-zero • Score function is ∇ 𝜄 log 𝜌 𝜄 (𝑡, 𝑏)

  10. Softmax Policy • Softmax function • Use linear approximation function

  11. Policy Gradient Theorem • Generalized policy gradient (proof @ Sutton’s book, pg.325)

  12. Proof of Policy Gradient Theorem (2-1)

  13. Proof of Policy Gradient Theorem (2-1)

  14. REINFOCE: Monte Carlo Policy Gradient REINFORCE Update

  15. Pseudo Code of REINFORCE

  16. REINFORCE on Short Corridor

  17. REINFORCE with Baseline • Include an arbitrary baseline function b(s) − Equation is valid because

  18. Gradient of REINFORCE with Baseline

  19. Baseline Can Help to Learn Faster

  20. Actor-Critic Methods • Baseline cannot bootstrap − Use learned state-value function as baseline -> Actor-Critic

  21. Policy Gradient for Continuing Problems • Continuing problem (No episode boundaries) − Use average reward per time step: TD( λ )

  22. Actor-Critic with Eligibility Traces

  23. Policy Parameterization for Continuous Action

  24. Reference 1. David Silver, Lecture 7: Policy Gradient 2. Chapter 13, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend