SLIDE 1 Policy Gradient
2020/5/22
SLIDE 2 Advantages of Policy-based RL
- Previously we focused on approximating value or action-value function:
- Policy Gradient methods focus on parameterize the policy:
SLIDE 3 3 Types of Reinforcement Learning
Model-based Policy- based Value- based
Actor
DQN Policy Gradient
− Learn value function − Implicit policy
− No value function − Learn Policy directly
− Learn both value and policy function
SLIDE 4 Lex Fridman, MIT Deep Learning, https://deeplearning.mit.edu/
SLIDE 5 Policy Objective Function
- Goal: given policy 𝜌𝜄(𝑡, 𝑏) with parameters θ, find best θ
- How to measure the quality of a policy?
𝐾 𝜄 ← 𝑤𝜌 𝑡0 = 𝐹[σ 𝜌(𝑏|𝑡)𝑟𝜌(𝑡, 𝑏)]
SLIDE 6
Short Corridor with Switched Actions
SLIDE 7 Policy Optimization
- Policy-based RL is an optimization problem that can be solved by:
− Hill climbing − Simplex / amoeba / Nelder Mead − Genetic algorithms − Gradient descent − Conjugate gradient − Quasi-newton
SLIDE 8 Computing Gradients By Finite Differences
- Estimate kth partial derivative of objective function w.r.t. Θ
- By perturbing by small amount in k-th dimension
- Simple, noisy, inefficient but sometime work!
- Works for all kinds of policy, even if policy is not differentiable
where 𝑣𝑙 is unit vector with 1 in k-th component, 0 elsewhere
SLIDE 9 Score Function
- Assume 𝜌𝜄 is differentiable whenever it is non-zero
- Score function is ∇𝜄 log 𝜌𝜄(𝑡, 𝑏)
SLIDE 10 Softmax Policy
- Softmax function
- Use linear approximation function
SLIDE 11 Policy Gradient Theorem
- Generalized policy gradient (proof @ Sutton’s book, pg.325)
SLIDE 12
Proof of Policy Gradient Theorem (2-1)
SLIDE 13
Proof of Policy Gradient Theorem (2-1)
SLIDE 14 REINFOCE: Monte Carlo Policy Gradient
REINFORCE Update
SLIDE 15
Pseudo Code of REINFORCE
SLIDE 16
REINFORCE on Short Corridor
SLIDE 17 REINFORCE with Baseline
- Include an arbitrary baseline function b(s)
− Equation is valid because
SLIDE 18
Gradient of REINFORCE with Baseline
SLIDE 19
Baseline Can Help to Learn Faster
SLIDE 20 Actor-Critic Methods
- Baseline cannot bootstrap
− Use learned state-value function as baseline -> Actor-Critic
SLIDE 21
SLIDE 22 Policy Gradient for Continuing Problems
- Continuing problem (No episode boundaries)
− Use average reward per time step: TD(λ)
SLIDE 23
Actor-Critic with Eligibility Traces
SLIDE 24
Policy Parameterization for Continuous Action
SLIDE 25 Reference
- 1. David Silver, Lecture 7: Policy Gradient
- 2. Chapter 13, Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An
Introduction,” 2nd edition, Nov. 2018