Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of - - PowerPoint PPT Presentation

policy gradient
SMART_READER_LITE
LIVE PREVIEW

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of - - PowerPoint PPT Presentation

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we focused on approximating value or action-value function: Policy Gradient methods focus on parameterize the policy: 3 Types of Reinforcement


slide-1
SLIDE 1

Policy Gradient

  • Prof. Kuan-Ting Lai

2020/5/22

slide-2
SLIDE 2

Advantages of Policy-based RL

  • Previously we focused on approximating value or action-value function:
  • Policy Gradient methods focus on parameterize the policy:
slide-3
SLIDE 3

3 Types of Reinforcement Learning

Model-based Policy- based Value- based

Actor

  • critic

DQN Policy Gradient

  • Value-based

− Learn value function − Implicit policy

  • Policy-based

− No value function − Learn Policy directly

  • Actor-critic

− Learn both value and policy function

slide-4
SLIDE 4

Lex Fridman, MIT Deep Learning, https://deeplearning.mit.edu/

slide-5
SLIDE 5

Policy Objective Function

  • Goal: given policy 𝜌𝜄(𝑡, 𝑏) with parameters θ, find best θ
  • How to measure the quality of a policy?

𝐾 𝜄 ← 𝑤𝜌 𝑡0 = 𝐹[σ 𝜌(𝑏|𝑡)𝑟𝜌(𝑡, 𝑏)]

slide-6
SLIDE 6

Short Corridor with Switched Actions

slide-7
SLIDE 7

Policy Optimization

  • Policy-based RL is an optimization problem that can be solved by:

− Hill climbing − Simplex / amoeba / Nelder Mead − Genetic algorithms − Gradient descent − Conjugate gradient − Quasi-newton

slide-8
SLIDE 8

Computing Gradients By Finite Differences

  • Estimate kth partial derivative of objective function w.r.t. Θ
  • By perturbing by small amount in k-th dimension
  • Simple, noisy, inefficient but sometime work!
  • Works for all kinds of policy, even if policy is not differentiable

where 𝑣𝑙 is unit vector with 1 in k-th component, 0 elsewhere

slide-9
SLIDE 9

Score Function

  • Assume 𝜌𝜄 is differentiable whenever it is non-zero
  • Score function is ∇𝜄 log 𝜌𝜄(𝑡, 𝑏)
slide-10
SLIDE 10

Softmax Policy

  • Softmax function
  • Use linear approximation function
slide-11
SLIDE 11

Policy Gradient Theorem

  • Generalized policy gradient (proof @ Sutton’s book, pg.325)
slide-12
SLIDE 12

Proof of Policy Gradient Theorem (2-1)

slide-13
SLIDE 13

Proof of Policy Gradient Theorem (2-1)

slide-14
SLIDE 14

REINFOCE: Monte Carlo Policy Gradient

REINFORCE Update

slide-15
SLIDE 15

Pseudo Code of REINFORCE

slide-16
SLIDE 16

REINFORCE on Short Corridor

slide-17
SLIDE 17

REINFORCE with Baseline

  • Include an arbitrary baseline function b(s)

− Equation is valid because

slide-18
SLIDE 18

Gradient of REINFORCE with Baseline

slide-19
SLIDE 19

Baseline Can Help to Learn Faster

slide-20
SLIDE 20

Actor-Critic Methods

  • Baseline cannot bootstrap

− Use learned state-value function as baseline -> Actor-Critic

slide-21
SLIDE 21
slide-22
SLIDE 22

Policy Gradient for Continuing Problems

  • Continuing problem (No episode boundaries)

− Use average reward per time step: TD(λ)

slide-23
SLIDE 23

Actor-Critic with Eligibility Traces

slide-24
SLIDE 24

Policy Parameterization for Continuous Action

slide-25
SLIDE 25

Reference

  • 1. David Silver, Lecture 7: Policy Gradient
  • 2. Chapter 13, Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An

Introduction,” 2nd edition, Nov. 2018