10703 Deep Reinforcement Learning Policy Gradient Methods Tom - - PowerPoint PPT Presentation

10703 deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

10703 Deep Reinforcement Learning Policy Gradient Methods Tom - - PowerPoint PPT Presentation

10703 Deep Reinforcement Learning Policy Gradient Methods Tom Mitchell October 1, 2018 Reading: Barto & Sutton, Chapter 13 Used Materials Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton


slide-1
SLIDE 1

10703 Deep Reinforcement Learning

Tom Mitchell October 1, 2018

Policy Gradient Methods

Reading: Barto & Sutton, Chapter 13

slide-2
SLIDE 2

Used Materials

  • Much of the material and slides for this lecture were taken from

Chapter 13 of Barto & Sutton textbook.

  • Some slides are borrowed from Ruslan Salakhutdinov, who in turn

borrowed from Rich Sutton’s RL class and David Silver’s Deep RL tutorial

slide-3
SLIDE 3

Policy-Based Reinforcement Learning

  • So far we approximated the value or action-value function using

parameters θ (e.g. neural networks)

  • A policy was generated directly from the value function e.g. using ε-

greedy

  • We will focus again on model-free reinforcement learning
  • In this lecture we will directly parameterize the policy
slide-4
SLIDE 4

Policy-Based Reinforcement Learning

  • So far we approximated the value or action-value function using

parameters θ (e.g. neural networks)

  • A policy was generated directly from the value function e.g. using ε-

greedy

  • In this lecture we will directly parameterize the policy
  • We will focus again on model-free reinforcement learning

Sometimes I will also use the notation:

slide-5
SLIDE 5

Typical Parameterized Differentiable Policy

  • Softmax:

where h(s,a,θ) is any function of s, a with params θ e.g., linear function of features x(s,a) you make up e.g., h(s,a,θ) is output of trained neural net

slide-6
SLIDE 6

Value-Based and Policy-Based RL

  • Value Based
  • Learn a Value Function
  • Implicit policy (e.g. ε-greedy)
  • Policy Based
  • Learn a Policy directly
  • Actor-Critic
  • Learn a Value Function, and
  • Learn a Policy
slide-7
SLIDE 7

Advantages of Policy-Based RL

  • Advantages
  • Better convergence properties
  • Effective in high-dimensional, even continuous action spaces
  • Can learn stochastic policies 

  • Disadvantages
  • Typically converge to a local rather than global optimum
slide-8
SLIDE 8

Example: Why use non-deterministic policy?

slide-9
SLIDE 9
slide-10
SLIDE 10

What Policy Learning Objective?

  • Goal: given policy πθ(s,a) with parameters θ, wish to find best θ
  • define “best θ” as argmaxθ J(θ) for some J(θ)
  • In episodic environments we can optimize the value of start state s1

Remember: Episode of experience under

policy π:

slide-11
SLIDE 11

What Policy Learning Objective?

  • Goal: given policy πθ(s,a) with parameters θ, wish to find best θ
  • define “best θ” as argmaxθ J(θ) for some J(θ)
  • In episodic environments we can optimize the value of start state s1
  • In continuing environments we can optimize the average value
  • Or the average immediate reward per time-step

where is stationary distribution of Markov chain for πθ

slide-12
SLIDE 12

Policy Optimization

  • Policy based reinforcement learning is an optimization problem
  • Find θ that maximizes J(θ)

  • Some approaches do not use gradient
  • Hill climbing
  • Genetic algorithms
  • We focus on gradient ascent, many extensions possible
  • And on methods that exploit sequential structure
  • Greater efficiency often possible using gradient
  • Gradient descent
  • Conjugate gradient
  • Quasi-Newton
slide-13
SLIDE 13

Gradient of Policy Objective

  • Let J(θ) be any policy objective function
  • Policy gradient algorithms search for a local

maximum in J(θ) by ascending the gradient

  • f the policy, w.r.t. parameters θ

α is a step-size parameter (learning rate) is the policy gradient

slide-14
SLIDE 14

Computing Gradients By Finite Differences

  • To evaluate policy gradient of πθ(s, a)
  • Uses n evaluations to compute policy gradient in n dimensions
  • Simple, inefficient – but general purpose!
  • Works for arbitrary policies, even if policy is not differentiable
  • For each dimension k in [1, n]
  • Estimate kth partial derivative of objective function w.r.t. θ
  • By perturbing θ by small amount ε in kth dimension

where uk is a unit vector with 1 in kth component, 0 elsewhere

slide-15
SLIDE 15

How do we find an expression for ?

Consider episodic case: Problem in calculating : :
 doesn’t a change to θ alter both:

  • action chosen by πθ in each state s
  • distribution of states we’ll encounter

Remember: Episode of experience under

policy π:

slide-16
SLIDE 16

Consider episodic case: Problem in calculating : :
 doesn’t a change to θ alter both:

  • action chosen by πθ in each state s
  • distribution of states we’ll encounter

Good news: policy gradient theorem: where is probability distribution over states

How do we find an expression for ?

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19

SGD Approach to Optimizing J(θ) : Approach 1

slide-20
SLIDE 20

SGD Approach to Optimizing J(θ) : Approach 2

slide-21
SLIDE 21

SGD Approach to Optimizing J(θ) : Approach 2

slide-22
SLIDE 22

SGD Approach to Optimizing J(θ) : Approach 2

slide-23
SLIDE 23

REINFORCE algorithm

slide-24
SLIDE 24

Note because

slide-25
SLIDE 25

Typical Parameterized Differentiable Policy

  • Softmax:

where h(s,a,θ) is any function of s, a with params θ e.g., linear function of features x(s,a) you make up e.g., h(s,a,θ) is output of trained neural net

slide-26
SLIDE 26

REINFORCE algorithm on Short Corridor World

slide-27
SLIDE 27

Good news:

  • REINFORCE converges to local optimum under usual SGD assumptions
  • because Eπ[Gt] = Q(St,At)

But variance is high

  • recall high variance of Monte Carlo sampling
slide-28
SLIDE 28

Good news:

  • REINFORCE converges to local optimum under usual SGD assumptions
  • because E

π[G t] = Q(S t,A t)

But variance is high

  • recall high variance of Monte Carlo sampling
slide-29
SLIDE 29

replace by for some fixed function b(s) that captures prior for s Note the equation is still valid because Result:

Adding a baseline to REINFORCE Algorithm

slide-30
SLIDE 30

replacing by for a good b(St) reduces variance in training target

  • ne typical b(S) is a learned value function

b(St) =

Adding a baseline to REINFORCE Algorithm

slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

Good news:

  • REINFORCE converges to local optimum under usual SGD assumptions
  • because E

π[G t] = Q(S t,A t)

But variance is high

  • recall high variance of Monte Carlo sampling
slide-34
SLIDE 34

Actor-Critic Model

  • learn both Q and π
  • use Q to generate target values, instead of G

One step actor-critic model:

slide-35
SLIDE 35