Policy gradients CMU 10-403 Katerina Fragkiadaki Used Materials - - PowerPoint PPT Presentation

policy gradients
SMART_READER_LITE
LIVE PREVIEW

Policy gradients CMU 10-403 Katerina Fragkiadaki Used Materials - - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Policy gradients CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the material and slides for this lecture were borrowed from Russ


slide-1
SLIDE 1

Policy gradients

Deep Reinforcement Learning and Control Katerina Fragkiadaki

Carnegie Mellon School of Computer Science CMU 10-403

slide-2
SLIDE 2

Used Materials

  • Disclaimer: Much of the material and slides for this lecture were

borrowed from Russ Salakhutdinov, Rich Sutton’s class and David Silver’s class on Reinforcement Learning.

slide-3
SLIDE 3

Revision

slide-4
SLIDE 4

Deep Q-Networks (DQNs)

  • Represent action-state value function by Q-network with weights w
slide-5
SLIDE 5

Cost function

  • We do not know the groundtruth value
  • Minimize MSE loss by stochastic gradient descent
  • Minimize mean-squared error between the true action-value function

qπ(S,A) and the approximate Q function: J(w) = 𝔽π [(qπ(S, A) − Q(S, A, w))

2

]

ℒ = (r + γ max

a′ Q(s, a′, w)−Q(s, a, w)) 2

wrong!

slide-6
SLIDE 6

Cost function

  • We do not know the groundtruth value
  • Minimize MSE loss by stochastic gradient descent
  • Minimize mean-squared error between the true action-value function

qπ(S,A) and the approximate Q function: J(w) = 𝔽π [(qπ(S, A) − Q(S, A, w))

2

]

ℒ = (r + γ max

a′ Q(s′, a′, w)−Q(s, a, w)) 2

slide-7
SLIDE 7

Q-Learning: Off-Policy TD Control

  • One-step Q-learning:
slide-8
SLIDE 8
  • Minimize MSE loss by stochastic gradient descent
  • Converges to Q∗ using table lookup representation
  • But diverges using neural networks due to:
  • 1. Correlations between samples
  • 2. Non-stationary targets

Stability of training problems for DQN

  • Solutions:
  • 1. Experience buffer
  • 2. Targets stay fixed for many iterations

ℒ = (r + γ max

a′ Q(s′, a′, w)−Q(s, a, w)) 2

slide-9
SLIDE 9
  • Minimize MSE loss by stochastic gradient descent
  • Boils down to a supervised learning problem
  • I use MCTS to play 800 games, I gather the Q estimates of states and

actions in the MCTS trees and train a regressor.

  • Any problems?
  • Any solutions?
  • DAGGER!

Learning a DQN supervised from a planner

ℒ = (QMCTS(s, a)−Q(s, a, w))

2

slide-10
SLIDE 10
  • Minimize MSE loss by stochastic gradient descent
  • Boils down to a supervised learning problem
  • I use MCTS to play 800 games, I gather the Q estimates of states and

actions in the MCTS trees and train a regressor. Then use it to find a policy

  • Any problems?
  • Any solutions?
  • DAGGER!
  • Also: training a classifier directly worked best!

Learning a DQN supervised from a planner

ℒ = (QMCTS(s, a)−Q(s, a, w))

2

slide-11
SLIDE 11

Policy-Based Reinforcement Learning

  • So far we approximated the value or action-value function using

parameters θ (e.g. neural networks)

  • A policy was generated directly from the value function e.g. using ε-

greedy

  • We will not use any models, and we will learn from experience, not

imitation

  • In this lecture we will directly parameterize the policy
slide-12
SLIDE 12

Policy-Based Reinforcement Learning

  • So far we approximated the value or action-value function using

parameters θ (e.g. neural networks)

  • A policy was generated directly from the value function e.g. using ε-

greedy

  • In this lecture we will directly parameterize the policy
  • We will focus again on model-free reinforcement learning

Sometimes I will also use the notation:

slide-13
SLIDE 13

Value-Based and Policy-Based RL

  • Value Based
  • Learned Value Function
  • Implicit policy (e.g. ε-greedy)
  • Policy Based
  • No Value Function
  • Learned Policy
  • Actor-Critic
  • Learned Value Function
  • Learned Policy
slide-14
SLIDE 14

Advantages of Policy-Based RL

  • Advantages
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies 

  • We will look into the benefits of stochastic policies in a future lecture

slide-15
SLIDE 15

Policy function approximators

discrete actions

go left go right

s

Output is a distribution over a discrete set of actions With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas in epsilon- greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values, if that change results in a different action having the maximal value.

slide-16
SLIDE 16

Policy function approximators

deterministic continuous policy

a = πθ(s) s a

discrete actions

go left go right

s s

stochastic continuous policy

µθ(s) σθ(s)

a ∼ N(µθ(s), σ2

θ(s))

Output is a distribution over a discrete set of actions

slide-17
SLIDE 17

Policy Objective Functions

  • Goal: given policy πθ(s,a) with parameters θ, find best θ
  • But how do we measure the quality of a policy πθ?
  • In episodic environments we can use the start value
  • In continuing environments we can use the average value
  • Or the average reward per time-step

where is stationary distribution of Markov chain for πθ

slide-18
SLIDE 18

Policy Objective Functions

  • Goal: given policy πθ(s,a) with parameters θ, find best θ
  • But how do we measure the quality of a policy πθ?
  • In continuing environments we can use the average value
  • In the episodic case, is defined to be
  • the expected number of time steps t on which St = s
  • in a randomly generated episode starting in s0 and
  • following π and the dynamics of the MDP.

Remember: Episode of experience under

policy π:

slide-19
SLIDE 19

Policy Optimization

  • Policy based reinforcement learning is an optimization problem
  • Find θ that maximizes J(θ)

  • Some approaches do not use gradient
  • Hill climbing
  • Genetic algorithms
  • We focus on gradient descent, many extensions possible
  • And on methods that exploit sequential structure
  • Greater efficiency often possible using gradient
slide-20
SLIDE 20

Policy Gradient

  • Let J(θ) be any policy objective function
  • Policy gradient algorithms search for a local

maximum in J(θ) by ascending the gradient of the policy, w.r.t. parameters θ α is a step-size parameter (learning rate) is the policy gradient

slide-21
SLIDE 21

Computing Gradients By Finite Differences

  • To evaluate policy gradient of πθ(s, a)
  • Uses n evaluations to compute policy gradient in n dimensions
  • Simple, noisy, inefficient - but sometimes effective
  • Works for arbitrary policies, even if policy is not differentiable
  • For each dimension k in [1, n]
  • Estimate kth partial derivative of objective function w.r.t. θ
  • By perturbing θ by small amount ε in kth dimension

where uk is a unit vector with 1 in kth component, 0 elsewhere

slide-22
SLIDE 22

Learning an AIBO running policy

slide-23
SLIDE 23

Learning an AIBO running policy

Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, Kohl and Stone, 2004

slide-24
SLIDE 24

Learning an AIBO running policy

Initial Training Final

slide-25
SLIDE 25

Policy Gradient: Score Function

  • We now compute the policy gradient analytically

  • Assume
  • policy πθ is differentiable whenever it is non-zero
  • we know the gradient
  • Likelihood ratios exploit the following identity
  • The score function is
slide-26
SLIDE 26

Softmax Policy: Discrete Actions

  • We will use a softmax policy as a running example
  • Weight actions using linear combination of features

Think a neural network with a softmax output probabilities

  • Probability of action is proportional to exponentiated weight

Nonlinear extension: replace with a deep neural network with trainable weights w

slide-27
SLIDE 27

Softmax Policy: Discrete Actions

  • We will use a softmax policy as a running example
  • Weight actions using linear combination of features

Think a neural network with a softmax output probabilities

  • Probability of action is proportional to exponentiated weight
  • The score function is

Nonlinear extension: replace with a deep neural network with trainable weights w

slide-28
SLIDE 28

Gaussian Policy: Continuous Actions

  • Variance may be fixed σ2, or can also parameterized
  • In continuous action spaces, a Gaussian policy is natural
  • The score function is
  • Mean is a linear combination of state features

Nonlinear extension: replace with a deep neural network with trainable weights w

  • Policy is Gaussian
slide-29
SLIDE 29

One-step MDP

  • Consider a simple class of one-step MDPs
  • Starting in state
  • Terminating after one time-step with reward
  • First, let’s look at the objective:

Intuition: Under MDP:

slide-30
SLIDE 30

One-step MDP

  • Consider a simple class of one-step MDPs
  • Starting in state
  • Terminating after one time-step with reward
  • Use likelihood ratios to compute the policy gradient