CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel - - PowerPoint PPT Presentation

cs 287 lecture 18 fall 2019 rl i policy gradients
SMART_READER_LITE
LIVE PREVIEW

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel - - PowerPoint PPT Presentation

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline for Todays Lecture Super-quick Refresher: Markov Model-free Policy


slide-1
SLIDE 1

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients

Pieter Abbeel UC Berkeley EECS

Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

slide-2
SLIDE 2

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient basic derivation

n

Temporal decomposition

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-3
SLIDE 3

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient basic derivation

n

Temporal decomposition

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-4
SLIDE 4

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process

Assumption: agent gets to observe the state

slide-5
SLIDE 5

Markov Decision Process (S, A, T, R, γ, H)

Given:

n

S: set of states

n

A: set of actions

n

T: S x A x S x {0,1,…,H} à [0,1] Tt(s,a,s’) = P(st+1 = s’ | st = s, at =a)

n

R: S x A x S x {0, 1, …, H} à Rt(s,a,s’) = reward for (st+1 = s’, st = s, at =a)

n

γ in (0,1]: discount factor H: horizon over which the agent will act Goal:

n

Find π*: S x {0, 1, …, H} à A that maximizes expected sum of rewards, i.e.,

R

slide-6
SLIDE 6

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient basic derivation

n

Temporal decomposition

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-7
SLIDE 7

Reinforcement Learning

[Figure source: Sutton & Barto, 1998]

ut

Still an MDP BUT: MDP not given to us, agent needs to learn to

  • ptimize reward through trial and error
slide-8
SLIDE 8

Policy Optimization in the RL Landscape

slide-9
SLIDE 9

Policy Optimization

πθ(u|s)

ut

[Figure source: Sutton & Barto, 1998]

slide-10
SLIDE 10

Policy Optimization

n Consider control policy parameterized

by parameter vector

n Stochastic policy class (smooths out

the problem):

: probability of action u in state s

θ

max

θ

E[

H

X

t=0

R(st)|πθ]

πθ(u|s)

πθ(u|s)

ut

[Figure source: Sutton & Barto, 1998]

slide-11
SLIDE 11

n Often can be simpler than Q or V

n E.g., robotic grasp

n V: doesn’t prescribe actions

n Would need dynamics model (+ compute 1 Bellman back-up)

n Q: need to be able to efficiently solve

n Challenge for continuous / high-dimensional action spaces*

Why Policy Optimization

π

*some recent work (partially) addressing this:

NAF: Gu, Lillicrap, Sutskever, Levine ICML 2016 Input Convex NNs: Amos, Xu, Kolter arXiv 2016 Deep Energy Q: Haarnoja, Tang, Abbeel, Levine, ICML 2017

arg max

u

Qθ(s, u)

slide-12
SLIDE 12

Kohl and Stone, 2004

Pioneering Policy Optimization Success Stories

Tedrake et al, 2005 Kober and Peters, 2009 Ng et al, 2004 Silver et al, 2014 (DPG) Lillicrap et al, 2015 (DDPG) Schulman et al, 2016 (TRPO + GAE) Levine*, Finn*, et al, 2016 (GPS) Mnih et al, 2015 (A3C) Silver*, Huang*, et al, 2016 (AlphaGo**)

slide-13
SLIDE 13

Conceptually:

Policy Optimization Dynamic Programming

Empirically:

Optimize what you care about Indirect, exploit the problem structure, self-consistency More compatible with rich architectures (including recurrence) More versatile More compatible with auxiliary objectives More compatible with exploration and off-policy learning More sample-efficient when they work

slide-14
SLIDE 14

n iLQR n Optimization-based Control: Collocation, Shooting, MPC,

Contact Invariant Optimization

à But these assumed access to the dynamics model, which we

don’t have available now Note: in 3rd lecture on RL we’ll cover model-based RL, which learns the dynamics model, and can use above methods

Note: We have done policy optimization before!

slide-15
SLIDE 15

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-16
SLIDE 16

Black Box Gradient Computation

slide-17
SLIDE 17

Challenge: Noise Can Dominate

Eπθ [R(τ)]

R(T) R(T)

θ

slide-18
SLIDE 18

Solution 1: Average over many samples

Eπθ [R(τ)]

R(T) R(T)

θ

slide-19
SLIDE 19

Solution 2: Fix random seed

fixed random seed sample

Eπθ [R(τ)]

R(T) R(T)

θ

slide-20
SLIDE 20

n Randomness in policy and dynamics

n But can often only control randomness in policy..

n Example: wind influence on a helicopter is stochastic, but if

we assume the same wind pattern across trials, this will make the different choices of θ more readily comparable

n Note: equally applicable to evolutionary methods

[Ng & Jordan, 2000] provide theoretical analysis of gains from fixing randomness (“pegasus”)

Solution 2: Fix random seed

slide-21
SLIDE 21

[Ng + al, ISER 2004] [Policy search was done in simulation]

slide-22
SLIDE 22

Learning to Hover

slide-23
SLIDE 23

Example: Sidewinding

[Andrew Ng] [Video: SNAKE – climbStep+sidewin

slide-24
SLIDE 24

Example: Learning to Walk

Initial A Learning Trial After Learning [1K Trials] [Kohl and Stone, ICRA 2004]

slide-25
SLIDE 25

Example: Learning to Walk

Initial

[Video: AIBO WALK – init [Kohl and Stone, ICRA 2004]

slide-26
SLIDE 26

Example: Learning to Walk

Training

[Video: AIBO WALK – tra [Kohl and Stone, ICRA 2004]

slide-27
SLIDE 27

Example: Learning to Walk

Finished

[Video: AIBO WALK – finis [Kohl and Stone, ICRA 2004]

slide-28
SLIDE 28

n Can work well! n Most success in low-dimensional spaces…

Finite Differences

slide-29
SLIDE 29

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-30
SLIDE 30

n

General Algorithm:

n Make some random change to the parameters n If the result improves, keep the change n Repeat

Evolutionary Methods

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

slide-31
SLIDE 31

Cross-Entropy Method

CEM: Initialize for iteration = 1, 2, … Sample n parameters For each , perform one rollout to get return Select the top k% of , and fit a new diagonal Gaussian to those samples. Update endfor θi ∼ N(µ, diag(σ2)) µ ∈ Rd, σ ∈ Rd

>0

θi R(τi) θ µ, σ

slide-32
SLIDE 32

n Very simple and can work surprisingly well n Very scalable n Does not take advantage of any temporal structure

Cross-Entropy Method

slide-33
SLIDE 33

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-34
SLIDE 34

Likelihood Ratio Policy Gradient

slide-35
SLIDE 35

Likelihood Ratio Policy Gradient

[Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

slide-36
SLIDE 36

Likelihood Ratio Policy Gradient

[Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

slide-37
SLIDE 37

Likelihood Ratio Policy Gradient

[Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

slide-38
SLIDE 38

Likelihood Ratio Policy Gradient

[Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

slide-39
SLIDE 39

Likelihood Ratio Policy Gradient

[Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

slide-40
SLIDE 40

Likelihood Ratio Policy Gradient

[Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

slide-41
SLIDE 41

n

Valid even when

n R is discontinuous and/or unknown n Sample space (of paths) is a discrete set

Likelihood Ratio Gradient: Validity

rU(θ) ⇡ ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)R(τ (i))

slide-42
SLIDE 42

n Gradient tries to:

n Increase probability of paths with

positive R

n Decrease probability of paths with

negative R

Likelihood Ratio Gradient: Intuition

rU(θ) ⇡ ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)R(τ (i))

! Likelihood ratio changes probabilities of experienced paths,

does not try to change the paths (<-> Path Derivative)

slide-43
SLIDE 43

Let’s Decompose Path into States and Actions

slide-44
SLIDE 44

Let’s Decompose Path into States and Actions

slide-45
SLIDE 45

Let’s Decompose Path into States and Actions

slide-46
SLIDE 46

Let’s Decompose Path into States and Actions

slide-47
SLIDE 47

Likelihood Ratio Gradient Estimate

slide-48
SLIDE 48

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-49
SLIDE 49

Derivation from Importance Sampling

U(θ) = Eτ∼θold  P(τ|θ) P(τ|θold)R(τ)

  • rθU(θ) = Eτ∼θold

rθP(τ|θ) P(τ|θold) R(τ)

  • rθ U(θ)|θ=θold = Eτ∼θold

rθ P(τ|θ)|θold P(τ|θold) R(τ)

  • = Eτ∼θold

⇥ rθ log P(τ|θ)|θold R(τ) ⇤ Note: Suggests we can also look at more than just gradient!

[Tang&Abbeel, NeurIPS 2011]

slide-50
SLIDE 50

Derivation from Importance Sampling

U(θ) = Eτ∼θold  P(τ|θ) P(τ|θold)R(τ)

  • rθU(θ) = Eτ∼θold

rθP(τ|θ) P(τ|θold) R(τ)

  • rθ U(θ)|θ=θold = Eτ∼θold

rθ P(τ|θ)|θold P(τ|θold) R(τ)

  • = Eτ∼θold

⇥ rθ log P(τ|θ)|θold R(τ) ⇤ Note: Suggests we can also look at more than just gradient!

[Tang&Abbeel, NeurIPS 2011]

slide-51
SLIDE 51

Derivation from Importance Sampling

U(θ) = Eτ∼θold  P(τ|θ) P(τ|θold)R(τ)

  • rθU(θ) = Eτ∼θold

rθP(τ|θ) P(τ|θold) R(τ)

  • rθ U(θ)|θ=θold = Eτ∼θold

rθ P(τ|θ)|θold P(τ|θold) R(τ)

  • = Eτ∼θold

⇥ rθ log P(τ|θ)|θold R(τ) ⇤ Note: Suggests we can also look at more than just gradient!

[Tang&Abbeel, NeurIPS 2011]

slide-52
SLIDE 52

Derivation from Importance Sampling

U(θ) = Eτ∼θold  P(τ|θ) P(τ|θold)R(τ)

  • rθU(θ) = Eτ∼θold

rθP(τ|θ) P(τ|θold) R(τ)

  • rθ U(θ)|θ=θold = Eτ∼θold

rθ P(τ|θ)|θold P(τ|θold) R(τ)

  • = Eτ∼θold

⇥ rθ log P(τ|θ)|θold R(τ) ⇤ Note: Suggests we can also look at more than just gradient!

[Tang&Abbeel, NeurIPS 2011]

slide-53
SLIDE 53

Derivation from Importance Sampling

U(θ) = Eτ∼θold  P(τ|θ) P(τ|θold)R(τ)

  • rθU(θ) = Eτ∼θold

rθP(τ|θ) P(τ|θold) R(τ)

  • rθ U(θ)|θ=θold = Eτ∼θold

rθ P(τ|θ)|θold P(τ|θold) R(τ)

  • = Eτ∼θold

⇥ rθ log P(τ|θ)|θold R(τ) ⇤

Suggests we can also look at more than just gradient! E.g., can use importance sampled objective as “surrogate loss” (locally) [[à later: PPO]]

[Tang&Abbeel, NeurIPS 2011]

slide-54
SLIDE 54

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction and temporal structure

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-55
SLIDE 55

n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world practicality

n Baseline n Temporal structure n [later] Trust region / natural gradient

Likelihood Ratio Gradient Estimate

slide-56
SLIDE 56

n Gradient tries to:

n Increase probability of paths with

positive R

n Decrease probability of paths with

negative R

Likelihood Ratio Gradient: Intuition

rU(θ) ⇡ ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)R(τ (i))

! Likelihood ratio changes probabilities of experienced paths,

does not try to change the paths (<-> Path Derivative)

slide-57
SLIDE 57

à Consider baseline b:

Likelihood Ratio Gradient: Baseline

rU(θ) ⇡ ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)R(τ (i))

rU(θ) ⇡ ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)(R(τ (i)) b)

still unbiased!

[Williams 1992]

E [rθ log P(τ; θ)b] = X

τ

P(τ; θ)rθ log P(τ; θ)b = X

τ

P(τ; θ)rθP(τ; θ) P(τ; θ) b = X

τ

rθP(τ; θ)b =rθ X

τ

P(τ)b ! =rθ (b) =0

= brθ( X

τ

P(τ)) = b ⇥ 0 OK as long as baseline doesn’t depend on action in logprob(action)

slide-58
SLIDE 58

n

Current estimate:

n

Removing terms that don’t depend on current action can lower variance:

Likelihood Ratio and Temporal Structure

ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)(R(τ (i)) b) = 1 m

m

X

i=1

H−1 X

t=0

rθ log πθ(u(i)

t |s(i) t )

! H−1 X

t=0

R(s(i)

t , u(i) t ) b

!

[Policy Gradient Theorem: Sutton et al, NIPS 1999; GPOMDP: Bartlett & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006]

Doesn’t depend on u(i)

t

Ok to depend on s(i)

t

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) b(s(i) t )

!

= 1 m

m

X

i=1

H−1 X

t=0

rθ log πθ(u(i)

t |s(i) t )

" t−1 X

k=0

R(s(i)

k , u(i) k )

  • +

H−1 X

k=t

R(s(i)

k , u(i) k )

  • b

# !

slide-59
SLIDE 59

n

Good choice for b?

n Constant baseline: n Optimal Constant baseline: n Time-dependent baseline: n State-dependent expected return:

à Increase logprob of action proportionally to how much its returns are better than the expected return under the current policy

Baseline Choices

b(st) = E [rt + rt+1 + rt+2 + . . . + rH−1]

b = E [R(τ)] ≈ 1 m

m

X

i=1

R(τ (i))

[See: Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]

bt = 1 m

m

X

i=1 H−1

X

k=t

R(s(i)

k , u(i) k )

= V π(st)

slide-60
SLIDE 60

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction & temporal structure

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-61
SLIDE 61

Monte Carlo Estimation of

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

How to estimate?

V π

n

Init

n Collect trajectories n Regress against empirical return:

V π

φ0

τ1, . . . , τm φi+1 ← arg min

φ

1 m

m

X

i=1 H−1

X

t=0

V π

θ (s(i) t ) −

H−1 X

k=t

R(s(i)

k , u(i) k )

  • !2
slide-62
SLIDE 62

n

Bellman Equation for

n

Init

n Collect data {s, u, s’, r} n Fitted V iteration:

Bootstrap Estimation of

V π(s) = X

u

π(u|s) X

s0

P(s0|s, u)[R(s, u, s0) + γV π(s0)]

V π

V π

V π

φ0

φi+1 min

φ

X

(s,u,s0,r)

kr + V π

φi(s0) Vφ(s)k2 2 + λkφ φik2 2

slide-63
SLIDE 63

Vanilla Policy Gradient

~ [Williams, 1992]

slide-64
SLIDE 64

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction & temporal structure

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-65
SLIDE 65

n Estimation of Q from single roll-out

Recall Our Likelihood Ratio PG Estimator

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

n = high variance per sample based / no generalization used

n Reduce variance by discounting n Reduce variance by function approximation (=critic)

slide-66
SLIDE 66

n Estimation of Q from single roll-out

Recall Our Likelihood Ratio PG Estimator

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

n = high variance per sample based / no generalization used

n Reduce variance by discounting n Reduce variance by function approximation (=critic)

slide-67
SLIDE 67

n Estimation of Q from single roll-out

Recall Our Likelihood Ratio PG Estimator

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

n = high variance per sample based / no generalization used

n Reduce variance by discounting n Reduce variance by function approximation (=critic)

slide-68
SLIDE 68

n Estimation of Q from single roll-out

Recall Our Likelihood Ratio PG Estimator

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

n = high variance per sample based / no generalization

n Reduce variance by discounting n Reduce variance by function approximation (=critic)

slide-69
SLIDE 69

n Estimation of Q from single roll-out

Further Refinements

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

n = high variance per sample based / no generalization

n Reduce variance by discounting n Reduce variance by function approximation (=critic)

slide-70
SLIDE 70

n Estimation of Q from single roll-out

Recall Our Likelihood Ratio PG Estimator

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) V π(s(i) k )

!

n = high variance per sample based / no generalization

n Reduce variance by discounting n Reduce variance by function approximation (=critic)

slide-71
SLIDE 71

à introduce discount factor as a hyperparameter to improve estimate of Q:

Variance Reduction by Discounting

Qπ(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · |s0 = s, a0 = a]

slide-72
SLIDE 72

n Generalized Advantage Estimation uses an exponentially

weighted average of these

n ~ TD(lambda)

Reducing Variance by Function Approximation

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · | s0 = s, u0 = u] = E[r0 + γV π(s1) | s0 = s, u0 = u] = E[r0 + γr1 + γ2V π(s2) | s0 = s, u0 = u] = E[r0 + γr1 + +γ2r2 + γ3V π(s3) | s0 = s, u0 = u] = · · ·

slide-73
SLIDE 73

n Generalized Advantage Estimation uses an exponentially

weighted average of these

n ~ TD(lambda)

Reducing Variance by Function Approximation

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · | s0 = s, u0 = u] = E[r0 + γV π(s1) | s0 = s, u0 = u] = E[r0 + γr1 + γ2V π(s2) | s0 = s, u0 = u] = E[r0 + γr1 + +γ2r2 + γ3V π(s3) | s0 = s, u0 = u] = · · ·

slide-74
SLIDE 74

n Generalized Advantage Estimation uses an exponentially

weighted average of these

n ~ TD(lambda)

Reducing Variance by Function Approximation

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · | s0 = s, u0 = u] = E[r0 + γV π(s1) | s0 = s, u0 = u] = E[r0 + γr1 + γ2V π(s2) | s0 = s, u0 = u] = E[r0 + γr1 + +γ2r2 + γ3V π(s3) | s0 = s, u0 = u] = · · ·

slide-75
SLIDE 75

n

Async Advantage Actor Critic (A3C) [Mnih et al, 2016]

n

  • ne of the above choices (e.g. k=5 step lookahead)

Reducing Variance by Function Approximation

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · | s0 = s, u0 = u] = E[r0 + γV π(s1) | s0 = s, u0 = u] = E[r0 + γr1 + γ2V π(s2) | s0 = s, u0 = u] = E[r0 + γr1 + +γ2r2 + γ3V π(s3) | s0 = s, u0 = u] = · · ·

ˆ Q

slide-76
SLIDE 76

n

Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016]

n

= lambda exponentially weighted average of all the above

n

~ TD(lambda) / eligibility traces [Sutton and Barto, 1990]

Reducing Variance by Function Approximation

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · | s0 = s, u0 = u] = E[r0 + γV π(s1) | s0 = s, u0 = u] = E[r0 + γr1 + γ2V π(s2) | s0 = s, u0 = u] = E[r0 + γr1 + +γ2r2 + γ3V π(s3) | s0 = s, u0 = u] = · · ·

(1 − λ)

(1 − λ)λ (1 − λ)λ2

(1 − λ)λ3

ˆ Q

slide-77
SLIDE 77

n

Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016]

n

= lambda exponentially weighted average of all the above

n

~ TD(lambda) / eligibility traces [Sutton and Barto, 1990]

Reducing Variance by Function Approximation

Qπ,γ(s, u) = E[r0 + γr1 + γ2r2 + · · · | s0 = s, u0 = u] = E[r0 + γV π(s1) | s0 = s, u0 = u] = E[r0 + γr1 + γ2V π(s2) | s0 = s, u0 = u] = E[r0 + γr1 + +γ2r2 + γ3V π(s3) | s0 = s, u0 = u] = · · ·

(1 − λ)

(1 − λ)λ (1 − λ)λ2

(1 − λ)λ3

ˆ Q

slide-78
SLIDE 78

n

Policy Gradient + Generalized Advantage Estimation:

n

Init

n

Collect roll-outs {s, u, s’, r} and

n

Update:

Actor-Critic with A3C or GAE

V π

φ0

πθ0

Note: many variations, e.g. could instead use 1-step for V, full roll-out for pi:

φi+1 min

φ

X

(s,u,s0,r)

kr + V π

φi(s0) Vφ(s)k2 2 + λkφ φik2 2

θi+1 θi + α 1 m

m

X

k=1 H−1

X

t=0

rθ log πθi(u(k)

t

|s(k)

t

) H−1 X

t0=t

r(k)

t0

V π

φi(s(k) t0 )

!

ˆ Qi(s, u)

θi+1 θi + α 1 m

m

X

k=1 H−1

X

t=0

rθ log πθi(u(k)

t

|s(k)

t

) ⇣ ˆ Qi(s(k)

t

, u(k)

t

) V π

φi(s(k) t

) ⌘

φi+1 min

φ

X

(s,u,s0,r)

k ˆ Qi(s, u) V π

φ (s)k2 2 + κkφ φik2 2

slide-79
SLIDE 79

n [Mnih et al, ICML 2016]

n Likelihood Ratio Policy Gradient n n-step Advantage Estimation

Async Advantage Actor Critic (A3C)

slide-80
SLIDE 80

A3C -- labyrinth

slide-81
SLIDE 81

Example: Toddler Robot

[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]

slide-82
SLIDE 82

GAE: Effect of gamma and lambda

[Schulman et al, 2016 -- GAE]

slide-83
SLIDE 83

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction & temporal structure

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-84
SLIDE 84

n Step-sizing necessary as gradient is only first-order

approximation

Step-sizing and Trust Regions

slide-85
SLIDE 85

n Terrible step sizes, always an issue, but how about just not so

great ones?

n Supervised learning

n Step too far à next update will correct for it

n Reinforcement learning

n Step too far à terrible policy n Next mini-batch: collected under this terrible policy! n Not clear how to recover short of going back and shrinking the step size

What’s in a step-size?

slide-86
SLIDE 86

n Simple step-sizing: Line search in direction of gradient

n Simple, but expensive (evaluations along the line) n Naïve: ignores where the first-order approximation is good/poor

Step-sizing and Trust Regions

slide-87
SLIDE 87

n Advanced step-sizing: Trust regions n First-order approximation from gradient is a good

approximation within “trust region” à Solve for best point within trust region:

Step-sizing and Trust Regions

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

slide-88
SLIDE 88

n Our problem: n Recall: n Hence:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

P(τ; θ) = P(s0)

H−1

Y

t=0

πθ(ut|st)P(st+1|st, ut)

dynamics cancels out! J

KL(P(τ; θ)||P(τ; θ + δθ)) = X

τ

P(τ; θ) log P(τ; θ) P(τ; θ + δθ) = X

τ

P(τ; θ) log P(s0) QH−1

t=0 πθ(ut|st)P(st+1|st, ut)

P(s0) QH−1

t=0 πθ+δθ(ut|st)P(st+1|st, ut)

= X

τ

P(τ; θ) log QH−1

t=0 πθ(ut|st)

QH−1

t=0 πθ+δθ(ut|st)

≈ 1 M X

(s,u) in roll−outs under θ

πθ(u|s) log πθ(u|s) πθ+δθ(u|s) = 1 M X

(s,u)∼θ

KL(πθ(u|s)||πθ+δθ(u|s))

slide-89
SLIDE 89

n Our problem: n Recall: n Hence:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

P(τ; θ) = P(s0)

H−1

Y

t=0

πθ(ut|st)P(st+1|st, ut)

dynamics cancels out! J

KL(P(τ; θ)||P(τ; θ + δθ)) = X

τ

P(τ; θ) log P(τ; θ) P(τ; θ + δθ) = X

τ

P(τ; θ) log P(s0) QH−1

t=0 πθ(ut|st)P(st+1|st, ut)

P(s0) QH−1

t=0 πθ+δθ(ut|st)P(st+1|st, ut)

= X

τ

P(τ; θ) log QH−1

t=0 πθ(ut|st)

QH−1

t=0 πθ+δθ(ut|st)

≈ 1 M X

(s,u) in roll−outs under θ

πθ(u|s) log πθ(u|s) πθ+δθ(u|s) = 1 M X

(s,u)∼θ

KL(πθ(u|s)||πθ+δθ(u|s))

slide-90
SLIDE 90

n Our problem: n Recall: n Hence:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

P(τ; θ) = P(s0)

H−1

Y

t=0

πθ(ut|st)P(st+1|st, ut)

dynamics cancels out! J

KL(P(τ; θ)||P(τ; θ + δθ)) = X

τ

P(τ; θ) log P(τ; θ) P(τ; θ + δθ) = X

τ

P(τ; θ) log P(s0) QH−1

t=0 πθ(ut|st)P(st+1|st, ut)

P(s0) QH−1

t=0 πθ+δθ(ut|st)P(st+1|st, ut)

= X

τ

P(τ; θ) log QH−1

t=0 πθ(ut|st)

QH−1

t=0 πθ+δθ(ut|st)

≈ 1 M X

(s,u) in roll−outs under θ

πθ(u|s) log πθ(u|s) πθ+δθ(u|s) = 1 M X

(s,u)∼θ

KL(πθ(u|s)||πθ+δθ(u|s))

slide-91
SLIDE 91

n Our problem: n Recall: n Hence:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

P(τ; θ) = P(s0)

H−1

Y

t=0

πθ(ut|st)P(st+1|st, ut)

dynamics cancels out! J

KL(P(τ; θ)||P(τ; θ + δθ)) = X

τ

P(τ; θ) log P(τ; θ) P(τ; θ + δθ) = X

τ

P(τ; θ) log P(s0) QH−1

t=0 πθ(ut|st)P(st+1|st, ut)

P(s0) QH−1

t=0 πθ+δθ(ut|st)P(st+1|st, ut)

= X

τ

P(τ; θ) log QH−1

t=0 πθ(ut|st)

QH−1

t=0 πθ+δθ(ut|st)

≈ 1 M X

(s,u) in roll−outs under θ

πθ(u|s) log πθ(u|s) πθ+δθ(u|s) = 1 M X

(s,u)∼θ

KL(πθ(u|s)||πθ+δθ(u|s))

slide-92
SLIDE 92

n Our problem: n Recall: n Hence:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

P(τ; θ) = P(s0)

H−1

Y

t=0

πθ(ut|st)P(st+1|st, ut)

dynamics cancels out! J

KL(P(τ; θ)||P(τ; θ + δθ)) = X

τ

P(τ; θ) log P(τ; θ) P(τ; θ + δθ) = X

τ

P(τ; θ) log P(s0) QH−1

t=0 πθ(ut|st)P(st+1|st, ut)

P(s0) QH−1

t=0 πθ+δθ(ut|st)P(st+1|st, ut)

= X

τ

P(τ; θ) log QH−1

t=0 πθ(ut|st)

QH−1

t=0 πθ+δθ(ut|st)

≈ 1 M X

(s,u) in roll−outs under θ

πθ(u|s) log πθ(u|s) πθ+δθ(u|s) = 1 M X

(s,u)∼θ

KL(πθ(u|s)||πθ+δθ(u|s))

≈ 1 M X

s,u in roll−outs under θ

log πθ(u|s) πθ+δθ(u|s)

slide-93
SLIDE 93

n Our problem: n Has become:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

max

δθ

ˆ g>δθ s.t. 1 M X

(s,u)⇠θ

log πθ(u|s) πθ+δθ(u|s) ≤ ε

slide-94
SLIDE 94

n Our problem: n Has become: n How to enforce this constraint given complex policies like neural nets

n 2nd approximation of KL Divergence

n (1) First order approximation is constant n (2) Hessian is Fisher Information Matrix

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

max

δθ

ˆ g>δθ s.t. 1 M X

(s,u)⇠θ

log πθ(u|s) πθ+δθ(u|s) ≤ ε

slide-95
SLIDE 95

n Our problem: n Has become: n 2nd order approximation to KL:

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε KL(πθ(u|s)||πθ+δθ(u|s) ⇡ δθ> @ X

(s,u)⇠θ

rθ log πθ(u|s)rθ log πθ(u|s)> 1 A δθ = δθ>Fθδθ

max

δθ

ˆ g>δθ s.t. 1 M X

(s,u)⇠θ

log πθ(u|s) πθ+δθ(u|s) ≤ ε

slide-96
SLIDE 96

n Our problem: n Has become: n 2nd order approximation to KL:

à Fisher matrix easily computed from gradient calculations

Evaluating the KL

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε KL(πθ(u|s)||πθ+δθ(u|s) ⇡ δθ> @ X

(s,u)⇠θ

rθ log πθ(u|s)rθ log πθ(u|s)> 1 A δθ = δθ>Fθδθ

max

δθ

ˆ g>δθ s.t. 1 M X

(s,u)⇠θ

log πθ(u|s) πθ+δθ(u|s) ≤ ε

slide-97
SLIDE 97

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are

used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-98
SLIDE 98

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are

used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-99
SLIDE 99

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are

used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-100
SLIDE 100

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are

used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-101
SLIDE 101

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do even better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are

used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-102
SLIDE 102

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do even better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are

used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-103
SLIDE 103

n Our problem: n Done?

n Deep RL à

high-dimensional, and building / inverting impractical

n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO]

n Can we do even better?

n Replace objective by surrogate loss that’s higher order approximation yet equally

efficient to evaluate [Schulman et al, 2015, TRPO]

n Note: the surrogate loss idea is generally applicable when likelihood ratio gradients

are used

Evaluating the KL

max

δθ

ˆ g>δθ s.t. δθ>Fθδθ ≤ ε

θ

slide-104
SLIDE 104

TRPO

max

π

L(π) = Eπold  π(a|s) πold(a|s)Aπold(s, a)

  • Surrogate loss:

Constraint: Eπold [KL(⇡||⇡old)] ≤ ✏

slide-105
SLIDE 105

[Schulman, Levine, Moritz, Jordan, Abbeel, 2014]

Experiments in Locomotion

slide-106
SLIDE 106

Learning Curves -- Comparison

slide-107
SLIDE 107

Learning Curves -- Comparison

slide-108
SLIDE 108

n

Deep Q-Network (DQN) [Mnih et al, 2013/2015]

n

Dagger with Monte Carlo Tree Search [Xiao-Xiao et al, 2014]

n

Trust Region Policy Optimization [Schulman, Levine, Moritz, Jordan, Abbeel, 2015]

n

Atari Games

Pong Enduro Beamrider Q*bert

slide-109
SLIDE 109

Natural Gradients Work

slide-110
SLIDE 110

Learning Locomotion (TRPO + GAE)

[Schulman, Moritz, Levine, Jordan, Abbeel, 2016]

slide-111
SLIDE 111

n

Super-quick Refresher: Markov Decision Processes (MDPs)

n

Reinforcement Learning

n

Policy Optimization

n

Model-free Policy Optimization: Finite Differences

n

Model-free Policy Optimization: Cross- Entropy Method

Outline for Today’s Lecture

n

Model-free Policy Optimization: Policy Gradients

n

Policy Gradient standard derivation

n

Temporal decomposition

n

Policy Gradient importance sampling derivation

n

Baseline subtraction & temporal structure

n

Value function estimation

n

Advantage Estimation (A2C/A3C/GAE)

n

Trust Region Policy Optimization (TRPO)

n

Proximal Policy Optimization (PPO)

slide-112
SLIDE 112

n Not easy to enforce trust region constraint for complex policy

architectures

n Networks that have stochasticity like dropout n Parameter sharing between policy and value function

n Conjugate Gradient implementation is complex n Would be good to harness good first-order optimizers like

Adam, RMSProp…

A better TRPO?

slide-113
SLIDE 113

Proximal Policy Optimization V1 – “Dual Descent TRPO”

TRPO PPO v1

Do dual descent update for beta

slide-114
SLIDE 114

Can we simplify further?

slide-115
SLIDE 115

Proximal Policy Optimization V2 – “Clipped Surrogate Loss”

Let: Optimize:

slide-116
SLIDE 116

RL: Learning Soccer

[Bansal et al, 2017]

slide-117
SLIDE 117

OpenAI-5 was trained with PPO

slide-118
SLIDE 118

OpenAI In-Hand Re-Orientation

slide-119
SLIDE 119

OpenAI Rubik’s Cube