Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki - - PowerPoint PPT Presentation

natural policy gradients trpo ppo
SMART_READER_LITE
LIVE PREVIEW

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki - - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted from John Shulman and Joshua Achiam Stochastic policies


slide-1
SLIDE 1

Natural Policy Gradients, TRPO, PPO

Deep Reinforcement Learning and Control Katerina Fragkiadaki

Carnegie Mellon School of Computer Science CMU 10703

slide-2
SLIDE 2

Part of the slides adapted from John Shulman and Joshua Achiam

slide-3
SLIDE 3

Stochastic policies

continuous actions discrete actions usually multivariate Gaussian almost always categorical

a ∼ N(µθ(s), σ2

θ(s))

a ∼ Cat(pθ(s))

µθ(s) σθ(s)

pθ(s)

θ θ

slide-4
SLIDE 4

Policy Gradients

Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

θold θnew

μθ(s) σθ(s) σθnew(s) μθnew(s)

θnew = θ + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθ

This lecture is all about the stepwise

slide-5
SLIDE 5

\

What is the underlying objective function?

̂ g ≈ 1 N

N

i=1 T

t=1

∇θlog πθ(α(i)

t |s(i) t )A(s(i) t , a(i) t ),

τi ∼ πθ Policy gradients: What is our objective? Result from differentiating the objective function: JPG(θ) = 1 N

N

i=1 T

t=1

log πθ(α(i)

t |s(i) t )A(s(i) t , a(i) t )

τi ∼ πθ Is this our objective? We cannot both maximize over a variable and sample from it. Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we have access to expert actions, then the loss function we want to optimize is: JSL(θ) = 1 N

N

i=1 T

t=1

log πθ( ˜ α(i)

t |s(i) t ),

τi ∼ π* which maximizes the probability of expert actions in the training set. Is this our SL objective? Well, we cannot optimize it too far, our advantage estimates are from samples of \pi_theta_{old}. However, this constraint of “cannot optimize too far from \theta_{old}” does not appear anywhere in the objective. Well, as a matter of fact, we care about test error, but this is a long story, the short answer is yes, this is good enough for us to optimize if we regularize. +regularization

slide-6
SLIDE 6

Policy Gradients

Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

θnew = θ + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθ

This lecture is all about the stepwise It is also about writing down an objective that we can

  • ptimize with PG, and the procedure 1,2,3,4,5 will be the

result of this objective maximization

θold θnew

μθ(s) σθ(s) σθnew(s) μθnew(s)

slide-7
SLIDE 7

Policy Gradients

Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

Two problems with the vanilla formulation:

  • 1. Hard to choose stepwise
  • 2. Sample inefficient: we cannot use data

collected with policies of previous iterations

ϵ

θnew = θ + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθ

θold θnew

μθ(s) σθ(s) σθnew(s) μθnew(s)

slide-8
SLIDE 8

Hard to choose stepsizes

  • Step too big

Bad policy->data collected under bad policy-> we cannot recover (in Supervised Learning, data does not depend on neural network weights)

  • Step too small

Not efficient use of experience (in Supervised Learning, data can be trivially re-used) Gradient descent in parameter space does not take into account the resulting distance in the (output) policy space between and πθold(s) πθnew(s) Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

θnew = θ + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθ

θold θnew

μθ(s) σθ(s) σθnew(s) μθnew(s)

slide-9
SLIDE 9

Hard to choose stepsizes

Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

Consider a family of policies with parametrization: πθ(a) = ⇢ σ(θ) a = 1 1 − σ(θ) a = 2

Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.

θnew = θ + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθ

slide-10
SLIDE 10

Notation

We will use the following to denote values of parameters and corresponding policies before and after an update:

θold → θnew πold → πnew θ → θ′ π → π′

slide-11
SLIDE 11

Gradient Descent in Parameter Space

The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: Euclidean distance in parameter space θnew = θold + d * SGD: d * = arg max

∥d∥≤ϵ J(θ + d)

It is hard to predict the result on the parameterized distribution..

µθ(s) σθ(s)

θ

slide-12
SLIDE 12

Gradient Descent in Distribution Space

The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max

d, s.t. KL(πθ∥πθ+d)≤ϵ J(θ + d)

Euclidean distance in parameter space θnew = θold + d * SGD: d * = arg max

∥d∥≤ϵ J(θ + d)

KL divergence in distribution space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: Easier to pick the distance threshold!!!

slide-13
SLIDE 13

Solving the KL Constrained Problem

First order Taylor expansion for the loss and second order for the KL: d * = arg max

d

J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ) Unconstrained penalized objective: ≈ arg max

d

J(θold) + ∇θJ(θ)|θ=θold ⋅ d − 1 2 λ(d⊤∇2

θDKL [πθold∥πθ]|θ=θold d) + λϵ

slide-14
SLIDE 14

Taylor expansion of KL

DKL(pθold|pθ) ≈ DKL(pθold|pθold) + d⊤∇θDKL(pθold|pθ)|θ=θold + 1 2 d⊤∇2

θDKL(pθold|pθ)|θ=θoldd

∇θDKL(pθold|pθ)|θ=θold = −∇θ𝔽x∼pθold log Pθ(x)|θ=θold = −𝔽x∼pθold∇θlog Pθ(x)|θ=θold = −𝔽x∼pθold 1 Pθold(x) ∇θPθ(x)|θ=θold = ∫x Pθold(x) 1 Pθold(x) ∇θPθ(x)|θ=θold = ∫x ∇θPθ(x)|θ=θold = ∇θ∫x Pθ(x)|θ=θold . = 0 KL(pθold|pθ) = 𝔽x∼pθold log ( Pθold(x) Pθ(x) )

slide-15
SLIDE 15

Taylor expansion of KL

DKL(pθold|pθ) ≈ DKL(pθold|pθold) + d⊤∇θKL(pθold|pθ)|θ=θold + 1 2 d⊤∇2

θDKL(pθold|pθ)|θ=θoldd

∇2

θDKL(pθold|pθ)|θ=θold = −𝔽x∼pθold∇2 θlog Pθ(x)|θ=θold

= −𝔽x∼pθold∇θ( ∇θPθ(x) Pθ(x) )|θ=θold = −𝔽x∼pθold ( ∇2

θPθ(x)Pθ(x) − ∇θPθ(x)∇θPθ(x)⊤

Pθ(x)2 )|θ=θold = −𝔽x∼pθold ∇2

θPθ(x)|θ=θold

Pθold(x) + 𝔽x∼pθold∇θlog Pθ(x)∇θlog Pθ(x)⊤|θ=θold = 𝔽x∼pθold∇θlog Pθ(x)∇θlog Pθ(x)⊤|θ=θold DKL(pθold|pθ) = 𝔽x∼pθold log ( Pθold(x) Pθ(x) )

slide-16
SLIDE 16

Fisher Information Matrix

F(θ) = 𝔽θ [∇θlog pθ(x)∇θlog pθ(x)⊤]

Exactly equivalent to the Hessian of KL divergence!

DKL(pθold|pθ) ≈ DKL(pθold|pθold) + d⊤∇θDKL(pθold|pθ)|θ=θold + 1 2 d⊤∇2

θDKL(pθold|pθ)|θ=θoldd

= 1 2 d⊤F(θold)d = 1 2(θ − θold)⊤F(θold)(θ − θold)

Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions: how much you change the distribution if you move the parameters a little bit in a given direction.

F(θold) = ∇2

θDKL(pθold|pθ)|θ=θold

slide-17
SLIDE 17

First order Taylor expansion for the loss and second order for the KL: d * = arg max

d

J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ) Unconstrained penalized objective: ≈ arg max

d

J(θold) + ∇θJ(θ)|θ=θold ⋅ d − 1 2 λ(d⊤∇2

θDKL [πθold∥πθ]|θ=θold d) + λϵ

= arg max

d

∇θJ(θ)|θ=θold ⋅ d − 1 2 λ(d⊤F(θold)d) = arg min

d − ∇θJ(θ)|θ=θold ⋅ d + 1

2 λ(d⊤F(θold)d) Substitute for the information matrix:

Solving the KL Constrained Problem

slide-18
SLIDE 18

The natural gradient:

Natural Gradient Descent

Setting the gradient to zero: 0 = ∂ ∂d (−∇θJ(θ)|θ=θold ⋅ d + 1 2 λ(d⊤F(θold)d)) = −∇θJ(θ)|θ=θold + 1 2 λ(F(θold))d d = 2 λ F−1(θold)∇θJ(θ)|θ=θold ˜ ∇J(θ) = F−1(θold)∇θJ(θ) θnew = θold + α ⋅ F−1(θold) ̂ g DKL(πθold|πθ) ≈ 1 2(θ − θold)⊤F(θold)(θ − θold) 1 2(αgN)⊤F(αgN) = ϵ α = 2ϵ (g⊤

NFgN)

slide-19
SLIDE 19

Natural Gradient Descent

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

ϵ

Both use samples from the current policy \pi_k

slide-20
SLIDE 20

Natural Gradient Descent

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

ϵ

very expensive to compute for a large number of parameters!

slide-21
SLIDE 21

Policy Gradients

Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

θold θnew

μθold(s) σθold(s) σθnew(s) μθnew(s)

θnew = θold + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθold

slide-22
SLIDE 22

Policy Gradients

Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ

g = ˆ Et h rθ log πθ(at | st) ˆ At i

Actor-Critic Policy Gradient:

̂ g = ̂ 𝔽t [∇θlog πθ(at|st)Aw(st)]

θold θnew

μθold(s) σθold(s) σθnew(s) μθnew(s)

θnew = θold + ϵ ⋅ ̂ g

  • 1. Collect trajectories for policy
  • 2. Estimate advantages
  • 3. Compute policy gradient
  • 4. Update policy parameters
  • 5. GOTO 1

̂ g A

πθold

  • On policy learning can be extremely

inefficient

  • The policy changes only a little bit with

each gradient step

  • I want to be able to use earlier data..how

to do that?

slide-23
SLIDE 23

J(θ) = 𝔽τ∼πθ(τ) [R(τ)] = ∑

τ

πθ(τ)R(τ) = ∑

τ

πθold(τ) πθ(τ) πθold(τ) R(τ) = ∑

τ∼πθold

πθ(τ) πθold(τ) R(τ) = 𝔽τ∼πθold πθ(τ) πθold(τ) R(τ)

∇θJ(θ) = 𝔽τ∼πθold ∇θπθ(τ) πθold(τ) R(τ)

Off policy learning with Importance Sampling

<-Gradient evaluated at theta_old is unchanged ∇θJ(θ)|θ=θold = 𝔽τ∼πθold∇θlog πθ(τ)|θ=θold R(τ)

slide-24
SLIDE 24

J(θ) = 𝔽τ∼πθ(τ) [R(τ)] = ∑

τ

πθ(τ)R(τ) = ∑

τ

πθold(τ) πθ(τ) πθold(τ) R(τ) = ∑

τ∼πθold

πθ(τ) πθold(τ) R(τ) = 𝔽τ∼πθold πθ(τ) πθold(τ) R(τ)

Off policy learning with Importance Sampling

J(θ) = 𝔽τ∼πθold

T

t=1 t

t′=1

πθ(a′

t|s′ t)

πθold(a′

t|s′ t)

̂ At

πθ(τ) πθold(τ) =

T

i=1

πθ(at|st) πθold(at|st)

Now we can use data from the old policy, but the variance has increased by a lot! Those multiplications can explode or vanish! ∇θJ(θ)|θ=θold = 𝔽τ∼πθold∇θlog πθ(τ)|θ=θold R(τ)

∇θJ(θ) = 𝔽τ∼πθold ∇θπθ(τ) πθold(τ) R(τ)

slide-25
SLIDE 25

maximize

θ

ˆ Et  πθ(at | st) πθold(at | st) ˆ At

  • subject to

ˆ Et[KL[πθold(· | st), πθ(· | st)]] ≤ δ.

I Also worth considering using a penalty instead of a constraint

maximize

θ

ˆ Et  πθ(at | st) πθold(at | st) ˆ At

  • − βˆ

Et[KL[πθold(· | st), πθ(· | st)]]

Trust region Policy Optimization

I

  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”.

ICML

I

Again the KL penalized problem!

slide-26
SLIDE 26

I maximizeθ Lπθold(πθ) − β · KLπθold(πθ) I Make linear approximation to Lπθold and quadratic approximation to KL term:

maximize

θ

g · (θ − θold) − β

2(θ − θold)TF(θ − θold)

where g = ∂ ∂θLπθold(πθ)

  • θ=θold,

F = ∂2 ∂2θKLπθold(πθ)

  • θ=θold

Solving KL penalized problem

Exactly what we saw with natural policy gradient! One important detail!

Trust region Policy Optimization

slide-27
SLIDE 27

Algorithm 2 Line Search for TRPO Compute proposed policy step ∆k = q

2δ ˆ gT

k ˆ

H−1

k

ˆ gk

ˆ H−1

k

ˆ gk for j = 0, 1, 2, ..., L do Compute proposed update θ = θk + αj∆k if Lθk (θ) ≥ 0 and ¯ DKL(θ||θk) ≤ δ then accept the update and set θk+1 = θk + αj∆k break end if end for

Trust region Policy Optimization

Due to the quadratic approximation, the KL constraint may be violated! What if we just do a line search to find the best stepsize, making sure:

  • I am improving my objective J(\theta)
  • The KL constraint is not violated!

maximize

θ

ˆ Et  πθ(at | st) πθold(at | st) ˆ At

  • subject to

ˆ Et[KL[πθold(· | st), πθ(· | st)]] ≤ δ.

slide-28
SLIDE 28

Algorithm 3 Trust Region Policy Optimization Input: initial policy parameters θ0 for k = 0, 1, 2, ... do Collect set of trajectories Dk on policy πk = π(θk) Estimate advantages ˆ Aπk

t

using any advantage estimation algorithm Form sample estimates for policy gradient ˆ gk (using advantage estimates) and KL-divergence Hessian-vector product function f (v) = ˆ Hkv Use CG with ncg iterations to obtain xk ≈ ˆ H−1

k

ˆ gk Estimate proposed step ∆k ≈ q

2δ xT

k ˆ

Hk xk xk

Perform backtracking line search with exponential decay to obtain final update θk+1 = θk + αj∆k end for

Trust region Policy Optimization

TRPO= NPG +Linesearch

slide-29
SLIDE 29

Algorithm 3 Trust Region Policy Optimization Input: initial policy parameters θ0 for k = 0, 1, 2, ... do Collect set of trajectories Dk on policy πk = π(θk) Estimate advantages ˆ Aπk

t

using any advantage estimation algorithm Form sample estimates for policy gradient ˆ gk (using advantage estimates) and KL-divergence Hessian-vector product function f (v) = ˆ Hkv Use CG with ncg iterations to obtain xk ≈ ˆ H−1

k

ˆ gk Estimate proposed step ∆k ≈ q

2δ xT

k ˆ

Hk xk xk

Perform backtracking line search with exponential decay to obtain final update θk+1 = θk + αj∆k end for

Trust region Policy Optimization

TRPO= NPG +Linesearch+monotonic improvement theorem!

slide-30
SLIDE 30

Relating objectives of two policies

Policy objective: Policy objective can be written in terms of old one:

J(πθ) = 𝔽τ∼πθ

t=0

γtrt J(πθ′) − J(πθ) = 𝔽τ∼π′

θ

t=0

γtAπθ(st, at) J(π′) − J(π) = 𝔽τ∼π′

t=0

γtAπ(st, at)

Equivalently for succinctness:

slide-31
SLIDE 31

J(π0) − J(π) = E

τ⇠π0

" 1 X

t=0

γtAπ(st, at) # = E

τ⇠π0

" 1 X

t=0

γt (R(st, at, st+1) + γV π(st+1) − V π(st)) # = J(π0) + E

τ⇠π0

" 1 X

t=0

γt+1V π(st+1) −

1

X

t=0

γtV π(st) # = J(π0) + E

τ⇠π0

" 1 X

t=1

γtV π(st) −

1

X

t=0

γtV π(st) # = J(π0) − E

τ⇠π0 [V π(s0)]

= J(π0) − J(π)

Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002

Relating objectives of two policies

The initial state distribution is the same for both!

slide-32
SLIDE 32

Relating objectives of two policies

Discounted state visitation distribution:

J(π′) − J(π) = 𝔽τ∼π′

t=0

γtAπ(st, at) = 𝔽s∼dπ′,a∼π′Aπ(s, a) = 𝔽s∼dπ′,a∼π [ π′(a|s) π(a|s) Aπ(s, a)] But how are we supposed to sample states from the policy we are trying to optimize for… Let’s use the previous policy to sample them. J(π′) − J(π) ≈ 𝔽s∼dπ,a∼π π′(a|s) π(a|s) Aπ(s, a) = ℒπ(π′) It turns out we can bound this approximation error:

  • J(π0) −
  • J(π) + Lπ(π0)
  • ≤ C

q E

s⇠dπ [DKL(π0||π)[s]]

Constrained Policy Optimization, Achiam et al. 2017

dπ(s) = (1 − γ)

t=0

γtP(st = s|π)

slide-33
SLIDE 33

Relating objectives of two policies

ℒπ′

π = 𝔽s∼dπ,a∼π [

π′(a|s) π(a|s) Aπ(s, a)] = 𝔽τ∼π [

t=0

π′(at|st) π(at|st) Aπ(st, at)] This is something we can optimize using trajectories from the old policy! Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but we have approximated, that’s why!) What is the gradient? ∇θℒθ

θk|θ=θk = 𝔽τ∼πθk [ ∞

t=0

γt ∇θπθ(at|st)|θ=θk πθk(at|st) Aπθk(st, at)] = 𝔽τ∼πθk [

t=0

γt∇θlog πθ(at|st)|θ=θk Aπθk(st, at)]

J(θ) = 𝔽τ∼πθold

T

t=1 t

t′=1

πθ(a′

t|s′ t)

πθold(a′

t|s′ t)

̂ At

Compare to Importance Sampling:

slide-34
SLIDE 34

⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔽s∼dπ [KL(π′|π)[s]] |J(π′) − (J(π) + ℒπ(π′))| ≤ C 𝔽s∼dπ [KL(π′|π)[s]] Given policy , we want to optimize over policy to maximize . π π′

  • If we maximize the RHS we are guaranteed to maximize the LHS.
  • We know how to maximize the RHS. I can estimate both quantities of \pi’ with

sampled from \pi

  • But will i have a better policy \pi’? (knowing that the distance of the objectives is

maximized is not enough, there needs to be positive or equal to zero)

Monotonic Improvement Theorem

slide-35
SLIDE 35

Proof of improvement guarantee: Suppose πk+1 and πk are related by πk+1 = arg max

π0 Lπk (π0) − C

q E

s⇠dπk [DKL(π0||πk)[s]].

πk is a feasible point, and the objective at πk is equal to 0.

Lπk (πk) ∝ E

s,a⇠dπk ,πk

[Aπk (s, a)] = 0 DKL(πk||πk)[s] = 0

= ⇒ optimal value ≥ 0 = ⇒ by the performance bound, J(πk+1) − J(πk) ≥ 0

Monotonic Improvement Theorem

slide-36
SLIDE 36
  • Theory is very conservative (high value of C) and we will use KL distance of pi’ and

pi as a constraint (trust region) as opposed to a penalty:

Approximate Monotonic Improvement

πk+1 = arg max

π0

Lπk (π0) s.t. E

s⇠dπk

⇥ DKL(π0||πk)[s] ⇤ ≤ δ

slide-37
SLIDE 37

Algorithm 3 Trust Region Policy Optimization Input: initial policy parameters θ0 for k = 0, 1, 2, ... do Collect set of trajectories Dk on policy πk = π(θk) Estimate advantages ˆ Aπk

t

using any advantage estimation algorithm Form sample estimates for policy gradient ˆ gk (using advantage estimates) and KL-divergence Hessian-vector product function f (v) = ˆ Hkv Use CG with ncg iterations to obtain xk ≈ ˆ H−1

k

ˆ gk Estimate proposed step ∆k ≈ q

2δ xT

k ˆ

Hk xk xk

Perform backtracking line search with exponential decay to obtain final update θk+1 = θk + αj∆k end for

Trust region Policy Optimization

TRPO= NPG +Linesearch+monotonic improvement theorem!

slide-38
SLIDE 38

Proximal Policy Optimization

Can I achieve similar performance without second order information (no Fisher matrix!)

Adaptive KL Penalty

Policy update solves unconstrained optimization problem ✓k+1 = arg max

θ

Lθk (✓) − k ¯ DKL(✓||✓k) Penalty coefficient k changes between iterations to approximately enforce KL-divergence constraint

Clipped Objective

New objective function: let rt(✓) = ⇡θ(at|st)/⇡θk (at|st). Then LCLIP

θk

(✓) = E

τ∼πk

" T X

t=0

h min(rt(✓) ˆ Aπk

t , clip (rt(✓), 1 − ✏, 1 + ✏) ˆ

Aπk

t )

i# where ✏ is a hyperparameter (maybe ✏ = 0.2) Policy update is ✓k+1 = arg maxθ LCLIP

θk

(✓)

I

  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”.

(2017)

I

slide-39
SLIDE 39

Input: initial policy parameters θ0, initial KL penalty β0, target KL-divergence δ for k = 0, 1, 2, ... do Collect set of partial trajectories Dk on policy πk = π(θk) Estimate advantages ˆ Aπk

t

using any advantage estimation algorithm Compute policy update θk+1 = arg max

θ

Lθk (θ) − βk ¯ DKL(θ||θk) by taking K steps of minibatch SGD (via Adam) if ¯ DKL(θk+1||θk) ≥ 1.5δ then βk+1 = 2βk else if ¯ DKL(θk+1||θk) ≤ δ/1.5 then βk+1 = βk/2 end if end for

PPO: Adaptive KL Penalty

Don’t use second order approximation for Kl which is expensive, use standard gradient descent

slide-40
SLIDE 40

I Recall the surrogate objective

LIS(✓) = ˆ Et  ⇡θ(at | st) ⇡θold(at | st) ˆ At

  • = ˆ

Et h rt(✓) ˆ At i . (1)

I Form a lower bound via clipped importance ratios

LCLIP(✓) = ˆ Et h min(rt(✓) ˆ At, clip(rt(✓), 1 − ✏, 1 + ✏) ˆ At) i (2)

PPO: Clipped Objective

I

  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”.

(2017)

I

slide-41
SLIDE 41

Proximal Policy Optimization with Clipped Objective

But how does clipping keep policy close? By making objective as pessimistic as possible about performance far away from θk:

Figure: Various objectives as a function of interpolation factor α between θk+1 and θk after one update of PPO-Clip 9

Proximal Policy Optimization PPO: Clipped Objective

slide-42
SLIDE 42

Input: initial policy parameters ✓0, clipping threshold ✏ for k = 0, 1, 2, ... do Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k) Estimate advantages ˆ Aπk

t

using any advantage estimation algorithm Compute policy update ✓k+1 = arg max

θ

LCLIP

θk

(✓) by taking K steps of minibatch SGD (via Adam), where LCLIP

θk

(✓) = E

τ∼πk

" T X

t=0

h min(rt(✓) ˆ Aπk

t , clip (rt(✓), 1 − ✏, 1 + ✏) ˆ

Aπk

t )

i# end for Clipping prevents policy from having incentive to go far away from ✓k+1 Clipping seems to work at least as well as PPO with KL penalty, but is simpler to implement

PPO: Clipped Objective

slide-43
SLIDE 43

Empirical Performance of PPO

Figure: Performance comparison between PPO with clipped objective and various other deep RL methods on a slate of MuJoCo tasks. 10

PPO: Clipped Objective

slide-44
SLIDE 44

Summary

  • Gradient Descent in Parameter VS distribution space
  • Natural gradients: we need to keep track of how the KL changes

from iteration to iteration

  • Natural policy gradients
  • Clipped objective works well

Related Readings

I

  • S. Kakade. “A Natural Policy Gradient.” NIPS. 2001

I

  • S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”.
  • ICML. 2002

I

  • J. Peters and S. Schaal. “Natural actor-critic”.

Neurocomputing (2008)

I

  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”.

ICML (2015)

I I

  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”.

(2017)

I

  • J. Achiam, D. Held, A. Tamar, P. Abeel “Constrained Policy Optimization”. (2017)