Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS - - PowerPoint PPT Presentation

proximal policy optimization
SMART_READER_LITE
LIVE PREVIEW

Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS - - PowerPoint PPT Presentation

Proximal Policy Optimization Ruifan Yu (ruifan.yu@uwaterloo.ca) CS 885 June 20 Pro roximal l Poli licy Optim timization (O (OpenAI) I) PPO has become th the default rein inforcement t le learnin ing alg lgorit ithm at t


slide-1
SLIDE 1

Proximal Policy Optimization

Ruifan Yu (ruifan.yu@uwaterloo.ca) CS 885 June 20

slide-2
SLIDE 2

Pro roximal l Poli licy Optim timization (O (OpenAI) I)

”PPO has become th the default rein inforcement t le learnin ing alg lgorit ithm at t OpenAI I bec ecause of f its its ea ease of f use and good performance”

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization

  • algorithms. arXiv preprint arXiv:1707.06347.

https://arxiv.org/pdf/1707.06347 https://blog.openai.com/openai-baselines-ppo/

slide-3
SLIDE 3

Policy Gradient (REINFORCE)

In practice, update on each batch(trajectory) * Use the same notation in the paper

slide-4
SLIDE 4

Problem?

  • Uns

nstable le up update

  • Step size is very important:
  • If step size is too large:
  • Large step  bad policy
  • Next batch is generated from current bad policy  collect bad samples
  • Bad samples  worse policy

(compare to supervised learning: the correct label and data in the following batches may correct it)

  • If step size is too small: the learning process is slow
  • Data

ata Ine Ineff fficiency

  • On-policy method: for each new policy, we need to generate a completely new trajectory
  • The data is thrown out after just one gradient update
  • As complex neural networks need many updates, this makes the training process very slow
slide-5
SLIDE 5

Importance Sampling

Estimate one distribution by sampling from another distribution

𝐹𝑦~𝑞[𝑔 𝑦 ] ≈ 1 𝑂 ෍

𝑗=1,𝑦𝑗∈𝑞 𝑂

𝑔 𝑦𝑗 𝐹𝑦~𝑞[𝑔 𝑦 ] = න 𝑔 𝑦 𝑞 𝑦 𝑒𝑦 = න 𝑔 𝑦 𝑞 𝑦 𝑟 𝑦 𝑟 𝑦 𝑒𝑦 = 𝐹𝑦~𝑟[𝑔 𝑦 𝑞 𝑦 𝑟 𝑦 ] ≈ 1 𝑂 ෍

𝑗=1,𝑦𝑗∈𝑟 𝑂

𝑔 𝑦𝑗 𝑞 𝑦𝑗 𝑟 𝑦𝑗

slide-6
SLIDE 6

Data Inefficiency

Data Inefficiency Make it efficient

Use previous samples? Evaluate the gradient

  • f current policy

Can we estimate an expectation of one distribution without taking samples from it? Avoid sampling from current policy Like replay buffer in DQN

slide-7
SLIDE 7

Importance Sampling in Policy Gradient

𝛼𝐾 𝜄 = 𝐹 𝑡𝑢, 𝑏𝑢 ~ 𝜌𝜄[𝛼 log 𝜌𝜄 𝑏𝑢 𝑡𝑢 𝐵(𝑡𝑢, 𝑏𝑢)] = 𝐹 𝑡𝑢, 𝑏𝑢 ~𝜌𝜄𝑝𝑚𝑒[ 𝜌𝜄(𝑡𝑢, 𝑏𝑢) 𝜌𝜄𝑝𝑚𝑒(𝑡𝑢, 𝑏𝑢) 𝛼 log 𝜌𝜄 𝑏𝑢 𝑡𝑢 𝐵(𝑡𝑢, 𝑏𝑢)] 𝐾 𝜄 = 𝐹 𝑡𝑢, 𝑏𝑢 ~𝜌𝜄𝑝𝑚𝑒[ 𝜌𝜄(𝑡𝑢, 𝑏𝑢) 𝜌𝜄𝑝𝑚𝑒(𝑡𝑢, 𝑏𝑢) 𝐵(𝑡𝑢, 𝑏𝑢)]

Surrogate objective function

𝐹𝑦~𝑞 𝑔 𝑦 = 𝐹𝑦~𝑟[𝑔 𝑦 𝑞 𝑦 𝑟 𝑦 ]

slide-8
SLIDE 8

Importance Sampling

Problem? No free lunch! Two expectations are same, but we are using sampling method to estimate them  variance is also important 𝐹𝑦~𝑟[𝑔 𝑦 𝑞 𝑦 𝑟 𝑦 ] 𝑊𝐵𝑆 𝑌 = 𝐹 𝑌2 − 𝐹 𝑌

2

= 𝐹𝑦~𝑞 𝑔 𝑦 2 − 𝐹𝑦~𝑞 𝑔 𝑦

2

V𝑏𝑠

𝑦~𝑟[𝑔 𝑦 𝑞 𝑦

𝑟 𝑦 ] = 𝐹𝑦~𝑟 𝑔 𝑦 𝑞 𝑦 𝑟 𝑦

2

− 𝐹𝑦~𝑟 𝑔 𝑦 𝑞 𝑦 𝑟 𝑦

2

= 𝐹𝑦~𝑞 𝑔 𝑦 2 𝑞 𝑦 𝑟 𝑦 − 𝐹𝑦~𝑞 𝑔 𝑦

2

V𝑏𝑠

𝑦~𝑞 𝑔 𝑦

𝐹𝑦~𝑞[𝑔 𝑦 ] = Price (Tradeoff): we may need to sample more data, if is far away from 1 𝑞 𝑦 𝑟 𝑦

slide-9
SLIDE 9

Unstable Update

Unstable Stable

Adaptive learning rate limit the policy update range Can we measure the distance between two distributions? Make confident updates

slide-10
SLIDE 10

KL Divergence

Measure the distance of two distributions

𝐸𝐿𝑀(𝑄||𝑅) = σ𝑦 𝑄 𝑦 𝑚𝑝𝑕

𝑄(𝑦) 𝑅(𝑦)

𝐸𝐿𝑀(𝜌1||𝜌2)[𝑡] = σ𝑏∈𝐵 𝜌1 𝑏|𝑡 𝑚𝑝𝑕

𝜌1(𝑏|𝑡) 𝜌2(𝑏|𝑡) KL divergence of two policies

* image: Kullback–Leibler divergence (Wikipedia) https://en.wikipedia.org/wiki/Kullback–Leibler_divergence

slide-11
SLIDE 11

Trust Region Policy Optimization (TRPO)

TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems—or even within a single problem, where the characteristics change over the course of learning Common trick in optimization: Lagrangian Dual

slide-12
SLIDE 12

Proximal Policy Optimization (PPO)

TRPO use conjugate gradient decent to handle the constraint Hessian Matrix  expensive both in computation and space Idea: The constraint helps in the training process. However, maybe the constraint is not a strict constraint: Does it matter if we only break the constraint just a few times? What if we treat it as a “soft” constraint? Add proximal value to

  • bjective function?
slide-13
SLIDE 13

PPO with Adaptive KL Penalty

Hard to pick 𝛾 value  use adaptive 𝛾 Still need to set up a KL divergence target value …

slide-14
SLIDE 14

PPO with Adaptive KL Penalty

* CS294 Fall 2017, Lecture 13 http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

slide-15
SLIDE 15

PPO with Clipped Objective

Fluctuation happens when r changes too quickly  limit r within a range? r 1 1 1 + Ɛ 1 - Ɛ 1 - Ɛ 1 + Ɛ r 1 1 1 + Ɛ 1 - Ɛ 1 - Ɛ 1 + Ɛ

slide-16
SLIDE 16

PPO with Clipped Objective

* CS294 Fall 2017, Lecture 13 http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf

slide-17
SLIDE 17

PPO in practice

entropy bonus to ensure sufficient exploration encourage “diversity” a squared-error loss for “critic” Surrogate objective function * c1, c2: empirical values, in the paper, c1=1, c2=0.01

slide-18
SLIDE 18

Performance

Results from continuous control benchmark. Average normalized scores (over 21 runs of the algorithm, on 7 environments)

slide-19
SLIDE 19

Performance

Results in MuJoCo environments, training for one million timesteps

slide-20
SLIDE 20

Related Works

[2] An Adaptive Clip

lipping Approach fo for r Pro Proximal l Po Policy Opti timization

PPO-𝜇 Change the clipping range adaptively [1] Emergence of

f Lo Loco comotion Be Behaviours in in Ric ich Env nviro ronments

Distributed PPO Interesting fact: this paper is published before PPO paper DeepMind got this idea from OpenAI’s talking in NIPS 2016 [1] https://arxiv.org/abs/1707.02286 [2] https://arxiv.org/abs/1804.06461

slide-21
SLIDE 21

END

Thank you