Trust Region Policy Optimization John Schulman, Sergey Levine, - - PowerPoint PPT Presentation

β–Ά
trust region policy optimization
SMART_READER_LITE
LIVE PREVIEW

Trust Region Policy Optimization John Schulman, Sergey Levine, - - PowerPoint PPT Presentation

Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018


slide-1
SLIDE 1

Trust Region Policy Optimization

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning)

  • Prof. Pascal Poupart

June 20th 2018

slide-2
SLIDE 2

Reinforcement Learning

Action Value Function Q-Learning Policy Gradients TRPO Actor Critic PPO A3C ACKTR

Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc

slide-3
SLIDE 3

Policy Gradient

For i=1,2,… Collect N trajectories for policy πœŒπœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„π‘π‘šπ‘’ + 𝛽𝑕

slide-4
SLIDE 4

Problems of Policy Gradient

For i=1,2,… Collect N trajectories for policy πœŒπœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„π‘π‘šπ‘’ + 𝛽𝑕

Non stationary input data due to changing policy and reward distributions change

slide-5
SLIDE 5

Problems of Policy Gradient

For i=1,2,… Collect N trajectories for policy πœŒπœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„π‘π‘šπ‘’ + 𝛽𝑕

cv Advantage is very random initially Advantage Policy

You’re bad

slide-6
SLIDE 6

Problems of Policy Gradient

For i=1,2,… Collect N trajectories for policy πœŒπœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„π‘π‘šπ‘’ + 𝛽𝑕

We need more carefully crafted policy update We want improvement and not degradation Idea: We can update old policy πœŒπ‘π‘šπ‘’ to a new policy ΰ·€ 𝜌 such that they are β€œtrusted” distance apart. Such conservative policy update allows improvement instead

  • f degradation.
slide-7
SLIDE 7

RL to Optimization

  • Most of ML is optimization
  • Supervised learning is reducing training loss
  • RL: what is policy gradient optimizing?
  • Favoring (𝑑, 𝑏) that gave more advantage 𝐡.
  • Can we write down optimization problem that allows to do small update on a

policy 𝜌 based on data sampled from 𝜌 (on-policy data)

Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)

slide-8
SLIDE 8

What loss to optimize?

  • Optimize πœƒ(𝜌) i.e., expected return of a policy 𝜌
  • We collect data with πœŒπ‘π‘šπ‘’ and optimize the objective to get a new

policy ව 𝜌.

πœƒ 𝜌 = 𝔽𝑑0~𝜍0,𝑏𝑒~𝜌 . 𝑑𝑒 ෍

𝑒=0 ∞

𝛿𝑒𝑠

𝑒

slide-9
SLIDE 9

What loss to optimize?

  • We can express πœƒ ΰ·€

𝜌 in terms of the advantage over the original policy1.

πœƒ ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + π”½πœ~ΰ·₯

𝜌 ෍ 𝑒=0 ∞

π›Ώπ‘’π΅πœŒπ‘π‘šπ‘’(𝑑𝑒, 𝑏𝑒)

[1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.

Expected return of new policy Expected return of

  • ld policy

Sample from new policy

slide-10
SLIDE 10

What loss to optimize?

  • Previous equation can be rewritten as1:

πœƒ ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍ΰ·₯

𝜌(𝑑) ෍ 𝑏

ΰ·€ 𝜌 𝑏 𝑑 𝐡𝜌(𝑑, 𝑏)

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Expected return of new policy Expected return of

  • ld policy

Discounted visitation frequency 𝜍𝜌 𝑑 = 𝑄 𝑑0 = 𝑑 + 𝛿𝑄 𝑑1 = 𝑑 + 𝛿2𝑄 + β‹―

slide-11
SLIDE 11

πœƒ ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍ΰ·₯

𝜌(𝑑) ෍ 𝑏

ΰ·€ 𝜌 𝑏 𝑑 𝐡𝜌(𝑑, 𝑏)

What loss to optimize?

New Expected Return Old Expected Return

β‰₯ 𝟏

slide-12
SLIDE 12

πœƒ ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍ΰ·₯

𝜌(𝑑) ෍ 𝑏

ΰ·€ 𝜌 𝑏 𝑑 𝐡𝜌(𝑑, 𝑏)

New Expected Return Old Expected Return

>

What loss to optimize?

β‰₯ 𝟏

Guaranteed Improvement from πœŒπ‘π‘šπ‘’ β†’ ΰ·€ 𝜌

slide-13
SLIDE 13

πœƒ ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍ΰ·₯

𝜌(𝑑) ෍ 𝑏

ΰ·€ 𝜌 𝑏 𝑑 𝐡𝜌(𝑑, 𝑏)

New State Visitation is Difficult

State visitation based on new policy New policy β€œComplex dependency of 𝜍ΰ·₯

𝜌(𝑑) on

ව 𝜌 makes the equation difficult to

  • ptimize directly.” [1]

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

slide-14
SLIDE 14

πœƒ ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍ΰ·₯

𝜌(𝑑) ෍ 𝑏

ΰ·€ 𝜌 𝑏 𝑑 𝐡𝜌(𝑑, 𝑏)

New State Visitation is Difficult

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

𝑀 ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍𝜌(𝑑) ෍

𝑏

ΰ·€ 𝜌 𝑏 𝑑 𝐡𝜌(𝑑, 𝑏)

Local approximation of 𝜽(ΰ·₯ 𝝆)

slide-15
SLIDE 15

Local approximation of πœƒ(ΰ·€ 𝜌)

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

𝑀 ΰ·€ 𝜌 = πœƒ πœŒπ‘π‘šπ‘’ + ෍

𝑑

𝜍𝜌(𝑑) ෍

𝑏

ΰ·€ 𝜌 𝑏 𝑑 π΅πœŒπ‘π‘šπ‘’(𝑑, 𝑏)

The approximation is accurate within step size πœ€ (trust region) Monotonic improvement guaranteed πœ„ πœ„β€² πœŒπœ„β€² 𝑑 𝑏 does not change dramatically. Trust region

slide-16
SLIDE 16
  • Monotonically improving policies can be generated by:
  • The following bound holds:

Local approximation of πœƒ(ΰ·€ 𝜌)

πœƒ ΰ·€ 𝜌 β‰₯ 𝑀 ΰ·€ 𝜌 βˆ’ 𝐷𝐸𝐿𝑀

𝑛𝑏𝑦(𝜌, ΰ·€

𝜌) Where, 𝐷 =

4πœ—π›Ώ 1βˆ’π›Ώ 2

𝜌 = arg max

𝜌

[𝑀 ΰ·€ 𝜌 βˆ’ 𝐷𝐸𝐿𝑀

𝑛𝑏𝑦 𝜌, ΰ·€

𝜌 ] Where, 𝐷 =

4πœ—π›Ώ 1βˆ’π›Ώ 2

slide-17
SLIDE 17

Minorization Maximization (MM) algorithm

Actual function πœƒ(𝜌) Surrogate function 𝑀 𝜌 βˆ’ 𝐷𝐸𝐿𝑀

𝑛𝑏𝑦(𝜌, ΰ·€

𝜌)

slide-18
SLIDE 18

Optimization of Parameterized Policies

  • Now policies are parameterized πœŒπœ„ 𝑏 𝑑 with parameters πœ„
  • Accordingly surrogate function changes to

arg max

πœ„

[𝑀 πœ„ βˆ’ 𝐷𝐸𝐿𝑀

𝑛𝑏𝑦 πœ„π‘π‘šπ‘’, πœ„ ]

slide-19
SLIDE 19

Optimization of Parameterized Policies

arg max

πœ„

[𝑀 πœ„ βˆ’ 𝐷𝐸𝐿𝑀

𝑛𝑏𝑦 πœ„π‘π‘šπ‘’, πœ„ ]

In practice 𝐷 results in very small step sizes One way to take larger step size is to constraint KL divergence between the new policy and the old policy, i.e., a trust region constraint: π’π’ƒπ’šπ’‹π’π’‹π’œπ’‡

𝜾

π‘΄πœΎ(𝜾) subject to, 𝑬𝑳𝑴

π’π’ƒπ’š πœΎπ’‘π’Žπ’†, 𝜾 ≀ 𝜺

slide-20
SLIDE 20

Solving KL-Penalized Problem

  • maxπ‘—π‘›π‘—π‘¨π‘“πœ„ 𝑀 πœ„ βˆ’ 𝐷. 𝐸𝐿𝑀

𝑛𝑏𝑦(πœ„π‘π‘šπ‘’, πœ„)

  • Use mean KL divergence instead of max.
  • i.e., max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑀 πœ„ βˆ’ 𝐷. 𝐸𝐿𝑀(πœ„π‘π‘šπ‘’, πœ„)

  • Make linear approximation to 𝑀 and quadratic to KL term:

max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑕 . πœ„ βˆ’ πœ„π‘π‘šπ‘’ βˆ’ 𝑑 2 πœ„ βˆ’ πœ„π‘π‘šπ‘’ π‘ˆπΊ(πœ„ βˆ’ πœ„π‘π‘šπ‘’) where, 𝑕 =

πœ– πœ–πœ„ 𝑀 πœ„ Θπœ„=πœ„π‘π‘šπ‘’,

𝐺 =

πœ–2 πœ–2πœ„ 𝐸𝐿𝑀 πœ„π‘π‘šπ‘’, πœ„ Θπœ„=πœ„π‘π‘šπ‘’

slide-21
SLIDE 21

Solving KL-Penalized Problem

  • Make linear approximation to 𝑀 and quadratic to KL term:
  • Solution: πœ„ βˆ’ πœ„π‘π‘šπ‘’ =

1 𝑑 πΊβˆ’1𝑕. Don’t want to form full Hessian matrix

𝐺 =

πœ–2 πœ–2πœ„ 𝐸𝐿𝑀 πœ„π‘π‘šπ‘’, πœ„ Θπœ„=πœ„π‘π‘šπ‘’.

  • Can compute πΊβˆ’1𝑕 approximately using conjugate gradient

algorithm without forming 𝐺 explicitly.

max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑕 . πœ„ βˆ’ πœ„π‘π‘šπ‘’ βˆ’ 𝑑 2 πœ„ βˆ’ πœ„π‘π‘šπ‘’ π‘ˆπΊ(πœ„ βˆ’ πœ„π‘π‘šπ‘’) where, 𝑕 =

πœ– πœ–πœ„ 𝑀 πœ„ Θπœ„=πœ„π‘π‘šπ‘’,

𝐺 = πœ–2

πœ–2πœ„ 𝐸𝐿𝑀 πœ„π‘π‘šπ‘’, πœ„ Θπœ„=πœ„π‘π‘šπ‘’

slide-22
SLIDE 22

Conjugate Gradient (CG)

  • Conjugate gradient algorithm approximately solves for 𝑦 = π΅βˆ’1𝑐

without explicitly forming matrix 𝐡

  • After 𝑙 iterations, CG has minimized

1 2 π‘¦π‘ˆπ΅π‘¦ βˆ’ 𝑐𝑦

slide-23
SLIDE 23

TRPO: KL-Constrained

  • Unconstrained problem: max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑀 πœ„ βˆ’ 𝐷. 𝐸𝐿𝑀(πœ„π‘π‘šπ‘’, πœ„)

  • Constrained problem: max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑀 πœ„ subject to 𝐷. 𝐸𝐿𝑀 πœ„π‘π‘šπ‘’, πœ„ ≀ πœ€

  • πœ€ is a hyper-parameter, remains fixed over whole learning process
  • Solve constrained quadratic problem: compute πΊβˆ’1𝑕 and then rescale step

to get correct KL

  • max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑕 . πœ„ βˆ’ πœ„π‘π‘šπ‘’ subject to 1

2 πœ„ βˆ’ πœ„π‘π‘šπ‘’ π‘ˆπΊ πœ„ βˆ’ πœ„π‘π‘šπ‘’ ≀ πœ€

  • Lagrangian: β„’ πœ„, πœ‡ = 𝑕 .

πœ„ βˆ’ πœ„π‘π‘šπ‘’ βˆ’ πœ‡

2 [ πœ„ βˆ’ πœ„π‘π‘šπ‘’ π‘ˆπΊ πœ„ βˆ’ πœ„π‘π‘šπ‘’ βˆ’ πœ€]

  • Differentiate wrt πœ„ and get πœ„ βˆ’ πœ„π‘π‘šπ‘’ = 1

πœ‡ πΊβˆ’1𝑕

  • We want 1

2 π‘‘π‘ˆπΊπ‘‘ = πœ€

  • Given candidate step π‘‘π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ rescale to 𝑑 =

2πœ€ π‘‘π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’.(πΊπ‘‘π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’) π‘‘π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’

slide-24
SLIDE 24

TRPO Algorithm

For i=1,2,… Collect N trajectories for policy πœŒπœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Use CG to compute πΌβˆ’1𝑕 Compute rescaled step 𝑑 = π›½πΌβˆ’1𝑕 with rescaling and line search Apply update: πœ„ = πœ„π‘π‘šπ‘’ + π›½πΌβˆ’1𝑕 max𝑗𝑛𝑗𝑨𝑓

πœ„

𝑀 πœ„ subject to 𝐷. 𝐸𝐿𝑀 πœ„π‘π‘šπ‘’, πœ„ ≀ πœ€

slide-25
SLIDE 25

Questions?