Trust Region Policy Optimization John Schulman, Sergey Levine, - PowerPoint PPT Presentation

Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018

Reinforcement Learning Action Value Function Policy Gradients Q-Learning Actor Critic TRPO PPO A3C ACKTR Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc

Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 𝜄 Estimate advantage function 𝐵 Compute policy gradient 𝑕 Update policy parameter 𝜄 = 𝜄 𝑝𝑚𝑒 + 𝛽𝑕

Problems of Policy Gradient For i=1,2 ,… Collect N trajectories for policy 𝜌 𝜄 Non stationary input data Estimate advantage function 𝐵 due to changing policy and Compute policy gradient 𝑕 reward distributions change Update policy parameter 𝜄 = 𝜄 𝑝𝑚𝑒 + 𝛽𝑕

Problems of Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 𝜄 Advantage is very random Estimate advantage function 𝐵 cv initially Compute policy gradient 𝑕 Update policy parameter 𝜄 = 𝜄 𝑝𝑚𝑒 + 𝛽𝑕 You’re bad Advantage Policy

Problems of Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 𝜄 Estimate advantage function 𝐵 Compute policy gradient 𝑕 We need more carefully Update policy parameter 𝜄 = 𝜄 𝑝𝑚𝑒 + 𝛽𝑕 crafted policy update We want improvement and not degradation Idea: We can update old policy 𝜌 𝑝𝑚𝑒 to a new policy ෤ 𝜌 such that they are “trusted” distance apart. Such conservative policy update allows improvement instead of degradation.

RL to Optimization • Most of ML is optimization • Supervised learning is reducing training loss • RL: what is policy gradient optimizing? • Favoring (𝑡, 𝑏) that gave more advantage 𝐵 . • Can we write down optimization problem that allows to do small update on a policy 𝜌 based on data sampled from 𝜌 (on-policy data) Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)

What loss to optimize? • Optimize 𝜃(𝜌) i.e., expected return of a policy 𝜌 ∞ 𝛿 𝑢 𝑠 𝜃 𝜌 = 𝔽 𝑡 0 ~𝜍 0 ,𝑏 𝑢 ~𝜌 . 𝑡 𝑢 ෍ 𝑢 𝑢=0 • We collect data with 𝜌 𝑝𝑚𝑒 and optimize the objective to get a new policy ෤ 𝜌 .

What loss to optimize? • We can express 𝜃 ෤ 𝜌 in terms of the advantage over the original policy 1 . ∞ 𝛿 𝑢 𝐵 𝜌 𝑝𝑚𝑒 (𝑡 𝑢 , 𝑏 𝑢 ) 𝜃 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + 𝔽 𝜐~෥ 𝜌 ෍ 𝑢=0 Expected return of old policy Expected return of Sample from new new policy policy [1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.

What loss to optimize? • Previous equation can be rewritten as 1 : 𝜃 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 ෥ 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 (𝑡, 𝑏) ෤ 𝑡 𝑏 Expected return of old policy Expected return of Discounted visitation frequency 𝜍 𝜌 𝑡 = 𝑄 𝑡 0 = 𝑡 + 𝛿𝑄 𝑡 1 = 𝑡 + 𝛿 2 𝑄 + ⋯ new policy [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

What loss to optimize? Old Expected Return 𝜃 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 ෥ 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 (𝑡, 𝑏) ෤ ≥ 𝟏 𝑡 𝑏 New Expected Return

What loss to optimize? 𝜃 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 ෥ 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 (𝑡, 𝑏) ෤ ≥ 𝟏 𝑡 𝑏 > New Expected Return Old Expected Return Guaranteed Improvement from 𝜌 𝑝𝑚𝑒 → ෤ 𝜌

New State Visitation is Difficult State visitation based on new policy 𝜃 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 ෥ 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 (𝑡, 𝑏) ෤ 𝑡 𝑏 New policy “Complex dependency of 𝜍 ෥ 𝜌 (𝑡) on 𝜌 makes the equation difficult to ෤ optimize directly. ” [1] [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

New State Visitation is Difficult 𝜃 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 ෥ 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 (𝑡, 𝑏) ෤ 𝑡 𝑏 𝑀 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 (𝑡, 𝑏) ෤ 𝑡 𝑏 Local approximation of 𝜽(෥ 𝝆) [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Local approximation of 𝜃(෤ 𝜌) 𝑀 ෤ 𝜌 = 𝜃 𝜌 𝑝𝑚𝑒 + ෍ 𝜍 𝜌 (𝑡) ෍ 𝜌 𝑏 𝑡 𝐵 𝜌 𝑝𝑚𝑒 (𝑡, 𝑏) ෤ 𝑡 𝑏 The approximation is accurate within step size 𝜀 (trust region) Trust region Monotonic improvement 𝜄 guaranteed 𝜄 ′ 𝜌 𝜄 ′ 𝑡 𝑏 does not change dramatically. [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Local approximation of 𝜃(෤ 𝜌) • The following bound holds: 𝑛𝑏𝑦 (𝜌, ෤ 𝜃 ෤ 𝜌 ≥ 𝑀 ෤ 𝜌 − 𝐷𝐸 𝐿𝑀 𝜌) 4𝜗𝛿 Where, 𝐷 = 1−𝛿 2 • Monotonically improving policies can be generated by: 𝑛𝑏𝑦 𝜌, ෤ 𝜌 = arg max [𝑀 ෤ 𝜌 − 𝐷𝐸 𝐿𝑀 𝜌 ] 𝜌 4𝜗𝛿 Where, 𝐷 = 1−𝛿 2

Minorization Maximization (MM) algorithm 𝑛𝑏𝑦 (𝜌, ෤ Surrogate function 𝑀 𝜌 − 𝐷𝐸 𝐿𝑀 𝜌) Actual function 𝜃(𝜌)

Optimization of Parameterized Policies • Now policies are parameterized 𝜌 𝜄 𝑏 𝑡 with parameters 𝜄 • Accordingly surrogate function changes to 𝑛𝑏𝑦 𝜄 𝑝𝑚𝑒 , 𝜄 ] arg max [𝑀 𝜄 − 𝐷𝐸 𝐿𝑀 𝜄

Optimization of Parameterized Policies 𝑛𝑏𝑦 𝜄 𝑝𝑚𝑒 , 𝜄 ] arg max [𝑀 𝜄 − 𝐷𝐸 𝐿𝑀 𝜄 In practice 𝐷 results in very small step sizes One way to take larger step size is to constraint KL divergence between the new policy and the old policy, i.e., a trust region constraint: 𝒏𝒃𝒚𝒋𝒏𝒋𝒜𝒇 𝑴 𝜾 (𝜾) 𝜾 𝒏𝒃𝒚 𝜾 𝒑𝒎𝒆 , 𝜾 ≤ 𝜺 subject to, 𝑬 𝑳𝑴

Solving KL-Penalized Problem 𝑛𝑏𝑦 (𝜄 𝑝𝑚𝑒 , 𝜄) • max𝑗𝑛𝑗𝑨𝑓 𝜄 𝑀 𝜄 − 𝐷. 𝐸 𝐿𝑀 • Use mean KL divergence instead of max. • i.e., max𝑗𝑛𝑗𝑨𝑓 𝑀 𝜄 − 𝐷. 𝐸 𝐿𝑀 (𝜄 𝑝𝑚𝑒 , 𝜄) 𝜄 • Make linear approximation to 𝑀 and quadratic to KL term: 𝑕 . 𝜄 − 𝜄 𝑝𝑚𝑒 − 𝑑 2 𝜄 − 𝜄 𝑝𝑚𝑒 𝑈 𝐺(𝜄 − 𝜄 𝑝𝑚𝑒 ) max𝑗𝑛𝑗𝑨𝑓 𝜄 𝜖 2 𝜖 𝜖𝜄 𝑀 𝜄 ȁ 𝜄=𝜄 𝑝𝑚𝑒 , 𝜖 2 𝜄 𝐸 𝐿𝑀 𝜄 𝑝𝑚𝑒 , 𝜄 ȁ 𝜄=𝜄 𝑝𝑚𝑒 where, 𝑕 = 𝐺 =

Solving KL-Penalized Problem • Make linear approximation to 𝑀 and quadratic to KL term: 𝜄 − 𝜄 𝑝𝑚𝑒 − 𝑑 2 𝜄 − 𝜄 𝑝𝑚𝑒 𝑈 𝐺(𝜄 − 𝜄 𝑝𝑚𝑒 ) max𝑗𝑛𝑗𝑨𝑓 𝑕 . 𝜄 𝐺 = 𝜖 2 𝜖 𝜖𝜄 𝑀 𝜄 ȁ 𝜄=𝜄 𝑝𝑚𝑒 , 𝜖 2 𝜄 𝐸 𝐿𝑀 𝜄 𝑝𝑚𝑒 , 𝜄 ȁ 𝜄=𝜄 𝑝𝑚𝑒 where, 𝑕 = 1 𝑑 𝐺 −1 𝑕 . Don’t want to form full Hessian matrix • Solution: 𝜄 − 𝜄 𝑝𝑚𝑒 = 𝜖 2 𝜖 2 𝜄 𝐸 𝐿𝑀 𝜄 𝑝𝑚𝑒 , 𝜄 ȁ 𝜄=𝜄 𝑝𝑚𝑒 . 𝐺 = • Can compute 𝐺 −1 𝑕 approximately using conjugate gradient algorithm without forming 𝐺 explicitly.

Conjugate Gradient (CG) • Conjugate gradient algorithm approximately solves for 𝑦 = 𝐵 −1 𝑐 without explicitly forming matrix 𝐵 1 2 𝑦 𝑈 𝐵𝑦 − 𝑐𝑦 • After 𝑙 iterations, CG has minimized

TRPO: KL-Constrained • Unconstrained problem: max𝑗𝑛𝑗𝑨𝑓 𝑀 𝜄 − 𝐷. 𝐸 𝐿𝑀 (𝜄 𝑝𝑚𝑒 , 𝜄) 𝜄 • Constrained problem: max𝑗𝑛𝑗𝑨𝑓 𝑀 𝜄 subject to 𝐷. 𝐸 𝐿𝑀 𝜄 𝑝𝑚𝑒 , 𝜄 ≤ 𝜀 𝜄 • 𝜀 is a hyper-parameter, remains fixed over whole learning process • Solve constrained quadratic problem: compute 𝐺 −1 𝑕 and then rescale step to get correct KL 𝜄 − 𝜄 𝑝𝑚𝑒 subject to 1 2 𝜄 − 𝜄 𝑝𝑚𝑒 𝑈 𝐺 𝜄 − 𝜄 𝑝𝑚𝑒 ≤ 𝜀 • max𝑗𝑛𝑗𝑨𝑓 𝑕 . 𝜄 𝜄 − 𝜄 𝑝𝑚𝑒 − 𝜇 2 [ 𝜄 − 𝜄 𝑝𝑚𝑒 𝑈 𝐺 𝜄 − 𝜄 𝑝𝑚𝑒 − 𝜀] • Lagrangian: ℒ 𝜄, 𝜇 = 𝑕 . • Differentiate wrt 𝜄 and get 𝜄 − 𝜄 𝑝𝑚𝑒 = 1 𝜇 𝐺 −1 𝑕 • We want 1 2 𝑡 𝑈 𝐺𝑡 = 𝜀 2𝜀 • Given candidate step 𝑡 𝑣𝑜𝑡𝑑𝑏𝑚𝑓𝑒 rescale to 𝑡 = 𝑡 𝑣𝑜𝑡𝑑𝑏𝑚𝑓𝑒 .(𝐺𝑡 𝑣𝑜𝑡𝑑𝑏𝑚𝑓𝑒 ) 𝑡 𝑣𝑜𝑡𝑑𝑏𝑚𝑓𝑒

TRPO Algorithm For i =1,2,… Collect N trajectories for policy 𝜌 𝜄 Estimate advantage function 𝐵 Compute policy gradient 𝑕 Use CG to compute 𝐼 −1 𝑕 Compute rescaled step 𝑡 = 𝛽𝐼 −1 𝑕 with rescaling and line search Apply update: 𝜄 = 𝜄 𝑝𝑚𝑒 + 𝛽𝐼 −1 𝑕 max𝑗𝑛𝑗𝑨𝑓 𝑀 𝜄 subject to 𝐷. 𝐸 𝐿𝑀 𝜄 𝑝𝑚𝑒 , 𝜄 ≤ 𝜀 𝜄

Questions?

Trust Region Policy Optimization John Schulman, Sergey Levine, - PowerPoint PPT Presentation

Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018

Trust Region Method Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS

TULA REGION TULA Moscow REGION Moscow region Kaluga region Tula Novomoskovsk Ryazan

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017

Trust But Verify Trust But Verify Trust But Verify Trust But Verify What Is CEC Entertainment?

Dynamics, robustness and fragility Private trust Public trust of trust Conclusions Dusko

Gods stories Gods stories Trust Trust To Rely Upon Something Totally Trust trust:

Composite Trust Composite Trust Composite Trust A formal derivation of conjunction A formal

A Nonlinear Trust Region Framework for PDE-Constrained Optimization Using

A Nonlinear Trust-Region Framework for PDE-Constrained Optimization Using Adaptive Model

MATH529 Fundamentals of Optimization Trust Region Algorithms Marco A. Montes de Oca

Klaipeda region about the region Klaipda region the right location Klaipda region is

Development Corporation Stuttgart Region Economic The Stuttgart Region The Stuttgart Region is

POTENTIAL OF NAMANGAN REGION 1 Namangan region Namangan region on the map of Uzbekistan

Group- Group -per per- -Region Allocation Region Allocation Region Bounds Region Bounds

May 3: Trust and Hybrid Models Trust models Chinese Wall model Aggressive Chinese Wall

A Key-recovery Attack on 855-Round Trivium Ximing Fu, Xiaoyun Wang, Xiaoyang Dong , Willi Meier

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

Porgera Mine Explosion 2 nd August 1994 Background On the morning of the 2 nd August 1994 an

HAMS: Hardware-Aware Model Scheduling on Heterogeneous Platforms Haofeng Kou - Baidu Research

Wentworth Institute of Technology College of Engineering and Technology COMP201 Computer

Lecture 01 The Security Mindset Stephen Checkoway CS 343 Fall 2020 Adapted from Michael

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Overview Focus Projection Focus Projection Focus to Accent Focus to Accent Restricted View of