trust region policy optimization
play

Trust Region Policy Optimization John Schulman, Sergey Levine, - PowerPoint PPT Presentation

Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018


  1. Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018

  2. Reinforcement Learning Action Value Function Policy Gradients Q-Learning Actor Critic TRPO PPO A3C ACKTR Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc

  3. Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕

  4. Problems of Policy Gradient For i=1,2 ,… Collect N trajectories for policy 𝜌 πœ„ Non stationary input data Estimate advantage function 𝐡 due to changing policy and Compute policy gradient 𝑕 reward distributions change Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕

  5. Problems of Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Advantage is very random Estimate advantage function 𝐡 cv initially Compute policy gradient 𝑕 Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕 You’re bad Advantage Policy

  6. Problems of Policy Gradient For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 We need more carefully Update policy parameter πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝑕 crafted policy update We want improvement and not degradation Idea: We can update old policy 𝜌 π‘π‘šπ‘’ to a new policy ΰ·€ 𝜌 such that they are β€œtrusted” distance apart. Such conservative policy update allows improvement instead of degradation.

  7. RL to Optimization β€’ Most of ML is optimization β€’ Supervised learning is reducing training loss β€’ RL: what is policy gradient optimizing? β€’ Favoring (𝑑, 𝑏) that gave more advantage 𝐡 . β€’ Can we write down optimization problem that allows to do small update on a policy 𝜌 based on data sampled from 𝜌 (on-policy data) Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)

  8. What loss to optimize? β€’ Optimize πœƒ(𝜌) i.e., expected return of a policy 𝜌 ∞ 𝛿 𝑒 𝑠 πœƒ 𝜌 = 𝔽 𝑑 0 ~𝜍 0 ,𝑏 𝑒 ~𝜌 . 𝑑 𝑒 ෍ 𝑒 𝑒=0 β€’ We collect data with 𝜌 π‘π‘šπ‘’ and optimize the objective to get a new policy ΰ·€ 𝜌 .

  9. What loss to optimize? β€’ We can express πœƒ ΰ·€ 𝜌 in terms of the advantage over the original policy 1 . ∞ 𝛿 𝑒 𝐡 𝜌 π‘π‘šπ‘’ (𝑑 𝑒 , 𝑏 𝑒 ) πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + 𝔽 𝜐~ΰ·₯ 𝜌 ෍ 𝑒=0 Expected return of old policy Expected return of Sample from new new policy policy [1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.

  10. What loss to optimize? β€’ Previous equation can be rewritten as 1 : πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 Expected return of old policy Expected return of Discounted visitation frequency 𝜍 𝜌 𝑑 = 𝑄 𝑑 0 = 𝑑 + 𝛿𝑄 𝑑 1 = 𝑑 + 𝛿 2 𝑄 + β‹― new policy [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  11. What loss to optimize? Old Expected Return πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ β‰₯ 𝟏 𝑑 𝑏 New Expected Return

  12. What loss to optimize? πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ β‰₯ 𝟏 𝑑 𝑏 > New Expected Return Old Expected Return Guaranteed Improvement from 𝜌 π‘π‘šπ‘’ β†’ ΰ·€ 𝜌

  13. New State Visitation is Difficult State visitation based on new policy πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 New policy β€œComplex dependency of 𝜍 ΰ·₯ 𝜌 (𝑑) on 𝜌 makes the equation difficult to ΰ·€ optimize directly. ” [1] [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  14. New State Visitation is Difficult πœƒ ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 ΰ·₯ 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 𝑀 ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 Local approximation of 𝜽(ΰ·₯ 𝝆) [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  15. Local approximation of πœƒ(ΰ·€ 𝜌) 𝑀 ΰ·€ 𝜌 = πœƒ 𝜌 π‘π‘šπ‘’ + ෍ 𝜍 𝜌 (𝑑) ෍ 𝜌 𝑏 𝑑 𝐡 𝜌 π‘π‘šπ‘’ (𝑑, 𝑏) ΰ·€ 𝑑 𝑏 The approximation is accurate within step size πœ€ (trust region) Trust region Monotonic improvement πœ„ guaranteed πœ„ β€² 𝜌 πœ„ β€² 𝑑 𝑏 does not change dramatically. [1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

  16. Local approximation of πœƒ(ΰ·€ 𝜌) β€’ The following bound holds: 𝑛𝑏𝑦 (𝜌, ΰ·€ πœƒ ΰ·€ 𝜌 β‰₯ 𝑀 ΰ·€ 𝜌 βˆ’ 𝐷𝐸 𝐿𝑀 𝜌) 4πœ—π›Ώ Where, 𝐷 = 1βˆ’π›Ώ 2 β€’ Monotonically improving policies can be generated by: 𝑛𝑏𝑦 𝜌, ΰ·€ 𝜌 = arg max [𝑀 ΰ·€ 𝜌 βˆ’ 𝐷𝐸 𝐿𝑀 𝜌 ] 𝜌 4πœ—π›Ώ Where, 𝐷 = 1βˆ’π›Ώ 2

  17. Minorization Maximization (MM) algorithm 𝑛𝑏𝑦 (𝜌, ΰ·€ Surrogate function 𝑀 𝜌 βˆ’ 𝐷𝐸 𝐿𝑀 𝜌) Actual function πœƒ(𝜌)

  18. Optimization of Parameterized Policies β€’ Now policies are parameterized 𝜌 πœ„ 𝑏 𝑑 with parameters πœ„ β€’ Accordingly surrogate function changes to 𝑛𝑏𝑦 πœ„ π‘π‘šπ‘’ , πœ„ ] arg max [𝑀 πœ„ βˆ’ 𝐷𝐸 𝐿𝑀 πœ„

  19. Optimization of Parameterized Policies 𝑛𝑏𝑦 πœ„ π‘π‘šπ‘’ , πœ„ ] arg max [𝑀 πœ„ βˆ’ 𝐷𝐸 𝐿𝑀 πœ„ In practice 𝐷 results in very small step sizes One way to take larger step size is to constraint KL divergence between the new policy and the old policy, i.e., a trust region constraint: π’π’ƒπ’šπ’‹π’π’‹π’œπ’‡ 𝑴 𝜾 (𝜾) 𝜾 π’π’ƒπ’š 𝜾 π’‘π’Žπ’† , 𝜾 ≀ 𝜺 subject to, 𝑬 𝑳𝑴

  20. Solving KL-Penalized Problem 𝑛𝑏𝑦 (πœ„ π‘π‘šπ‘’ , πœ„) β€’ max𝑗𝑛𝑗𝑨𝑓 πœ„ 𝑀 πœ„ βˆ’ 𝐷. 𝐸 𝐿𝑀 β€’ Use mean KL divergence instead of max. β€’ i.e., max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ βˆ’ 𝐷. 𝐸 𝐿𝑀 (πœ„ π‘π‘šπ‘’ , πœ„) πœ„ β€’ Make linear approximation to 𝑀 and quadratic to KL term: 𝑕 . πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ 𝑑 2 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺(πœ„ βˆ’ πœ„ π‘π‘šπ‘’ ) max𝑗𝑛𝑗𝑨𝑓 πœ„ πœ– 2 πœ– πœ–πœ„ 𝑀 πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ , πœ– 2 πœ„ 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ where, 𝑕 = 𝐺 =

  21. Solving KL-Penalized Problem β€’ Make linear approximation to 𝑀 and quadratic to KL term: πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ 𝑑 2 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺(πœ„ βˆ’ πœ„ π‘π‘šπ‘’ ) max𝑗𝑛𝑗𝑨𝑓 𝑕 . πœ„ 𝐺 = πœ– 2 πœ– πœ–πœ„ 𝑀 πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ , πœ– 2 πœ„ 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ where, 𝑕 = 1 𝑑 𝐺 βˆ’1 𝑕 . Don’t want to form full Hessian matrix β€’ Solution: πœ„ βˆ’ πœ„ π‘π‘šπ‘’ = πœ– 2 πœ– 2 πœ„ 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ȁ πœ„=πœ„ π‘π‘šπ‘’ . 𝐺 = β€’ Can compute 𝐺 βˆ’1 𝑕 approximately using conjugate gradient algorithm without forming 𝐺 explicitly.

  22. Conjugate Gradient (CG) β€’ Conjugate gradient algorithm approximately solves for 𝑦 = 𝐡 βˆ’1 𝑐 without explicitly forming matrix 𝐡 1 2 𝑦 π‘ˆ 𝐡𝑦 βˆ’ 𝑐𝑦 β€’ After 𝑙 iterations, CG has minimized

  23. TRPO: KL-Constrained β€’ Unconstrained problem: max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ βˆ’ 𝐷. 𝐸 𝐿𝑀 (πœ„ π‘π‘šπ‘’ , πœ„) πœ„ β€’ Constrained problem: max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ subject to 𝐷. 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ≀ πœ€ πœ„ β€’ πœ€ is a hyper-parameter, remains fixed over whole learning process β€’ Solve constrained quadratic problem: compute 𝐺 βˆ’1 𝑕 and then rescale step to get correct KL πœ„ βˆ’ πœ„ π‘π‘šπ‘’ subject to 1 2 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ ≀ πœ€ β€’ max𝑗𝑛𝑗𝑨𝑓 𝑕 . πœ„ πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ πœ‡ 2 [ πœ„ βˆ’ πœ„ π‘π‘šπ‘’ π‘ˆ 𝐺 πœ„ βˆ’ πœ„ π‘π‘šπ‘’ βˆ’ πœ€] β€’ Lagrangian: β„’ πœ„, πœ‡ = 𝑕 . β€’ Differentiate wrt πœ„ and get πœ„ βˆ’ πœ„ π‘π‘šπ‘’ = 1 πœ‡ 𝐺 βˆ’1 𝑕 β€’ We want 1 2 𝑑 π‘ˆ 𝐺𝑑 = πœ€ 2πœ€ β€’ Given candidate step 𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ rescale to 𝑑 = 𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ .(𝐺𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’ ) 𝑑 π‘£π‘œπ‘‘π‘‘π‘π‘šπ‘“π‘’

  24. TRPO Algorithm For i =1,2,… Collect N trajectories for policy 𝜌 πœ„ Estimate advantage function 𝐡 Compute policy gradient 𝑕 Use CG to compute 𝐼 βˆ’1 𝑕 Compute rescaled step 𝑑 = 𝛽𝐼 βˆ’1 𝑕 with rescaling and line search Apply update: πœ„ = πœ„ π‘π‘šπ‘’ + 𝛽𝐼 βˆ’1 𝑕 max𝑗𝑛𝑗𝑨𝑓 𝑀 πœ„ subject to 𝐷. 𝐸 𝐿𝑀 πœ„ π‘π‘šπ‘’ , πœ„ ≀ πœ€ πœ„

  25. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend