Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients (cont.) Katerina Fragkiadaki

Revision

̂ Policy Gradients π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g How to estimate this gradient 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

̂ Policy Gradients π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g How to estimate the stepsize 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

̂ Policy Gradients • Step too big Bad policy->data collected under bad π θ 1. Collect trajectories for policy policy-> we cannot recover 2. Estimate advantages A (in Supervised Learning, data does not 3. Compute policy gradient g depend on neural network weights) 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g • Step too small 5. GOTO 1 Not efficient use of experience (in Supervised Learning, data can be trivially re-used) μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

̂ ̂ \ What is the underlying optimization problem? . U ( θ ) = 𝔽 τ ∼ P ( τ ; θ ) [ R ( τ )] = ∑ We started here: max P ( τ ; θ ) R ( τ ) θ τ N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) Policy gradients: τ i ∼ π θ t ), N i =1 t =1 g = 𝔽 t [ ∇ θ log π θ ( α t | s t ) A ( s t , a t ) ] This result from differentiating the following objective function: U PG ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] This is not the right objective: we can’t optimize too far . U PG ( θ ) max (as the advantage values become invalid), and this θ constraint shows up nowhere in the optimization: Compare this to supervised learning using expert actions and a maximum a ∼ π * ˜ likelihood objective: N T U SL ( θ ) = 1 ∑ ∑ α ( i ) t | s ( i ) log π θ ( ˜ t ), τ i ∼ π * (+regularization) N i =1 t =1

̂ Hard to choose stepsizes π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Consider a family of policies with parametrization: ⇢ σ ( θ ) a = 1 π θ ( a ) = 1 − σ ( θ ) a = 2 ⇢ − The same parameter step changes the policy distribution more or less Δ θ = − 2 dramatically depending on where in the parameter space we are.

Notation We will use the following to denote values of parameters and corresponding policies before and after an update: θ old → θ new π old → π new θ → θ ′ � π → π ′ �

Gradient Descent in Distribution Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ U ( θ + d ) SGD: θ new = θ old + d * Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: d * = arg d , s . t . KL( π θ ∥ π θ + d ) ≤ ϵ U ( θ + d ) max KL divergence in distribution space Easier to pick the distance threshold (and we made the constraint explicit of ``don’t optimize too much”)

Solving the KL Constrained Problem U ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] Unconstrained penalized objective: U ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d Let’s solve it: first order Taylor expansion for the loss and second order for the KL: U ( θ old ) + ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 d * ≈ arg max d Q: How will you compute this?

KL Taylor expansion D KL ( p θ old | p θ ) ≈ 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d = 1 2 d ⊤ F ( θ old ) d = 1 2( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) Fisher Information matrix: F ( θ ) = 𝔽 θ [ ∇ θ log p θ ( x ) ∇ θ log p θ ( x ) ⊤ ] F ( θ old ) = ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions : how much you change the distribution if you move the parameters a little bit in a given direction.

Solving the KL Constrained Problem Unconstrained penalized objective: U ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: U ( θ old ) + ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d Substitute for the information matrix: ∇ θ U ( θ ) | θ = θ old ⋅ d − 1 2 λ ( d ⊤ F ( θ old ) d ) = arg max d d − ∇ θ U ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) = arg min

Natural Gradient Descent ∂ d ( −∇ θ U ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) ) 0 = ∂ Setting the gradient to zero: = −∇ θ U ( θ ) | θ = θ old + 1 2 λ ( F ( θ old )) d d = 2 λ F − 1 ( θ old ) ∇ θ U ( θ ) | θ = θ old g N = F − 1 ( θ old ) ∇ θ U ( θ ) The natural gradient: θ new = θ old + α ⋅ g N Let’s solve for the stepzise along the natural gradient direction: D KL ( π θ old | π θ ) ≈ 1 2 ( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) 1 2 ( α g N ) ⊤ F ( α g N ) = ϵ 2 ϵ α = ( g ⊤ F g )

Stepsize along the Natural Gradient direction g N = F − 1 ( θ old ) ∇ θ U ( θ ) The natural gradient: θ new = θ old + α ⋅ g N Let’s solve for the stepzise along the natural gradient direction! D KL ( π θ old | π θ ) ≈ 1 2 ( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) = 1 2( α g N ) ⊤ F ( α g N ) I want the KL between old and new policies to be \epsilon: 1 2 ( α g N ) ⊤ F ( α g N ) = ϵ 2 ϵ α = ( g ⊤ N F g N )

Natural Gradient Descent ϵ Both use samples from the current policy π k = π ( θ k )

Natural Gradient Descent ϵ very expensive to compute for a large number of parameters!

̂ ̂ \ What is the underlying optimization problem? . U ( θ ) = 𝔽 τ ∼ P ( τ ; θ ) [ R ( τ )] = ∑ max P ( τ ; θ ) R ( τ ) We started here: θ τ N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) τ i ∼ π θ t ), Policy gradients: N i =1 t =1 g = 𝔽 t [ ∇ θ log π θ ( α t | s t ) A ( s t , a t ) ] This result from differentiating the following objective function: U PG ( θ ) = 𝔽 t [ log π θ ( α t | s t ) A ( s t , a t ) ] ``don’t optimize too much” constraint: . 𝔽 t [ log π θ + d ( α t | s t ) A ( s t , a t ) ] − λ D KL [ π θ ∥ π θ + d ] max d We used the 1st order approximation for the 1st term, but what if d is large??

Alternative derivation U ( θ ) = 𝔽 τ ∼ π θ ( τ ) [ R ( τ ) ] = ∑ π θ ( τ ) R ( τ ) τ = ∑ π θ old ( τ ) π θ ( τ ) π θ old ( τ ) R ( τ ) τ . 𝔽 t [ π θ ( τ ) π θ old ( a t | s t ) A ( s t , a t ) ] − λ D KL [ π θ old ∥ π θ ] = 𝔽 τ ∼ π θ old π θ old ( τ ) R ( τ ) π θ ( a t | s t ) max θ ∇ θ π θ ( τ ) ∇ θ U ( θ ) = 𝔽 τ ∼ π θ old π θ old ( τ ) R ( τ ) ∇ θ U ( θ ) | θ = θ old = 𝔽 τ ∼ π θ old ∇ θ log π θ ( τ ) | θ = θ old R ( τ ) <-Gradient evaluated at theta_old is unchanged

Trust region Policy Optimization Constrained objective: . 𝔽 t [ π θ old ( a t | s t ) A ( s t , a t ) ] π θ ( a t | s t ) max θ subject to 𝔽 t [ D KL [ π θ old ( ⋅ | s t ) ∥ π θ ( ⋅ | s t ) ] ] ≤ δ Or unonstrained objective: . 𝔽 t [ π θ old ( a t | s t ) A ( s t , a t ) ] − β 𝔽 t [ D KL [ π θ old ( ⋅ | s t ) ∥ π θ ( ⋅ | s t ) ] ] π θ ( a t | s t ) max θ I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML I

Proximal Policy Optimization Can I achieve similar performance without second order information (no Fisher matrix!) r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) . L CLIP = 𝔽 t [ min ( r t ( θ ) A ( s t , a t ), clip ( r t ( θ ),1 − ϵ ,1 + ϵ ) A ( s t , a t ) ) ] max θ I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I

PPO: Clipped Objective Empirical Performance of PPO Figure: Performance comparison between PPO with clipped objective and various other deep RL methods on a slate of MuJoCo tasks. 10

Training linear policies to solve control tasks with natural policy gradients https://youtu.be/frojcskMkkY

State s: joint positions, joint velocities, contact info

observations: joint positions, joint velocities, contact info

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Multigoal RL Katerina Fragkiadaki

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1. Collect trajectories for policy 2. Estimate advantages A 3.

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

Minimal ConT EXt Distribution Mojca Miklavec, BachoT EX 2008 Specifics of ConT EXt

Migration to ConT EXt? First experience with ConT EXt typesetting Tom Hla KONVOJ

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

CS171 Introduction to Computer Science II Recursion (cont.) + MergeSort Recursion (cont.) +

Cont act : Web: www.res-policy-beyond2020.eu Email: beyond2020@ eeg.tuwien.ac.at Cont ract

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

CLUSTER: TRUST ASSESSMENT NOVEMBER 12, 2014 ZBIGNIEW KALBARCZYK UNIVERSITY OF ILLINOIS AT

fpga manager & device tree overlays moritz fischer moritz.fischer@ettus.com

Quantifying the Economic Impact of Adventure Tourism and Paths Forward Partnerships for Greatest

*** Reconstructing Concept Networks on the Basis of Crosslinguistic Polysemy J OHANN -M ATTIS L

Social dynamics of innovation: What governance for the Trois-Rivires City-Region ? INSTITUT de

Models, Tools, Systems, Solutions, Challenges Tutorial at ACM / IEEE 22nd Int. Conf. On Model

Michael N. Christoff Presented by Michael N. C hristoff December 19, 2012 1 High Level

INF4140 - Models of concurrency Hsten 2015 November 2, 2015 Abstract This is the

Sambuz

Useful Links

Newsletter

Mail Us

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1. Collect trajectories for policy 2. Estimate advantages A 3.

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

Minimal ConT EXt Distribution Mojca Miklavec, BachoT EX 2008 Specifics of ConT EXt

Migration to ConT EXt? First experience with ConT EXt typesetting Tom Hla KONVOJ

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

CS171 Introduction to Computer Science II Recursion (cont.) + MergeSort Recursion (cont.) +

Cont act : Web: www.res-policy-beyond2020.eu Email: beyond2020@ eeg.tuwien.ac.at Cont ract

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

CLUSTER: TRUST ASSESSMENT NOVEMBER 12, 2014 ZBIGNIEW KALBARCZYK UNIVERSITY OF ILLINOIS AT

fpga manager &amp; device tree overlays moritz fischer moritz.fischer@ettus.com

Quantifying the Economic Impact of Adventure Tourism and Paths Forward Partnerships for Greatest

*** Reconstructing Concept Networks on the Basis of Crosslinguistic Polysemy J OHANN -M ATTIS L

Social dynamics of innovation: What governance for the Trois-Rivires City-Region ? INSTITUT de

Models, Tools, Systems, Solutions, Challenges Tutorial at ACM / IEEE 22nd Int. Conf. On Model

Michael N. Christoff Presented by Michael N. C hristoff December 19, 2012 1 High Level

INF4140 - Models of concurrency Hsten 2015 November 2, 2015 Abstract This is the

Sambuz

Useful Links

Newsletter

Mail Us

fpga manager & device tree overlays moritz fischer moritz.fischer@ettus.com