Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki

Part of the slides adapted from John Shulman and Joshua Achiam

Stochastic policies continuous actions µ θ ( s ) usually multivariate θ Gaussian σ θ ( s ) a ∼ N ( µ θ ( s ) , σ 2 θ ( s )) discrete actions almost always categorical θ p θ ( s ) a ∼ Cat( p θ ( s ))

̂ ̂ ̂ Policy Gradients h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A This lecture is all about the stepwise 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

̂ \ What is the underlying objective function? N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) Policy gradients: t ), τ i ∼ π θ N i =1 t =1 What is our objective? Result from differentiating the objective function: N T J PG ( θ ) = 1 ∑ ∑ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) t ) τ i ∼ π θ N i =1 t =1 Is this our objective? We cannot both maximize over a variable and sample from it. Well, we cannot optimize it too far, our advantage estimates are from samples of \pi_theta_{old}. However, this constraint of “cannot optimize too far from \theta_{old}” does not appear anywhere in the objective. Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we have access to expert actions, then the loss function we want to optimize is: N T J SL ( θ ) = 1 ∑ ∑ α ( i ) t | s ( i ) log π θ ( ˜ t ), τ i ∼ π * +regularization N i =1 t =1 which maximizes the probability of expert actions in the training set. Is this our SL objective? Well, as a matter of fact, we care about test error, but this is a long story, the short answer is yes, this is good enough for us to optimize if we regularize.

̂ ̂ ̂ Policy Gradients h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A This lecture is all about the stepwise 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 It is also about writing down an objective that we can optimize with PG, and the procedure 1,2,3,4,5 will be the result of this objective maximization μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

̂ ̂ ̂ Policy Gradients h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Two problems with the vanilla formulation: μ θ ( s ) ϵ θ old 1. Hard to choose stepwise σ θ ( s ) 2. Sample inefficient: we cannot use data collected with policies of previous iterations μ θ new ( s ) θ new σ θ new ( s )

̂ ̂ ̂ Hard to choose stepsizes h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = • Step too big π θ 1. Collect trajectories for policy Bad policy->data collected under bad 2. Estimate advantages A policy-> we cannot recover 3. Compute policy gradient g (in Supervised Learning, data does not 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g depend on neural network weights) 5. GOTO 1 • Step too small Not efficient use of experience (in Supervised Learning, data can be trivially re-used) μ θ ( s ) θ old σ θ ( s ) Gradient descent in parameter space does not take into account the resulting distance in the (output) policy μ θ new ( s ) θ new space between and π θ old ( s ) π θ new ( s ) σ θ new ( s )

̂ ̂ ̂ Hard to choose stepsizes h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Consider a family of policies with parametrization: ⇢ σ ( θ ) a = 1 π θ ( a ) = 1 − σ ( θ ) a = 2 Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.

Notation We will use the following to denote values of parameters and corresponding policies before and after an update: θ old → θ new π old → π new θ → θ ′ � π → π ′ �

Gradient Descent in Parameter Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ J ( θ + d ) Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. µ θ ( s ) θ σ θ ( s ) SGD: θ new = θ old + d *

Gradient Descent in Distribution Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ J ( θ + d ) SGD: θ new = θ old + d * Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: d * = arg d , s . t . KL( π θ ∥ π θ + d ) ≤ ϵ J ( θ + d ) max KL divergence in distribution space Easier to pick the distance threshold!!!

Solving the KL Constrained Problem Unconstrained penalized objective: J ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: J ( θ old ) + ∇ θ J ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d

Fisher Information Matrix Exactly equivalent to the Hessian of KL divergence! F ( θ ) = 𝔽 θ [ ∇ θ log p θ ( x ) ∇ θ log p θ ( x ) ⊤ ] F ( θ old ) = ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old D KL ( p θ old | p θ ) ≈ D KL ( p θ old | p θ old ) + d ⊤ ∇ θ D KL ( p θ old | p θ ) | θ = θ old + 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d = 1 2 d ⊤ F ( θ old ) d = 1 2( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions: how much you change the distribution if you move the parameters a little bit in a given direction.

Solving the KL Constrained Problem Unconstrained penalized objective: J ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: J ( θ old ) + ∇ θ J ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d Substitute for the information matrix: ∇ θ J ( θ ) | θ = θ old ⋅ d − 1 2 λ ( d ⊤ F ( θ old ) d ) = arg max d d − ∇ θ J ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) = arg min

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted from John Shulman and Joshua Achiam Stochastic policies

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

1 Todays Agenda PPO Plans Lo and Hi Plan Designs PPO Plan Features

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Trust Region Policy Optimization (TRPO) John Schulman, Sergey Levine, Philipp Moritz, Michael I.

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

2019 Open Enrollment Medical Overview Plans for 2019 The $2,700 High Deductible Plan (HSA)

Contracts 101 April 2014 619.359.6600 | partnersdocs.com PPO Scorecard 34 Contracts Multiplan

West Valley - Mission Community College District PPO Claim Experience & Plan Utilization

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many

Welcome to Speed Networking Barbara E. Wolfe, PhD, RN, CS, FAAN President, American Psychiatric

AB 86: Adult Education Los Angeles Regional Adult Education Consortium All Districts Faculty

1 Step 1: Translation Step 1: Translation Constraints Free Variables What essential

6/11/2018 Maryann Trott, BCBA (mtrott@salud.unm.edu) AUTISM PROGRAMS/FROM COMMON CORE TO LESSON

Air Traffic Complexity Resolution in Multi-Sector Planning Using CP Pierre Flener 1 Justin Pearson

Optimal Asset Allocation and Risk Shifting in Money Management Suleyman Basak (LBS), Anna Pavlova

Adaptive Management: Overview and Superfund Task Force Pilot Case Studies Kate Garufi, EPA HQ

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted from John Shulman and Joshua Achiam Stochastic policies

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

1 Todays Agenda PPO Plans Lo and Hi Plan Designs PPO Plan Features

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Trust Region Policy Optimization (TRPO) John Schulman, Sergey Levine, Philipp Moritz, Michael I.

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

2019 Open Enrollment Medical Overview Plans for 2019 The $2,700 High Deductible Plan (HSA)

Contracts 101 April 2014 619.359.6600 | partnersdocs.com PPO Scorecard 34 Contracts Multiplan

West Valley - Mission Community College District PPO Claim Experience &amp; Plan Utilization

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many

Welcome to Speed Networking Barbara E. Wolfe, PhD, RN, CS, FAAN President, American Psychiatric

AB 86: Adult Education Los Angeles Regional Adult Education Consortium All Districts Faculty

1 Step 1: Translation Step 1: Translation Constraints Free Variables What essential

6/11/2018 Maryann Trott, BCBA (mtrott@salud.unm.edu) AUTISM PROGRAMS/FROM COMMON CORE TO LESSON

Air Traffic Complexity Resolution in Multi-Sector Planning Using CP Pierre Flener 1 Justin Pearson

Optimal Asset Allocation and Risk Shifting in Money Management Suleyman Basak (LBS), Anna Pavlova

Adaptive Management: Overview and Superfund Task Force Pilot Case Studies Kate Garufi, EPA HQ

West Valley - Mission Community College District PPO Claim Experience & Plan Utilization