natural policy gradients trpo ppo
play

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted from John Shulman and Joshua Achiam Stochastic policies


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki

  2. Part of the slides adapted from John Shulman and Joshua Achiam

  3. Stochastic policies continuous actions µ θ ( s ) usually multivariate θ Gaussian σ θ ( s ) a ∼ N ( µ θ ( s ) , σ 2 θ ( s )) discrete actions almost always categorical θ p θ ( s ) a ∼ Cat( p θ ( s ))

  4. ̂ ̂ ̂ Policy Gradients h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A This lecture is all about the stepwise 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

  5. ̂ \ What is the underlying objective function? N T g ≈ 1 ∑ ∑ ∇ θ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) Policy gradients: t ), τ i ∼ π θ N i =1 t =1 What is our objective? Result from differentiating the objective function: N T J PG ( θ ) = 1 ∑ ∑ log π θ ( α ( i ) t | s ( i ) t ) A ( s ( i ) t , a ( i ) t ) τ i ∼ π θ N i =1 t =1 Is this our objective? We cannot both maximize over a variable and sample from it. Well, we cannot optimize it too far, our advantage estimates are from samples of \pi_theta_{old}. However, this constraint of “cannot optimize too far from \theta_{old}” does not appear anywhere in the objective. Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we have access to expert actions, then the loss function we want to optimize is: N T J SL ( θ ) = 1 ∑ ∑ α ( i ) t | s ( i ) log π θ ( ˜ t ), τ i ∼ π * +regularization N i =1 t =1 which maximizes the probability of expert actions in the training set. Is this our SL objective? Well, as a matter of fact, we care about test error, but this is a long story, the short answer is yes, this is good enough for us to optimize if we regularize.

  6. ̂ ̂ ̂ Policy Gradients h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A This lecture is all about the stepwise 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 It is also about writing down an objective that we can optimize with PG, and the procedure 1,2,3,4,5 will be the result of this objective maximization μ θ ( s ) θ old σ θ ( s ) μ θ new ( s ) θ new σ θ new ( s )

  7. ̂ ̂ ̂ Policy Gradients h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Two problems with the vanilla formulation: μ θ ( s ) ϵ θ old 1. Hard to choose stepwise σ θ ( s ) 2. Sample inefficient: we cannot use data collected with policies of previous iterations μ θ new ( s ) θ new σ θ new ( s )

  8. ̂ ̂ ̂ Hard to choose stepsizes h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = • Step too big π θ 1. Collect trajectories for policy Bad policy->data collected under bad 2. Estimate advantages A policy-> we cannot recover 3. Compute policy gradient g (in Supervised Learning, data does not 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g depend on neural network weights) 5. GOTO 1 • Step too small Not efficient use of experience (in Supervised Learning, data can be trivially re-used) μ θ ( s ) θ old σ θ ( s ) Gradient descent in parameter space does not take into account the resulting distance in the (output) policy μ θ new ( s ) θ new space between and π θ old ( s ) π θ new ( s ) σ θ new ( s )

  9. ̂ ̂ ̂ Hard to choose stepsizes h i g = ˆ r θ log π θ ( a t | s t ) ˆ Monte Carlo Policy Gradients (REINFORCE), gradient direction: ˆ A t E t 𝔽 t [ ∇ θ log π θ ( a t | s t ) A w ( s t ) ] Actor-Critic Policy Gradient: g = π θ 1. Collect trajectories for policy 2. Estimate advantages A 3. Compute policy gradient g 4. Update policy parameters θ new = θ + ϵ ⋅ ̂ g 5. GOTO 1 Consider a family of policies with parametrization: ⇢ σ ( θ ) a = 1 π θ ( a ) = 1 − σ ( θ ) a = 2 Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.

  10. Notation We will use the following to denote values of parameters and corresponding policies before and after an update: θ old → θ new π old → π new θ → θ ′ � π → π ′ �

  11. Gradient Descent in Parameter Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ J ( θ + d ) Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. µ θ ( s ) θ σ θ ( s ) SGD: θ new = θ old + d *

  12. Gradient Descent in Distribution Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max ∥ d ∥≤ ϵ J ( θ + d ) SGD: θ new = θ old + d * Euclidean distance in parameter space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: d * = arg d , s . t . KL( π θ ∥ π θ + d ) ≤ ϵ J ( θ + d ) max KL divergence in distribution space Easier to pick the distance threshold!!!

  13. Solving the KL Constrained Problem Unconstrained penalized objective: J ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: J ( θ old ) + ∇ θ J ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d

  14. Taylor expansion of KL D KL ( p θ old | p θ ) ≈ D KL ( p θ old | p θ old ) + d ⊤ ∇ θ D KL ( p θ old | p θ ) | θ = θ old + 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d ∇ θ D KL ( p θ old | p θ ) | θ = θ old = −∇ θ 𝔽 x ∼ p θ old log P θ ( x ) | θ = θ old = −𝔽 x ∼ p θ old ∇ θ log P θ ( x ) | θ = θ old 1 = −𝔽 x ∼ p θ old P θ old ( x ) ∇ θ P θ ( x ) | θ = θ old = ∫ x 1 P θ old ( x ) ∇ θ P θ ( x ) | θ = θ old P θ old ( x ) = ∫ x ∇ θ P θ ( x ) | θ = θ old = ∇ θ ∫ x KL( p θ old | p θ ) = 𝔽 x ∼ p θ old log ( P θ ( x ) | θ = θ old . P θ ( x ) ) P θ old ( x ) = 0

  15. Taylor expansion of KL D KL ( p θ old | p θ ) ≈ D KL ( p θ old | p θ old ) + d ⊤ ∇ θ KL( p θ old | p θ ) | θ = θ old + 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old = −𝔽 x ∼ p θ old ∇ 2 θ log P θ ( x ) | θ = θ old P θ ( x ) ) | θ = θ old = −𝔽 x ∼ p θ old ∇ θ ( ∇ θ P θ ( x ) = −𝔽 x ∼ p θ old ( ) | θ = θ old ∇ 2 θ P θ ( x ) P θ ( x ) − ∇ θ P θ ( x ) ∇ θ P θ ( x ) ⊤ P θ ( x ) 2 ∇ 2 θ P θ ( x ) | θ = θ old + 𝔽 x ∼ p θ old ∇ θ log P θ ( x ) ∇ θ log P θ ( x ) ⊤ | θ = θ old = −𝔽 x ∼ p θ old P θ old ( x ) = 𝔽 x ∼ p θ old ∇ θ log P θ ( x ) ∇ θ log P θ ( x ) ⊤ | θ = θ old D KL ( p θ old | p θ ) = 𝔽 x ∼ p θ old log ( P θ ( x ) ) P θ old ( x )

  16. Fisher Information Matrix Exactly equivalent to the Hessian of KL divergence! F ( θ ) = 𝔽 θ [ ∇ θ log p θ ( x ) ∇ θ log p θ ( x ) ⊤ ] F ( θ old ) = ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old D KL ( p θ old | p θ ) ≈ D KL ( p θ old | p θ old ) + d ⊤ ∇ θ D KL ( p θ old | p θ ) | θ = θ old + 1 2 d ⊤ ∇ 2 θ D KL ( p θ old | p θ ) | θ = θ old d = 1 2 d ⊤ F ( θ old ) d = 1 2( θ − θ old ) ⊤ F ( θ old )( θ − θ old ) Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions: how much you change the distribution if you move the parameters a little bit in a given direction.

  17. Solving the KL Constrained Problem Unconstrained penalized objective: J ( θ + d ) − λ (D KL [ π θ ∥ π θ + d ] − ϵ ) d * = arg max d First order Taylor expansion for the loss and second order for the KL: J ( θ old ) + ∇ θ J ( θ ) | θ = θ old ⋅ d − 1 θ D KL [ π θ old ∥ π θ ] | θ = θ old d ) + λϵ 2 λ ( d ⊤ ∇ 2 ≈ arg max d Substitute for the information matrix: ∇ θ J ( θ ) | θ = θ old ⋅ d − 1 2 λ ( d ⊤ F ( θ old ) d ) = arg max d d − ∇ θ J ( θ ) | θ = θ old ⋅ d + 1 2 λ ( d ⊤ F ( θ old ) d ) = arg min

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend