Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki

Part of the slides on path wise derivatives adapted from John Schulman

Computing Gradients of Expectations When the variable w.r.t. which we are differentiating appears in the distribution: ∇ θ 𝔽 x ∼ p ( ⋅ | θ ) F( x ) = 𝔽 x ∼ p ( ⋅ | θ ) ∇ θ log p ( ⋅ | θ )F( x ) ∇ θ 𝔽 a ∼ π θ R ( a , s ) e.g. likelihood ratio gradient estimator When the variable w.r.t. which we are differentiating appears in the expectation: d F( x ( θ ), z ) dx ∇ θ 𝔽 z ∼𝒪 (0,1) F( x ( θ ), z ) = 𝔽 z ∼𝒪 (0,1) ∇ θ F( x ( θ ), z ) = 𝔽 z ∼𝒪 (0,1) dx d θ pathwise derivative Re-parametrization trick: For some distributions p(x|\theta) we can switch from one gradient estimator to the other. Why would we want to do so?

Known MDP deterministic node: the value is a r 0 deterministic function of its input r 1 stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) ρ ( s , a ) deterministic computation node a 0 a 1 π θ ( s ) π θ ( s ) s 1 s 0 T( s , a ) T( s , a ) ... θ Reward and dynamics are known

Known MDP-let’s make it simpler deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s ) I want to learn \theta to maximize the reward obtained. s 0 θ

What if the policy is deterministic? deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s ) I want to learn \theta to maximize the reward obtained. s 0 θ I can compute the gradient with backpropagation. a = π θ ( s ) ∇ θ ρ ( s , a ) = ρ a π θθ

What if the policy is stochastic? deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s , a ) I want to learn \theta to maximize the reward obtained. s 0 θ Likelihood ratio estimator, works for both continuous and discrete actions 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a )

Policies are parametrized Gaussians deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 a ∼ 𝒪 ( μ ( s , θ ), Σ ( s , θ )) µ θ ( s ) σ θ ( s ) π θ ( s ) I want to learn \theta to maximize the reward obtained. s 0 θ 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a ) If is constant: σ 2 r θ log π θ ( s, a ) = ( a � µ ( s ; θ )) ∂ µ ( s ; θ ) ∂θ σ 2

Re-parametrization for Gaussian r 0 ρ ( s , a ) a 0 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) z z ∼ 𝒪 (0, I ) µ θ ( s ) σ θ ( s ) π θ ( s ) s 0 θ

Re-parametrization for Gaussian r 0 𝔽 ( μ + z σ ) = μ isotropic Var( μ + z σ ) = σ 2 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) ρ ( s , a ) da d θ = d μ ( s , θ ) + z ⊙ d σ ( s , θ ) d θ d θ d ρ ( a ( θ , z ), s ) a 0 da ( θ , z ) ∇ θ 𝔽 z [ ρ ( a ( θ , z ), s ) ] = 𝔽 z a = μ ( s , θ ) + z ⊙ σ ( s , θ ) da d θ z z ∼ 𝒪 (0, I ) Sample estimate: d ρ ( a ( θ , z ), s ) N N 1 [ ρ ( a ( θ , z i ), s ) ] = 1 da ( θ , z ) µ θ ( s ) σ θ ( s ) ∑ ∑ ∇ θ | z = z i N N da d θ i =1 i =1 π θ ( s ) s 0 θ

Re-parametrization for Gaussian r 0 𝔽 ( μ + z σ ) = μ isotropic Var( μ + z σ ) = σ 2 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) ρ ( s , a ) da d θ = d μ ( s , θ ) + z ⊙ d σ ( s , θ ) d θ d θ d ρ ( a ( θ , z ), s ) a 0 da ( θ , z ) ∇ θ 𝔽 z [ ρ ( a ( θ , z ), s ) ] = 𝔽 z a = μ ( σ , θ ) + z ⊙ σ ( s , θ ) da d θ z z ∼ 𝒪 (0, I ) Sample estimate: d ρ ( a ( θ , z ), s ) N N 1 [ ρ ( a ( θ , z i ), s ) ] = 1 da ( θ , z ) µ θ ( s ) σ θ ( s ) ∑ ∑ ∇ θ | z = z i N N da d θ i =1 i =1 π θ ( s ) s 0 θ

Re-parametrization for Gaussian r 0 𝔽 ( μ + z σ ) = μ isotropic Var( μ + z σ ) = σ 2 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) ρ ( s , a ) da d θ = d μ ( s , θ ) + z ⊙ d σ ( s , θ ) d θ d θ d ρ ( a ( θ , z ), s ) a 0 da ( θ , z ) ∇ θ 𝔽 z [ ρ ( a ( θ , z ), s ) ] = 𝔽 z a = μ ( s , θ ) + z ⊙ σ ( s , θ ) da d θ z z ∼ 𝒪 (0, I ) Sample estimate: d ρ ( a ( θ , z ), s ) N N 1 [ ρ ( a ( θ , z i ), s ) ] = 1 da ( θ , z ) µ θ ( s ) σ θ ( s ) ∑ ∑ ∇ θ | z = z i N N da d θ i =1 i =1 π θ ( s ) general Σ = LL ⊤ a = μ ( σ , θ ) + Lz , s 0 θ The pathwise Pathwise derivative: derivative uses the Likelihood ratio grad estimator: d ρ ( a ( θ , z ), s ) da ( θ , z ) 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a ) derivative of the 𝔽 z da d θ reward w.r.t. the action!

Policies are parametrized Categorical distr deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s , a ) I want to learn \theta to maximize the reward obtained. s 0 θ 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a )

Re-parametrization for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ y k ∼ P K j =0 exp((log p j ) / τ ) Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017

Re-parametrization trick for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ y k ∼ P K j =0 exp((log p j ) / τ ) Re-parametrization: a k = arg max k (log p k + ϵ k ), ϵ k = − log( − log( U )), u ∼ 𝒱 [0,1] Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017

Re-parametrization trick for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ a k k ∼ P K j =0 exp((log p j ) / τ ) Reparametrization: a k = arg max k (log p k + ϵ k ), ϵ k = − log( − log( U )), u ∼ 𝒱 [0,1] In the forward pass you sample from the parametrized distribution c ∼ G (log p ) a k In the backward pass you use the soft distribution: dc θ = dG dp da dp d θ d θ Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017

Back-propagating through discrete variables For binary neurons: backward pass forward pass Straight-through sigmoidal For general categorically distributed neurons: forward pass backward pass http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017

Re-parametrized Policy Gradients I Episodic MDP: θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T We want to compute: ∇ θ 𝔽 [ R T ]

Re-parametrized Policy Gradients I Episodic MDP: θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T We want to compute: ∇ θ 𝔽 [ R T ] | r E r I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T

Re-parametrized Policy Gradients I Episodic MDP: θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T We want to compute: ∇ θ 𝔽 [ R T ] | r r I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T For path wise derivative to work, we need transition dynamics and reward function to be known.

Re-parametrized Policy Gradients θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T " T " T # # d d R T d a t d E [ R T | a t ] d a t X X d θ E [ R T ] = E = E d a t d θ d a t d θ t =1 t =1 " # " # For path wise derivative to work, we need transition dynamics and reward function to be known, or…

Re-parametrized Policy Gradients θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T " T " T # # d d R T d a t d E [ R T | a t ] d a t X X d θ E [ R T ] = E = E d a t d θ d a t d θ t =1 t =1 " T " T # # d Q ( s t , a t ) d a t d X X = E = E d θ Q ( s t , π ( s t , z t ; θ )) d a t d θ t =1 t =1 I Learn Q φ to approximate Q π , γ , and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part of the slides on path wise derivatives adapted from John Schulman Computing Gradients of

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

MATHEMATICS 1 CONTENTS Derivatives for functions of two variables Higher-order partial

PARTIAL DERIVATIVES MATH 200 GOALS Figure out how to take derivatives of functions of

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel

JSE Limited ALT x Main Equity Agricultural Yield-X Board Derivatives Derivatives Bonds

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Derivatives Background (uncertainty) Intro: Derivatives Futures Options

Derivatives Differentiability problems in Banach spaces For vector valued functions there are two

The pathwise solution of an SPDE with fractal noise Elena Issoglio Friedrich-Schiller

Robust trading strategies, pathwise It o calculus, and generalized Takagi functions Alexander

Asymptotically exponential hitting times and metastability: a pathwise approach without

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

2.3 Partial Derivatives, Linear Approximation Prof. Tesler Math 20C Fall 2018 Prof. Tesler 2.3

How to compute a derivative Computing derivatives of complicated functions How do you

FNCE 4040 Derivatives Chapter 4 Interest Rates University of Colorado at Boulder Leeds

Quarterly D&O Claims Trends: Q3 2016 Sponsored By: About Advisen Advisen delivers: the

Derivative-Free Methods for Machine Learning Tasks Inverse Problem Formulations The Ensemble

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Numerical Solutions to Partial Differential Equations Zhiping Li LMAM and School of Mathematical

Slide 4 / 213 Slide 4 (Answer) / 213 Slide 5 / 213 Derivatives Exploration Exploration into the

Listed company crime, misconduct and the SFC Mr Andrew Sheng Chairman Securities & Futures

Ghost-free vector superfield actions in supersymmetric higher-derivative theories Ryo Yokokura

Holographic d-wave superconductors Francesco Benini Princeton University with Chris Herzog,

Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki Part of the slides on path wise derivatives adapted from John Schulman Computing Gradients of

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

MATHEMATICS 1 CONTENTS Derivatives for functions of two variables Higher-order partial

PARTIAL DERIVATIVES MATH 200 GOALS Figure out how to take derivatives of functions of

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel

JSE Limited ALT x Main Equity Agricultural Yield-X Board Derivatives Derivatives Bonds

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Derivatives Background (uncertainty) Intro: Derivatives Futures Options

Derivatives Differentiability problems in Banach spaces For vector valued functions there are two

The pathwise solution of an SPDE with fractal noise Elena Issoglio Friedrich-Schiller

Robust trading strategies, pathwise It o calculus, and generalized Takagi functions Alexander

Asymptotically exponential hitting times and metastability: a pathwise approach without

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

2.3 Partial Derivatives, Linear Approximation Prof. Tesler Math 20C Fall 2018 Prof. Tesler 2.3

How to compute a derivative Computing derivatives of complicated functions How do you

FNCE 4040 Derivatives Chapter 4 Interest Rates University of Colorado at Boulder Leeds

Quarterly D&amp;O Claims Trends: Q3 2016 Sponsored By: About Advisen Advisen delivers: the

Derivative-Free Methods for Machine Learning Tasks Inverse Problem Formulations The Ensemble

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Numerical Solutions to Partial Differential Equations Zhiping Li LMAM and School of Mathematical

Slide 4 / 213 Slide 4 (Answer) / 213 Slide 5 / 213 Derivatives Exploration Exploration into the

Listed company crime, misconduct and the SFC Mr Andrew Sheng Chairman Securities &amp; Futures

Ghost-free vector superfield actions in supersymmetric higher-derivative theories Ryo Yokokura

Holographic d-wave superconductors Francesco Benini Princeton University with Chris Herzog,

Quarterly D&O Claims Trends: Q3 2016 Sponsored By: About Advisen Advisen delivers: the

Listed company crime, misconduct and the SFC Mr Andrew Sheng Chairman Securities & Futures