Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka - PowerPoint PPT Presentation

NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 10, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

REINFORCE with Baseline The returns can be arbitrary – better-than-average and worse-than-average returns cannot be recognized from the absolute value of the return. b ( s ) Hopefully, we can generalize the policy gradient theorem using a baseline to ∑ ∑ ∇ J ( θ ) ∝ μ ( s ) ( s , a ) − b ( s ) ) ∇ π ( a ∣ s ; θ ). ( q θ θ π s ∈ S a ∈ A b ( s ) ( s ) v π A good choice for is , which can be shown to minimize variance of the estimator. E ( s ) = ( s , a ) v q a ∼ π π π Such baseline reminds centering of returns, given that . Then, better- than-average returns are positive and worse-than-average returns are negative. def q ( s , a ) = ( s , a ) − ( s ) a v π π π The resulting value is also called an advantage function . ( s ) v π Of course, the baseline can be only approximated. If neural networks are used to estimate π ( a ∣ s ; θ ) , then some part of the network is usually shared between the policy and value function estimation, which is trained using mean square error of the predicted and observed return. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 2/27

Parallel Advantage Actor Critic An alternative to independent workers is to train in a synchronous and centralized way by having the workes to only generate episodes. Such approach was described in May 2017 by Celemente et al., who named their agent parallel advantage actor-critic (PAAC). Actions States DNN ... Worker 0 Worker n w learn Environments 0 ... ... ... ... ... ... ... ... n e States, Rewards Targets Master Figure 1 of the paper "Efficient Parallel Methods for Deep Reinforcement Learning" by Alfredo V. Clemente et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 3/27

Continuous Action Space Until now, the actions were discreet. However, many environments naturally accept actions from a , b ∈ R [ a , b ] continuous space. We now consider actions which come from range for , or more generally from a Cartesian product of several such ranges: ∏ [ a , b ]. i i i 1.0 µ = σ = 2 0, 0.2, A simple way how to parametrize the action distribution µ = σ = 2 0, 1.0, 0.8 µ = σ = 2 is to choose them from the normal distribution. 0, 5.0, σ 2 μ µ = σ = 2 − 2, 0.5, ✓ ◆ − ( x − µ ) 2 Given mean and variance , probability density 2 N ( μ , σ ) 0.6 p 2 σ 2 function of is 0.4 ( x − μ ) 2 − 1 def p ( x ) = 2 σ 2 . 0.2 e 2 πσ 2 0.0 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 x Figure from section 13.7 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 4/27

Continuous Action Space in Gradient Methods Utilizing continuous action spaces in gradient-based methods is straightforward. Instead of the softmax distribution we suitably parametrize the action value, usually using the normal distribution. Considering only one real-valued action, we therefore have def P ( a ∼ N ( μ ( s ; θ ), σ ( s ; θ ) ) ) , 2 π ( a ∣ s ; θ ) = μ ( s ; θ ) σ ( s ; θ ) where and are function approximation of mean and standard deviation of the action distribution. The mean and standard deviation are usually computed from the shared representation, with the mean being computed as a regular regression (i.e., one output neuron without activation); def log(1 + e ) the standard variance (which must be positive) being computed again as a regression, exp softplus softplus( x ) = x followed most commonly by either or , where . NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 5/27

Continuous Action Space in Gradient Methods μ ( s ; θ ) σ ( s ; θ ) During training, we compute and and then sample the action value (clipping it [ a , b ] to if required). To compute the loss, we utilize the probability density function of the normal distribution (and usually also add the entropy penalty). mu = tf.layers.dense(hidden_layer, 1)[:, 0] sd = tf.layers.dense(hidden_layer, 1)[:, 0] sd = tf.exp(log_sd) # or sd = tf.nn.softplus(sd) normal_dist = tf.distributions.Normal(mu, sd) # Loss computed as - log π (a|s) - entropy_regularization loss = - normal_dist.log_prob(self.actions) * self.returns \ - args.entropy_regularization * normal_dist.entropy() NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 6/27

Deterministic Policy Gradient Theorem Combining continuous actions and Deep Q Networks is not straightforward. In order to do so, we need a different variant of the policy gradient theorem. Recall that in policy gradient theorem, ∑ ∑ π ∇ J ( θ ) ∝ μ ( s ) ( s , a )∇ π ( a ∣ s ; θ ). q θ θ s ∈ S a ∈ A Deterministic Policy Gradient Theorem a ∈ R π ( s ; θ ) Assume that the policy is deterministic and computes an action . Then under several assumptions about continuousness, the following holds: [ ∇ ] . ∣ E ∇ J ( θ ) ∝ π ( s ; θ )∇ ( s , a ) q ∣ s ∼ μ ( s ) θ θ a π a = π ( s ; θ ) The theorem was first proven in the paper Deterministic Policy Gradient Algorithms by David Silver et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 7/27

Deterministic Policy Gradient Theorem – Proof The proof is very similar to the original (stochastic) policy gradient theorem. We assume that ′ ′ p ( s ∣ s , a ), ∇ p ( s ∣ s , a ), r ( s , a ), ∇ r ( s , a ), π ( s ; θ ), ∇ π ( s ; θ ) a a θ are continuous in all params. ∇ ( s ) = ∇ ( s , π ( s ; θ )) v q θ θ π π ( r ( s , π ( s ; θ ) ) + ∫ ( s ) d s ) ′ ′ ′ = ∇ p ( s ∣ s , π ( s ; θ ) ) v γ θ π s ′ θ ∫ ∣ ′ ′ ′ = ∇ π ( s ; θ )∇ r ( s , a ) + γ ∇ p ( s ∣ s , π ( s ; θ ) ) v ( s ) d s ∣ θ a π a = π ( s ; θ ) s ′ ( r ( s , a ) ∫ ( s ) d s ) ∣ ′ ′ ′ = ∇ π ( s ; θ )∇ + p ( s ∣ s , a ) ) v γ ∣ θ a π a = π ( s ; θ ) s ′ ∫ ′ ′ ′ + γ p ( s ∣ s , π ( s ; θ ) ) ∇ ( s ) d s v θ π s ′ ∫ ∣ ′ ′ ′ = ∇ π ( s ; θ )∇ ( s , a ) + p ( s ∣ s , π ( s ; θ ) ) ∇ ( s ) d s q γ v ∣ θ θ a π π a = π ( s ; θ ) s ′ ′ ∇ ( s ) v θ π Similarly to the gradient theorem, we finish the proof by continually expanding . NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 8/27

Deep Deterministic Policy Gradients Note that the formulation of deterministic policy gradient theorem allows an off-policy algorithm, because the loss functions no longer depends on actions (similarly to how expected Sarsa is also an off-policy algorithm). π ( s ; θ ) q ( s , a ; θ ) q ( s , a ; θ ) We therefore train function approximation for both and , training using a deterministic variant of the Bellman equation: E q ( S , A ; θ ) = + γq ( S , π ( S ; θ )) ] [ R , S t +1 t +1 t +1 t t R t +1 t +1 π ( s ; θ ) and according to the deterministic policy gradient theorem. The algorithm was first described in the paper Continuous Control with Deep Reinforcement Learning by Timothy P. Lillicrap et al. (2015). The authors utilize a replay buffer, a target network (updated by exponential moving average τ = 0.001 with ), batch normalization for CNNs, and perform exploration by adding a normal- distributed noise to predicted actions. Training is performed by Adam with learning rates of 1e-4 and 1e-3 for the policy and critic network, respectively. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 9/27

Deep Deterministic Policy Gradients Algorithm 1 DDPG algorithm Randomly initialize critic network Q ( s, a | θ Q ) and actor µ ( s | θ µ ) with weights θ Q and θ µ . Initialize target network Q ′ and µ ′ with weights θ Q ′ ← θ Q , θ µ ′ ← θ µ Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state s 1 for t = 1, T do Select action a t = µ ( s t | θ µ ) + N t according to the current policy and exploration noise Execute action a t and observe reward r t and observe new state s t +1 Store transition ( s t , a t , r t , s t +1 ) in R Sample a random minibatch of N transitions ( s i , a i , r i , s i +1 ) from R Set y i = r i + γQ ′ ( s i +1 , µ ′ ( s i +1 | θ µ ′ ) | θ Q ′ ) ∑ Update critic by minimizing the loss: L = 1 i ( y i − Q ( s i , a i | θ Q )) 2 N Update the actor policy using the sampled policy gradient:  ∇ θ µ J ≈ 1 ∇ a Q ( s, a | θ Q ) | s = s i ,a = µ ( s i ) ∇ θ µ µ ( s | θ µ ) | s i N i Update the target networks: θ Q ′ ← τθ Q + (1 − τ ) θ Q ′ θ µ ′ ← τθ µ + (1 − τ ) θ µ ′ end for end for Algorithm 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 10/27

Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka - PowerPoint PPT Presentation

NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 10, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

Deterministic Networking Lab Part Bernhard Frmel Institut fr Technische Informatik

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Advanced Algorithms (I) Chihao Zhang Shanghai Jiao Tong University Feb. 25, 2019 Advanced

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Themis Ensemble Manager Presented To: WoWoHa Seminar June 2020 David Domyancic, James Corbett,

Introduction Kevin has over 20 years experience of working within financial markets technology

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model Paul

Introduction to I/O and Disk Management 1 Secondary Storage Management Disks just like

Nonlinear Shallow Water Testbed Model Chris Eldred, Colorado State University A) Project

Questions vs directives Question Does treatment duration have an effect on survival?

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov This Week

Evolving the GNU Radio scheduler Embracing and Breaking Legacy Marcus M uller 2020-02-01