TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 - PowerPoint PPT Presentation

NPFL122, Lecture 10 TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Deterministic Policy Gradient Theorem Combining continuous actions and Deep Q Networks is not straightforward. In order to do so, we need a different variant of the policy gradient theorem. Recall that in policy gradient theorem, ∑ ∑ π ∇ J ( θ ) ∝ μ ( s ) ( s , a )∇ π ( a ∣ s ; θ ). q θ θ s ∈ S a ∈ A Deterministic Policy Gradient Theorem a ∈ R π ( s ; θ ) Assume that the policy is deterministic and computes an action . Then under several assumptions about continuousness, the following holds: [ ∇ ] . ∣ E ∇ J ( θ ) ∝ π ( s ; θ )∇ ( s , a ) q ∣ s ∼ μ ( s ) θ θ a π a = π ( s ; θ ) The theorem was first proven in the paper Deterministic Policy Gradient Algorithms by David Silver et al. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 2/28

Deep Deterministic Policy Gradients Note that the formulation of deterministic policy gradient theorem allows an off-policy algorithm, because the loss functions no longer depends on actions (similarly to how expected Sarsa is also an off-policy algorithm). π ( s ; θ ) q ( s , a ; θ ) q ( s , a ; θ ) We therefore train function approximation for both and , training using a deterministic variant of the Bellman equation: E q ( S , A ; θ ) = + γq ( S , π ( S ; θ )) ] [ R , S t +1 t +1 t +1 t t R t +1 t +1 π ( s ; θ ) and according to the deterministic policy gradient theorem. The algorithm was first described in the paper Continuous Control with Deep Reinforcement Learning by Timothy P. Lillicrap et al. (2015). The authors utilize a replay buffer, a target network (updated by exponential moving average τ = 0.001 with ), batch normalization for CNNs, and perform exploration by adding a normal- distributed noise to predicted actions. Training is performed by Adam with learning rates of 1e-4 and 1e-3 for the policy and critic network, respectively. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 3/28

Deep Deterministic Policy Gradients Algorithm 1 DDPG algorithm Randomly initialize critic network Q ( s, a | θ Q ) and actor µ ( s | θ µ ) with weights θ Q and θ µ . Initialize target network Q ′ and µ ′ with weights θ Q ′ ← θ Q , θ µ ′ ← θ µ Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state s 1 for t = 1, T do Select action a t = µ ( s t | θ µ ) + N t according to the current policy and exploration noise Execute action a t and observe reward r t and observe new state s t +1 Store transition ( s t , a t , r t , s t +1 ) in R Sample a random minibatch of N transitions ( s i , a i , r i , s i +1 ) from R Set y i = r i + γQ ′ ( s i +1 , µ ′ ( s i +1 | θ µ ′ ) | θ Q ′ ) ∑ Update critic by minimizing the loss: L = 1 i ( y i − Q ( s i , a i | θ Q )) 2 N Update the actor policy using the sampled policy gradient:  ∇ θ µ J ≈ 1 ∇ a Q ( s, a | θ Q ) | s = s i ,a = µ ( s i ) ∇ θ µ µ ( s | θ µ ) | s i N i Update the target networks: θ Q ′ ← τθ Q + (1 − τ ) θ Q ′ θ µ ′ ← τθ µ + (1 − τ ) θ µ ′ end for end for Algorithm 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 4/28

Twin Delayed Deep Deterministic Policy Gradient The paper Addressing Function Approximation Error in Actor-Critic Methods by Scott Fujimoto et al. from February 2018 proposes improvements to DDPG which decrease maximization bias by training two critics and choosing minimum of their predictions; introduce several variance-lowering optimizations: delayed policy updates; target policy smoothing. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 5/28

TD3 – Maximization Bias Similarly to Q-learning, the DDPG algorithm suffers from maximization bias. In Q-learning, the max maximization bias was caused by the explicit operator. For DDPG methods, it can be θ q approx θ caused by the gradient descent itself. Let be the parameters maximizing the and let θ q π π true approx true π be the hypothetical parameters which maximise true , and let and denote the corresponding policies. α < ε 1 Because the gradient direction is a local maximizer, for sufficiently small we have E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] . θ approx θ true α < ε q 2 π However, for real and for sufficiently small it holds that E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] . π true π approx E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] α < min( ε , ε ) 1 2 true true θ π Therefore, if , for E [ q E [ q ( s , π ) ] ≥ ( s , π ) ] . θ approx π approx NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 6/28

TD3 – Maximization Bias 400 400 500 400 Average Value Average Value 400 300 300 300 300 200 200 200 200 100 100 100 CDQ True CDQ 100 DQ-AC True DQ-AC DDPG True DDPG DDQN-AC True DDQN-AC 0 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Time steps (1e6) Time steps (1e6) Time steps (1e6) Time steps (1e6) (a) Hopper-v1 (b) Walker2d-v1 (a) Hopper-v1 (b) Walker2d-v1 Figure 1 of the paper "Addressing Function Approximation Error in Actor-Critic Methods" by Figure 2 of the paper "Addressing Function Approximation Error in Actor-Critic Methods" by Scott Fujimoto et al. Scott Fujimoto et al. Analogously to Double DQN we could compute the learning targets using the current policy and ′ ′ r + γq ( s , π ( s )) θ ′ θ the target critic, i.e., (instead of using target policy and target critic as in DDPG), obtaining DDQN-AC algorithm. However, the authors found out that the policy changes too slowly and the target and current networks are too similar. Using the original Double Q-learning, two pairs of actors and critics could be used, with the ′ r + γq ( s , π ( s )) q ′ θ θ θ 1 1 2 learning targets computed by the opposite critic, i.e., for updating . The resulting DQ-AC algorithm is slightly better, but still suffering from oversetimation. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 7/28

TD3 – Algorithm The authors instead suggest to employ two critics and one actor. The actor is trained using one of the critics, and both critics are trained using the same target computed using the minimum value of both critics as ′ ′ r + γ min ( s , π ( s )). q ′ θ θ i =1,2 i Furthermore, the authors suggest two additional improvements for variance reduction. For obtaining higher quality target values, the authors propose to train the critics more often. Therefore, critics are updated each step, but the actor and the target networks are d = 2 d updated only every -th step ( is used in the paper). To explictly model that similar actions should lead to similar results, a small random noise is added to performed actions when computing the target value: ′ ′ r + γ min ( s , π ( s ) + ε ) for ε ∼ clip( N (0, σ ), − c , c ). q ′ θ θ i =1,2 i NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 8/28

TD3 – Algorithm Algorithm 1 TD3 Initialize critic networks Q θ 1 , Q θ 2 , and actor network π φ with random parameters θ 1 , θ 2 , φ Initialize target networks θ ′ 1 ← θ 1 , θ ′ 2 ← θ 2 , φ ′ ← φ Initialize replay buffer B for t = 1 to T do Select action with exploration noise a ∼ π φ ( s ) + ǫ , ǫ ∼ N (0 , σ ) and observe reward r and new state s ′ Store transition tuple ( s, a, r, s ′ ) in B Sample mini-batch of N transitions ( s, a, r, s ′ ) from B a ← π φ ′ ( s ′ ) + ǫ, ˜ ǫ ∼ clip( N (0 , ˜ σ ) , − c, c ) i ( s ′ , ˜ y ← r + γ min i =1 , 2 Q θ ′ a ) Update critics θ i ← argmin θ i N − 1 ∑ ( y − Q θ i ( s, a )) 2 if t mod d then Update φ by the deterministic policy gradient: ∇ φ J ( φ ) = N − 1 ∑ ∇ a Q θ 1 ( s, a ) | a = π φ ( s ) ∇ φ π φ ( s ) Update target networks: θ ′ i ← τθ i + (1 − τ ) θ ′ i φ ′ ← τφ + (1 − τ ) φ ′ end if end for Algorithm 1 of the paper "Addressing Function Approximation Error in Actor-Critic Methods" by Scott Fujimoto et al. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 9/28

TD3 – Algorithm Hyper-parameter Ours DDPG 10 − 3 10 − 3 Critic Learning Rate 10 − 2 · || θ || 2 Critic Regularization None 10 − 3 10 − 4 Actor Learning Rate Actor Regularization None None Optimizer Adam Adam 5 · 10 − 3 10 − 3 Target Update Rate ( τ ) Batch Size 100 64 Iterations per time step 1 1 Discount Factor 0 . 99 0 . 99 Reward Scaling 1 . 0 1 . 0 Normalized Observations False True Gradient Clipping False False Exploration Policy N (0 , 0 . 1) OU, θ = 0 . 15 , µ = 0 , σ = 0 . 2 Table 3 of the paper "Addressing Function Approximation Error in Actor-Critic Methods" by Scott Fujimoto et al. NPFL122, Lecture 10 Refresh TD3 AlphaZero A0-MCTS A0-Network A0-Training A0-Evaluation 10/28

TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 - PowerPoint PPT Presentation

NPFL122, Lecture 10 TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Deterministic Policy Gradient

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 Charles University in Prague

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

INF580 Advanced Mathematical Programming TD3 Complexity and MP Leo Liberti CNRS LIX,

Modern Monte Carlo Tree Search Andrew Li, John Chen, Keiran Paster 1 Outline Motivation

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

If Mathematical Proof is a Game, What are the States and Moves? David McAllester 1 AlphaGo Fan

AlphaZero The new Chess King How a general reinforcement learning algorithm became the worlds

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm by

From Deep Blue to Monte Carlo: An Update on Game

CS 225 Data Structures Dec. 11 Flo loyd- Warshalls Algorithm Wad ade Fag agen-Ulm

Recent Advances in Reinforcement Learning (with a focus on ) Patrick Scholz

CS 309: Autonomous Intelligent Robotics FRI I Lecture 2: Introduction to AI Instructor: Justin

What are the emerging technologies? 1- Machine Learning (ML) 2- Block Chain Technologies (BCT)

TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 - PowerPoint PPT Presentation

NPFL122, Lecture 10 TD3, Monte Carlo Tree Search Milan Straka December 17, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Deterministic Policy Gradient

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

TD3, Monte Carlo Tree Search Milan Straka December 09, 2019 Charles University in Prague

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

INF580 Advanced Mathematical Programming TD3 Complexity and MP Leo Liberti CNRS LIX,

Modern Monte Carlo Tree Search Andrew Li, John Chen, Keiran Paster 1 Outline Motivation

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

If Mathematical Proof is a Game, What are the States and Moves? David McAllester 1 AlphaGo Fan

AlphaZero The new Chess King How a general reinforcement learning algorithm became the worlds

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm by

From Deep Blue to Monte Carlo: An Update on Game

CS 225 Data Structures Dec. 11 Flo loyd- Warshalls Algorithm Wad ade Fag agen-Ulm

Recent Advances in Reinforcement Learning (with a focus on ) Patrick Scholz

CS 309: Autonomous Intelligent Robotics FRI I Lecture 2: Introduction to AI Instructor: Justin

What are the emerging technologies? 1- Machine Learning (ML) 2- Block Chain Technologies (BCT)

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.