Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Asynchronous RL CMU 10703 Katerina Fragkiadaki

Non-stationary data problem for Deep RL • Stability of training neural networks requires the gradient updates to be de- correlated • This is not the case if data arrives sequentially • Gradient updates computed from some part of the space can cause the value (Q) function approximator to oscillate • Our solution so far has been: Experience buffers where experience tuples are mixed and sampled from. Resulting sampled batches are more stationary that the ones encountered online (without buffer) • This limits deep RL to off-policy methods, since data from an older policy are used to update the weights of the value approximator.

Asynchronous Deep RL • Alternative: parallelize the collection of experience and stabilize training without experience buffers ! • Multiple threads of experience , one per agent, each exploring in different part of the environment contributing experience tuples • Different exploration strategies (e.g., various \epsilon values) in different threads increase diversity • It can be applied to both on policy and off policy methods, applied it to SARSA, DQN, and advantage actor-critic

Distributed RL

Distributed Asynchronous RL Each worker may No locking have slightly modified version of the policy/critic The actor critic trained in such asynchronous way is knows as A3C

Distributed Synchronous RL All worker may have 5. Gradients of all the same actor/critic workers are averaged and weights the central neural net weights are updated The actor critic trained in such synchronous way is knows as A2C

• Training stabilization without Experience Buffer A3C • Use of on policy methods, e.g., SARSA and policy gradients s 1 , s 2 , s 3 , s 4 • Reduction is training time linear to the number of threads r 1 , r 2 , r 3 What is the approximation used for the advantage? R 3 = r 3 + γ V ( s 4 , θ ′ � v ) A 3 = R 3 − V ( s 3 ; θ ′ � v ) R 2 = r 2 + γ r 3 + γ 2 V ( s 4 , θ ′ � A 2 = R 2 − V ( s 2 ; θ ′ � v ) v )

Advantages of Asynchronous (multi-threaded) RL

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Evolutionary Methods CMU 10703 Katerina Fragkiadaki Part of the slides borrowed by Xi Chen, Pieter Abbeel, John Schulman

Policy Optimization 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ τ :a trajectory

Policy Optimization and RL 𝔽 [ R ( s t ) | π θ , μ 0 ( s ) ] T 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] = max ∑ max J ( θ ) = max θ θ θ t =0

𝔽 [ R ( s t ) | π θ , μ 0 ( s ) ] T 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] = max ∑ max J ( θ ) = max θ θ θ t =0 H X

𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ Evolutionary methods H X

Black-box Policy Optimization 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ 𝔽 [ R ( τ ) ] θ No information regarding the structure of the reward

Evolutionary methods 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ General algorithm: Initialize a population of parameter vectors ( genotypes ) 1.Make random perturbations ( mutations ) to each parameter vector 2.Evaluate the perturbed parameter vector ( fitness ) 3.Keep the perturbed vector if the result improves ( selection ) 4.GOTO 1 Biologically plausible…

Cross-entropy method Let’s consider our parameters to be sampled from a multivariate isotropic Gaussian We will evolve this Gaussian towards sampled that have highest fitness CEM: Initialize µ ∈ R d , σ ∈ R d > 0 for iteration = 1, 2, … Sample n parameters θ i ∼ N ( µ, diag( σ 2 )) For each , perform one rollout to get return R ( τ i ) θ i Select the top k% of , and fit a new diagonal Gaussian θ to those samples. Update µ, σ endfor

Covariance Matrix Adaptation Let’s consider our parameters to be sampled from a multivariate Gaussian We will evolve this Gaussian towards sampled that have highest fitness • Sample • Select elites 𝑛 𝑗 , 𝐷 𝑗 μ i , C i • Update mean • Update covariance • iterate

Covariance Matrix Adaptation • Sample • Select elites • Update mean • Update covariance • iterate

Covariance Matrix Adaptation • Sample • Select elites μ i +1 , C i +1 𝑛 𝑗+1 , 𝐷 𝑗+1 • Update mean • Update covariance • iterate

CMA-ES, CEM Work embarrassingly well in low-dimensions µ ∈ R 22 [NIPS 2013]

Question • Evolutionary methods work well on relatively low-dim problems • Can they be used to optimize deep network policies?

PG VS ES We are sampling in both cases…

Policy Gradients Review . J ( θ ) = 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] max θ ∇ θ J ( θ ) = ∇ θ 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] = ∇ θ ∑ P θ ( τ ) R ( τ ) τ = ∑ ∇ θ P θ ( τ ) R ( τ ) τ ∇ μ P θ ( τ ) = ∑ P θ ( τ ) R ( τ ) P θ ( τ ) τ = ∑ P θ ( τ ) ∇ θ log P θ ( τ ) R ( τ ) τ = 𝔽 τ ∼ P θ ( τ ) [ ∇ θ log P θ ( τ ) R ( τ ) ] Sample estimate: N ∇ θ J ( θ ) ≈ 1 ∑ ∇ θ log P θ ( τ ( i ) ) R ( τ ( i ) ) N i =1

ES Considers distribution over policy parameters . U ( μ ) = 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] max μ ∇ μ U ( μ ) = ∇ μ 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] = ∇ μ ∫ P μ ( θ ) F ( θ ) d θ = ∫ ∇ μ P μ ( θ ) F ( θ ) d θ = ∫ P μ ( θ ) ∇ μ P μ ( θ ) F ( θ ) d θ P μ ( θ ) = ∫ P μ ( θ ) ∇ μ log P μ ( θ ) F ( θ ) d θ = 𝔽 θ ∼ P μ ( θ ) [ ∇ μ log P μ ( θ ) F ( θ ) ]

ES Considers distribution over policy parameters . U ( μ ) = 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] max μ ∇ μ U ( μ ) = ∇ μ 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] = ∇ μ ∫ P μ ( θ ) F ( θ ) d θ = ∫ ∇ μ P μ ( θ ) F ( θ ) d θ = ∫ P μ ( θ ) ∇ μ P μ ( θ ) F ( θ ) d θ P μ ( θ ) = ∫ P μ ( θ ) ∇ μ log P μ ( θ ) F ( θ ) d θ = 𝔽 θ ∼ P μ ( θ ) [ ∇ μ log P μ ( θ ) F ( θ ) ] Sample estimate: N ∇ μ U ( μ ) ≈ 1 ∑ ∇ μ log P μ ( θ ( i ) ) F ( θ ( i ) ) N i =1

PG ES Considers distribution over actions Considers distribution over policy parameters . U ( μ ) = 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] . J ( θ ) = 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] max max μ θ ∇ μ U ( μ ) = ∇ μ 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] ∇ θ J ( θ ) = ∇ θ 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] = ∇ θ ∑ = ∇ μ ∫ P μ ( θ ) F ( θ ) d θ P θ ( τ ) R ( τ ) τ = ∑ = ∫ ∇ μ P μ ( θ ) F ( θ ) d θ ∇ θ P θ ( τ ) R ( τ ) τ = ∫ P μ ( θ ) ∇ μ P θ ( τ ) = ∑ ∇ μ P μ ( θ ) P θ ( τ ) R ( τ ) F ( θ ) d θ P θ ( τ ) P μ ( θ ) τ = ∫ P μ ( θ ) ∇ μ log P μ ( θ ) F ( θ ) d θ = ∑ P θ ( τ ) ∇ θ log P θ ( τ ) R ( τ ) τ = 𝔽 τ ∼ P θ ( τ ) [ ∇ θ log P θ ( τ ) R ( τ ) ] = 𝔽 θ ∼ P μ ( θ ) [ ∇ μ log P μ ( θ ) F ( θ ) ] Sample estimate: Sample estimate: N N ∇ θ J ( θ ) ≈ 1 ∇ μ U ( μ ) ≈ 1 ∑ ∑ ∇ θ log P θ ( τ ( i ) ) R ( τ ( i ) ) ∇ μ log P μ ( θ ( i ) ) F ( θ ( i ) ) N N i =1 i =1

From trajectories to actions N N T ∇ θ J ( θ ) ≈ 1 ∇ θ J ( θ ) ≈ 1 ∑ ∑ ∑ ∇ θ log P θ ( τ ( i ) ) R ( τ ( i ) ) ∇ θ log π θ ( α ( i ) t | s ( i ) t ) R ( s ( i ) t , a ( i ) t ) N N i =1 i =1 t =1 T ∏ P ( s ( i ) t +1 | s ( i ) t , a ( i ) ⋅ π θ ( a ( i ) t | s ( i ) ∇ θ log P ( τ ( i ) ; θ ) = ∇ θ log t ) t ) t =0 policy dynamics T ∑ log P ( s ( i ) t +1 | s ( i ) t , a ( i ) + log π θ ( a ( i ) t | s ( i ) = ∇ θ t ) t ) t =0 policy dynamics T ∑ log π θ ( a ( i ) t | s ( i ) = ∇ θ t ) t =0 policy T ∑ ∇ θ log π θ ( a ( i ) t | s ( i ) = t ) t =0

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL Stability of training neural networks requires the gradient updates

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Asynchronous Presentation Asynchronous Presentation VoiceThreads http://voicethreads.com

An Adiabatic Power-Supply Controller for An Adiabatic Power-Supply Controller for Asynchronous

Practical Promises As opposed to impractical promises 1 what is asynchronous code? Asynchronous

Asynchronous circuit technology is on the market Overview Introduction Handshake

Asynchronous Duty-Cycling Wireless Networks Lei Tang, Yanjun Sun, Omer Gurewitz, David B. Johnson

From asynchronous to synchronous specifications for distributed program synthesis David Janin

Synchronizing the Asynchronous Bernhard Kragl IST Austria Shaz Qadeer Thomas A. Henzinger

Asynchronous Communication II 2 / 41 INF4140 - Models of concurrency Asynchronous Communication,

Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , Laxmikant V. Kale* and Rasmus

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram

Communication Chapter 2 Layered Protocols (1) 2-1 Layers, interfaces, and protocols in the OSI

Synchronization Message Passing The most important property of a program is whether it

Paxos week! Doug Woos Logistics notes Next Monday: International Workers Day - No in-class

Synchronous vs. Asynchronous Programming Jan Pascal Maas Institute for Software Engineering and

Getting multi-agent systems to cooperate Luca Schenato University of Padova OptHySys 2017

Communication Chi Zhang czhang@cs.fiu.edu Outline Layered Protocols Remote Procedure