distributed rl
play

Distributed RL Richard Liaw, Eric Liang Common Computational - PowerPoint PPT Presentation

Distributed RL Richard Liaw, Eric Liang Common Computational Patterns for RL Original Batch Optimization How can we better utilize our computational Simulation resources to accelerate RL Optimization Simulation Optimization progress?


  1. Distributed RL Richard Liaw, Eric Liang

  2. Common Computational Patterns for RL Original Batch Optimization How can we better utilize our computational Simulation resources to accelerate RL Optimization Simulation Optimization progress? Simulation

  3. History of large scale distributed RL 2013 2015 2016 2018 2018 ? DQN GORILA A3C Ape-X IMPALA Playing Atari with Deep Massively Parallel Asynchronous Methods Distributed Prioritized IMPALA: Scalable Reinforcement Learning Methods for Deep for Deep Reinforcement Experience Replay Distributed Deep-RL with (Mnih 2013) Reinforcement Learning Learning (Horgan 2018) Importance Weighted (Nair 2015) (Mnih 2016) Actor-Learner Architectures (Espeholt 2018)

  4. 2013/2015: DQN for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r)) minibatch = replay.sample() q_network.update(mini_batch) if should_update_target(): q_network.sync_with(target_net)

  5. 2015: General Reinforcement Learning Architecture (GORILA)

  6. GORILA Performance

  7. 2016: Asynchronous Advantage Actor Critic (A3C) # Each worker: while True: Sends gradients sync_weights_from_master() back for i in range(5): collect sample from env grad = compute_grad(samples) async_send_grad_to_master() Each has different exploration -> more diverse samples!

  8. A3C Performance Changes to GORILA: 1. Faster updates 2. Removes the replay buffer 3. Moves to Actor-Critic (from Q learning)

  9. Distributed Prioritized Experience Replay (Ape-X) A3C doesn’t scale very well… Ape-X: 1. Distributed DQN/DDPG 2. Reintroduces replay 3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”

  10. Ape-X Performance

  11. Importance Weighted Actor-Learner Architectures (IMPALA) Motivated by progress in distributed deep learning!

  12. How to correct for Policy Lag? Importance Sampling! Given an actor-critic model: 1. Apply importance-sampling to policy gradient 2. Apply importance sampling to critic update

  13. IMPALA Performance

  14. Other interesting distributed architectures

  15. AlphaZero Each model trained on 64 GPUs and 19 parameter servers!

  16. Evolution Strategies

  17. RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18) Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica 17 http://rllib.io

  18. RL research scales with compute Fig. courtesy NVidia Inc. Fig. courtesy OpenAI http://rllib.io 18

  19. How do we leverage this hardware? (a) Supervised Learning (b) Reinforcement Learning scalable abstractions for RL? http://rllib.io 19

  20. Systems for RL today • Many implementations (7000+ repos on GitHub!) – how general are they (and do they scale)? PPO: multiprocessing, MPI AlphaZero: custom systems Evolution Strategies: Redis IMPALA: Distributed TensorFlow A3C: shared memory, multiprocessing, TF • Huge variety of algorithms and distributed systems used to implement, but little reuse of components http://rllib.io 20

  21. Challenges to reuse 1. Wide range of physical execution strategies for one "algorithm" GPU asynchronous send experiences param-server single-node cluster multiprocessing MPI send gradients synchronous CPU http://rllib.io 21

  22. Challenges to reuse 2. Tight coupling with deep learning frameworks Different parallelism paradigms: – Distributed TensorFlow vs TensorFlow + MPI? http://rllib.io 22

  23. Challenges to reuse 3. Large variety of algorithms with different structures http://rllib.io 23

  24. We need abstractions for RL Good abstractions decompose RL algorithms into reusable components. Goals: – Code reuse across deep learning frameworks – Scalable execution of algorithms – Easily compare and reproduce algorithms http://rllib.io 24

  25. Structure of RL computations Agent Environment action ( a i+1 ) Policy: state ( s i ) state → action (observation) reward ( r i ) http://rllib.io 25

  26. Structure of RL computations Environment Agent action ( a i+1 ) Policy policy Policy evaluation state ( s i ) improvement (state → (observation) (e.g., SGD) action) trajectory X: s 0 , (s 1 , r 1 ), …, (s n , r n ) reward ( r i ) http://rllib.io 26

  27. Many RL loop decompositions Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) θ <- sync() X <- rollout() rollout() Actor- Replay dθ <- grad(L, X) Learner sync(dθ) Actor Param Actor- Server Learner Learner Actor Actor- Learner Actor X <- replay() apply(grad(L, X)) http://rllib.io 27

  28. Common components Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor- Replay Replay Learner Actor Actor Actor Param Actor- Server Learner Learner Actor Actor Actor Policy π θ (o t ) Actor- Trajectory Learner Actor Actor postprocessor ρ θ ( X ) Actor Loss L (θ,X) http://rllib.io 28

  29. Common components Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor- Replay Replay Learner Actor Actor Actor Param Actor- Server Learner Learner Actor Actor Actor Policy π θ (o t ) Actor- Trajectory Learner Actor Actor postprocessor ρ θ ( X ) Actor Loss L (θ,X) http://rllib.io 29

  30. Structural differences Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) ● Asynchronous optimization ● Central learner ● Replicated workers ● Data queues between components ● Single machine ● Large replay buffers ● Scales to clusters ...and this is just one family! + Population-Based Training (Jaderberg et al; 2017) ➝ No existing system can ● Nested parallel computations effectively meet all the varied ● Control decisions based on demands of RL workloads. intermediate results http://rllib.io 30

  31. Requirements for a new system Goal: Capture a broad range of RL workloads with high performance and substantial code reuse 1. Support stateful computations - e.g., simulators, neural nets, replay buffers - big data frameworks, e.g., Spark, are typically stateless 2. Support asynchrony - difficult to express in MPI, esp. nested parallelism 3. Allow easy composition of (distributed) components http://rllib.io 31

  32. Ray System Substrate • RLlib builds on Ray to provide higher-level RL abstractions • Hierarchical parallel task model with stateful workers – flexible enough to capture a broad range of RL workloads (vs specialized sys.) GPU asynchronous send experiences param-server single-node cluster multiprocessing MPI send gradients synchronous CPU Hierarchical Task Model http://rllib.io 32

  33. Hierarchical Parallel Task Model 1. Create Python class instances in the cluster (stateful workers) 2. Schedule short-running tasks onto workers – Challenge: High performance: 1e6+ tasks/s, ~200us task overhead "do model-based "collect Sub-worker rollouts" experiences" Top-level worker (process) Sub-sub worker (Python process) processes Sub-worker "run K steps "allreduce exchange weight shards of training" your through Ray object store Ray Cluster gradients" Sub-worker http://rllib.io 33

  34. Unifying system enables RL Abstractions Policy Optimizer Abstraction SyncSamples SyncReplay AsyncGradients AsyncSamples MultiGPU ... GPU asynchronous send experiences single-node cluster Policy Graph Abstraction {π θ , ρ θ , L(θ,X)} send gradients {Q-func, {LSTM, synchronous CPU Examples: n-step, adv. calc, Q-loss} PG loss} Hierarchical Task Model http://rllib.io 34

  35. RLlib Abstractions in Action Policy Optimizer s Policy SyncSamples SyncReplay AsyncGradients AsyncSamples MultiGPU ... Graphs {Q-func, n-step, DQN Async DQN Ape-X Q-loss} (2015) (2016) (2018) {LSTM, Policy Gradient adv. calc, (2000) PG loss} +actor-critic A2C (2016) A3C (2016) loss, GAE +clipped obj. PPO (2017) PPO (GPU-optimized) +V-trace IMPALA (2018) http://rllib.io 35

  36. RLlib Reference Algorithms • High-throughput architectures – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner Architecture (IMPALA) • Gradient-based – Advantage Actor-Critic (A2C, A3C) – Deep Deterministic Policy Gradients (DDPG) – Deep Q Networks (DQN, Rainbow) – Policy Gradients – Proximal Policy Optimization (PPO) • Derivative-free – Augmented Random Search (ARS) Community – Evolution Strategies Contributions http://rllib.io

  37. RLlib Reference Algorithms 1 GPU + 64 vCPUs (large single machine) http://rllib.io

  38. Scale your algorithms with RLlib • Beyond a "collection of algorithms", • RLlib's abstractions let you easily implement and scale new algorithms (multi-agent, novel losses, architectures, etc) http://rllib.io 38

  39. Code example: training PPO http://rllib.io

  40. Code example: multi-agent RL http://rllib.io

  41. Code example: hyperparam tuning http://rllib.io

  42. Code example: hyperparam tuning http://rllib.io

  43. Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning. RLlib is open source and available at http://rllib.io Thanks! http://rllib.io 43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend