Distributed RL Richard Liaw, Eric Liang Common Computational - PowerPoint PPT Presentation

Distributed RL Richard Liaw, Eric Liang

Common Computational Patterns for RL Original Batch Optimization How can we better utilize our computational Simulation resources to accelerate RL Optimization Simulation Optimization progress? Simulation

History of large scale distributed RL 2013 2015 2016 2018 2018 ? DQN GORILA A3C Ape-X IMPALA Playing Atari with Deep Massively Parallel Asynchronous Methods Distributed Prioritized IMPALA: Scalable Reinforcement Learning Methods for Deep for Deep Reinforcement Experience Replay Distributed Deep-RL with (Mnih 2013) Reinforcement Learning Learning (Horgan 2018) Importance Weighted (Nair 2015) (Mnih 2016) Actor-Learner Architectures (Espeholt 2018)

2013/2015: DQN for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r)) minibatch = replay.sample() q_network.update(mini_batch) if should_update_target(): q_network.sync_with(target_net)

2015: General Reinforcement Learning Architecture (GORILA)

GORILA Performance

2016: Asynchronous Advantage Actor Critic (A3C) # Each worker: while True: Sends gradients sync_weights_from_master() back for i in range(5): collect sample from env grad = compute_grad(samples) async_send_grad_to_master() Each has different exploration -> more diverse samples!

A3C Performance Changes to GORILA: 1. Faster updates 2. Removes the replay buffer 3. Moves to Actor-Critic (from Q learning)

Distributed Prioritized Experience Replay (Ape-X) A3C doesn’t scale very well… Ape-X: 1. Distributed DQN/DDPG 2. Reintroduces replay 3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”

Ape-X Performance

Importance Weighted Actor-Learner Architectures (IMPALA) Motivated by progress in distributed deep learning!

How to correct for Policy Lag? Importance Sampling! Given an actor-critic model: 1. Apply importance-sampling to policy gradient 2. Apply importance sampling to critic update

IMPALA Performance

Other interesting distributed architectures

AlphaZero Each model trained on 64 GPUs and 19 parameter servers!

Evolution Strategies

RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18) Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica 17 http://rllib.io

RL research scales with compute Fig. courtesy NVidia Inc. Fig. courtesy OpenAI http://rllib.io 18

How do we leverage this hardware? (a) Supervised Learning (b) Reinforcement Learning scalable abstractions for RL? http://rllib.io 19

Systems for RL today • Many implementations (7000+ repos on GitHub!) – how general are they (and do they scale)? PPO: multiprocessing, MPI AlphaZero: custom systems Evolution Strategies: Redis IMPALA: Distributed TensorFlow A3C: shared memory, multiprocessing, TF • Huge variety of algorithms and distributed systems used to implement, but little reuse of components http://rllib.io 20

Challenges to reuse 1. Wide range of physical execution strategies for one "algorithm" GPU asynchronous send experiences param-server single-node cluster multiprocessing MPI send gradients synchronous CPU http://rllib.io 21

Challenges to reuse 2. Tight coupling with deep learning frameworks Different parallelism paradigms: – Distributed TensorFlow vs TensorFlow + MPI? http://rllib.io 22

Challenges to reuse 3. Large variety of algorithms with different structures http://rllib.io 23

We need abstractions for RL Good abstractions decompose RL algorithms into reusable components. Goals: – Code reuse across deep learning frameworks – Scalable execution of algorithms – Easily compare and reproduce algorithms http://rllib.io 24

Structure of RL computations Agent Environment action ( a i+1 ) Policy: state ( s i ) state → action (observation) reward ( r i ) http://rllib.io 25

Structure of RL computations Environment Agent action ( a i+1 ) Policy policy Policy evaluation state ( s i ) improvement (state → (observation) (e.g., SGD) action) trajectory X: s 0 , (s 1 , r 1 ), …, (s n , r n ) reward ( r i ) http://rllib.io 26

Many RL loop decompositions Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) θ <- sync() X <- rollout() rollout() Actor- Replay dθ <- grad(L, X) Learner sync(dθ) Actor Param Actor- Server Learner Learner Actor Actor- Learner Actor X <- replay() apply(grad(L, X)) http://rllib.io 27

Common components Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor- Replay Replay Learner Actor Actor Actor Param Actor- Server Learner Learner Actor Actor Actor Policy π θ (o t ) Actor- Trajectory Learner Actor Actor postprocessor ρ θ ( X ) Actor Loss L (θ,X) http://rllib.io 28

Common components Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor- Replay Replay Learner Actor Actor Actor Param Actor- Server Learner Learner Actor Actor Actor Policy π θ (o t ) Actor- Trajectory Learner Actor Actor postprocessor ρ θ ( X ) Actor Loss L (θ,X) http://rllib.io 29

Structural differences Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) ● Asynchronous optimization ● Central learner ● Replicated workers ● Data queues between components ● Single machine ● Large replay buffers ● Scales to clusters ...and this is just one family! + Population-Based Training (Jaderberg et al; 2017) ➝ No existing system can ● Nested parallel computations effectively meet all the varied ● Control decisions based on demands of RL workloads. intermediate results http://rllib.io 30

Requirements for a new system Goal: Capture a broad range of RL workloads with high performance and substantial code reuse 1. Support stateful computations - e.g., simulators, neural nets, replay buffers - big data frameworks, e.g., Spark, are typically stateless 2. Support asynchrony - difficult to express in MPI, esp. nested parallelism 3. Allow easy composition of (distributed) components http://rllib.io 31

Ray System Substrate • RLlib builds on Ray to provide higher-level RL abstractions • Hierarchical parallel task model with stateful workers – flexible enough to capture a broad range of RL workloads (vs specialized sys.) GPU asynchronous send experiences param-server single-node cluster multiprocessing MPI send gradients synchronous CPU Hierarchical Task Model http://rllib.io 32

Hierarchical Parallel Task Model 1. Create Python class instances in the cluster (stateful workers) 2. Schedule short-running tasks onto workers – Challenge: High performance: 1e6+ tasks/s, ~200us task overhead "do model-based "collect Sub-worker rollouts" experiences" Top-level worker (process) Sub-sub worker (Python process) processes Sub-worker "run K steps "allreduce exchange weight shards of training" your through Ray object store Ray Cluster gradients" Sub-worker http://rllib.io 33

Unifying system enables RL Abstractions Policy Optimizer Abstraction SyncSamples SyncReplay AsyncGradients AsyncSamples MultiGPU ... GPU asynchronous send experiences single-node cluster Policy Graph Abstraction {π θ , ρ θ , L(θ,X)} send gradients {Q-func, {LSTM, synchronous CPU Examples: n-step, adv. calc, Q-loss} PG loss} Hierarchical Task Model http://rllib.io 34

RLlib Abstractions in Action Policy Optimizer s Policy SyncSamples SyncReplay AsyncGradients AsyncSamples MultiGPU ... Graphs {Q-func, n-step, DQN Async DQN Ape-X Q-loss} (2015) (2016) (2018) {LSTM, Policy Gradient adv. calc, (2000) PG loss} +actor-critic A2C (2016) A3C (2016) loss, GAE +clipped obj. PPO (2017) PPO (GPU-optimized) +V-trace IMPALA (2018) http://rllib.io 35

RLlib Reference Algorithms • High-throughput architectures – Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner Architecture (IMPALA) • Gradient-based – Advantage Actor-Critic (A2C, A3C) – Deep Deterministic Policy Gradients (DDPG) – Deep Q Networks (DQN, Rainbow) – Policy Gradients – Proximal Policy Optimization (PPO) • Derivative-free – Augmented Random Search (ARS) Community – Evolution Strategies Contributions http://rllib.io

RLlib Reference Algorithms 1 GPU + 64 vCPUs (large single machine) http://rllib.io

Scale your algorithms with RLlib • Beyond a "collection of algorithms", • RLlib's abstractions let you easily implement and scale new algorithms (multi-agent, novel losses, architectures, etc) http://rllib.io 38

Code example: training PPO http://rllib.io

Code example: multi-agent RL http://rllib.io

Code example: hyperparam tuning http://rllib.io

Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning. RLlib is open source and available at http://rllib.io Thanks! http://rllib.io 43

Distributed RL Richard Liaw, Eric Liang Common Computational - PowerPoint PPT Presentation

Distributed RL Richard Liaw, Eric Liang Common Computational Patterns for RL Original Batch Optimization How can we better utilize our computational Simulation resources to accelerate RL Optimization Simulation Optimization progress?

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Coordination What makes a system distributed? Time in a distributed system

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

CISC 323 Intro to Software Engineering Week 8: Software Architecture (Continued) CISC 323

r r strt

Formal Specification and Verification of Distributed Components Soutenance de Thse pour obtenir

Creol as formal model for distributed, concurrent objects Martin Steffen IfI UiO Flacos, Malta

Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer

Lecture 2 System Design Chapter 6, Preliminaries Guest lectures December 15: Andersen

Simulation Engines TDA571|DIT030 Architecture and design Tommaso Piazza 1 Software

Chapter 7 System Design I (previous lecture) 0. Overview of System Design Addressing Design

Distributed RL Richard Liaw, Eric Liang Common Computational - PowerPoint PPT Presentation

Distributed RL Richard Liaw, Eric Liang Common Computational Patterns for RL Original Batch Optimization How can we better utilize our computational Simulation resources to accelerate RL Optimization Simulation Optimization progress?

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Coordination What makes a system distributed? Time in a distributed system

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

CISC 323 Intro to Software Engineering Week 8: Software Architecture (Continued) CISC 323

r r strt

Formal Specification and Verification of Distributed Components Soutenance de Thse pour obtenir

Creol as formal model for distributed, concurrent objects Martin Steffen IfI UiO Flacos, Malta

Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer

Lecture 2 System Design Chapter 6, Preliminaries Guest lectures December 15: Andersen

Simulation Engines TDA571|DIT030 Architecture and design Tommaso Piazza 1 Software

Chapter 7 System Design I (previous lecture) 0. Overview of System Design Addressing Design

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges