Distributed RL Richard Liaw, Eric Liang Common Computational - - PowerPoint PPT Presentation

distributed rl
SMART_READER_LITE
LIVE PREVIEW

Distributed RL Richard Liaw, Eric Liang Common Computational - - PowerPoint PPT Presentation

Distributed RL Richard Liaw, Eric Liang Common Computational Patterns for RL Original Batch Optimization How can we better utilize our computational Simulation resources to accelerate RL Optimization Simulation Optimization progress?


slide-1
SLIDE 1

Distributed RL

Richard Liaw, Eric Liang

slide-2
SLIDE 2

Common Computational Patterns for RL

Batch Optimization

Simulation Simulation Simulation Optimization Optimization

How can we better utilize our computational resources to accelerate RL progress?

Original

slide-3
SLIDE 3

History of large scale distributed RL

2013

DQN

Playing Atari with Deep Reinforcement Learning (Mnih 2013)

GORILA

Massively Parallel Methods for Deep Reinforcement Learning (Nair 2015) 2015

A3C

Asynchronous Methods for Deep Reinforcement Learning (Mnih 2016) 2016

Ape-X

Distributed Prioritized Experience Replay (Horgan 2018) 2018

IMPALA

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures (Espeholt 2018) 2018

?

slide-4
SLIDE 4

2013/2015: DQN

for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r)) minibatch = replay.sample() q_network.update(mini_batch) if should_update_target(): q_network.sync_with(target_net)

slide-5
SLIDE 5

2015: General Reinforcement Learning Architecture (GORILA)

slide-6
SLIDE 6

GORILA Performance

slide-7
SLIDE 7

2016: Asynchronous Advantage Actor Critic (A3C)

Sends gradients back # Each worker: while True: sync_weights_from_master() for i in range(5): collect sample from env grad = compute_grad(samples) async_send_grad_to_master() Each has different exploration -> more diverse samples!

slide-8
SLIDE 8

A3C Performance

Changes to GORILA:

  • 1. Faster updates
  • 2. Removes the

replay buffer

  • 3. Moves to

Actor-Critic (from Q learning)

slide-9
SLIDE 9

Distributed Prioritized Experience Replay (Ape-X)

A3C doesn’t scale very well… Ape-X: 1. Distributed DQN/DDPG 2. Reintroduces replay 3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”

slide-10
SLIDE 10

Ape-X Performance

slide-11
SLIDE 11

Importance Weighted Actor-Learner Architectures (IMPALA)

Motivated by progress in distributed deep learning!

slide-12
SLIDE 12

How to correct for Policy Lag? Importance Sampling!

Given an actor-critic model: 1. Apply importance-sampling to policy gradient

  • 2. Apply importance sampling to critic update
slide-13
SLIDE 13

IMPALA Performance

slide-14
SLIDE 14

Other interesting distributed architectures

slide-15
SLIDE 15

AlphaZero

Each model trained

  • n 64 GPUs and 19

parameter servers!

slide-16
SLIDE 16

Evolution Strategies

slide-17
SLIDE 17

http://rllib.io

RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18)

17

Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica

slide-18
SLIDE 18

http://rllib.io

18

  • Fig. courtesy OpenAI
  • Fig. courtesy NVidia Inc.

RL research scales with compute

slide-19
SLIDE 19

http://rllib.io

How do we leverage this hardware?

19

scalable abstractions for RL?

(a) Supervised Learning (b) Reinforcement Learning

slide-20
SLIDE 20

http://rllib.io

Systems for RL today

  • Many implementations (7000+ repos on GitHub!)

– how general are they (and do they scale)?

PPO: multiprocessing, MPI AlphaZero: custom systems Evolution Strategies: Redis IMPALA: Distributed TensorFlow A3C: shared memory, multiprocessing, TF

  • Huge variety of algorithms and distributed systems used to

implement, but little reuse of components

20

slide-21
SLIDE 21

http://rllib.io

Challenges to reuse

  • 1. Wide range of physical execution strategies for one

"algorithm"

21

single-node cluster GPU CPU synchronous asynchronous send gradients send experiences

MPI multiprocessing param-server

slide-22
SLIDE 22

http://rllib.io

Challenges to reuse

  • 2. Tight coupling with deep learning frameworks

22

Different parallelism paradigms: – Distributed TensorFlow vs TensorFlow + MPI?

slide-23
SLIDE 23

http://rllib.io

Challenges to reuse

  • 3. Large variety of algorithms with different structures

23

slide-24
SLIDE 24

http://rllib.io

We need abstractions for RL

Good abstractions decompose RL algorithms into reusable components. Goals: – Code reuse across deep learning frameworks – Scalable execution of algorithms – Easily compare and reproduce algorithms

24

slide-25
SLIDE 25

http://rllib.io

Structure of RL computations

25

Agent Environment action (ai+1) Policy: state → action state (si) (observation) reward (ri)

slide-26
SLIDE 26

http://rllib.io

Structure of RL computations

26

Agent Environment action (ai+1) state (si) (observation) reward (ri)

Policy evaluation (state → action) Policy improvement (e.g., SGD) trajectory X: s0, (s1, r1), …, (sn, rn)

policy

slide-27
SLIDE 27

http://rllib.io

Many RL loop decompositions

27

Actor- Learner Actor- Learner Actor- Learner Param Server Learner Replay Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor Actor Actor

X <- rollout() dθ <- grad(L, X) sync(dθ) θ <- sync() rollout() X <- replay() apply(grad(L, X))

slide-28
SLIDE 28

http://rllib.io

Replay

Common components

28

Actor- Learner Actor- Learner Actor- Learner Param Server Actor Actor Actor Learner Replay Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor Actor Actor Actor Actor Actor Policy πθ(ot) Trajectory postprocessor ρθ(X) Loss L(θ,X)

slide-29
SLIDE 29

http://rllib.io

Replay

Common components

29

Actor- Learner Actor- Learner Actor- Learner Param Server Actor Actor Actor Learner Replay Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018) Actor Actor Actor Actor Actor Actor Policy πθ(ot) Trajectory postprocessor ρθ(X) Loss L(θ,X)

slide-30
SLIDE 30

http://rllib.io

Structural differences

30

Async DQN (Mnih et al; 2016)

  • Asynchronous optimization
  • Replicated workers
  • Single machine

Ape-X DQN (Horgan et al; 2018)

  • Central learner
  • Data queues between components
  • Large replay buffers
  • Scales to clusters

...and this is just one family! ➝ No existing system can effectively meet all the varied demands of RL workloads. + Population-Based Training (Jaderberg et al; 2017)

  • Nested parallel computations
  • Control decisions based on

intermediate results

slide-31
SLIDE 31

http://rllib.io

Requirements for a new system

Goal: Capture a broad range of RL workloads with high performance and substantial code reuse

  • 1. Support stateful computations
  • e.g., simulators, neural nets, replay buffers
  • big data frameworks, e.g., Spark, are typically stateless
  • 2. Support asynchrony
  • difficult to express in MPI, esp. nested parallelism
  • 3. Allow easy composition of (distributed) components

31

slide-32
SLIDE 32

http://rllib.io

Ray System Substrate

32

Hierarchical Task Model

  • RLlib builds on Ray to provide higher-level RL abstractions
  • Hierarchical parallel task model with stateful workers

– flexible enough to capture a broad range of RL workloads (vs specialized sys.) single-node cluster GPU CPU synchronous asynchronous send gradients send experiences

MPI multiprocessing param-server

slide-33
SLIDE 33

http://rllib.io

Ray Cluster

Hierarchical Parallel Task Model

  • 1. Create Python class instances in the cluster (stateful workers)
  • 2. Schedule short-running tasks onto workers

– Challenge: High performance: 1e6+ tasks/s, ~200us task

  • verhead

33

Top-level worker (Python process)

Sub-worker (process) Sub-worker Sub-worker "collect experiences" Sub-sub worker processes "do model-based rollouts" "allreduce your gradients"

exchange weight shards through Ray object store

"run K steps

  • f training"
slide-34
SLIDE 34

http://rllib.io

Unifying system enables RL Abstractions

34

Policy Optimizer Abstraction

SyncSamples AsyncSamples AsyncGradients SyncReplay MultiGPU ...

Policy Graph Abstraction {πθ, ρθ, L(θ,X)}

{Q-func, n-step, Q-loss} {LSTM,

  • adv. calc,

PG loss} Examples:

Hierarchical Task Model single-node cluster GPU CPU synchronous asynchronous send gradients send experiences

slide-35
SLIDE 35

http://rllib.io

RLlib Abstractions in Action

35

Policy Optimizers

SyncSamples AsyncSamples AsyncGradients SyncReplay MultiGPU ... {Q-func, n-step, Q-loss} {LSTM,

  • adv. calc,

PG loss} DQN (2015) Async DQN (2016) Ape-X (2018) Policy Gradient (2000) +actor-critic loss, GAE A2C (2016) PPO (GPU-optimized) PPO (2017) +clipped obj. IMPALA (2018) +V-trace A3C (2016)

Policy Graphs

slide-36
SLIDE 36

http://rllib.io

RLlib Reference Algorithms

  • High-throughput architectures

– Distributed Prioritized Experience Replay (Ape-X) – Importance Weighted Actor-Learner Architecture (IMPALA)

  • Gradient-based

– Advantage Actor-Critic (A2C, A3C) – Deep Deterministic Policy Gradients (DDPG) – Deep Q Networks (DQN, Rainbow) – Policy Gradients – Proximal Policy Optimization (PPO)

  • Derivative-free

– Augmented Random Search (ARS) – Evolution Strategies

Community Contributions

slide-37
SLIDE 37

http://rllib.io

RLlib Reference Algorithms

1 GPU + 64 vCPUs (large single machine)

slide-38
SLIDE 38

http://rllib.io

Scale your algorithms with RLlib

38

  • Beyond a "collection of algorithms",
  • RLlib's abstractions let you easily implement and scale new

algorithms (multi-agent, novel losses, architectures, etc)

slide-39
SLIDE 39

http://rllib.io

Code example: training PPO

slide-40
SLIDE 40

http://rllib.io

Code example: multi-agent RL

slide-41
SLIDE 41

http://rllib.io

Code example: hyperparam tuning

slide-42
SLIDE 42

http://rllib.io

Code example: hyperparam tuning

slide-43
SLIDE 43

http://rllib.io

RLlib is open source and available at http://rllib.io Thanks!

43

Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning.

slide-44
SLIDE 44

http://rllib.io

Ray distributed execution engine

44

  • Ray provides Task parallel and Actor APIs built on dynamic task graphs
  • These APIs are used to build distributed applications, libraries and systems

Ray execution model Dynamic Task Graphs Applications ... Numerical computation Third-party simulators Ray programming model Task Parallelism Actors

slide-45
SLIDE 45

http://rllib.io

Ray distributed scheduler

45

  • Faster than

Python multi- processing on a single node

  • Competitive with

MPI in many workloads

Worker Driver Worker Worker Worker Worker Object Store Object Store Object Store Local Scheduler Local Scheduler Local Scheduler

Global Scheduler Global Scheduler Global Scheduler

Global Scheduler