CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement - - PowerPoint PPT Presentation

cule gpu accelerated rl
SMART_READER_LITE
LIVE PREVIEW

CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement - - PowerPoint PPT Presentation

March 26-29, 2018 | Silicon Valley CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time REINFORCEMENT LEARNING A successful approach Board games


slide-1
SLIDE 1

March 26-29, 2018 | Silicon Valley

* CUDA Learning Environment

Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons

CULE*: GPU ACCELERATED RL

Reinforcement Learning time

slide-2
SLIDE 2

2

REINFORCEMENT LEARNING

A successful approach…

  • Board games
  • Video games
  • Robotics
  • Finance
  • Automotive
  • ML training – L2L (learn-to-learn)
slide-3
SLIDE 3

3

REINFORCEMENT LEARNING

A successful approach calling for more investigation

  • New RL algorithms
  • Development
  • Debugging / testing
  • Benchmarking
  • Alternative approaches
  • Evolutionary strategies
  • Imitation learning
slide-4
SLIDE 4

4

REINFORCEMENT LEARNING

ALE (Atari Learning Environment)

  • Diverse set of tasks
  • Established benchmark – MNIST of

RL?

slide-5
SLIDE 5

5

CULE

CUDA Learning Environment

LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money

slide-6
SLIDE 6

6

AGENDA

RL training: CPU, GPU Limitations CuLE Performance Analysis and new scenarios

slide-7
SLIDE 7

7

RL TRAINING

The OpenAI ATARI interface

https://github.com/openai/atari-py (OpenAI gym)

slide-8
SLIDE 8

8

RL TRAINING

CPU only based training

DQN A3C …

Mnih V. et al., Human-level control through deep reinforcement Learning, Nature, 2015 Minh V. et al., Asynchronous Methods for Deep Reinforcement Learning, ICML 2016

slide-9
SLIDE 9

9

RL TRAINING

Hybrid CPU GPU training

DQN A3C …

Babaeizadeh M. et al., Reinforcement Learning through Asynchronous Advantage Actor- Critic on a GPU, ICLR, 2017 1 10 100 1000 10000

Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1

4x 11x 12x 20x 45x PPS A3C GA3C

GA3C …

slide-10
SLIDE 10

10

RL TRAINING

Clusters

DQN A3C …

Espeholt L. et al., IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, 2018

GA3C … Cluster ES (GA) A3C A2C IMPALA …

slide-11
SLIDE 11

11

RL TRAINING

DGX-1

DQN A3C …

Stoole A., Abbeel P ., Accelerated Methods for Deep Reinforcement Learning, 2018

GA3C … Cluster ES (GA) A3C A2C IMPALA … DGX-1 Policy gradient Q-value …

slide-12
SLIDE 12

12

RL TRAINING

TIME

Limitations

DQN A3C … GA3C …

slide-13
SLIDE 13

13

RL TRAINING

Limitations

Stoole A., Abbeel P ., Accelerated Methods for Deep Reinforcement Learning, 2018

Cluster ES (GA) A3C A2C IMPALA … DGX-1 Policy gradient Q-value …

$$$

slide-14
SLIDE 14

14

CULE

CUDA Learning Environment

LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money

slide-15
SLIDE 15

15

AGENDA

RL training: CPU, GPU Limitations CuLE Performance Analysis and new scenarios

slide-16
SLIDE 16

16

RL TRAINING (CPU SIMULATION)

Standard training scenario

States, rewards Actions (, weights) Updates

slide-17
SLIDE 17

17

States, rewards Actions (, weights) Updates

RL TRAINING (CPU SIMULATION)

Standard training scenario

slide-18
SLIDE 18

18

RL TRAINING (CPU SIMULATION)

Standard training scenario

States, rewards Actions (, weights) Updates Limited bandwidth

slide-19
SLIDE 19

19

RL TRAINING (CPU SIMULATION)

Standard training scenario

States, rewards Actions (, weights) Updates Limited bandwidth Limited number of CPUs, low frames / second

slide-20
SLIDE 20

20

RL TRAINING (CULE)

States, rewards Actions (, weights) Updates

Porting ATARI to the GPU

slide-21
SLIDE 21

21

RL TRAINING (GPU)

1-to-1 mapping of ALEs to threads

ALE ATARI simulator

slide-22
SLIDE 22

22

AGENDA

RL training: CPU, GPU Limitations CuLE Performance Analysis and new scenarios

slide-23
SLIDE 23

23

GYM COMPATIBLE (MOSTLY)

for agent in (0, agents): action.cpu() # transfer to CPU

  • bservation, reward, done, info = env.step(action.numpy()) # execute
  • bservation.cuda() # transfer back GPU

reward.cuda() # parallel call to all agents

  • bservations, rewards, dones, infos = env.step(actions) # execute

AtariPy CuLE

slide-24
SLIDE 24

24

FRAMES PER SECOND

Breakout, inference only (no training)

1 environment 1024 environments 4096 environments 32768 environments GPU occupancy

slide-25
SLIDE 25

25

GYM COMPATIBLE (MOSTLY)

for agent in (0, agents): action.cpu() # transfer to CPU

  • bservation, reward, done, info = env.step( action.numpy()) # execute
  • bservation.cuda() # transfer back GPU

reward.cuda() train() # parallel call to all agents

  • bservations, rewards, dones, infos = env.step(actions) # execute

train()

AtariPy CuLE

slide-26
SLIDE 26

26

REINFORCEMENT LEARNING

Breakout – A2C (preliminary result)

slide-27
SLIDE 27

27

AGENDA

RL training: CPU, GPU Limitations CuLE Performance Analysis and new scenarios

slide-28
SLIDE 28

28

TRADE-OFF

Same amount of time: CuLE vs. non CuLE

Frames Agents 10 ~ 100 agents update 1 update 2 update 3 update 4 update 5 update 6 CULE update 1 Traditional approach

Bandwidth vs. Latency

1,000 ~ 100,000 agents

slide-29
SLIDE 29

29

TRADE-OFF

Same amount of time: CuLE vs. non CuLE

Frames Agents 10 ~ 100 agents update 1 update 2 update 3 update 4 update 5 update 6 CULE update 1 Traditional approach

Bandwidth vs. Latency

update 2 update 3 1,000 ~ 100,000 agents

slide-30
SLIDE 30

30

TRADE-OFF

Same amount of time: CuLE vs. non CuLE

Frames Agents 10 ~ 100 agents update 1 update 2 update 3 update 4 update 5 update 6 CULE update 1 Traditional approach

Bandwidth vs. Latency

update 2 update 3 1,000 ~ 100,000 agents

slide-31
SLIDE 31

31

GYM COMPATIBLE (MOSTLY)

for time in (0, np.inf): action.cpu() # transfer to CPU

  • bservation, reward, done, info = env.step( action.numpy()) # execute

cpu_state = cule.get_state() # get state train() # seed, cule::set_state(gpuState, cpuState) env.seed(cpu_state, first_agent = 0, last_agent = 100) # parallel call to all agents

  • bservations, rewards, dones, infos = env.step(actions) # execute

# …

AtariPy / CuLE CuLE

slide-32
SLIDE 32

32

SEEDING

Same amount of time: CuLE vs. non CuLE

Frames Agents 10 ~ 100 agents seed 1 seed 2 seed 3 seed 4 seed 5 seed 6 CULE update 1 Traditional approach

Bandwidth vs. Latency

1,000 ~ 100,000 agents

slide-33
SLIDE 33

33

CONCLUSION

CuLE

More frames for less money (democratizing RL) New scenarios How to use large batches? Seeding from the CPU, ES, … Soon released on https://github.com/NVlabs/

slide-34
SLIDE 34

March 26-29, 2018 | Silicon Valley

THANK YOU

CULE (CUDA LEARNING ENVIRONMENT), SOON RELEASED HTTPS://GITHUB.COM/NVLABS/

slide-35
SLIDE 35

March 26-29, 2018 | Silicon Valley

slide-36
SLIDE 36

36

MOTIVATION

Democratizing RL research

CuLE DGX Cluster K80 Colab (Jupyter-like environment)

slide-37
SLIDE 37

37

ASYNCHRONOUS UPDATES

GA3C-like updates … …

… …

R0 R1 R2 R4

slide-38
SLIDE 38

38

EXPERIENCE TRADE-OFF

Bandwidth vs. Latency

VS

Time Time

High experience volume, Low updates per second Low experience volume, High updates per second

slide-39
SLIDE 39

39

GYM COMPATIBLE (MOSTLY)

action.cpu() # transfer to CPU

  • bservation, reward, done, info =

env.step(action.numpy()) # execute

  • bservation.cuda() # transfer back GPU

reward.cuda()

  • bservation, reward, done, info = env.step(action)

AtariPy CuLE

ke: `cule::set_state(gpuState, cpuState` variable contains memory references.

slide-40
SLIDE 40

40

SYNCHRONOUS

On GPU updates … …

… …

R0 R1 R2 R4

slide-41
SLIDE 41

41

RL TRAINING (GPU)

Standard training scenario

CuLE AtariPy