cule gpu accelerated rl
play

CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement - PowerPoint PPT Presentation

March 26-29, 2018 | Silicon Valley CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time REINFORCEMENT LEARNING A successful approach Board games


  1. March 26-29, 2018 | Silicon Valley CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time

  2. REINFORCEMENT LEARNING A successful approach… Board games • • Video games Robotics • Finance • Automotive • ML training – L2L (learn-to-learn) • • … 2

  3. REINFORCEMENT LEARNING A successful approach calling for more investigation New RL algorithms • Development • • Debugging / testing Benchmarking • Alternative approaches • Evolutionary strategies • • Imitation learning … • 3

  4. REINFORCEMENT LEARNING ALE (Atari Learning Environment) • Diverse set of tasks Established benchmark – MNIST of • RL? 4

  5. CULE CUDA Learning Environment LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money 5

  6. RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 6

  7. RL TRAINING The OpenAI ATARI interface https://github.com/openai/atari-py (OpenAI gym) 7

  8. RL TRAINING CPU only based training DQN A3C … Mnih V. et al., Human-level control through deep reinforcement Learning, Nature, 2015 8 Minh V. et al., Asynchronous Methods for Deep Reinforcement Learning, ICML 2016

  9. RL TRAINING Hybrid CPU GPU training 4x 11x 12x 20x 45x 10000 DQN GA3C 1000 A3C … A3C PPS … 100 GA3C 10 1 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 Babaeizadeh M. et al., Reinforcement Learning through Asynchronous Advantage Actor- 9 Critic on a GPU, ICLR, 2017

  10. RL TRAINING Clusters ES (GA) A3C A2C Cluster IMPALA … DQN GA3C A3C … … Espeholt L. et al., IMPALA: Scalable Distributed Deep-RL with Importance Weighted 10 Actor-Learner Architectures, 2018

  11. RL TRAINING DGX-1 ES (GA) A3C A2C Cluster IMPALA … DQN GA3C A3C … … DGX-1 Policy gradient Q-value … Stoole A., Abbeel P ., Accelerated Methods 11 for Deep Reinforcement Learning, 2018

  12. RL TRAINING Limitations DQN GA3C A3C … TIME … 12

  13. RL TRAINING Limitations ES (GA) A3C A2C Cluster IMPALA … $$$ DGX-1 Policy gradient Q-value … Stoole A., Abbeel P ., Accelerated Methods 13 for Deep Reinforcement Learning, 2018

  14. CULE CUDA Learning Environment LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money 14

  15. RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 15

  16. RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … States, rewards 16

  17. RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … States, rewards 17

  18. RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … Limited bandwidth States, rewards 18

  19. RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … Limited number of CPUs, low Limited frames / bandwidth second States, rewards 19

  20. RL TRAINING (CULE) Porting ATARI to the GPU Actions (, weights) Updates … States, rewards 20

  21. RL TRAINING (GPU) 1-to-1 mapping of ALEs to threads ALE ATARI simulator 21

  22. RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 22

  23. GYM COMPATIBLE (MOSTLY) AtariPy for agent in (0, agents): action.cpu() # transfer to CPU observation, reward, done, info = env.step(action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() CuLE # parallel call to all agents observations, rewards, dones, infos = env.step(actions ) # execute 23

  24. FRAMES PER SECOND Breakout, inference only (no training) 1 environment 1024 environments 4096 environments 32768 environments GPU occupancy 24

  25. GYM COMPATIBLE (MOSTLY) AtariPy for agent in (0, agents): action.cpu() # transfer to CPU observation, reward, done, info = env.step( action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() train() CuLE # parallel call to all agents observations, rewards, dones, infos = env.step(actions) # execute train() 25

  26. REINFORCEMENT LEARNING Breakout – A2C (preliminary result) 26

  27. RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 27

  28. TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 Frames update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 28 Traditional approach

  29. TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 update 2 Frames update 3 update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 29 Traditional approach

  30. TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 update 2 Frames update 3 update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 30 Traditional approach

  31. GYM COMPATIBLE (MOSTLY) AtariPy / CuLE for time in (0, np.inf): action.cpu() # transfer to CPU observation, reward, done, info = env.step( action.numpy()) # execute cpu_state = cule.get_state() # get state train() CuLE # seed, cule::set_state(gpuState, cpuState) env.seed(cpu_state, first_agent = 0, last_agent = 100) # parallel call to all agents observations, rewards, dones, infos = env.step(actions) # execute # … 31

  32. SEEDING Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE seed 1 update 1 Frames seed 2 seed 3 seed 4 Bandwidth vs. Latency seed 5 seed 6 32 Traditional approach

  33. CONCLUSION CuLE More frames for less money (democratizing RL) New scenarios How to use large batches? Seeding from the CPU, ES, … Soon released on https://github.com/NVlabs/ 33

  34. March 26-29, 2018 | Silicon Valley THANK YOU CULE (CUDA LEARNING ENVIRONMENT), SOON RELEASED HTTPS://GITHUB.COM/NVLABS/

  35. March 26-29, 2018 | Silicon Valley

  36. MOTIVATION Democratizing RL research Colab Cluster (Jupyter-like environment) CuLE K80 DGX 36

  37. ASYNCHRONOUS UPDATES GA3C-like updates … … … … … R 4 R 0 R 1 R 2 37

  38. EXPERIENCE TRADE-OFF Bandwidth vs. Latency Low experience volume, High experience volume, High updates per second Low updates per second VS Time Time 38

  39. GYM COMPATIBLE (MOSTLY) AtariPy action.cpu() # transfer to CPU observation, reward, done, info = env.step(action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() ke: `cule::set_state(gpuState, cpuState` variable contains CuLE memory references. observation, reward, done, info = env.step(action) 39

  40. SYNCHRONOUS On GPU updates … … … … … R 4 R 0 R 1 R 2 40

  41. RL TRAINING (GPU) Standard training scenario AtariPy CuLE 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend