CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement - PowerPoint PPT Presentation

March 26-29, 2018 | Silicon Valley CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time

REINFORCEMENT LEARNING A successful approach… Board games • • Video games Robotics • Finance • Automotive • ML training – L2L (learn-to-learn) • • … 2

REINFORCEMENT LEARNING A successful approach calling for more investigation New RL algorithms • Development • • Debugging / testing Benchmarking • Alternative approaches • Evolutionary strategies • • Imitation learning … • 3

REINFORCEMENT LEARNING ALE (Atari Learning Environment) • Diverse set of tasks Established benchmark – MNIST of • RL? 4

CULE CUDA Learning Environment LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money 5

RL training: CPU, GPU Limitations AGENDA CuLE Performance Analysis and new scenarios 6

RL TRAINING The OpenAI ATARI interface https://github.com/openai/atari-py (OpenAI gym) 7

RL TRAINING CPU only based training DQN A3C … Mnih V. et al., Human-level control through deep reinforcement Learning, Nature, 2015 8 Minh V. et al., Asynchronous Methods for Deep Reinforcement Learning, ICML 2016

RL TRAINING Hybrid CPU GPU training 4x 11x 12x 20x 45x 10000 DQN GA3C 1000 A3C … A3C PPS … 100 GA3C 10 1 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 Babaeizadeh M. et al., Reinforcement Learning through Asynchronous Advantage Actor- 9 Critic on a GPU, ICLR, 2017

RL TRAINING Clusters ES (GA) A3C A2C Cluster IMPALA … DQN GA3C A3C … … Espeholt L. et al., IMPALA: Scalable Distributed Deep-RL with Importance Weighted 10 Actor-Learner Architectures, 2018

RL TRAINING DGX-1 ES (GA) A3C A2C Cluster IMPALA … DQN GA3C A3C … … DGX-1 Policy gradient Q-value … Stoole A., Abbeel P ., Accelerated Methods 11 for Deep Reinforcement Learning, 2018

RL TRAINING Limitations DQN GA3C A3C … TIME … 12

RL TRAINING Limitations ES (GA) A3C A2C Cluster IMPALA … $$$ DGX-1 Policy gradient Q-value … Stoole A., Abbeel P ., Accelerated Methods 13 for Deep Reinforcement Learning, 2018

CULE CUDA Learning Environment LEARNING ALGO CULE Frames production / consumption rate > 10K / s Democratize RL: more frames for less money 14

RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … States, rewards 16

RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … States, rewards 17

RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … Limited bandwidth States, rewards 18

RL TRAINING (CPU SIMULATION) Standard training scenario Actions (, weights) Updates … Limited number of CPUs, low Limited frames / bandwidth second States, rewards 19

RL TRAINING (CULE) Porting ATARI to the GPU Actions (, weights) Updates … States, rewards 20

RL TRAINING (GPU) 1-to-1 mapping of ALEs to threads ALE ATARI simulator 21

GYM COMPATIBLE (MOSTLY) AtariPy for agent in (0, agents): action.cpu() # transfer to CPU observation, reward, done, info = env.step(action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() CuLE # parallel call to all agents observations, rewards, dones, infos = env.step(actions ) # execute 23

FRAMES PER SECOND Breakout, inference only (no training) 1 environment 1024 environments 4096 environments 32768 environments GPU occupancy 24

GYM COMPATIBLE (MOSTLY) AtariPy for agent in (0, agents): action.cpu() # transfer to CPU observation, reward, done, info = env.step( action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() train() CuLE # parallel call to all agents observations, rewards, dones, infos = env.step(actions) # execute train() 25

REINFORCEMENT LEARNING Breakout – A2C (preliminary result) 26

TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 Frames update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 28 Traditional approach

TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 update 2 Frames update 3 update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 29 Traditional approach

TRADE-OFF Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE update 1 update 1 update 2 Frames update 3 update 2 update 3 update 4 Bandwidth vs. Latency update 5 update 6 30 Traditional approach

GYM COMPATIBLE (MOSTLY) AtariPy / CuLE for time in (0, np.inf): action.cpu() # transfer to CPU observation, reward, done, info = env.step( action.numpy()) # execute cpu_state = cule.get_state() # get state train() CuLE # seed, cule::set_state(gpuState, cpuState) env.seed(cpu_state, first_agent = 0, last_agent = 100) # parallel call to all agents observations, rewards, dones, infos = env.step(actions) # execute # … 31

SEEDING Same amount of time: CuLE vs. non CuLE Agents 1,000 ~ 100,000 agents 10 ~ 100 agents CULE seed 1 update 1 Frames seed 2 seed 3 seed 4 Bandwidth vs. Latency seed 5 seed 6 32 Traditional approach

CONCLUSION CuLE More frames for less money (democratizing RL) New scenarios How to use large batches? Seeding from the CPU, ES, … Soon released on https://github.com/NVlabs/ 33

March 26-29, 2018 | Silicon Valley THANK YOU CULE (CUDA LEARNING ENVIRONMENT), SOON RELEASED HTTPS://GITHUB.COM/NVLABS/

March 26-29, 2018 | Silicon Valley

MOTIVATION Democratizing RL research Colab Cluster (Jupyter-like environment) CuLE K80 DGX 36

ASYNCHRONOUS UPDATES GA3C-like updates … … … … … R 4 R 0 R 1 R 2 37

EXPERIENCE TRADE-OFF Bandwidth vs. Latency Low experience volume, High experience volume, High updates per second Low updates per second VS Time Time 38

GYM COMPATIBLE (MOSTLY) AtariPy action.cpu() # transfer to CPU observation, reward, done, info = env.step(action.numpy()) # execute observation.cuda() # transfer back GPU reward.cuda() ke: `cule::set_state(gpuState, cpuState` variable contains CuLE memory references. observation, reward, done, info = env.step(action) 39

SYNCHRONOUS On GPU updates … … … … … R 4 R 0 R 1 R 2 40

RL TRAINING (GPU) Standard training scenario AtariPy CuLE 41

CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement - PowerPoint PPT Presentation

March 26-29, 2018 | Silicon Valley CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time REINFORCEMENT LEARNING A successful approach Board games

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising Hermann Frntratt ,

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Use Tesla to provide first GPU VM Service in China Feng Zhu

Corporate Presentation March 8, 2019 PLATINUM GROUP METALS | WATERBERG PGM PROJECT DISCLOSURE

Data Class XI ( As per CBSE Board) Handling New Syllabus 2019-20 Visit : python.mykvs.in for

Faltings Heights of CM Elliptic Curves Tyler Genao Florida Atlantic University In collaboration

Replotting the Nyquist Plot: A New Visualization Proposal Predrag Pejovi Introduction

H1 2020 revenue and operational data July 28, 2020 Disclaimer This presentation contains

Presto at Wayfair Vinay Narayana https:// www.linkedin.com/in/vinaynarayana/ @nvinay26 1.

and opportunities for growth 16 May 2013 Agenda Introduction Clive Bannister | Group Chief

PROFITABLE LOW COST CO-PRODUCER SITE VISIT FEBRUARY 2017 1 DISCLAIMER These Presentation

CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement - PowerPoint PPT Presentation

March 26-29, 2018 | Silicon Valley CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time REINFORCEMENT LEARNING A successful approach Board games

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising Hermann Frntratt ,

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Use Tesla to provide first GPU VM Service in China Feng Zhu

Corporate Presentation March 8, 2019 PLATINUM GROUP METALS | WATERBERG PGM PROJECT DISCLOSURE

Data Class XI ( As per CBSE Board) Handling New Syllabus 2019-20 Visit : python.mykvs.in for

Faltings Heights of CM Elliptic Curves Tyler Genao Florida Atlantic University In collaboration

Replotting the Nyquist Plot: A New Visualization Proposal Predrag Pejovi Introduction

H1 2020 revenue and operational data July 28, 2020 Disclaimer This presentation contains

Presto at Wayfair Vinay Narayana https:// www.linkedin.com/in/vinaynarayana/ @nvinay26 1.

and opportunities for growth 16 May 2013 Agenda Introduction Clive Bannister | Group Chief

PROFITABLE LOW COST CO-PRODUCER SITE VISIT FEBRUARY 2017 1 DISCLAIMER These Presentation

CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement - PowerPoint PPT Presentation

March 26-29, 2018 | Silicon Valley CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri Frosio, Jared Hoberock, Jason Clemons time REINFORCEMENT LEARNING A successful approach Board games

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team