GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , - PowerPoint PPT Presentation

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA An ICLR 2017 paper A github project

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh †,‡ , I.Frosio ‡ , S.Tyree ‡ , J. Clemons ‡ , J.Kautz ‡ † University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 4

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Learning to accomplish a task Image from www.33rdsquare.com 5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t ✓ Environment S t , R t ✓ Agent ✓ Observable status S t RL agent S t ✓ Reward R t ✓ Action a t ✓ Policy a t = p ( S t ) a t = p ( S t ) 6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Deep RL agent S t a t = p ( S t ) 7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Dp (∙) S t R 0 R 1 R 2 R 3 R 4 a t = p ( S t ) 8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015) Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master S t, R t p ’(∙) Agent 2 model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 10

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING The GPU 12

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING LOW OCCUPANCY (33%) LOW OCCUPANCY 13

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%) Batch size 14

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), LOW UTILIZATION (40%) time time 15

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%) time time 16

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING BANDWIDTH LIMITED time time 17

Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master S t, R t p ’ ( ∙ ) Agent 2 model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t + R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 18

MAPPING DEEP PROBLEMS TO A GPU REGRESSION, CLASSIFICATION, REINFORCEMENT LEARNING … data status, reward ? 100% utilization / occupancy Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, labels action Strawberry, Strawberry, Strawberry, … … 19

A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 20

A3C Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 24

A3C Dp (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) Master Agent 2 S t, R t p ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = p (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = p (S t ) 25

GA3C GPU-based A3C 26 El Capitan big wall, Yosemite Valley

GA3C (INFERENCE) a t {a t } Agent 1 prediction queue { S t } … … S t Master Agent 2 model predictors … Agent N 27

GA3C (TRAINING) Agent 1 Dp (∙) Master Agent 2 model S t ,R t … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 28

GA3C a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 29

GA3C GPU-based A3C 30 El Capitan big wall, Yosemite Valley

GA3C Learn how to balance GPU-based A3C 31 El Capitan big wall, Yosemite Valley

GA3C: PREDICTIONS PER SECOND (PPS) a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 32

GA3C: TRAININGS PER SECOND (TPS) a t {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 33

AUTOMATIC SCHEDULING Balancing the system at run time ATARI Boxing ATARI Pong N P = # predictors, N T = # trainers, N A = # agents, TPS = training per seconds 34

THE ADVANTAGE OF SPEED More frames = faster convergence 35

LARGER DNNS For real world applications (e.g. robotics, automotive) A3C (ATARI) Others (robotics) Timothy P. Lillicrap et al., Continuous Conv 16 Conv 32 control with deep reinforcement learning, 8x8 4x4 International Conference on Learning FC 256 filters, filters, Representations, 2016. stride 4 stride 2 S. Levine et al., End-to-end training of deep ? visuomotor policies, Journal of Machine Conv 32 Learning Research, 17:1-40, 2016. Conv 32 Conv 64 8x8 4x4 4x4 filters, FC 256 filters, filters, stride 1, stride 2 stride 2 2, 3, 4 36

GA3C VS. A3C*: PREDICTIONS PER SECONDS * Our Tensor Flow implementation on a CPU 10000 4x 11x 12x 20x 45x 1000 PPS A3C 100 GA3C 10 1 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 37

CPU & GPU UTILIZATION IN GA3C For larger DNNs 100 90 80 70 Utilization (%) 60 50 CPU % 40 GPU % 30 20 10 0 Small DNN Large DNN, Large DNN, Large DNN, Large DNN, stride 4 stride 3 stride 2 stride 1 38

GA3C POLICY LAG Asynchronous playing and training {a t } Agent 1 prediction queue Dp (∙) { S t } … … S t Master Agent 2 model predictors … DNN A … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers DNN B 39

STABILITY AND CONVERGENCE SPEED Reducing policy lag through min training batch size 40

GA3C (45x faster) Balancing computational resources, speed and stability. GPU-based A3C 41 El Capitan big wall, Yosemite Valley

RESOURCES THEORY CODING M. Babaeizadeh, I. Frosio, S. Tyree, J. GA3C, a GPU implementation of A3C Clemons, J. Kautz, Reinforcement (open source at Learning through Asynchronous https://github.com/NVlabs/GA3C). Advantage Actor-Critic on a GPU , A general architecture to generate ICLR 2017 (available at and consume training data. https://openreview.net/forum?id=r1V GvBcxl&noteId=r1VGvBcxl). 42

QUESTIONS QUESTIONS ATARI 2600 Policy lag Multiple GPUs Why TensorFlow Replay memory … Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU. 43

BACKUP SLIDES

POLICY LAG IN GA3C Potentially large time lag between training data generation and network update 45

SCORES 46

Training a larger DNN 47

BALANCING COMPUTATIONAL RESOURCES The actors: CPU, PCI-E & GPU Trainings per second Prediction queue size Training queue size 48

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , - PowerPoint PPT Presentation

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , I.Frosio , S.Tyree , J. Clemons , J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project GPU-BASED

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Address: Land Off Pear Tree Place Great Finborough Slide 2 Site Location Plan Slide 3

GENERATING 3D FRUIT MAPS FOR MODEL-BASED ASSESSMENT OF

1 utilizing 3D fruit reachability and kinematic modeling Stavros Vougioukas 1 , Rajkishan

cdwbq (Confused Drums Wish Broke Quizzes) Slide 1 -

Juvenile Chinook Salmon Monitoring on the Mainstem Trinity River, California USFWS Arcata Fish

Best Practice A Brief History of Pear. Pear was founded in 2001 and the company has gone

Payr yroll User Gr oll User Group Meetin oup Meeting Octob October 11 r 11 th th , 2 , 2017

Institutional Framework to Foster Institutional Framework to Foster Peoples Behavioral Change