GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , - - PowerPoint PPT Presentation

gpu based a3c for deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , - - PowerPoint PPT Presentation

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , I.Frosio , S.Tyree , J. Clemons , J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project GPU-BASED


slide-1
SLIDE 1
  • M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree‡ , J. Clemons‡ , J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

An ICLR 2017 paper A github project

slide-2
SLIDE 2
  • M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

slide-3
SLIDE 3
  • M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

slide-4
SLIDE 4

4

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

slide-5
SLIDE 5

5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Learning to accomplish a task

Image from www.33rdsquare.com

slide-6
SLIDE 6

6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St at = p(St)

~Rt St

RL agent

✓ Environment ✓ Agent ✓ Observable status St ✓ Reward Rt ✓ Action at ✓ Policy at = p(St)

, Rt

slide-7
SLIDE 7

7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt at = p(St)

~Rt St

Deep RL agent

slide-8
SLIDE 8

8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt at = p(St) Dp(∙)

R0 R1 R2 R3 R4

~Rt St

slide-9
SLIDE 9

9

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

slide-10
SLIDE 10

10

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015)

Dp(∙) p’(∙)

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-11
SLIDE 11

11

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

slide-12
SLIDE 12

12

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

The GPU

slide-13
SLIDE 13

13

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

LOW OCCUPANCY LOW OCCUPANCY (33%)

slide-14
SLIDE 14

14

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%)

Batch size

slide-15
SLIDE 15

15

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%), LOW UTILIZATION (40%)

time time

slide-16
SLIDE 16

16

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%)

time time

slide-17
SLIDE 17

17

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

BANDWIDTH LIMITED

time time

slide-18
SLIDE 18

18

+

Dp(∙) p’(∙)

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-19
SLIDE 19

19

REGRESSION, CLASSIFICATION, …

MAPPING DEEP PROBLEMS TO A GPU

REINFORCEMENT LEARNING

Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, Strawberry, Strawberry, Strawberry, … …

data labels 100% utilization / occupancy status, reward action

?

slide-20
SLIDE 20

20

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-21
SLIDE 21

21

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-22
SLIDE 22

22

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-23
SLIDE 23

23

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-24
SLIDE 24

24

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

Dp(∙)

slide-25
SLIDE 25

25

A3C

Dp(∙) p’(∙)

Master model

St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)

Agent 1 Agent 2 Agent 16

slide-26
SLIDE 26

26

GPU-based A3C GA3C

El Capitan big wall, Yosemite Valley

slide-27
SLIDE 27

27

GA3C (INFERENCE)

Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at} at

slide-28
SLIDE 28

28

GA3C (TRAINING)

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Dp(∙)

Agent 1 Agent 2

Agent N St,Rt

Master model

slide-29
SLIDE 29

29

GA3C

Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at} at

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Dp(∙)

slide-30
SLIDE 30

30

GPU-based A3C GA3C

El Capitan big wall, Yosemite Valley

slide-31
SLIDE 31

31

GPU-based A3C GA3C

El Capitan big wall, Yosemite Valley

Learn how to balance

slide-32
SLIDE 32

32

GA3C: PREDICTIONS PER SECOND (PPS)

Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at} at

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Dp(∙)

slide-33
SLIDE 33

33

GA3C: TRAININGS PER SECOND (TPS)

Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at} at

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Dp(∙)

slide-34
SLIDE 34

34

AUTOMATIC SCHEDULING

Balancing the system at run time

ATARI Boxing ATARI Pong NP = # predictors, NT = # trainers, NA = # agents, TPS = training per seconds

slide-35
SLIDE 35

35

THE ADVANTAGE OF SPEED

More frames = faster convergence

slide-36
SLIDE 36

36

LARGER DNNS

A3C (ATARI) Others (robotics) Timothy P. Lillicrap et al., Continuous control with deep reinforcement learning, International Conference

  • n

Learning Representations, 2016.

  • S. Levine et al., End-to-end training of deep

visuomotor policies, Journal

  • f

Machine Learning Research, 17:1-40, 2016.

For real world applications (e.g. robotics, automotive)

Conv 16 8x8 filters, stride 4 Conv 32 4x4 filters, stride 2 FC 256 Conv 32 8x8 filters, stride 1, 2, 3, 4 Conv 32 4x4 filters, stride 2 FC 256 Conv 64 4x4 filters, stride 2

?

slide-37
SLIDE 37

37

GA3C VS. A3C*: PREDICTIONS PER SECONDS

* Our Tensor Flow implementation on a CPU

1 10 100 1000 10000 Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1 4x 11x 12x 20x 45x PPS A3C GA3C

slide-38
SLIDE 38

38

CPU & GPU UTILIZATION IN GA3C

For larger DNNs

10 20 30 40 50 60 70 80 90 100 Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1 Utilization (%) CPU % GPU %

slide-39
SLIDE 39

39 Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at}

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Dp(∙)

GA3C POLICY LAG

Asynchronous playing and training

DNN A DNN B

slide-40
SLIDE 40

40

STABILITY AND CONVERGENCE SPEED

Reducing policy lag through min training batch size

slide-41
SLIDE 41

41

GPU-based A3C GA3C (45x faster)

El Capitan big wall, Yosemite Valley

Balancing computational resources, speed and stability.

slide-42
SLIDE 42

42

THEORY

RESOURCES

  • M. Babaeizadeh, I. Frosio, S. Tyree, J.

Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017 (available at https://openreview.net/forum?id=r1V GvBcxl&noteId=r1VGvBcxl).

CODING

GA3C, a GPU implementation of A3C (open source at https://github.com/NVlabs/GA3C). A general architecture to generate and consume training data.

slide-43
SLIDE 43

43

QUESTIONS

QUESTIONS

Policy lag Multiple GPUs Why TensorFlow Replay memory …

ATARI 2600

Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU.

slide-44
SLIDE 44

BACKUP SLIDES

slide-45
SLIDE 45

45

POLICY LAG IN GA3C

Potentially large time lag between training data generation and network update

slide-46
SLIDE 46

46

SCORES

slide-47
SLIDE 47

47

Training a larger DNN

slide-48
SLIDE 48

48

BALANCING COMPUTATIONAL RESOURCES

The actors: CPU, PCI-E & GPU

Trainings per second Training queue size Prediction queue size