- M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree‡ , J. Clemons‡ , J.Kautz‡
†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING
An ICLR 2017 paper A github project
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , - - PowerPoint PPT Presentation
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh , , I.Frosio , S.Tyree , J. Clemons , J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project GPU-BASED
†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA
An ICLR 2017 paper A github project
†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA
†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA
4
5
Image from www.33rdsquare.com
6
✓ Environment ✓ Agent ✓ Observable status St ✓ Reward Rt ✓ Action at ✓ Policy at = p(St)
7
8
R0 R1 R2 R3 R4
9
10
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
11
12
13
14
Batch size
15
time time
16
time time
17
time time
18
Dp(∙) p’(∙)
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
…
Agent 1 Agent 2 Agent 16
19
Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, Strawberry, Strawberry, Strawberry, … …
data labels 100% utilization / occupancy status, reward action
20
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
21
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
22
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
23
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
24
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
25
Master model
St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St) St, Rt R0 R1 R2 R3 R4 at = p(St)
Agent 1 Agent 2 Agent 16
26
El Capitan big wall, Yosemite Valley
27
Master model
St predictors
{St}
prediction queue
Agent 1 Agent 2
Agent N {at} at
28
R0 R1 R2 R4
trainers
{St,Rt}
training queue Dp(∙)
Agent 1 Agent 2
Agent N St,Rt
Master model
29
Master model
St predictors
{St}
prediction queue
Agent 1 Agent 2
Agent N {at} at
R0 R1 R2 R4
trainers
{St,Rt}
training queue Dp(∙)
30
El Capitan big wall, Yosemite Valley
31
El Capitan big wall, Yosemite Valley
32
Master model
St predictors
{St}
prediction queue
Agent 1 Agent 2
Agent N {at} at
R0 R1 R2 R4
trainers
{St,Rt}
training queue Dp(∙)
33
Master model
St predictors
{St}
prediction queue
Agent 1 Agent 2
Agent N {at} at
R0 R1 R2 R4
trainers
{St,Rt}
training queue Dp(∙)
34
ATARI Boxing ATARI Pong NP = # predictors, NT = # trainers, NA = # agents, TPS = training per seconds
35
36
A3C (ATARI) Others (robotics) Timothy P. Lillicrap et al., Continuous control with deep reinforcement learning, International Conference
Learning Representations, 2016.
visuomotor policies, Journal
Machine Learning Research, 17:1-40, 2016.
Conv 16 8x8 filters, stride 4 Conv 32 4x4 filters, stride 2 FC 256 Conv 32 8x8 filters, stride 1, 2, 3, 4 Conv 32 4x4 filters, stride 2 FC 256 Conv 64 4x4 filters, stride 2
37
1 10 100 1000 10000 Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1 4x 11x 12x 20x 45x PPS A3C GA3C
38
10 20 30 40 50 60 70 80 90 100 Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1 Utilization (%) CPU % GPU %
39 Master model
… …
St predictors
{St}
prediction queue
Agent 1 Agent 2
…
Agent N {at}
… …
R0 R1 R2 R4
trainers
{St,Rt}
training queue Dp(∙)
40
41
El Capitan big wall, Yosemite Valley
42
43
Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU.
45
Potentially large time lag between training data generation and network update
46
47
48
Trainings per second Training queue size Prediction queue size