Distributed Meta Optimization of Reinforcement Learning Agents Greg - - PowerPoint PPT Presentation

distributed meta optimization of
SMART_READER_LITE
LIVE PREVIEW

Distributed Meta Optimization of Reinforcement Learning Agents Greg - - PowerPoint PPT Presentation

Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev


slide-1
SLIDE 1

Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019

Distributed Meta Optimization of Reinforcement Learning Agents

slide-2
SLIDE 2

2

AGENDA

Contents

Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results Conclusion

slide-3
SLIDE 3

3

GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL

  • M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017

(available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C.

slide-4
SLIDE 4

4

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Learning to accomplish a task

Image from www.33rdsquare.com

slide-5
SLIDE 5

5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St at = π(St)

~Rt St

RL agent

✓ Environment ✓ Agent ✓ Observable status St ✓ Reward Rt ✓ Action at ✓ Policy at = π(St)

, Rt

slide-6
SLIDE 6

6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt at = π(St)

~Rt St

Deep RL agent

slide-7
SLIDE 7

7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt at = π(St) Δπ(∙)

R0 R1 R2 R3 R4

~Rt St

slide-8
SLIDE 8

8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Objective: maximize expected discounted rewards Value of a state The role of 𝛿: short or far-sighted agents 0 < 𝛿 < 1, usually 0.99

slide-9
SLIDE 9

9

GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL

  • M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017

(available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C.

slide-10
SLIDE 10

10

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015)

Δπ(∙) π’(∙)

Master model

St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St)

Agent 1 Agent 2 Agent 16

slide-11
SLIDE 11

11

  • M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017

(available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C.

GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL

slide-12
SLIDE 12

12

REGRESSION, CLASSIFICATION, …

MAPPING DEEP PROBLEMS TO A GPU

REINFORCEMENT LEARNING

Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, Strawberry, Strawberry, Strawberry, … …

data labels 100% utilization / occupancy status, reward action

?

slide-13
SLIDE 13

13

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St)

Agent 1 Agent 2 Agent 16

t_max

slide-14
SLIDE 14

14

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St)

Agent 1 Agent 2 Agent 16

slide-15
SLIDE 15

15

A3C

Master model

St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St)

Agent 1 Agent 2 Agent 16

Δπ(∙)

t_max

slide-16
SLIDE 16

16

A3C

Δπ(∙) π’(∙)

Master model

St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St) St, Rt R0 R1 R2 R3 R4 at = π(St)

Agent 1 Agent 2 Agent 16

t_max

slide-17
SLIDE 17

17

GA3C (INFERENCE)

Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at} at

slide-18
SLIDE 18

18

GA3C (TRAINING)

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Δπ(∙)

Agent 1 Agent 2

Agent N St,Rt

Master model

slide-19
SLIDE 19

19

GA3C

Master model

… …

St predictors

{St}

prediction queue

Agent 1 Agent 2

Agent N {at} at

… …

R0 R1 R2 R4

trainers

{St,Rt}

training queue Δπ(∙)

slide-20
SLIDE 20

20

CPU & GPU UTILIZATION IN GA3C

For larger DNNs - bandwidth limited, do not scale to multiple GPUs!

CPU for environment simulation GPU for inference / training

slide-21
SLIDE 21

21

Role of t_max

t_max = 4 [Play to the end - Monte Carlo] No variance (collected rewards are real) High bias (we played only once) One update every t_max frames 1 4

  • 2
  • 6
slide-22
SLIDE 22

22

Role of t_max

t_max = 2 Value network High variance (noisy value network) Low bias (unbiased net, many agents) More updates per second 1 4 Value network: from here, approximately 2.5 t_max affects bias, variance, computational cost (number of updates per second, batch size)

slide-23
SLIDE 23

23

Other parameters and stability

Hyperparameter search in 2015: The search for the optimal learning rate:

https://arxiv.org/pdf/1602.01783.pdf

slide-24
SLIDE 24

24

GA3C on distributed systems

  • RL is unstable, metaoptimization for optimal

hyperparameters search

  • E.g. learning rate may affect the stability and speed of

convergence

  • GA3C does not scale to multiple GPUs (bandwidth limited),

but … We can run parallel instances of GA3C on a distributed system

  • The discount factor 𝛿 affects the final aim (short or far-

sighted agent)

  • The t_max factor affects the computational cost and

stability* of GA3C

* See G. Heinrich, I. Frosio, Metaoptimization on a Distributed System forDeep Reinforcement Learning, ttps://arxiv.org/abs/1902.02725.

slide-25
SLIDE 25

25

AGENDA

Contents

Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results Conclusion

slide-26
SLIDE 26

26 26

GA3C Agent Concorde

It is as easy as flying a Concorde.

META OPTIMIZATION

  • Topology parameters

▪ Number of layers and their width ▪ Choice of activations

  • Training parameters

▪ Learning rate ▪ Reward decay rate (𝛅)

▪ back-propagation window size (tmax)

▪ Choice of optimizer ▪ Number of training episodes

  • Data parameters

▪ Environment model.

Exhaustive search is intractable

Source: Christian Kath

slide-27
SLIDE 27

27 27

Example: Tree of Parzen Estimators

How does a standard optimization algorithm fare?

META OPTIMIZATION

  • Two Parameters, one Metric to

minimize.

  • Optimization Trade-offs:

▪ Exploitation v.s. exploration. ▪ Wall time v.s. resource efficiency.

  • Optimization packages start with a

Random Search.

  • Tens of experiments are needed

before historical records can be leveraged. Warm Starts are needed to cut down complexity over time.

slide-28
SLIDE 28

28 28

Metric Variance

The Need for Diversity

META OPTIMIZATION

  • Non-determinism makes individual

experiments inconclusive.

  • A change can only be considered an

improvement if it works under a variety of conditions. Meta Optimization should be part of data scientists’ daily routine.

slide-29
SLIDE 29

29 29

Complex Pipelines

  • Evaluation cannot be

reduced to a single Python function, or Docker container. Meta Optimization must be independent of task scheduling.

The Complexity of Evaluating Models

META OPTIMIZATION

slide-30
SLIDE 30

30 30

Architecture

  • Scalable Platform for

Traceable Machine Learning Workflows

  • Self-Documented

Experiments

  • Services can be used in

isolation, or combined for maximum traceability.

Project MagLev: Machine Learning Platform

META OPTIMIZATION

slide-31
SLIDE 31

31 31

Knowledge Base

  • Experiment Data is fully

connected.

  • Objects are searched

through their relationships with

  • thers.

No Information silo.

MagLev Experiment Tracking

META OPTIMIZATION

Model

id 6a-e7-59 job_id ae-45-4a param_set_id 45-4a-26 creator bob created 2018-09-28 dataset_id ac-73-fc

Workflow

id fb-d5-7a Desc my_v0.1.1 spec ... project DormRoom creator bob created 2018-09-28

Job

id ae-45-4a scm SHA:3487c creator bob workflow_id fb-d5-7a created 2018-09-28

ExperimentSet

id 03-ad-24 config yml creator bob project DormRoom created 2018-09-28

ParamSet

id 45-4a-26 exp_id 78-3f-58

Metric

id 69-bd-4c param_set_id 45-4a-26 job_id ae-45-4a model_id 6a-e7-59 name xentropy value 0.01 creator bob created 2018-09-28 dataset_id ac-73-fc

Dataset

id ac-73-fc vdisk zzz://... project DormRoom creator bob created 2018-09-28

slide-32
SLIDE 32

32 32

Main SDK Features

  • All common parameter

types.

  • Early-termination

methods.

  • Standard + custom

parameter picking methods.

Typical Setup

META OPTIMIZATION

slide-33
SLIDE 33

33

AGENDA

Contents

Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results Conclusion

slide-34
SLIDE 34

34 34

Hyper Parameters and Preview of results.

META OPTIMIZATION + GA3C

  • Learning Rate: log uniform distribution over [1e-5, 1e-2] interval.
  • tmax: quantized (q=1) log uniform distribution over [2, 100]

interval.

  • 𝜹: one of {0.9, 0.95, 0.99, 0.995, 0.999, 0.9995, 0.9999}
slide-35
SLIDE 35

35

AGENDA

Contents

Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results Conclusion

slide-36
SLIDE 36

36 36

Successive Halving (SH) [1]

Early Termination Without Compromise

HYPERTRICK

Terminate ½ of workers every N*2P units of work (P is a phase index). Requires synchronization between workers. Assumes relative perf over time is constant.

[1] https://arxiv.org/abs/1502.07943

slide-37
SLIDE 37

37 37

Successive Halving (SH) [1]

Early Termination Without Compromise

HYPERTRICK

Terminate ½ of workers every N*2P units of work (P is a phase index). Requires synchronization between workers. Assumes relative perf over time is constant. Run several instances of SH in parallel with different values of N. Does not assume constant relative perf but still requires synchronization.

HyperBand [2]

[1] https://arxiv.org/abs/1502.07943 [2] http://arxiv.org/abs/1603.06560

slide-38
SLIDE 38

38 38

Successive Halving (SH) [1] HyperTrick [3]

Early Termination Without Compromise

HYPERTRICK

Terminate ½ of workers every N*2P units of work (P is a phase index). Requires synchronization between workers. Assumes relative perf over time is constant. Run several instances of SH in parallel with different values of N. Does not assume constant relative perf but still requires synchronization.

Parameters:

  • N: total number of workers.
  • r: eviction rate per phase P.

Worker:

while maglev.should_continue(): work() maglev.report_metrics()

MagLev:

  • Let earliest N(1-√r)(1-rP) run.
  • Others are terminated, if in

bottom √r quantile. Expected number of workers at end of phase P is N(1-r)P

HyperBand [2]

[1] https://arxiv.org/abs/1502.07943 [2] http://arxiv.org/abs/1603.06560 [3] https://arxiv.org/abs/1902.02725

slide-39
SLIDE 39

39 39

Successive Halving HyperTrick

16 workers on 6 nodes running up to 4 iterations

HYPERTRICK v.s. SUCCESSIVE HALVING

No context switches, shorter wall time Must support preemption

slide-40
SLIDE 40

40

AGENDA

Contents

Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results Conclusion

slide-41
SLIDE 41

41 41

Videos of Trained Agents

PONG

𝛅=0.9 (short-sighted) 𝛅=0.995 (far-sighted)

slide-42
SLIDE 42

42 42

Terminate Underperformers

HYPERTRICK

Boxing Centipede Pacman Pong

unpainted area = saved resources

slide-43
SLIDE 43

43 43

Comparison Against HyperBand

HYPERTRICK

Cluster occupancy (Pong) Timelines (Pong) Best Score v.s. Wall Time (Pong)

slide-44
SLIDE 44

44 44

Experimental Comparison of HyperTrick v.s. HyperBand

META OPTIMIZATION

slide-45
SLIDE 45

45 45

Take Away Related Talks

Conclusion

META OPTIMIZATION + GA3C

RL is unstable => meta optimization is useful. Hypertrick, a new algorithm for metaoptimization. GA3C + HyperTrick + Maglev is effective. Our paper: https://arxiv.org/abs/1902.02725 GA3C: https://github.com/NVlabs/GA3C MagLev info: Yehia Khoja (ykhoja@nvidia.com)

S9649

Wed 9:00am

NVIDIA's AI Infrastructure for Self-Driving Cars

Clement Farabet

S9613

Wed 10:00am

Deep Active Learning

Adam Lesnikowski

S9911

Wed 2:00pm

Determinism In Deep Learning

Duncan Riach

S9630

Thu 2:00pm

Scaling Up DL for Autonomous Driving

Jose Alvarez

S9987

Thu 9:00am

MagLev: production-grade AI platform...

Divya Vavili