Distributed Meta Optimization of Reinforcement Learning Agents Greg - PowerPoint PPT Presentation

Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019

Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev Metaoptimization and Reinforcement Learning (on distributed systems) HyperTrick Results AGENDA Conclusion 2

GPU-Based A3C for Deep Reinforcement Learning (RL) keywords: GPU, A3C, RL M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU , ICLR 2017 (available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C. 3

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Learning to accomplish a task Image from www.33rdsquare.com 4

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t ✓ Environment S t , R t ✓ Agent ✓ Observable status S t RL agent S t ✓ Reward R t ✓ Action a t ✓ Policy a t = π ( S t ) a t = π ( S t ) 5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Deep RL agent S t a t = π ( S t ) 6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t, R t Δπ (∙) S t R 0 R 1 R 2 R 3 R 4 a t = π ( S t ) 7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Objective: maximize expected discounted rewards Value of a state The role of 𝛿 : short or far-sighted agents 0 < 𝛿 < 1, usually 0.99 8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015) Δπ (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t π ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) 10

MAPPING DEEP PROBLEMS TO A GPU REGRESSION, CLASSIFICATION, REINFORCEMENT LEARNING … status, data reward ? 100% utilization / occupancy Pear, pear, pear, pear, … Empty, empty, … Fig, fig, fig, fig, fig, fig, action Strawberry, Strawberry, labels Strawberry, … … 12

A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) t_max 13

A3C Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) 14

A3C Δπ (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) t_max 15

A3C Δπ (∙) Agent 1 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) Master Agent 2 S t, R t π ’(∙) model R 0 R 1 R 2 R 3 R 4 a t = π (S t ) … Agent 16 S t, R t R 0 R 1 R 2 R 3 R 4 a t = π (S t ) t_max 16

GA3C (INFERENCE) a t {a t } Agent 1 prediction queue { S t } … … S t Master Agent 2 model predictors … Agent N 17

GA3C (TRAINING) Agent 1 Δπ (∙) Master Agent 2 model S t ,R t … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 18

GA3C a t {a t } Agent 1 prediction queue Δπ (∙) { S t } … … S t Master Agent 2 model predictors … … … Agent N R 4 { S t ,R t } R 0 R 1 R 2 training queue trainers 19

CPU & GPU UTILIZATION IN GA3C For larger DNNs - bandwidth limited, do not scale to multiple GPUs! GPU for inference / training CPU for environment simulation 20

Role of t_max t_max = 4 [Play to the end - Monte Carlo] No variance (collected rewards are real) High bias (we played only once) One update every t_max frames 1 4 -2 -6 21

Role of t_max t_max = 2 Value network High variance (noisy value network) Low bias (unbiased net, many agents) 1 More updates per second 4 Value network: from here, approximately 2.5 t_max affects bias, variance, computational cost (number of updates per second, batch size) 22

Other parameters and stability Hyperparameter search in 2015: The search for the optimal learning rate: https://arxiv.org/pdf/1602.01783.pdf 23

GA3C on distributed systems ● RL is unstable, metaoptimization for optimal hyperparameters search ● E.g. learning rate may affect the stability and speed of convergence ● GA3C does not scale to multiple GPUs (bandwidth limited), but … We can run parallel instances of GA3C on a distributed system ● The discount factor 𝛿 affects the final aim (short or far- sighted agent) ● The t_max factor affects the computational cost and stability* of GA3C * See G. Heinrich, I. Frosio, Metaoptimization on a Distributed System forDeep Reinforcement Learning, ttps://arxiv.org/abs/1902.02725. 24

META OPTIMIZATION It is as easy as flying a Concorde. GA3C Agent Concorde • Topology parameters ▪ Number of layers and their width ▪ Choice of activations • Training parameters ▪ Learning rate ▪ Reward decay rate ( 𝛅 ) ▪ back-propagation window size (tmax) ▪ Choice of optimizer ▪ Number of training episodes • Data parameters ▪ Environment model. Exhaustive search is intractable 26 26 Source: Christian Kath

META OPTIMIZATION How does a standard optimization algorithm fare? Example: Tree of Parzen Estimators • Two Parameters, one Metric to minimize. • Optimization Trade-offs: ▪ Exploitation v.s. exploration. → ▪ Wall time v.s. resource efficiency. • Optimization packages start with a Random Search. • Tens of experiments are needed before historical records can be leveraged. Warm Starts are needed to cut down complexity over time. 27 27

META OPTIMIZATION The Need for Diversity Metric Variance • Non-determinism makes individual experiments inconclusive. • A change can only be considered an improvement if it works under a → variety of conditions. Meta Optimization should be part of data scientists’ daily routine. 28 28

META OPTIMIZATION The Complexity of Evaluating Models Complex Pipelines • Evaluation cannot be reduced to a single Python function, or Docker container. Meta Optimization must be independent of task scheduling. 29 29

META OPTIMIZATION Project MagLev: Machine Learning Platform Architecture • Scalable Platform for Traceable Machine Learning Workflows • Self-Documented Experiments • Services can be used in isolation, or combined for maximum traceability. 30 30

META OPTIMIZATION ExperimentSet MagLev Experiment Tracking 03-ad-24 id yml config bob creator ParamSet Knowledge Base DormRoom project 2018-09-28 45-4a-26 created id 78-3f-58 exp_id • Experiment Data is fully connected. Job Metric ae-45-4a id Workflow • Objects are searched SHA:3487c 69-bd-4c scm id fb-d5-7a id bob 45-4a-26 creator param_set_id through their my_v0.1.1 Desc fb-d5-7a ae-45-4a workflow_id job_id ... spec 2018-09-28 relationships with 6a-e7-59 created model_id DormRoom project xentropy name bob creator others. 0.01 value 2018-09-28 Model created bob creator 6a-e7-59 No Information silo. id 2018-09-28 created ae-45-4a job_id ac-73-fc dataset_id 45-4a-26 param_set_id bob creator 2018-09-28 created ac-73-fc dataset_id Dataset ac-73-fc id zzz://... vdisk DormRoom project bob creator 31 31 2018-09-28 created

META OPTIMIZATION Typical Setup Main SDK Features • All common parameter types. • Early-termination methods. • Standard + custom parameter picking methods. 32 32

META OPTIMIZATION + GA3C Hyper Parameters and Preview of results. • Learning Rate: log uniform distribution over [1e-5, 1e-2] interval. • tmax: quantized (q=1) log uniform distribution over [2, 100] interval. 𝜹 : one of {0.9, 0.95, 0.99, 0.995, 0.999, 0.9995, 0.9999} • 34 34

Distributed Meta Optimization of Reinforcement Learning Agents Greg - PowerPoint PPT Presentation

Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

META-SHARE META SHARE the Open Resource Exchange Facility Stelios Piperidis ILSP-Athena RC,

CS 671 Automated Reasoning Meta Reasoning Object Level versus Meta Level Object level:

Meta-optimization of Quantum-Inspired Evolutionary Algorithm Robert Nowotniak, Jacek Kucharski

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng Ou Outline

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng TG Ph.D. Student

UNIVERSITY SENATE November 7, 2016 Minutes The University Senate was called to order at 3:35

Phase Transformations & Hardenability of Steels (Jominy EndQuench Test) Jominy End

Goldman Sachs Presentation to Deutsche Bank Global Financial Services Investor Conference Gary D.

A Framework for Thinking about Technology Adoption Eric Verhoogen Sept. 10, 2019 Introduction

A NEW MALI GOLD EXPLORER MAY 2019 DISCLAIMER AND COMPETENT PERSONS STATEMENTS This

Surrender to the flow rhyme as the defining structural element in rap Kjell Andreas Oddekalv

A Personalized Walk through the Museum: The CHIP Interactive Tour Guide

Distributed Meta Optimization of Reinforcement Learning Agents Greg - PowerPoint PPT Presentation

Distributed Meta Optimization of Reinforcement Learning Agents Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Contents Introduction to Reinforcement Learning Introduction to Metaoptimization (on distributed systems) / Maglev

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers &amp; References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

META-SHARE META SHARE the Open Resource Exchange Facility Stelios Piperidis ILSP-Athena RC,

CS 671 Automated Reasoning Meta Reasoning Object Level versus Meta Level Object level:

Meta-optimization of Quantum-Inspired Evolutionary Algorithm Robert Nowotniak, Jacek Kucharski

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng Ou Outline

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng TG Ph.D. Student

UNIVERSITY SENATE November 7, 2016 Minutes The University Senate was called to order at 3:35

Phase Transformations &amp; Hardenability of Steels (Jominy EndQuench Test) Jominy End

Goldman Sachs Presentation to Deutsche Bank Global Financial Services Investor Conference Gary D.

A Framework for Thinking about Technology Adoption Eric Verhoogen Sept. 10, 2019 Introduction

A NEW MALI GOLD EXPLORER MAY 2019 DISCLAIMER AND COMPETENT PERSONS STATEMENTS This

Surrender to the flow rhyme as the defining structural element in rap Kjell Andreas Oddekalv

A Personalized Walk through the Museum: The CHIP Interactive Tour Guide

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,

Phase Transformations & Hardenability of Steels (Jominy EndQuench Test) Jominy End