Learning to Optimize as Policy Learning Yisong Yue Policy Learning - PowerPoint PPT Presentation

Learning to Optimize as Policy Learning Yisong Yue

Policy Learning (Reinforcement & Imitation) Goal: Find “Optimal” Policy State/Context s t Agent Imitation Learning: Optimize imitation loss Reinforcement Learning: Optimize environmental reward s t+1 Environment / World Learning-based Approach for Sequential Decision Making

Basic Formulation State/Context s t Agent (Typically a Neural Net) d • Policy: ! " → $(&) s t+1 Environment / World State Action ontrol) • Roll-out: τ = " * , & * , " , , & , , " - , … (aka trace or trajectory) Transition Function: P(s’|s,a) • Objective: ∑ 0(" 1 , & 1 ) 1

Optimization as Sequential Decision Makin • Many Solvers are Sequential • Tree-Search • Greedy • Gradient Descent • Can view solver as “agent” or “policy” • State = intermediate solution Formalize Learning P • • Find a state with high reward (solution) Builds upon mod • • Learn better local decision making Theoretical Analysis/ • Interesting Algorithm •

Example #1: Learning to Search (Discrete) Integer Program Tree-Search (Branch and Boun State = partial search tree (need to featurize) Action = variable selection or branching � Sparse Reward @ feasible solution [He et al., 2014][Khalil et al., 2016] [Song et al., arXiv]

Example #1: Learning to Search (Discrete) Integer Program Tree-Search (Branch and Boun State = partial search tree (need to featurize) Action = variable selection or branching Deterministic State Transitions • � Massive State Space • Sparse Rewards • Sparse Reward @ feasible solution [He et al., 2014][Khalil et al., 2016] [Song et al., arXiv]

Example #2: Learning Greedy Algorithms (discrete Submodula Contextual Submodular Maximization: &02max : ; (Ψ) 6: 6 89 Selecte • Greedy Sequential Selection: Context / Environment • Ψ ← Ψ ⨁ argmax : ; (Ψ⨁&) B Not Available at Test Time • Train policy to mimic greedy: • ! " → & State s = (C, D) Dictionary of Trajectories Select D Learning Policies for Contextual Submodular Prediction S. Ross, R. Zhou, Y. Yue, D. Dey, J.A. Bagnell. ICML

Example #2: Learning Greedy Algorithms (discrete Submodula Contextual Submodular Maximization: &02max : ; (Ψ) 6: 6 89 Selecte • Greedy Sequential Selection: Context / Environment • Ψ ← Ψ ⨁ argmax : ; (Ψ⨁&) B Not Available at Test Time • Train policy to mimic greedy: Deterministic State Transitions • • ! " → & Massive State Space • Dense Rewards • State s = (C, D) Note: Not Learning Submodular • Dictionary of Trajectories Select D Learning Policies for Contextual Submodular Prediction S. Ross, R. Zhou, Y. Yue, D. Dey, J.A. Bagnell. ICML

Example #3: Iterative Amortized Inference (contin • State = description of problem & curren Gradient Descent Style Updates: • Action = next point Useful for Accelerating Variational Inference Iterative Amortized Inference, Joe Marino, Yisong Yue, Stephan Mandt. ICML 2018

Example #3: Iterative Amortized Inference (contin • State = description of problem & curren Gradient Descent Style Updates: • Action = next point (Mostly) Deterministic State Transitions • Continuous State Space • Dense Rewards • Simplest Case: One-Shot Inference • “Variational Autoencoders” [Kingma & Welling, ICLR 2014] • Useful for Accelerating Variational Inference Iterative Amortized Inference, Joe Marino, Yisong Yue, Stephan Mandt. ICML 2018

Optimization as Sequential Decision Makin Learning to Search • Discrete Optimization (Tree Search), Sparse Rewards • Learning to Search via Retrospective Imitation [arXiv] Jialin So • Co-training for Policy Learning [UAI 2019] Contextual Submodular Maximization • Discrete Optimization (Greedy), Dense Rewards • Learning Policies for Contextual Submodular Prediction [ICML 2013] Stephane Learning to Infer • Continuous Optimization (Gradient-style), Dense Rewards • Iterative Amortized Inference [ICML 2018] • A General Method for Amortizing Variational Filtering [NeurIPS 2018] Joe Ma

Learning to Optimize for Tree Search • Idea #1: Treat as Standard RL • Randomly explore for high rewards • Very hard exploration problem! • Issues: massive state space & sparse rewards �

Learning to Optimize for Tree Search • Idea #2: Treat as Standard IL • Convert to Supervised Learning • Assume access to solved instances “Demonstration Data” , • Training Data: E * = � • Basic IL: argmin K L M (!) ≡ O P,B ~L M ℓ(&, ! " ) H∈J Behavioral Cloning

Challenges w/ Imitation Learning • Issues with Behavioral Cloning • Minimize K L M … implications? • If ! makes a mistake early, subsequent state distribution ≈ E * ?? • Some extensions to Interactive IL [He et al., NeurIPS 2014] Our Approach is also Interactive IL • Demonstrations not Available on Large Problems • How to (formally) bootstrap from smaller problems? • Bridging the gap between IL & RL Our Approach gives one solution

Retrospective Imitation Jialin Song • Given: • Family of Distributions of Search problems Difficulty levels: k=1,…,K • Family is parameterized by size/difficulty • Solved Instances on the Smallest/Easiest Instances • “Demonstrations” • Goal: • Interactive IL approach Connections to Curriculu • Can Scale up from Smallest/Easiest Instances & Transfer Learning • Formal Guarantees Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

Retrospective Imitation • Two-Stage Algorithm • Core Algorithm Interactive IL w/ Sparse Environmen • Fixed problem difficulty • Reductions to Supervised Learning • Full Algorithm w/ Scaling Up • Uses Core Algorithm as Subroutine Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

Retrospective Imitation (Core Algorithm) Roll-out Trace Expert Trace · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Repeat · · · . · · · . . · · · · · · . . . · · · · · · · · · · · · · · · · · · · · · ? · · · · · · ? · · · � Policy Roll-out (optional exploration) · · · · · · � Retrospective Oracle 3 (Algorithm 2) � Initial Learning 1 2 · · · · · · · · · · · · · · · · · · Supervised Learning · · · · · · · · · Imitation � Policy Update with Further Learning 4 Reduction Learning · · · · · · . . . Policy Region A · · · · · · · · · Derive Enviro ? · · · Retrospective Oracle Feedback Region B · · · Figure 1. A visualization of retrospective imitation learning depicting components of Algorithm 1. An imitation learning polic Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

Retrospective Imitation (Full Algorithm) Problem Initialize k=1 Core Algori Difficulty k Instances & Demonstrations Initialize Base Solver Gurobi/SCIP/CPlex k=k+1 Use trained S Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

Core Algorithm Does this converge? • Converges to what? • Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

Imitation Learning Tutorial (ICML 2018) https://sites.google.com/view/icml2018-imitation-learning/ Yisong Yue Hoang M. Le yyue@caltech.edu hmle@caltech.edu @YisongYue @HoangMinhLe yisongyue.com hoangle.info

Issues w/ Distribution Drift & Imitation Sig , • Demonstrations from initial Solver: E * = “correct” decision in this state Which input states? Correct relative to what? • Supervised learning: argmin K L M (!) ≡ O P,B ~L M ℓ(&, ! " ) H∈J Oracle call to TensorFlow/PyTorch/etc… If S achieves low error on T U , so what?

Interactive Imitation Learning (Core Alg) • First popularized by [Daume et al., 2009] [Ross et al., 2011] • Basic idea: • Train ! 1V, = argmin K L WXY (!) Supervised Learning H∈J i=i+1 • Roll-out ! 1V, , collect traces Z Run on instances [ 1 • Demonstrator converts Z into per-state feedback: E Depends on w [ 1 ∪ E 1V, • E 1 = E Data aggregation Search-based Structured Prediction, Daume, Langford, Marcu, Machine Learning Journal 2009 A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , Ross, Gordon, Bagne

Learning to Optimize as Policy Learning Yisong Yue Policy Learning - PowerPoint PPT Presentation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement & Imitation) Goal: Find Optimal Policy State/Context s t Agent Imitation Learning: Optimize imitation loss Reinforcement Learning: Optimize

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

AVOIDING THE CRASH: AVOIDING THE CRASH 1: DONT INTUBATE , OPTIMIZE OPTIMIZE YOUR PRE, PERI,

Dont Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30

OPTIMIZE YOUR PAGES, LEVERAGE YOUR BUSINESS CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1

AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Online Algorithms for Rent or Buy with Expert Advice Sreenivas Gollapudi Debmalya Panigrahi How

VSAM P ERFORMANCE S UITE Optimize VSAM performance with this powerful suite of tools from CSI

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

Optimizing zlib for A deflated story Adenilson Cavalcanti BS. MSc. Staff Engineer - Arm San Jose

An introduction to A/B testing using a Google Optimize example Juan M. Fonseca-Sol s

Student-Centered Learning: Functional Requirements for Integrated Systems to Optimize Learning

A linear programming model to optimize diets in environmental policy scenarios Paper by: L.E.

A linear programming model to optimize diets in environmental policy scenarios Moraes, L.E. et al.

PRESENTATION POLICY Policy written on 16 th January 2020 Policy written by Claire Woolley

WHY OPEN POLICY MAKING ? Global Innovation Policy Accelerator - GIPA Policy Lab:- ABOUT Open

Spotiton: A new approach to EM specimen preparation Tilak Jain Staff Scientist National

Chapter 5 Statistical Models in Simulation Banks, Carson, Nelson & Nicol Discrete-Event

Pointfree pointwise convergence, Baire functions, and epimorphisms in truncated archimedean

Around Hilberts 13th Problem Ziqin Feng Miami University February 21, 2012 Ziqin Feng

Topic 1: LTI Systems Overview: Introduction to Signals Types of Signals: CT/DT,

EE361: Signals and System II Introduction http://www.ee.unlv.edu/~b1morris/ee361/ 2 Class

Chapter 1 Chapter 1 Fundamental Concepts Fundamental Concepts 1 Signals Signals A signal

6.003: Signals and Systems Continuous-Time Systems September 20, 2011 1 Multiple Representations

Learning to Optimize as Policy Learning Yisong Yue Policy Learning - PowerPoint PPT Presentation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement & Imitation) Goal: Find Optimal Policy State/Context s t Agent Imitation Learning: Optimize imitation loss Reinforcement Learning: Optimize

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

AVOIDING THE CRASH: AVOIDING THE CRASH 1: DONT INTUBATE , OPTIMIZE OPTIMIZE YOUR PRE, PERI,

Dont Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30

OPTIMIZE YOUR PAGES, LEVERAGE YOUR BUSINESS CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1

AutoTVM &amp; Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Online Algorithms for Rent or Buy with Expert Advice Sreenivas Gollapudi Debmalya Panigrahi How

VSAM P ERFORMANCE S UITE Optimize VSAM performance with this powerful suite of tools from CSI

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

Optimizing zlib for A deflated story Adenilson Cavalcanti BS. MSc. Staff Engineer - Arm San Jose

An introduction to A/B testing using a Google Optimize example Juan M. Fonseca-Sol s

Student-Centered Learning: Functional Requirements for Integrated Systems to Optimize Learning

A linear programming model to optimize diets in environmental policy scenarios Paper by: L.E.

A linear programming model to optimize diets in environmental policy scenarios Moraes, L.E. et al.

PRESENTATION POLICY Policy written on 16 th January 2020 Policy written by Claire Woolley

WHY OPEN POLICY MAKING ? Global Innovation Policy Accelerator - GIPA Policy Lab:- ABOUT Open

Spotiton: A new approach to EM specimen preparation Tilak Jain Staff Scientist National

Chapter 5 Statistical Models in Simulation Banks, Carson, Nelson &amp; Nicol Discrete-Event

Pointfree pointwise convergence, Baire functions, and epimorphisms in truncated archimedean

Around Hilberts 13th Problem Ziqin Feng Miami University February 21, 2012 Ziqin Feng

Topic 1: LTI Systems Overview: Introduction to Signals Types of Signals: CT/DT,

EE361: Signals and System II Introduction http://www.ee.unlv.edu/~b1morris/ee361/ 2 Class

Chapter 1 Chapter 1 Fundamental Concepts Fundamental Concepts 1 Signals Signals A signal

6.003: Signals and Systems Continuous-Time Systems September 20, 2011 1 Multiple Representations

AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data

Chapter 5 Statistical Models in Simulation Banks, Carson, Nelson & Nicol Discrete-Event