Learning near-optimal hyperparameters with minimal overhead Gellrt - PowerPoint PPT Presentation

Learning near-optimal hyperparameters with minimal overhead Gellért Weisz András György Csaba Szepesvári Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019 1 / 22

Introduction Problem: find good parameter settings (configurations) for general purpose solvers. ◮ No structure assumed over the parameter space. 2 / 22

Introduction Problem: find good parameter settings (configurations) for general purpose solvers. ◮ No structure assumed over the parameter space. Zillions of practical algorithms ⇔ Little theory Want theoretical guarantees on the runtime of ◮ the chosen configuration; and ◮ the configuration process. 2 / 22

Introduction Problem: find good parameter settings (configurations) for general purpose solvers. ◮ No structure assumed over the parameter space. Zillions of practical algorithms ⇔ Little theory Want theoretical guarantees on the runtime of ◮ the chosen configuration; and ◮ the configuration process. Goal: find a near-optimal configuration solving 1 − δ fraction of the problems in the least expected time. ◮ Since some instances ( δ fraction) are hopelessly hard; don’t want to solve those. 2 / 22

� pdf pdf expected capped runtime � � ( � ) tail probability = � Runtime of configuration � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. 3 / 22

� pdf expected capped runtime � � ( � ) Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. pdf tail probability = � Runtime of configuration � 3 / 22

pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf expected capped runtime � � ( � ) Runtime of configuration � 3 / 22

pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf Runtime of the optimal expected capped capped configuration: runtime � � ( � ) R δ ( i ) OPT δ = min i Runtime of configuration � 3 / 22

pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf Runtime of the optimal expected capped capped configuration: runtime � � ( � ) R δ ( i ) OPT δ = min i Runtime of configuration � Configuration i is ( ε, δ ) -optimal if R δ ( i ) ≤ (1 + ε )OPT δ/ 2 . 3 / 22

pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf Runtime of the optimal expected capped capped configuration: runtime � � ( � ) R δ ( i ) OPT δ = min i Runtime of configuration � Configuration i is ( ε, δ ) -optimal if R δ ( i ) ≤ (1 + ε )OPT δ/ 2 . Note that OPT δ ≤ OPT δ/ 2 ≤ OPT 0 – gaps can be large! 3 / 22

Previous work (before ICML ’19) 4 / 22

Structured Procrastination (Kleinberg et al., 2017) Relaxed goal: Find i with R δ ( i ) ≤ (1 + ε )OPT 0 OPT 0 n � � Worst-case lower bound: runtime must be at least Ω ε 2 δ With probability 1 − ζ , returns an ( ε, δ ) -optimal configuration in worst-case time � n � n log ¯ κ �� O OPT 0 ε 2 δ log ζε 2 δ ◮ κ : absolute upper bound on runtimes 5 / 22

Structured Procrastination (Kleinberg et al., 2017) Relaxed goal: Find i with R δ ( i ) ≤ (1 + ε )OPT 0 OPT 0 n � � Worst-case lower bound: runtime must be at least Ω ε 2 δ With probability 1 − ζ , returns an ( ε, δ ) -optimal configuration in worst-case time � n � n log ¯ κ �� O OPT 0 ε 2 δ log ζε 2 δ ◮ κ : absolute upper bound on runtimes Can we remove ¯ κ ? Can we improve runtime when problem is easier? 5 / 22

LEAPSANDBOUNDS (Weisz et al., 2018) Guess a value θ of OPT , starting from a low value 1 Test whether R δ ( i ) ≤ θ for some configuration i : 2 ◮ For each i , run b = ˜ O ( 1 δε 2 ) instances with instance-wise timeout τ = 4 θ 3 δ , abort if empirical average exceeds θ . 3 Return the configuration with the smallest mean amongst successful configurations. If no test succeeded, double θ , continue from Step 2 . Average runtime budget and its use across different configurations and phases 6 / 22

Why does this work? w.h.p., for any configuration i : 7 / 22

Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: 7 / 22

Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i 7 / 22

Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i (ii) Empirical mean ¯ R i is ε -close to R τ ( i ) = E [ X ( i, J ) ∧ τ ] , J ∼ Γ 7 / 22

Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i (ii) Empirical mean ¯ R i is ε -close to R τ ( i ) = E [ X ( i, J ) ∧ τ ] , J ∼ Γ ◮ otherwise, R δ ( i ) > θ , hence can safely abandon i for this phase 7 / 22

Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i (ii) Empirical mean ¯ R i is ε -close to R τ ( i ) = E [ X ( i, J ) ∧ τ ] , J ∼ Γ ◮ otherwise, R δ ( i ) > θ , hence can safely abandon i for this phase R i < θ , then for i ∗ = argmin i ¯ Thus, if for any configuration i , ¯ R i , R δ ( i ∗ ) ≤ (1 + ε )OPT 0 w.h.p. 7 / 22

Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ 8 / 22

Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. 8 / 22

Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. Runtime: � � � � � � σ 2 � n log n log OPT 0 τk ( i ) , 1 εδ , 1 δ log 1 1 O i,k OPT 0 i =1 max + log . ε 2 R 2 δ ζ εR τk ( i ) 8 / 22

Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. Runtime: � � � � � � σ 2 � n log n log OPT 0 τk ( i ) , 1 εδ , 1 δ log 1 1 O i,k OPT 0 i =1 max + log . ε 2 R 2 δ ζ εR τk ( i ) σ 2 τk ≪ 1 i,k Huge improvement if the variances are small: δ . R 2 8 / 22

10 3 delta=0 delta=0.05 Mean below delta quantile (s) delta=0.1 delta=0.2 10 2 delta=0.3 delta=0.5 10 1 10 0 0 200 400 600 800 1000 Configurations (sorted according to mean below delta quantile) Experiments Configuring the minisat SAT solver (Sorensson and Een, 2005) 1K configurations, 20K nontrivial problem instances Compare with Structured Procrastination by Kleinberg et al. (2017) Code and ( 83 CPU years worth of @ year 2018 CPUs) data: https://github.com/deepmind/leaps-and-bounds 9 / 22

Experiments Configuring the minisat SAT solver (Sorensson and Een, 2005) 1K configurations, 20K nontrivial problem instances Compare with Structured Procrastination by Kleinberg et al. (2017) Code and ( 83 CPU years worth of @ year 2018 CPUs) data: https://github.com/deepmind/leaps-and-bounds 10 3 delta=0 delta=0.05 Mean below delta quantile (s) delta=0.1 delta=0.2 10 2 delta=0.3 delta=0.5 10 1 10 0 0 200 400 600 800 1000 Configurations (sorted according to mean below delta quantile) 9 / 22

Total time spent running configuration (s) LeapsAndBounds Structured Procrastination 10 5 10 4 0 200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) Results ε = 0 . 2 , δ = 0 . 2 , ζ = 0 . 1 Instead of doubling, use θ := 1 . 25 θ Runs can be stopped and resumed (ie ‘continue’ running on an instance) 10 / 22

Results ε = 0 . 2 , δ = 0 . 2 , ζ = 0 . 1 Instead of doubling, use θ := 1 . 25 θ Runs can be stopped and resumed (ie ‘continue’ running on an instance) Total time spent running configuration (s) LeapsAndBounds Structured Procrastination 10 5 10 4 0 200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) 10 / 22

Learning near-optimal hyperparameters with minimal overhead Gellrt - PowerPoint PPT Presentation

Learning near-optimal hyperparameters with minimal overhead Gellrt Weisz Andrs Gyrgy Csaba Szepesvri Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019 1 / 22 Introduction Problem: find good parameter settings

ACTIVELY LEARNING HYPERPARAMETERS FOR GPS Roman Garnett Washington University in St. Louis

Coarse Classification of Binary Minimal Clones Zarathustra Brady Minimal clones A clone C is

Tuning a CART's hyperparameters MACH IN E LEARN IN G W ITH TREE-BAS ED MODELS IN P YTH ON

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We pick the best hyperparameters

Synthetic Minimal Chromosome 2010 CBNU-KOREA team genetic information necessary and sufficient

A toy example in Minimal Model Program In minimal model program for 3-folds, Mori connected

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Applications of Constrained BayesOpt in Robotics and Rethinking Priors & Hyperparameters Marc

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Liquid Argon Near Detector Simulation Liquid Argon Near Detector Simulation Jonathan Asaadi 1

Towards Assessing the Impact of Bayesian Optimizations own Hyperparameters Marius Lindauer,

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

DIMENSIONALITY REDUCTION AND VISUALIZATION Loose ends from HW2 Hyperparameters, bin size =

Outline GP hyperparameter inference Priors on GP hyperparameters Benefits of

Electric System Financial Results Storm Costs Recovery- Municipal vs. IOU Building Community

Arthur Kressner CON EDISONS ELECTRIC SYSTEM NEW YORK CITY AND WESTCHESTER COUNTY ! 3.2

Semi-Quantum Key Distribution with Limited Measurement Capabilities Walter O. Krawec Computer

Improved Blind Side-Channel Analysis by Exploitation of Joint Distributions of Leakages Christophe

MICE Demonstration of Ionisation Cooling Colin Whyte University of Strathclyde On behalf of

Managing Data and Operation Distribution In MongoDB Antonios Giannopoulos and Jason Terpko

Null Hypothesis Significance Testing Gallery of Tests 18.05 Spring 2014 January 1, 2017 1

Tax Basics for the Business Lawyer May 25, 2017 Presented by the Taxation Committee Roger

Learning near-optimal hyperparameters with minimal overhead Gellrt - PowerPoint PPT Presentation

Learning near-optimal hyperparameters with minimal overhead Gellrt Weisz Andrs Gyrgy Csaba Szepesvri Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019 1 / 22 Introduction Problem: find good parameter settings

ACTIVELY LEARNING HYPERPARAMETERS FOR GPS Roman Garnett Washington University in St. Louis

Coarse Classification of Binary Minimal Clones Zarathustra Brady Minimal clones A clone C is

Tuning a CART's hyperparameters MACH IN E LEARN IN G W ITH TREE-BAS ED MODELS IN P YTH ON

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&amp;A Q: We pick the best hyperparameters

Synthetic Minimal Chromosome 2010 CBNU-KOREA team genetic information necessary and sufficient

A toy example in Minimal Model Program In minimal model program for 3-folds, Mori connected

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Applications of Constrained BayesOpt in Robotics and Rethinking Priors &amp; Hyperparameters Marc

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Liquid Argon Near Detector Simulation Liquid Argon Near Detector Simulation Jonathan Asaadi 1

Towards Assessing the Impact of Bayesian Optimizations own Hyperparameters Marius Lindauer,

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

DIMENSIONALITY REDUCTION AND VISUALIZATION Loose ends from HW2 Hyperparameters, bin size =

Outline GP hyperparameter inference Priors on GP hyperparameters Benefits of

Electric System Financial Results Storm Costs Recovery- Municipal vs. IOU Building Community

Arthur Kressner CON EDISONS ELECTRIC SYSTEM NEW YORK CITY AND WESTCHESTER COUNTY ! 3.2

Semi-Quantum Key Distribution with Limited Measurement Capabilities Walter O. Krawec Computer

Improved Blind Side-Channel Analysis by Exploitation of Joint Distributions of Leakages Christophe

MICE Demonstration of Ionisation Cooling Colin Whyte University of Strathclyde On behalf of

Managing Data and Operation Distribution In MongoDB Antonios Giannopoulos and Jason Terpko

Null Hypothesis Significance Testing Gallery of Tests 18.05 Spring 2014 January 1, 2017 1

Tax Basics for the Business Lawyer May 25, 2017 Presented by the Taxation Committee Roger

Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We pick the best hyperparameters

Applications of Constrained BayesOpt in Robotics and Rethinking Priors & Hyperparameters Marc