Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - PowerPoint PPT Presentation

Outline ◮ Part I: Bandits in the Bayesian Paradigm 1. Gaussian processes 2. Algorithms: Upper Confidence Bound (UCB) & Thompson Sampling (TS) ◮ Part II: Scaling up Bandits 1. Multi-fidelity bandit: cheap approximations to an expensive experiment 2. Parallelising function evaluations 3. High dimensional input spaces (N.B: Part II is a shameless plug for my research.) 12/40

Part 2.1: Multi-fidelity Bandits Motivating question: What if we have cheap approximations to f ? 1. Hyper-parameter tuning: Train & validate with a subset of the data, and/or early stopping before convergence. E.g. Bandwidth ( ℓ ) selection in kernel density estimation. 13/40

Part 2.1: Multi-fidelity Bandits Motivating question: What if we have cheap approximations to f ? 1. Hyper-parameter tuning: Train & validate with a subset of the data, and/or early stopping before convergence. E.g. Bandwidth ( ℓ ) selection in kernel density estimation. 2. Computational astrophysics: cosmological simulations and numerical computations with less granularity. 3. Autonomous driving: simulation vs real world experiment. 13/40

Multi-fidelity Methods For specific applications, ◮ Industrial design (Forrester et al. 2007) ◮ Hyper-parameter tuning (Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016) ◮ Active learning (Zhang & Chaudhuri 2015) ◮ Robotics (Cutler et al. 2014) Multi-fidelity bandits & optimisation (Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016) 14/40

Multi-fidelity Methods For specific applications, ◮ Industrial design (Forrester et al. 2007) ◮ Hyper-parameter tuning (Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016) ◮ Active learning (Zhang & Chaudhuri 2015) ◮ Robotics (Cutler et al. 2014) Multi-fidelity bandits & optimisation (Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016) . . . with theoretical guarantees (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) 14/40

Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? 15/40

Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). 15/40

Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). 15/40

Multi-fidelity Bandits (Kandasamy et al. ICML 2017) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. Z X 16/40

Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . X 16/40

Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. 16/40

Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). 16/40

Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). A cost function, λ : Z → R + . λ ( z ) λ ( z ) = λ ( N , T ) = O ( N 2 T ) (say). Z z • 16/40

Algorithm: BOCA (Kandasamy et al. ICML 2017) 17/40

Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and compute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + 17/40

Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and compute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 17/40

Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and compute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and compute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � λ ( z ) � q � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) = ξ ( z ) λ ( z • ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

Theoretical Results for BOCA g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” 18/40

Theoretical Results for BOCA g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” Theorem: (Informal) BOCA does better, i.e. achieves better Simple regret, than GP- UCB . The improvements are better in the “good” setting when compared to the “bad” setting. 18/40

Experiment: SVM with 20 News Groups Tune two hyper-parameters for the SVM. Dataset has N • = 15 K data and use T • = 100 iterations. But can choose N ∈ [5 K , 15 K ] or T ∈ [20 , 100] (2D fidelity space) . 0.915 0.91 0.905 0.9 0.895 0.89 500 1000 1500 2000 19/40

Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N • = 192 data. Requires numerical integration on a grid of size G • = 10 6 . Approximate with N ∈ [50 , 192] or G ∈ [10 2 , 10 6 ] (2D fidelity space) . 20/40

Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N • = 192 data. Requires numerical integration on a grid of size G • = 10 6 . Approximate with N ∈ [50 , 192] or G ∈ [10 2 , 10 6 ] (2D fidelity space) . 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 1000 1500 2000 2500 3000 3500 20/40

Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. 21/40

Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. 21/40

Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. 21/40

Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. Can be extended to infinite X . 21/40

Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. Can be extended to infinite X . Does not fall within the GP/Bayesian framework. 21/40

Hyper-band (cont’d) When compared to Bayesian methods, ◮ Pro: Incremental resource allocation (do not need to retrain all models from the beginning). ◮ Con: Cannot use correlation between arms (e.g. if x 1 has large CV accuracy, then x 2 close to x 1 is also likely to do well). 22/40

Hyper-band (cont’d) When compared to Bayesian methods, ◮ Pro: Incremental resource allocation (do not need to retrain all models from the beginning). ◮ Con: Cannot use correlation between arms (e.g. if x 1 has large CV accuracy, then x 2 close to x 1 is also likely to do well). Experiments: 22/40

Outline ◮ Part I: Bandits in the Bayesian Paradigm 1. Gaussian processes 2. Algorithms: Upper Confidence Bound (UCB) & Thompson Sampling (TS) ◮ Part II: Scaling up Bandits 1. Multi-fidelity bandit: cheap approximations to an expensive experiment 2. Parallelising function evaluations 3. High dimensional input spaces 23/40

Part 2.2: Parallelising arm pulls Sequential evaluations with one worker 24/40

Part 2.2: Parallelising arm pulls Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) 24/40

Part 2.2: Parallelising arm pulls Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous) 24/40

Why parallelisation? ◮ Computational experiments: infrastructure with 100-1000’s CPUs or GPUs. 25/40

Why parallelisation? ◮ Computational experiments: infrastructure with 100-1000’s CPUs or GPUs. Prior work: (Ginsbourger et al. 2011, Janusevskis et al. 2012, Wang et al. 2016, Gonz´ alez et al. 2015, Desautels et al. 2014, Contal et al. 2013, Shah and Ghahramani 2015, Kathuria et al. 2016, Wang et al. 2017, Wu and Frazier 2016, Hernandez-Lobato et al. 2017) Shortcomings ◮ Asynchronicity ◮ Theoretical guarantees ◮ Computationally & conceptually simple 25/40

Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 26/40

Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 1) Construct posterior GP . 26/40

Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 1) Construct posterior GP . 2) Draw sample g from posterior. 26/40

Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x t x 1) Construct posterior GP . 2) Draw sample g from posterior. 3) Choose x t = argmax x g ( x ). 26/40

Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x t x 1) Construct posterior GP . 2) Draw sample g from posterior. 3) Choose x t = argmax x g ( x ). 4) Evaluate f at x t . 26/40

Parallelised Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyTS At any given time, 1. ( x ′ , y ′ ) ← Wait for a worker to finish. 2. Compute posterior GP . 3. Draw a sample g ∼ GP . 4. Re-deploy worker at argmax g . 27/40

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - PowerPoint PPT Presentation

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon University Guest Lecture - Scalable Machine Learning for Big Data Biology University of Pittsburgh, Pittsburgh, PA November 3, 2017 Hyper-parameter

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

Hyper-parameter tuning to improve existing software Alexander Brownlee, University of Stirling

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Fast and Easy Hyper-Parameter Grid Search for Deep Learning GTC 2016 Mark Whitney Rescale

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal with ties

MLbase: A System for Distributed Machine Learning Ameet Talwalkar

Introduction to Machine Learning Hyperparameter Tuning - Problem Definition

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Hyperparameter optimization strategies git clone

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - PowerPoint PPT Presentation

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon University Guest Lecture - Scalable Machine Learning for Big Data Biology University of Pittsburgh, Pittsburgh, PA November 3, 2017 Hyper-parameter

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Hyper: Make VM Runs Like Container Xu Wang &lt;xu@hyper.sh&gt; Hyper HQ Agenda Lesson

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Data Mining II Optimization &amp; Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Data Mining II Optimization &amp; Parameter Tuning Heiko Paulheim Why Parameter Tuning?

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

Hyper-parameter tuning to improve existing software Alexander Brownlee, University of Stirling

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Fast and Easy Hyper-Parameter Grid Search for Deep Learning GTC 2016 Mark Whitney Rescale

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&amp;A Q: How do we deal with ties

MLbase: A System for Distributed Machine Learning Ameet Talwalkar

Introduction to Machine Learning Hyperparameter Tuning - Problem Definition

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Hyperparameter optimization strategies git clone

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal with ties