scalable bandit methods for hyper parameter tuning
play

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - PowerPoint PPT Presentation

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon University Guest Lecture - Scalable Machine Learning for Big Data Biology University of Pittsburgh, Pittsburgh, PA November 3, 2017 Hyper-parameter


  1. Outline ◮ Part I: Bandits in the Bayesian Paradigm 1. Gaussian processes 2. Algorithms: Upper Confidence Bound (UCB) & Thompson Sampling (TS) ◮ Part II: Scaling up Bandits 1. Multi-fidelity bandit: cheap approximations to an expensive experiment 2. Parallelising function evaluations 3. High dimensional input spaces (N.B: Part II is a shameless plug for my research.) 12/40

  2. Part 2.1: Multi-fidelity Bandits Motivating question: What if we have cheap approximations to f ? 1. Hyper-parameter tuning: Train & validate with a subset of the data, and/or early stopping before convergence. E.g. Bandwidth ( ℓ ) selection in kernel density estimation. 13/40

  3. Part 2.1: Multi-fidelity Bandits Motivating question: What if we have cheap approximations to f ? 1. Hyper-parameter tuning: Train & validate with a subset of the data, and/or early stopping before convergence. E.g. Bandwidth ( ℓ ) selection in kernel density estimation. 2. Computational astrophysics: cosmological simulations and numerical computations with less granularity. 3. Autonomous driving: simulation vs real world experiment. 13/40

  4. Multi-fidelity Methods For specific applications, ◮ Industrial design (Forrester et al. 2007) ◮ Hyper-parameter tuning (Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016) ◮ Active learning (Zhang & Chaudhuri 2015) ◮ Robotics (Cutler et al. 2014) Multi-fidelity bandits & optimisation (Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016) 14/40

  5. Multi-fidelity Methods For specific applications, ◮ Industrial design (Forrester et al. 2007) ◮ Hyper-parameter tuning (Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016) ◮ Active learning (Zhang & Chaudhuri 2015) ◮ Robotics (Cutler et al. 2014) Multi-fidelity bandits & optimisation (Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016) . . . with theoretical guarantees (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) 14/40

  6. Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? 15/40

  7. Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). 15/40

  8. Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). 15/40

  9. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. Z X 16/40

  10. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . X 16/40

  11. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. 16/40

  12. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). 16/40

  13. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). A cost function, λ : Z → R + . λ ( z ) λ ( z ) = λ ( N , T ) = O ( N 2 T ) (say). Z z • 16/40

  14. Algorithm: BOCA (Kandasamy et al. ICML 2017) 17/40

  15. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + 17/40

  16. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 17/40

  17. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 17/40

  18. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  19. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  20. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  21. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  22. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � λ ( z ) � q � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) = ξ ( z ) λ ( z • ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  23. Theoretical Results for BOCA g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” 18/40

  24. Theoretical Results for BOCA g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” Theorem: (Informal) BOCA does better, i.e. achieves better Simple regret, than GP- UCB . The improvements are better in the “good” setting when compared to the “bad” setting. 18/40

  25. Experiment: SVM with 20 News Groups Tune two hyper-parameters for the SVM. Dataset has N • = 15 K data and use T • = 100 iterations. But can choose N ∈ [5 K , 15 K ] or T ∈ [20 , 100] (2D fidelity space) . 0.915 0.91 0.905 0.9 0.895 0.89 500 1000 1500 2000 19/40

  26. Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N • = 192 data. Requires numerical integration on a grid of size G • = 10 6 . Approximate with N ∈ [50 , 192] or G ∈ [10 2 , 10 6 ] (2D fidelity space) . 20/40

  27. Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N • = 192 data. Requires numerical integration on a grid of size G • = 10 6 . Approximate with N ∈ [50 , 192] or G ∈ [10 2 , 10 6 ] (2D fidelity space) . 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 1000 1500 2000 2500 3000 3500 20/40

  28. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. 21/40

  29. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. 21/40

  30. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. 21/40

  31. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. Can be extended to infinite X . 21/40

  32. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. Can be extended to infinite X . Does not fall within the GP/Bayesian framework. 21/40

  33. Hyper-band (cont’d) When compared to Bayesian methods, ◮ Pro: Incremental resource allocation (do not need to retrain all models from the beginning). ◮ Con: Cannot use correlation between arms (e.g. if x 1 has large CV accuracy, then x 2 close to x 1 is also likely to do well). 22/40

  34. Hyper-band (cont’d) When compared to Bayesian methods, ◮ Pro: Incremental resource allocation (do not need to retrain all models from the beginning). ◮ Con: Cannot use correlation between arms (e.g. if x 1 has large CV accuracy, then x 2 close to x 1 is also likely to do well). Experiments: 22/40

  35. Outline ◮ Part I: Bandits in the Bayesian Paradigm 1. Gaussian processes 2. Algorithms: Upper Confidence Bound (UCB) & Thompson Sampling (TS) ◮ Part II: Scaling up Bandits 1. Multi-fidelity bandit: cheap approximations to an expensive experiment 2. Parallelising function evaluations 3. High dimensional input spaces 23/40

  36. Part 2.2: Parallelising arm pulls Sequential evaluations with one worker 24/40

  37. Part 2.2: Parallelising arm pulls Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) 24/40

  38. Part 2.2: Parallelising arm pulls Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous) 24/40

  39. Why parallelisation? ◮ Computational experiments: infrastructure with 100-1000’s CPUs or GPUs. 25/40

  40. Why parallelisation? ◮ Computational experiments: infrastructure with 100-1000’s CPUs or GPUs. Prior work: (Ginsbourger et al. 2011, Janusevskis et al. 2012, Wang et al. 2016, Gonz´ alez et al. 2015, Desautels et al. 2014, Contal et al. 2013, Shah and Ghahramani 2015, Kathuria et al. 2016, Wang et al. 2017, Wu and Frazier 2016, Hernandez-Lobato et al. 2017) Shortcomings ◮ Asynchronicity ◮ Theoretical guarantees ◮ Computationally & conceptually simple 25/40

  41. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 26/40

  42. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 1) Construct posterior GP . 26/40

  43. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 1) Construct posterior GP . 2) Draw sample g from posterior. 26/40

  44. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x t x 1) Construct posterior GP . 2) Draw sample g from posterior. 3) Choose x t = argmax x g ( x ). 26/40

  45. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x t x 1) Construct posterior GP . 2) Draw sample g from posterior. 3) Choose x t = argmax x g ( x ). 4) Evaluate f at x t . 26/40

  46. Parallelised Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyTS At any given time, 1. ( x ′ , y ′ ) ← Wait for a worker to finish. 2. Compute posterior GP . 3. Draw a sample g ∼ GP . 4. Re-deploy worker at argmax g . 27/40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend