scaling bayesian optimization in high dimensions
play

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - PowerPoint PPT Presentation

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT BayesOpt Workshop 2017 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind) Bayesian Optimization with GPs f ( x )


  1. 
 Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT 
 BayesOpt Workshop 2017 
 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) 
 and Pushmeet Kohli (DeepMind)

  2. Bayesian Optimization with GPs f ( x ) Gaussian process: BO : sequentially build model of f 
 closed form expressions for for t=1, … T : posterior mean and 
 variance (uncertainty) • select new query point(s) x 
 f ∼ GP ( µ, k ) selection criterion: acquisition function σ − 1 arg max x ∈ X α t ( x ) • observe f(x) • update model & repeat 
 μ −1

  3. Challenges in high dimensions statistical & computational complexity: 
 • estimating & optimizing acquisition function 
 new, sample-efficient acquisition function (ICML 2017) • function estimation in high dimensions 
 learn input structure (ICML 2017) • many observations (data points): huge matrix in GP 
 multiple random partitions (BayesOpt 2017) • parallelization σ − 1 μ −1

  4. (Predictive) Entropy Search new query point: arg max x ∈ X α t ( x ) Location of global Observed Point to query optimum Data X α t ( x ) = I ( { x, y } ; x ∗ | D t ) x ∗ ES = H ( p ( x ∗ | D t )) − E [ H ( p ( x ∗ | D t ∪ { x, y } ))] I ( a ; b ) = H ( a ) − H ( a | b ) = H ( b ) − H ( b | a ) PES = H ( p ( y | D t , x )) − E [ H ( p ( y | D t , x, x ∗ ))] if is high-dimensional: costly to estimate! x ∗ α t ( x ) (Hennig & Schuler, 2012; Hernandez-Lobato, Hoffman & Ghahramani 2014)

  5. Max-value Entropy Search Observed 1-dimensional Query Point Data Input space α t ( x ) = I ( { x, y } ; x ∗ | D t ) Output space α t ( x ) = I ( { x ; y } ; y ∗ | D t ) dimensions! d → 1 X x ∗ d-dimensional  γ y ∗ ( x ) ψ ( γ y ∗ ( x )) � ≈ 1 X − log( Ψ ( γ y ∗ ( x ))) closed-form 2 Ψ ( γ y ∗ ( x )) K y ∗ ∈ Y ∗ How sample ? Expectation over . p ( y ∗ | D t ) y ∗

  6. 
 
 
 
 
 
 Sampling y*: Idea 1 is a 1D Gaussian 
 p ( f ( x )) 3 Fisher-Tippett-Gnedenko Theorem 2 The maximum of a set of i.i.d. Gaussian 1 f ( x ) variables is asymptotically described by 0 a Gumbel distribution. -1 -2 -3 • sample representative points 
 -2 -1 0 1 2 x • approximate max-value of the representative points by a Gumbel distribution

  7. 
 
 
 
 
 Sampling y*: Idea 2 draw functions from GP posterior 
 2 and maximize each. How? 
 1 output, f(x) 0 Neal 1994: 
 − 1 GP infinite 1-layer neural ≡ − 2 network with Gaussian weights. − 5 0 5 input, x (b), posterior f ( x ) • approximate GP as finite neural network 
 random weights (random features) 
 & sample posterior weights …… • maximize network output for each sample x (Hernández-Lobato, Hoffman & Ghahramani 2014)

  8. Max-value Entropy Search Observed 1-dimensional Query Point Data Input space α t ( x ) = I ( { x, y } ; x ∗ | D t ) Output space α t ( x ) = I ( { x ; y } ; y ∗ | D t ) dimensions! d → 1 X x ∗ d-dimensional p ( y ∗ | D t ) Can sample ! Expectation over . y ∗ Does it work?

  9. Empirically: max-value enough? sample-efficiency? 4 PES 1 sampling x ∗ PES 10 3 PES 100 Simple Regret 2 1 MES-G 1 sampling y ∗ MES-G 10 MES-G 100 0 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Iteration

  10. Empirically: faster than PES Runtime Per Iteration (s) 16 15.24 1 10 100 samples 12 8 5.85 4 1.61 0.67 0.2 0.13 0.12 0.09 0.09 0 PES MES-NN MES-Gumbel

  11. Connections & Theory zoo of acquisition functions: EI (Mockus, 1974) , PI (Kushner, 1964) , GP-UCB (Auer, 2002; Srinivas et al., 2010) , GP-MI (Contal et al., 2014), ES (Hennig & Schuler, 2012) , PES (Hernández-Lobato et al., 2014) , EST (Wang et al., 2016) , GLASSES (González et al., 2016) , SMAC (Hutter et al., 2010) , ROAR (Hutter et al., 2010), … MES 
 Lemma (Wang-J17) Equivalent acquisition functions: • MES with a single sample of per step y ∗ with specific, 
 } • UCB (upper confidence bound, Srinivas et al., 2010 ) adaptive 
 parameter 
 • PI (probability of improvement, Kushner, 1964 ) 
 setting Theorem: Regret bound (Wang-J 17) 
 T 0 = O ( T log δ ) With probability , within iterations: 
 1 − δ �q (log T ) d +2 � f ∗ − max t ∈ [1 ,T 0 ] f ( x t ) = O T

  12. Gaussian Processes in high dimensions • estimating a nonlinear function in 
 high input dimensions: 
 statistically challenging 
 3 2 • optimizing nonconvex acquisition 
 1 0 function in high dimensions 
 -1 computationally challenging 
 -2 -3 • many observations: huge matrices 
 -2 -1 0 1 2 computationally challenging

  13. Additive Gaussian Processes f 0 ( x A 0 ) X f m ( x A m ) . f ( x ) = m ∈ [ M ] f 1 ( x A 1 ) f 2 ( x A 2 ) • lower-complexity functions 
 statistical efficiency • optimize acquisition function block-wise 
 computational efficiency What is the partition? (Hastie&Tibshirani, 1990; Kandasamy et al., 2015)

  14. 
 Structural Kernel Learning f = f 0 f 1 f 2 + + z = [0 1 0 0 1 1 1 0 2] 
 f 0 ( x A 0 ) Learn the assignment! 0 1 0 0 1 1 1 0 2 Key idea: Dirichlet prior on z f 1 ( x A 1 ) f 2 ( x A 2 ) Posterior 
 z j ∼ Multi ( θ ) p ( z | D n ; α ) Integrate 
 out via Gibbs sampling. 
 easy updates

  15. Empirical Results 10 5 0 -5 1 1 0.5 0.5 0 0 x 2 x 1 synthetic, 50 dim robot pushing task D=50 120 10 9 No Partition 100 Simple Regret 8 80 7 60 r t 6 Fully Partitioned r t 5 40 Heuristics 4 20 True 3 0 2 SKL 200 400 600 100 200 300 400 500 t (d) t Iteration

  16. 
 
 
 
 
 
 
 Curious connections 0.5 0.9 0.5 • crossover in evolutionary algorithms: 0.1 0.8 0.8 0.3 0.5 0.3 • BO with additive GP: 
 3 observations estimated acquisition function (c) (d) 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 -2 0 2 4 -2 0 2 4 • observed good points: query points: 
 -1 2 -1 2 0 2 2 0 learned instead of completely random coordinate partition

  17. Gaussian Processes in high dimensions 3 • estimating nonlinear functions in 
 2 high input dimensions: 
 1 statistically challenging 0 -1 • optimizing nonconvex acquisition 
 -2 Full kernel function in high dimensions 
 10 -3 -2 -1 0 1 2 computationally challenging 
 5 f(x) 0 σ σ σ 3 σ • many observations : huge matrix inversions 
 -5 µ f -10 computationally challenging 0 0.5 1 Low-rank approximation x (d) 50 µ ( x ) = k n ( x ) > ( K n + τ 2 I ) � 1 y t 0 f(x) -50 σ 2 ( x ) = k ( x, x ) − k n ( x ) > ( K n + τ 2 I ) � 1 k n ( x ) σ σ 3 σ σ -100 µ -150 f 0 0.5 1

  18. Ensemble Bayesian Optimization in each iteration: • partition data via 
 Mondrian process • fit GP in each part: 
 structure learning + 
 Tile Coding; 
 synchronize • select query points in parallel & filter parallelization across parts distribution over partitions — new draw in each iteration

  19. Does it scale? Gibbs sampling time (minutes) 160 SKL 140 EBO 120 100 We stopped SKL after 2 hours 80 60 40 EBO average runtime = 61 seconds 20 0 0 10k 20k 30k 40k 50k Observation size

  20. Variances 100 Observations 1000 Observations 5000 Observations Ground Truth 10 40 100 10 20 50 5 5 0 0 f(x) f(x) f(x) f(x) 0 0 -20 -50 3 3 3 3 -5 -5 -40 -100 f f f f -10 -60 -150 -10 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 x x x x 5000 Observations 5000 Observations 5000 Observations 5000 Observations 10 10 10 10 5 5 5 5 f(x) f(x) f(x) f(x) 0 0 0 0 -5 3 -5 3 -5 3 -5 3 f f f f -10 -10 -10 -10 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 x x x x

  21. Empirical Results BO-SVI PBO BO-Add-SVI EBO 7 6 5 4 Regret 3 2 1 0 0 10 20 30 40 50 60 Time (minutes) (Hensman et al., 2013, Wang et al., 2017)

  22. Summary: GP-BO in high dimensions Challenge: high dimensions, many observations 
 statistical & computational efficiency • Max-value Entropy Search 
 sample-efficient, effective acquisition function 
 (Wang, Jegelka, ICML 2017) • Many dimensions: learning structured kernels 
 (Wang, Li, Jegelka, Kohli, ICML 2017) • Many observations & dimensions & parallelization: ensemble Bayesian Optimization 
 (Wang, Gehring, Kohli, Jegelka, BayesOpt 2017) 


  23. References • Zi Wang, Stefanie Jegelka. Max-value entropy search for efficient Bayesian Optimization. ICML 2017. 
 • Zi Wang, Chengtao Li, Stefanie Jegelka, Pushmeet Kohli. Batched High-dimensional Bayesian Optimization via Structural Kernel Learning. ICML 2017. • Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka. Batched Large-scale Bayesian Optimization in High-dimensional Spaces. BayesOpt, 2017.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend