Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - PowerPoint PPT Presentation

  Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT   BayesOpt Workshop 2017   joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT)   and Pushmeet Kohli (DeepMind)

Bayesian Optimization with GPs f ( x ) Gaussian process: BO : sequentially build model of f   closed form expressions for for t=1, … T : posterior mean and   variance (uncertainty) • select new query point(s) x   f ∼ GP ( µ, k ) selection criterion: acquisition function σ − 1 arg max x ∈ X α t ( x ) • observe f(x) • update model & repeat   μ −1

Challenges in high dimensions statistical & computational complexity:   • estimating & optimizing acquisition function   new, sample-efficient acquisition function (ICML 2017) • function estimation in high dimensions   learn input structure (ICML 2017) • many observations (data points): huge matrix in GP   multiple random partitions (BayesOpt 2017) • parallelization σ − 1 μ −1

(Predictive) Entropy Search new query point: arg max x ∈ X α t ( x ) Location of global Observed Point to query optimum Data X α t ( x ) = I ( { x, y } ; x ∗ | D t ) x ∗ ES = H ( p ( x ∗ | D t )) − E [ H ( p ( x ∗ | D t ∪ { x, y } ))] I ( a ; b ) = H ( a ) − H ( a | b ) = H ( b ) − H ( b | a ) PES = H ( p ( y | D t , x )) − E [ H ( p ( y | D t , x, x ∗ ))] if is high-dimensional: costly to estimate! x ∗ α t ( x ) (Hennig & Schuler, 2012; Hernandez-Lobato, Hoffman & Ghahramani 2014)

Max-value Entropy Search Observed 1-dimensional Query Point Data Input space α t ( x ) = I ( { x, y } ; x ∗ | D t ) Output space α t ( x ) = I ( { x ; y } ; y ∗ | D t ) dimensions! d → 1 X x ∗ d-dimensional  γ y ∗ ( x ) ψ ( γ y ∗ ( x )) � ≈ 1 X − log( Ψ ( γ y ∗ ( x ))) closed-form 2 Ψ ( γ y ∗ ( x )) K y ∗ ∈ Y ∗ How sample ? Expectation over . p ( y ∗ | D t ) y ∗

            Sampling y*: Idea 1 is a 1D Gaussian   p ( f ( x )) 3 Fisher-Tippett-Gnedenko Theorem 2 The maximum of a set of i.i.d. Gaussian 1 f ( x ) variables is asymptotically described by 0 a Gumbel distribution. -1 -2 -3 • sample representative points   -2 -1 0 1 2 x • approximate max-value of the representative points by a Gumbel distribution

          Sampling y*: Idea 2 draw functions from GP posterior   2 and maximize each. How?   1 output, f(x) 0 Neal 1994:   − 1 GP infinite 1-layer neural ≡ − 2 network with Gaussian weights. − 5 0 5 input, x (b), posterior f ( x ) • approximate GP as finite neural network   random weights (random features)   & sample posterior weights …… • maximize network output for each sample x (Hernández-Lobato, Hoffman & Ghahramani 2014)

Max-value Entropy Search Observed 1-dimensional Query Point Data Input space α t ( x ) = I ( { x, y } ; x ∗ | D t ) Output space α t ( x ) = I ( { x ; y } ; y ∗ | D t ) dimensions! d → 1 X x ∗ d-dimensional p ( y ∗ | D t ) Can sample ! Expectation over . y ∗ Does it work?

Empirically: max-value enough? sample-efficiency? 4 PES 1 sampling x ∗ PES 10 3 PES 100 Simple Regret 2 1 MES-G 1 sampling y ∗ MES-G 10 MES-G 100 0 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Iteration

Empirically: faster than PES Runtime Per Iteration (s) 16 15.24 1 10 100 samples 12 8 5.85 4 1.61 0.67 0.2 0.13 0.12 0.09 0.09 0 PES MES-NN MES-Gumbel

Connections & Theory zoo of acquisition functions: EI (Mockus, 1974) , PI (Kushner, 1964) , GP-UCB (Auer, 2002; Srinivas et al., 2010) , GP-MI (Contal et al., 2014), ES (Hennig & Schuler, 2012) , PES (Hernández-Lobato et al., 2014) , EST (Wang et al., 2016) , GLASSES (González et al., 2016) , SMAC (Hutter et al., 2010) , ROAR (Hutter et al., 2010), … MES   Lemma (Wang-J17) Equivalent acquisition functions: • MES with a single sample of per step y ∗ with specific,   } • UCB (upper confidence bound, Srinivas et al., 2010 ) adaptive   parameter   • PI (probability of improvement, Kushner, 1964 )   setting Theorem: Regret bound (Wang-J 17)   T 0 = O ( T log δ ) With probability , within iterations:   1 − δ �q (log T ) d +2 � f ∗ − max t ∈ [1 ,T 0 ] f ( x t ) = O T

Gaussian Processes in high dimensions • estimating a nonlinear function in   high input dimensions:   statistically challenging   3 2 • optimizing nonconvex acquisition   1 0 function in high dimensions   -1 computationally challenging   -2 -3 • many observations: huge matrices   -2 -1 0 1 2 computationally challenging

Additive Gaussian Processes f 0 ( x A 0 ) X f m ( x A m ) . f ( x ) = m ∈ [ M ] f 1 ( x A 1 ) f 2 ( x A 2 ) • lower-complexity functions   statistical efficiency • optimize acquisition function block-wise   computational efficiency What is the partition? (Hastie&Tibshirani, 1990; Kandasamy et al., 2015)

  Structural Kernel Learning f = f 0 f 1 f 2 + + z = [0 1 0 0 1 1 1 0 2]   f 0 ( x A 0 ) Learn the assignment! 0 1 0 0 1 1 1 0 2 Key idea: Dirichlet prior on z f 1 ( x A 1 ) f 2 ( x A 2 ) Posterior   z j ∼ Multi ( θ ) p ( z | D n ; α ) Integrate   out via Gibbs sampling.   easy updates

Empirical Results 10 5 0 -5 1 1 0.5 0.5 0 0 x 2 x 1 synthetic, 50 dim robot pushing task D=50 120 10 9 No Partition 100 Simple Regret 8 80 7 60 r t 6 Fully Partitioned r t 5 40 Heuristics 4 20 True 3 0 2 SKL 200 400 600 100 200 300 400 500 t (d) t Iteration

              Curious connections 0.5 0.9 0.5 • crossover in evolutionary algorithms: 0.1 0.8 0.8 0.3 0.5 0.3 • BO with additive GP:   3 observations estimated acquisition function (c) (d) 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 -2 0 2 4 -2 0 2 4 • observed good points: query points:   -1 2 -1 2 0 2 2 0 learned instead of completely random coordinate partition

Gaussian Processes in high dimensions 3 • estimating nonlinear functions in   2 high input dimensions:   1 statistically challenging 0 -1 • optimizing nonconvex acquisition   -2 Full kernel function in high dimensions   10 -3 -2 -1 0 1 2 computationally challenging   5 f(x) 0 σ σ σ 3 σ • many observations : huge matrix inversions   -5 µ f -10 computationally challenging 0 0.5 1 Low-rank approximation x (d) 50 µ ( x ) = k n ( x ) > ( K n + τ 2 I ) � 1 y t 0 f(x) -50 σ 2 ( x ) = k ( x, x ) − k n ( x ) > ( K n + τ 2 I ) � 1 k n ( x ) σ σ 3 σ σ -100 µ -150 f 0 0.5 1

Ensemble Bayesian Optimization in each iteration: • partition data via   Mondrian process • fit GP in each part:   structure learning +   Tile Coding;   synchronize • select query points in parallel & filter parallelization across parts distribution over partitions — new draw in each iteration

Does it scale? Gibbs sampling time (minutes) 160 SKL 140 EBO 120 100 We stopped SKL after 2 hours 80 60 40 EBO average runtime = 61 seconds 20 0 0 10k 20k 30k 40k 50k Observation size

Variances 100 Observations 1000 Observations 5000 Observations Ground Truth 10 40 100 10 20 50 5 5 0 0 f(x) f(x) f(x) f(x) 0 0 -20 -50 3 3 3 3 -5 -5 -40 -100 f f f f -10 -60 -150 -10 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 x x x x 5000 Observations 5000 Observations 5000 Observations 5000 Observations 10 10 10 10 5 5 5 5 f(x) f(x) f(x) f(x) 0 0 0 0 -5 3 -5 3 -5 3 -5 3 f f f f -10 -10 -10 -10 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 x x x x

Empirical Results BO-SVI PBO BO-Add-SVI EBO 7 6 5 4 Regret 3 2 1 0 0 10 20 30 40 50 60 Time (minutes) (Hensman et al., 2013, Wang et al., 2017)

Summary: GP-BO in high dimensions Challenge: high dimensions, many observations   statistical & computational efficiency • Max-value Entropy Search   sample-efficient, effective acquisition function   (Wang, Jegelka, ICML 2017) • Many dimensions: learning structured kernels   (Wang, Li, Jegelka, Kohli, ICML 2017) • Many observations & dimensions & parallelization: ensemble Bayesian Optimization   (Wang, Gehring, Kohli, Jegelka, BayesOpt 2017)  

References • Zi Wang, Stefanie Jegelka. Max-value entropy search for efficient Bayesian Optimization. ICML 2017.   • Zi Wang, Chengtao Li, Stefanie Jegelka, Pushmeet Kohli. Batched High-dimensional Bayesian Optimization via Structural Kernel Learning. ICML 2017. • Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka. Batched Large-scale Bayesian Optimization in High-dimensional Spaces. BayesOpt, 2017.

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - PowerPoint PPT Presentation

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT BayesOpt Workshop 2017 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind) Bayesian Optimization with GPs f ( x )

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

CTA WEIGHTS AND CTA WEIGHTS AND DIMENSIONS DIMENSIONS INITIATIVES INITIATIVES Meeting of the

Module 4: Building Working with Standard Dimensions Dimensions Using the Basic Level

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Preference-Based Bayesian Optimization in High Dimensions with Human Feedback Myra Cheng, Ellen

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

Trees Last time: recursion In the last lecture, we learned about recursion &

On the interpretation of high-energy neutrino limits J.K. Becker, A. Gro, K. Mnich, J.

Multiwavelength Astronomy: Probing Natures Particle Accelerators Brenda Dingus Los Alamos

Indications of Dark Matter from Astrophysical observations --- Fermi LAT, PAMELA, HESS &

Fermi Analysis Benoit Lott Centre dEtudes Nuclaires de Bordeaux- Gradignan

Experiments in EnglishJapanese Tree-to-String Machine Translation Graham Neubig Nara

Geoengi gine neeri ring f ng for C r Climate te Cha hang nge: Na Natu ture Ha Has A

Rethinking Class-Balanced Methods for Long-tailed Visual Recognition from a Domain Adaptation

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - PowerPoint PPT Presentation

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT BayesOpt Workshop 2017 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind) Bayesian Optimization with GPs f ( x )

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

CTA WEIGHTS AND CTA WEIGHTS AND DIMENSIONS DIMENSIONS INITIATIVES INITIATIVES Meeting of the

Module 4: Building Working with Standard Dimensions Dimensions Using the Basic Level

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Preference-Based Bayesian Optimization in High Dimensions with Human Feedback Myra Cheng, Ellen

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

Trees Last time: recursion In the last lecture, we learned about recursion &amp;

On the interpretation of high-energy neutrino limits J.K. Becker, A. Gro, K. Mnich, J.

Multiwavelength Astronomy: Probing Natures Particle Accelerators Brenda Dingus Los Alamos

Indications of Dark Matter from Astrophysical observations --- Fermi LAT, PAMELA, HESS &amp;

Fermi Analysis Benoit Lott Centre dEtudes Nuclaires de Bordeaux- Gradignan

Experiments in EnglishJapanese Tree-to-String Machine Translation Graham Neubig Nara

Geoengi gine neeri ring f ng for C r Climate te Cha hang nge: Na Natu ture Ha Has A

Rethinking Class-Balanced Methods for Long-tailed Visual Recognition from a Domain Adaptation

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Trees Last time: recursion In the last lecture, we learned about recursion &

Indications of Dark Matter from Astrophysical observations --- Fermi LAT, PAMELA, HESS &