Bayesian optimization and Information-based Approaches Jos e - PowerPoint PPT Presentation

Bayesian optimization and Information-based Approaches Jos´ e Miguel Hern´ andez-Lobato joint work with Michael A. Gelbart, Matt W. Hoffman, Ryan P. Adams and Zoubin Ghahramani April 31, 2015 (50% of these slides have been made by Matt)

Bayesian optimization We are interested in solving black-box optimization problems of the form x ∗ = arg max f ( x ) x ∈X where black-box means: • We may only be able to observe the function value, no gradients . • Our observations may be corrupted by noise . x t Y t Black-box, f query inputs noisy outputs • One requirement on the noisy outputs: E [ Y t | x t ] = f ( x t ). Black-box queries are very expensive (time, economic cost, etc...). 1/21

Example (AB testing) Users visit our website which has different configurations (A and B) and we want to find the best configuration (possibly online). Example (Hyperparameter tuning) We have some algorithm which relies on hyperparameters which we want to optimize with respect to performance. Example (Design of new molecules) We want to find molecular compounds with optimal chemical properties : more efficient solar panels, batteries, drugs, etc... 2/21

Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 3/21

Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3/21

Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 3/21

Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 3/21

Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 5 Sample new data; update model. 3/21

Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 5 Sample new data; update model. 6 Repeat! 3/21

Two primary questions to answer are: • What is the model and • What is the exploration strategy given the model? 4/21

Modeling We want a model that can both make predictions and maintain a measure of uncertainty over those predictions. Gaussian processes provide a flexible prior for modeling continuous functions of this form. Bayesian neural networks are an alternative when the data size is large. Snoek et al. [2015] 5/21

Modeling: Gaussian processes A Gaussian process f ∼ GP( m , k ) defines a distribution over functions such that any finite collection of evaluations at x 1: n are Normally distributed,         f ( x 1 ) m ( x 1 ) k ( x 1 , x 1 ) . . . k ( x 1 , x t ) . . . . . . . .  ∼ N  ,         . . . .       f ( x t ) m ( x t ) k ( x t , x 1 ) . . . k ( x t , x t ) If the observations y are the result of Normal noise on f , then • P ( y 1: n , f ( x 1: n )) is jointly Gaussian. • Conditioning can be done in closed-form. • The result is a tractable GP posterior distribution. Rasmussen and Williams [2006] 6/21

The exploration strategy: expected improvement The exploration strategy must explicitly trade off between exploration and exploitation . Should map the model and a query point to expected future value . The result is an acquisition function . Common approach: maximize the Expected Improvement (EI): �� α t ( x ) = E f ( x ) max 0 , f ( x ) − f ( x + ) � D t D t , the observations. � x + , best value so far. Intuitively, EI selects the point which gives us the most improvement over our current best solution, in the next round. Mockus et al. [1978], Jones et al. [1998] 7/21

The exploration strategy: Entropy Search Entropy search (ES) maximizes the expected reduction in entropy: � �� α t ( x ) = H x ⋆ � D t − E y H x ⋆ � D t ∪ { ( x , y ) } � D t , x (ES) � where x ⋆ is the unknown global optimizer. 2 2 o 1 1 x x 0 0 0 x x x − 1 x x − 2 x x − 2 x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 o o 2 3.0 o o o o o o 1 o o o o o Density 2.0 0 1.0 − 2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Villemonteix et al. [2009], Hennig and Schuler [2012] 8/21

Predictive Entropy Search The ES acquisition function is equal to I ( y , x ⋆ ) = I ( x ⋆ , y ). We can swap y and x ⋆ and rewrite the acquisition as � �� α t ( x ) = H y � D t , x − E x ⋆ H y � D t , x , x ⋆ � D t , x (PES) � which we call Predictive Entropy Search . Approximating the PES acquisition function can be done in two steps: 1 Sampling from the distribution over global maximizers x ⋆ . 2 Estimating the predictive entropy for y conditioned on x ⋆ . Hern´ andez-Lobato et al. [2014] 9/21

� � � 1: sampling the location of the optimum x ⋆ To sample x ⋆ we need only sample ˜ f ∼ p ( f |D t ) and return arg max x ˜ f ( x ). o o 2 3.0 o o o o o o 1 o o o o o Density 2.0 0 1.0 0.0 � 2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 However, ˜ f is an infinite dimensional object! Instead we use ˜ f ( · ) ≈ φ ( · ) T θ where φ ( x ) = � 2 α/ m cos( Wx + b ). Bochner’s theorem shows that when m → ∞ the approximation is exact. Bochner [1959] 10/21

2: Approximating the distribution of y given x ⋆ Instead of conditioning to x ⋆ , we use the following simplified constraints: d = diag[ ∇ 2 f ( x ⋆ )] < 0 ∇ f ( x ⋆ ) = 0 f ( x ⋆ ) > f ( x ) upper[ ∇ 2 f ( x ⋆ )] = 0 f ( x ⋆ ) > max f ( x t ) t C A B • We can incorporate the equality constraints on the gradient and the Hessian analytically . • To deal with the inequality constraints we use the method expectation propagation (EP). � • The result is a Gaussian approximation to p ( y � D t , x , x ⋆ ) for which we can easily calculate the entropy. 11/21

Accuracy of the PES approximation The following compares a fine-grained rejection sampling (RS) scheme to compute the ground truth objective with ES and PES. RS Acquisition Function ES Acquisition Function PES Acquisition Function x x x 0.07 0.30 0.25 0.01 0.2 0.02 0.25 0.04 0.1 . 0 0 5 0.25 0.02 0.05 0 0 . 3 0.06 0 0 . 6 0.25 0.35 0.01 0.25 x x x x 0.05 0.2 x x 0.03 0.05 0.25 x x x x x x 0 0.2 0.05 . 0.20 0 2 . 0 6 0 0 0.04 . 3 0.35 0.30 0.04 0.01 x 0.02 x x . 2 0.15 0.15 0 0.03 0.25 0.1 x x x x x x 3 0.10 0.25 0 0 . 0.02 0.1 0.05 0.25 x x x 0.03 0.01 x x x 0.05 0.06 0.03 0.05 0.20 0.04 0 0 . 5 0.02 0.00 0.04 0.01 0.25 0 . 0 3 0.00 We see that PES provides a much better approximation. 12/21

Results on simulated data Here we show results where the objective function is sampled from a known GP prior. 0 − . 5 ● ● ● ● ● ● − ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● edian IR ● ● ● − 1.5 ● ● ● ● − ● ● ● ● ● ● − 2.5 ● ● − Log10 M ● ● ● ● ● ● M ethods ● ● ● ● − ● − 3.5 ● ● ● ● ● ● ● ● ● EI ● − ● ● ● ● ● ● − ● − ● ● − ● − ● ● − ● − 4.5 ES ● ● ● ● ● ● ● − ● ● ● ● ● ● ● ● ● ● PES ● ● ● ● ● ● ● − ● ● ● ● ● ● − 5.5 ● − − ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 − − Num ber of Function Evaluations 13/21

Bayesian optimization and Information-based Approaches Jos e - PowerPoint PPT Presentation

Bayesian optimization and Information-based Approaches Jos e Miguel Hern andez-Lobato joint work with Michael A. Gelbart, Matt W. Hoffman, Ryan P. Adams and Zoubin Ghahramani April 31, 2015 (50% of these slides have been made by Matt)

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Some Bayesian Approaches for ERGM Ranran Wang, UW MURI-UCI August 25, 2009 Some Bayesian

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Hlder spaces R n open, bounded u C 0 ( ) [0,1] | u ( x ) u

Hidden algebraic structure on cohomology of simplicial complexes, and TFT Pavel Mnev University

A Kalman Filter for Robust Outlier Detection Jo-Anne Ting, Evangelos Theodorou, Stefan Schaal

Standardizing Commit-and-Prove ZK Daniel Benarroch Matteo Campanelli Dario Fiore IMDEA Software

The Method of Frobenius Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

Legendre Polynomials Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

E k -algebras and homological stability Oscar Randal-Williams Premise Want to study the homology

EAP Efficient Re-authentication Lakshminath Dondeti , ldondeti@qualcomm.com Vidya Narayanan ,