 
              Bayesian optimization and Information-based Approaches Jos´ e Miguel Hern´ andez-Lobato joint work with Michael A. Gelbart, Matt W. Hoffman, Ryan P. Adams and Zoubin Ghahramani April 31, 2015 (50% of these slides have been made by Matt)
Bayesian optimization We are interested in solving black-box optimization problems of the form x ∗ = arg max f ( x ) x ∈X where black-box means: • We may only be able to observe the function value, no gradients . • Our observations may be corrupted by noise . x t Y t Black-box, f query inputs noisy outputs • One requirement on the noisy outputs: E [ Y t | x t ] = f ( x t ). Black-box queries are very expensive (time, economic cost, etc...). 1/21
Example (AB testing) Users visit our website which has different configurations (A and B) and we want to find the best configuration (possibly online). Example (Hyperparameter tuning) We have some algorithm which relies on hyperparameters which we want to optimize with respect to performance. Example (Design of new molecules) We want to find molecular compounds with optimal chemical properties : more efficient solar panels, batteries, drugs, etc... 2/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 3/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 3/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 3/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 5 Sample new data; update model. 3/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 5 Sample new data; update model. 6 Repeat! 3/21
Bayesian black-box optimization Bayesian optimization in a nutshell: 1 Get initial sample. 2 Construct a posterior model. 3 Select the exploration strategy. . . 4 . . . and optimize it. 5 Sample new data; update model. 6 Repeat! 3/21
Two primary questions to answer are: • What is the model and • What is the exploration strategy given the model? 4/21
Modeling We want a model that can both make predictions and maintain a measure of uncertainty over those predictions. Gaussian processes provide a flexible prior for modeling continuous functions of this form. Bayesian neural networks are an alternative when the data size is large. Snoek et al. [2015] 5/21
Modeling: Gaussian processes A Gaussian process f ∼ GP( m , k ) defines a distribution over functions such that any finite collection of evaluations at x 1: n are Normally distributed,         f ( x 1 ) m ( x 1 ) k ( x 1 , x 1 ) . . . k ( x 1 , x t ) . . . . . . . .  ∼ N  ,         . . . .       f ( x t ) m ( x t ) k ( x t , x 1 ) . . . k ( x t , x t ) If the observations y are the result of Normal noise on f , then • P ( y 1: n , f ( x 1: n )) is jointly Gaussian. • Conditioning can be done in closed-form. • The result is a tractable GP posterior distribution. Rasmussen and Williams [2006] 6/21
The exploration strategy: expected improvement The exploration strategy must explicitly trade off between exploration and exploitation . Should map the model and a query point to expected future value . The result is an acquisition function . Common approach: maximize the Expected Improvement (EI): �� � � � α t ( x ) = E f ( x ) max 0 , f ( x ) − f ( x + ) � D t D t , the observations. � x + , best value so far. Intuitively, EI selects the point which gives us the most improvement over our current best solution, in the next round. Mockus et al. [1978], Jones et al. [1998] 7/21
The exploration strategy: Entropy Search Entropy search (ES) maximizes the expected reduction in entropy: � �� � � � � � � α t ( x ) = H x ⋆ � D t − E y H x ⋆ � D t ∪ { ( x , y ) } � D t , x (ES) � where x ⋆ is the unknown global optimizer. 2 2 o 1 1 x x 0 0 0 x x x − 1 x x − 2 x x − 2 x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 o o 2 3.0 o o o o o o 1 o o o o o Density 2.0 0 1.0 − 2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Villemonteix et al. [2009], Hennig and Schuler [2012] 8/21
Predictive Entropy Search The ES acquisition function is equal to I ( y , x ⋆ ) = I ( x ⋆ , y ). We can swap y and x ⋆ and rewrite the acquisition as � �� � � � � � � α t ( x ) = H y � D t , x − E x ⋆ H y � D t , x , x ⋆ � D t , x (PES) � which we call Predictive Entropy Search . Approximating the PES acquisition function can be done in two steps: 1 Sampling from the distribution over global maximizers x ⋆ . 2 Estimating the predictive entropy for y conditioned on x ⋆ . Hern´ andez-Lobato et al. [2014] 9/21
� � � 1: sampling the location of the optimum x ⋆ To sample x ⋆ we need only sample ˜ f ∼ p ( f |D t ) and return arg max x ˜ f ( x ). o o 2 3.0 o o o o o o 1 o o o o o Density 2.0 0 1.0 0.0 � 2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 However, ˜ f is an infinite dimensional object! Instead we use ˜ f ( · ) ≈ φ ( · ) T θ where φ ( x ) = � 2 α/ m cos( Wx + b ). Bochner’s theorem shows that when m → ∞ the approximation is exact. Bochner [1959] 10/21
2: Approximating the distribution of y given x ⋆ Instead of conditioning to x ⋆ , we use the following simplified constraints: d = diag[ ∇ 2 f ( x ⋆ )] < 0 ∇ f ( x ⋆ ) = 0 f ( x ⋆ ) > f ( x ) upper[ ∇ 2 f ( x ⋆ )] = 0 f ( x ⋆ ) > max f ( x t ) t C A B • We can incorporate the equality constraints on the gradient and the Hessian analytically . • To deal with the inequality constraints we use the method expectation propagation (EP). � • The result is a Gaussian approximation to p ( y � D t , x , x ⋆ ) for which we can easily calculate the entropy. 11/21
Accuracy of the PES approximation The following compares a fine-grained rejection sampling (RS) scheme to compute the ground truth objective with ES and PES. RS Acquisition Function ES Acquisition Function PES Acquisition Function x x x 0.07 0.30 0.25 0.01 0.2 0.02 0.25 0.04 0.1 . 0 0 5 0.25 0.02 0.05 0 0 . 3 0.06 0 0 . 6 0.25 0.35 0.01 0.25 x x x x 0.05 0.2 x x 0.03 0.05 0.25 x x x x x x 0 0.2 0.05 . 0.20 0 2 . 0 6 0 0 0.04 . 3 0.35 0.30 0.04 0.01 x 0.02 x x . 2 0.15 0.15 0 0.03 0.25 0.1 x x x x x x 3 0.10 0.25 0 0 . 0.02 0.1 0.05 0.25 x x x 0.03 0.01 x x x 0.05 0.06 0.03 0.05 0.20 0.04 0 0 . 5 0.02 0.00 0.04 0.01 0.25 0 . 0 3 0.00 We see that PES provides a much better approximation. 12/21
Results on simulated data Here we show results where the objective function is sampled from a known GP prior. 0 − . 5 ● ● ● ● ● ● − ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● edian IR ● ● ● − 1.5 ● ● ● ● − ● ● ● ● ● ● − 2.5 ● ● − Log10 M ● ● ● ● ● ● M ethods ● ● ● ● − ● − 3.5 ● ● ● ● ● ● ● ● ● EI ● − ● ● ● ● ● ● − ● − ● ● − ● − ● ● − ● − 4.5 ES ● ● ● ● ● ● ● − ● ● ● ● ● ● ● ● ● ● PES ● ● ● ● ● ● ● − ● ● ● ● ● ● − 5.5 ● − − ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 − − Num ber of Function Evaluations 13/21
Recommend
More recommend