Bayesian optimization and Information-based Approaches
Jos´ e Miguel Hern´ andez-Lobato joint work with Michael A. Gelbart, Matt W. Hoffman, Ryan P. Adams and Zoubin Ghahramani April 31, 2015 (50% of these slides have been made by Matt)
Bayesian optimization and Information-based Approaches Jos e - - PowerPoint PPT Presentation
Bayesian optimization and Information-based Approaches Jos e Miguel Hern andez-Lobato joint work with Michael A. Gelbart, Matt W. Hoffman, Ryan P. Adams and Zoubin Ghahramani April 31, 2015 (50% of these slides have been made by Matt)
Jos´ e Miguel Hern´ andez-Lobato joint work with Michael A. Gelbart, Matt W. Hoffman, Ryan P. Adams and Zoubin Ghahramani April 31, 2015 (50% of these slides have been made by Matt)
We are interested in solving black-box optimization problems of the form x∗ = arg max
x∈X
f (x) where black-box means:
Black-box, f xt query inputs Yt noisy outputs
Black-box queries are very expensive (time, economic cost, etc...).
1/21
Example (AB testing)
Users visit our website which has different configurations (A and B) and we want to find the best configuration (possibly online).
Example (Hyperparameter tuning)
We have some algorithm which relies on hyperparameters which we want to optimize with respect to performance.
Example (Design of new molecules)
We want to find molecular compounds with optimal chemical properties: more efficient solar panels, batteries, drugs, etc...
2/21
Bayesian optimization in a nutshell:
1 Get initial sample.
3/21
Bayesian optimization in a nutshell:
1 Get initial sample. 2 Construct a posterior
model.
3/21
Bayesian optimization in a nutshell:
1 Get initial sample. 2 Construct a posterior
model.
3 Select the exploration
3/21
Bayesian optimization in a nutshell:
1 Get initial sample. 2 Construct a posterior
model.
3 Select the exploration
4 . . . and optimize it.
3/21
Bayesian optimization in a nutshell:
1 Get initial sample. 2 Construct a posterior
model.
3 Select the exploration
4 . . . and optimize it. 5 Sample new data;
update model.
3/21
Bayesian optimization in a nutshell:
1 Get initial sample. 2 Construct a posterior
model.
3 Select the exploration
4 . . . and optimize it. 5 Sample new data;
update model.
6 Repeat!
3/21
Bayesian optimization in a nutshell:
1 Get initial sample. 2 Construct a posterior
model.
3 Select the exploration
4 . . . and optimize it. 5 Sample new data;
update model.
6 Repeat!
3/21
Two primary questions to answer are:
4/21
We want a model that can both make predictions and maintain a measure of uncertainty over those predictions. Gaussian processes provide a flexible prior for modeling continuous functions of this form. Bayesian neural networks are an alternative when the data size is large.
Snoek et al. [2015] 5/21
A Gaussian process f ∼ GP(m, k) defines a distribution over functions such that any finite collection of evaluations at x1:n are Normally distributed, f (x1) . . . f (xt) ∼ N m(x1) . . . m(xt) , k(x1, x1) . . . k(x1, xt) . . . . . . k(xt, x1) . . . k(xt, xt) If the observations y are the result of Normal noise on f , then
Rasmussen and Williams [2006] 6/21
The exploration strategy must explicitly trade off between exploration and exploitation. Should map the model and a query point to expected future value. The result is an acquisition function. Common approach: maximize the Expected Improvement
Mockus et al. [1978], Jones et al. [1998]
(EI): αt(x) = Ef (x)
x+, best value so far. Intuitively, EI selects the point which gives us the most improvement
7/21
Entropy search (ES) maximizes the expected reduction in entropy:
Villemonteix et al. [2009], Hennig and Schuler [2012]
αt(x) = H
where x⋆ is the unknown global optimizer.
0.0 0.2 0.4 0.6 0.8 1.0 −2 1 2
0.2 0.4 0.6 0.8 1.0 −2 −1 1 2
x x x x x x x x x x
0.0 0.2 0.4 0.6 0.8 1.0 −2 1 2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0
The ES acquisition function is equal to I(y, x⋆) = I(x⋆, y). We can swap y and x⋆ and rewrite the acquisition as αt(x) = H
which we call Predictive Entropy Search.
Hern´ andez-Lobato et al. [2014]
Approximating the PES acquisition function can be done in two steps:
1 Sampling from the distribution over global maximizers x⋆. 2 Estimating the predictive entropy for y conditioned on x⋆.
9/21
To sample x⋆ we need only sample ˜ f ∼ p(f |Dt) and return arg maxx ˜ f (x).
0.2 0.4 0.6 0.8 1.0 2 1 2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0
f is an infinite dimensional object! Instead we use ˜ f (·) ≈ φ(·)Tθ where φ(x) =
Bochner’s theorem shows that when m → ∞ the approximation is exact.
Bochner [1959] 10/21
Instead of conditioning to x⋆, we use the following simplified constraints: ∇f (x⋆) = 0 upper[∇2f (x⋆)] = 0 A d = diag[∇2f (x⋆)] < 0 f (x⋆) > max
t
f (xt) B f (x⋆) > f (x) C
Hessian analytically.
expectation propagation (EP).
we can easily calculate the entropy.
11/21
The following compares a fine-grained rejection sampling (RS) scheme to compute the ground truth objective with ES and PES.
0.20 0.25 0.30 0.35
x x x x x x x x x x
0.2 0.2 . 2 0.25 0.25 0.25 0.25 0.25 0.25 . 3 0.350.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
x x x x x x x x x x
0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.03 0.03 0.03 . 3 . 3 . 3 0.04 0.04 0.04 0.04 0.05 . 5 . 6 . 6 0.060.00 0.05 0.10 0.15 0.20 0.25 0.30
x x x x x x x x x x
0.05 0.05 0.05 . 5 0.05 0.1 0.1 0.1 0.15 . 2 0.2 0.25 0.25RS Acquisition Function ES Acquisition Function PES Acquisition Function
We see that PES provides a much better approximation.
12/21
Here we show results where the objective function is sampled from a known GP prior.
− −
− − − − − − − − − − − − − − − . 5
5.5 − 4.5 − 3.5 − 2.5 − 1.5 − 0.5 10 20 30 40 50 Num ber of Function Evaluations Log10 M edian IR M ethods
ES PES
13/21
−2.9 −1.9 −0.9 10 20 30
Number of Function Evaluations Log10 Median IR
Methods
ES PES PES−NB
Results on Branin Cost Function
−3.6 −2.6 −1.6 −0.6 10 20 30
Number of Function Evaluations Results on Cosines Cost Function
−2.7 −1.7 −0.7 10 20 30 40 50
Number of Function Evaluations Results on Hartmann Cost Function
−0.4 0.6 10 20 30 40
Function Evaluations Log10 Median IR
Methods
ES PES PES−NB
NNet Cost
−0.1 10 20 30 40
Function Evaluations Hydrogen
−1.9 −0.9 10 20 30 40
Function Evaluations Portfolio
−0.9 10 20 30
Function Evaluations Walker A
−0.3
20 30
Function Evaluations Walker B
14/21
A cookie company wants to create a low-calorie cookie that is just as tasty as the original. This is a constrained optimization problem over the parameterized space of cookie recipes: More generally, we want to solve max f (x) s.t. c1(x) ≥ 0, . . . , cK(x) ≥ 0 , where f and c1, . . . , cK are unknown and return noisy values.
15/21
The PESC acquisition function is
Hern´ andez-Lobato et al. [2015]
αt(x) = H
(PESC)
An approximation is obtained in two steps (as in PES):
1 Sampling from the distribution over global maximizers x⋆.
Sample ˜ f ∼ p(f |Dt), ˜ c1 ∼ p(c1|Dt), . . . , ˜ cK ∼ p(cK|Dt) and solve arg max
x
˜ f (x) s.t. ˜ c1(x) ≥ 0, . . . , ˜ cK(x) ≥ 0 ,
2 Estimating the predictive entropy for y conditioned on x⋆.
p(y|Dt, x, x⋆) ∝
K
k=1 δ[yk − ck(x)]
K
k=1 Θ[ck(x′)]
k=1 Θ[ck(x′)
k=1 Θ[ck(x⋆)]
Approximated with a product of univariate Gaussians using EP.
16/21
Optimizing a neural network validation error on MNIST when constrained to make predictions in under 2ms.
10 20 30 40 50
Number of function evaluations
1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2
log10 objective value
EIC PESC
Optimizing the effective sample size
constrained to pass convergence diagnostics.
20 40 60 80 100
Number of function evaluations
−log10 effective sample size
EIC PESC
Baseline: expected improvement with constraints (EIC): αt(x) = E
K
k=1 p(ck(x) ≥ 0)
The PESC acquisition function is additive across f and c1, . . . , cK.
Marginal Posterior Distributions 0.0 0.2 0.4 0.6 0.8 1.0 −3 −1 1 2 3
x x x x x x x x x x
Objective Constraint
x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 x RS PESC Acqusition Function Acquisition Function for Acquisition Function for 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 x RS PESC Acqusition Function 0.0 0.2 0.4 0.6 0.8 1.0 −3 −1 1 2 3
x x x x x x x x x x x x x x
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 x
Objective Constraint
Marginal Posterior Distributions RS PESC Acquisition Function for RS PESC Acqusition Function Acqusition Function Acquisition Function for x 18/21
19/21
University Press, 1959.
global optimization. JMLR, 13, 2012.
andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box
Curran Associates, Inc., 2014.
andez-Lobato, M. A. Gelbart, M. W. Hoffman, R. P. Adams, and Z. Ghahramani. Predictive entropy search for bayesian
20/21
methods for seeking the extremum. Towards Global Optimization, 2, 1978.
using deep neural networks. arXiv preprint arXiv:1502.05700, 2015.
to the global optimization of expensive-to-evaluate functions. Journal
21/21