Bayesian optimisation Gilles Louppe April 11, 2016 Problem - - PowerPoint PPT Presentation

bayesian optimisation
SMART_READER_LITE
LIVE PREVIEW

Bayesian optimisation Gilles Louppe April 11, 2016 Problem - - PowerPoint PPT Presentation

Bayesian optimisation Gilles Louppe April 11, 2016 Problem statement x = arg max f ( x ) x Constraints: f is a black box for which no closed form is known; gradients df dx are not available. f is expensive to evaluate;


slide-1
SLIDE 1

Bayesian optimisation

Gilles Louppe April 11, 2016

slide-2
SLIDE 2

Problem statement

x∗ = arg max

x

f (x) Constraints:

  • f is a black box for which no closed form is known;

gradients df

dx are not available.

  • f is expensive to evaluate;
  • (optional) uncertainty on observations yi of f

e.g., yi = f (xi) + ǫi because of Poisson fluctuations.

Goal: find x∗, while minimizing the number of evaluations f (x).

2 / 18

slide-3
SLIDE 3

Disclaimer

If you do not have these constraints, there is certainly a better

  • ptimisation algorithm than Bayesian optimisation.

(e.g., L-BFGS-B, Powell’s method (as in Minuit), etc)

3 / 18

slide-4
SLIDE 4

Bayesian optimisation

for t = 1 : T,

  • 1. Given observations (xi, yi) for i = 1 : t, build a probabilistic

model for the objective f .

Integrate out all possible true functions, using Gaussian process regression.

  • 2. Optimise a cheap utility function u based on the posterior

distribution for sampling the next point. xt+1 = arg max

x

u(x) Exploit uncertainty to balance exploration against exploitation.

  • 3. Sample the next observation yt+1 at xt+1.

4 / 18

slide-5
SLIDE 5

Where shall we sample next?

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

True (unknown) Observations

5 / 18

slide-6
SLIDE 6

Build a probabilistic model for the objective function

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

True (unknown) Observations µGP(x) CI

This gives a posterior distribution over functions that could have generated the observed data.

6 / 18

slide-7
SLIDE 7

Acquisition functions

Acquisition functions u(x) specify which sample x should be tried next:

  • Upper confidence bound UCB(x) = µGP(x) + κσGP(x);
  • Probability of improvement PI(x) = P(f (x) ≥ f (x+

t ) + κ);

  • Expected improvement EI(x) = E[f (x) − f (x+

t )];

  • ... and many others.

where x+

t is the best point observed so far.

In most cases, acquisition functions provide knobs (e.g., κ) for controlling the exploration-exploitation trade-off.

  • Search in regions where µGP(x) is high (exploitation)
  • Probe regions where uncertainty σGP(x) is high (exploration)

7 / 18

slide-8
SLIDE 8

Plugging everything together (t = 0)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

x +

t

= 0.1000 True (unknown) Observations µGP(x) u(x) CI

xt+1 = arg maxx UCB(x)

8 / 18

slide-9
SLIDE 9

... and repeat until convergence (t = 1)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

x +

t

= 0.1000 True (unknown) Observations µGP(x) u(x) CI

9 / 18

slide-10
SLIDE 10

... and repeat until convergence (t = 2)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

x +

t

= 0.1000 True (unknown) Observations µGP(x) u(x) CI

10 / 18

slide-11
SLIDE 11

... and repeat until convergence (t = 3)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

x +

t

= 0.1000 True (unknown) Observations µGP(x) u(x) CI

11 / 18

slide-12
SLIDE 12

... and repeat until convergence (t = 4)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

x +

t

= 0.1000 True (unknown) Observations µGP(x) u(x) CI

12 / 18

slide-13
SLIDE 13

... and repeat until convergence (t = 5)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x)

x +

t

= 0.2858 True (unknown) Observations µGP(x) u(x) CI

13 / 18

slide-14
SLIDE 14

What is Bayesian about Bayesian optimization?

  • The Bayesian strategy treats the unknown objective function

as a random function and place a prior over it.

The prior captures our beliefs about the behaviour of the

  • function. It is here defined by a Gaussian process whose

covariance function captures assumptions about the smoothness of the objective.

  • Function evaluations are treated as data. They are used to

update the prior to form the posterior distribution over the

  • bjective function.
  • The posterior distribution, in turn, is used to construct an

acquisition function for querying the next point.

14 / 18

slide-15
SLIDE 15

Limitations

  • Bayesian optimisation has parameters itself!

Choice of the acquisition function Choice of the kernel (i.e. design of the prior) Parameter wrapping Initialization scheme

  • Gaussian processes usually do not scale well to many
  • bservations and to high-dimensional data.

Sequential model-based optimization provides a direct and effective alternative (i.e., replace GPs by a tree-based model).

15 / 18

slide-16
SLIDE 16

Applications

  • Bayesian optimization has been used in many scientific fields,

including robotics, machine learning or life sciences.

  • Use cases for high energy physics?

Optimisation of simulation parameters in event generators; Optimisation of compiler flags to maximize execution speed; Optimisation of hyper-parameters in machine learning for HEP; ... let’s discuss further ideas?

16 / 18

slide-17
SLIDE 17

Software

  • Python

Spearmint https://github.com/JasperSnoek/spearmint GPyOpt https://github.com/SheffieldML/GPyOpt RoBO https://github.com/automl/RoBO scikit-optimize https://github.com/MechCoder/scikit-optimize (work in progress)

  • C++

MOE https://github.com/yelp/MOE

Check also this Github repo for a vanilla implementation reproducing these slides.

17 / 18

slide-18
SLIDE 18

Summary

  • Bayesian optimisation provides a principled approach for
  • ptimising an expensive function f ;
  • Often very effective, provided it is itself properly configured;
  • Hot topic in machine learning research. Expect quick

improvements!

18 / 18

slide-19
SLIDE 19

References

Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian

  • ptimization of expensive cost functions, with application to active user

modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.