BayesOpt: hot topics and current challenges Javier Gonz alez - - PowerPoint PPT Presentation

bayesopt hot topics and current challenges
SMART_READER_LITE
LIVE PREVIEW

BayesOpt: hot topics and current challenges Javier Gonz alez - - PowerPoint PPT Presentation

BayesOpt: hot topics and current challenges Javier Gonz alez Masterclass, 7-February, 2107 @Lancaster University Agenda of the day 9:00-11:00, Introduction to Bayesian Optimization : What is BayesOpt and why it works? Relevant


slide-1
SLIDE 1

BayesOpt: hot topics and current challenges

Javier Gonz´ alez Masterclass, 7-February, 2107 @Lancaster University

slide-2
SLIDE 2

Agenda of the day

◮ 9:00-11:00, Introduction to Bayesian Optimization:

◮ What is BayesOpt and why it works? ◮ Relevant things to know.

◮ 11:30-13:00, Connections, extensions and

applications:

◮ Extensions to multi-task problems, constrained domains,

early-stopping, high dimensions.

◮ Connections to Armed bandits and ABC. ◮ An applications in genetics.

◮ 14:00-16:00, GPyOpt LAB!: Bring your own problem! ◮ 16:30-15:30, Hot topics current challenges:

◮ Parallelization. ◮ Non-myopic methods ◮ Interactive Bayesian Optimization.

slide-3
SLIDE 3

Section III: Hot topics and challenges

◮ Parallel Bayesian Optimization ◮ Non-myopic methods. ◮ Interactive Bayesian Optimization.

slide-4
SLIDE 4

Scalable BO: Parallel/batch BO

Avoiding the bottleneck of evaluating f

◮ Cost of f(xn) = cost of {f(xn,1), . . . , f(xn,nb)}. ◮ Many cores available, simultaneous lab experiments, etc.

slide-5
SLIDE 5

Considerations when designing a batch

◮ Available pairs {(xj, yi)}n i=1 are augmented with the

evaluations of f on Bnb

t

= {xt,1, . . . , xt,nb}.

◮ Goal: design Bnb 1 , . . . , Bnb m .

Notation:

◮ In: represents the available data set Dn and the GP

structure when n data points are available (It,k in the batch context).

◮ α(x; In): generic acquisition function given In.

slide-6
SLIDE 6

Optimal greedy batch design

Sequential policy: Maximize: α(x; It,0) Greedy batch policy, 1st element t-th batch: Maximize: α(x; It,0)

slide-7
SLIDE 7

Optimal greedy batch design

Sequential policy: Maximize: α(x; It,0) Greedy batch policy, 2nd element t-th batch: Maximize:

  • α(x; It,1)p(yt,1|xt,1, It,0)p(xt,1|It,0)dxt,1dyt,1

◮ p(yt,1|x1, It,0): predictive distribution of the GP. ◮ p(x1|It,0) = δ(xt,1 − arg maxx∈X α(x; It,0)).

slide-8
SLIDE 8

Optimal greedy batch design

Sequential policy: Maximize: α(x; It,k−1) Greedy batch policy, k-th element t-th batch: Maximize:

  • α(x; It,k−1)

k−1

  • j=1

p(yt,j|xt,j, It,j−1)p(xt,j|It,j−1)dxt,jdyt,j

◮ p(yt,j|xt,j, It,j−1): predictive distribution of the GP. ◮ p(xj|It,j−1) = δ(xt,j − arg maxx∈X α(x; It,j−1)).

slide-9
SLIDE 9

Available approaches

[Azimi et al., 2010; Desautels et al., 2012; Chevalier et al., 2013; Contal et al. 2013]

◮ Exploratory approaches, reduction in system uncertainty. ◮ Generate ‘fake’ observations of f using p(yt,j|xj, It,j−1). ◮ Simultaneously optimize elements on the batch using the

joint distribution of yt1, . . . yt,nb. Bottleneck: All these methods require to iteratively update p(yt,j|xj, It,j−1) to model the iteration between the elements in the batch: O(n3) How to design batches reducing this cost? Local penalization

slide-10
SLIDE 10

Goal: eliminate the marginalization step

“To develop an heuristic approximating the ‘optimal batch design strategy’ at lower computational cost, while incorporating information about global properties of f from the GP model into the batch design” Lipschitz continuity: |f(x1) − f(x2)| ≤ Lx1 − x2p.

slide-11
SLIDE 11

Interpretation of the Lipschitz continuity of f

M = maxx∈X f(x) and Brxj (xj) = {x ∈ X : x − xj ≤ rxj} where rxj = M − f(xj) L

0.4 0.6 0.8 1.0 1.2

x

30 20 10 10 20

f(x) True function Samples Exclusion cones Active regions

xM / ∈ Brxj (xj) otherwise, the Lipschitz condition is violated.

slide-12
SLIDE 12

Probabilistic version of Brx(x)

We can do this because f(x) ∼ GP(µ(x), k(x, x′))

◮ rxj is Gaussian with µ(rxj) = M−µ(xj) L

and σ2(rxj) = σ2(xj)

L2

. Local penalizers: ϕ(x; xj) = p(x / ∈ Brxj (xj)) ϕ(x; xj) = p(rxj < x − xj) = 0.5erfc(−z) where z =

1

2σ2

n(xj)(Lxj − x − M + µn(xj)).

◮ Reflects the size of the ’Lipschitz’ exclusion areas. ◮ Approaches to 1 when x is far form xj and decreases

  • therwise.
slide-13
SLIDE 13

Idea to collect the batches

Without using explicitly the model.

Optimal batch: maximization-marginalization

  • α(x; It,k−1)

k−1

  • j=1

p(yt,j|xt,j, It,j−1)p(xt,j|It,j−1)dxt,jdyt,j Proposal: maximization-penalization. Use the ϕ(x; xj) to penalize the acquisition and predict the expected change in α(x; It,k−1).

slide-14
SLIDE 14

Local penalization strategy

[Gonz´ alez, Dai, Hennig, Lawrence, 2016]

10 5 5 10

x

1 2 3 4 5 6 7 8 9

value

1st batch element

α(x)

10 5 5 10

x

1 2 3 4 5 6 7 8 9

value

2nd batch element

α(x) α(x)ϕ1(x) ϕ1 (x)

10 5 5 10

x

1 2 3 4 5 6 7 8 9

value

3th batch element

α(x)ϕ1 (x) α(x)ϕ1 (x)ϕ2 (x) ϕ2 (x)

The maximization-penalization strategy selects xt,k as xt,k = arg max

x∈X

  g(α(x; It,0))

k−1

  • j=1

ϕ(x; xt,j)    , g is a transformation of α(x; It,0) to make it always positive.

slide-15
SLIDE 15

Local penalization strategy

[Gonz´ alez, Dai, Hennig, Lawrence, 2016]

10 5 5 10

x

1 2 3 4 5 6 7 8 9

value

1st batch element

α(x)

10 5 5 10

x

1 2 3 4 5 6 7 8 9

value

2nd batch element

α(x) α(x)ϕ1(x) ϕ1 (x)

10 5 5 10

x

1 2 3 4 5 6 7 8 9

value

3th batch element

α(x)ϕ1 (x) α(x)ϕ1 (x)ϕ2 (x) ϕ2 (x)

The maximization-penalization strategy selects xt,k as xt,k = arg max

x∈X

  g(α(x; It,0))

k−1

  • j=1

ϕ(x; xt,j)    , g is a transformation of α(x; It,0) to make it always positive.

slide-16
SLIDE 16

Example for L = 50

L controls the exploration-exploitation balance within the batch.

slide-17
SLIDE 17

Example for L = 100

L controls the exploration-exploitation balance within the batch.

slide-18
SLIDE 18

Example for L = 150

L controls the exploration-exploitation balance within the batch.

slide-19
SLIDE 19

Example for L = 250

L controls the exploration-exploitation balance within the batch.

slide-20
SLIDE 20

Finding an unique Lipschitz constant

Let f : X → R be a L-Lipschitz continuous function defined on a compact subset X ⊆ RD. Then Lp = max

x∈X ∇f(x)p,

is a valid Lipschitz constant. The gradient of f at x∗ is distributed as a multivariate Gaussian ∇f(x∗)|X, y, x∗ ∼ N(µ∇(x∗), Σ2

∇(x∗))

We choose: ˆ L = max

X

µ∇(x∗)

slide-21
SLIDE 21

Experiments: Sobol function

Best (average) result for some given time budget.

slide-22
SLIDE 22

2D experiment with ‘large domain’

Comparison in terms of the wall clock time

50 100 150 200 250 300

Time(seconds)

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0

Best found value

EI UCB Rand-EI Rand-UCB SM-UCB B-UCB PE-UCB Pred-EI Pred-UCB qEI LP-EI LP-UCB

slide-23
SLIDE 23

Myopia of optimisation techniques

◮ Most global optimisation techniques are myopic, in

considering no more than a single step into the future.

◮ Relieving this myopia requires solving the multi-step

lookahead problem.

Figure: Two evaluations, if the first evaluation is made myopically, the second must be sub-optimal.

slide-24
SLIDE 24

Non-myopic thinking

To think non-myopically is important: it is a way of integrating in our decisions the information about our available (limited) resources to solve a given problem.

slide-25
SLIDE 25

Acquisition function: expected loss

[Osborne, 2010]

Loss of evaluating f at x∗ assuming it is returning y∗: λ(y∗) y∗; if y∗ ≤ η η; if y∗ > η. where η = min{y0}, the current best found value. The loss expectation is : Λ1(x∗|I0) E[min(y∗, η)] =

  • λ(y∗)p(y∗|x∗, I0)dy∗

I0 is the current information D, θ and likelihood type.

slide-26
SLIDE 26

The expected loss (improvement) is myopic

◮ Selects the next evaluation as if it was the last one. ◮ The remaining available budget is not taken into account

when deciding where to evaluate. How to take into account the effect of future evaluations in the decision?

slide-27
SLIDE 27

Expected loss with n steps ahead

Intractable even for a handful number of steps ahead

Λn(x∗|I0) =

  • λ(yn)

n

  • j=1

p(yj|xj, Ij−1)p(xj|Ij−1)dy∗ . . . dyndx2 . . . dxn

◮ p(yj|xj, Ij−1): predictive distribution of the GP at xj and ◮ p(xj|Ij−1): optimisation step.

slide-28
SLIDE 28

Relieving the myopia of Bayesian optimisation

We present... GLASSES! Global optimisation with Look-Ahead through Stochastic Simulation and Expected-loss Search

slide-29
SLIDE 29

GLASSES

Rendering the approximation sparse

Idea: jointly model the epistemic uncertainty about the steps ahead using some defining some point process. Γn(x∗|I0) =

  • λ(yn)p(y|X, I0, x∗)p(X|I0, x∗)dydX
slide-30
SLIDE 30

GLASSES

Technical details

Selecting a good p(X|I0, x∗) is complicated.

◮ Replace integrating over p(X|I0, x∗) by conditioning over

an oracle predictor Fn(x∗) of the n future locations.

◮ y = (y∗, . . . , yn)T : Gaussian outputs of f at Fn(x∗). ◮ Λn

  • x∗ | I0, Fn(x∗)
  • = Γn(x∗|I0, Fn(x∗)) = E
  • min(y, η)
  • .

◮ E

  • min(y, η)
  • is computed using Expectation Propagation.
slide-31
SLIDE 31

GLASSES: predicting the steps ahead

Oracle based on a batch BO method [Gonzalez et al., AISTATS’2016]

Can be interpreted as the MAP of a determinantal point process.

slide-32
SLIDE 32

GLASSES: interpretation of the loss

Automatic balance between exploration and exploitation.

slide-33
SLIDE 33

Results in a benchmark of objectives

GLASSES is overall the best method.

slide-34
SLIDE 34

Interactive Bayesian optimization

Gonzalez et al, [2016]

Key question: what if it is easier to compare two points in the domain than obtaining a single output value for each one? Preferential returns

slide-35
SLIDE 35

Interactive Bayesian optimization

Gonzalez et al, [2016]

To find xmin = arg min

x∈X g(x).

where g is not directly accessible. Queries to g can only be done in pairs of points or duels [x, x′] ∈ X × X from which binary feedback {0, 1} is obtained Useful when modeling human preferences

slide-36
SLIDE 36

Modelling preferences

The model of choice is a Bernoulli probability function: p(y = 1|[x, x′]) = πf([x, x′]) and p(y = 0|[x, x′]) = πf([x′, x]) where π : ℜ × ℜ → [0, 1] is a link function. A natural choice for πf is the logistic function πf([x′, x]) = σ(f([x′, x])) = 1 1 + e−f([x′,x]) for f([x, x′]) = g(x′) − g(x).

slide-37
SLIDE 37

Elements of the problem

−10 −5 5 10 15 20 f(x)

Objective function

Global minimum 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Score value

Copeland and soft-Copeland functions

Copeland soft-Copeland 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5

Preference function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Key concepts:

◮ Preference function: πf([x′, x]). ◮ Soft-Copeland score: C(x) = Vol(X)−1 X πf([x, x′])dx′. ◮ Condorcet’s winner: point with maximal soft-Copeland

score.

slide-38
SLIDE 38

Idea

◮ Modeling the preference with a Gaussian process for

classification.

◮ Select the new duel than maximizes the Copeland’s score

in expectation.

slide-39
SLIDE 39

Compeland’s expected improvement (CEI)

Acquisition for duels: αCEI([x, x′]; D, θ) = E [max(0, c − c⋆)] = πf,j([x, x′])(c⋆

j,x − c⋆ j)+ + πf,j([x′, x])(c⋆ j,x′ − c⋆) ◮ c⋆ j is the value of the Condorcet’s winner at iteration j. ◮ c⋆ x the value of the estimated Condorcer winner resulting of

augmenting Dj with {[x, x′], 1}

slide-40
SLIDE 40

Results

20 40 60 80 100 #iterations 5 4 3 2 1 1 2 3 4 g(x ∗ )

Forrester (1D) BOPPER random sparring

5 10 15 20 25 30 #iterations 0.5 0.0 0.5 1.0 g(x ∗ )

Six-Hump-Camel (2D) BOPPER random

Model correlations with the Gaussian process helps!

slide-41
SLIDE 41

Questions?