Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , - - PowerPoint PPT Presentation

▶

Aug 07, 2023 330 likes •596 views

Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , Andreas Damianou, Neil D. Lawrence @ICML 2017, Sydney, Australia June 26, 2019 My Colleagues Javier Gonz alez Andreas Damianou Neil D. Lawrence Motivation Bayesian

SLIDE 1

Preferential Bayesian Optimization

Javier Gonz´ alez, Zhenwen Dai, Andreas Damianou, Neil D. Lawrence

@ICML 2017, Sydney, Australia

June 26, 2019

SLIDE 2

My Colleagues

Javier Gonz´ alez Andreas Damianou Neil D. Lawrence

SLIDE 3

Motivation

◮ Bayesian Optimization aims at searching for the global minimum of an expensive function g, xmin = arg min

x∈X g(x).

◮ What if the function g is not directly measurable?

SLIDE 4

Preference vs. Rating

◮ The objective function of many tasks are difficult to precisely summarize into a single value. ◮ Comparison is almost always easier than rating for humans. ◮ Such observation has been exploited in A/B testing.

SLIDE 5

BO via Preference

◮ Beyond a single A/B testing. ◮ To optimize a system via tuning this configuration, e.g., the font size, background color of a website. ◮ The objective such as customer experience is not directly measurable ◮ Compare the objective with two different configurations. ◮ The task is to search for the best configuration by iteratively suggesting pairs of configurations and observing the results of comparisons.

SLIDE 6

Problem Definition

◮ To find the minimum of a latent function g(x), x ∈ X. ◮ Observe only whether g(x) < g(x′) or not, for a duel [x, x′] ∈ X × X. ◮ The outcomes are binary: true or false. ◮ The outcomes are stochastic.

SLIDE 7

Preference Function

◮ In this work, the probabilistic distribution is assumed to Bernoulli: p(y ∈ {0, 1}|[x, x′]) = πy(1 − π)1−y, π = σ

g(x′) − g(x)
.

◮ π is referred to as a preference function. ◮ A Preferential Bayesian optimization algorithm will propose a sequence of duels that helps efficiently localize the minimum of a latent function g(x).

−10 −5 5 10 15 20 f(x)

Objective function

Global minimum

Copeland and soft-Copeland functions 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5

Preference function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

SLIDE 8

A Surrogate Model

◮ The preference function is not observable. ◮ Only observe a few comparisons. ◮ Need a surrogate model to guide the search. ◮ Two choices:

◮ a surrogate model for the latent function (like in standard BO). [Brochu, 2010, Guo et al., 2010] ◮ a surrogate model for the preference function

0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5

Preference function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Expectation of y⋆ and σ(f⋆)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

SLIDE 9

A Surrogate Model of Preference Model

◮ We propose to build a surrogate model for the preference function. ◮ Pros: easy to model (Gaussian process Binary Classification is used:) p(y⋆ = 1|D, [x, x′], θ) =

σ(f⋆)p(f⋆|D, [x⋆, x′

⋆], θ)df⋆

◮ Pros: flexible latent function (e.g., non-stationality). ◮ Cons: the minimum of the latent function is not directly accessible

0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5

Preference function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Expectation of y⋆ and σ(f⋆)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

SLIDE 10

Who is the winner (the minimum)?

◮ The minimum beats all the other locations on average. ◮ Extending an idea from armed-bandits [Zoghi et al., 2015], we define the soft-Copeland score as, (the average winning probability), C(x) = Vol(X)−1

πf ([x, x′])dx′, ◮ The optimum of g(x) can be estimated as, denoted as the Condorcet winner, xc = arg max

x∈X C(x),

−10 −5 5 10 15 20 f(x)

Objective function

Global minimum 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Score value

Copeland and soft-Copeland functions

Copeland soft-Copeland 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ . 5 . 5 . 5

Preference function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

SLIDE 11

The current estimation of minimum

◮ Only have a surrogate model of preference function. ◮ Estimate the soft-Copeland score from the surrogate model and get an approximate Condorcet winner. ◮ Note that the approximated Condorcet winner may not be the optimum of g(x).

SLIDE 12

Acquisition Function

◮ Existing Acq. Func. are not applicable. ◮ They are designed to work with a surrogate model

f the objective function.

◮ In PBO, the surrogate model does not directly represent the latent objective function. ◮ We need a new Acq. Func. for duels!

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Expectation of y⋆ and σ(f⋆)

0.2 0.3 0.4 0.5 0.6 0.7 0.8

SLIDE 13

Pure Exploration Acquisition Function (PBO-PE)

◮ The common pure explorative acq. func., i.e. V[y], does not work. ◮ Propose a pure explorative acq. func. as the variance (uncertainty) of the “winning” probability

f a duel:

V[σ(f⋆)] =

(σ(f⋆) − E[σ(f⋆)])2 p(f⋆|D, [x, x′])df⋆

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variance of y∗

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variance of σ(f⋆)

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

SLIDE 14

Acquisition Function: PBO-DTS

To select the next duel [xnext, x′

next]:

1. Draw a sample from surrogate model
2. Take the maximum of soft-Copeland score as xnext.
3. Take x′

next that gives the maximum in PBO-PE

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Sample of σ(f⋆)

0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Sampled Copeland Function

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variance of σ(f⋆)

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

SLIDE 15

Experiment: Forrester Function

◮ Synthetic 1D function: Forrester ◮ Observations drawn with a probability:

1 1+eg(x)−g(x′)

◮ g(xc) shows the value at the location that algorithms believe is the minimum. ◮ The curve is the average

f 20 trials.

IBO: [Brochu, 2010] SPARRING: [Ailon et al., 2014]

25 50 75 100 125 150 175 200 #iterations −6 −4 −2 2 g(xc)

Forrester

PBO-PE PBO-DTS PBO-CEI RANDOM IBO SPARRING

SLIDE 16

Experiments: More (2D) Functions

25 50 75 100 125 150 175 200 #iterations −6 −4 −2 2 g(xc)

Forrester

PBO-PE PBO-DTS PBO-CEI RANDOM IBO SPARRING 25 50 75 100 125 150 175 200 #iterations −0.5 0.0 0.5 1.0 1.5 2.0 g(xc)

Six Hump Camel

PBO-PE PBO-DTS RANDOM IBO 25 50 75 100 125 150 175 200 #iterations 103 104 g(xc)

Gold Stein

PBO-PE PBO-DTS RANDOM IBO 25 50 75 100 125 150 175 200 #iterations 100 101 g(xc)

Levy

PBO-PE PBO-DTS RANDOM IBO

SLIDE 17

Summary

◮ Address Bayesian optimization with preferential returns. ◮ Propose to build a surrogate model for the preference function. ◮ Propose a few efficient acquisition functions. ◮ Show the performance on synthetic functions.

SLIDE 18

Questions?

SLIDE 19

Exploration & Exploitation

The two ingredients in an acquisition function: Exploration & Exploitation.

SLIDE 20

Exploration in PBO

◮ To understand exploration in PBO by designing a pure explorative acq. func. ◮ Exploration in standard BO can be viewed as the action to reduce uncertainty of a surrogate model. ◮ A purely explorative acq. func. V[y⋆] =

(y⋆ − E[y⋆])2 p(y⋆|D, x⋆)dy⋆

◮ Can we extend this idea to PBO?

SLIDE 21

A Straight-Forward Choice

◮ A straight-forward extension from standard BO: V[y⋆] =

y⋆∈{0,1}

(y⋆ − E[y⋆])2 p(y⋆|D, [x⋆, x′

⋆])

=E[y⋆](1 − E[y⋆]) ◮ The maximum variance is always at where E[y⋆] = 0.5! ◮ The variance may not reduce with observations!

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Expectation of y⋆ and σ(f⋆)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variance of y∗

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24

SLIDE 22

Dueling-Thompson Sampling (DTS)

◮ To balance exploration & exploitation, we borrow the idea of Thompson sampling by drawing a sample from the surrogate model. ◮ Compute the soft-copeland score on the drawn sample. ◮ The value xnext that gives the maximum soft-copeland score gives a good balance between exploration and exploitation. ◮ Take it as the first value of the next duel.

0.0 0.2 0.4 0.6 0.8 1.0

100 Copeland Samples (#duels = 10)

0.0 0.2 0.4 0.6 0.8 1.0

100 Copeland Samples (#duels = 30)

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0

100 Copeland Samples (#duels = 150)

SLIDE 23

Aleatoric Uncertainty & Epistemic Uncertainty

◮ The uncertainty of y⋆ comes from two sources: the aleatoric uncertainty σ(f⋆) and the epistemic uncertainty p(f⋆|D, [x⋆, x′

⋆], θ)

p(y⋆ = 1|D, [x, x′], θ) =

σ(f⋆)p(f⋆|D, [x⋆, x′

⋆], θ)df⋆

◮ Aleatoric Uncertainty: the stochasticity of the underlying process ◮ Epistemic Uncertainty: the uncertainty due to limited observations ◮ Exploration should focus on epistemic uncertainty.

SLIDE 24

Multi-arm Bandits on 2D

500 1000 1500 2000 2500 3000 3500 4000 #iterations −1 1 2 3 4 5 6 g(xc)

Six Hump Camel

PBO-DTS SPARRING

SLIDE 25

Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 856–864, 2014. Eric Brochu. Interactive Bayesian Optimization: Learning Parameters for Graphics and Animation. PhD thesis, University of British Columbia, Vancouver, Canada, December 2010. Shengbo Guo, Scott Sanner, and Edwin V Bonilla. Gaussian process preference elicitation. In Advances in Neural Information Processing Systems 23, pages 262–270, 2010. Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 307–315. Curran Associates, Inc., 2015.