Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , - - PowerPoint PPT Presentation
Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , - - PowerPoint PPT Presentation
Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , Andreas Damianou, Neil D. Lawrence @ICML 2017, Sydney, Australia June 26, 2019 My Colleagues Javier Gonz alez Andreas Damianou Neil D. Lawrence Motivation Bayesian
My Colleagues
Javier Gonz´ alez Andreas Damianou Neil D. Lawrence
Motivation
◮ Bayesian Optimization aims at searching for the global minimum of an expensive function g, xmin = arg min
x∈X g(x).
◮ What if the function g is not directly measurable?
Preference vs. Rating
◮ The objective function of many tasks are difficult to precisely summarize into a single value. ◮ Comparison is almost always easier than rating for humans. ◮ Such observation has been exploited in A/B testing.
BO via Preference
◮ Beyond a single A/B testing. ◮ To optimize a system via tuning this configuration, e.g., the font size, background color of a website. ◮ The objective such as customer experience is not directly measurable ◮ Compare the objective with two different configurations. ◮ The task is to search for the best configuration by iteratively suggesting pairs of configurations and observing the results of comparisons.
Problem Definition
◮ To find the minimum of a latent function g(x), x ∈ X. ◮ Observe only whether g(x) < g(x′) or not, for a duel [x, x′] ∈ X × X. ◮ The outcomes are binary: true or false. ◮ The outcomes are stochastic.
Preference Function
◮ In this work, the probabilistic distribution is assumed to Bernoulli: p(y ∈ {0, 1}|[x, x′]) = πy(1 − π)1−y, π = σ
- g(x′) − g(x)
- .
◮ π is referred to as a preference function. ◮ A Preferential Bayesian optimization algorithm will propose a sequence of duels that helps efficiently localize the minimum of a latent function g(x).
−10 −5 5 10 15 20 f(x)
Objective function
Global minimum
Copeland and soft-Copeland functions 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5
Preference function
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
A Surrogate Model
◮ The preference function is not observable. ◮ Only observe a few comparisons. ◮ Need a surrogate model to guide the search. ◮ Two choices:
◮ a surrogate model for the latent function (like in standard BO). [Brochu, 2010, Guo et al., 2010] ◮ a surrogate model for the preference function
0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5
Preference function
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Expectation of y⋆ and σ(f⋆)
0.2 0.3 0.4 0.5 0.6 0.7 0.8
A Surrogate Model of Preference Model
◮ We propose to build a surrogate model for the preference function. ◮ Pros: easy to model (Gaussian process Binary Classification is used:) p(y⋆ = 1|D, [x, x′], θ) =
- σ(f⋆)p(f⋆|D, [x⋆, x′
⋆], θ)df⋆
◮ Pros: flexible latent function (e.g., non-stationality). ◮ Cons: the minimum of the latent function is not directly accessible
0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5
Preference function
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Expectation of y⋆ and σ(f⋆)
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Who is the winner (the minimum)?
◮ The minimum beats all the other locations on average. ◮ Extending an idea from armed-bandits [Zoghi et al., 2015], we define the soft-Copeland score as, (the average winning probability), C(x) = Vol(X)−1
- X
πf ([x, x′])dx′, ◮ The optimum of g(x) can be estimated as, denoted as the Condorcet winner, xc = arg max
x∈X C(x),
−10 −5 5 10 15 20 f(x)
Objective function
Global minimum 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Score value
Copeland and soft-Copeland functions
Copeland soft-Copeland 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ . 5 . 5 . 5
Preference function
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
The current estimation of minimum
◮ Only have a surrogate model of preference function. ◮ Estimate the soft-Copeland score from the surrogate model and get an approximate Condorcet winner. ◮ Note that the approximated Condorcet winner may not be the optimum of g(x).
Acquisition Function
◮ Existing Acq. Func. are not applicable. ◮ They are designed to work with a surrogate model
- f the objective function.
◮ In PBO, the surrogate model does not directly represent the latent objective function. ◮ We need a new Acq. Func. for duels!
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Expectation of y⋆ and σ(f⋆)
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Pure Exploration Acquisition Function (PBO-PE)
◮ The common pure explorative acq. func., i.e. V[y], does not work. ◮ Propose a pure explorative acq. func. as the variance (uncertainty) of the “winning” probability
- f a duel:
V[σ(f⋆)] =
- (σ(f⋆) − E[σ(f⋆)])2 p(f⋆|D, [x, x′])df⋆
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variance of y∗
0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variance of σ(f⋆)
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Acquisition Function: PBO-DTS
To select the next duel [xnext, x′
next]:
- 1. Draw a sample from surrogate model
- 2. Take the maximum of soft-Copeland score as xnext.
- 3. Take x′
next that gives the maximum in PBO-PE
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Sample of σ(f⋆)
0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Sampled Copeland Function
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variance of σ(f⋆)
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Experiment: Forrester Function
◮ Synthetic 1D function: Forrester ◮ Observations drawn with a probability:
1 1+eg(x)−g(x′)
◮ g(xc) shows the value at the location that algorithms believe is the minimum. ◮ The curve is the average
- f 20 trials.
IBO: [Brochu, 2010] SPARRING: [Ailon et al., 2014]
25 50 75 100 125 150 175 200 #iterations −6 −4 −2 2 g(xc)
Forrester
PBO-PE PBO-DTS PBO-CEI RANDOM IBO SPARRING
Experiments: More (2D) Functions
25 50 75 100 125 150 175 200 #iterations −6 −4 −2 2 g(xc)
Forrester
PBO-PE PBO-DTS PBO-CEI RANDOM IBO SPARRING 25 50 75 100 125 150 175 200 #iterations −0.5 0.0 0.5 1.0 1.5 2.0 g(xc)
Six Hump Camel
PBO-PE PBO-DTS RANDOM IBO 25 50 75 100 125 150 175 200 #iterations 103 104 g(xc)
Gold Stein
PBO-PE PBO-DTS RANDOM IBO 25 50 75 100 125 150 175 200 #iterations 100 101 g(xc)
Levy
PBO-PE PBO-DTS RANDOM IBO
Summary
◮ Address Bayesian optimization with preferential returns. ◮ Propose to build a surrogate model for the preference function. ◮ Propose a few efficient acquisition functions. ◮ Show the performance on synthetic functions.
Questions?
Exploration & Exploitation
The two ingredients in an acquisition function: Exploration & Exploitation.
Exploration in PBO
◮ To understand exploration in PBO by designing a pure explorative acq. func. ◮ Exploration in standard BO can be viewed as the action to reduce uncertainty of a surrogate model. ◮ A purely explorative acq. func. V[y⋆] =
- (y⋆ − E[y⋆])2 p(y⋆|D, x⋆)dy⋆
◮ Can we extend this idea to PBO?
A Straight-Forward Choice
◮ A straight-forward extension from standard BO: V[y⋆] =
- y⋆∈{0,1}
(y⋆ − E[y⋆])2 p(y⋆|D, [x⋆, x′
⋆])
=E[y⋆](1 − E[y⋆]) ◮ The maximum variance is always at where E[y⋆] = 0.5! ◮ The variance may not reduce with observations!
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Expectation of y⋆ and σ(f⋆)
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variance of y∗
0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24
Dueling-Thompson Sampling (DTS)
◮ To balance exploration & exploitation, we borrow the idea of Thompson sampling by drawing a sample from the surrogate model. ◮ Compute the soft-copeland score on the drawn sample. ◮ The value xnext that gives the maximum soft-copeland score gives a good balance between exploration and exploitation. ◮ Take it as the first value of the next duel.
0.0 0.2 0.4 0.6 0.8 1.0
100 Copeland Samples (#duels = 10)
0.0 0.2 0.4 0.6 0.8 1.0
100 Copeland Samples (#duels = 30)
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
100 Copeland Samples (#duels = 150)
Aleatoric Uncertainty & Epistemic Uncertainty
◮ The uncertainty of y⋆ comes from two sources: the aleatoric uncertainty σ(f⋆) and the epistemic uncertainty p(f⋆|D, [x⋆, x′
⋆], θ)
p(y⋆ = 1|D, [x, x′], θ) =
- σ(f⋆)p(f⋆|D, [x⋆, x′
⋆], θ)df⋆
◮ Aleatoric Uncertainty: the stochasticity of the underlying process ◮ Epistemic Uncertainty: the uncertainty due to limited observations ◮ Exploration should focus on epistemic uncertainty.
Multi-arm Bandits on 2D
500 1000 1500 2000 2500 3000 3500 4000 #iterations −1 1 2 3 4 5 6 g(xc)
Six Hump Camel
PBO-DTS SPARRING