Scaling Bayesian Optimization in High Dimensions
Stefanie Jegelka, MIT BayesOpt Workshop 2017
joint work with Zi Wang, Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind)
Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - - PowerPoint PPT Presentation
Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT BayesOpt Workshop 2017 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind) Bayesian Optimization with GPs f ( x )
Stefanie Jegelka, MIT BayesOpt Workshop 2017
joint work with Zi Wang, Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind)
f(x)
Gaussian process: closed form expressions for posterior mean and variance (uncertainty)
f ∼ GP(µ, k)
arg max
x∈X αt(x)
selection criterion: acquisition function
μ −1 σ −1
= H (p(x∗ | Dt)) − E [H(p(x∗ | Dt ∪ {x, y}))] = H(p(y | Dt, x)) − E [H(p(y | Dt, x, x∗))]
ES PES
(Hennig & Schuler, 2012; Hernandez-Lobato, Hoffman & Ghahramani 2014)
I(a; b) = H(a) − H(a|b) = H(b) − H(b|a)
arg max
x∈X αt(x)
X
αt(x) = I({x, y}; x∗ | Dt)
Observed Data Point to query Location
x∗
αt(x) = I({x, y}; x∗ | Dt)
≈ 1 K X
y∗∈Y∗
γy∗(x)ψ(γy∗(x)) 2Ψ(γy∗(x)) − log(Ψ(γy∗(x)))
Query Point Observed Data
d-dimensional
dimensions!
d → 1
X
x∗ 1-dimensional Input space Output space
Fisher-Tippett-Gnedenko Theorem The maximum of a set of i.i.d. Gaussian variables is asymptotically described by a Gumbel distribution.
1 2 3
1 2
f(x) x
(random features) & sample posterior weights
−5 5 −2 −1 1 2 input, x
(b), posterior
(Hernández-Lobato, Hoffman & Ghahramani 2014)
……
f(x)
random weights
Neal 1994: GP infinite 1-layer neural network with Gaussian weights.
≡
αt(x) = I({x, y}; x∗ | Dt)
Query Point Observed Data
d-dimensional
dimensions!
d → 1
X
x∗ 1-dimensional Input space Output space
Simple Regret
1 2 3 4
Iteration
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
Runtime Per Iteration (s)
4 8 12 16
PES MES-NN MES-Gumbel
0.12 5.85 15.24 0.09 0.67 1.61 0.09 0.13 0.2
1 10 100 samples
zoo of acquisition functions: EI (Mockus, 1974), PI (Kushner, 1964), GP-UCB (Auer, 2002; Srinivas et
al., 2010), GP-MI (Contal et al., 2014), ES (Hennig & Schuler, 2012), PES (Hernández-Lobato et al., 2014), EST (Wang et al., 2016), GLASSES (González et al., 2016), SMAC (Hutter et al., 2010), ROAR (Hutter et al., 2010), … MES
y∗ with specific, adaptive parameter setting
1 − δ T 0 = O(T log δ)
f ∗ − max
t∈[1,T 0] f(xt) = O
q
(log T )d+2 T
1 2 3
1 2
m∈[M]
fm(xAm).
(Hastie&Tibshirani, 1990; Kandasamy et al., 2015)
Integrate
0 1 0 0 1 1 1 0 2
zj ∼ Multi(θ)
t
100 200 300 400 500
rt
2 3 4 5 6 7 8 9 10
robot pushing task
200 400 600
20 40 60 80 100 120
Iteration Simple Regret
1 0.5 x1 0.5 x2 5
10 1
synthetic, 50 dim
0.5 0.1 0.3 0.9 0.8 0.5 0.5 0.8 0.3
2 4
1 2 3 4
2 4
1 2 3 4
3 observations
(c) (d)
estimated acquisition function
2 2
2 2
1 2 3
1 2
µ(x) = kn(x)>(Kn + τ 2I)1yt σ2(x) = k(x, x) − kn(x)>(Kn + τ 2I)1kn(x)
x 0.5 1 f(x)
5 10
3σ µ f σ σ σ
(d)
Full kernel
σ
0.5 1 f(x)
50
3σ µ f σ σ
Low-rank approximation
10k 20k 30k 40k 50k Observation size 20 40 60 80 100 120 140 160 Gibbs sampling time (minutes)
We stopped SKL after 2 hours EBO average runtime = 61 seconds
0.5 1 x
5 10 f(x)
3 f
0.5 1 x
50 100 f(x)
3 f
0.5 1 x
5 10 f(x)
3 f
0.5 1 x
20 40 f(x)
3 f
Ground Truth 5000 Observations 1000 Observations 100 Observations
0.5 1 x
5 10 f(x)
3 f
0.5 1 x
5 10 f(x)
3 f
0.5 1 x
5 10 f(x)
3 f
0.5 1 x
5 10 f(x)
3 f
5000 Observations 5000 Observations 5000 Observations 5000 Observations
10 20 30 40 50 60 Time (minutes) 1 2 3 4 5 6 7 Regret
BO-SVI BO-Add-SVI PBO EBO
(Hensman et al., 2013, Wang et al., 2017)
(Wang, Jegelka, ICML 2017)
(Wang, Li, Jegelka, Kohli, ICML 2017)
(Wang, Gehring, Kohli, Jegelka, BayesOpt 2017)