Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - - PowerPoint PPT Presentation

scaling bayesian optimization in high dimensions
SMART_READER_LITE
LIVE PREVIEW

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - - PowerPoint PPT Presentation

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT BayesOpt Workshop 2017 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind) Bayesian Optimization with GPs f ( x )


slide-1
SLIDE 1

Scaling Bayesian Optimization in High Dimensions

Stefanie Jegelka, MIT
 BayesOpt Workshop 2017


joint work with Zi Wang, Chengtao Li, Clement Gehring (MIT) 
 and Pushmeet Kohli (DeepMind)

slide-2
SLIDE 2

Bayesian Optimization with GPs

BO: sequentially build model of f
 for t=1, … T:

  • select new query point(s) x

  • observe f(x)
  • update model & repeat


f(x)

μ −1 σ −1

Gaussian process: closed form expressions for posterior mean and 
 variance (uncertainty)

f ∼ GP(µ, k)

arg max

x∈X αt(x)

selection criterion: acquisition function

slide-3
SLIDE 3

Challenges in high dimensions

statistical & computational complexity:


  • estimating & optimizing acquisition function


new, sample-efficient acquisition function (ICML 2017)

  • function estimation in high dimensions


learn input structure (ICML 2017)

  • many observations (data points): huge matrix in GP


multiple random partitions (BayesOpt 2017)

  • parallelization

μ −1 σ −1

slide-4
SLIDE 4

new query point: x∗ if is high-dimensional: costly to estimate! αt(x)

(Predictive) Entropy Search

= H (p(x∗ | Dt)) − E [H(p(x∗ | Dt ∪ {x, y}))] = H(p(y | Dt, x)) − E [H(p(y | Dt, x, x∗))]

ES PES

(Hennig & Schuler, 2012; Hernandez-Lobato, Hoffman & Ghahramani 2014)

I(a; b) = H(a) − H(a|b) = H(b) − H(b|a)

arg max

x∈X αt(x)

X

αt(x) = I({x, y}; x∗ | Dt)

Observed Data Point to query Location

  • f global
  • ptimum

x∗

slide-5
SLIDE 5

Max-value Entropy Search

αt(x) = I({x, y}; x∗ | Dt)

≈ 1 K X

y∗∈Y∗

γy∗(x)ψ(γy∗(x)) 2Ψ(γy∗(x)) − log(Ψ(γy∗(x)))

  • closed-form

Query Point Observed Data

d-dimensional

Expectation over . p(y∗|Dt) How sample ? y∗ αt(x) = I({x; y}; y∗ | Dt)

dimensions!

d → 1

X

x∗ 1-dimensional Input space Output space

slide-6
SLIDE 6

Sampling y*: Idea 1

is a 1D Gaussian
 
 
 
 
 
 


  • sample representative points

  • approximate max-value of the representative points by a

Gumbel distribution p(f(x))

Fisher-Tippett-Gnedenko Theorem The maximum of a set of i.i.d. Gaussian variables is asymptotically described by a Gumbel distribution.

  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2

f(x) x

slide-7
SLIDE 7

Sampling y*: Idea 2

draw functions from GP posterior 
 and maximize each. How?
 
 
 
 
 


  • approximate GP as finite neural network 


(random features)
 & sample posterior weights

  • maximize network output for each sample

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

(b), posterior

(Hernández-Lobato, Hoffman & Ghahramani 2014)

……

x

f(x)

random weights

Neal 1994: 
 GP infinite 1-layer neural network with Gaussian weights.

slide-8
SLIDE 8

Max-value Entropy Search

αt(x) = I({x, y}; x∗ | Dt)

Query Point Observed Data

d-dimensional

Expectation over . p(y∗|Dt) Can sample ! y∗ αt(x) = I({x; y}; y∗ | Dt)

dimensions!

d → 1

X

x∗ 1-dimensional Input space Output space

Does it work?

slide-9
SLIDE 9

Empirically: max-value enough? sample-efficiency?

Simple Regret

1 2 3 4

Iteration

1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

PES 1 PES 10 PES 100 MES-G 1 MES-G 10 MES-G 100 sampling x∗ sampling y∗

slide-10
SLIDE 10

Empirically: faster than PES

Runtime Per Iteration (s)

4 8 12 16

PES MES-NN MES-Gumbel

0.12 5.85 15.24 0.09 0.67 1.61 0.09 0.13 0.2

1 10 100 samples

slide-11
SLIDE 11

zoo of acquisition functions: EI (Mockus, 1974), PI (Kushner, 1964), GP-UCB (Auer, 2002; Srinivas et

al., 2010), GP-MI (Contal et al., 2014), ES (Hennig & Schuler, 2012), PES (Hernández-Lobato et al., 2014), EST (Wang et al., 2016), GLASSES (González et al., 2016), SMAC (Hutter et al., 2010), ROAR (Hutter et al., 2010), … MES


Lemma (Wang-J17) Equivalent acquisition functions:

  • MES with a single sample of per step
  • UCB (upper confidence bound, Srinivas et al., 2010)
  • PI (probability of improvement, Kushner, 1964) 


Theorem: Regret bound (Wang-J 17)
 With probability , within iterations:


Connections & Theory

y∗ with specific, 
 adaptive 
 parameter 
 setting

}

1 − δ T 0 = O(T log δ)

f ∗ − max

t∈[1,T 0] f(xt) = O

q

(log T )d+2 T

slide-12
SLIDE 12

Gaussian Processes in high dimensions

  • estimating a nonlinear function in 


high input dimensions:
 statistically challenging


  • optimizing nonconvex acquisition


function in high dimensions
 computationally challenging


  • many observations: huge matrices


computationally challenging

  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2

slide-13
SLIDE 13

Additive Gaussian Processes

  • lower-complexity functions


statistical efficiency

  • optimize acquisition function block-wise


computational efficiency f(x) = X

m∈[M]

fm(xAm).

(Hastie&Tibshirani, 1990; Kandasamy et al., 2015)

f1(xA1) f0(xA0) f2(xA2)

What is the partition?

slide-14
SLIDE 14

Integrate


  • ut

Structural Kernel Learning

f1(xA1) f0(xA0) f2(xA2)

f =

+ + f0 f1 f2

0 1 0 0 1 1 1 0 2

z = [0 1 0 0 1 1 1 0 2]
 Learn the assignment! Key idea: Dirichlet prior on z

zj ∼ Multi(θ)

Posterior


via Gibbs sampling.
 easy updates p(z | Dn; α)

slide-15
SLIDE 15

t

100 200 300 400 500

rt

2 3 4 5 6 7 8 9 10

robot pushing task

Empirical Results

t

200 400 600

rt

20 40 60 80 100 120

D=50

(d)

True No Partition Fully Partitioned SKL Heuristics

Iteration Simple Regret

1 0.5 x1 0.5 x2 5

  • 5

10 1

synthetic, 50 dim

slide-16
SLIDE 16

Curious connections

  • crossover in evolutionary algorithms:
  • BO with additive GP: 



 
 
 
 
 
 


  • observed good points: query points: 


learned instead of completely random coordinate partition

0.5 0.1 0.3 0.9 0.8 0.5 0.5 0.8 0.3

  • 2

2 4

  • 2
  • 1

1 2 3 4

  • 2

2 4

  • 2
  • 1

1 2 3 4

3 observations

(c) (d)

estimated acquisition function

  • 1

2 2

  • 1

2 2

slide-17
SLIDE 17

Gaussian Processes in high dimensions

  • estimating nonlinear functions in 


high input dimensions:
 statistically challenging

  • optimizing nonconvex acquisition


function in high dimensions
 computationally challenging


  • many observations: huge matrix inversions


computationally challenging

  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2

µ(x) = kn(x)>(Kn + τ 2I)1yt σ2(x) = k(x, x) − kn(x)>(Kn + τ 2I)1kn(x)

x 0.5 1 f(x)

  • 10
  • 5

5 10

3σ µ f σ σ σ

(d)

Full kernel

σ

0.5 1 f(x)

  • 150
  • 100
  • 50

50

3σ µ f σ σ

Low-rank approximation

slide-18
SLIDE 18

Ensemble Bayesian Optimization

in each iteration:

  • partition data via


Mondrian process

  • fit GP in each part:


structure learning +
 Tile Coding;
 synchronize

  • select query points in

parallel & filter parallelization across parts distribution over partitions — new draw in each iteration

slide-19
SLIDE 19

Does it scale?

10k 20k 30k 40k 50k Observation size 20 40 60 80 100 120 140 160 Gibbs sampling time (minutes)

SKL EBO

We stopped SKL after 2 hours EBO average runtime = 61 seconds

slide-20
SLIDE 20

Variances

0.5 1 x

  • 10
  • 5

5 10 f(x)

3 f

0.5 1 x

  • 150
  • 100
  • 50

50 100 f(x)

3 f

0.5 1 x

  • 10
  • 5

5 10 f(x)

3 f

0.5 1 x

  • 60
  • 40
  • 20

20 40 f(x)

3 f

Ground Truth 5000 Observations 1000 Observations 100 Observations

0.5 1 x

  • 10
  • 5

5 10 f(x)

3 f

0.5 1 x

  • 10
  • 5

5 10 f(x)

3 f

0.5 1 x

  • 10
  • 5

5 10 f(x)

3 f

0.5 1 x

  • 10
  • 5

5 10 f(x)

3 f

5000 Observations 5000 Observations 5000 Observations 5000 Observations

slide-21
SLIDE 21

Empirical Results

10 20 30 40 50 60 Time (minutes) 1 2 3 4 5 6 7 Regret

BO-SVI BO-Add-SVI PBO EBO

(Hensman et al., 2013, Wang et al., 2017)

slide-22
SLIDE 22

Summary: GP-BO in high dimensions

Challenge: high dimensions, many observations
 statistical & computational efficiency

  • Max-value Entropy Search


sample-efficient, effective acquisition function


(Wang, Jegelka, ICML 2017)

  • Many dimensions: learning structured kernels 


(Wang, Li, Jegelka, Kohli, ICML 2017)

  • Many observations & dimensions & parallelization:

ensemble Bayesian Optimization 


(Wang, Gehring, Kohli, Jegelka, BayesOpt 2017)


slide-23
SLIDE 23

References

  • Zi Wang, Stefanie Jegelka. Max-value entropy search for

efficient Bayesian Optimization. ICML 2017.


  • Zi Wang, Chengtao Li, Stefanie Jegelka, Pushmeet Kohli.

Batched High-dimensional Bayesian Optimization via Structural Kernel Learning. ICML 2017.

  • Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie
  • Jegelka. Batched Large-scale Bayesian Optimization in

High-dimensional Spaces. BayesOpt, 2017.