High Dimensional Bayesian Optimisation and Bandits via Additive - - PowerPoint PPT Presentation

high dimensional bayesian optimisation and bandits via
SMART_READER_LITE
LIVE PREVIEW

High Dimensional Bayesian Optimisation and Bandits via Additive - - PowerPoint PPT Presentation

High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kanda samy , Jeff Schneider, Barnab as P oczos ICML 15 July 8 2015 1/20 Bandits & Optimisation Maximum Likelihood inference in Computational


slide-1
SLIDE 1

High Dimensional Bayesian Optimisation and Bandits via Additive Models

Kirthevasan Kandasamy, Jeff Schneider, Barnab´ as P´

  • czos

ICML ’15

July 8 2015

1/20

slide-2
SLIDE 2

Bandits & Optimisation

Maximum Likelihood inference in Computational Astrophysics

Cosmological Simulator

Observation

E.g: Hubble Constant Baryonic Density

2/20

slide-3
SLIDE 3

Bandits & Optimisation

Maximum Likelihood inference in Computational Astrophysics

Cosmological Simulator

Observation

E.g: Hubble Constant Baryonic Density

2/20

slide-4
SLIDE 4

Bandits & Optimisation

Expensive Blackbox Function

2/20

slide-5
SLIDE 5

Bandits & Optimisation

Expensive Blackbox Function

Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics

2/20

slide-6
SLIDE 6

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function. Let x∗ = argmaxx f (x). x∗

f(x∗)

x f(x)

3/20

slide-7
SLIDE 7

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function. Let x∗ = argmaxx f (x).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x)

3/20

slide-8
SLIDE 8

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function. Let x∗ = argmaxx f (x).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x) Optimisation ∼ = Minimise Simple Regret. ST = f (x∗) − max

xt, t=1,...,T f (xt).

3/20

slide-9
SLIDE 9

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function. Let x∗ = argmaxx f (x).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x) Bandits ∼ = Minimise Cumulative Regret. RT =

T

  • t=1

f (x∗) − f (xt).

3/20

slide-10
SLIDE 10

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function. Let x∗ = argmaxx f (x).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x) Optimisation ∼ = Minimise Simple Regret. ST = f (x∗) − max

xt, t=1,...,T f (xt).

3/20

slide-11
SLIDE 11

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x)

4/20

slide-12
SLIDE 12

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x)

Obtain posterior GP. .

4/20

slide-13
SLIDE 13

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x)

Maximise acquisition function ϕt: xt = argmaxx ϕt(x).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

xt = 0.828 x ϕt(x)

GP-UCB: ϕt(x) = µt−1(x) + β1/2

t

σt−1(x) (Srinivas et al. 2010)

4/20

slide-14
SLIDE 14

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x f(x)

Maximise acquisition function ϕt: xt = argmaxx ϕt(x).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

xt = 0.828 x ϕt(x)

ϕt: Expected Improvement (GP-EI), Thompson Sampling etc.

4/20

slide-15
SLIDE 15

Scaling to Higher Dimensions

Two Key Challenges:

◮ Statistical Difficulty:

Nonparametric sample complexity exponential in D.

◮ Computational Difficulty:

Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.

5/20

slide-16
SLIDE 16

Scaling to Higher Dimensions

Two Key Challenges:

◮ Statistical Difficulty:

Nonparametric sample complexity exponential in D.

◮ Computational Difficulty:

Optimising ϕt to within ζ accuracy requires O(ζ−D) effort. Existing Work:

◮ (Chen et al. 2012): f depends on a small number of variables.

Find variables and then GP-UCB.

◮ (Wang et al. 2013): f varies along a lower dimensional

  • subspace. GP-EI on a random subspace.

◮ (Djolonga et al. 2013): f varies along a lower dimensional

  • subspace. Find subspace and then GP-UCB.

5/20

slide-17
SLIDE 17

Scaling to Higher Dimensions

Two Key Challenges:

◮ Statistical Difficulty:

Nonparametric sample complexity exponential in D.

◮ Computational Difficulty:

Optimising ϕt to within ζ accuracy requires O(ζ−D) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.

◮ Assumes f varies only along a low dimensional subspace. ◮ Perform BO on a low dimensional subspace. ◮ Assumption too strong in realistic settings.

5/20

slide-18
SLIDE 18

Additive Functions

Structural assumption: f (x) = f (1)(x(1)) + f (2)(x(2)) + . . . + f (M)(x(M)). x(j) ∈ X (j) = [0, 1]d, d ≪ D, x(i) ∩ x(j) = ∅.

6/20

slide-19
SLIDE 19

Additive Functions

Structural assumption: f (x) = f (1)(x(1)) + f (2)(x(2)) + . . . + f (M)(x(M)). x(j) ∈ X (j) = [0, 1]d, d ≪ D, x(i) ∩ x(j) = ∅.

E.g. f (x{1,...,10}) = f (1)(x{1,3,9}) + f (2)(x{2,4,8}) + f (3)(x{5,6,10}) . 1 2 3 4 5 6 ✟ ✟ ❍ ❍ 7 8 9 10 Call {X (j)M

j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the “decomposition”.

6/20

slide-20
SLIDE 20

Additive Functions

Structural assumption: f (x) = f (1)(x(1)) + f (2)(x(2)) + . . . + f (M)(x(M)). x(j) ∈ X (j) = [0, 1]d, d ≪ D, x(i) ∩ x(j) = ∅. Assume each f (j) ∼ GP(0, κ(j)). Then f ∼ GP(0, κ) where, κ(x, x′) = κ(1)(x(1), x(1)′) + · · · + κ(M)(x(M), x(M)′).

6/20

slide-21
SLIDE 21

Additive Functions

Structural assumption: f (x) = f (1)(x(1)) + f (2)(x(2)) + . . . + f (M)(x(M)). x(j) ∈ X (j) = [0, 1]d, d ≪ D, x(i) ∩ x(j) = ∅. Assume each f (j) ∼ GP(0, κ(j)). Then f ∼ GP(0, κ) where, κ(x, x′) = κ(1)(x(1), x(1)′) + · · · + κ(M)(x(M), x(M)′). Given (X, Y ) = {(xi, yi)T

i=1}, and test point x†,

f (j)(x(j)

† )|X, Y ∼ N

  • µ(j), σ(j)2).

6/20

slide-22
SLIDE 22

Outline

  • 1. GP-UCB
  • 2. The Add-GP-UCB algorithm

◮ Bounds on ST: exponential in D → linear in D. ◮ An easy-to-optimise acquisition function. ◮ Performs well even when f is not additive.

  • 3. Experiments
  • 4. Conclusion & some open questions

7/20

slide-23
SLIDE 23

GP-UCB

xt = argmax

x∈X

µt−1(x) + β1/2

t

σt−1(x)

8/20

slide-24
SLIDE 24

GP-UCB

xt = argmax

x∈X

µt−1(x) + β1/2

t

σt−1(x) Squared Exponential Kernel κ(x, x′) = A exp x − x′2 2h2

  • Theorem (Srinivas et al. 2010)

Let f ∼ GP(0, κ). Then w.h.p, ST ∈ O

  • DD(log T)D

T

  • .

8/20

slide-25
SLIDE 25

GP-UCB on additive κ

If f ∼ GP(0, κ) where κ(x, x′) = κ(1)(x(1), x(1)′) + · · · + κ(M)(x(M), x(M)′). κ(j) → SE Kernel.

9/20

slide-26
SLIDE 26

GP-UCB on additive κ

If f ∼ GP(0, κ) where κ(x, x′) = κ(1)(x(1), x(1)′) + · · · + κ(M)(x(M), x(M)′). κ(j) → SE Kernel. Can be shown: If each κ(j) is a SE kernel, ST ∈ O

  • D2dd(log T)d

T

  • .

9/20

slide-27
SLIDE 27

GP-UCB on additive κ

If f ∼ GP(0, κ) where κ(x, x′) = κ(1)(x(1), x(1)′) + · · · + κ(M)(x(M), x(M)′). κ(j) → SE Kernel. Can be shown: If each κ(j) is a SE kernel, ST ∈ O

  • D2dd(log T)d

T

  • .

But ϕt = µt−1 + β1/2

t

σt−1 is D-dimensional !

9/20

slide-28
SLIDE 28

Add-GP-UCB

  • ϕt(x) =

M

  • j=1

µ(j)

t−1(x) + β1/2 t

σ(j)

t−1(x(j)).

10/20

slide-29
SLIDE 29

Add-GP-UCB

  • ϕt(x) =

M

  • j=1

µ(j)

t−1(x) + β1/2 t

σ(j)

t−1(x(j))

  • ϕ(j)

t (x(j))

. Maximise each ϕ(j)

t

separately. Requires only O(poly(D)ζ−d) effort

(vs O(ζ−D) for GP-UCB).

10/20

slide-30
SLIDE 30

Add-GP-UCB

  • ϕt(x) =

M

  • j=1

µ(j)

t−1(x) + β1/2 t

σ(j)

t−1(x(j))

  • ϕ(j)

t (x(j))

. Maximise each ϕ(j)

t

separately. Requires only O(poly(D)ζ−d) effort

(vs O(ζ−D) for GP-UCB).

Theorem Let f (j) ∼ GP(0, κ(j)) and f =

j f (j). Then w.h.p,

ST ∈ O

  • D2dd(log T)d

T

  • .

10/20

slide-31
SLIDE 31

Summary of Theoretical Results (for SE Kernel)

GP-UCB with no assumption on f : ST ∈ O

  • DD/2(log T)D/2T −1/2

GP-UCB on additive f : ST ∈ O

  • DT −1/2

Maximising ϕt : O(ζ−D) effort. Add-GP-UCB on additive f : ST ∈ O

  • DT −1/2

Maximising ϕt : O(poly(D)ζ−d) effort.

11/20

slide-32
SLIDE 32

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{1} f(1)(x{1})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{2} f(2)(x{2})

12/20

slide-33
SLIDE 33

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{1} f(1)(x{1})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{2} f(2)(x{2})

12/20

slide-34
SLIDE 34

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{1}

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{2}

12/20

slide-35
SLIDE 35

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(1)

t

= 0.869 x{1} ˜ ϕ(1)(x{1})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(2)

t

= 0.141 x{2} ˜ ϕ(2)(x{2})

12/20

slide-36
SLIDE 36

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(1)

t

= 0.869 x{1} ˜ ϕ(1)(x{1}) xt = (0.869, 0.141)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(2)

t

= 0.141 x{2} ˜ ϕ(2)(x{2})

12/20

slide-37
SLIDE 37

Additive modeling in non-additive settings

◮ Additive models common in high dimensional regression.

E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · · + f (x{D}).

13/20

slide-38
SLIDE 38

Additive modeling in non-additive settings

◮ Additive models common in high dimensional regression.

E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · · + f (x{D}).

◮ Additive models are statistically simpler =

⇒ worse bias, but much better variance in low sample regime.

13/20

slide-39
SLIDE 39

Additive modeling in non-additive settings

◮ Additive models common in high dimensional regression.

E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · · + f (x{D}).

◮ Additive models are statistically simpler =

⇒ worse bias, but much better variance in low sample regime.

◮ In BO applications queries are expensive. So we usually

cannot afford many queries.

13/20

slide-40
SLIDE 40

Additive modeling in non-additive settings

◮ Additive models common in high dimensional regression.

E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc. f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · · + f (x{D}).

◮ Additive models are statistically simpler =

⇒ worse bias, but much better variance in low sample regime.

◮ In BO applications queries are expensive. So we usually

cannot afford many queries.

◮ Observation:

Add-GP-UCB does well even when f is not additive.

◮ Better bias/ variance trade-off in high dimensional regression. ◮ Easy to maximise acquisition function. 13/20

slide-41
SLIDE 41

Unknown Kernel/ Decomposition in practice

Learn kernel hyper-parameters and decomposition {Xj} by maximising GP marginal likelihood periodically.

14/20

slide-42
SLIDE 42

Experiments

200 400 600 800 100 101 102

Add-∗: Knows decomposition. Add-d/M: M groups

  • f

size ≤ d.

Use 1000 DiRect evaluations to maximise acquisition function. DiRect: Dividing Rectangles (Jones et al. 1993)

15/20

slide-43
SLIDE 43

Experiments

200 400 600 800 100 101 102

Add-∗: Knows decomposition. Add-d/M: M groups

  • f

size ≤ d.

Use 4000 DiRect evaluations to maximise acquisition function.

15/20

slide-44
SLIDE 44

SDSS Luminous Red Galaxies

Cosmological Simulator

Observation

E.g: Hubble Constant Baryonic Density

◮ Task:

Find maximum likelihood cosmological parameters.

◮ 20 Dimensions. But only 9 parameters are relevant. ◮ Each query takes 2-5 seconds. ◮ Use 500 DiRect evaluations to maximise acquisition function.

16/20

slide-45
SLIDE 45

SDSS Luminous Red Galaxies

100 200 300 400 −103 −102 −101 REMBO: (Wang et al. 2013)

17/20

slide-46
SLIDE 46

Viola & Jones Face Detection

A cascade of 22 weak classifiers. Image classified negative if the score < threshold at any stage.

◮ Task:

Find optimal threshold values on a training set of 1000 images.

◮ 22 dimensions. ◮ Each query takes 30-40 seconds. ◮ Use 1000 DiRect evaluations to maximise acquisition function.

18/20

slide-47
SLIDE 47

Viola & Jones Face Detection

100 200 300 65 70 75 80 85 90 95

19/20

slide-48
SLIDE 48

Summary

◮ Additive assumption improves regret:

exponential in D → linear in D.

◮ Acquisition function is easy to maximise. ◮ Even for non-additive f is not additive, Add-GP-UCB does

well in practice.

20/20

slide-49
SLIDE 49

Summary

◮ Additive assumption improves regret:

exponential in D → linear in D.

◮ Acquisition function is easy to maximise. ◮ Even for non-additive f is not additive, Add-GP-UCB does

well in practice.

◮ Similar results hold for Mat´

ern kernels and in bandit setting.

20/20

slide-50
SLIDE 50

Summary

◮ Additive assumption improves regret:

exponential in D → linear in D.

◮ Acquisition function is easy to maximise. ◮ Even for non-additive f is not additive, Add-GP-UCB does

well in practice.

◮ Similar results hold for Mat´

ern kernels and in bandit setting. Some open questions:

◮ How to choose (d, M)? ◮ Can we generalise to other acquisition functions?

20/20

slide-51
SLIDE 51

Summary

◮ Additive assumption improves regret:

exponential in D → linear in D.

◮ Acquisition function is easy to maximise. ◮ Even for non-additive f is not additive, Add-GP-UCB does

well in practice.

◮ Similar results hold for Mat´

ern kernels and in bandit setting. Some open questions:

◮ How to choose (d, M)? ◮ Can we generalise to other acquisition functions?

Code available:

github.com/kirthevasank/add-gp-bandits

Jeff’s Talk: Friday 2pm @ Van Gogh Thank You.

20/20