Bandit Optimisation with Approximations Kirthevasan Kandasamy - - PowerPoint PPT Presentation

bandit optimisation with approximations
SMART_READER_LITE
LIVE PREVIEW

Bandit Optimisation with Approximations Kirthevasan Kandasamy - - PowerPoint PPT Presentation

Bandit Optimisation with Approximations Kirthevasan Kandasamy Carnegie Mellon University Ecole Polytechnique, Paris April 27, 2017 Slides: www.cs.cmu.edu/ kkandasa/misc/ecole-slides.pdf www.cs.cmu.edu/ kkandasa Slides are up on my


slide-1
SLIDE 1

Bandit Optimisation with Approximations

Kirthevasan Kandasamy Carnegie Mellon University ´ Ecole Polytechnique, Paris April 27, 2017 Slides:

www.cs.cmu.edu/∼kkandasa/misc/ecole-slides.pdf

slide-2
SLIDE 2

Slides are up on my website: www.cs.cmu.edu/∼kkandasa

slide-3
SLIDE 3

Bandit Optimisation

Neural Network

hyper- parameters cross validation accuracy

  • Train NN using given hyper-parameters
  • Compute accuracy on validation set

1/26

slide-4
SLIDE 4

Bandit Optimisation

Expensive Blackbox Function

1/26

slide-5
SLIDE 5

Bandit Optimisation

Expensive Blackbox Function

Other Examples:

  • ML estimation in Astrophysics
  • Optimal policy in Autonomous Driving
  • Synthetic gene design

1/26

slide-6
SLIDE 6

Bandit Optimisation

Expensive Blackbox Function

Other Examples:

  • ML estimation in Astrophysics
  • Optimal policy in Autonomous Driving
  • Synthetic gene design

1/26

slide-7
SLIDE 7

Bandit Optimisation

f : X → R is an expensive, black-box, noisy function.

x f(x)

2/26

slide-8
SLIDE 8

Bandit Optimisation

f : X → R is an expensive, black-box, noisy function.

x f(x)

2/26

slide-9
SLIDE 9

Bandit Optimisation

f : X → R is an expensive, black-box, noisy function. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

2/26

slide-10
SLIDE 10

Bandit Optimisation

f : X → R is an expensive, black-box, noisy function. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations Sn = f (x⋆) − max

t=1,...,n f (xt).

2/26

slide-11
SLIDE 11

Bandit Optimisation

f : X → R is an expensive, black-box, noisy function. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Cumulative Regret after n evaluations Rn =

n

  • t=1

f (x⋆) − f (xt).

2/26

slide-12
SLIDE 12

Bandit Optimisation

f : X → R is an expensive, black-box, noisy function. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations Sn = f (x⋆) − max

t=1,...,n f (xt).

2/26

slide-13
SLIDE 13

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Mean µ : X → R, Covariance kernel κ : X 2 → R.

3/26

slide-14
SLIDE 14

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Mean µ : X → R, Covariance kernel κ : X 2 → R. Functions with no observations

x f(x)

3/26

slide-15
SLIDE 15

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Mean µ : X → R, Covariance kernel κ : X 2 → R. Prior GP

x f(x)

3/26

slide-16
SLIDE 16

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Mean µ : X → R, Covariance kernel κ : X 2 → R. Observations

x f(x)

3/26

slide-17
SLIDE 17

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Mean µ : X → R, Covariance kernel κ : X 2 → R. Posterior GP given observations

x f(x)

3/26

slide-18
SLIDE 18

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Mean µ : X → R, Covariance kernel κ : X 2 → R. Posterior GP given observations

x f(x)

After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

3/26

slide-19
SLIDE 19

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

x f(x)

4/26

slide-20
SLIDE 20

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

x f(x)

4/26

slide-21
SLIDE 21

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

ϕt = µt−1 + β1/2

t

σt−1 x f(x)

Construct upper conf. bound: ϕt(x) = µt−1(x) + β1/2

t

σt−1(x).

4/26

slide-22
SLIDE 22

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

ϕt = µt−1 + β1/2

t

σt−1

xt

x f(x)

Maximise upper confidence bound.

4/26

slide-23
SLIDE 23

GP-UCB

xt = argmax

x

µt−1(x) + β1/2

t

σt−1(x)

◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.

βt ≍ log t.

5/26

slide-24
SLIDE 24

GP-UCB

xt = argmax

x

µt−1(x) + β1/2

t

σt−1(x)

◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.

βt ≍ log t. GP-UCB

(Srinivas et al. 2010)

w.h.p Sn = f (x⋆) − max

t=1,...,n f (xt)

  • Ψn(X)

n Ψn(X) ← Maximum Information Gain.

5/26

slide-25
SLIDE 25

GP-UCB

xt = argmax

x

µt−1(x) + β1/2

t

σt−1(x)

◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.

βt ≍ log t. GP-UCB

(Srinivas et al. 2010)

w.h.p Sn = f (x⋆) − max

t=1,...,n f (xt)

  • Ψn(X)

n Ψn(X) ← Maximum Information Gain. When X ⊂ Rd, SE kernel: Ψn(X) ≍ dd log(n)d · vol(X). Mat´ ern kernel: Ψn(X) ≍ n1− 1

d2 · vol(X). 5/26

slide-26
SLIDE 26

GP-UCB

(Srinivas et al. 2010)

x f(x)

6/26

slide-27
SLIDE 27

GP-UCB

(Srinivas et al. 2010)

t = 1 x f(x)

6/26

slide-28
SLIDE 28

GP-UCB

(Srinivas et al. 2010)

t = 2 x f(x)

6/26

slide-29
SLIDE 29

GP-UCB

(Srinivas et al. 2010)

t = 3 x f(x)

6/26

slide-30
SLIDE 30

GP-UCB

(Srinivas et al. 2010)

t = 4 x f(x)

6/26

slide-31
SLIDE 31

GP-UCB

(Srinivas et al. 2010)

t = 5 x f(x)

6/26

slide-32
SLIDE 32

GP-UCB

(Srinivas et al. 2010)

t = 6 x f(x)

6/26

slide-33
SLIDE 33

GP-UCB

(Srinivas et al. 2010)

t = 7 x f(x)

6/26

slide-34
SLIDE 34

GP-UCB

(Srinivas et al. 2010)

t = 11 x f(x)

6/26

slide-35
SLIDE 35

GP-UCB

(Srinivas et al. 2010)

t = 25 x f(x)

6/26

slide-36
SLIDE 36

What if we have cheap approximations to f ?

7/26

slide-37
SLIDE 37

What if we have cheap approximations to f ?

  • 1. Hyper-parameter tuning: Train & validate with a subset of the

data, and/or early stopping before convergence. E.g. Bandwidth (ℓ) selection in kernel density estimation.

7/26

slide-38
SLIDE 38

What if we have cheap approximations to f ?

  • 1. Hyper-parameter tuning: Train & validate with a subset of the

data, and/or early stopping before convergence. E.g. Bandwidth (ℓ) selection in kernel density estimation.

  • 2. Autonomous driving: simulation vs real world experiment.
  • 3. Computational astrophysics: cosmological simulations and

numerical computations with less granularity.

7/26

slide-39
SLIDE 39

Prior work in Multi-fidelity Methods

For specific applications,

◮ Industrial design

(Forrester et al. 2007)

◮ Hyper-parameter tuning

(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)

◮ Active learning

(Zhang & Chaudhuri 2015)

◮ Robotics

(Cutler et al. 2014)

Multi-fidelity optimisation

(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)

8/26

slide-40
SLIDE 40

Multi-fidelity GP Bandit Optimisation

  • 1. A finite number of approximations

(Kandasamy et al. NIPS 2016b)

  • Formalism and challenges
  • Algorithm
  • Theoretical results & proof sketches
  • Experiments
  • 2. A continuous spectrum of approximations

(Kandasamy et al. Arxiv 2017)

  • Formalism
  • Algorithm
  • Theoretical results
  • Experiments

9/26

slide-41
SLIDE 41

Multi-fidelity GP Bandit Optimisation

  • 1. A finite number of approximations

(Kandasamy et al. NIPS 2016b)

  • Formalism and challenges
  • Algorithm
  • Theoretical results & proof sketches
  • Experiments
  • 2. A continuous spectrum of approximations

(Kandasamy et al. Arxiv 2017)

  • Formalism
  • Algorithm
  • Theoretical results
  • Experiments

Extends beyond GPs.

9/26

slide-42
SLIDE 42

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (2) = f ◮ Optimise f = f (2).

x⋆ = argmaxx f (2)(x).

◮ But ..

10/26

slide-43
SLIDE 43

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2) ◮ Optimise f = f (2).

x⋆ = argmaxx f (2)(x).

◮ But .. we have an approximation f (1) to f (2). ◮ f (1) costs λ(1),

f (2) costs λ(2). λ(1) < λ(2). “cost”: could be computation time, money etc.

10/26

slide-44
SLIDE 44

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2) ◮ Optimise f = f (2).

x⋆ = argmaxx f (2)(x).

◮ But .. we have an approximation f (1) to f (2). ◮ f (1) costs λ(1),

f (2) costs λ(2). λ(1) < λ(2). “cost”: could be computation time, money etc.

◮ f (1), f (2) ∼ GP(0, κ). ◮ f (2) − f (1)∞ ≤ ζ(1).

ζ(1) is known.

10/26

slide-45
SLIDE 45

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

At time t: Determine the point xt ∈ X and fidelity mt ∈ {1, 2} for querying.

11/26

slide-46
SLIDE 46

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

At time t: Determine the point xt ∈ X and fidelity mt ∈ {1, 2} for querying. End Goal: Maximise f (2). Don’t care for maximum of f (1).

11/26

slide-47
SLIDE 47

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

At time t: Determine the point xt ∈ X and fidelity mt ∈ {1, 2} for querying. End Goal: Maximise f (2). Don’t care for maximum of f (1). Simple Regret: S(Λ) = f (2)(x⋆) − max

t : mt=2 f (2)(xt)

S(Λ) = +∞ if we haven’t queried f (2) yet.

11/26

slide-48
SLIDE 48

Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation)

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

At time t: Determine the point xt ∈ X and fidelity mt ∈ {1, 2} for querying. End Goal: Maximise f (2). Don’t care for maximum of f (1). Simple Regret: S(Λ) = f (2)(x⋆) − max

t : mt=2 f (2)(xt)

S(Λ) = +∞ if we haven’t queried f (2) yet. → But use f (1) to guide search for x⋆ at f (2).

11/26

slide-49
SLIDE 49

Challenges

x⋆ f (2) = f

11/26

slide-50
SLIDE 50

Challenges

x⋆

+ζ(1) −ζ(1)

f (2)

11/26

slide-51
SLIDE 51

Challenges

x⋆ f (1) f (2)

11/26

slide-52
SLIDE 52

Challenges

x⋆ f (1) f (2)

◮ f (1) is not just a noisy version of f (2).

11/26

slide-53
SLIDE 53

Challenges

x⋆ x(1)

f (1) f (2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

11/26

slide-54
SLIDE 54

Challenges

x⋆ x(1)

f (1) f (2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

11/26

slide-55
SLIDE 55

Challenges

x⋆ x(1)

f (1) f (2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

11/26

slide-56
SLIDE 56

Challenges

x⋆ x(1)

f (1) f (2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

11/26

slide-57
SLIDE 57

Challenges

x⋆ x(1)

f (1) f (2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

◮ Need to explore f (2) sufficiently well around the high valued

regions of f (1) – but at a not too large region.

11/26

slide-58
SLIDE 58

Challenges

x⋆ x(1)

f(1) f(2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

◮ Need to explore f (2) sufficiently well around the high valued

regions of f (1) – but at a not too large region.

11/26

slide-59
SLIDE 59

Challenges

x⋆ x(1)

f(1) f(2)

◮ f (1) is not just a noisy version of f (2). ◮ Cannot just maximise f (1).

x(1)

is suboptimal for f (2).

◮ Need to explore f (2) sufficiently well around the high valued

regions of f (1) – but at a not too large region.

Key Message: We will explore X using f (1) and use f (2) mostly in a promising region Xα.

11/26

slide-60
SLIDE 60

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

x⋆ f (1) f (2)

12/26

slide-61
SLIDE 61

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

x⋆ f (1) f (2) ◮ Construct Upper Confidence Bound ϕt for f (2).

Choose point xt = argmaxx∈X ϕt(x).

12/26

slide-62
SLIDE 62

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

x⋆ xt t = 14 f (1) f (2) ◮ Construct Upper Confidence Bound ϕt for f (2).

Choose point xt = argmaxx∈X ϕt(x).

ϕ(1)

t (x) =

µ(1)

t−1(x) + β1/2 t

σ(1)

t−1(x) +ζ(1)

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) } 12/26

slide-63
SLIDE 63

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

x⋆ xt t = 14 f (1) f (2)

γ(1)

mt = 2

◮ Construct Upper Confidence Bound ϕt for f (2).

Choose point xt = argmaxx∈X ϕt(x).

ϕ(1)

t (x) =

µ(1)

t−1(x) + β1/2 t

σ(1)

t−1(x) +ζ(1)

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) }

◮ Choose fidelity mt =

  • 1

if β1/2

t

σ(1)

t−1(xt) > γ(1)

2

  • therwise.

12/26

slide-64
SLIDE 64

Theoretical Results for MF-GP-UCB

GP-UCB

(Srinivas et al. 2010)

w.h.p S(Λ) = f (2)(x⋆) − max

t : mt=2 f (2)(xt)

  • ΨnΛ(X)

nΛ nΛ = ⌊Λ/λ(2)⌋. ΨnΛ(A) = Maximum Information Gain → Scales with vol(A).

13/26

slide-65
SLIDE 65

Theoretical Results for MF-GP-UCB

GP-UCB

(Srinivas et al. 2010)

w.h.p S(Λ) = f (2)(x⋆) − max

t : mt=2 f (2)(xt)

  • ΨnΛ(X)

nΛ nΛ = ⌊Λ/λ(2)⌋. ΨnΛ(A) = Maximum Information Gain → Scales with vol(A).

13/26

slide-66
SLIDE 66

Theoretical Results for MF-GP-UCB

GP-UCB

(Srinivas et al. 2010)

w.h.p S(Λ) = f (2)(x⋆) − max

t : mt=2 f (2)(xt)

  • ΨnΛ(X)

nΛ nΛ = ⌊Λ/λ(2)⌋. ΨnΛ(A) = Maximum Information Gain → Scales with vol(A). MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

w.h.p ∀α > 0, S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X c

α)

n2−α

Λ

Xα = {x : f (2)(x⋆) − f (1)(x) ≤ Cαζ(1)}. Good approximation =

⇒ vol(Xα) ≪ vol(X) = ⇒ ΨnΛ(Xα) ≪ ΨnΛ(X).

13/26

slide-67
SLIDE 67

Proof Sketches

λ(2)

expensive > λ(1) cheap

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

w.h.p S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X c

α)

n2−α

Λ

Xα = {x : f (2)(x⋆) − f (1)(x) ζ(1)}. Good approximation =

⇒ vol(Xα) ≪ vol(X) = ⇒ ΨnΛ(Xα) ≪ ΨnΛ(X).

14/26

slide-68
SLIDE 68

Proof Sketches

λ(2)

expensive > λ(1) cheap

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

w.h.p S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X c

α)

n2−α

Λ

Xα = {x : f (2)(x⋆) − f (1)(x) ζ(1)}. Good approximation =

⇒ vol(Xα) ≪ vol(X) = ⇒ ΨnΛ(Xα) ≪ ΨnΛ(X).

Number of (random) queries after capital Λ ← N, nΛ = Λ λ(2) ≤ N ≤ Λ λ(1) .

14/26

slide-69
SLIDE 69

Proof Sketches

λ(2)

expensive > λ(1) cheap

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

w.h.p S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X c

α)

n2−α

Λ

Xα = {x : f (2)(x⋆) − f (1)(x) ζ(1)}. Good approximation =

⇒ vol(Xα) ≪ vol(X) = ⇒ ΨnΛ(Xα) ≪ ΨnΛ(X).

Number of (random) queries after capital Λ ← N, nΛ = Λ λ(2) ≤ N ≤ Λ λ(1) . But we show N ∈ O(nΛ).

14/26

slide-70
SLIDE 70

Proof Sketches

λ(2)

expensive > λ(1) cheap

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

w.h.p S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X c

α)

n2−α

Λ

Xα = {x : f (2)(x⋆) − f (1)(x) ζ(1)}. Good approximation =

⇒ vol(Xα) ≪ vol(X) = ⇒ ΨnΛ(Xα) ≪ ΨnΛ(X).

Number of (random) queries after capital Λ ← N, nΛ = Λ λ(2) ≤ N ≤ Λ λ(1) . But we show N ∈ O(nΛ). N = T (1)

N (Xα) + T (1) N (X c α) + T (2) N (Xα) + T (2) N (X c α)

14/26

slide-71
SLIDE 71

Proof Sketches

λ(2)

expensive > λ(1) cheap

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

w.h.p S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X c

α)

n2−α

Λ

Xα = {x : f (2)(x⋆) − f (1)(x) ζ(1)}. Good approximation =

⇒ vol(Xα) ≪ vol(X) = ⇒ ΨnΛ(Xα) ≪ ΨnΛ(X).

Number of (random) queries after capital Λ ← N, nΛ = Λ λ(2) ≤ N ≤ Λ λ(1) . But we show N ∈ O(nΛ). N = T (1)

N (Xα)

  • polylog(N)

+ T (1)

N (X c α)

  • sublinear(N)

+ T (2)

N (Xα) + T (2) N (X c α)

14/26

slide-72
SLIDE 72

T (2)

N (X c α) ≤ Nα

for all α > 0

λ(2)

expensive > λ(1) cheap

For x ∈ Xα, f (2)(x⋆)−f (1)(x) ≤ Cαζ(1). f (1) is small in X c

α.

x⋆ xt t = 50 f (1) f (2)

15/26

slide-73
SLIDE 73

T (2)

N (X c α) ≤ Nα

for all α > 0

λ(2)

expensive > λ(1) cheap

For x ∈ Xα, f (2)(x⋆)−f (1)(x) ≤ Cαζ(1). f (1) is small in X c

α.

x⋆ xt t = 50 f (1) f (2) ϕ(1)

t (x) = µ(1) t−1(x) + β1/2 t

σ(1)

t−1(x) + ζ(1),

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) },

xt = argmax

x∈X

ϕt(x) → [1]. Choose fidelity mt =

  • 1

if β1/2

t

σ(1)

t−1(xt) > γ(1)

2 if β1/2

t

σ(1)

t−1(xt) ≤ γ(1)

→ [2].

15/26

slide-74
SLIDE 74

T (2)

N (X c α) ≤ Nα

for all α > 0

λ(2)

expensive > λ(1) cheap

For x ∈ Xα, f (2)(x⋆)−f (1)(x) ≤ Cαζ(1). f (1) is small in X c

α.

x⋆ xt t = 50 f (1) f (2) ϕ(1)

t (x) = µ(1) t−1(x) + β1/2 t

σ(1)

t−1(x) + ζ(1),

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) },

xt = argmax

x∈X

ϕt(x) → [1]. Choose fidelity mt =

  • 1

if β1/2

t

σ(1)

t−1(xt) > γ(1)

2 if β1/2

t

σ(1)

t−1(xt) ≤ γ(1)

→ [2].

Argument: If xt ∈ X c

α in [1], then mt = 2 is unlikely in [2].

15/26

slide-75
SLIDE 75

T (2)

N (X c α) ≤ Nα

for all α > 0

λ(2)

expensive > λ(1) cheap

For x ∈ Xα, f (2)(x⋆)−f (1)(x) ≤ Cαζ(1). f (1) is small in X c

α.

x⋆ xt t = 50 f (1) f (2) ϕ(1)

t (x) = µ(1) t−1(x) + β1/2 t

σ(1)

t−1(x) + ζ(1),

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) },

xt = argmax

x∈X

ϕt(x) → [1]. Choose fidelity mt =

  • 1

if β1/2

t

σ(1)

t−1(xt) > γ(1)

2 if β1/2

t

σ(1)

t−1(xt) ≤ γ(1)

→ [2].

Argument: If xt ∈ X c

α in [1], then mt = 2 is unlikely in [2].

mt = 2 = ⇒ σ(1)

t−1(xt) is small

= ⇒ Several f (1) queries near xt

15/26

slide-76
SLIDE 76

T (2)

N (X c α) ≤ Nα

for all α > 0

λ(2)

expensive > λ(1) cheap

For x ∈ Xα, f (2)(x⋆)−f (1)(x) ≤ Cαζ(1). f (1) is small in X c

α.

x⋆ xt t = 50 f (1) f (2) ϕ(1)

t (x) = µ(1) t−1(x) + β1/2 t

σ(1)

t−1(x) + ζ(1),

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) },

xt = argmax

x∈X

ϕt(x) → [1]. Choose fidelity mt =

  • 1

if β1/2

t

σ(1)

t−1(xt) > γ(1)

2 if β1/2

t

σ(1)

t−1(xt) ≤ γ(1)

→ [2].

Argument: If xt ∈ X c

α in [1], then mt = 2 is unlikely in [2].

mt = 2 = ⇒ σ(1)

t−1(xt) is small

= ⇒ Several f (1) queries near xt = ⇒ µ(1)

t−1(xt) ≈ f (1)(xt)

= ⇒ ϕ(1)

t (xt) is small

= ⇒

15/26

slide-77
SLIDE 77

T (2)

N (X c α) ≤ Nα

for all α > 0

λ(2)

expensive > λ(1) cheap

For x ∈ Xα, f (2)(x⋆)−f (1)(x) ≤ Cαζ(1). f (1) is small in X c

α.

x⋆ xt t = 50 f (1) f (2) ϕ(1)

t (x) = µ(1) t−1(x) + β1/2 t

σ(1)

t−1(x) + ζ(1),

ϕ(2)

t (x) = µ(2) t−1(x) + β1/2 t

σ(2)

t−1(x)

ϕt(x) = min{ ϕ(1)

t (x), ϕ(2) t (x) },

xt = argmax

x∈X

ϕt(x) → [1]. Choose fidelity mt =

  • 1

if β1/2

t

σ(1)

t−1(xt) > γ(1)

2 if β1/2

t

σ(1)

t−1(xt) ≤ γ(1)

→ [2].

Argument: If xt ∈ X c

α in [1], then mt = 2 is unlikely in [2].

mt = 2 = ⇒ σ(1)

t−1(xt) is small

= ⇒ Several f (1) queries near xt = ⇒ µ(1)

t−1(xt) ≈ f (1)(xt)

= ⇒ ϕ(1)

t (xt) is small

= ⇒ xt won’t be arg-max.

15/26

slide-78
SLIDE 78

MF-GP-UCB with multiple approximations

16/26

slide-79
SLIDE 79

MF-GP-UCB with multiple approximations

Things work out.

16/26

slide-80
SLIDE 80

Experiment: Viola & Jones Face Detection

22 Threshold values for each cascade. (d = 22) Fidelities with dataset sizes (300, 3000). (M = 2)

1000 2000 3000 4000 5000 6000 7000 8000 0.1 0.15 0.2 0.25 0.3 0.35 17/26

slide-81
SLIDE 81

Experiment: Cosmological Maximum Likelihood Inference

◮ Type Ia Supernovae Data ◮ Maximum likelihood inference for 3 cosmological parameters:

◮ Hubble Constant H0 ◮ Dark Energy Fraction ΩΛ ◮ Dark Matter Fraction ΩM

◮ Likelihood: Robertson Walker metric

(Robertson 1936)

Requires numerical integration for each point in the dataset.

18/26

slide-82
SLIDE 82

Experiment: Cosmological Maximum Likelihood Inference

3 cosmological parameters. (d = 3) Fidelities: integration on grids of size (102, 104, 106). (M = 3)

500 1000 1500 2000 2500 3000 3500

  • 10
  • 5

5 10 19/26

slide-83
SLIDE 83

MF-GP-UCB Synthetic Experiment: Hartmann-3D

d = 3, M = 3

0.5 1 1.5 2 2.5 3 3.5 5 10 15 20 25 30 35 40

  • Num. of Queries

Query frequencies for Hartmann-3D f (3)(x)

m=1 m=2 m=3

19/26

slide-84
SLIDE 84

Multi-fidelity Optimisation with Continuous Approximations

20/26

slide-85
SLIDE 85

Multi-fidelity Optimisation with Continuous Approximations

  • Use an arbitrary amount of data?
  • Iterative algorithms: use arbitrary number of iterations?

20/26

slide-86
SLIDE 86

Multi-fidelity Optimisation with Continuous Approximations

  • Use an arbitrary amount of data?
  • Iterative algorithms: use arbitrary number of iterations?

E.g. Train an ML model with N• data and T• iterations.

20/26

slide-87
SLIDE 87

Multi-fidelity Optimisation with Continuous Approximations

  • Use an arbitrary amount of data?
  • Iterative algorithms: use arbitrary number of iterations?

E.g. Train an ML model with N• data and T• iterations. But use N < N• data and T < T• iterations to approximate cross validation performance. Approximations from a continuous 2D “fidelity space” (N, T).

20/26

slide-88
SLIDE 88

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

X Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd.

21/26

slide-89
SLIDE 89

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

X

g(z, x)

Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd. g : Z × X → R.

21/26

slide-90
SLIDE 90

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

X

g(z, x) f(x) z•

Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd. g : Z × X → R. We wish to optimise f (x) = g(z•, x) where z• ∈ Z.

21/26

slide-91
SLIDE 91

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

X

g(z, x) f(x) z•

Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd. g : Z × X → R. We wish to optimise f (x) = g(z•, x) where z• ∈ Z.

previous e.g.: Z = all (N, T) values, z• = [N•, T•].

21/26

slide-92
SLIDE 92

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

X

g(z, x) f(x) z•

Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd. g : Z × X → R. We wish to optimise f (x) = g(z•, x) where z• ∈ Z.

previous e.g.: Z = all (N, T) values, z• = [N•, T•].

A cost function, λ : Z → R+.

e.g.: λ(z) = λ(N, T) = O(N2T)

21/26

slide-93
SLIDE 93

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

x⋆

X

g(z, x) f(x) z•

Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd. g : Z × X → R. We wish to optimise f (x) = g(z•, x) where z• ∈ Z.

previous e.g.: Z = all (N, T) values, z• = [N•, T•].

A cost function, λ : Z → R+.

e.g.: λ(z) = λ(N, T) = O(N2T)

x⋆ = argmaxx f (x).

21/26

slide-94
SLIDE 94

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

x⋆

X

g(z, x) f(x) z•

Z

A fidelity space Z ⊂ Rp and domain X ⊂ Rd. g : Z × X → R. We wish to optimise f (x) = g(z•, x) where z• ∈ Z.

previous e.g.: Z = all (N, T) values, z• = [N•, T•].

A cost function, λ : Z → R+.

e.g.: λ(z) = λ(N, T) = O(N2T)

x⋆ = argmaxx f (x). Simple Regret:

S(Λ) = f (x⋆) − max

t: zt=z• f (xt).

21/26

slide-95
SLIDE 95

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

g ∼ GP(0, κ),

X

g(z, x) f(x) z•

Z

22/26

slide-96
SLIDE 96

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

g ∼ GP(0, κ), κ : (Z × X)2 → R.

X

g(z, x) f(x) z•

Z

22/26

slide-97
SLIDE 97

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

g ∼ GP(0, κ), κ : (Z × X)2 → R. κ([z, x], [z′, x′]) = κX (x, x′) · κZ(z, z′)

X

g(z, x) f(x) z•

Z

22/26

slide-98
SLIDE 98

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

g ∼ GP(0, κ), κ : (Z × X)2 → R. κ([z, x], [z′, x′]) = κX (x, x′) · κZ(z, z′)

X

g(z, x) f(x) z•

Z SE kernel:

h = 0.05 h = 0.5

22/26

slide-99
SLIDE 99

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

g ∼ GP(0, κ), κ : (Z × X)2 → R. κ([z, x], [z′, x′]) = κX (x, x′) · κZ(z, z′)

X

g(z, x) f(x) z•

Z

Information Gap ξ : Z → R

  • measures the price (in information)

for querying at z = z•. SE kernel:

h = 0.05 h = 0.5

22/26

slide-100
SLIDE 100

Multi-fidelity Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

g ∼ GP(0, κ), κ : (Z × X)2 → R. κ([z, x], [z′, x′]) = κX (x, x′) · κZ(z, z′)

X

g(z, x) f(x) z•

Z

Information Gap ξ : Z → R

  • measures the price (in information)

for querying at z = z•. SE kernel: ξ(z)

z−z• h

.

h = 0.05 h = 0.5

22/26

slide-101
SLIDE 101

BOCA: Bayesian Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

At time t we have t − 1 previous

  • evaluations. {(zi, xi, yi)}t−1

i=1.

X Z

z•

22/26

slide-102
SLIDE 102

BOCA: Bayesian Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

At time t we have t − 1 previous

  • evaluations. {(zi, xi, yi)}t−1

i=1.

Construct posterior GP for g: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ X Z

z•

22/26

slide-103
SLIDE 103

BOCA: Bayesian Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

At time t we have t − 1 previous

  • evaluations. {(zi, xi, yi)}t−1

i=1.

Construct posterior GP for g: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ X Z

z•

xt ← maximise upper confidence bound for f . xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x)

22/26

slide-104
SLIDE 104

BOCA: Bayesian Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

At time t we have t − 1 previous

  • evaluations. {(zi, xi, yi)}t−1

i=1.

Construct posterior GP for g: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ X Z

z•

xt ← maximise upper confidence bound for f . xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

22/26

slide-105
SLIDE 105

BOCA: Bayesian Optimisation with Continuous Approximations

(Kandasamy et al. Arxiv 2017)

At time t we have t − 1 previous

  • evaluations. {(zi, xi, yi)}t−1

i=1.

Construct posterior GP for g: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ X Z

z•

xt ← maximise upper confidence bound for f . xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z) =

λ(z) λ(z•) q ξ(z)

  • zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

22/26

slide-106
SLIDE 106

Theoretical Results for BOCA

GP-UCB

(Srinivas et al. 2010)

w.h.p S(Λ)

  • ΨnΛ(X)

nΛ nΛ = ⌊Λ/λ(z•)⌋. ΨnΛ(A) = Maximum Information Gain → Scales with vol(A).

23/26

slide-107
SLIDE 107

Theoretical Results for BOCA

GP-UCB

(Srinivas et al. 2010)

w.h.p S(Λ)

  • ΨnΛ(X)

nΛ nΛ = ⌊Λ/λ(z•)⌋. ΨnΛ(A) = Maximum Information Gain → Scales with vol(A). BOCA

(Kandasamy et al. Arxiv 2017)

w.h.p ∀α > 0, S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X)

n2−α

Λ

Xα =

  • x; f (x⋆) − f (x) Cα

1 h

  • 23/26
slide-108
SLIDE 108

Theoretical Results for BOCA

GP-UCB

(Srinivas et al. 2010)

w.h.p S(Λ)

  • ΨnΛ(X)

nΛ nΛ = ⌊Λ/λ(z•)⌋. ΨnΛ(A) = Maximum Information Gain → Scales with vol(A). BOCA

(Kandasamy et al. Arxiv 2017)

w.h.p ∀α > 0, S(Λ)

  • ΨnΛ(Xα)

nΛ +

  • ΨnΛ(X)

n2−α

Λ

Xα =

  • x; f (x⋆) − f (x) Cα

1 h

  • If h is large,

vol(Xα) ≪ vol(X), ΨnΛ(Xα) ≪ ΨnΛ(X).

23/26

slide-109
SLIDE 109

Experiment: SVM with 20 News Groups

Tune two hyper-parameters for the SVM. (d = 2) Dataset has N• = 15K data and use T• = 100 iterations. But can choose N ∈ [5K, 15K] or T ∈ [20, 100]. (p = 2)

24/26

slide-110
SLIDE 110

Experiment: SVM with 20 News Groups

Tune two hyper-parameters for the SVM. (d = 2) Dataset has N• = 15K data and use T• = 100 iterations. But can choose N ∈ [5K, 15K] or T ∈ [20, 100]. (p = 2)

0.89 0.895 0.9 0.905 0.91 0.915 500 1000 1500 2000

24/26

slide-111
SLIDE 111

Summary

Multi-fidelity K-armed bandits

(Kandasamy et al. NIPS 2016a)

◮ An algorithm MF-UCB and an upper bound on the regret. ◮ An almost matching lower bound.

25/26

slide-112
SLIDE 112

Summary

Multi-fidelity K-armed bandits

(Kandasamy et al. NIPS 2016a)

◮ An algorithm MF-UCB and an upper bound on the regret. ◮ An almost matching lower bound.

Key takeaways

(Kandasamy et al. NIPS 2016a, Kandasamy et al. NIPS 2016b, Kandasamy et al. Arxiv 2017)

◮ Upper confidence bound strategy ◮ Choose higher fidelity only after controlling

uncertainty/variance at lower fidelities.

◮ Explore the entire space using cheap low fidelities and reserve

expensive higher fidelities for promising candidates.

25/26

slide-113
SLIDE 113

Jeff Schneider Barnabas Poczos Junier Oliva Gautam Dasarathy

Thank you.

Code for MF-GP-UCB: https://github.com/kirthevasank/mf-gp-ucb

26/26

slide-114
SLIDE 114

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

slide-115
SLIDE 115

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

slide-116
SLIDE 116

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

µ(1)

t−1 + β1/2 t

σ(1)

t−1

t = 6

slide-117
SLIDE 117

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2)

ϕ(1)

t

= µ(1)

t−1 + β1/2 t

σ(1)

t−1 + ζ(1)

t = 6

slide-118
SLIDE 118

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2) ϕ(1)

t

ϕ(2)

t

t = 6

slide-119
SLIDE 119

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ f (1) f (2) ϕ(1)

t

ϕ(2)

t

ϕt t = 6

slide-120
SLIDE 120

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ xt t = 6 ϕ(1)

t

ϕ(2)

t

ϕt f (1) f (2)

slide-121
SLIDE 121

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ xt t = 6 ϕ(1)

t

ϕ(2)

t

ϕt f (1) f (2)

β1/2

t

σ(1)

t−1(x)

γ(1)

mt = 1

slide-122
SLIDE 122

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ xt t = 10 f (1) f (2)

γ(1)

mt = 2

slide-123
SLIDE 123

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ xt t = 11 f (1) f (2)

γ(1)

mt = 2

slide-124
SLIDE 124

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

x⋆ xt t = 14 f (1) f (2)

γ(1)

mt = 2