Parallelised Bayesian Optimisation via Thompson Sampling - - PowerPoint PPT Presentation

parallelised bayesian optimisation via thompson sampling
SMART_READER_LITE
LIVE PREVIEW

Parallelised Bayesian Optimisation via Thompson Sampling - - PowerPoint PPT Presentation

Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Akshay Jeff Barnab as Krishnamurthy Schneider P oczos AISTATS 2018 Black-box Optimisation Expensive Blackbox Function Examples: - Hyper-parameter T


slide-1
SLIDE 1

Parallelised Bayesian Optimisation via Thompson Sampling

Kirthevasan Kandasamy Akshay Jeff Barnab´ as Krishnamurthy Schneider P´

  • czos

AISTATS 2018

slide-2
SLIDE 2

Black-box Optimisation

Expensive Blackbox Function

Examples:

  • Hyper-parameter T

uning

  • ML estimation in Astrophysics
  • Optimal policy in Autonomous Driving

1/15

slide-3
SLIDE 3

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function.

x f(x)

2/15

slide-4
SLIDE 4

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function.

x f(x)

2/15

slide-5
SLIDE 5

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

2/15

slide-6
SLIDE 6

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations SR(n) = f (x⋆) − max

t=1,...,n f (xt).

2/15

slide-7
SLIDE 7

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R.

3/15

slide-8
SLIDE 8

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Functions with no observations

x f(x)

3/15

slide-9
SLIDE 9

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Prior GP

x f(x)

3/15

slide-10
SLIDE 10

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Observations

x f(x)

3/15

slide-11
SLIDE 11

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

3/15

slide-12
SLIDE 12

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

3/15

slide-13
SLIDE 13

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Several criteria for picking next point: GP-UCB (Srinivas et al. 2010), GP-EI (Mockus & Mockus, 1991). x f(x)

4/15

slide-14
SLIDE 14

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Several criteria for picking next point: GP-UCB (Srinivas et al. 2010), GP-EI (Mockus & Mockus, 1991). x f(x) 1) Compute posterior GP.

4/15

slide-15
SLIDE 15

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Several criteria for picking next point: GP-UCB (Srinivas et al. 2010), GP-EI (Mockus & Mockus, 1991). x f(x) ϕt = µt−1 + β1/2

t

σt−1 1) Compute posterior GP. 2) Construct acquisition ϕt.

4/15

slide-16
SLIDE 16

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Several criteria for picking next point: GP-UCB (Srinivas et al. 2010), GP-EI (Mockus & Mockus, 1991). x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

1) Compute posterior GP. 2) Construct acquisition ϕt. 3) Choose xt = argmaxx ϕt(x).

4/15

slide-17
SLIDE 17

Gaussian Process Bandit (Bayesian) Optimisation

Model f ∼ GP(0, κ). Several criteria for picking next point: GP-UCB (Srinivas et al. 2010), GP-EI (Mockus & Mockus, 1991). x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

1) Compute posterior GP. 2) Construct acquisition ϕt. 3) Choose xt = argmaxx ϕt(x). 4) Evaluate f at xt.

4/15

slide-18
SLIDE 18

This work: Parallel Evaluations

Sequential evaluations with one worker

5/15

slide-19
SLIDE 19

This work: Parallel Evaluations

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous)

5/15

slide-20
SLIDE 20

This work: Parallel Evaluations

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous)

5/15

slide-21
SLIDE 21

This work: Parallel Evaluations

Sequential evaluations with one worker

jth job has feedback from all previous j − 1 evaluations.

Parallel evaluations with M workers (Asynchronous)

jth job missing feedback from exactly M − 1 evaluations.

Parallel evaluations with M workers (Synchronous)

jth job missing feedback from ≤ M − 1 evaluations.

5/15

slide-22
SLIDE 22

Challenges in parallel BO: encouraging diversity

Direct application of UCB in the synchronous setting . . .

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt1

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).

6/15

slide-23
SLIDE 23

Challenges in parallel BO: encouraging diversity

Direct application of UCB in the synchronous setting . . .

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).
  • Second worker: acquisition is the same! xt1 = xt2

6/15

slide-24
SLIDE 24

Challenges in parallel BO: encouraging diversity

Direct application of UCB in the synchronous setting . . .

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).
  • Second worker: acquisition is the same! xt1 = xt2
  • xt1 = xt2 = · · · = xtM.

6/15

slide-25
SLIDE 25

Challenges in parallel BO: encouraging diversity

Direct application of UCB in the synchronous setting . . .

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).
  • Second worker: acquisition is the same! xt1 = xt2
  • xt1 = xt2 = · · · = xtM.

Direct application of popular (deterministic) strategies, e.g. GP-UCB, GP-EI, etc. do not work. Need to “encourage diversity”.

6/15

slide-26
SLIDE 26

Challenges in parallel BO: encouraging diversity

◮ Add hallucinated observations.

(Ginsbourger et al. 2011, Janusevkis et al. 2012)

◮ Optimise an acquisition over X M (e.g. M-product UCB).

( Wang et al 2016, Wu & Frazier 2017 )

◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines.

(Contal et al. 2013, Gonzalez et al. 2015, Shah & Ghahramani 2015, Wang et al. 2017, Wang et al. 2018)

7/15

slide-27
SLIDE 27

Challenges in parallel BO: encouraging diversity

◮ Add hallucinated observations.

(Ginsbourger et al. 2011, Janusevkis et al. 2012)

◮ Optimise an acquisition over X M (e.g. M-product UCB).

( Wang et al 2016, Wu & Frazier 2017 )

◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines.

(Contal et al. 2013, Gonzalez et al. 2015, Shah & Ghahramani 2015, Wang et al. 2017, Wang et al. 2018)

Our Approach: Based on Thompson sampling (Thompson, 1933).

◮ Conceptually simple: does not require explicit diversity

strategies.

7/15

slide-28
SLIDE 28

Challenges in parallel BO: encouraging diversity

◮ Add hallucinated observations.

(Ginsbourger et al. 2011, Janusevkis et al. 2012)

◮ Optimise an acquisition over X M (e.g. M-product UCB).

( Wang et al 2016, Wu & Frazier 2017 )

◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines.

(Contal et al. 2013, Gonzalez et al. 2015, Shah & Ghahramani 2015, Wang et al. 2017, Wang et al. 2018)

Our Approach: Based on Thompson sampling (Thompson, 1933).

◮ Conceptually simple: does not require explicit diversity

strategies.

◮ Asynchronicity ◮ Theoretical guarantees

7/15

slide-29
SLIDE 29

GP Optimisation with Thompson Sampling

(Thompson, 1933)

x f(x)

8/15

slide-30
SLIDE 30

GP Optimisation with Thompson Sampling

(Thompson, 1933)

x f(x)

1) Construct posterior GP.

8/15

slide-31
SLIDE 31

GP Optimisation with Thompson Sampling

(Thompson, 1933)

x f(x)

1) Construct posterior GP. 2) Draw sample g from posterior.

8/15

slide-32
SLIDE 32

GP Optimisation with Thompson Sampling

(Thompson, 1933)

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x).

8/15

slide-33
SLIDE 33

GP Optimisation with Thompson Sampling

(Thompson, 1933)

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt.

8/15

slide-34
SLIDE 34

GP Optimisation with Thompson Sampling

(Thompson, 1933)

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt. Take-home message: In parallel settings, direct application of sequential TS algorithm works. Inherent randomness adds sufficient diversity when managing M workers.

8/15

slide-35
SLIDE 35

Parallelised Thompson Sampling

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g.

9/15

slide-36
SLIDE 36

Parallelised Thompson Sampling

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m.

9/15

slide-37
SLIDE 37

Parallelised Thompson Sampling

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m. Parallel TS in prior work:

(Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017)

9/15

slide-38
SLIDE 38

Simple Regret in Parallel Settings

Simple regret after n evaluations, SR(n) = f (x⋆) − max

t=1,...,n f (xt).

n ← # completed evaluations by all workers.

10/15

slide-39
SLIDE 39

Simple Regret in Parallel Settings

Simple regret after n evaluations, SR(n) = f (x⋆) − max

t=1,...,n f (xt).

n ← # completed evaluations by all workers. Simple regret with time as a resource, Asynchronous Synchronous SR′(T) = f (x⋆) − max

t=1,...,N f (xt).

N ← # completed evaluations by all workers in time T. (possibly random).

10/15

slide-40
SLIDE 40

Theoretical Results SR(n)

Several results for sequential Thompson sampling (Agrawal et al.

2012, Kaufmann et al. 2012, Russo & van Roy 2016 )

11/15

slide-41
SLIDE 41

Theoretical Results SR(n)

Several results for sequential Thompson sampling (Agrawal et al.

2012, Kaufmann et al. 2012, Russo & van Roy 2016 )

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Ψn ← Maximum information gain

(Srinivas et al. 2010)

GP with SE Kernel in d dimensions, Ψn(X) ≍ dd log(n)d.

11/15

slide-42
SLIDE 42

Theoretical Results SR(n)

Several results for sequential Thompson sampling (Agrawal et al.

2012, Kaufmann et al. 2012, Russo & van Roy 2016 )

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Ψn ← Maximum information gain

(Srinivas et al. 2010)

GP with SE Kernel in d dimensions, Ψn(X) ≍ dd log(n)d.

Theorem: synTS

(Kandasamy et al. 2018)

E[SR(n)] M

  • log(M)

n +

  • Ψn log(n+M)

n

11/15

slide-43
SLIDE 43

Theoretical Results SR(n)

Several results for sequential Thompson sampling (Agrawal et al.

2012, Kaufmann et al. 2012, Russo & van Roy 2016 )

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Ψn ← Maximum information gain

(Srinivas et al. 2010)

GP with SE Kernel in d dimensions, Ψn(X) ≍ dd log(n)d.

Theorem: synTS

(Kandasamy et al. 2018)

E[SR(n)] M

  • log(M)

n +

  • Ψn log(n+M)

n Theorem: asyTS

(Kandasamy et al. 2018)

E[SR(n)] Mpolylog(M) n +

  • CΨn log(n)

n

11/15

slide-44
SLIDE 44

Experiment: Park1-4D M = 10

Comparison in terms of number of evaluations

10 0

asyTS seqTS

20 40 60 80 100 120

synTS

12/15

slide-45
SLIDE 45

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential

13/15

slide-46
SLIDE 46

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential Theorem: TS with M parallel workers

(Kandasamy et al. 2018)

If evaluation times are the same, synTS ≈ asyTS. When there is high variability in evaluation times, asyTS is much better than synTS.

slide-47
SLIDE 47

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential Theorem: TS with M parallel workers

(Kandasamy et al. 2018)

If evaluation times are the same, synTS ≈ asyTS. When there is high variability in evaluation times, asyTS is much better than synTS.

  • Uniform: constant factor
  • Half-normal:
  • log(M) factor
  • Exponential: log(M) factor

13/15

slide-48
SLIDE 48

Experiment: Hartmann-18D M = 25

Evaluation time sampled from an exponential distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 2.5 3 3.5 4 4.5 5 5.5 6 6.5

Additional synthetic and real experiments in the paper/poster.

14/15

slide-49
SLIDE 49

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

◮ Take-aways: Theory

  • Both perform essentially the same as seqTS in terms of the

number of evaluations.

  • When we factor time as a resource, asyTS performs best.

◮ Take-aways: Practice

  • Conceptually simple and scales better with the number of

workers than other methods.

15/15

slide-50
SLIDE 50

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

◮ Take-aways: Theory

  • Both perform essentially the same as seqTS in terms of the

number of evaluations.

  • When we factor time as a resource, asyTS performs best.

◮ Take-aways: Practice

  • Conceptually simple and scales better with the number of

workers than other methods.

Thank you

Poster #49, Session 3 (Tuesday evening).

Code: github.com/kirthevasank/gp-parallel-ts

15/15

slide-51
SLIDE 51

Appendix

slide-52
SLIDE 52

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

slide-53
SLIDE 53

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

slide-54
SLIDE 54

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

10 20 30 40 10 -2 10 -1

slide-55
SLIDE 55

Experiment: Hartmann-6D M = 12

Evaluation time sampled from a half-normal distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 10 -1 10 0

slide-56
SLIDE 56

Experiment: Hartmann-18D M = 25

Evaluation time sampled from an exponential distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 2.5 3 3.5 4 4.5 5 5.5 6 6.5

slide-57
SLIDE 57

Experiment: Currin-Exponential-14D M = 35

Evaluation time sampled from a Pareto-3 distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 10 15 20 25

slide-58
SLIDE 58

Experiment: Model Selection in Cifar10 M = 4

Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4 - 16 minutes.

1000 2000 3000 4000 5000 6000 7000 0.68 0.69 0.7 0.71 0.72

synTS asyRAND

asyHUCB

asyTS asyEI synHUCB