Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - - PowerPoint PPT Presentation

scalable bandit methods for hyper parameter tuning
SMART_READER_LITE
LIVE PREVIEW

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - - PowerPoint PPT Presentation

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon University Guest Lecture - Scalable Machine Learning for Big Data Biology University of Pittsburgh, Pittsburgh, PA November 3, 2017 Hyper-parameter


slide-1
SLIDE 1

Scalable Bandit Methods for Hyper-parameter Tuning

Kirthevasan Kandasamy Carnegie Mellon University Guest Lecture - Scalable Machine Learning for Big Data Biology University of Pittsburgh, Pittsburgh, PA November 3, 2017

slide-2
SLIDE 2

Hyper-parameter Tuning

Neural Network

hyper- parameters cross validation accuracy

  • Train NN using given hyper-parameters
  • Compute accuracy on validation set

1/40

slide-3
SLIDE 3

Black-box Optimisation

Expensive Blackbox Function

1/40

slide-4
SLIDE 4

Maximum Likelihood estimation in Astrophysics

Cosmological Simulator

Observation

E.g: Hubble Constant Baryonic Density

Likelihood Score Likelihood computation

1/40

slide-5
SLIDE 5

Black-box Optimisation

Expensive Blackbox Function

Other Examples:

  • Pre-clinical Drug Discovery
  • Optimal policy in Autonomous Driving
  • Synthetic gene design

1/40

slide-6
SLIDE 6

Black-box Optimisation

f : X → R is a black-box function that is accessible only via noisy evaluations.

x f(x)

2/40

slide-7
SLIDE 7

Black-box Optimisation

f : X → R is a black-box function that is accessible only via noisy evaluations.

x f(x)

2/40

slide-8
SLIDE 8

Black-box Optimisation

f : X → R is a black-box function that is accessible only via noisy evaluations. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

2/40

slide-9
SLIDE 9

Black-box Optimisation

f : X → R is a black-box function that is accessible only via noisy evaluations. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations Sn = f (x⋆) − max

t=1,...,n f (xt).

2/40

slide-10
SLIDE 10

Outline

◮ Part I: Bandits in the Bayesian Paradigm

  • 1. Gaussian processes
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Part II: Scaling up Bandits

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiment

  • 2. Parallelising function evaluations
  • 3. High dimensional input spaces

3/40

slide-11
SLIDE 11

Outline

◮ Part I: Bandits in the Bayesian Paradigm

  • 1. Gaussian processes
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Part II: Scaling up Bandits

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiment

  • 2. Parallelising function evaluations
  • 3. High dimensional input spaces

3/40

slide-12
SLIDE 12

Gaussian (Normal) distribution N(µ, σ2)

◮ A probability distribution for real valued random variables. ◮ Mean µ and variance σ2 completely characterises distribution.

4/40

slide-13
SLIDE 13

Gaussian (Normal) distribution N(µ, σ2)

◮ A probability distribution for real valued random variables. ◮ Mean µ and variance σ2 completely characterises distribution. ◮ For samples X1, . . . , Xn, let ˆ

µ = 1

n

  • i Xi be the sample mean.

Then, ˆ µ ± 1.96 σ

√n is a 95% confidence interval for µ. ◮ Can draw samples (e.g. in Matlab: mu + sigma * randn()).

4/40

slide-14
SLIDE 14

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R.

5/40

slide-15
SLIDE 15

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Functions with no observations

x f(x)

5/40

slide-16
SLIDE 16

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Prior GP

x f(x)

5/40

slide-17
SLIDE 17

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Observations

x f(x)

5/40

slide-18
SLIDE 18

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

5/40

slide-19
SLIDE 19

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

5/40

slide-20
SLIDE 20

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x)

6/40

slide-21
SLIDE 21

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) 1) Construct posterior GP.

6/40

slide-22
SLIDE 22

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1 1) Construct posterior GP. 2) ϕt = µt−1 + β1/2

t

σt−1 is a UCB.

6/40

slide-23
SLIDE 23

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

1) Construct posterior GP. 2) ϕt = µt−1 + β1/2

t

σt−1 is a UCB. 3) Choose xt = argmaxx ϕt(x).

6/40

slide-24
SLIDE 24

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

1) Construct posterior GP. 2) ϕt = µt−1 + β1/2

t

σt−1 is a UCB. 3) Choose xt = argmaxx ϕt(x). 4) Evaluate f at xt.

6/40

slide-25
SLIDE 25

GP-UCB

xt = argmax

x

µt−1(x) + β1/2

t

σt−1(x)

◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.

βt ≍ log t.

7/40

slide-26
SLIDE 26

GP-UCB

(Srinivas et al. 2010)

x f(x)

8/40

slide-27
SLIDE 27

GP-UCB

(Srinivas et al. 2010)

t = 1 x f(x)

8/40

slide-28
SLIDE 28

GP-UCB

(Srinivas et al. 2010)

t = 2 x f(x)

8/40

slide-29
SLIDE 29

GP-UCB

(Srinivas et al. 2010)

t = 3 x f(x)

8/40

slide-30
SLIDE 30

GP-UCB

(Srinivas et al. 2010)

t = 4 x f(x)

8/40

slide-31
SLIDE 31

GP-UCB

(Srinivas et al. 2010)

t = 5 x f(x)

8/40

slide-32
SLIDE 32

GP-UCB

(Srinivas et al. 2010)

t = 6 x f(x)

8/40

slide-33
SLIDE 33

GP-UCB

(Srinivas et al. 2010)

t = 7 x f(x)

8/40

slide-34
SLIDE 34

GP-UCB

(Srinivas et al. 2010)

t = 11 x f(x)

8/40

slide-35
SLIDE 35

GP-UCB

(Srinivas et al. 2010)

t = 25 x f(x)

8/40

slide-36
SLIDE 36

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

9/40

slide-37
SLIDE 37

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x) 1) Construct posterior GP.

9/40

slide-38
SLIDE 38

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x) 1) Construct posterior GP. 2) Draw sample g from posterior.

9/40

slide-39
SLIDE 39

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x).

9/40

slide-40
SLIDE 40

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt.

9/40

slide-41
SLIDE 41

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x)

10/40

slide-42
SLIDE 42

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 1

10/40

slide-43
SLIDE 43

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 2

10/40

slide-44
SLIDE 44

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 3

10/40

slide-45
SLIDE 45

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 4

10/40

slide-46
SLIDE 46

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 5

10/40

slide-47
SLIDE 47

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 6

10/40

slide-48
SLIDE 48

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 7

10/40

slide-49
SLIDE 49

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 11

10/40

slide-50
SLIDE 50

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 25

10/40

slide-51
SLIDE 51

Bandits in the Bayesian Paradigm

Theory: Both UCB and TS will eventually find the optimum under appropriate smoothness assumptions of f . That is, Sn = f (x⋆) − max

t=1,...,n f (xt)

→ 0, as n → ∞

11/40

slide-52
SLIDE 52

Bandits in the Bayesian Paradigm

Theory: Both UCB and TS will eventually find the optimum under appropriate smoothness assumptions of f . That is, Sn = f (x⋆) − max

t=1,...,n f (xt)

→ 0, as n → ∞ Other criteria for selecting xt:

◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Predictive entropy search (Hern´

andez-Lobato et al. 2014)

◮ . . . and a few more.

11/40

slide-53
SLIDE 53

Bandits in the Bayesian Paradigm

Theory: Both UCB and TS will eventually find the optimum under appropriate smoothness assumptions of f . That is, Sn = f (x⋆) − max

t=1,...,n f (xt)

→ 0, as n → ∞ Other criteria for selecting xt:

◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Predictive entropy search (Hern´

andez-Lobato et al. 2014)

◮ . . . and a few more.

Other Bayesian models for f :

◮ Neural networks (Snoek et al. 2015) ◮ Random Forests (Hutter 2009)

11/40

slide-54
SLIDE 54

Outline

◮ Part I: Bandits in the Bayesian Paradigm

  • 1. Gaussian processes
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Part II: Scaling up Bandits

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiment

  • 2. Parallelising function evaluations
  • 3. High dimensional input spaces

12/40

slide-55
SLIDE 55

Outline

◮ Part I: Bandits in the Bayesian Paradigm

  • 1. Gaussian processes
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Part II: Scaling up Bandits

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiment

  • 2. Parallelising function evaluations
  • 3. High dimensional input spaces

(N.B: Part II is a shameless plug for my research.)

12/40

slide-56
SLIDE 56

Part 2.1: Multi-fidelity Bandits

Motivating question: What if we have cheap approximations to f ?

  • 1. Hyper-parameter tuning: Train & validate with a subset of the

data, and/or early stopping before convergence. E.g. Bandwidth (ℓ) selection in kernel density estimation.

13/40

slide-57
SLIDE 57

Part 2.1: Multi-fidelity Bandits

Motivating question: What if we have cheap approximations to f ?

  • 1. Hyper-parameter tuning: Train & validate with a subset of the

data, and/or early stopping before convergence. E.g. Bandwidth (ℓ) selection in kernel density estimation.

  • 2. Computational astrophysics: cosmological simulations and

numerical computations with less granularity.

  • 3. Autonomous driving: simulation vs real world experiment.

13/40

slide-58
SLIDE 58

Multi-fidelity Methods

For specific applications,

◮ Industrial design

(Forrester et al. 2007)

◮ Hyper-parameter tuning

(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)

◮ Active learning

(Zhang & Chaudhuri 2015)

◮ Robotics

(Cutler et al. 2014)

Multi-fidelity bandits & optimisation

(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)

14/40

slide-59
SLIDE 59

Multi-fidelity Methods

For specific applications,

◮ Industrial design

(Forrester et al. 2007)

◮ Hyper-parameter tuning

(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)

◮ Active learning

(Zhang & Chaudhuri 2015)

◮ Robotics

(Cutler et al. 2014)

Multi-fidelity bandits & optimisation

(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)

. . . with theoretical guarantees

(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)

14/40

slide-60
SLIDE 60

Multi-fidelity Bandits for Hyper-parameter tuning

  • Use an arbitrary amount of data?
  • Iterative algorithms: use arbitrary number of iterations?

15/40

slide-61
SLIDE 61

Multi-fidelity Bandits for Hyper-parameter tuning

  • Use an arbitrary amount of data?
  • Iterative algorithms: use arbitrary number of iterations?

E.g. Train an ML model with N• data and T• iterations.

  • But use N < N• data and T < T• iterations to approximate

cross validation performance at (N•, T•).

15/40

slide-62
SLIDE 62

Multi-fidelity Bandits for Hyper-parameter tuning

  • Use an arbitrary amount of data?
  • Iterative algorithms: use arbitrary number of iterations?

E.g. Train an ML model with N• data and T• iterations.

  • But use N < N• data and T < T• iterations to approximate

cross validation performance at (N•, T•). Approximations from a continuous 2D “fidelity space” (N, T).

15/40

slide-63
SLIDE 63

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

X Z

A fidelity space Z and domain X

Z ← all (N, T) values. X ← all hyper-parameter values.

16/40

slide-64
SLIDE 64

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

X

g(z, x)

Z

A fidelity space Z and domain X

Z ← all (N, T) values. X ← all hyper-parameter values.

g : Z × X → R.

g([N, T], x) ← cv accuracy when training with N data for T iterations at hyper-parameter x.

16/40

slide-65
SLIDE 65

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

X

g(z, x) f(x) z•

Z

A fidelity space Z and domain X

Z ← all (N, T) values. X ← all hyper-parameter values.

g : Z × X → R.

g([N, T], x) ← cv accuracy when training with N data for T iterations at hyper-parameter x.

Denote f (x) = g(z•, x) where z• ∈ Z.

z• = [N•, T•].

16/40

slide-66
SLIDE 66

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

A fidelity space Z and domain X

Z ← all (N, T) values. X ← all hyper-parameter values.

g : Z × X → R.

g([N, T], x) ← cv accuracy when training with N data for T iterations at hyper-parameter x.

Denote f (x) = g(z•, x) where z• ∈ Z.

z• = [N•, T•].

End Goal: Find x⋆ = argmaxx f (x).

16/40

slide-67
SLIDE 67

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

A fidelity space Z and domain X

Z ← all (N, T) values. X ← all hyper-parameter values.

g : Z × X → R.

g([N, T], x) ← cv accuracy when training with N data for T iterations at hyper-parameter x.

Denote f (x) = g(z•, x) where z• ∈ Z.

z• = [N•, T•].

End Goal: Find x⋆ = argmaxx f (x). A cost function, λ : Z → R+.

λ(z) = λ(N, T) = O(N2T) (say).

Z z• λ(z)

16/40

slide-68
SLIDE 68

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

17/40

slide-69
SLIDE 69

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+

17/40

slide-70
SLIDE 70

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x)

17/40

slide-71
SLIDE 71

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x)

17/40

slide-72
SLIDE 72

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

17/40

slide-73
SLIDE 73

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

17/40

slide-74
SLIDE 74

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

17/40

slide-75
SLIDE 75

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

17/40

slide-76
SLIDE 76

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z) =

λ(z) λ(z•) q ξ(z)

  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

17/40

slide-77
SLIDE 77

Theoretical Results for BOCA

x⋆

X

g(z, x) f(x) z•

Z

“good” x⋆ g(z, x)

X

f(x) z•

Z

“bad”

18/40

slide-78
SLIDE 78

Theoretical Results for BOCA

x⋆

X

g(z, x) f(x) z•

Z

“good” x⋆ g(z, x)

X

f(x) z•

Z

“bad” Theorem: (Informal) BOCA does better, i.e. achieves better Simple regret, than GP-

  • UCB. The improvements are better in the “good” setting when

compared to the “bad” setting.

18/40

slide-79
SLIDE 79

Experiment: SVM with 20 News Groups

Tune two hyper-parameters for the SVM. Dataset has N• = 15K data and use T• = 100 iterations. But can choose N ∈ [5K, 15K] or T ∈ [20, 100]

(2D fidelity space).

0.89 0.895 0.9 0.905 0.91 0.915 500 1000 1500 2000

19/40

slide-80
SLIDE 80

Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N• = 192 data. Requires numerical integration on a grid of size G• = 106. Approximate with N ∈ [50, 192] or G ∈ [102, 106] (2D fidelity space).

20/40

slide-81
SLIDE 81

Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N• = 192 data. Requires numerical integration on a grid of size G• = 106. Approximate with N ∈ [50, 192] or G ∈ [102, 106] (2D fidelity space).

1000 1500 2000 2500 3000 3500 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

20/40

slide-82
SLIDE 82

Hyper-band: A multi-fidelity method with incremental resource allocation

(Li et al. 2016)

E.g: Training a neural network with gradient descent for several iterations.

21/40

slide-83
SLIDE 83

Hyper-band: A multi-fidelity method with incremental resource allocation

(Li et al. 2016)

E.g: Training a neural network with gradient descent for several

  • iterations. If the CV error is bad after early iterations, then it will

likely be bad at the end.

21/40

slide-84
SLIDE 84

Hyper-band: A multi-fidelity method with incremental resource allocation

(Li et al. 2016)

E.g: Training a neural network with gradient descent for several

  • iterations. If the CV error is bad after early iterations, then it will

likely be bad at the end. Successive Halving (with finite X):

  • 1. Allocate a small resource R to each x ∈ X.

e.g. Train all hyper-parameters for 100 iterations.

  • 2. Drop half of the x’s that are performing worst.
  • 3. Repeat steps 1 & 2 until one arm is left.

21/40

slide-85
SLIDE 85

Hyper-band: A multi-fidelity method with incremental resource allocation

(Li et al. 2016)

E.g: Training a neural network with gradient descent for several

  • iterations. If the CV error is bad after early iterations, then it will

likely be bad at the end. Successive Halving (with finite X):

  • 1. Allocate a small resource R to each x ∈ X.

e.g. Train all hyper-parameters for 100 iterations.

  • 2. Drop half of the x’s that are performing worst.
  • 3. Repeat steps 1 & 2 until one arm is left.

Can be extended to infinite X.

21/40

slide-86
SLIDE 86

Hyper-band: A multi-fidelity method with incremental resource allocation

(Li et al. 2016)

E.g: Training a neural network with gradient descent for several

  • iterations. If the CV error is bad after early iterations, then it will

likely be bad at the end. Successive Halving (with finite X):

  • 1. Allocate a small resource R to each x ∈ X.

e.g. Train all hyper-parameters for 100 iterations.

  • 2. Drop half of the x’s that are performing worst.
  • 3. Repeat steps 1 & 2 until one arm is left.

Can be extended to infinite X. Does not fall within the GP/Bayesian framework.

21/40

slide-87
SLIDE 87

Hyper-band (cont’d)

When compared to Bayesian methods,

◮ Pro: Incremental resource allocation (do not need to retrain

all models from the beginning).

◮ Con: Cannot use correlation between arms (e.g. if x1 has

large CV accuracy, then x2 close to x1 is also likely to do well).

22/40

slide-88
SLIDE 88

Hyper-band (cont’d)

When compared to Bayesian methods,

◮ Pro: Incremental resource allocation (do not need to retrain

all models from the beginning).

◮ Con: Cannot use correlation between arms (e.g. if x1 has

large CV accuracy, then x2 close to x1 is also likely to do well). Experiments:

22/40

slide-89
SLIDE 89

Outline

◮ Part I: Bandits in the Bayesian Paradigm

  • 1. Gaussian processes
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Part II: Scaling up Bandits

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiment

  • 2. Parallelising function evaluations
  • 3. High dimensional input spaces

23/40

slide-90
SLIDE 90

Part 2.2: Parallelising arm pulls

Sequential evaluations with one worker

24/40

slide-91
SLIDE 91

Part 2.2: Parallelising arm pulls

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous)

24/40

slide-92
SLIDE 92

Part 2.2: Parallelising arm pulls

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous)

24/40

slide-93
SLIDE 93

Why parallelisation?

◮ Computational experiments: infrastructure with 100-1000’s

CPUs or GPUs.

25/40

slide-94
SLIDE 94

Why parallelisation?

◮ Computational experiments: infrastructure with 100-1000’s

CPUs or GPUs. Prior work: (Ginsbourger et al. 2011, Janusevskis et al. 2012, Wang et al.

2016, Gonz´ alez et al. 2015, Desautels et al. 2014, Contal et al. 2013, Shah and Ghahramani 2015, Kathuria et al. 2016, Wang et al. 2017, Wu and Frazier 2016, Hernandez-Lobato et al. 2017)

Shortcomings

◮ Asynchronicity ◮ Theoretical guarantees ◮ Computationally & conceptually simple

25/40

slide-95
SLIDE 95

Review: Sequential Thompson Sampling in GP Bandits

Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

26/40

slide-96
SLIDE 96

Review: Sequential Thompson Sampling in GP Bandits

Thompson Sampling (TS)

(Thompson, 1933).

x f(x) 1) Construct posterior GP.

26/40

slide-97
SLIDE 97

Review: Sequential Thompson Sampling in GP Bandits

Thompson Sampling (TS)

(Thompson, 1933).

x f(x) 1) Construct posterior GP. 2) Draw sample g from posterior.

26/40

slide-98
SLIDE 98

Review: Sequential Thompson Sampling in GP Bandits

Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x).

26/40

slide-99
SLIDE 99

Review: Sequential Thompson Sampling in GP Bandits

Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt.

26/40

slide-100
SLIDE 100

Parallelised Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g.

27/40

slide-101
SLIDE 101

Parallelised Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m.

27/40

slide-102
SLIDE 102

Parallelised Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m. Variants in prior work:

(Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017)

27/40

slide-103
SLIDE 103

Theoretical Results for TS: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)

n

28/40

slide-104
SLIDE 104

Theoretical Results for TS: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)

n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M

  • log(M)

n +

  • vol(X) log(n + M)

n n ← # completed arm pulls by all workers.

28/40

slide-105
SLIDE 105

Theoretical Results for TS: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)

n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M

  • log(M)

n +

  • vol(X) log(n + M)

n n ← # completed arm pulls by all workers. Why is this interesting?

  • A sequential algorithm can make use of information from all

previous rounds to determine where to evaluate next.

  • A parallel algorithm could be missing up to M − 1 results at

any given time.

28/40

slide-106
SLIDE 106

Theoretical Results for TS: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)

n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M

  • log(M)

n +

  • vol(X) log(n + M)

n n ← # completed arm pulls by all workers. Why is this interesting?

  • A sequential algorithm can make use of information from all

previous rounds to determine where to evaluate next.

  • A parallel algorithm could be missing up to M − 1 results at

any given time. But randomisation helps!

28/40

slide-107
SLIDE 107

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous)

29/40

slide-108
SLIDE 108

Theoretical Results: Simple regret with time

Asynchronous Synchronous

30/40

slide-109
SLIDE 109

Theoretical Results: Simple regret with time

Asynchronous Synchronous Theorem (Informal)

(Kandasamy et al. Arxiv 2017)

If evaluation times are the same, asyTS ≈ synTS. Otherwise, bounds for asyTS is better than synTS. More the variability in evaluation times, the bigger the difference.

slide-110
SLIDE 110

Theoretical Results: Simple regret with time

Asynchronous Synchronous Theorem (Informal)

(Kandasamy et al. Arxiv 2017)

If evaluation times are the same, asyTS ≈ synTS. Otherwise, bounds for asyTS is better than synTS. More the variability in evaluation times, the bigger the difference.

  • Bounded tail decay: constant factor
  • Sub-gaussian tail decay:
  • log(M) factor
  • Sub-exponential tail decay: log(M) factor

30/40

slide-111
SLIDE 111

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

31/40

slide-112
SLIDE 112

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

31/40

slide-113
SLIDE 113

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

10 20 30 40 10 -2 10 -1

31/40

slide-114
SLIDE 114

Experiment: Hartmann-18D M = 25

Evaluation time sampled from an exponential distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 2.5 3 3.5 4 4.5 5 5.5 6 6.5

32/40

slide-115
SLIDE 115

Experiment: Model Selection in Cifar10 M = 4

Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4 - 16 minutes.

1000 2000 3000 4000 5000 6000 7000 0.68 0.69 0.7 0.71 0.72

synTS asyRAND

asyHUCB

asyTS asyEI synHUCB

33/40

slide-116
SLIDE 116

Parallelised Thompson Sampling in Neural Networks

(Hernandez-Lobato et al. 2017)

34/40

slide-117
SLIDE 117

Parallelised Thompson Sampling in Neural Networks

(Hernandez-Lobato et al. 2017)

34/40

slide-118
SLIDE 118

Outline

◮ Part I: Bandits in the Bayesian Paradigm

  • 1. Gaussian processes
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Part II: Scaling up Bandits

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiment

  • 2. Parallelising function evaluations
  • 3. High dimensional input spaces

35/40

slide-119
SLIDE 119

Part 2.3: Optimisation in High Dimensional Input Spaces

E.g. Tuning a machine learning model with several hyper-parameters

36/40

slide-120
SLIDE 120

Part 2.3: Optimisation in High Dimensional Input Spaces

E.g. Tuning a machine learning model with several hyper-parameters

At each time step

x f(x) x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

36/40

slide-121
SLIDE 121

Part 2.3: Optimisation in High Dimensional Input Spaces

E.g. Tuning a machine learning model with several hyper-parameters

At each time step

x f(x) x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

  • 1. Statistical Difficulty:

estimating a high dimensional GP.

  • 2. Computational Difficulty:

maximising a high dimensional acquisition (e.g. upper confidence bound) ϕt.

36/40

slide-122
SLIDE 122

Additive Models for High Dimensional BO

(Kandasamy et al. ICML 2015)

E.g. f (x{1,...,10}) = f (1)(x{1,3,9}) + f (2)(x{2,4,8}) + f (3)(x{5,6,10}) . 1 2 3 4 5 6 ✟ ✟ ❍ ❍ 7 8 9 10

37/40

slide-123
SLIDE 123

Additive Models for High Dimensional BO

(Kandasamy et al. ICML 2015)

E.g. f (x{1,...,10}) = f (1)(x{1,3,9}) + f (2)(x{2,4,8}) + f (3)(x{5,6,10}) . 1 2 3 4 5 6 ✟ ✟ ❍ ❍ 7 8 9 10

◮ Better statistical properties: sample complexity improves from

exponential in d to linear in d.

◮ Add-GP-UCB algorithm: computationally tractable even for

large d.

◮ Better bias variance trade-off in practice: algorithm does well

even if f is not additive.

37/40

slide-124
SLIDE 124

Experiment: Viola & Jones Cascade classifier

Tune 22 hyper-parameters in the V&J classifier. 100 200 300 65 70 75 80 85 90 95

38/40

slide-125
SLIDE 125

Summary

◮ Bandits are a framework for studying exploration vs

exploitation trade-offs when optimising black-box functions.

◮ Several applications: Hyper-parameter Tuning, materials

synthesis, scientific experiments etc.

◮ Several algorithms: UCB, TS, EI etc.

39/40

slide-126
SLIDE 126

Summary

◮ Bandits are a framework for studying exploration vs

exploitation trade-offs when optimising black-box functions.

◮ Several applications: Hyper-parameter Tuning, materials

synthesis, scientific experiments etc.

◮ Several algorithms: UCB, TS, EI etc. ◮ Multi-fidelity Bandits: Use cheap approximations to a an

expensive experiment to speed up optimisation.

◮ Parallelised TS: Simple and intuitive way to deal with

multiple workers.

◮ High dimensional optimisation: Additive models have

favourable statistical and computational properties.

39/40

slide-127
SLIDE 127

Akshay Barnab´ as Gautam Jeff Junier

Thank You

Slides: www.cs.cmu.edu/~kkandasa/talks/pitt-hptune-slides.pdf