Stochastic Bandits Kirthevasan Kandasamy Carnegie Mellon University - - PowerPoint PPT Presentation

stochastic bandits
SMART_READER_LITE
LIVE PREVIEW

Stochastic Bandits Kirthevasan Kandasamy Carnegie Mellon University - - PowerPoint PPT Presentation

Stochastic Bandits Kirthevasan Kandasamy Carnegie Mellon University University of Moratuwa, Sri Lanka August 17, 2017 Slides: www.cs.cmu.edu/~kkandasa/misc/mora-slides.pdf Slides are up on my webpage: www.cs.cmu.edu/~kkandasa On-line


slide-1
SLIDE 1

Stochastic Bandits

Kirthevasan Kandasamy Carnegie Mellon University University of Moratuwa, Sri Lanka August 17, 2017 Slides: www.cs.cmu.edu/~kkandasa/misc/mora-slides.pdf

slide-2
SLIDE 2

Slides are up on my webpage:

www.cs.cmu.edu/~kkandasa

slide-3
SLIDE 3

On-line advertising

You are given a pool of 250 ads. Task:

◮ You can display one ad at a time, (say for 106 times). ◮ You wish to maximise the cumulative number of clicks, i.e.

identify ads with the highest click-through-rate and display them most of the time.

1/39

slide-4
SLIDE 4

The Stochastic Multi-armed Bandit

(Robbins, 1952)

◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm.

2/39

slide-5
SLIDE 5

The Stochastic Multi-armed Bandit

(Robbins, 1952)

◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm. ◮ When you play arm xt ∈ X in round t you receive a stochastic

reward yt, where E[yt] = f (xt).

2/39

slide-6
SLIDE 6

The Stochastic Multi-armed Bandit

(Robbins, 1952)

◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm. ◮ When you play arm xt ∈ X in round t you receive a stochastic

reward yt, where E[yt] = f (xt).

◮ Goal: Maximise the cumulative sum of expected rewards,

E

  • n
  • t=1

yt

  • =

n

  • t=1

f (xt).

2/39

slide-7
SLIDE 7

The Stochastic Multi-armed Bandit

(Robbins, 1952)

◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm. ◮ When you play arm xt ∈ X in round t you receive a stochastic

reward yt, where E[yt] = f (xt).

◮ Goal: Maximise the cumulative sum of expected rewards,

E

  • n
  • t=1

yt

  • =

n

  • t=1

f (xt).

◮ Goal: An algorithm (policy/strategy) which achieves “small”

cumulative regret, Rn =

n

  • t=1

f (x⋆) −

n

  • t=1

f (xt) =

n

  • t=1
  • f (x⋆) − f (xt)
  • .

where, x⋆ = argmaxx∈X f (x).

2/39

slide-8
SLIDE 8

Smooth Bandits

f : X → R is a black-box function that is accessible only via noisy

  • evaluations. X is a metric space, e.g. Rd.

x f(x)

3/39

slide-9
SLIDE 9

Smooth Bandits

f : X → R is a black-box function that is accessible only via noisy

  • evaluations. X is a metric space, e.g. Rd.

x f(x)

3/39

slide-10
SLIDE 10

Smooth Bandits

f : X → R is a black-box function that is accessible only via noisy

  • evaluations. X is a metric space, e.g. Rd.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

3/39

slide-11
SLIDE 11

Smooth Bandits

f : X → R is a black-box function that is accessible only via noisy

  • evaluations. X is a metric space, e.g. Rd.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Cumulative Regret after n evaluations Rn =

n

  • t=1

f (x⋆) − f (xt).

3/39

slide-12
SLIDE 12

Smooth Bandits

f : X → R is a black-box function that is accessible only via noisy

  • evaluations. X is a metric space, e.g. Rd.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations Sn = f (x⋆) − max

t=1,...,n f (xt).

3/39

slide-13
SLIDE 13

Applications

Cosmological Simulator

Observation

E.g: Hubble Constant Baryonic Density

Likelihood Score Likelihood computation

4/39

slide-14
SLIDE 14

Applications

Neural Network

hyper- parameters cross validation accuracy

  • Train NN using given hyper-parameters
  • Compute accuracy on validation set

4/39

slide-15
SLIDE 15

Applications

Expensive Blackbox Function

Other Examples:

  • Pre-clinical Drug Discovery
  • Optimal policy in Autonomous Driving
  • Synthetic gene design

4/39

slide-16
SLIDE 16

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

5/39

slide-17
SLIDE 17

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

  • 3. “Smooth K-armed” bandits,

X is finite, but there is additional structure.

5/39

slide-18
SLIDE 18

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

  • 3. “Smooth K-armed” bandits,

X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0.

5/39

slide-19
SLIDE 19

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

  • 3. “Smooth K-armed” bandits,

X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret

  • 1. Cumulative regret,

Rn = n

t=1 f (x⋆) − f (xt).

  • 2. Simple regret,

Sn = f (x⋆) − maxt=1,...,n f (xt).

5/39

slide-20
SLIDE 20

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

  • 3. “Smooth K-armed” bandits,

X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret

  • 1. Cumulative regret,

Rn = n

t=1 f (x⋆) − f (xt).

  • 2. Simple regret,

Sn = f (x⋆) − maxt=1,...,n f (xt).

5/39

slide-21
SLIDE 21

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

  • 3. “Smooth K-armed” bandits,

X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret

  • 1. Cumulative regret,

Rn = n

t=1 f (x⋆) − f (xt).

  • 2. Simple regret,

Sn = f (x⋆) − maxt=1,...,n f (xt). Other formalisms: contextual bandit, adversarial bandit, duelling bandit, linear bandit, best arm identification and several more . . .

5/39

slide-22
SLIDE 22

Recap

Types of arms (domain X)

  • 1. K-armed bandit,

X is a finite set.

  • 2. Smooth bandit,

X is a metric space (e.g. Rd).

  • 3. “Smooth K-armed” bandits,

X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret

  • 1. Cumulative regret,

Rn = n

t=1 f (x⋆) − f (xt).

  • 2. Simple regret,

Sn = f (x⋆) − maxt=1,...,n f (xt). Other formalisms: contextual bandit, adversarial bandit, duelling bandit, linear bandit, best arm identification and several more . . . N.B: Pulling/playing an arm = experiment = function evaluation

5/39

slide-23
SLIDE 23

Outline

◮ Part I: Stochastic bandits

(cont’d)

  • 1. Gaussian processes for smooth bandits
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiments

  • 2. Parallelising arm pulls

6/39

slide-24
SLIDE 24

Outline

◮ Part I: Stochastic bandits

(cont’d)

  • 1. Gaussian processes for smooth bandits
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiments

  • 2. Parallelising arm pulls

6/39

slide-25
SLIDE 25

Gaussian (Normal) distribution N(µ, σ2)

◮ A probability distribution for real valued random variables. ◮ Mean µ and variance σ2 completely characterises distribution.

7/39

slide-26
SLIDE 26

Gaussian (Normal) distribution N(µ, σ2)

◮ A probability distribution for real valued random variables. ◮ Mean µ and variance σ2 completely characterises distribution. ◮ For samples X1, . . . , Xn, let ˆ

µ = 1

n

  • i Xi be the sample mean.

Then, ˆ µ ± 1.96 σ

√n is a 95% confidence interval for µ. ◮ Can draw samples (e.g. in Matlab: mu + sigma * randn()).

7/39

slide-27
SLIDE 27

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R.

8/39

slide-28
SLIDE 28

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Functions with no observations

x f(x)

8/39

slide-29
SLIDE 29

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Prior GP

x f(x)

8/39

slide-30
SLIDE 30

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Observations

x f(x)

8/39

slide-31
SLIDE 31

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Posterior GP given observations

x f(x)

8/39

slide-32
SLIDE 32

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Posterior GP given observations

x f(x)

After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

8/39

slide-33
SLIDE 33

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

x f(x)

9/39

slide-34
SLIDE 34

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

x f(x)

9/39

slide-35
SLIDE 35

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

x f(x) ϕt = µt−1 + β1/2

t

σt−1

Construct upper conf. bound: ϕt(x) = µt−1(x) + β1/2

t

σt−1(x).

9/39

slide-36
SLIDE 36

Algorithm 1: Upper Confidence Bounds in GP Bandits

Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010).

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

Maximise upper confidence bound.

9/39

slide-37
SLIDE 37

GP-UCB

xt = argmax

x

µt−1(x) + β1/2

t

σt−1(x)

◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.

βt ≍ log t.

10/39

slide-38
SLIDE 38

GP-UCB

xt = argmax

x

µt−1(x) + β1/2

t

σt−1(x)

◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.

βt ≍ log t. GP-UCB, κ is an SE kernel

(Srinivas et al. 2010)

w.h.p Sn = f (x⋆) − max

t=1,...,n f (xt)

  • log(n)dvol(X)

n

10/39

slide-39
SLIDE 39

GP-UCB

(Srinivas et al. 2010)

x f(x)

11/39

slide-40
SLIDE 40

GP-UCB

(Srinivas et al. 2010)

t = 1 x f(x)

11/39

slide-41
SLIDE 41

GP-UCB

(Srinivas et al. 2010)

t = 2 x f(x)

11/39

slide-42
SLIDE 42

GP-UCB

(Srinivas et al. 2010)

t = 3 x f(x)

11/39

slide-43
SLIDE 43

GP-UCB

(Srinivas et al. 2010)

t = 4 x f(x)

11/39

slide-44
SLIDE 44

GP-UCB

(Srinivas et al. 2010)

t = 5 x f(x)

11/39

slide-45
SLIDE 45

GP-UCB

(Srinivas et al. 2010)

t = 6 x f(x)

11/39

slide-46
SLIDE 46

GP-UCB

(Srinivas et al. 2010)

t = 7 x f(x)

11/39

slide-47
SLIDE 47

GP-UCB

(Srinivas et al. 2010)

t = 11 x f(x)

11/39

slide-48
SLIDE 48

GP-UCB

(Srinivas et al. 2010)

t = 25 x f(x)

11/39

slide-49
SLIDE 49

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

12/39

slide-50
SLIDE 50

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

12/39

slide-51
SLIDE 51

Algorithm 2: Thompson Sampling in GP Bandits

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

Draw sample g from posterior. Choose xt = argmaxx g(x).

12/39

slide-52
SLIDE 52

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x)

13/39

slide-53
SLIDE 53

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 1

13/39

slide-54
SLIDE 54

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 2

13/39

slide-55
SLIDE 55

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 3

13/39

slide-56
SLIDE 56

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 4

13/39

slide-57
SLIDE 57

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 5

13/39

slide-58
SLIDE 58

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 6

13/39

slide-59
SLIDE 59

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 7

13/39

slide-60
SLIDE 60

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 11

13/39

slide-61
SLIDE 61

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 25

13/39

slide-62
SLIDE 62

Outline

◮ Part I: Stochastic bandits

(cont’d)

  • 1. Gaussian processes for smooth bandits
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiments

  • 2. Parallelising arm pulls

14/39

slide-63
SLIDE 63

SL2College

www.sl2college.org

15/39

slide-64
SLIDE 64

SL2College Research Collaboration Program -Ashwin de Silva

www.sl2college.org/research-collab research-collab@sl2college.org

16/39

slide-65
SLIDE 65

SL2College Research Collaboration Program

How it works We have a pool of doctoral/post-doctoral/professorial mentors (all Sri Lankan). We connect Sri Lankan undergrads to mentors, who will guide the students on a research project. Aim: Publish a paper (at a good venue) within a 9-15 month time frame.

17/39

slide-66
SLIDE 66

Application Process

◮ Fill out the application form on our webpage:

www.sl2college.org/research-collab

  • mention areas of interests and preferred mentors.

◮ .. and email your CV to research-collab@sl2college.org.

18/39

slide-67
SLIDE 67

Application Process

◮ Fill out the application form on our webpage:

www.sl2college.org/research-collab

  • mention areas of interests and preferred mentors.

◮ .. and email your CV to research-collab@sl2college.org. ◮ If we decide to proceed, we ask you to submit a ∼ 1 page

research statement,

  • your research interests & future plans
  • why you are interested in working with aforesaid mentor.

18/39

slide-68
SLIDE 68

Application Process

◮ Fill out the application form on our webpage:

www.sl2college.org/research-collab

  • mention areas of interests and preferred mentors.

◮ .. and email your CV to research-collab@sl2college.org. ◮ If we decide to proceed, we ask you to submit a ∼ 1 page

research statement,

  • your research interests & future plans
  • why you are interested in working with aforesaid mentor.

◮ We send your CV & statement to the mentor. If he/she is

interested, we initiate a collaboration.

◮ You report to us once every 3 months.

18/39

slide-69
SLIDE 69

SL2College Research Collaboration Team

Ashwin Nuwan Rajitha Umashanthi Kirthevasan

www.sl2college.org/research-collab research-collab@sl2college.org

19/39

slide-70
SLIDE 70

Outline

◮ Part I: Stochastic bandits

(cont’d)

  • 1. Gaussian processes for smooth bandits
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiments

  • 2. Parallelising arm pulls

20/39

slide-71
SLIDE 71

Part 2.1: Multi-fidelity Bandits

Motivating question: What if we have cheap approximations to f ?

21/39

slide-72
SLIDE 72

Part 2.1: Multi-fidelity Bandits

Motivating question: What if we have cheap approximations to f ?

  • 1. Computational astrophysics and other scientific experiments:

simulations and numerical computations with less granularity.

Cosmological Simulator Observation

E.g: Hubble Constant Baryonic Density

Likelihood Score Likelihood computation

21/39

slide-73
SLIDE 73

Part 2.1: Multi-fidelity Bandits

Motivating question: What if we have cheap approximations to f ?

  • 1. Computational astrophysics and other scientific experiments:

simulations and numerical computations with less granularity.

Cosmological Simulator Observation

E.g: Hubble Constant Baryonic Density

Likelihood Score Likelihood computation

  • 2. Hyper-parameter tuning: Train & validate with a subset of the

data.

  • 3. Robotics & autonomous driving: computer simulation vs real

world experiment.

21/39

slide-74
SLIDE 74

Multi-fidelity Methods

For specific applications,

◮ Industrial design

(Forrester et al. 2007)

◮ Hyper-parameter tuning

(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)

◮ Active learning

(Zhang & Chaudhuri 2015)

◮ Robotics

(Cutler et al. 2014)

Multi-fidelity bandits & optimisation

(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)

22/39

slide-75
SLIDE 75

Multi-fidelity Methods

For specific applications,

◮ Industrial design

(Forrester et al. 2007)

◮ Hyper-parameter tuning

(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)

◮ Active learning

(Zhang & Chaudhuri 2015)

◮ Robotics

(Cutler et al. 2014)

Multi-fidelity bandits & optimisation

(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)

. . . with theoretical guarantees

(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)

22/39

slide-76
SLIDE 76

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

X Z

A fidelity space Z and domain X

Z ← all granularity values X ← space of cosmological parameters

23/39

slide-77
SLIDE 77

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

X

g(z, x)

Z

A fidelity space Z and domain X

Z ← all granularity values X ← space of cosmological parameters

g : Z × X → R.

g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.

23/39

slide-78
SLIDE 78

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

X

g(z, x) f(x) z•

Z

A fidelity space Z and domain X

Z ← all granularity values X ← space of cosmological parameters

g : Z × X → R.

g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.

Denote f (x) = g(z•, x) where z• ∈ Z.

z• = highest grid size.

23/39

slide-79
SLIDE 79

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

A fidelity space Z and domain X

Z ← all granularity values X ← space of cosmological parameters

g : Z × X → R.

g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.

Denote f (x) = g(z•, x) where z• ∈ Z.

z• = highest grid size.

End Goal: Find x⋆ = argmaxx f (x).

23/39

slide-80
SLIDE 80

Multi-fidelity Bandits

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

A fidelity space Z and domain X

Z ← all granularity values X ← space of cosmological parameters

g : Z × X → R.

g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.

Denote f (x) = g(z•, x) where z• ∈ Z.

z• = highest grid size.

End Goal: Find x⋆ = argmaxx f (x). A cost function, λ : Z → R+.

λ(z) = O(zp) (say).

Z z• λ(z)

23/39

slide-81
SLIDE 81

Multi-fidelity Simple Regret

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

Z z• λ(z) End Goal: Find x⋆ = argmaxx f (x).

24/39

slide-82
SLIDE 82

Multi-fidelity Simple Regret

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

Z z• λ(z) End Goal: Find x⋆ = argmaxx f (x).

Simple Regret after capital Λ:

S(Λ) = f (x⋆) − max

t: zt=z• f (xt).

Λ ← amount of a resource spent, e.g. computation time or money.

24/39

slide-83
SLIDE 83

Multi-fidelity Simple Regret

(Kandasamy et al. ICML 2017)

x⋆

X

g(z, x) f(x) z•

Z

Z z• λ(z) End Goal: Find x⋆ = argmaxx f (x).

Simple Regret after capital Λ:

S(Λ) = f (x⋆) − max

t: zt=z• f (xt).

Λ ← amount of a resource spent, e.g. computation time or money. No reward for pulling an arm at low fidelities, but use cheap evaluations at z = z• to speed up search for x⋆.

24/39

slide-84
SLIDE 84

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

25/39

slide-85
SLIDE 85

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+

25/39

slide-86
SLIDE 86

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x)

25/39

slide-87
SLIDE 87

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x)

25/39

slide-88
SLIDE 88

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

25/39

slide-89
SLIDE 89

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

25/39

slide-90
SLIDE 90

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

25/39

slide-91
SLIDE 91

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z)
  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

25/39

slide-92
SLIDE 92

Algorithm: BOCA

(Kandasamy et al. ICML 2017)

Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax

x∈X

µt−1(z•, x) + β1/2

t

σt−1(z•, x) (2) Zt ≈ {z•} ∪

  • z : σt−1(z, xt) ≥ γ(z) =

λ(z) λ(z•) q ξ(z)

  • (3)

zt = argmin

z∈Zt

λ(z) (cheapest z in Zt)

25/39

slide-93
SLIDE 93

Theoretical Results for BOCA

x⋆

X

g(z, x) f(x) z•

Z

“good” x⋆ g(z, x)

X

f(x) z•

Z

“bad”

26/39

slide-94
SLIDE 94

Theoretical Results for BOCA

x⋆

X

g(z, x) f(x) z•

Z

“good” large hZ x⋆ g(z, x)

X

f(x) z•

Z

“bad” small hZ E.g.: For SE kernels, bandwidth hZ controls smoothness.

26/39

slide-95
SLIDE 95

Theoretical Results for BOCA

GP-UCB SE kernel,

(Srinivas et al. 2010)

w.h.p S(Λ)

  • vol(X)

Λ

27/39

slide-96
SLIDE 96

Theoretical Results for BOCA

GP-UCB SE kernel,

(Srinivas et al. 2010)

w.h.p S(Λ)

  • vol(X)

Λ BOCA SE kernel,

(Kandasamy et al. ICML 2017)

w.h.p ∀α > 0, S(Λ)

  • vol(Xα)

Λ +

  • vol(X)

Λ2−α Xα =

  • x; f (x⋆) − f (x) Cα

1 hZ

  • 27/39
slide-97
SLIDE 97

Theoretical Results for BOCA

GP-UCB SE kernel,

(Srinivas et al. 2010)

w.h.p S(Λ)

  • vol(X)

Λ BOCA SE kernel,

(Kandasamy et al. ICML 2017)

w.h.p ∀α > 0, S(Λ)

  • vol(Xα)

Λ +

  • vol(X)

Λ2−α Xα =

  • x; f (x⋆) − f (x) Cα

1 hZ

  • If hZ is large (good approximations), vol(Xα) ≪ vol(X),

and BOCA is much better than GP-UCB.

27/39

slide-98
SLIDE 98

Theoretical Results for BOCA

GP-UCB SE kernel,

(Srinivas et al. 2010)

w.h.p S(Λ)

  • vol(X)

Λ BOCA SE kernel,

(Kandasamy et al. ICML 2017)

w.h.p ∀α > 0, S(Λ)

  • vol(Xα)

Λ +

  • vol(X)

Λ2−α Xα =

  • x; f (x⋆) − f (x) Cα

1 hZ

  • If hZ is large (good approximations), vol(Xα) ≪ vol(X),

and BOCA is much better than GP-UCB. N.B: Dropping constants and polylog terms.

27/39

slide-99
SLIDE 99

Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N• = 192 data. Requires numerical integration on a grid of size G• = 106. Approximate with N ∈ [50, 192] or G ∈ [102, 106] (2D fidelity space).

28/39

slide-100
SLIDE 100

Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N• = 192 data. Requires numerical integration on a grid of size G• = 106. Approximate with N ∈ [50, 192] or G ∈ [102, 106] (2D fidelity space).

1000 1500 2000 2500 3000 3500 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

28/39

slide-101
SLIDE 101

Outline

◮ Part I: Stochastic bandits

(cont’d)

  • 1. Gaussian processes for smooth bandits
  • 2. Algorithms: Upper Confidence Bound (UCB) & Thompson

Sampling (TS)

◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research

  • 1. Multi-fidelity bandit: cheap approximations to an expensive

experiments

  • 2. Parallelising arm pulls

29/39

slide-102
SLIDE 102

Part 2.2: Parallelising arm pulls

Sequential arm pulls with one worker

30/39

slide-103
SLIDE 103

Part 2.2: Parallelising arm pulls

Sequential arm pulls with one worker Parallel arm pulls with M workers (Asynchronous)

30/39

slide-104
SLIDE 104

Part 2.2: Parallelising arm pulls

Sequential arm pulls with one worker Parallel arm pulls with M workers (Asynchronous) Parallel arm pulls with M workers (Synchronous)

30/39

slide-105
SLIDE 105

Why parallelisation?

◮ Computational experiments: infrastructure with 100-1000’s

CPUs or GPUs.

◮ Drug discovery: High throughput screening

31/39

slide-106
SLIDE 106

Why parallelisation?

◮ Computational experiments: infrastructure with 100-1000’s

CPUs or GPUs.

◮ Drug discovery: High throughput screening

Prior work: (Ginsbourger et al. 2011, Janusevskis et al. 2012, Wang et al.

2016, Gonz´ alez et al. 2015, Desautels et al. 2014, Contal et al. 2013, Shah and Ghahramani 2015, Kathuria et al. 2016, Wang et al. 2017, Wu and Frazier 2016, Hernandez-Lobato et al. 2017)

Shortcomings

◮ Asynchronicity ◮ Theoretical guarantees ◮ Computationally & conceptually simple

31/39

slide-107
SLIDE 107

Review: Sequential Thompson Sampling in GP Bandits

x f(x)

32/39

slide-108
SLIDE 108

Review: Sequential Thompson Sampling in GP Bandits

x f(x)

32/39

slide-109
SLIDE 109

Review: Sequential Thompson Sampling in GP Bandits

x f(x)

xt

Draw sample g from posterior. Choose xt = argmaxx g(x).

32/39

slide-110
SLIDE 110

Parallelised Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g.

33/39

slide-111
SLIDE 111

Parallelised Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m.

33/39

slide-112
SLIDE 112

Theoretical Results: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)d

n

34/39

slide-113
SLIDE 113

Theoretical Results: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)d

n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M log(M)2d n +

  • vol(X) log(n)d

n n ← # completed arm pulls by all workers.

34/39

slide-114
SLIDE 114

Theoretical Results: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)d

n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M log(M)2d n +

  • vol(X) log(n)d

n n ← # completed arm pulls by all workers. Why is this interesting?

  • A sequential algorithm can make use of information from all

previous rounds to determine where to evaluate next.

  • A parallel algorithm could be missing up to M − 1 results at

any given time.

34/39

slide-115
SLIDE 115

Theoretical Results: number of evaluations

Sequential TS, SE Kernel

(Russo & van Roy 2014)

E[Sn]

  • vol(X) log(n)d

n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M log(M)2d n +

  • vol(X) log(n)d

n n ← # completed arm pulls by all workers. Why is this interesting?

  • A sequential algorithm can make use of information from all

previous rounds to determine where to evaluate next.

  • A parallel algorithm could be missing up to M − 1 results at

any given time. But randomisation helps!

34/39

slide-116
SLIDE 116

Theoretical Results: Simple regret with time

Asynchronous Synchronous

35/39

slide-117
SLIDE 117

Theoretical Results: Simple regret with time

Asynchronous Synchronous Theorem (Informal)

(Kandasamy et al. Arxiv 2017)

If evaluation times are the same, asyTS ≈ synTS. Otherwise, bounds for asyTS is strictly better than synTS. More the variability in evaluation times, the bigger the difference.

slide-118
SLIDE 118

Theoretical Results: Simple regret with time

Asynchronous Synchronous Theorem (Informal)

(Kandasamy et al. Arxiv 2017)

If evaluation times are the same, asyTS ≈ synTS. Otherwise, bounds for asyTS is strictly better than synTS. More the variability in evaluation times, the bigger the difference.

  • Bounded tail decay: constant factor
  • Sub-gaussian tail decay:
  • log(M) factor
  • Sub-exponential tail decay: log(M) factor

35/39

slide-119
SLIDE 119

Experiment: Currin-Exponential-14D M = 35

Evaluation time sampled from a Pareto-3 distribution

Tim e units (T)

5 10 15 20

SR′(T)

10 15 20 25

synRAND synBUCB synUCBPE synTS asyRAND asyUCB asyEI asyHUCB asyHTS asyTS

36/39

slide-120
SLIDE 120

Experiment: Hyper-parameter tuning in Cifar10 M = 4

Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4 - 16 minutes.

Time (s)

1000 2000 3000 4000 5000 6000 7000

Validation Accuracy

0.68 0.69 0.7 0.71 0.72

synBUCB synTS asyRAND asyEI asyHUCB asyTS

37/39

slide-121
SLIDE 121

Summary

◮ Bandits are a framework for studying exploration vs

exploitation trade-offs when optimising black-box functions.

◮ Smooth bandit formulations are more common in practical

applications.

◮ Several algorithms: UCB, TS, Index based policies, ǫ-greedy

etc.

38/39

slide-122
SLIDE 122

Summary

◮ Bandits are a framework for studying exploration vs

exploitation trade-offs when optimising black-box functions.

◮ Smooth bandit formulations are more common in practical

applications.

◮ Several algorithms: UCB, TS, Index based policies, ǫ-greedy

etc.

◮ Multi-fidelity Bandits: Allows us to use cheap

approximations to a an expensive experiment to quickly find the optimum.

◮ Parallelised TS: Simple and intuitive way to deal with

multiple workers.

38/39

slide-123
SLIDE 123

Akshay Barnab´ as Gautam Jeff Junier

Thank You

Slides: www.cs.cmu.edu/~kkandasa/misc/mora-slides.pdf