Stochastic Bandits Kirthevasan Kandasamy Carnegie Mellon University - - PowerPoint PPT Presentation
Stochastic Bandits Kirthevasan Kandasamy Carnegie Mellon University - - PowerPoint PPT Presentation
Stochastic Bandits Kirthevasan Kandasamy Carnegie Mellon University University of Moratuwa, Sri Lanka August 17, 2017 Slides: www.cs.cmu.edu/~kkandasa/misc/mora-slides.pdf Slides are up on my webpage: www.cs.cmu.edu/~kkandasa On-line
Slides are up on my webpage:
www.cs.cmu.edu/~kkandasa
On-line advertising
You are given a pool of 250 ads. Task:
◮ You can display one ad at a time, (say for 106 times). ◮ You wish to maximise the cumulative number of clicks, i.e.
identify ads with the highest click-through-rate and display them most of the time.
1/39
The Stochastic Multi-armed Bandit
(Robbins, 1952)
◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm.
2/39
The Stochastic Multi-armed Bandit
(Robbins, 1952)
◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm. ◮ When you play arm xt ∈ X in round t you receive a stochastic
reward yt, where E[yt] = f (xt).
2/39
The Stochastic Multi-armed Bandit
(Robbins, 1952)
◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm. ◮ When you play arm xt ∈ X in round t you receive a stochastic
reward yt, where E[yt] = f (xt).
◮ Goal: Maximise the cumulative sum of expected rewards,
E
- n
- t=1
yt
- =
n
- t=1
f (xt).
2/39
The Stochastic Multi-armed Bandit
(Robbins, 1952)
◮ You are given K arms, X = {1, . . . , K}. ◮ At every round you “play/pull” an arm. ◮ When you play arm xt ∈ X in round t you receive a stochastic
reward yt, where E[yt] = f (xt).
◮ Goal: Maximise the cumulative sum of expected rewards,
E
- n
- t=1
yt
- =
n
- t=1
f (xt).
◮ Goal: An algorithm (policy/strategy) which achieves “small”
cumulative regret, Rn =
n
- t=1
f (x⋆) −
n
- t=1
f (xt) =
n
- t=1
- f (x⋆) − f (xt)
- .
where, x⋆ = argmaxx∈X f (x).
2/39
Smooth Bandits
f : X → R is a black-box function that is accessible only via noisy
- evaluations. X is a metric space, e.g. Rd.
x f(x)
3/39
Smooth Bandits
f : X → R is a black-box function that is accessible only via noisy
- evaluations. X is a metric space, e.g. Rd.
x f(x)
3/39
Smooth Bandits
f : X → R is a black-box function that is accessible only via noisy
- evaluations. X is a metric space, e.g. Rd.
Let x⋆ = argmaxx f (x).
x f(x) x∗
f(x∗)
3/39
Smooth Bandits
f : X → R is a black-box function that is accessible only via noisy
- evaluations. X is a metric space, e.g. Rd.
Let x⋆ = argmaxx f (x).
x f(x) x∗
f(x∗)
Cumulative Regret after n evaluations Rn =
n
- t=1
f (x⋆) − f (xt).
3/39
Smooth Bandits
f : X → R is a black-box function that is accessible only via noisy
- evaluations. X is a metric space, e.g. Rd.
Let x⋆ = argmaxx f (x).
x f(x) x∗
f(x∗)
Simple Regret after n evaluations Sn = f (x⋆) − max
t=1,...,n f (xt).
3/39
Applications
Cosmological Simulator
Observation
E.g: Hubble Constant Baryonic Density
Likelihood Score Likelihood computation
4/39
Applications
Neural Network
hyper- parameters cross validation accuracy
- Train NN using given hyper-parameters
- Compute accuracy on validation set
4/39
Applications
Expensive Blackbox Function
Other Examples:
- Pre-clinical Drug Discovery
- Optimal policy in Autonomous Driving
- Synthetic gene design
4/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
5/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
- 3. “Smooth K-armed” bandits,
X is finite, but there is additional structure.
5/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
- 3. “Smooth K-armed” bandits,
X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0.
5/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
- 3. “Smooth K-armed” bandits,
X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret
- 1. Cumulative regret,
Rn = n
t=1 f (x⋆) − f (xt).
- 2. Simple regret,
Sn = f (x⋆) − maxt=1,...,n f (xt).
5/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
- 3. “Smooth K-armed” bandits,
X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret
- 1. Cumulative regret,
Rn = n
t=1 f (x⋆) − f (xt).
- 2. Simple regret,
Sn = f (x⋆) − maxt=1,...,n f (xt).
5/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
- 3. “Smooth K-armed” bandits,
X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret
- 1. Cumulative regret,
Rn = n
t=1 f (x⋆) − f (xt).
- 2. Simple regret,
Sn = f (x⋆) − maxt=1,...,n f (xt). Other formalisms: contextual bandit, adversarial bandit, duelling bandit, linear bandit, best arm identification and several more . . .
5/39
Recap
Types of arms (domain X)
- 1. K-armed bandit,
X is a finite set.
- 2. Smooth bandit,
X is a metric space (e.g. Rd).
- 3. “Smooth K-armed” bandits,
X is finite, but there is additional structure. f : X → R. On playing x ∈ X you observe f (x) + ε, Eε = 0. Two notions of regret
- 1. Cumulative regret,
Rn = n
t=1 f (x⋆) − f (xt).
- 2. Simple regret,
Sn = f (x⋆) − maxt=1,...,n f (xt). Other formalisms: contextual bandit, adversarial bandit, duelling bandit, linear bandit, best arm identification and several more . . . N.B: Pulling/playing an arm = experiment = function evaluation
5/39
Outline
◮ Part I: Stochastic bandits
(cont’d)
- 1. Gaussian processes for smooth bandits
- 2. Algorithms: Upper Confidence Bound (UCB) & Thompson
Sampling (TS)
◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research
- 1. Multi-fidelity bandit: cheap approximations to an expensive
experiments
- 2. Parallelising arm pulls
6/39
Outline
◮ Part I: Stochastic bandits
(cont’d)
- 1. Gaussian processes for smooth bandits
- 2. Algorithms: Upper Confidence Bound (UCB) & Thompson
Sampling (TS)
◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research
- 1. Multi-fidelity bandit: cheap approximations to an expensive
experiments
- 2. Parallelising arm pulls
6/39
Gaussian (Normal) distribution N(µ, σ2)
◮ A probability distribution for real valued random variables. ◮ Mean µ and variance σ2 completely characterises distribution.
7/39
Gaussian (Normal) distribution N(µ, σ2)
◮ A probability distribution for real valued random variables. ◮ Mean µ and variance σ2 completely characterises distribution. ◮ For samples X1, . . . , Xn, let ˆ
µ = 1
n
- i Xi be the sample mean.
Then, ˆ µ ± 1.96 σ
√n is a 95% confidence interval for µ. ◮ Can draw samples (e.g. in Matlab: mu + sigma * randn()).
7/39
Gaussian Processes (GP)
GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R.
8/39
Gaussian Processes (GP)
GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Functions with no observations
x f(x)
8/39
Gaussian Processes (GP)
GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Prior GP
x f(x)
8/39
Gaussian Processes (GP)
GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Observations
x f(x)
8/39
Gaussian Processes (GP)
GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Posterior GP given observations
x f(x)
8/39
Gaussian Processes (GP)
GP(µ, κ): A distribution over functions from X to R. Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. Posterior GP given observations
x f(x)
After t observations, f (x) ∼ N( µt(x), σ2
t (x) ).
8/39
Algorithm 1: Upper Confidence Bounds in GP Bandits
Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)
(Srinivas et al. 2010).
x f(x)
9/39
Algorithm 1: Upper Confidence Bounds in GP Bandits
Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)
(Srinivas et al. 2010).
x f(x)
9/39
Algorithm 1: Upper Confidence Bounds in GP Bandits
Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)
(Srinivas et al. 2010).
x f(x) ϕt = µt−1 + β1/2
t
σt−1
Construct upper conf. bound: ϕt(x) = µt−1(x) + β1/2
t
σt−1(x).
9/39
Algorithm 1: Upper Confidence Bounds in GP Bandits
Model f ∼ GP(0, κ). Gaussian Process Upper Confidence Bound (GP-UCB)
(Srinivas et al. 2010).
x f(x) ϕt = µt−1 + β1/2
t
σt−1
xt
Maximise upper confidence bound.
9/39
GP-UCB
xt = argmax
x
µt−1(x) + β1/2
t
σt−1(x)
◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.
βt ≍ log t.
10/39
GP-UCB
xt = argmax
x
µt−1(x) + β1/2
t
σt−1(x)
◮ µt−1: Exploitation ◮ σt−1: Exploration ◮ βt controls the tradeoff.
βt ≍ log t. GP-UCB, κ is an SE kernel
(Srinivas et al. 2010)
w.h.p Sn = f (x⋆) − max
t=1,...,n f (xt)
- log(n)dvol(X)
n
10/39
GP-UCB
(Srinivas et al. 2010)
x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 1 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 2 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 3 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 4 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 5 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 6 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 7 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 11 x f(x)
11/39
GP-UCB
(Srinivas et al. 2010)
t = 25 x f(x)
11/39
Algorithm 2: Thompson Sampling in GP Bandits
Model f ∼ GP(0, κ). Thompson Sampling (TS)
(Thompson, 1933).
x f(x)
12/39
Algorithm 2: Thompson Sampling in GP Bandits
Model f ∼ GP(0, κ). Thompson Sampling (TS)
(Thompson, 1933).
x f(x)
12/39
Algorithm 2: Thompson Sampling in GP Bandits
Model f ∼ GP(0, κ). Thompson Sampling (TS)
(Thompson, 1933).
x f(x)
xt
Draw sample g from posterior. Choose xt = argmaxx g(x).
12/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x)
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 1
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 2
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 3
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 4
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 5
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 6
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 7
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 11
13/39
Thompson Sampling (TS) in GPs
(Thompson, 1933)
x f(x) t = 25
13/39
Outline
◮ Part I: Stochastic bandits
(cont’d)
- 1. Gaussian processes for smooth bandits
- 2. Algorithms: Upper Confidence Bound (UCB) & Thompson
Sampling (TS)
◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research
- 1. Multi-fidelity bandit: cheap approximations to an expensive
experiments
- 2. Parallelising arm pulls
14/39
SL2College
www.sl2college.org
15/39
SL2College Research Collaboration Program -Ashwin de Silva
www.sl2college.org/research-collab research-collab@sl2college.org
16/39
SL2College Research Collaboration Program
How it works We have a pool of doctoral/post-doctoral/professorial mentors (all Sri Lankan). We connect Sri Lankan undergrads to mentors, who will guide the students on a research project. Aim: Publish a paper (at a good venue) within a 9-15 month time frame.
17/39
Application Process
◮ Fill out the application form on our webpage:
www.sl2college.org/research-collab
- mention areas of interests and preferred mentors.
◮ .. and email your CV to research-collab@sl2college.org.
18/39
Application Process
◮ Fill out the application form on our webpage:
www.sl2college.org/research-collab
- mention areas of interests and preferred mentors.
◮ .. and email your CV to research-collab@sl2college.org. ◮ If we decide to proceed, we ask you to submit a ∼ 1 page
research statement,
- your research interests & future plans
- why you are interested in working with aforesaid mentor.
18/39
Application Process
◮ Fill out the application form on our webpage:
www.sl2college.org/research-collab
- mention areas of interests and preferred mentors.
◮ .. and email your CV to research-collab@sl2college.org. ◮ If we decide to proceed, we ask you to submit a ∼ 1 page
research statement,
- your research interests & future plans
- why you are interested in working with aforesaid mentor.
◮ We send your CV & statement to the mentor. If he/she is
interested, we initiate a collaboration.
◮ You report to us once every 3 months.
18/39
SL2College Research Collaboration Team
Ashwin Nuwan Rajitha Umashanthi Kirthevasan
www.sl2college.org/research-collab research-collab@sl2college.org
19/39
Outline
◮ Part I: Stochastic bandits
(cont’d)
- 1. Gaussian processes for smooth bandits
- 2. Algorithms: Upper Confidence Bound (UCB) & Thompson
Sampling (TS)
◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research
- 1. Multi-fidelity bandit: cheap approximations to an expensive
experiments
- 2. Parallelising arm pulls
20/39
Part 2.1: Multi-fidelity Bandits
Motivating question: What if we have cheap approximations to f ?
21/39
Part 2.1: Multi-fidelity Bandits
Motivating question: What if we have cheap approximations to f ?
- 1. Computational astrophysics and other scientific experiments:
simulations and numerical computations with less granularity.
Cosmological Simulator Observation
E.g: Hubble Constant Baryonic Density
Likelihood Score Likelihood computation
21/39
Part 2.1: Multi-fidelity Bandits
Motivating question: What if we have cheap approximations to f ?
- 1. Computational astrophysics and other scientific experiments:
simulations and numerical computations with less granularity.
Cosmological Simulator Observation
E.g: Hubble Constant Baryonic Density
Likelihood Score Likelihood computation
- 2. Hyper-parameter tuning: Train & validate with a subset of the
data.
- 3. Robotics & autonomous driving: computer simulation vs real
world experiment.
21/39
Multi-fidelity Methods
For specific applications,
◮ Industrial design
(Forrester et al. 2007)
◮ Hyper-parameter tuning
(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)
◮ Active learning
(Zhang & Chaudhuri 2015)
◮ Robotics
(Cutler et al. 2014)
Multi-fidelity bandits & optimisation
(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)
22/39
Multi-fidelity Methods
For specific applications,
◮ Industrial design
(Forrester et al. 2007)
◮ Hyper-parameter tuning
(Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016)
◮ Active learning
(Zhang & Chaudhuri 2015)
◮ Robotics
(Cutler et al. 2014)
Multi-fidelity bandits & optimisation
(Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016)
. . . with theoretical guarantees
(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)
22/39
Multi-fidelity Bandits
(Kandasamy et al. ICML 2017)
X Z
A fidelity space Z and domain X
Z ← all granularity values X ← space of cosmological parameters
23/39
Multi-fidelity Bandits
(Kandasamy et al. ICML 2017)
X
g(z, x)
Z
A fidelity space Z and domain X
Z ← all granularity values X ← space of cosmological parameters
g : Z × X → R.
g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.
23/39
Multi-fidelity Bandits
(Kandasamy et al. ICML 2017)
X
g(z, x) f(x) z•
Z
A fidelity space Z and domain X
Z ← all granularity values X ← space of cosmological parameters
g : Z × X → R.
g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.
Denote f (x) = g(z•, x) where z• ∈ Z.
z• = highest grid size.
23/39
Multi-fidelity Bandits
(Kandasamy et al. ICML 2017)
x⋆
X
g(z, x) f(x) z•
Z
A fidelity space Z and domain X
Z ← all granularity values X ← space of cosmological parameters
g : Z × X → R.
g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.
Denote f (x) = g(z•, x) where z• ∈ Z.
z• = highest grid size.
End Goal: Find x⋆ = argmaxx f (x).
23/39
Multi-fidelity Bandits
(Kandasamy et al. ICML 2017)
x⋆
X
g(z, x) f(x) z•
Z
A fidelity space Z and domain X
Z ← all granularity values X ← space of cosmological parameters
g : Z × X → R.
g(z, x) ← likelihood score when per- forming integrations on a grid of size z at cosmological parameters x.
Denote f (x) = g(z•, x) where z• ∈ Z.
z• = highest grid size.
End Goal: Find x⋆ = argmaxx f (x). A cost function, λ : Z → R+.
λ(z) = O(zp) (say).
Z z• λ(z)
23/39
Multi-fidelity Simple Regret
(Kandasamy et al. ICML 2017)
x⋆
X
g(z, x) f(x) z•
Z
Z z• λ(z) End Goal: Find x⋆ = argmaxx f (x).
24/39
Multi-fidelity Simple Regret
(Kandasamy et al. ICML 2017)
x⋆
X
g(z, x) f(x) z•
Z
Z z• λ(z) End Goal: Find x⋆ = argmaxx f (x).
Simple Regret after capital Λ:
S(Λ) = f (x⋆) − max
t: zt=z• f (xt).
Λ ← amount of a resource spent, e.g. computation time or money.
24/39
Multi-fidelity Simple Regret
(Kandasamy et al. ICML 2017)
x⋆
X
g(z, x) f(x) z•
Z
Z z• λ(z) End Goal: Find x⋆ = argmaxx f (x).
Simple Regret after capital Λ:
S(Λ) = f (x⋆) − max
t: zt=z• f (xt).
Λ ← amount of a resource spent, e.g. computation time or money. No reward for pulling an arm at low fidelities, but use cheap evaluations at z = z• to speed up search for x⋆.
24/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x) (2) Zt ≈ {z•} ∪
- z : σt−1(z, xt) ≥ γ(z)
- (3)
zt = argmin
z∈Zt
λ(z) (cheapest z in Zt)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x) (2) Zt ≈ {z•} ∪
- z : σt−1(z, xt) ≥ γ(z)
- (3)
zt = argmin
z∈Zt
λ(z) (cheapest z in Zt)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x) (2) Zt ≈ {z•} ∪
- z : σt−1(z, xt) ≥ γ(z)
- (3)
zt = argmin
z∈Zt
λ(z) (cheapest z in Zt)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x) (2) Zt ≈ {z•} ∪
- z : σt−1(z, xt) ≥ γ(z)
- (3)
zt = argmin
z∈Zt
λ(z) (cheapest z in Zt)
25/39
Algorithm: BOCA
(Kandasamy et al. ICML 2017)
Model g ∼ GP(0, κ) and com- pute posterior GP: mean µt−1 : Z × X → R std-dev σt−1 : Z × X → R+ (1) xt ← maximise upper confidence bound for f (x) = g(z•, x). xt = argmax
x∈X
µt−1(z•, x) + β1/2
t
σt−1(z•, x) (2) Zt ≈ {z•} ∪
- z : σt−1(z, xt) ≥ γ(z) =
λ(z) λ(z•) q ξ(z)
- (3)
zt = argmin
z∈Zt
λ(z) (cheapest z in Zt)
25/39
Theoretical Results for BOCA
x⋆
X
g(z, x) f(x) z•
Z
“good” x⋆ g(z, x)
X
f(x) z•
Z
“bad”
26/39
Theoretical Results for BOCA
x⋆
X
g(z, x) f(x) z•
Z
“good” large hZ x⋆ g(z, x)
X
f(x) z•
Z
“bad” small hZ E.g.: For SE kernels, bandwidth hZ controls smoothness.
26/39
Theoretical Results for BOCA
GP-UCB SE kernel,
(Srinivas et al. 2010)
w.h.p S(Λ)
- vol(X)
Λ
27/39
Theoretical Results for BOCA
GP-UCB SE kernel,
(Srinivas et al. 2010)
w.h.p S(Λ)
- vol(X)
Λ BOCA SE kernel,
(Kandasamy et al. ICML 2017)
w.h.p ∀α > 0, S(Λ)
- vol(Xα)
Λ +
- vol(X)
Λ2−α Xα =
- x; f (x⋆) − f (x) Cα
1 hZ
- 27/39
Theoretical Results for BOCA
GP-UCB SE kernel,
(Srinivas et al. 2010)
w.h.p S(Λ)
- vol(X)
Λ BOCA SE kernel,
(Kandasamy et al. ICML 2017)
w.h.p ∀α > 0, S(Λ)
- vol(Xα)
Λ +
- vol(X)
Λ2−α Xα =
- x; f (x⋆) − f (x) Cα
1 hZ
- If hZ is large (good approximations), vol(Xα) ≪ vol(X),
and BOCA is much better than GP-UCB.
27/39
Theoretical Results for BOCA
GP-UCB SE kernel,
(Srinivas et al. 2010)
w.h.p S(Λ)
- vol(X)
Λ BOCA SE kernel,
(Kandasamy et al. ICML 2017)
w.h.p ∀α > 0, S(Λ)
- vol(Xα)
Λ +
- vol(X)
Λ2−α Xα =
- x; f (x⋆) − f (x) Cα
1 hZ
- If hZ is large (good approximations), vol(Xα) ≪ vol(X),
and BOCA is much better than GP-UCB. N.B: Dropping constants and polylog terms.
27/39
Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N• = 192 data. Requires numerical integration on a grid of size G• = 106. Approximate with N ∈ [50, 192] or G ∈ [102, 106] (2D fidelity space).
28/39
Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N• = 192 data. Requires numerical integration on a grid of size G• = 106. Approximate with N ∈ [50, 192] or G ∈ [102, 106] (2D fidelity space).
1000 1500 2000 2500 3000 3500 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
28/39
Outline
◮ Part I: Stochastic bandits
(cont’d)
- 1. Gaussian processes for smooth bandits
- 2. Algorithms: Upper Confidence Bound (UCB) & Thompson
Sampling (TS)
◮ Digression: SL2College Research Collaboration Program ◮ Part II: My research
- 1. Multi-fidelity bandit: cheap approximations to an expensive
experiments
- 2. Parallelising arm pulls
29/39
Part 2.2: Parallelising arm pulls
Sequential arm pulls with one worker
30/39
Part 2.2: Parallelising arm pulls
Sequential arm pulls with one worker Parallel arm pulls with M workers (Asynchronous)
30/39
Part 2.2: Parallelising arm pulls
Sequential arm pulls with one worker Parallel arm pulls with M workers (Asynchronous) Parallel arm pulls with M workers (Synchronous)
30/39
Why parallelisation?
◮ Computational experiments: infrastructure with 100-1000’s
CPUs or GPUs.
◮ Drug discovery: High throughput screening
31/39
Why parallelisation?
◮ Computational experiments: infrastructure with 100-1000’s
CPUs or GPUs.
◮ Drug discovery: High throughput screening
Prior work: (Ginsbourger et al. 2011, Janusevskis et al. 2012, Wang et al.
2016, Gonz´ alez et al. 2015, Desautels et al. 2014, Contal et al. 2013, Shah and Ghahramani 2015, Kathuria et al. 2016, Wang et al. 2017, Wu and Frazier 2016, Hernandez-Lobato et al. 2017)
Shortcomings
◮ Asynchronicity ◮ Theoretical guarantees ◮ Computationally & conceptually simple
31/39
Review: Sequential Thompson Sampling in GP Bandits
x f(x)
32/39
Review: Sequential Thompson Sampling in GP Bandits
x f(x)
32/39
Review: Sequential Thompson Sampling in GP Bandits
x f(x)
xt
Draw sample g from posterior. Choose xt = argmaxx g(x).
32/39
Parallelised Thompson Sampling
(Kandasamy et al. Arxiv 2017)
Asynchronous: asyTS At any given time,
- 1. (x′, y′) ← Wait for
a worker to finish.
- 2. Compute posterior GP.
- 3. Draw a sample g ∼ GP.
- 4. Re-deploy worker at
argmax g.
33/39
Parallelised Thompson Sampling
(Kandasamy et al. Arxiv 2017)
Asynchronous: asyTS At any given time,
- 1. (x′, y′) ← Wait for
a worker to finish.
- 2. Compute posterior GP.
- 3. Draw a sample g ∼ GP.
- 4. Re-deploy worker at
argmax g. Synchronous: synTS At any given time,
- 1. {(x′
m, y′ m)}M m=1 ← Wait for
all workers to finish.
- 2. Compute posterior GP.
- 3. Draw M samples
gm ∼ GP, ∀m.
- 4. Re-deploy worker m at
argmax gm, ∀m.
33/39
Theoretical Results: number of evaluations
Sequential TS, SE Kernel
(Russo & van Roy 2014)
E[Sn]
- vol(X) log(n)d
n
34/39
Theoretical Results: number of evaluations
Sequential TS, SE Kernel
(Russo & van Roy 2014)
E[Sn]
- vol(X) log(n)d
n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M log(M)2d n +
- vol(X) log(n)d
n n ← # completed arm pulls by all workers.
34/39
Theoretical Results: number of evaluations
Sequential TS, SE Kernel
(Russo & van Roy 2014)
E[Sn]
- vol(X) log(n)d
n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M log(M)2d n +
- vol(X) log(n)d
n n ← # completed arm pulls by all workers. Why is this interesting?
- A sequential algorithm can make use of information from all
previous rounds to determine where to evaluate next.
- A parallel algorithm could be missing up to M − 1 results at
any given time.
34/39
Theoretical Results: number of evaluations
Sequential TS, SE Kernel
(Russo & van Roy 2014)
E[Sn]
- vol(X) log(n)d
n Theorem: synTS & asyTS, SE Kernel (Kandasamy et al. Arxiv 2017) E[Sn] M log(M)2d n +
- vol(X) log(n)d
n n ← # completed arm pulls by all workers. Why is this interesting?
- A sequential algorithm can make use of information from all
previous rounds to determine where to evaluate next.
- A parallel algorithm could be missing up to M − 1 results at
any given time. But randomisation helps!
34/39
Theoretical Results: Simple regret with time
Asynchronous Synchronous
35/39
Theoretical Results: Simple regret with time
Asynchronous Synchronous Theorem (Informal)
(Kandasamy et al. Arxiv 2017)
If evaluation times are the same, asyTS ≈ synTS. Otherwise, bounds for asyTS is strictly better than synTS. More the variability in evaluation times, the bigger the difference.
Theoretical Results: Simple regret with time
Asynchronous Synchronous Theorem (Informal)
(Kandasamy et al. Arxiv 2017)
If evaluation times are the same, asyTS ≈ synTS. Otherwise, bounds for asyTS is strictly better than synTS. More the variability in evaluation times, the bigger the difference.
- Bounded tail decay: constant factor
- Sub-gaussian tail decay:
- log(M) factor
- Sub-exponential tail decay: log(M) factor
35/39
Experiment: Currin-Exponential-14D M = 35
Evaluation time sampled from a Pareto-3 distribution
Tim e units (T)
5 10 15 20
SR′(T)
10 15 20 25
synRAND synBUCB synUCBPE synTS asyRAND asyUCB asyEI asyHUCB asyHTS asyTS
36/39
Experiment: Hyper-parameter tuning in Cifar10 M = 4
Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4 - 16 minutes.
Time (s)
1000 2000 3000 4000 5000 6000 7000
Validation Accuracy
0.68 0.69 0.7 0.71 0.72
synBUCB synTS asyRAND asyEI asyHUCB asyTS
37/39
Summary
◮ Bandits are a framework for studying exploration vs
exploitation trade-offs when optimising black-box functions.
◮ Smooth bandit formulations are more common in practical
applications.
◮ Several algorithms: UCB, TS, Index based policies, ǫ-greedy
etc.
38/39
Summary
◮ Bandits are a framework for studying exploration vs
exploitation trade-offs when optimising black-box functions.
◮ Smooth bandit formulations are more common in practical
applications.
◮ Several algorithms: UCB, TS, Index based policies, ǫ-greedy
etc.
◮ Multi-fidelity Bandits: Allows us to use cheap
approximations to a an expensive experiment to quickly find the optimum.
◮ Parallelised TS: Simple and intuitive way to deal with
multiple workers.
38/39