On the Complexity of Best Arm Identification in Multi-Armed Bandit - - PowerPoint PPT Presentation

on the complexity of best arm identification in multi
SMART_READER_LITE
LIVE PREVIEW

On the Complexity of Best Arm Identification in Multi-Armed Bandit - - PowerPoint PPT Presentation

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier Institut de Mathmatiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March 2015 Simple Multi-Armed Bandit


slide-1
SLIDE 1

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Aurélien Garivier

Institut de Mathématiques de Toulouse

Information Theory, Learning and Big Data Simons Institute, Berkeley, March 2015

slide-2
SLIDE 2

Simple Multi-Armed Bandit Model

Roadmap

1

Simple Multi-Armed Bandit Model

2

Complexity of Best Arm Identification Lower bounds on the complexities Gaussian Feedback Binary Feedback

slide-3
SLIDE 3

Simple Multi-Armed Bandit Model

The (stochastic) Multi-Armed Bandit Model

Environment K arms with parameters θ = (θ1, . . . , θK) such that for any possible choice of arm at ∈ {1, . . . , K} at time t, one receives the reward Xt = Xat,t where, for any 1 ≤ a ≤ K and s ≥ 1, Xa,s ∼ νa, and the (Xa,s)a,s are independent. Reward distributions νa ∈ Fa parametric family, or not: canonical exponential family, general bounded rewards Example Bernoulli rewards: θ ∈ [0, 1]K, νa = B(θa) Strategy The agent’s actions follow a dynamical strategy π = (π1, π2, . . . ) such that At = πt(X1, . . . , Xt−1)

slide-4
SLIDE 4

Simple Multi-Armed Bandit Model

Real challenges

Randomized clinical trials

  • riginal motivation since the 1930’s

dynamic strategies can save resources

Recommender systems:

advertisement website optimization news, blog posts, . . .

Computer experiments

large systems can be simulated in order to optimize some criterion over a set of parameters but the simulation cost may be high, so that only few choices are possible for the parameters

Games and planning (tree-structured options)

slide-5
SLIDE 5

Simple Multi-Armed Bandit Model

Performance Evaluation: Cumulated Regret

Cumulated Reward: ST = T

t=1 Xt

Goal: Choose π so as to maximize E [ST] =

T

  • t=1

K

  • a=1

E

  • E [Xt✶{At = a}|X1, . . . , Xt−1]
  • =

K

  • a=1

µaE [Nπ

a (T)]

where Nπ

a (T) = t≤T ✶{At = a} is the number of

draws of arm a up to time T, and µa = E(νa). Regret Minimization: maximizing E [ST] ⇐ ⇒ minimizing RT = Tµ∗ − E [ST] =

  • a:µa<µ∗

(µ∗ − µa)E [Nπ

a (T)]

where µ∗ ∈ max{µa : 1 ≤ a ≤ K}

slide-6
SLIDE 6

Simple Multi-Armed Bandit Model

Upper Confidence Bound Strategies

UCB [Lai&Robins ’85; Agrawal ’95; Auer&al ’02]

Construct an upper confidence bound for the expected reward of each arm: Sa(t) Na(t)

estimated reward

+

  • log(t)

2Na(t)

  • exploration bonus

Choose the arm with the highest UCB It is an index strategy [Gittins ’79] Its behavior is easily interpretable and intuitively appealing Listen to Robert Nowak’s talk tomorrow!

slide-7
SLIDE 7

Simple Multi-Armed Bandit Model

Optimality?

Generalization of [Lai&Robbins ’85]

Theorem [Burnetas and Katehakis, ’96]

If π is a uniformly efficient strategy, then for any θ ∈ [0, 1]K, lim inf

T→∞

E

  • Na(T)
  • log(T)

≥ 1 Kinf (νa, µ∗) where Kinf (νa, µ∗) = inf

  • K(νa, ν′) :

ν′ ∈ Fa, E(ν′) ≥ µ∗ Idea: change of distribution

ν∗ δ1 δ 1 2 δ0 Kinf(νa, µ⋆) νa µ∗

slide-8
SLIDE 8

Simple Multi-Armed Bandit Model

Reaching Optimality: Empirical Likelihood

The KL-UCB Algorithm, AoS 2013 joint work with O. Cappé, O-A. Maillard, R. Munos, G. Stoltz Parameters: An operator ΠF : M1(S) → F; a non-decreasing function f : N → R Initialization: Pull each arm of {1, . . . , K} once for t = K to T − 1 do compute for each arm a the quantity

Ua(t) = sup

  • E(ν) :

ν ∈ F and KL

  • ΠF
  • ˆ

νa(t)

  • , ν
  • ≤ f (t)

Na(t)

  • pick an arm

At+1 ∈ arg max

a∈{1,...,K}

Ua(t) end for

slide-9
SLIDE 9

Simple Multi-Armed Bandit Model

Regret bound

Theorem: Assume that F is the set of finitely supported probability distributions over S = [0, 1], that µa > 0 for all arms a and that µ⋆ < 1. There exists a constant M(νa, µ⋆) > 0 only depending on νa and µ⋆ such that, with the choice f (t) = log(t) + log

  • log(t)
  • for t ≥ 2, for all T ≥ 3:

E

  • Na(T)

log(T) Kinf

  • νa, µ⋆ +

36 (µ⋆)4

  • log(T)

4/5 log

  • log(T)
  • +
  • 72

(µ⋆)4 + 2µ⋆ (1 − µ⋆) Kinf

  • νa, µ⋆2
  • log(T)

4/5 +(1 − µ⋆)2 M(νa, µ⋆) 2(µ⋆)2

  • log(T)

2/5 +log

  • log(T)
  • Kinf
  • νa, µ⋆ +

2µ⋆ (1 − µ⋆) Kinf

  • νa, µ⋆2 + 4 .
slide-10
SLIDE 10

Simple Multi-Armed Bandit Model

Regret bound

Theorem: Assume that F is the set of finitely supported probability distributions over S = [0, 1], that µa > 0 for all arms a and that µ⋆ < 1. There exists a constant M(νa, µ⋆) > 0 only depending on νa and µ⋆ such that, with the choice f (t) = log(t) + log

  • log(t)
  • for t ≥ 2, for all T ≥ 3:

E

  • Na(T)

log(T) Kinf

  • νa, µ⋆ +

36 (µ⋆)4

  • log(T)

4/5 log

  • log(T)
  • +
  • 72

(µ⋆)4 + 2µ⋆ (1 − µ⋆) Kinf

  • νa, µ⋆2
  • log(T)

4/5 +(1 − µ⋆)2 M(νa, µ⋆) 2(µ⋆)2

  • log(T)

2/5 +log

  • log(T)
  • Kinf
  • νa, µ⋆ +

2µ⋆ (1 − µ⋆) Kinf

  • νa, µ⋆2 + 4 .
slide-11
SLIDE 11

Complexity of Best Arm Identification

Roadmap

1

Simple Multi-Armed Bandit Model

2

Complexity of Best Arm Identification Lower bounds on the complexities Gaussian Feedback Binary Feedback

slide-12
SLIDE 12

Complexity of Best Arm Identification

Best Arm Identification Strategies

A two-armed bandit model is a pair ν = (ν1, ν2) of probability distributions (’arms’) with respective means µ1 and µ2 a∗ = argmaxa µa is the (unknown) best arm Strategy = a sampling rule (At)t∈N where At ∈ {1, 2} is the arm chosen at time t (based on past observations) a sample Zt ∼ νAt is observed a stopping rule τ indicating when he stops sampling the arms a recommendation rule ˆ aτ ∈ {1, 2} indicating which arm he thinks is best (at the end of the interaction) In classical A/B Testing, the sampling rule At is uniform on {1, 2} and the stopping rule τ = t is fixed in advance.

slide-13
SLIDE 13

Complexity of Best Arm Identification

Best Arm Identification

Joint work with Emilie Kaufmann and Olivier Cappé (Telecom ParisTech) Goal: design a strategy A = ((At), τ,ˆ aτ) such that: Fixed-budget setting Fixed-confidence setting τ = t Pν(ˆ aτ = a∗) ≤ δ pt(ν) := Pν(ˆ at = a∗) as small Eν[τ] as small as possible as possible See also: [Mannor&Tsitsiklis ’04], [Even-Dar&al. ’06], [Audibert&al.’10], [Bubeck&al. ’11,’13], [Kalyanakrishnan&al. ’12], [Karnin&al. ’13], [Jamieson&al. ’14]...

slide-14
SLIDE 14

Complexity of Best Arm Identification

Two possible goals

Goal: design a strategy A = ((At), τ,ˆ aτ) such that: Fixed-budget setting Fixed-confidence setting τ = t Pν(ˆ aτ = a∗) ≤ δ pt(ν) := Pν(ˆ at = a∗) as small Eν[τ] as small as possible as possible In the particular case of uniform sampling : Fixed-budget setting Fixed-confidence setting classical test of sequential test of (µ1 > µ2) against (µ1 < µ2) (µ1 > µ2) against (µ1 < µ2) based on t samples with probability of error uniformly bounded by δ [Siegmund 85]: sequential tests can save samples !

slide-15
SLIDE 15

Complexity of Best Arm Identification

The complexities of best-arm identification

For a class M bandit models, algorithm A = ((At), τ,ˆ aτ) is... Fixed-budget setting Fixed-confidence setting consistent on M if δ-PAC on M if ∀ν ∈ M, pt(ν) = Pν(ˆ at = a∗) − →

t→∞ 0

∀ν ∈ M, Pν(ˆ aτ = a∗) ≤ δ From the literature pt(ν) ≃ exp

t CH(ν)

  • Eν[τ] ≃ C′H′(ν) log(1/δ)

[Audibert&al.’10],[Bubeck&al’11] [Mannor&Tsitsiklis ’04],[Even-Dar&al. ’06] [Bubeck&al’13],... [Kalanakrishnan&al’12],...

= ⇒ two complexities κB(ν) = inf

A cons.

  • lim sup

t→∞

− 1

t log pt(ν)

−1 κC(ν) = inf

A δ−PAClim sup δ→0 Eν[τ] log(1/δ

for a probability of error ≤ δ, for a probability of error ≤ δ, budget t ≃ κB(ν) log(1/δ) Eν[τ] ≃ κC(ν) log(1/δ)

slide-16
SLIDE 16

Complexity of Best Arm Identification Lower bounds on the complexities

Changes of distribution

Theorem: how to use (and hide) the change of distribution Let ν and ν′ be two bandit models with K arms such that for all a, the distributions νa and ν′

a are mutually absolutely continuous.

For any almost-surely finite stopping time σ with respect to (Ft),

K

  • a=1

Eν[Na(σ)] KL(νa, ν′

a) ≥ sup E∈Fσ

kl

  • Pν(E), Pν′(E)
  • ,

where kl(x, y) = x log(x/y) + (1 − x) log

  • (1 − x)/(1 − y)
  • .

Useful remark: ∀δ ∈ [0, 1], kl

  • δ, 1 − δ
  • ≥ log

1 2.4 δ ,

slide-17
SLIDE 17

Complexity of Best Arm Identification Lower bounds on the complexities

General lower bounds

Theorem 1 Let M be a class of two armed bandit models that are continuously parametrized by their means. Let ν = (ν1, ν2) ∈ M. Fixed-budget setting Fixed-confidence setting any consistent algorithm satisfies any δ-PAC algorithm satisfies lim supt→∞ − 1

t log pt(ν) ≤ K∗(ν1, ν2)

Eν[τ] ≥

1 K∗(ν1,ν2) log

  • 1

2.4δ

  • with K∗(ν1, ν2)

with K∗(ν1, ν2) = KL(ν∗, ν1) = KL(ν∗, ν2) = KL(ν1, ν∗) = KL(ν2, ν∗) Thus, κB(ν) ≥

1 K∗(ν1,ν2)

Thus, κC(ν) ≥

1 K∗(ν1,ν2)

slide-18
SLIDE 18

Complexity of Best Arm Identification Gaussian Feedback

Gaussian Rewards: Fixed-Budget Setting

For fixed (known) values σ1, σ2, we consider Gaussian bandit models M =

  • ν =
  • N
  • µ1, σ2

1

  • , N
  • µ2, σ2

2

  • : (µ1, µ2) ∈ R2, µ1 = µ2
  • Theorem 1:

κB(ν) ≥ 2(σ1 + σ2)2 (µ1 − µ2)2 A strategy allocating t1 =

  • σ1

σ1+σ2 t

  • samples to arm 1 and

t2 = t − t1 samples to arm 1, and recommending the empirical best satisfies lim inf

t→∞ −1

t log pt(ν) ≥ (µ1 − µ2)2 2(σ1 + σ2)2 κB(ν) = 2(σ1 + σ2)2 (µ1 − µ2)2

slide-19
SLIDE 19

Complexity of Best Arm Identification Gaussian Feedback

Gaussian Rewards: Fixed-confidence setting

The α-Elimination algorithm with exploration rate β(t, δ)

➜ chooses At in order to keep a proportion N1(t)/t ≃ α ➜ if ˆ µa(t) is the empirical mean of rewards obtained from a up to time t, σ2

t (α) = σ2 1/⌈αt⌉ + σ2 2/(t − ⌈αt⌉),

τ = inf

  • t ∈ N : |ˆ

µ1(t) − ˆ µ2(t)| >

  • 2σ2

t (α)β(t, δ)

  • 200

400 600 800 1000 −1.0 −0.5 0.0 0.5 1.0

➜ recommends the empirical best arm ˆ aτ = argmaxaˆ µa(τ)

slide-20
SLIDE 20

Complexity of Best Arm Identification Gaussian Feedback

Gaussian Rewards: Fixed-confidence setting

From Theorem 1: Eν[τ] ≥ 2(σ1 + σ2)2 (µ1 − µ2)2 log 1 2.4δ

  • σ1

σ1+σ2 -Elimination with β(t, δ) = log t δ + 2 log log(6t) is

δ-PAC and ∀ǫ > 0, Eν[τ] ≤ (1 + ǫ)2(σ1 + σ2)2 (µ1 − µ2)2 log 1 2.4δ

  • + oǫ

δ→0

  • log 1

δ

  • κC(ν) = 2(σ1 + σ2)2

(µ1 − µ2)2

slide-21
SLIDE 21

Complexity of Best Arm Identification Gaussian Feedback

Gaussian Rewards: Conclusion

For any two fixed values of σ1 and σ2, κB(ν) = κC(ν) = 2(σ1 + σ2)2 (µ1 − µ2)2 If the variances are equal, σ1 = σ2 = σ, κB(ν) = κC(ν) = 8σ2 (µ1 − µ2)2 uniform sampling is optimal only when σ1 = σ2 1/2-Elimination is δ-PAC for a smaller exploration rate β(t, δ) ≃ log(log(t)/δ)

slide-22
SLIDE 22

Complexity of Best Arm Identification Binary Feedback

Binary Rewards: Lower Bounds

M = {ν = (B(µ1), B(µ2)) : (µ1, µ2) ∈]0; 1[2, µ1 = µ2}, shorthand: K(µ, µ′) = KL (B(µ), B(µ′)). Fixed-budget setting Fixed-confidence setting any consistent algorithm satisfies any δ-PAC algorithm satisfies lim supt→∞ − 1

t log pt(ν) ≤ K∗(µ1, µ2)

Eν[τ] ≥

1 K∗(µ1,µ2) log

1

  • (Chernoff information)

K∗(µ1, µ2) > K∗(µ1, µ2)

slide-23
SLIDE 23

Complexity of Best Arm Identification Binary Feedback

Binary Rewards: Uniform Sampling

For any consistent... For any δ-PAC... ... algorithm pt(ν) e−K∗(µ1,µ2)t

Eν[τ] log(1/δ) 1 K∗(µ1,µ2)

... algorithm using pt(ν) e−

K(µ,µ1)+K(µ,µ2) 2

t Eν[τ] log(1/δ) 2 K(µ1,µ)+K(µ2,µ)

uniform sampling with µ = f (µ1, µ2) with µ = µ1+µ2

2

Remark: Quantities in the same column appear to be close from one another ⇒ Binary rewards: uniform sampling close to optimal

slide-24
SLIDE 24

Complexity of Best Arm Identification Binary Feedback

Binary Rewards: Uniform Sampling

For any consistent... For any δ-PAC... ... algorithm pt(ν) ≃ e−K∗(µ1,µ2)t

Eν[τ] log(1/δ) 1 K∗(µ1,µ2)

... algorithm using pt(ν) ≃ e−

K(µ,µ1)+K(µ,µ2) 2

t Eν[τ] log(1/δ) 2 K(µ1,µ)+K(µ2,µ)

uniform sampling with µ = f (µ1, µ2) with µ = µ1+µ2

2

Remark: Quantities in the same column appear to be close from one another ⇒ Binary rewards: uniform sampling close to optimal

slide-25
SLIDE 25

Complexity of Best Arm Identification Binary Feedback

Binary Rewards: Fixed-Budget Setting

In fact, κB(ν) = 1 K∗(µ1, µ2) The algorithm using uniform sampling and recommending the empirical best arm is very close to optimal

slide-26
SLIDE 26

Complexity of Best Arm Identification Binary Feedback

Binary Rewards: Fixed-Confidence Setting

δ-PAC algorithms using uniform sampling satisfy Eν[τ] log(1/δ) ≥ 1 I∗(ν) with I∗(ν) = K

  • µ1, µ1+µ2

2

  • + K
  • µ2, µ1+µ2

2

  • 2

. The algorithm using uniform sampling and τ = inf

  • t ∈ 2N∗ : |ˆ

µ1(t) − ˆ µ2(t)| > log log(t) + 1 δ

  • is δ-PAC but not optimal:

E[τ] log(1/δ) ≃ 2 (µ1−µ2)2 > 1 I∗(ν).

A better stopping rule NOT based on the difference of empirical means τ = inf

  • t ∈ 2N∗ : t I∗(ˆ

µ1(t), ˆ µ2(t)) > log log(t) + 1 δ

slide-27
SLIDE 27

Complexity of Best Arm Identification Binary Feedback

Binary Rewards: Conclusion

Regarding the complexities: κB(ν) =

1 K∗(µ1,µ2)

κC(ν) ≥

1 K∗(µ1,µ2) > 1 K∗(µ1,µ2)

Thus κC(ν) > κB(ν) Regarding the algorithms There is not much to gain by departing from uniform sampling In the fixed-confidence setting, a sequential test based on the difference of the empirical means is no longer optimal

slide-28
SLIDE 28

Complexity of Best Arm Identification Binary Feedback

Conclusion

➜ the complexities κB(ν) and κC(ν) are not always equal (and feature some different informational quantities) ➜ strategies using random stopping do not necessarily lead to a saving in terms of the number of sample used ➜ for Bernoulli distributions and Gaussian with similar variances, strategies using uniform sampling are (almost)

  • ptimal

➜ Generalization to m best arms identification among K arms

slide-29
SLIDE 29

Complexity of Best Arm Identification Binary Feedback

Elements of Bibliography (see references therein!)

1

[Lai&Robins ’85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1) :4-22, 1985.

2

[Agrawal ’95] R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4) :1054-1078, 1995.

3

[Auer&al ’02] P . Auer, N. Cesa-Bianchi, and P . Fischer. Finite-time analysis of the multiarmed bandit

  • problem. Machine Learning, 47(2) :235-256, 2002.

4

[Even-Dar&al ’06] Action elimination and stopping conditions for multi-armed bandit and reinforcement leraning problems, JMLR 7:1079-1105, 2006.

5

[Audibert&al ’09] J-Y. Audibert, R. Munos, and Cs. Szepesvári. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), 2009

6

[Filippi &al ’10] S. Filippi, O. Cappé, and A. Garivier. Optimism in reinforcement learning and Kullback-Leibler divergence. In Allerton Conf. on Communication, Control, and Computing, Monticello, US, 2010.

7

[Cappé&al ’13] O. Cappé, A. Garivier, O-A. Maillard, R Munos, G. Stoltz. Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation . Annals of Statistics (41:3) Jun. 2013 pp.1516-1541.

8

[Abbasi-Yadkori&al ’11] Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári: Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems. CoRR abs/1102.2670: (2011)

9

[Bubeck&Cesa-Bianchi ’12] S. Bubeck and N. Cesa-Bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning 5(1): 1-122 (2012)

10

[Cappé&al ’13] O. Cappé, A. Garivier, O-A. Maillard, R Munos, G. Stoltz. Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation . Annals of Statistics (41:3) Jun. 2013 pp.1516-1541.

11

[Jamieson&al ’14] K. Jamieson, M. Malloy, R. Nowak and S. Bubeck. lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits. COLT 2014 :423-439

12

[Kaufmann&al ’15] E. Kaufmann, O. Cappé, A. Garivier, On the Complexity of Best Arm Identification in Multi-Armed Bandit Models, ArXiv:1407.4443