on the complexity of best arm identification in multi
play

On the Complexity of Best Arm Identification in Multi-Armed Bandit - PowerPoint PPT Presentation

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier Institut de Mathmatiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March 2015 Simple Multi-Armed Bandit


  1. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March 2015

  2. Simple Multi-Armed Bandit Model Roadmap Simple Multi-Armed Bandit Model 1 Complexity of Best Arm Identification 2 Lower bounds on the complexities Gaussian Feedback Binary Feedback

  3. Simple Multi-Armed Bandit Model The (stochastic) Multi-Armed Bandit Model Environment K arms with parameters θ = ( θ 1 , . . . , θ K ) such that for any possible choice of arm a t ∈ { 1 , . . . , K } at time t , one receives the reward X t = X a t , t where, for any 1 ≤ a ≤ K and s ≥ 1 , X a , s ∼ ν a , and the ( X a , s ) a , s are independent. Reward distributions ν a ∈ F a parametric family, or not: canonical exponential family, general bounded rewards Example Bernoulli rewards: θ ∈ [ 0 , 1 ] K , ν a = B ( θ a ) Strategy The agent’s actions follow a dynamical strategy π = ( π 1 , π 2 , . . . ) such that A t = π t ( X 1 , . . . , X t − 1 )

  4. Simple Multi-Armed Bandit Model Real challenges Randomized clinical trials original motivation since the 1930’s dynamic strategies can save resources Recommender systems: advertisement website optimization news, blog posts, . . . Computer experiments large systems can be simulated in order to optimize some criterion over a set of parameters but the simulation cost may be high, so that only few choices are possible for the parameters Games and planning (tree-structured options)

  5. Simple Multi-Armed Bandit Model Performance Evaluation: Cumulated Regret Cumulated Reward: S T = � T t = 1 X t Goal: Choose π so as to maximize T K � � � � E [ S T ] = E [ X t ✶ { A t = a }| X 1 , . . . , X t − 1 ] E t = 1 a = 1 K � µ a E [ N π = a ( T )] a = 1 a ( T ) = � where N π t ≤ T ✶ { A t = a } is the number of draws of arm a up to time T , and µ a = E ( ν a ) . Regret Minimization: maximizing E [ S T ] ⇐ ⇒ minimizing � R T = T µ ∗ − E [ S T ] = ( µ ∗ − µ a ) E [ N π a ( T )] a : µ a <µ ∗ where µ ∗ ∈ max { µ a : 1 ≤ a ≤ K }

  6. Simple Multi-Armed Bandit Model Upper Confidence Bound Strategies UCB [Lai&Robins ’85; Agrawal ’95; Auer&al ’02] Construct an upper confidence bound for the expected reward of each arm: � S a ( t ) log ( t ) + N a ( t ) 2 N a ( t ) � �� � � �� � estimated reward exploration bonus Choose the arm with the highest UCB It is an index strategy [Gittins ’79] Its behavior is easily interpretable and intuitively appealing Listen to Robert Nowak’s talk tomorrow!

  7. Simple Multi-Armed Bandit Model Optimality? Generalization of [Lai&Robbins ’85] Theorem [Burnetas and Katehakis, ’96] If π is a uniformly efficient strategy, then for any θ ∈ [ 0 , 1 ] K , � � E N a ( T ) 1 ≥ lim inf K inf ( ν a , µ ∗ ) log ( T ) T →∞ δ 1 2 where � K inf ( ν a , µ ∗ ) = inf K ( ν a , ν ′ ) : ν ′ ∈ F a , E ( ν ′ ) ≥ µ ∗ � ν a K inf ( ν a , µ⋆ ) ν ∗ Idea: change of distribution δ 1 µ ∗ δ 0

  8. Simple Multi-Armed Bandit Model Reaching Optimality: Empirical Likelihood The KL-UCB Algorithm , AoS 2013 joint work with O. Cappé, O-A. Maillard, R. Munos, G. Stoltz Parameters: An operator Π F : M 1 ( S ) → F ; a non-decreasing function f : N → R Initialization: Pull each arm of { 1 , . . . , K } once for t = K to T − 1 do compute for each arm a the quantity � � � � ≤ f ( t ) � � U a ( t ) = sup E ( ν ) : ν ∈ F and KL Π F ˆ ν a ( t ) , ν N a ( t ) pick an arm A t + 1 ∈ arg max U a ( t ) a ∈{ 1 ,..., K } end for

  9. Simple Multi-Armed Bandit Model Regret bound Theorem: Assume that F is the set of finitely supported probability distributions over S = [ 0 , 1 ] , that µ a > 0 for all arms a and that µ ⋆ < 1 . There exists a constant M ( ν a , µ ⋆ ) > 0 only depending on ν a and µ ⋆ such that, with the choice � � for t ≥ 2 , for all T ≥ 3 : f ( t ) = log ( t ) + log log ( t ) log ( T ) 36 � 4 / 5 log � � � � � ν a , µ ⋆ � + N a ( T ) ≤ log ( T ) log ( T ) E � ( µ ⋆ ) 4 K inf � � 2 µ ⋆ 72 � � 4 / 5 + ( µ ⋆ ) 4 + log ( T ) � ν a , µ ⋆ � 2 ( 1 − µ ⋆ ) K inf +( 1 − µ ⋆ ) 2 M ( ν a , µ ⋆ ) � � 2 / 5 log ( T ) 2 ( µ ⋆ ) 2 � � 2 µ ⋆ + log log ( T ) ν a , µ ⋆ � + ν a , µ ⋆ � 2 + 4 . � � K inf ( 1 − µ ⋆ ) K inf

  10. Simple Multi-Armed Bandit Model Regret bound Theorem: Assume that F is the set of finitely supported probability distributions over S = [ 0 , 1 ] , that µ a > 0 for all arms a and that µ ⋆ < 1 . There exists a constant M ( ν a , µ ⋆ ) > 0 only depending on ν a and µ ⋆ such that, with the choice � � f ( t ) = log ( t ) + log log ( t ) for t ≥ 2 , for all T ≥ 3 : log ( T ) 36 � 4 / 5 log � � � � � ν a , µ ⋆ � + N a ( T ) ≤ log ( T ) log ( T ) E � ( µ ⋆ ) 4 K inf � � 2 µ ⋆ 72 � � 4 / 5 + ( µ ⋆ ) 4 + log ( T ) � ν a , µ ⋆ � 2 ( 1 − µ ⋆ ) K inf +( 1 − µ ⋆ ) 2 M ( ν a , µ ⋆ ) � � 2 / 5 log ( T ) 2 ( µ ⋆ ) 2 � � 2 µ ⋆ + log log ( T ) ν a , µ ⋆ � + ν a , µ ⋆ � 2 + 4 . � � K inf ( 1 − µ ⋆ ) K inf

  11. Complexity of Best Arm Identification Roadmap Simple Multi-Armed Bandit Model 1 Complexity of Best Arm Identification 2 Lower bounds on the complexities Gaussian Feedback Binary Feedback

  12. Complexity of Best Arm Identification Best Arm Identification Strategies A two-armed bandit model is a pair ν = ( ν 1 , ν 2 ) of probability distributions (’arms’) with respective means µ 1 and µ 2 a ∗ = argmax a µ a is the (unknown) best arm Strategy = a sampling rule ( A t ) t ∈ N where A t ∈ { 1 , 2 } is the arm chosen at time t (based on past observations) a sample Z t ∼ ν A t is observed a stopping rule τ indicating when he stops sampling the arms a recommendation rule ˆ a τ ∈ { 1 , 2 } indicating which arm he thinks is best (at the end of the interaction) In classical A/B Testing, the sampling rule A t is uniform on { 1 , 2 } and the stopping rule τ = t is fixed in advance.

  13. Complexity of Best Arm Identification Best Arm Identification Joint work with Emilie Kaufmann and Olivier Cappé (Telecom ParisTech) Goal: design a strategy A = (( A t ) , τ, ˆ a τ ) such that: Fixed-budget setting Fixed-confidence setting a τ � = a ∗ ) ≤ δ P ν (ˆ τ = t a t � = a ∗ ) as small p t ( ν ) := P ν (ˆ E ν [ τ ] as small as possible as possible See also: [Mannor&Tsitsiklis ’04], [Even-Dar&al. ’06], [Audibert&al.’10], [Bubeck&al. ’11,’13], [Kalyanakrishnan&al. ’12], [Karnin&al. ’13], [Jamieson&al. ’14]...

  14. Complexity of Best Arm Identification Two possible goals Goal: design a strategy A = (( A t ) , τ, ˆ a τ ) such that: Fixed-budget setting Fixed-confidence setting a τ � = a ∗ ) ≤ δ τ = t P ν (ˆ a t � = a ∗ ) as small p t ( ν ) := P ν (ˆ E ν [ τ ] as small as possible as possible In the particular case of uniform sampling : Fixed-budget setting Fixed-confidence setting classical test of sequential test of ( µ 1 > µ 2 ) against ( µ 1 < µ 2 ) ( µ 1 > µ 2 ) against ( µ 1 < µ 2 ) based on t samples with probability of error uniformly bounded by δ [Siegmund 85]: sequential tests can save samples !

  15. Complexity of Best Arm Identification The complexities of best-arm identification For a class M bandit models, algorithm A = (( A t ) , τ, ˆ a τ ) is... Fixed-budget setting Fixed-confidence setting consistent on M if δ -PAC on M if a t � = a ∗ ) − a τ � = a ∗ ) ≤ δ ∀ ν ∈ M , p t ( ν ) = P ν (ˆ t →∞ 0 → ∀ ν ∈ M , P ν (ˆ From the literature � � t E ν [ τ ] ≃ C ′ H ′ ( ν ) log ( 1 /δ ) p t ( ν ) ≃ exp − CH ( ν ) [Audibert&al.’10],[Bubeck&al’11] [Mannor&Tsitsiklis ’04],[Even-Dar&al. ’06] [Bubeck&al’13],... [Kalanakrishnan&al’12],... = ⇒ two complexities � � − 1 E ν [ τ ] − 1 κ B ( ν ) = inf lim sup t log p t ( ν ) κ C ( ν ) = A δ − PAC lim sup inf log ( 1 /δ A cons. t →∞ δ → 0 for a probability of error ≤ δ , for a probability of error ≤ δ , budget t ≃ κ B ( ν ) log ( 1 /δ ) E ν [ τ ] ≃ κ C ( ν ) log ( 1 /δ )

  16. Complexity of Best Arm Identification Lower bounds on the complexities Changes of distribution Theorem: how to use (and hide) the change of distribution Let ν and ν ′ be two bandit models with K arms such that for all a , the distributions ν a and ν ′ a are mutually absolutely continuous. For any almost-surely finite stopping time σ with respect to ( F t ) , K � � � E ν [ N a ( σ )] KL ( ν a , ν ′ a ) ≥ sup kl P ν ( E ) , P ν ′ ( E ) , E∈F σ a = 1 � � where kl ( x , y ) = x log ( x / y ) + ( 1 − x ) log ( 1 − x ) / ( 1 − y ) . Useful remark: 1 � � ∀ δ ∈ [ 0 , 1 ] , δ, 1 − δ ≥ log kl 2 . 4 δ ,

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend