best arm identification in multi armed bandits
play

Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , - PowerPoint PPT Presentation

Framework Lower Bound Algorithms Experiments Conclusion Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 & S ebastien Bubeck 3 & R emi Munos 3 1 Univ. Paris Est, Imagine 2 CNRS/ENS/INRIA, Willow project 3


  1. Framework Lower Bound Algorithms Experiments Conclusion Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert 1 , 2 & S´ ebastien Bubeck 3 & R´ emi Munos 3 1 Univ. Paris Est, Imagine 2 CNRS/ENS/INRIA, Willow project 3 INRIA Lille, SequeL team mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  2. Framework Lower Bound Algorithms Experiments Conclusion Best arm identification task Parameters available to the forecaster: the number of rounds n and the number of arms K . Parameters unknown to the forecaster: the reward distributions (over [0 , 1]) ν 1 , . . . , ν K of the arms. We assume that there is a unique arm i ∗ with maximal mean. For each round t = 1 , 2 , . . . , n ; 1 The forecaster chooses an arm I t ∈ { 1 , . . . , K } . 2 The environment draws the reward Y t from ν I t (and independently from the past given I t ). At the end of the n rounds the forecaster outputs a recommendation J n ∈ { 1 , . . . , K } . Goal: Find the best arm, i.e, the arm with maximal mean. Regret: e n = P ( J n � = i ∗ ) . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  3. Framework Lower Bound Algorithms Experiments Conclusion Motivating examples Clinical trials for cosmetic products. During the test phase , several several formulæ for a cream are sequentially tested , and after a finite time one is chosen for commercialization. Channel allocation for mobile phone communications. Cellphones can explore the set of channels to find the best one to operate. Each evaluation of a channel is noisy and there is a limited number of evaluations before the communication starts on the chosen channel . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  4. Framework Lower Bound Algorithms Experiments Conclusion Summary of the talk Let µ i be the mean of ν i , and ∆ i = µ i ∗ − µ i the suboptimality of arm i . Main theoretical result: it requires of order of H = � i � = i ∗ 1 / ∆ 2 i rounds to find the best arm. Note that this result is well known for K = 2. We present two new forecasters, Successive Rejects (SR) and Adaptive UCB-E (Upper Confidence Bound Exploration) . SR is parameter free, and has optimal guarantees (up to a logarithmic factor). Adaptive UCB-E has no theoretical guarantees but it experimentally outperforms SR. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  5. Framework Lower Bound Algorithms Experiments Conclusion Lower Bound Theorem Let ν 1 , . . . , ν K be Bernoulli distributions with parameters in [1 / 3 , 2 / 3] . There exists a numerical constant c > 0 such that for any forecaster, up to a permutation of the arms, � � − c (1 + o (1)) n log( K ) e n ≥ exp . H Informally, any algorithm requires at least (of order of) H / log( K ) rounds to find the best arm. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  6. Framework Lower Bound Algorithms Experiments Conclusion Lower Bound Theorem Let ν 1 , . . . , ν K be Bernoulli distributions with parameters in [1 / 3 , 2 / 3] . There exists a numerical constant c > 0 such that for any forecaster, up to a permutation of the arms, � � � n log( K ) � 1 + K log( K ) √ n e n ≥ exp − c . H Informally, any algorithm requires at least (of order of) H / log( K ) rounds to find the best arm. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  7. Framework Lower Bound Algorithms Experiments Conclusion Uniform strategy For each i ∈ { 1 , . . . , K } , select arm i during ⌊ n / K ⌋ rounds. Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , ⌊ n / K ⌋ . Theorem � � − n min i ∆ 2 The uniform strategy satisfies: e n ≤ 2 K exp . i 2 K For any ( δ 1 , . . . , δ K ) with min i δ i ≤ 1 / 2 , there exist distributions such that ∆ 1 = δ 1 , . . . , ∆ K = δ K and � � − 8 n min i ∆ 2 e n ≥ 1 i 2 exp . K Informally, the uniform strategy finds the best arm with (of order of) K / min i ∆ 2 i rounds. For large K , this can be significantly larger than H = � i � = i ∗ 1 / ∆ 2 i . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  8. Framework Lower Bound Algorithms Experiments Conclusion UCB-E Draw each arm once For each round t = K + 1 , 2 , . . . , n : � � � n / H � Draw I t ∈ argmax X i , T i ( t − 1) + , 2 T i ( t − 1) i ∈{ 1 ,..., K } where T i ( t − 1) = nb of times we pulled arm i up to time t − 1. Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , T i ( n ) . Theorem � � n UCB-E satisfies e n ≤ n exp − . 50 H UCB-E finds the best arm with (of order of) H rounds, but it requires the knowledge of H = � i � = i ∗ 1 / ∆ 2 mon-logo i . Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  9. Framework Lower Bound Algorithms Experiments Conclusion Successive Rejects (SR) 2 + � K Let log( K ) = 1 1 i , A 1 = { 1 , . . . , K } , n 0 = 0 and i =2 1 n − K n k = ⌈ K +1 − k ⌉ for k ∈ { 1 , . . . , K − 1 } . log( K ) For each phase k = 1 , 2 , . . . , K − 1: (1) For each i ∈ A k , select arm i during n k − n k − 1 rounds. (2) Let A k +1 = A k \ arg min i ∈ A k � X i , n k , where � X i , s represents the empirical mean of arm i after s pulls. Let J n be the unique element of A K . Motivation for choosing n k Consider µ 1 > µ 2 = · · · = µ M ≫ µ M +1 = · · · = µ K target: draw n / M times the M best arms 1 n SR: the M best arms are drawn more than n K − M +1 ≈ log( K ) M mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  10. Framework Lower Bound Algorithms Experiments Conclusion Successive Rejects (SR) 2 + � K Let log( K ) = 1 1 i , A 1 = { 1 , . . . , K } , n 0 = 0 and i =2 1 n − K n k = ⌈ K +1 − k ⌉ for k ∈ { 1 , . . . , K − 1 } . log( K ) For each phase k = 1 , 2 , . . . , K − 1: (1) For each i ∈ A k , select arm i during n k − n k − 1 rounds. (2) Let A k +1 = A k \ arg min i ∈ A k � X i , n k , where � X i , s represents the empirical mean of arm i after s pulls. Let J n be the unique element of A K . Theorem SR satisfies: � � n e n ≤ K exp − . 4 H log K mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  11. Framework Lower Bound Algorithms Experiments Conclusion UCB-E Parameter: exploration constant c > 0. Draw each arm once For each round t = 1 , 2 , . . . , n : � � � c n / H � Draw I t ∈ argmax X i , T i ( t − 1) + , T i ( t − 1) i ∈{ 1 ,..., K } where T i ( t − 1) = nb of times we pulled arm i up to time t − 1. Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , T i ( n ) . mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  12. Framework Lower Bound Algorithms Experiments Conclusion Adaptive UCB-E Parameter: exploration constant c > 0. For each round t = 1 , 2 , . . . , n : (1) Compute an (under)estimate ˆ H t of H � � � c n / ˆ H t � (2) Draw I t ∈ argmax i ∈{ 1 ,..., K } X i , T i ( t − 1) + , T i ( t − 1) Let J n ∈ argmax i ∈{ 1 ,..., K } � X i , T i ( n ) . Overestimating H ⇒ low exploration of the arms ⇒ potential missing of the optimal arm ⇒ all ∆ i badly estimated Underestimating H ⇒ higher exploration ⇒ not focusing enough on the arms ⇒ bad estimation of H = � i � = i ∗ 1 / ∆ 2 i mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  13. Framework Lower Bound Algorithms Experiments Conclusion Experiments with Bernoulli distributions Experiment 5: Arithmetic progression, K = 15, µ i = 0 . 5 − 0 . 025 i , i ∈ { 1 , . . . , 15 } . Experiment 7: Three groups of bad arms, K = 30, µ 1 = 0 . 5, µ 2:6 = 0 . 45, µ 7:20 = 0 . 43, µ 21:30 = 0 . 38. Experiment 5, n=4000 Experiment 7, n=6000 0.4 0.7 1 : Unif 1 : Unif 2−4 : HR 2−4 : HR 0.35 5 : SR 0.6 5 : SR 6−9 : UCB−E 6−9 : UCB−E 10−14 : Ad UCB−E 10−14 : Ad UCB−E Probability of error 0.3 Probability of error 0.5 0.25 0.4 0.2 0.3 0.15 0.2 0.1 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

  14. Framework Lower Bound Algorithms Experiments Conclusion Conclusion It requires at least H / log( K ) rounds to find the best arm, with H = � i � = i ∗ 1 / ∆ 2 i . UCB-E requires only H log n rounds but also the knowledge of H to tune its parameter. SR is a parameter free algorithm that requires less than H log 2 K rounds to find the best arm. Adaptive UCB-E does not have theoretical guarantees but it experimentally outperforms SR. mon-logo Jean-Yves Audibert & S´ ebastien Bubeck & R´ emi Munos Best Arm Identification in Multi-Armed Bandits

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend