multi armed bandit learning in dynamic systems with
play

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models - PowerPoint PPT Presentation

1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University of California, Davis, CA 95616 Supported by NSF, ARO. c Qing Zhao. Talk at UMD, October, 2011. 2


  1. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University of California, Davis, CA 95616 Supported by NSF, ARO.

  2. c � Qing Zhao. Talk at UMD, October, 2011. 2 Multi-Armed Bandit Multi-Armed Bandit: ◮ N arms and a single player. ◮ Select one arm to play at each time. ◮ i.i.d. reward with Unknown mean θ i . ◮ Maximize the long-run reward. Exploitation v.s. Exploration ◮ Exploitation: play the arm with the largest sample mean. ◮ Exploration: play an arm to learn its reward statistics.

  3. c � Qing Zhao. Talk at UMD, October, 2011. 3 Clinical Trial (Thompson’33) Two treatments with unknown effectiveness:

  4. c � Qing Zhao. Talk at UMD, October, 2011. 4 Dynamic Spectrum Access Dynamic Spectrum Access under Unknown Model: Opportunities ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� Channel 1 ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� t �������� �������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� Channel N �������� �������� ��������� ��������� ��������� ��������� t 0 1 2 3 T ◮ N independent channels. ◮ Choose K channels to sense/access in each slot. ◮ Accessing an idle channel results in a unit reward. ◮ Channel occupancy: i.i.d. Bernoulli with unknown mean θ i .

  5. c � Qing Zhao. Talk at UMD, October, 2011. 5 Other Applications of MAB Web Search Internet Advertising/Investment Queueing and Scheduling Multi-Agent Systems λ 2 λ 1

  6. c � Qing Zhao. Talk at UMD, October, 2011. 6 Non-Bayesian Formulation Performance Measure: Regret ∆ = ( θ 1 , · · · , θ N ) : unknown reward means. ◮ Θ ◮ θ (1) T : max total reward (by time T ) if Θ is known. ◮ V π T (Θ) : total reward of policy π by time T . ◮ Regret (cost of learning): N � ( θ (1) − θ ( i ) ) E [ time spent on θ ( i ) ] . ∆ R π = θ (1) T − V π T (Θ) T (Θ) = i =2 Objective : minimize the growth rate of R π T (Θ) with T . sublinear regret = ⇒ maximum average reward θ (1)

  7. c � Qing Zhao. Talk at UMD, October, 2011. 7 Classic Results ◮ Lai&Robbins’85 : θ (1) − θ ( i ) � R ∗ log T as T → ∞ . T (Θ) ∼ I ( θ ( i ) , θ (1) ) i> 1 � �� � KL distance ✷ Optimal policies explicitly constructed for Gaussian, Bernoulli, Poisson, and Laplacian distributions. ◮ Agrawal’95 : ✷ Order-optimal index policies explicitly constructed for Gaussian, Bernoulli, Poisson, Laplacian, and Exponential distributions. ◮ Auer&Cesa-Bianchi&Fischer’02 : ✷ Order-optimal index policies for distributions with finite support.

  8. c � Qing Zhao. Talk at UMD, October, 2011. 8 Classic Policies Key Statistics: ◮ Sample mean ¯ θ i ( t ) ( exploitation ); ◮ Number of plays τ i ( t ) ( exploration ); In the classic policies: ◮ ¯ θ i ( t ) and τ i ( t ) are combined together for arm selection at each t : � 2 log t index = ¯ UCB Policy( Auer et al. :02 ): θ i + τ i ( t ) ◮ A fixed form difficult to adapt to different reward models.

  9. c � Qing Zhao. Talk at UMD, October, 2011. 9 Limitations ◮ Limitations of the Classic Policies : ✷ Reward distributions limited to finite support or specific cases; ✷ A single player (equivalently, centralized multiple players); ✷ i.i.d. or rested Markov reward over successive plays of each arm.

  10. c � Qing Zhao. Talk at UMD, October, 2011. 10 Recent Results ◮ Limitations of the Classic Policies : ✷ Reward distributions limited to finite support or specific cases; ✷ A single player (equivalently, centralized multiple players); ✷ i.i.d. or rested Markov reward over successive plays of each arm. ◮ Recent results: policies with a tunable parameter capable of handling ✷ a more general class of reward distributions (including heavy-tailed); ✷ decentralized MAB with partial reward observations; ✷ restless Markovian reward model.

  11. c � Qing Zhao. Talk at UMD, October, 2011. 11 General Reward Distributions

  12. c � Qing Zhao. Talk at UMD, October, 2011. 12 DSEE Deterministic Sequencing of Exploration and Exploitation (DSEE): ◮ Time is partitioned into interleaving exploration and exploitation sequences. t = 1 T ✷ Exploration: play all arms in round-robin. ✷ Exploitation: play the arm with the largest sample mean. ◮ A tunable parameter: the cardinality of the exploration sequence ✷ can be adjusted according to the “hardness” of the reward distributions.

  13. c � Qing Zhao. Talk at UMD, October, 2011. 13 The Optimal Cardinality of Exploration The Cardinality of Exploration: ✷ a lower bound of the regret order; ✷ should be the min x so that regret in exploitation is no larger than x . ◮ O (log T ) ? 0 0 50 100 150 200 250 300 350 400 450 500 Time (T) √ T ) ? ◮ O ( 0 0 50 100 150 200 250 300 350 400 450 500 Time (T)

  14. c � Qing Zhao. Talk at UMD, October, 2011. 14 Performance of DSEE When moment generating functions of { f i ( x ) } are properly bounded around 0 : 3.5 ◮ ∃ ζ > 0 , u 0 > 0 s.t. ∀ u with | u | ≤ u 0 , 3 E [exp(( X − θ ) u )] ≤ exp( ζu 2 / 2) Chi−square Moment−Generating Function G(u) Gaussian (large variance) Gaussian (small variance) 2.5 Uniform ◮ DSEE achieves the optimal regret 2 order O (log T ) . 1.5 1 ◮ Achieve a regret arbitrary close to log- arithmic w.o. any knowledge. 0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 u When { f i ( x ) } are heavy-tailed distributions: ◮ The moments of { f i ( x ) } exist only up to the p th order; ◮ DSEE achieves regret order O ( T 1 /p ) .

  15. c � Qing Zhao. Talk at UMD, October, 2011. 15 Basic Idea in Regret Analysis Convergence Rate of the Sample Mean: ◮ Chernoff-Hoeffding Bound (’63): for distributions w. finite support [ a, b ] , Pr( | X s − θ | ≥ δ ) ≤ 2 exp( − 2 δ 2 s/ ( b − a ) 2 ) . ◮ Chernoff-Hoeffding-Agrawal Bound (’95): for distributions w. bounded MGF around 0 , Pr( | X s − θ | ≥ δ ) ≤ 2 exp( − cδ 2 s ) , ∀ δ ∈ [0 , ζu 0 ] , c ∈ (0 , 1 / (2 ζ )] . ◮ Chow’s Bound (’75): for distributions having the p th ( p > 1) moment, Pr( | X s − θ | > ǫ ) = o ( s 1 − p ) .

  16. c � Qing Zhao. Talk at UMD, October, 2011. 16 Decentralized Bandit with Multiple Players

  17. c � Qing Zhao. Talk at UMD, October, 2011. 17 Distributed Spectrum Sharing Opportunities ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� Channel 1 ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� t �������� �������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� �������� �������� ��������� ��������� ��������� ��������� Channel N �������� �������� ��������� ��������� ��������� ��������� t 0 1 2 3 T ◮ N channels, M ( M < N ) distributed secondary users (no info exchange). ◮ Primary occupancy of channel i : i.i.d. Bernoulli with unknown mean θ i : ◮ Users accessing the same channel collide; no one receives reward. ◮ Objective: decentralized policy for optimal network-level performance.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend