delay and cooperation in nonstochastic bandits
play

Delay and Cooperation in Nonstochastic Bandits Nicol` o - PowerPoint PPT Presentation

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 /


  1. Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Joint work with: Claudio Gentile and Alberto Minora (Varese) Yishay Mansour (Tel-Aviv) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 1 / 23

  2. Themes Learning with partial and delayed feedback Distributed online learning Trade-o ff between quality and quantity of feedback information N. Cesa-Bianchi (UNIMI) Delay and Cooperation 2 / 23

  3. The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? ? ? ? ? ? ? ? ? N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

  4. The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? ? ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

  5. The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? .3 ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) Player gets partial information: Only ℓ t ( I t ) is revealed 2 N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

  6. The nonstochastic bandit problem A sequential decision problem K actions Unknown deterministic assignment of losses to actions � � ∈ [ 0, 1 ] K for t = 1, 2, . . . ℓ t = ℓ t ( 1 ) , . . . , ℓ t ( K ) ? .3 ? ? ? ? ? ? ? For t = 1, 2, . . . Player picks an action I t (possibly using randomization) and 1 incurs loss ℓ t ( I t ) Player gets partial information: Only ℓ t ( I t ) is revealed 2 Ad placement, recommender systems, online auctions, . . . N. Cesa-Bianchi (UNIMI) Delay and Cooperation 3 / 23

  7. Regret Regret of randomized agent I 1 , I 2 , . . . � T � T � � def ℓ t ( i ) want R T = E ℓ t ( I t ) − min = o ( T ) i = 1,..., K t = 1 t = 1 √ Lower bound: R T � KT N. Cesa-Bianchi (UNIMI) Delay and Cooperation 4 / 23

  8. The Exp3 algorithm [Auer et al., 2002] Agent’s strategy � � t − 1 � � P t ( I t = i ) exp − η ℓ s ( i ) i = 1, . . . , N ∝ s = 1  ℓ t ( i )  � � if I t = i � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise Only one non-zero component in � ℓ t N. Cesa-Bianchi (UNIMI) Delay and Cooperation 5 / 23

  9. The Exp3 algorithm [Auer et al., 2002] Agent’s strategy � � t − 1 � � P t ( I t = i ) exp − η ℓ s ( i ) i = 1, . . . , N ∝ s = 1  ℓ t ( i )  � � if I t = i � ℓ t ( i ) = P t ℓ t ( i ) observed  0 otherwise Only one non-zero component in � ℓ t Properties of importance weighting estimator � � � ℓ t ( i ) = ℓ t ( i ) E t unbiasedness � ℓ t ( i ) 2 � 1 � � � variance control E t � ℓ t ( i ) observed P t N. Cesa-Bianchi (UNIMI) Delay and Cooperation 5 / 23

  10. Regret bounds Matching the lower bound up to logarithmic factors � T ℓ t ( i ) 2 �� � K R T � ln K + η � � � 2 E P t ( I t = i ) E t η t = 1 i = 1 � T � K � ln K + η P t ( I t = i ) � � � � 2 E η ℓ t ( i ) is observed P t t = 1 i = 1 √ = ln K + η 2 KT = KT ln K η N. Cesa-Bianchi (UNIMI) Delay and Cooperation 6 / 23

  11. Regret bounds Matching the lower bound up to logarithmic factors � T ℓ t ( i ) 2 �� � K R T � ln K + η � � � 2 E P t ( I t = i ) E t η t = 1 i = 1 � T � K � ln K + η P t ( I t = i ) � � � � 2 E η ℓ t ( i ) is observed P t t = 1 i = 1 √ = ln K + η 2 KT = KT ln K η The full information (experts) setting Agent observes vector of losses ℓ t after each play P t ( ℓ t ( i ) is observed ) = 1 √ R T � T ln K N. Cesa-Bianchi (UNIMI) Delay and Cooperation 6 / 23

  12. Learning with delayed losses At the end of each round t > d the agent pays ℓ t ( I t ) and observes ℓ t − d ( I t − d ) Upper bound [Neu et al., 2010; Joulani et al., 2013] � ( d + 1 ) KT R T � Proof (by reduction): Run d + 1 instances of a bandit algorithm for the standard (no delay) setting in parallel. At each time step t = ( d + 1 ) r + s , use instance s + 1 for the current play. Lower bound � � � � √ � � max KT , ( d + 1 ) T ln K = Ω ( d + K ) T � �� �� �� � � ���������������� �� ���������������� � bandit delayed experts lower bound lower bound N. Cesa-Bianchi (UNIMI) Delay and Cooperation 7 / 23

  13. Simpler and better solution Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available  ℓ t − d ( i )  � � if I t − d = i � ℓ t ( i ) = P t − d ℓ t − d ( i ) observed  0 otherwise N. Cesa-Bianchi (UNIMI) Delay and Cooperation 8 / 23

  14. Simpler and better solution Delayed importance sampling estimates Run Exp3 and make an importance-weighted update whenever a loss becomes available  ℓ t − d ( i )  � � if I t − d = i � ℓ t ( i ) = P t − d ℓ t − d ( i ) observed  0 otherwise Regret bound � R T = d + ( d + K ) T ln K matching the lower bound up to logarithmic factors N. Cesa-Bianchi (UNIMI) Delay and Cooperation 8 / 23

  15. Properties of the delayed loss estimate Recall key step in Exp3 analysis (a.k.a. “bandit magic”) � � K P t I t = i � � � = K P t ℓ t ( i ) is observed i = 1 For the delayed loss estimate we have � � K I t = i P t 1 � � � � Ke for η � ℓ t − d ( i ) is observed eK ( d + 1 ) P t − d i = 1 N. Cesa-Bianchi (UNIMI) Delay and Cooperation 9 / 23

  16. Cooperation with delay N agents sitting 1 on the vertices of an unknown communication 6 2 graph G = ( V , E ) Agents cooperate 7 3 5 10 to solve a common bandit problem Each agent runs 4 9 an instance of the same bandit 8 algorithm N. Cesa-Bianchi (UNIMI) Delay and Cooperation 10 / 23

  17. Some related works Cooperative nonstochastic bandits without delays [Awerbuch and Kleinberg, 2008] Cooperative stochastic bandits on dynamic P2P networks [Szorenyi et al., 2013] Stochastic bandits that compete for shared resources (cognitive radio networks) Distributed gradient descent N. Cesa-Bianchi (UNIMI) Delay and Cooperation 11 / 23

  18. The communication protocol with fixed delay d For each t = 1, . . . , T each agent v ∈ V does the following: Plays an action I t ( v ) drawn according to his private distribution 1 � � p t ( v ) observing loss ℓ t I t ( v ) (same loss vector for all agents) Sends to his neighbors the message 2 � � � � m t ( v ) = t , v , I t ( v ) , ℓ t I t ( v ) , p t ( v ) Receives messages from his neighbors, forwarding those that are 3 not older than d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 12 / 23

  19. The communication protocol with fixed delay d For each t = 1, . . . , T each agent v ∈ V does the following: Plays an action I t ( v ) drawn according to his private distribution 1 � � p t ( v ) observing loss ℓ t I t ( v ) (same loss vector for all agents) Sends to his neighbors the message 2 � � � � m t ( v ) = t , v , I t ( v ) , ℓ t I t ( v ) , p t ( v ) Receives messages from his neighbors, forwarding those that are 3 not older than d An agent receives a message from another agent with a delay equal to the shortest path between them A message sent by some agent v at time t will be received by all agents whose shortest-path distance from v is at most d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 12 / 23

  20. Average welfare regret � T � T � � = 1 � � � R coop I t ( v ) − min ℓ t ( i ) E ℓ t T N i = 1,..., K v ∈ V t = 1 t = 1 Remarks √ Clearly, R coop TK ln K when agents run vanilla Exp3 � T (no cooperation) By using other agent’s plays, each agent may estimate ℓ t better (thus learning nearly at full info rate) In general, d trades o ff between quality and quantity of information N. Cesa-Bianchi (UNIMI) Delay and Cooperation 13 / 23

  21. Cooperative delayed loss estimator Each agent v uses the messages received from the other agents in order to estimate ℓ t better  ℓ t − d ( i ) × B t − d ( i , v ) � �  if t > d � ℓ t ( i , v ) = B t − d ( i , v ) P t − d  0 otherwise B t − d ( i , v ) is the event that some agent in a d -neighborhood of v played action i at time t − d N. Cesa-Bianchi (UNIMI) Delay and Cooperation 14 / 23

  22. Cooperative delayed loss estimator Each agent v uses the messages received from the other agents in order to estimate ℓ t better  ℓ t − d ( i ) × B t − d ( i , v ) � �  if t > d � ℓ t ( i , v ) = B t − d ( i , v ) P t − d  0 otherwise B t − d ( i , v ) is the event that some agent in a d -neighborhood of v played action i at time t − d Now � ℓ ( v ) may have many non-zero components (better estimate) N. Cesa-Bianchi (UNIMI) Delay and Cooperation 14 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend