multi player bandits revisited
play

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - PowerPoint PPT Presentation

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint work with milie Kaufmann PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille ALT Conference 08 -


  1. Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint work with Émilie Kaufmann PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille ALT Conference – 08 - 04 - 2018

  2. 1. Introduction and motivation Maintain a good Quality of Service . Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Devices can choose a difgerent radio channel at each time How? 1.a. Objective With no centralized control as it costs network overhead. Goal Insert them in a crowded wireless network . wireless access point. We control some communicating devices, they want to use a Motivation 2 / 30 With a protocol slotted in both time and frequency . ֒ → learn the best one with a sequential algorithm ! ALT Conference – 08 - 04 - 2018

  3. 2.a. Our communication model Our communication model It decides each time the channel it uses to send each packet . It can implement a simple decision algorithm . Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 3 / 30 2. Our model: 3 difgerent feedback levels K radio channels ( e.g. , 10). Discrete and synchronized time t ≥ 1 . Dynamic device = dynamic radio reconfjguration ALT Conference – 08 - 04 - 2018

  4. 4 / 30 2.b. With or without sensing Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Without sensing : same background traffjc, but cannot sense, so 2 detect collisions. With sensing : Device fjrst senses for presence of Primary Users 1 Two variants : with or without sensing Background traffjc is i.i.d. . network, independently without centralized supervision, “Easy” case Our model 2. Our model: 3 difgerent feedback levels M ≤ K devices always communicate and try to access the that have strict priority (background traffjc), then use Ack to only Ack is used. ALT Conference – 08 - 04 - 2018

  5. 5 / 30 with sensing information Background traffjc, and rewards i.i.d. background traffjc Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) dynamic devices, iid Rewards 2.c. Background traffjc, and rewards 2. Our model: 3 difgerent feedback levels K channels, modeled as Bernoulli ( 0 / 1 ) distributions of mean µ k = background traffjc from Primary Users , bothering the M devices, each uses channel A j ( t ) ∈ { 1 , . . . , K } at time t . r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) = 1 ( uplink & Ack ) ∀ k, Y k,t ∼ Bern( µ k ) ∈ { 0 , 1 } , C j ( t ) = 1 ( alone on arm A j ( t )) . collision for device j : → r j ( t ) combined binary reward but not from two Bernoulli! ֒ ALT Conference – 08 - 04 - 2018

  6. But all consider the same instantaneous reward 1 6 / 30 Models licensed protocols (ex. ZigBee), our main focus. Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) . Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , “No sensing”: observe only the combined 3 , 2.d. Difgerent feedback levels only if , then “Sensing”: fjrst observe 2 1 3 feedback levels 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. ALT Conference – 08 - 04 - 2018

  7. But all consider the same instantaneous reward 1 6 / 30 3 feedback levels Multi-Player Bandits Revisited 1 Lilian Besson (CentraleSupélec & Inria) . 2 Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , 2.d. Difgerent feedback levels 3 “No sensing”: observe only the combined 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. ALT Conference – 08 - 04 - 2018

  8. But all consider the same instantaneous reward 6 / 30 2.d. Difgerent feedback levels 3 feedback levels Multi-Player Bandits Revisited 1 Lilian Besson (CentraleSupélec & Inria) . 2 3 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. “No sensing”: observe only the combined Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! ALT Conference – 08 - 04 - 2018

  9. 6 / 30 2 3 feedback levels Multi-Player Bandits Revisited 1 Lilian Besson (CentraleSupélec & Inria) 3 2.d. Difgerent feedback levels 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. “No sensing”: observe only the combined Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! But all consider the same instantaneous reward r j ( t ) . ALT Conference – 08 - 04 - 2018

  10. 2.e. Goal Goal Goal Minimize packet loss ratio in a fjnite-space discrete-time Decision Making Problem . Solution ? Multi-Armed Bandit algorithms decentralized and used independently by each dynamic device. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 7 / 30 2. Our model: 3 difgerent feedback levels ( = maximize nb of received Ack ) ALT Conference – 08 - 04 - 2018

  11. 8 / 30 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret : Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Upper Bound on the regret, for one algorithm ! How good is my decentralized algorithm in this setting? Lower Bound on the regret, for any algorithm ! How good a decentralized algorithm can be in this setting? Two directions of analysis Ref: [Lai & Robbins, 1985], [Liu & Zhao, 2009], [Anandkumar et al, 2010] etc. 2. Our model: 3 difgerent feedback levels   ( M ) T M ∑ ∑ ∑  r j ( t )  . µ ∗ R T ( µ , M, ρ ) := T − E µ k t =1 j =1 k =1 Notation: µ ∗ k is the mean of the k -best arm ( k -th largest in µ ): µ ∗ 1 := max µ , µ ∗ 2 := max µ \ { µ ∗ 1 } , ALT Conference – 08 - 04 - 2018

  12. 8 / 30 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret: Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) How good is my decentralized algorithm in this setting? How good a decentralized algorithm can be in this setting? Two directions of analysis 2. Our model: 3 difgerent feedback levels   ( M ) T M ∑ ∑ ∑ r j ( t )  .  R T ( µ , M, ρ ) := µ ∗ T − E µ k k =1 t =1 j =1 ֒ → Lower Bound on the regret, for any algorithm ! ֒ → Upper Bound on the regret, for one algorithm ! ALT Conference – 08 - 04 - 2018

  13. 3. Lower bound Lower bound 1 2 Asymptotic lower bound on one term, 3 And for the regret. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 9 / 30 Decomposition of the regret in 3 terms, ALT Conference – 08 - 04 - 2018

  14. 3. Lower bound 3.a. Lower bound on the regret Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Devices can use orthogonal channels ( number of collisions ). 3 them ( number of optimal non-selections ), Devices can quickly identify the best arms, and most surely play 2 them too much ( number of sub-optimal selections ), , and not play - Devices can quickly identify the bad arms 1 Small regret can be attained if… 10 / 30 Decomposition on the regret Decomposition For any algorithm, decentralized or not, we have ∑ ( µ ∗ R T ( µ , M, ρ ) = M − µ k ) E µ [ T k ( T )] k ∈ M - worst ∑ ∑ K ( µ k − µ ∗ + M ) ( T − E µ [ T k ( T )]) + µ k E µ [ C k ( T )] . k ∈ M - best k =1 Notations for an arm k ∈ { 1 , . . . , K } : k ( T ) := ∑ T T j t =1 1 ( A j ( t ) = k ) , counts selections by the player j ∈ { 1 , . . . , M } , T k ( T ) := ∑ M j =1 T j k ( T ) , counts selections by all M players, C k ( T ) := ∑ T t =1 1 ( ∃ j 1 ̸ = j 2 , A j 1 ( t ) = k = A j 2 ( t )) , counts collisions. ALT Conference – 08 - 04 - 2018

  15. 3. Lower bound 3.a. Lower bound on the regret Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Devices can use orthogonal channels ( number of collisions ). 3 play them ( number of optimal non-selections ), Devices can quickly identify the best arms, and most surely 2 play them too much ( number of sub-optimal selections ), 1 Small regret can be attained if… 10 / 30 Decomposition on the regret Decomposition For any algorithm, decentralized or not, we have ∑ R T ( µ , M, ρ ) = ( µ ∗ M − µ k ) E µ [ T k ( T )] k ∈ M - worst ∑ ∑ K ( µ k − µ ∗ + M ) ( T − E µ [ T k ( T )]) + µ k E µ [ C k ( T )] . k ∈ M - best k =1 Devices can quickly identify the bad arms M - worst , and not ALT Conference – 08 - 04 - 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend