Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - PowerPoint PPT Presentation

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Christophe Moy Émilie Kaufmann Advised by PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL Seminar - 22 December 2017

1. Introduction and motivation 1.a. Objective Motivation We control some communicating devices, they want to access to an access point. Insert them in a crowded wireless network . With a protocol slotted in both time and frequency . Goal Maintain a good Quality of Service . With no centralized control as it costs network overhead. How? Devices can choose a different radio channel at each time → learn the best one with sequential algorithm! ֒ Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 2 / 41

1. Introduction and motivation 1.b. Outline and references Outline 2 Our model: 3 different feedback levels 3 Regret lower bound 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 3 / 41

1. Introduction and motivation 1.b. Outline and references Outline and reference 2 Our model: 3 different feedback levels 3 Regret lower bound 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results This is based on our latest article: “Multi-Player Bandits Models Revisited” , Besson & Kaufmann. arXiv:1711.02317 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 3 / 41

2. Our model: 3 different feedback level 2.a. Our model Our model K radio channels ( e.g. , 10) ( known ) Discrete and synchronized time t ≥ 1 . Every time frame t is: Figure 1: Protocol in time and frequency, with an Acknowledgement . Dynamic device = dynamic radio reconfiguration It decides each time the channel it uses to send each packet . It can implement a simple decision algorithm . Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 4 / 41

2. Our model: 3 different feedback level 2.b. With or without sensing Our model “Easy” case M ≤ K devices always communicate and try to access the network, independently without centralized supervision, Background traffic is i.i.d. . Two variants : with or without sensing 1 With sensing : Device first senses for presence of Primary Users (background traffic), then use Ack to detect collisions. Model the “classical” Opportunistic Spectrum Access problem. Not exactly suited for Internet of Things, but can model ZigBee, and can be analyzed mathematically... Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 5 / 41

2. Our model: 3 different feedback level 2.b. With or without sensing Our model “Easy” case M ≤ K devices always communicate and try to access the network, independently without centralized supervision, Background traffic is i.i.d. . Two variants : with or without sensing 1 With sensing : Device first senses for presence of Primary Users (background traffic), then use Ack to detect collisions. Model the “classical” Opportunistic Spectrum Access problem. Not exactly suited for Internet of Things, but can model ZigBee, and can be analyzed mathematically... 2 Without sensing : same background traffic, but cannot sense, so only Ack is used. More suited for “IoT” networks like LoRa or SigFox (Harder to analyze mathematically.) Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 5 / 41

2. Our model: 3 different feedback level 2.c. Background traffic, and rewards Background traffic, and rewards i.i.d. background traffic K channels, modeled as Bernoulli ( 0 / 1 ) distributions of mean µ k = background traffic from Primary Users , bothering the dynamic devices, M devices, each uses channel A j ( t ) ∈ { 1 , . . . , K } at time t . Rewards r j ( t ) := Y A j ( t ) ,t × ✶ ( C j ( t )) = ✶ ( uplink & Ack ) iid with sensing information ∀ k, Y k,t ∼ Bern( µ k ) ∈ { 0 , 1 } , collision for device j : C j ( t ) = ✶ ( alone on arm A j ( t )) . → combined binary reward but not from two Bernoulli! ֒ Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 6 / 41

✶ 2. Our model: 3 different feedback level 2.d. Different feedback levels 3 feedback levels r j ( t ) := Y A j ( t ) ,t × ✶ ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, → Not realistic enough, we don’t focus on it. ֒ Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 41

✶ 2. Our model: 3 different feedback level 2.d. Different feedback levels 3 feedback levels r j ( t ) := Y A j ( t ) ,t × ✶ ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, → Not realistic enough, we don’t focus on it. ֒ 2 “Sensing”: first observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t � = 0 , → Models licensed protocols (ex. ZigBee), our main focus. ֒ Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 41

2. Our model: 3 different feedback level 2.d. Different feedback levels 3 feedback levels r j ( t ) := Y A j ( t ) ,t × ✶ ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, → Not realistic enough, we don’t focus on it. ֒ 2 “Sensing”: first observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t � = 0 , → Models licensed protocols (ex. ZigBee), our main focus. ֒ 3 “No sensing”: observe only the combined Y A j ( t ) ,t × ✶ ( C j ( t )) , → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! ֒ Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 41

2. Our model: 3 different feedback level 2.d. Different feedback levels 3 feedback levels r j ( t ) := Y A j ( t ) ,t × ✶ ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, → Not realistic enough, we don’t focus on it. ֒ 2 “Sensing”: first observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t � = 0 , → Models licensed protocols (ex. ZigBee), our main focus. ֒ 3 “No sensing”: observe only the combined Y A j ( t ) ,t × ✶ ( C j ( t )) , → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! ֒ But all consider the same instantaneous reward r j ( t ) . Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 41

2. Our model: 3 different feedback level 2.e. Goal Goal Problem Goal : minimize packet loss ratio ( = maximize nb of received Ack ) in a finite-space discrete-time Decision Making Problem . Solution ? Multi-Armed Bandit algorithms , decentralized and used independently by each dynamic device. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 8 / 41

2. Our model: 3 different feedback level 2.e. Goal Goal Problem Goal : minimize packet loss ratio ( = maximize nb of received Ack ) in a finite-space discrete-time Decision Making Problem . Solution ? Multi-Armed Bandit algorithms , decentralized and used independently by each dynamic device. Decentralized reinforcement learning optimization! � T M � Max transmission rate ≡ max cumulated rewards j =1 r j ( t ) . max algorithm A t =1 Each player wants to maximize its cumulated reward , With no central control, and no exchange of information, Only possible if : each player converges to one of the M best arms, orthogonally (without collisions). Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 8 / 41

2. Our model: 3 different feedback level 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret :   � M � T M � � �  r j ( t )  µ ∗ R T ( µ , M, ρ ) := T − E µ k t =1 k =1 j =1 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 41

2. Our model: 3 different feedback level 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret :   � M � T M � � �  r j ( t )  µ ∗ R T ( µ , M, ρ ) := T − E µ k t =1 k =1 j =1 Two directions of analysis Clearly R T = O ( T ) , but we want a sub-linear regret, as small as possible! Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 41

2. Our model: 3 different feedback level 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret :   � M � T M � � �  r j ( t )  µ ∗ R T ( µ , M, ρ ) := T − E µ k t =1 k =1 j =1 Two directions of analysis Clearly R T = O ( T ) , but we want a sub-linear regret, as small as possible! How good a decentralized algorithm can be in this setting? → Lower Bound on regret, for any algorithm ! ֒ Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 41

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - PowerPoint PPT Presentation

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Christophe Moy milie Kaufmann Advised by PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

ARTigo Tag Cluster tags of player 2 player 4 player 1 player 3 1 russian 1 army 1

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

The Player Agent The Player Agent Are they the most important league official right now? right

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Linear Bandits D avid P al Google, New York & Department of Computing Science

without Regret Barbara Jobstmann EPFL and Jasper DA CNRS, Verimag Joint Work with Christian von

Q4 05 STRATEGIC OVERVIEW Investor Community Conference Call TONY COMPER President and Chief

A Model for Detecting Transport Layer Data Reneging Nasif Ekiz, Paul D. Amer Nasif Ekiz, Paul D.

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - PowerPoint PPT Presentation

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Christophe Moy milie Kaufmann Advised by PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

ARTigo Tag Cluster tags of player 2 player 4 player 1 player 3 1 russian 1 army 1

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

The Player Agent The Player Agent Are they the most important league official right now? right

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar

Efficient tracking of a growing number of experts Jaouad Mourtada &amp; Odalric-ambrym Maillard

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Linear Bandits D avid P al Google, New York &amp; Department of Computing Science

without Regret Barbara Jobstmann EPFL and Jasper DA CNRS, Verimag Joint Work with Christian von

Q4 05 STRATEGIC OVERVIEW Investor Community Conference Call TONY COMPER President and Chief

A Model for Detecting Transport Layer Data Reneging Nasif Ekiz, Paul D. Amer Nasif Ekiz, Paul D.

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard

Linear Bandits D avid P al Google, New York & Department of Computing Science