Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - PowerPoint PPT Presentation

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Advised by Christophe Moy Émilie Kaufmann PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL Seminar - 22 December 2017

1. Introduction and motivation 1.a. Objective 2 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited Devices can choose a difgerent radio channel at each time How? With no centralized control as it costs network overhead. Maintain a good Quality of Service. Goal With a protocol slotted in both time and frequency. Insert them in a crowded wireless network. single base station. We control some communicating devices, they want to access to a Motivation 42 ֒ → learn the best one with sequential algorithm!

1 Introduction 1. Introduction and motivation 1.b. Outline and references 3 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arXiv:1711.02317 “Multi-Player Bandits Models Revisited”, Besson & Kaufmann. This is based on our latest article: 42 Outline and reference 2 Our model: 3 difgerent feedback levels 3 Decomposition and lower bound on regret 4 Quick reminder on single-player MAB algorithms 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results 8 An heuristic ( Selfish ), and disappointing results 9 Conclusion

1 Introduction 1. Introduction and motivation This is based on our latest article: 3 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arXiv:1711.02317 “Multi-Player Bandits Models Revisited”, Besson & Kaufmann. 42 Outline and reference 1.b. Outline and references 2 Our model: 3 difgerent feedback levels 3 Decomposition and lower bound on regret 4 Quick reminder on single-player MAB algorithms 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results 8 An heuristic ( Selfish ), and disappointing results 9 Conclusion

2.a. Our model Our model (known) Figure 1: Protocol in time and frequency, with an Acknowledgement. It decides each time the channel it uses to send each packet. It can implement a simple decision algorithm. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 4 / 42 2. Our model: 3 difgerent feedback level K radio channels (e.g., 10) Discrete and synchronized time t ≥ 1 . Every time frame t is: Dynamic device = dynamic radio reconfjguration

2 Without sensing: same background traffjc, but cannot sense, 42 problem. Not exactly suited for Internet of Things, but 5 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited or SigFox (Harder to analyze mathematically.) so only Ack is used. More suited for “IoT” networks like LoRa mathematically... can model ZigBee, and can be analyzed Model the “classical” Opportunistic Spectrum Access 2.b. With or without sensing Users (background traffjc), then use Ack to detect collisions. Two variants : with or without sensing Background traffjc is i.i.d.. network, independently without centralized supervision, “Easy” case Our model 2. Our model: 3 difgerent feedback level M ≤ K devices always communicate and try to access the 1 With sensing: Device fjrst senses for presence of Primary

42 problem. Not exactly suited for Internet of Things, but 5 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited or SigFox (Harder to analyze mathematically.) so only Ack is used. More suited for “IoT” networks like LoRa mathematically... can model ZigBee, and can be analyzed Model the “classical” Opportunistic Spectrum Access 2.b. With or without sensing Users (background traffjc), then use Ack to detect collisions. Two variants : with or without sensing Background traffjc is i.i.d.. network, independently without centralized supervision, “Easy” case Our model 2. Our model: 3 difgerent feedback level M ≤ K devices always communicate and try to access the 1 With sensing: Device fjrst senses for presence of Primary 2 Without sensing: same background traffjc, but cannot sense,

42 2.c. Background traffjc, and rewards Background traffjc, and rewards i.i.d. background traffjc 6 / SequeL Seminar - 22/12/17 dynamic devices, Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited Rewards iid 2. Our model: 3 difgerent feedback level K channels, modeled as Bernoulli ( 0 / 1 ) distributions of mean µ k = background traffjc from Primary Users, bothering the M devices, each uses channel A j ( t ) ∈ { 1 , . . . , K } at time t . r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) = 1 ( uplink & Ack ) with sensing information ∀ k, Y k,t ∼ Bern( µ k ) ∈ { 0 , 1 } , collision for device j : C j ( t ) = 1 ( alone on arm A j ( t )) . ֒ → joint binary reward but not from two Bernoulli!

But all consider the same instantaneous reward 2 “Sensing”: fjrst observe 1 3 “No sensing”: observe only the joint 42 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited . Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , Models licensed protocols (ex. ZigBee), our main focus. 2.d. Difgerent feedback levels , only if , then 3 feedback levels 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it.

But all consider the same instantaneous reward 1 3 “No sensing”: observe only the joint 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited . Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , 42 3 feedback levels 2.d. Difgerent feedback levels 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. 2 “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus.

But all consider the same instantaneous reward 42 2.d. Difgerent feedback levels 3 feedback levels 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited . 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. 2 “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. 3 “No sensing”: observe only the joint Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze !

42 2.d. Difgerent feedback levels 3 feedback levels 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. 2 “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. 3 “No sensing”: observe only the joint Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! But all consider the same instantaneous reward r j ( t ) .

42 . 8 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arms, orthogonally (without collisions). best Only possible if : each player converges to one of the With no central control, and no exchange of information, Each player wants to maximize its cumulated reward, algorithm 2.e. Goal max cumulated rewards Max transmission rate Decentralized reinforcement learning optimization! and used independently by each dynamic device. Solution ? Multi-Armed Bandit algorithms, decentralized Ack) in a fjnite-space discrete-time Decision Making Problem. Problem Goal 2. Our model: 3 difgerent feedback level Goal : minimize packet loss ratio ( = maximize nb of received

42 Each player wants to maximize its cumulated reward, Goal Problem 8 / Ack) in a fjnite-space discrete-time Decision Making Problem. Solution ? Multi-Armed Bandit algorithms, decentralized and used independently by each dynamic device. Decentralized reinforcement learning optimization! SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arms, orthogonally (without collisions). With no central control, and no exchange of information, 2.e. Goal 2. Our model: 3 difgerent feedback level Goal : minimize packet loss ratio ( = maximize nb of received Max transmission rate ≡ max cumulated rewards ∑ T ∑ M j =1 r j max A ( t ) . algorithm A t =1 Only possible if : each player converges to one of the M best

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - PowerPoint PPT Presentation

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Advised by Christophe Moy milie Kaufmann PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

ARTigo Tag Cluster tags of player 2 player 4 player 1 player 3 1 russian 1 army 1

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

The Player Agent The Player Agent Are they the most important league official right now? right

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CMOS technologies in the 1 0 0 nm range CMOS technologies in the 1 0 0 nm range for rad-hard

Passive Wireless-side Measurement Aniket Mahanti Carey Williamson Martin Arlitt University of

Mobility Detection Using Everyday GSM Traces Timothy Sohn et al Philip Cootey pcootey@wpi.edu

JCSP Networking 2.0 (or maybe JCSP 1.1 rc4) Kevin Chalmers School of Computing Napier

Generic External Memory for Switch Data Planes Daehyeok Kim Yibo Zhu, Changhoon Kim, Jeongkeun

A Multi-Radio Unification Protocol for IEEE 802.11 Wireless Networks A. Adya, P. Bahl, J. Padhye,

D EPFET pixel modules will be readout in rolling shutter The currents generated by a selected

Chapter 5 Fundamental parameters The CMOS Inverter for digital gates Goal With This Chapter