Multi-Player Bandits Revisited
Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Advised by Christophe Moy Émilie Kaufmann
PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille
Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm - - PowerPoint PPT Presentation
Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Advised by Christophe Moy milie Kaufmann PhD Student Team SCEE, IETR, CentraleSuplec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL
PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille
1.a. Objective
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 2 / 42
1.b. Outline and references
1 Introduction 2 Our model: 3 difgerent feedback levels 3 Decomposition and lower bound on regret 4 Quick reminder on single-player MAB algorithms 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results 8 An heuristic (Selfish), and disappointing results 9 Conclusion
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 3 / 42
1.b. Outline and references
1 Introduction 2 Our model: 3 difgerent feedback levels 3 Decomposition and lower bound on regret 4 Quick reminder on single-player MAB algorithms 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results 8 An heuristic (Selfish), and disappointing results 9 Conclusion
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 3 / 42
2.a. Our model
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 4 / 42
2.b. With or without sensing
1 With sensing: Device fjrst senses for presence of Primary
2 Without sensing: same background traffjc, but cannot sense,
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 5 / 42
2.b. With or without sensing
1 With sensing: Device fjrst senses for presence of Primary
2 Without sensing: same background traffjc, but cannot sense,
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 5 / 42
2.c. Background traffjc, and rewards
iid
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 6 / 42
2.d. Difgerent feedback levels
1 “Full feedback”: observe both YAj(t),t and Cj(t) separately,
2 “Sensing”: fjrst observe
3 “No sensing”: observe only the joint
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 42
2.d. Difgerent feedback levels
1 “Full feedback”: observe both YAj(t),t and Cj(t) separately,
2 “Sensing”: fjrst observe YAj(t),t, then Cj(t) only if YAj(t),t ̸= 0,
3 “No sensing”: observe only the joint
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 42
2.d. Difgerent feedback levels
1 “Full feedback”: observe both YAj(t),t and Cj(t) separately,
2 “Sensing”: fjrst observe YAj(t),t, then Cj(t) only if YAj(t),t ̸= 0,
3 “No sensing”: observe only the joint YAj(t),t × 1(Cj(t)),
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 42
2.d. Difgerent feedback levels
1 “Full feedback”: observe both YAj(t),t and Cj(t) separately,
2 “Sensing”: fjrst observe YAj(t),t, then Cj(t) only if YAj(t),t ̸= 0,
3 “No sensing”: observe only the joint YAj(t),t × 1(Cj(t)),
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 7 / 42
2.e. Goal
algorithm
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 8 / 42
2.e. Goal
algorithm A T
t=1 M
j=1 rj A(t).
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 8 / 42
2.f. Centralized regret
T
t=1 M
j=1
j − rj(t)
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 42
2.f. Centralized regret
T
t=1 M
j=1
j − rj(t)
k=1
k
T
t=1 M
j=1
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 42
2.f. Centralized regret
T
t=1 M
j=1
j − rj(t)
k=1
k
T
t=1 M
j=1
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 42
2.f. Centralized regret
T
t=1 M
j=1
j − rj(t)
k=1
k
T
t=1 M
j=1
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 42
2.f. Centralized regret
T
t=1 M
j=1
j − rj(t)
k=1
k
T
t=1 M
j=1
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 9 / 42
1 Decomposition of regret in 3 terms, 2 Asymptotic lower bound of one term, 3 And for regret, 4 Sketch of proof, 5 Illustration.
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 10 / 42
3.a. Lower bound on regret
k∈M-worst
M − µk)Eµ[Tk(T)]
k∈M-best
M) (T − Eµ[Tk(T)]) + K
k=1
1 Devices can quickly identify the bad arms
2 Devices can quickly identify the best arms, and most surely
3 Devices can use orthogonal channels (number of collisions).
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 11 / 42
3.a. Lower bound on regret
k∈M-worst
M − µk)Eµ[Tk(T)]
k∈M-best
M) (T − Eµ[Tk(T)]) + K
k=1
1 Devices can quickly identify the bad arms M-worst, and not
2 Devices can quickly identify the best arms, and most surely
3 Devices can use orthogonal channels (number of collisions).
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 11 / 42
3.a. Lower bound on regret
k∈M-worst
M − µk)Eµ[Tk(T)]
k∈M-best
M) (T − Eµ[Tk(T)]) + K
k=1
1 Devices can quickly identify the bad arms M-worst, and not
2 Devices can quickly identify the best arms, and most surely
3 Devices can use orthogonal channels (number of collisions).
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 11 / 42
3.a. Lower bound on regret
k∈M-worst
M − µk)Eµ[Tk(T)]
k∈M-best
M) (T − Eµ[Tk(T)]) + K
k=1
1 Devices can quickly identify the bad arms M-worst, and not
2 Devices can quickly identify the best arms, and most surely
3 Devices can use orthogonal channels (number of collisions).
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 11 / 42
3.a. Lower bound on regret
T→+∞
k(T)]
M),
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 12 / 42
3.a. Lower bound on regret
T→+∞
k(T)]
M),
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 12 / 42
3.a. Lower bound on regret
T→+∞
k∈M-worst
M − µk)
M)
Where is the binary Kullback-Leibler divergence.
Ref: [Anantharam et al, 1987]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 13 / 42
3.a. Lower bound on regret
T→+∞
k∈M-worst
M − µk)
M)
Where kl(x, y) := x log( x
y ) + (1 − x) log( 1−x 1−y ) is the binary Kullback-Leibler divergence.
Ref: [Anantharam et al, 1987]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 13 / 42
2000 4000 6000 8000 10000 Time steps t = 1. . T, horizon T = 10000, 6 players: 6 × RhoRand-KLUCB 500 1000 1500 2000 2500 Cumulative centralized regret
1000[Rt]
Multi-players M = 6 : Cumulated centralized regret, averaged 1000 times 9 arms: [B(0.1), B(0.2), B(0.3), B(0.4) ∗ , B(0.5) ∗ , B(0.6) ∗ , B(0.7) ∗ , B(0.8) ∗ , B(0.9) ∗ ]
Cumulated centralized regret (a) term: Pulls of 3 suboptimal arms (lower-bounded) (b) term: Non-pulls of 6 optimal arms (c) term: Weighted count of collisions Our lower-bound = 48.8 log(t) Anandkumar et al.'s lower-bound = 15 log(t) Centralized lower-bound = 8.14 log(t)
3.c. Sketch of the proof
k(T)] expected
Ref: [Garivier et al, 2016]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 15 / 42
1 Index-based MAB deterministic policies, 2 Upper Confjdence Bound algorithm : UCB1, 3 Kullback-Leibler UCB algorithm : kl-UCB.
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 16 / 42
4.a. Upper Confjdence Bound algorithm : UCB1
1 For the fjrst K steps (t = 1, . . . , K), try each channel once. 2 Then for the next steps t > K :
Mean µk(t)
k
References: [Lai & Robbins, 1985], [Auer et al, 2002], [Bubeck & Cesa-Bianchi, 2012] Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 17 / 42
4.b. Kullback-Leibler UCB algorithm : kl-UCB
1 For the fjrst K steps (t = 1, . . . , K), try each channel once. 2 Then for the next steps t > K :
q∈[a,b]
Tk(t) , q
Tk(t)
k
References: [Garivier & Cappé, 2011], [Cappé & Garivier & Maillard & Munos & Stoltz, 2013] Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 18 / 42
4.b. Kullback-Leibler UCB algorithm : kl-UCB
1 For the fjrst K steps (t = 1, . . . , K), try each channel once. 2 Then for the next steps t > K :
q∈[a,b]
Tk(t) , q
Tk(t)
k
References: [Garivier & Cappé, 2011], [Cappé & Garivier & Maillard & Munos & Stoltz, 2013] Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 18 / 42
1 Common building blocks of previous algorithms, 2 First proposal: RandTopM, 3 Second proposal: MCTopM, 4 Algorithm and illustration.
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 19 / 42
5.a. State-of-the-art MP algorithms
1 MAB policy to learn the best arms (use sensing YAj(t),t), 2 Orthogonalization scheme to avoid collisions (use Cj(t)).
[Avner & Mannor, 2015], [Shamir et al, 2016]
[Anandkumar et al, 2011]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 20 / 42
5.a. State-of-the-art MP algorithms
1 MAB policy to learn the best arms (use sensing YAj(t),t), 2 Orthogonalization scheme to avoid collisions (use Cj(t)).
[Avner & Mannor, 2015], [Shamir et al, 2016]
[Anandkumar et al, 2011]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 20 / 42
5.a. State-of-the-art MP algorithms
1 MAB policy to learn the best arms (use sensing YAj(t),t), 2 Orthogonalization scheme to avoid collisions (use Cj(t)).
[Avner & Mannor, 2015], [Shamir et al, 2016]
[Anandkumar et al, 2011]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 20 / 42
5.b. RandTopM algorithm
1 Let Aj(1) ∼ U({1, . . . , K}) and Cj(1) = False 2 for t = 0, . . . , T − 1 do 3
4
(
)
5
6
7
8
9
k(t + 1) and set
10 end
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 21 / 42
5.b. RandTopM algorithm
1 Let Aj(1) ∼ U({1, . . . , K}) and Cj(1) = False 2 for t = 0, . . . , T − 1 do 3
4
5
(
)
6
7
(
{
k(t − 1) ≤ gj Aj(t)(t − 1)
})
8
9
10
11
12
13
k(t + 1) and set
14 end Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 22 / 42
5.c. MCTopM algorithm
(0) Start t = 0 Not fjxed, sj(t) Fixed, sj(t)
(1) Cj(t), Aj(t) ∈ Mj(t) (2) Cj(t), Aj(t) ∈ Mj(t) (3) Aj(t) / ∈ Mj(t) (4) Aj(t) ∈ Mj(t) (5) Aj(t) / ∈ Mj(t)
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 23 / 42
1 Let Aj(1) ∼ U({1, . . . , K}) and Cj(1) = False and sj(1) = False 2 for t = 0, . . . , T − 1 do 3
4
(
{
k(t − 1) ≤ gj Aj(t)(t − 1)
})
5
6
7
(
)
8
9
10
11
12
13
14
k(t + 1) and set
15 end
1 Theorem, 2 Remarks, 3 Idea of the proof.
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 25 / 42
6.a. Theorem for MCTopM with kl-UCB
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 26 / 42
6.a. Theorem for MCTopM with kl-UCB
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 26 / 42
6.a. Theorem for MCTopM with kl-UCB
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 27 / 42
6.a. Theorem for MCTopM with kl-UCB
M
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 27 / 42
6.a. Theorem for MCTopM with kl-UCB
M
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 27 / 42
6.a. Theorem for MCTopM with kl-UCB
M
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 27 / 42
6.a. Theorem for MCTopM with kl-UCB
M
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 27 / 42
6.b. Sketch of the proof
1 Bound the expected number of collisions by M times the
2 Bound the expected number of transitions of type
3 Bound the expected length of a sequence in the non-sitted
4 So most of the times (
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 28 / 42
6.b. Sketch of the proof
1 Bound the expected number of collisions by M times the
2 Bound the expected number of transitions of type (3) and (5),
k(t − 1) ≤ gj k′(t − 1), and gj k(t) > gj k′(t)
3 Bound the expected length of a sequence in the non-sitted
4 So most of the times (
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 28 / 42
6.b. Sketch of the proof
1 Bound the expected number of collisions by M times the
2 Bound the expected number of transitions of type (3) and (5),
k(t − 1) ≤ gj k′(t − 1), and gj k(t) > gj k′(t)
3 Bound the expected length of a sequence in the non-sitted
4 So most of the times (
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 28 / 42
6.b. Sketch of the proof
1 Bound the expected number of collisions by M times the
2 Bound the expected number of transitions of type (3) and (5),
k(t − 1) ≤ gj k′(t − 1), and gj k(t) > gj k′(t)
3 Bound the expected length of a sequence in the non-sitted
4 So most of the times (O(T − log T)), players are sitted, and
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 28 / 42
1 Illustration of regret for a single problem and M = K, 2 Regret for uniformly sampled problems and M < K, 3 Logarithmic number of collisions, 4 Logarithmic number of arm switches, 5 Fairness?
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 29 / 42
2000 4000 6000 8000 10000 Time steps t = 1. . T, horizon T = 10000, 1000 2000 3000 4000 5000 6000 7000 Cumulative centralized regret
9
X
k = 1
µ ∗
k t − 9
X
k = 1
µk
200[Tk(t)]
Multi-players M = 9 : Cumulated centralized regret, averaged 200 times 9 arms: [B(0.1) ∗ , B(0.2) ∗ , B(0.3) ∗ , B(0.4) ∗ , B(0.5) ∗ , B(0.6) ∗ , B(0.7) ∗ , B(0.8) ∗ , B(0.9) ∗ ]
9 × RandTopM-KLUCB 9 × MCTopM-KLUCB 9 × Selfish-KLUCB 9 × RhoRand-KLUCB Our lower-bound = 0 log(t) Anandkumar et al.'s lower-bound = 0 log(t) Centralized lower-bound = 0 log(t)
1000 2000 3000 4000 5000 Time steps t = 1. . T, horizon T = 5000, 500 1000 1500 2000 2500 3000 3500 Cumulative centralized regret
6
X
k = 1
µ ∗
k t − 9
X
k = 1
µk
500[Tk(t)]
Multi-players M = 6 : Cumulated centralized regret, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0, 1]
6 × RandTopM-KLUCB 6 × MCTopM-KLUCB 6 × Selfish-KLUCB 6 × RhoRand-KLUCB
1000 2000 3000 4000 5000 Time steps t = 1. . T, horizon T = 5000 100 200 300 400 500 600 700 800 Cumulated number of collisions on all arms
Multi-players M = 6 : Cumulated number of collisions, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0, 1]
6 × RandTopM-KLUCB 6 × MCTopM-KLUCB 6 × Selfish-KLUCB 6 × RhoRand-KLUCB
1000 2000 3000 4000 5000 Time steps t = 1. . T, horizon T = 5000, 200 400 600 800 Cumulated number of switches (changes of arms)
Multi-players M = 6 : Total cumulated number of switches, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0, 1]
6 × RandTopM-KLUCB 6 × MCTopM-KLUCB 6 × Selfish-KLUCB 6 × RhoRand-KLUCB
1000 2000 3000 4000 5000 Time steps t = 1. . T, horizon T = 5000, 0.0% 2.0% 4.0% 6.0% 8.0% 10.0% Centralized measure of fairness for cumulative rewards (Std)
Multi-players M = 6 : Centralized measure of fairness, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0, 1]
6 × RandTopM-KLUCB 6 × MCTopM-KLUCB 6 × Selfish-KLUCB 6 × RhoRand-KLUCB
1 Just an heuristic, 2 Problems with Selfish, 3 Illustration of failure cases.
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 35 / 42
8.a. Problems with Selfish
Reference: [Bonnefoi & Besson et al, 2017]
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 36 / 42
8.a. Problems with Selfish
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 37 / 42
10 15 20 25 30 35 20 40 60 80 100 120 6 5 4
2 × RandTopM-KLUCB
1000 2000 3000 4000 5000 6000 7000 200 400 600 800 1000 17
2 × Selfish-KLUCB
10 15 20 25 30 35 40 20 40 60 80 100 120 140 2 1 2 1
2 × MCTopM-KLUCB
10 20 30 40 50 60 20 40 60 80 100 120 140 160 2 2
2 × RhoRand-KLUCB 0.0 0.2 0.4 0.6 0.8 1.0 Regret value RT at the end of simulation, for T = 5000 0.0 0.2 0.4 0.6 0.8 1.0 Number of observations, 1000 repetitions
Histogram of regrets for different multi-players bandit algorithms 3 arms: [B(0.1), B(0.5) ∗ , B(0.9) ∗ ]
9.a. Sum-up
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 39 / 42
9.a. Sum-up
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 39 / 42
9.b. Future work
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 40 / 42
9.b. Future work
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 40 / 42
9.c. Thanks!
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 41 / 42
9.c. Thanks!
Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 42 / 42