MVA-RL Course
The Multi-Arm Bandit Framework
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation
The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013 - 2/94 In This Lecture
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Oct 29th, 2013 - 2/94
Oct 29th, 2013 - 3/94
Mathematical Tools
Oct 29th, 2013 - 4/94
Mathematical Tools
Proposition (Chernoff-Hoeffding Inequality)
i=1(bi − ai)2
Oct 29th, 2013 - 5/94
Mathematical Tools
Proof. P
Xi − µi ≥ ǫ
P(es n
i=1 Xi−µi ≥ esǫ)
≤ e−sǫE[es n
i=1 Xi−µi],
Markov inequality = e−sǫ
n
E[es(Xi−µi)], independent random variables ≤ e−sǫ
n
es2(bi−ai)2/8, Hoeffding inequality = e−sǫ+s2 n
i=1(bi−ai)2/8
If we choose s = 4ǫ/ n
i=1(bi − ai)2, the result follows.
Similar arguments hold for P n
i=1 Xi − µi ≤ −ǫ
Oct 29th, 2013 - 6/94
Mathematical Tools
n
Oct 29th, 2013 - 7/94
Mathematical Tools
n
Oct 29th, 2013 - 8/94
Mathematical Tools
n
2ǫ2
Oct 29th, 2013 - 9/94
The General Multi-arm Bandit Problem
Oct 29th, 2013 - 10/94
The General Multi-arm Bandit Problem
◮ At the same time
◮ The environment chooses a vector of rewards {Xi,t}N
i=1
◮ The learner chooses an arm It
◮ The learner receives a reward XIt,t ◮ The environment does not reveal the rewards of the other
Oct 29th, 2013 - 11/94
The General Multi-arm Bandit Problem
i=1,...,N E
X or in the algorithm)
Oct 29th, 2013 - 12/94
The General Multi-arm Bandit Problem
Oct 29th, 2013 - 13/94
The General Multi-arm Bandit Problem
◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ...
Oct 29th, 2013 - 14/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 15/94
The Stochastic Multi-arm Bandit Problem
Definition
◮ Each arm has a distribution νi bounded in [0, 1] and
◮ The rewards are i.i.d. Xi,t ∼ νi
Oct 29th, 2013 - 16/94
The Stochastic Multi-arm Bandit Problem
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
i=1,...,N E
i=1,...,N(nµi) − E
i=1,...,N(nµi) − N
N
Oct 29th, 2013 - 17/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 18/94
The Stochastic Multi-arm Bandit Problem
◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the
Oct 29th, 2013 - 19/94
The Stochastic Multi-arm Bandit Problem
−4 −2 2 4 6 5 10 15 20 25 Rewards −4 −2 2 4 6 5 10 15 20 25 30 35 40 Rewards
pulls = 100 pulls = 200
−4 −2 2 4 6 2 4 6 8 10 12 14 Rewards −4 −2 2 4 6 0.5 1 1.5 2 2.5 3 Rewards
pulls = 50 pulls = 20
Oct 29th, 2013 - 20/94
The Stochastic Multi-arm Bandit Problem
Optimism in face of uncertainty
−4 −2 2 4 6 5 10 15 20 25 Rewards −4 −2 2 4 6 5 10 15 20 25 30 35 40 Rewards
−4 −2 2 4 6 2 4 6 8 10 12 14 Rewards −4 −2 2 4 6 0.5 1 1.5 2 2.5 3 Rewards
Oct 29th, 2013 - 21/94
The Stochastic Multi-arm Bandit Problem
1 (10) 2 (73) 3 (3) 4 (23) −1.5 −1 −0.5 0.5 1 1.5 2 Arms Reward
Oct 29th, 2013 - 22/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 23/94
The Stochastic Multi-arm Bandit Problem
◮ Compute the score of each arm i
◮ Pull arm
i=1,...,N Bi,s,t ◮ Update the number of pulls TIt,t = TIt,t−1 + 1
Oct 29th, 2013 - 24/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 25/94
The Stochastic Multi-arm Bandit Problem
Theorem
n
Oct 29th, 2013 - 26/94
The Stochastic Multi-arm Bandit Problem
s
Oct 29th, 2013 - 27/94
The Stochastic Multi-arm Bandit Problem
Theorem
Oct 29th, 2013 - 28/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 29/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 30/94
The Stochastic Multi-arm Bandit Problem
∆
Oct 29th, 2013 - 31/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 32/94
The Stochastic Multi-arm Bandit Problem
∆
∆
Oct 29th, 2013 - 33/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 34/94
The Stochastic Multi-arm Bandit Problem
◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible
◮ Big 1 − δ: high level of exploration ◮ Small 1 − δ: high level of exploitation
Oct 29th, 2013 - 35/94
The Stochastic Multi-arm Bandit Problem
Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =
µi,s − µi
2s
At time t we pull arm i [algorithm] Bi,Ti,t−1 ≥ Bi∗,Ti∗,t−1 ˆ µi,Ti,t−1 +
2Ti,t−1 ≥ ˆ µi∗,Ti∗,t−1 +
2Ti∗,t−1 On the event E we have [math] µi + 2
2Ti,t−1 ≥ µi∗
Oct 29th, 2013 - 36/94
The Stochastic Multi-arm Bandit Problem
Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2
2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2
i
+ 1 under event E and thus with probability 1 − nNδ. Moving to the expectation [statistics] E[Ti,n] = E[Ti,nIE] + E[Ti,nIEC] E[Ti,n] ≤ log 1/δ 2∆2
i
+ 1 + n(nNδ) Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +
2Ti,t−1 and
Oct 29th, 2013 - 37/94
The Stochastic Multi-arm Bandit Problem
Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +
2Ti,t−1 and E[Ti,n] ≤ log n ∆2
i
+ 1 + N
Oct 29th, 2013 - 38/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 39/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 40/94
The Stochastic Multi-arm Bandit Problem
UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ
2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice
Oct 29th, 2013 - 41/94
The Stochastic Multi-arm Bandit Problem
i,s,t = ˆ
i,s log t
Oct 29th, 2013 - 42/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 43/94
The Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 44/94
The Stochastic Multi-arm Bandit Problem
Theorem
n→∞
Oct 29th, 2013 - 45/94
The Non-Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 46/94
The Non-Stochastic Multi-arm Bandit Problem
Definition
◮ Arms have no fixed distribution ◮ The rewards Xi,t are arbitrarily chosen by the environment
Oct 29th, 2013 - 47/94
The Non-Stochastic Multi-arm Bandit Problem
i=1,...,N E
i=1,...,N n
Oct 29th, 2013 - 48/94
The Non-Stochastic Multi-arm Bandit Problem
Initialize the weights wi,0 = 1
◮ Compute (Wt−1 = N
i=1 wi,t−1)
ˆ pi,t = wi,t−1 Wt−1
◮ Choose the arm at random
It ∼ ˆ pt
◮ Observe the rewards {Xi,t} ◮ Receive a reward XIt,t ◮ Update
wi,t = wi,t−1 exp
Oct 29th, 2013 - 49/94
The Non-Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 50/94
The Non-Stochastic Multi-arm Bandit Problem
Initialize the weights wi,0 = 1
◮ Compute (Wt−1 = N
i=1 wi,t−1)
ˆ pi,t = wi,t−1 Wt−1
◮ Choose the arm at random
It ∼ ˆ pt
◮ Observe the rewards {Xi,t} ◮ Receive a reward XIt,t ◮ Update
wi,t = wi,t−1 exp
Oct 29th, 2013 - 51/94
The Non-Stochastic Multi-arm Bandit Problem
ˆ pi,t
Oct 29th, 2013 - 52/94
The Non-Stochastic Multi-arm Bandit Problem
Exp3: Exponential-weight algorithm for Exploration and Exploitation Initialize the weights wi,0 = 1
◮ Compute (Wt−1 = N
i=1 wi,t−1)
ˆ pi,t = wi,t−1 Wt−1
◮ Choose the arm at random
It ∼ ˆ pt
◮ Receive a reward XIt,t ◮ Update
wi,t = wi,t−1 exp
Xit,t
Oct 29th, 2013 - 53/94
The Non-Stochastic Multi-arm Bandit Problem
◮ Exp3 has a small regret in expectation ◮ Exp3 might have large deviations with high probability (ie,
Oct 29th, 2013 - 54/94
The Non-Stochastic Multi-arm Bandit Problem
Fix: add some extra uniform exploration Initialize the weights wi,0 = 1
◮ Compute (Wt−1 = N
i=1 wi,t−1)
ˆ pi,t = (1 − γ)wi,t−1 Wt−1 + γ K
◮ Choose the arm at random
It ∼ ˆ pt
◮ Receive a reward XIt,t ◮ Update
wi,t = wi,t−1 exp
Xit,t
Oct 29th, 2013 - 55/94
The Non-Stochastic Multi-arm Bandit Problem
Theorem
i=1,...,N n
t=1 Xi,t.
Oct 29th, 2013 - 56/94
The Non-Stochastic Multi-arm Bandit Problem
Theorem
Oct 29th, 2013 - 57/94
The Non-Stochastic Multi-arm Bandit Problem
Oct 29th, 2013 - 58/94
The Non-Stochastic Multi-arm Bandit Problem
Initialize the weights wi,0 = 1
◮ Compute (Wt−1 = N
i=1 wi,t−1)
ˆ pi,t = (1 − γ)wi,t−1 Wt−1 + γ K
◮ Choose the arm at random
It ∼ ˆ pt
◮ Receive a reward XIt,t ◮ Compute
Xi,t + β ˆ pi,t
◮ Update
wi,t = wi,t−1 exp
Xit,t
Oct 29th, 2013 - 59/94
The Non-Stochastic Multi-arm Bandit Problem
Theorem
n
Oct 29th, 2013 - 60/94
The Non-Stochastic Multi-arm Bandit Problem
Theorem
n
Oct 29th, 2013 - 61/94
Connections to Game Theory
Oct 29th, 2013 - 62/94
Connections to Game Theory
A B C 1 30, -30
20, -20 2 10, -10
Nash equilibrium: A set of strategies is a Nash equilibrium if no player can do better by unilaterally changing his strategy. Red: take action 1 with prob. 4/7 and action 2 with prob. 3/7 Blue: take action A with prob. 0, action B with prob. 4/7, and action C with prob. 3/7 Value of the game: V = 20/7 (reward of Red at the equilibrium)
Oct 29th, 2013 - 63/94
Connections to Game Theory
At each round t
◮ Row player computes a mixed strategy ˆ
pt = (ˆ p1,t, . . . , ˆ pN,t)
◮ Column player computes a mixed strategy ˆ
qt = (ˆ q1,t, . . . , ˆ qM,t)
◮ Row player selects action It ∈ {1, . . . , N} ◮ Column player selects action Jt ∈ {1, . . . , M} ◮ Row player suffers ℓ(It, Jt) ◮ Column player suffers −ℓ(It, Jt)
Value of the game V = max
q
min
p
¯ ℓ(p, q) with ¯ ℓ(p, q) =
N
M
piqjℓ(i, j)
Oct 29th, 2013 - 64/94
Connections to Game Theory
n
i=1,...,N n
n
j=1,...,M n
Oct 29th, 2013 - 65/94
Connections to Game Theory
Theorem
n→∞
n
Oct 29th, 2013 - 66/94
Connections to Game Theory
Theorem
n
n
Oct 29th, 2013 - 67/94
Connections to Game Theory
Proof idea. Since ¯ ℓ(p, Jt) is linear, over the simplex, the minimum is at one of the corners [math] min
i=1,...,N
1 N
n
ℓ(i, Jt) = min
p
1 n
n
¯ ℓ(p, Jt) We consider the empirical probability of the row player [def] ˆ qj,n = 1 n
n
IJt = j Elaborating on it [math] min
p
1 n
n
¯ ℓ(p, Jt) = min
p M
ˆ qj,n¯ ℓ(p, j) = min
p
¯ ℓ(p, ˆ qn) ≤ max
q
min
p
¯ ℓ(p, q) = V
Oct 29th, 2013 - 68/94
Connections to Game Theory
Proof idea. By definition of Hannan’s consistent strategy [def] lim sup
n→∞
1 n
n
ℓ(It, Jt) = min
i=1,...,N
1 n
n
ℓ(i, Jt) Then lim sup
n→∞
1 n
n
ℓ(It, Jt) ≤ V If we do the same for the other player [zero–sum game] lim sup
n→∞
1 n
n
ℓ(It, Jt) ≥ V
Oct 29th, 2013 - 69/94
Connections to Game Theory
n
i=1,...,N n
Oct 29th, 2013 - 70/94
Connections to Game Theory
◮ Players do not know the payoff matrix ◮ Players do not observe the loss of the other player ◮ Players do not even observe the action of the other player
Oct 29th, 2013 - 71/94
Connections to Game Theory
n
i=1,...,N n
i=1,...,N n
N
n =
i,j=1,...,N n
Oct 29th, 2013 - 72/94
Connections to Game Theory
n =
i,j=1,...,N n
Oct 29th, 2013 - 73/94
Connections to Game Theory
Theorem
Oct 29th, 2013 - 74/94
Connections to Game Theory
◮ Checkers / Chess / Go ◮ Poker ◮ Bargaining ◮ Monitoring ◮ Patrolling ◮ ...
Oct 29th, 2013 - 75/94
Connections to Game Theory
Oct 29th, 2013 - 76/94
Connections to Game Theory
Oct 29th, 2013 - 77/94
Connections to Game Theory
Theorem
Theorem
Oct 29th, 2013 - 78/94
Other Stochastic Multi-arm Bandit Problems
Oct 29th, 2013 - 79/94
Other Stochastic Multi-arm Bandit Problems
◮ Find the best shortest path in a limited number of days ◮ Maximize the confidence about the best treatment after a
◮ Discover the best advertisements after a training phase ◮ ...
Oct 29th, 2013 - 80/94
Other Stochastic Multi-arm Bandit Problems
N
i
1 ∆2
i
j=1 1 ∆2
j
Oct 29th, 2013 - 81/94
Other Stochastic Multi-arm Bandit Problems
◮ Divide the budget in N − 1 phases. Define
i=2 1/i)
◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1
◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds ◮ Remove the worst arm
Ak+1 = Ak\ arg min
i∈Ak ˆ
µi,nk
◮ Return the only remaining arm Jn = AN
Oct 29th, 2013 - 82/94
Other Stochastic Multi-arm Bandit Problems
Theorem
(i) .
Oct 29th, 2013 - 83/94
Other Stochastic Multi-arm Bandit Problems
◮ Define an exploration parameter a ◮ Compute
◮ Select
Bi,s ◮ At the end return
i
Oct 29th, 2013 - 84/94
Other Stochastic Multi-arm Bandit Problems
Theorem
36 n−N H1
i=1 1/∆2 i .
Oct 29th, 2013 - 85/94
Other Stochastic Multi-arm Bandit Problems
Oct 29th, 2013 - 86/94
Other Stochastic Multi-arm Bandit Problems
◮ N production lines ◮ The test of the performance of a line is expensive ◮ We want an accurate estimation of the performance of each
Oct 29th, 2013 - 87/94
Other Stochastic Multi-arm Bandit Problems
i , if it is
i
i
Oct 29th, 2013 - 88/94
Other Stochastic Multi-arm Bandit Problems
1,n, . . . , T ∗ N,n) = arg
(T1,n,...,TN,n) Ln
i,n =
i
j=1 σ2 j
n =
i=1 σ2 i
Oct 29th, 2013 - 89/94
Other Stochastic Multi-arm Bandit Problems
i
i=1 σ2 i
i
j=1 σ2 j
Oct 29th, 2013 - 90/94
Other Stochastic Multi-arm Bandit Problems
◮ Estimate
i,Ti,t−1 =
Ti,t−1
s,i − ˆ
i,Ti,t−1 ◮ Compute
i,Ti,t−1 + 5
Oct 29th, 2013 - 91/94
Other Stochastic Multi-arm Bandit Problems
Theorem
min
Oct 29th, 2013 - 92/94
Other Stochastic Multi-arm Bandit Problems
Oct 29th, 2013 - 93/94
Other Stochastic Multi-arm Bandit Problems
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr