MVA-RL Course
The Exploration-Exploitation Dilemma
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation
The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course The Exploration-Exploitation Dilemma A. LAZARIC Reinforcement Learning Fall 2017 - 2/95 The
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Fall 2017 - 2/95
Fall 2017 - 2/95
For i = 1, . . . , n
3.1 Take action at according to a suitable exploration policy 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Fall 2017 - 3/95
For i = 1, . . . , n
3.1 Take action at = arg maxa Q(xt, a) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Fall 2017 - 4/95
For i = 1, . . . , n
3.1 Take action at = arg maxa Q(xt, a) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Fall 2017 - 4/95
For i = 1, . . . , n
3.1 Take action at ∼ U(A) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Fall 2017 - 5/95
For i = 1, . . . , n
3.1 Take action at ∼ U(A) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Fall 2017 - 5/95
Fall 2017 - 6/95
Mathematical Tools
Proposition (Chernoff-Hoeffding Inequality)
i=1(bi − ai)2
Fall 2017 - 7/95
Mathematical Tools
Proof. P
Xi − µi ≥ ǫ
P(es n
i=1 Xi−µi ≥ esǫ)
≤ e−sǫE[es n
i=1 Xi−µi],
Markov inequality = e−sǫ
n
E[es(Xi−µi)], independent random variables ≤ e−sǫ
n
es2(bi−ai)2/8, Hoeffding inequality = e−sǫ+s2 n
i=1(bi−ai)2/8
If we choose s = 4ǫ/ n
i=1(bi − ai)2, the result follows.
Similar arguments hold for P n
i=1 Xi − µi ≤ −ǫ
Fall 2017 - 8/95
Mathematical Tools
n
Fall 2017 - 9/95
Mathematical Tools
n
Fall 2017 - 10/95
Mathematical Tools
n
2ǫ2
Fall 2017 - 11/95
Mathematical Tools
Fall 2017 - 12/95
Mathematical Tools
Definition (Markov decision process)
◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability ◮ r(x, a, y) is the reward of transition (x, a, y)
Fall 2017 - 13/95
Mathematical Tools
◮ i = 1, . . . , K set of possible actions ◮ t = 1, . . . , n time ◮ It action selected at time t ◮ Xi,t reward for action i at time t
Fall 2017 - 14/95
Mathematical Tools
Fall 2017 - 15/95
Mathematical Tools
3.1 Take action at 3.2 Observe next state xt+1 and reward rt 3.3 Set t = t + 1
Fall 2017 - 15/95
Mathematical Tools
Fall 2017 - 16/95
Mathematical Tools
◮ At the same time
Fall 2017 - 16/95
Mathematical Tools
◮ At the same time
◮ The environment chooses a vector of rewards {Xi,t}K
i=1
◮ The learner chooses an arm It
Fall 2017 - 16/95
Mathematical Tools
◮ At the same time
◮ The environment chooses a vector of rewards {Xi,t}K
i=1
◮ The learner chooses an arm It
◮ The learner receives a reward XIt,t
Fall 2017 - 16/95
Mathematical Tools
◮ At the same time
◮ The environment chooses a vector of rewards {Xi,t}K
i=1
◮ The learner chooses an arm It
◮ The learner receives a reward XIt,t ◮ The environment does not reveal the rewards of the other
Fall 2017 - 16/95
Mathematical Tools
i=1,...,K E
Fall 2017 - 17/95
Mathematical Tools
i=1,...,K E
X or in the algorithm)
Fall 2017 - 17/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2: Whenever the learner pulls a bad arm, it suffers some regret
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge: The learner should solve two opposite problems!
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge: The learner should solve two opposite problems!
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge: The learner should solve two opposite problems!
Fall 2017 - 18/95
Mathematical Tools
Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge: The learner should solve the exploration-exploitation dilemma!
Fall 2017 - 18/95
Mathematical Tools
◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ...
Fall 2017 - 19/95
Mathematical Tools
Definition
◮ Each arm has a distribution νi bounded in [0, 1] and
◮ The rewards are i.i.d. Xi,t ∼ νi (as in the MDP model)
Fall 2017 - 20/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
i=1,...,K E
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
i=1,...,K(nµi) − E
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
i=1,...,K(nµi) − K
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
K
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
Fall 2017 - 21/95
Mathematical Tools
◮ Number of times arm i has been pulled after n rounds
n
◮ Regret
◮ Gap ∆i = µi∗ − µi
Fall 2017 - 21/95
Mathematical Tools
Fall 2017 - 22/95
Mathematical Tools
Fall 2017 - 23/95
Mathematical Tools
◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the
Fall 2017 - 23/95
Mathematical Tools
1 (10) 2 (73) 3 (3) 4 (23) −1.5 −1 −0.5 0.5 1 1.5 2 Arms Reward
Fall 2017 - 24/95
Mathematical Tools
Fall 2017 - 25/95
Mathematical Tools
◮ Compute the score of each arm i
◮ Pull arm
i=1,...,K Bi,s,t ◮ Update the number of pulls TIt,t = TIt,t−1 + 1 and the other
Fall 2017 - 26/95
Mathematical Tools
Fall 2017 - 27/95
Mathematical Tools
Fall 2017 - 27/95
Mathematical Tools
Fall 2017 - 27/95
Mathematical Tools
Fall 2017 - 27/95
Mathematical Tools
Fall 2017 - 27/95
Mathematical Tools
◮ Compute the score of each arm i
◮ Pull arm
i=1,...,K Bi,t ◮ Update the number of pulls TIt,t = TIt,t−1 + 1 and ˆ
Fall 2017 - 28/95
Mathematical Tools
Theorem
n
Fall 2017 - 29/95
Mathematical Tools
s
Fall 2017 - 30/95
Mathematical Tools
Fall 2017 - 30/95
Mathematical Tools
Fall 2017 - 30/95
Mathematical Tools
Theorem
Fall 2017 - 31/95
Mathematical Tools
Fall 2017 - 32/95
Mathematical Tools
Fall 2017 - 32/95
Mathematical Tools
Fall 2017 - 33/95
Mathematical Tools
Fall 2017 - 34/95
Mathematical Tools
Fall 2017 - 34/95
Mathematical Tools
∆
Fall 2017 - 34/95
Mathematical Tools
Fall 2017 - 35/95
Mathematical Tools
Fall 2017 - 35/95
Mathematical Tools
Fall 2017 - 35/95
Mathematical Tools
Fall 2017 - 35/95
Mathematical Tools
∆
∆
Fall 2017 - 36/95
Mathematical Tools
Fall 2017 - 37/95
Mathematical Tools
Fall 2017 - 37/95
Mathematical Tools
◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible
Fall 2017 - 38/95
Mathematical Tools
◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible
◮ Big 1 − δ: high level of exploration ◮ Small 1 − δ: high level of exploitation
Fall 2017 - 38/95
Mathematical Tools
◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible
◮ Big 1 − δ: high level of exploration ◮ Small 1 − δ: high level of exploitation
Fall 2017 - 38/95
Mathematical Tools
Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =
µi,s − µi
2s
Fall 2017 - 39/95
Mathematical Tools
Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =
µi,s − µi
2s
At time t we pull arm i [algorithm] Bi,Ti,t−1 ≥ Bi∗,Ti∗,t−1
Fall 2017 - 39/95
Mathematical Tools
Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =
µi,s − µi
2s
At time t we pull arm i [algorithm] ˆ µi,Ti,t−1 +
2Ti,t−1 ≥ ˆ µi∗,Ti∗,t−1 +
2Ti∗,t−1
Fall 2017 - 39/95
Mathematical Tools
Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =
µi,s − µi
2s
At time t we pull arm i [algorithm] ˆ µi,Ti,t−1 +
2Ti,t−1 ≥ ˆ µi∗,Ti∗,t−1 +
2Ti∗,t−1 On the event E we have [math] µi + 2
2Ti,t−1 ≥ µi∗
Fall 2017 - 39/95
Mathematical Tools
Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2
2(Ti,n − 1) ≥ µi∗
Fall 2017 - 40/95
Mathematical Tools
Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2
2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2
i
+ 1 under event E and thus with probability 1 − nKδ.
Fall 2017 - 40/95
Mathematical Tools
Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2
2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2
i
+ 1 under event E and thus with probability 1 − nKδ. Moving to the expectation [statistics] E[Ti,n] = E[Ti,nIE] + E[Ti,nIEC]
Fall 2017 - 40/95
Mathematical Tools
Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2
2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2
i
+ 1 under event E and thus with probability 1 − nKδ. Moving to the expectation [statistics] E[Ti,n] ≤ log 1/δ 2∆2
i
+ 1 + n(nKδ)
Fall 2017 - 40/95
Mathematical Tools
Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2
2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2
i
+ 1 under event E and thus with probability 1 − nKδ. Moving to the expectation [statistics] E[Ti,n] ≤ log 1/δ 2∆2
i
+ 1 + n(nKδ) Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +
2Ti,t−1 and n
Fall 2017 - 40/95
Mathematical Tools
Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +
2Ti,t−1 and E[Ti,n] ≤ log n ∆2
i
+ 1 + K
Fall 2017 - 41/95
Mathematical Tools
Fall 2017 - 42/95
Mathematical Tools
Fall 2017 - 42/95
Mathematical Tools
Fall 2017 - 43/95
Mathematical Tools
UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ
2s
Fall 2017 - 44/95
Mathematical Tools
UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ
2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n
Fall 2017 - 44/95
Mathematical Tools
UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ
2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice
Fall 2017 - 44/95
Mathematical Tools
UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ
2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice
Fall 2017 - 44/95
Mathematical Tools
Idea: use empirical Bernstein bounds for more accurate c.i.
Fall 2017 - 45/95
Mathematical Tools
Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm
◮ Compute the score of each arm i
Bi,t = ˆ µi,Ti,t + ρ
2Ti,t
◮ Pull arm
It = arg max
i=1,...,K Bi,t
◮ Update the number of pulls TIt,t, ˆ
µi,Ti,t
Fall 2017 - 45/95
Mathematical Tools
Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm
◮ Compute the score of each arm i
Bi,t = ˆ µi,Ti,t +
σ2
i,Ti,t log t
Ti,t + 8 log t 3Ti,t
◮ Pull arm
It = arg max
i=1,...,K Bi,t
◮ Update the number of pulls TIt,t, ˆ
µi,Ti,t and ˆ σ2
i,Ti,t
Fall 2017 - 45/95
Mathematical Tools
Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm
◮ Compute the score of each arm i
Bi,t = ˆ µi,Ti,t +
σ2
i,Ti,t log t
Ti,t + 8 log t 3Ti,t
◮ Pull arm
It = arg max
i=1,...,K Bi,t
◮ Update the number of pulls TIt,t, ˆ
µi,Ti,t and ˆ σ2
i,Ti,t
Regret Rn ≤ O 1 ∆ log n
Fall 2017 - 45/95
Mathematical Tools
Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm
◮ Compute the score of each arm i
Bi,t = ˆ µi,Ti,t +
σ2
i,Ti,t log t
Ti,t + 8 log t 3Ti,t
◮ Pull arm
It = arg max
i=1,...,K Bi,t
◮ Update the number of pulls TIt,t, ˆ
µi,Ti,t and ˆ σ2
i,Ti,t
Regret Rn ≤ O σ2 ∆ log n
Fall 2017 - 45/95
Mathematical Tools
Idea: use even tighter c.i. based on Kullback–Leibler divergence d(p, q) = p log p q + (1 − p) log 1 − p 1 − q
Fall 2017 - 46/95
Mathematical Tools
Idea: use even tighter c.i. based on Kullback–Leibler divergence d(p, q) = p log p q + (1 − p) log 1 − p 1 − q Algorithm: Compute the score of each arm i (convex optimization) Bi,t = max
µi,Ti,t, q
Fall 2017 - 46/95
Mathematical Tools
Idea: use even tighter c.i. based on Kullback–Leibler divergence d(p, q) = p log p q + (1 − p) log 1 − p 1 − q Algorithm: Compute the score of each arm i (convex optimization) Bi,t = max
µi,Ti,t, q
E
d(µi, µ∗) + C1 log(log(n)) + C2(ǫ) nβ(ǫ) where d(µi, µ∗) > 2∆2
i
Fall 2017 - 46/95
Mathematical Tools
Idea: Use a Bayesian approach to estimate the means {µi}i
Fall 2017 - 47/95
Mathematical Tools
Idea: Use a Bayesian approach to estimate the means {µi}i Algorithm: Assuming Bernoulli arms and a Beta prior on the mean
◮ Compute
Di,t = Beta(Si,t + 1, Fi,t + 1)
◮ Draw a mean sample as
◮ Pull arm
It = arg max µi,t
◮ If XIt,t = 1 update SIt,t+1 = SIt,t + 1, else update FIt,t+1 = FIt,t + 1
Regret: lim
n→∞
Rn log(n) =
K
∆i d(µi, µ∗)
Fall 2017 - 47/95
Mathematical Tools
Theorem
n→∞
Fall 2017 - 48/95
Mathematical Tools
Theorem
n→∞
Fall 2017 - 48/95
Mathematical Tools
Theorem
n→∞
Fall 2017 - 48/95
Mathematical Tools
Fall 2017 - 49/95
Mathematical Tools
◮ Different users may have different preferences ◮ Different news may have different characteristics ◮ The set of available news may change over time ◮ We want to minimise the regret w.r.t. the best news for each
Fall 2017 - 50/95
Mathematical Tools
Limitations of MAB:
◮ Arms are independent ◮ Each single arm has to be tested at least once ◮ Regret scales linearly with K
Fall 2017 - 51/95
Mathematical Tools
Limitations of MAB:
◮ Arms are independent ◮ Each single arm has to be tested at least once ◮ Regret scales linearly with K
Linear bandit approach:
◮ Embed arms in Rd (each arm a is mapped to a feature vector
φa ∈ Rd)
◮ The reward varies linearly with the arm
E[r(a)] = φ⊤
a θ∗
where θ∗ ∈ Rd is unknown.
Fall 2017 - 51/95
Mathematical Tools
Limitations of MAB:
◮ Arms are independent ◮ Each single arm has to be tested at least once ◮ Regret scales linearly with K
Linear bandit approach:
◮ Embed arms in Rd (each arm a is mapped to a feature vector
φa ∈ Rd)
◮ The reward varies linearly with the arm
E[r(a)] = φ⊤
a θ∗
where θ∗ ∈ Rd is unknown. Remark: if d = A and φa = ea, then it coincides with MAB
Fall 2017 - 51/95
Mathematical Tools
◮ The learner chooses an arm at and receives a reward rat
a θ∗
Fall 2017 - 52/95
Mathematical Tools
The MAB approach: the value of an arm is estimated by µi,t Exploiting the linear assumption:
◮ Estimate θ∗ using regularized least squares
θ n
atθ − rt(at)
2 + λθ2
2
Fall 2017 - 53/95
Mathematical Tools
The MAB approach: the value of an arm is estimated by µi,t Exploiting the linear assumption:
◮ Estimate θ∗ using regularized least squares
θ n
atθ − rt(at)
2 + λθ2
2
◮ Closed-form solution
An =
n
φatφ⊤
at + λI bn = n
φatrt(at) ⇒ θn = A−1
n bn
Fall 2017 - 53/95
Mathematical Tools
The MAB approach: the value of an arm is estimated by µi,t Exploiting the linear assumption:
◮ Estimate θ∗ using regularized least squares
θ n
atθ − rt(at)
2 + λθ2
2
◮ Closed-form solution
An =
n
φatφ⊤
at + λI bn = n
φatrt(at) ⇒ θn = A−1
n bn
◮ Estimate of the value of arm a
a
θn
Fall 2017 - 53/95
Mathematical Tools
The MAB approach: construct confidence intervals
Exploiting the linear assumption:
◮ Estimate of an arm
rn(a) may be accurate when “similar” arms have been selected (even if Tn(a) = 0!)
Fall 2017 - 54/95
Mathematical Tools
The MAB approach: construct confidence intervals
Exploiting the linear assumption:
◮ Estimate of an arm
rn(a) may be accurate when “similar” arms have been selected (even if Tn(a) = 0!)
◮ Confidence intervals
rn(a)
a A−1 n φa
Fall 2017 - 54/95
Mathematical Tools
The MAB approach: construct confidence intervals
Exploiting the linear assumption:
◮ Estimate of an arm
rn(a) may be accurate when “similar” arms have been selected (even if Tn(a) = 0!)
◮ Confidence intervals
rn(a)
a A−1 n φa
◮ Tuning of the confidence interval
αn = B
1 + nL/λ δ
Remark: the confidence interval reduces to MAB when all arms are
Fall 2017 - 54/95
Mathematical Tools
The MAB approach – UCB: pull arm It = µi,t +
Exploiting the linear assumption:
◮ At each time step t select arm
at = arg max
a∈A φ⊤ a
θt + αt
a A−1 t φa
Fall 2017 - 55/95
Mathematical Tools
The MAB approach – UCB: regret O(K log(n)/∆) or O(
Exploiting the linear assumption:
◮ Regret bound
Rn = O
Fall 2017 - 56/95
Mathematical Tools
The MAB approach – TS:
◮ Compute a posterior over µi ◮ Draw a
µi from the posterior
◮ Select arm It = arg maxi
µi Exploiting the linear assumption:
◮ Regret bound
Rn = O
Fall 2017 - 57/95
Mathematical Tools
Limitations of MAB:
◮ The value of an arm is fixed ◮ No side-information / context is used
Fall 2017 - 58/95
Mathematical Tools
Limitations of MAB:
◮ The value of an arm is fixed ◮ No side-information / context is used
Contextual linear bandit approach:
◮ Finite arms ◮ Define a context x ∈ X ◮ The reward varies linearly with the context
E[r(x, a)] = φ⊤
x θ∗ a
Fall 2017 - 58/95
Mathematical Tools
Limitations of MAB:
◮ The value of an arm is fixed ◮ No side-information / context is used
Contextual linear bandit approach:
◮ Finite arms ◮ Define a context x ∈ X ◮ The reward varies linearly with the context
E[r(x, a)] = φ⊤
x θ∗ a
Extensions:
◮ Embed arms in Rd and
E[r(x, a)] = φ⊤
x,aθ∗ a
◮ Let the arm set change over time At
Fall 2017 - 58/95
Mathematical Tools
◮ User xt arrives and a set of news At is provided ◮ The user xt together with a news a ∈ At are described by a
◮ The learner chooses a news at ∈ At and receives a reward
t = arg max a∈At E[rt(xt, at)]
t )
Fall 2017 - 59/95
Mathematical Tools
◮ Ta = {t : at = a} ◮ Construct the design matrix of all the contexts observed when
◮ Construct the reward vector of all the rewards observed when
◮ Estimate θa as
a Da + I)−1D⊤ a ca
Fall 2017 - 60/95
Mathematical Tools
◮ Chernoff-Hoeffding in this case becomes
x,aˆ
x,a(D⊤ a Da + I)−1φx,a ◮ and the UCB strategy is
a∈At φ⊤ x,aˆ
x,a(D⊤ a Da + I)−1φx,a
Fall 2017 - 61/95
Mathematical Tools
◮ Online evaluation: too expensive ◮ Offline evaluation: how to use the logged data?
Fall 2017 - 62/95
Mathematical Tools
◮ Assumption 1: contexts and rewards are i.i.d. from a
◮ Assumption 2: the logging strategy is random
Fall 2017 - 63/95
Mathematical Tools
Fall 2017 - 64/95
Mathematical Tools
Fall 2017 - 65/95
Mathematical Tools
Fall 2017 - 66/95
Other Stochastic Multi-arm Bandit Problems
◮ Find the best shortest path in a limited number of days ◮ Maximize the confidence about the best treatment after a
◮ Discover the best advertisements after a training phase ◮ ...
Fall 2017 - 67/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 68/95
Other Stochastic Multi-arm Bandit Problems
N
i
Fall 2017 - 68/95
Other Stochastic Multi-arm Bandit Problems
N
i
1 ∆2
i
j=1 1 ∆2
j
Fall 2017 - 68/95
Other Stochastic Multi-arm Bandit Problems
◮ Divide the budget in N − 1 phases. Define
i=2 1/i)
Fall 2017 - 69/95
Other Stochastic Multi-arm Bandit Problems
◮ Divide the budget in N − 1 phases. Define
i=2 1/i)
◮ Set of active arms Ak at phase k (A1 = {1, . . . , N})
Fall 2017 - 69/95
Other Stochastic Multi-arm Bandit Problems
◮ Divide the budget in N − 1 phases. Define
i=2 1/i)
◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1
◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds
Fall 2017 - 69/95
Other Stochastic Multi-arm Bandit Problems
◮ Divide the budget in N − 1 phases. Define
i=2 1/i)
◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1
◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds ◮ Remove the worst arm
Ak+1 = Ak\ arg min
i∈Ak ˆ
µi,nk
Fall 2017 - 69/95
Other Stochastic Multi-arm Bandit Problems
◮ Divide the budget in N − 1 phases. Define
i=2 1/i)
◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1
◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds ◮ Remove the worst arm
Ak+1 = Ak\ arg min
i∈Ak ˆ
µi,nk
◮ Return the only remaining arm Jn = AN
Fall 2017 - 69/95
Other Stochastic Multi-arm Bandit Problems
Theorem
(i) .
Fall 2017 - 70/95
Other Stochastic Multi-arm Bandit Problems
◮ Define an exploration parameter a ◮ Compute
Fall 2017 - 71/95
Other Stochastic Multi-arm Bandit Problems
◮ Define an exploration parameter a ◮ Compute
◮ Select
Bi,s
Fall 2017 - 71/95
Other Stochastic Multi-arm Bandit Problems
◮ Define an exploration parameter a ◮ Compute
◮ Select
Bi,s ◮ At the end return
i
Fall 2017 - 71/95
Other Stochastic Multi-arm Bandit Problems
Theorem
36 n−N H1
i=1 1/∆2 i .
Fall 2017 - 72/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 73/95
Other Stochastic Multi-arm Bandit Problems
◮ N production lines ◮ The test of the performance of a line is expensive ◮ We want an accurate estimation of the performance of each
Fall 2017 - 74/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 75/95
Other Stochastic Multi-arm Bandit Problems
i , if it is
i
Fall 2017 - 75/95
Other Stochastic Multi-arm Bandit Problems
i , if it is
i
i
Fall 2017 - 75/95
Other Stochastic Multi-arm Bandit Problems
1,n, . . . , T ∗ N,n) = arg
(T1,n,...,TN,n) Ln
Fall 2017 - 76/95
Other Stochastic Multi-arm Bandit Problems
1,n, . . . , T ∗ N,n) = arg
(T1,n,...,TN,n) Ln
i,n =
i
j=1 σ2 j
Fall 2017 - 76/95
Other Stochastic Multi-arm Bandit Problems
1,n, . . . , T ∗ N,n) = arg
(T1,n,...,TN,n) Ln
i,n =
i
j=1 σ2 j
n =
i=1 σ2 i
Fall 2017 - 76/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 77/95
Other Stochastic Multi-arm Bandit Problems
i
i=1 σ2 i
Fall 2017 - 77/95
Other Stochastic Multi-arm Bandit Problems
i
i=1 σ2 i
i
j=1 σ2 j
Fall 2017 - 77/95
Other Stochastic Multi-arm Bandit Problems
◮ Estimate
i,Ti,t−1 =
Ti,t−1
s,i − ˆ
i,Ti,t−1 ◮ Compute
i,Ti,t−1 + 5
Fall 2017 - 78/95
Other Stochastic Multi-arm Bandit Problems
Theorem
min
Fall 2017 - 79/95
Other Stochastic Multi-arm Bandit Problems
Theorem
min
Fall 2017 - 79/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 80/95
Other Stochastic Multi-arm Bandit Problems
For i = 1, . . . , n
3.1 Take action at according to a suitable exploration policy 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function
Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1
EndWhile EndFor
Fall 2017 - 81/95
Other Stochastic Multi-arm Bandit Problems
i=1,...,K E
Fall 2017 - 82/95
Other Stochastic Multi-arm Bandit Problems
i=1,...,K E
π
Fall 2017 - 82/95
Other Stochastic Multi-arm Bandit Problems
i=1,...,K E
π
Fall 2017 - 82/95
Other Stochastic Multi-arm Bandit Problems
i=1,...,K E
π
π
t , π(x∗ t ))
t ∼ p
t−1, π∗(x∗ t−1)
Fall 2017 - 82/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 83/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 83/95
Other Stochastic Multi-arm Bandit Problems
◮ A policy π is defined as π : X → A ◮ The long-term average reward of a policy is
ρπ(M) = lim
n→∞ E
1 n
n
rt
π∗(M) = arg max
π
ρπ(M) = ⇒ ρ∗(M) = ρπ∗(M)(M)
Fall 2017 - 84/95
Other Stochastic Multi-arm Bandit Problems
◮ A policy π is defined as π : X → A ◮ The long-term average reward of a policy is
ρπ(M) = lim
n→∞ E
1 n
n
rt
π∗(M) = arg max
π
ρπ(M) = ⇒ ρ∗(M) = ρπ∗(M)(M)
◮ Exploration-exploitation dilemma
◮ Explore the environment to estimate its parameters ◮ Exploit the estimates to collect reward
Fall 2017 - 84/95
Other Stochastic Multi-arm Bandit Problems
Regret Learning curve Steps Per-step reward ρ∗
Fall 2017 - 85/95
Other Stochastic Multi-arm Bandit Problems
Regret Learning curve Steps Per-step reward ρ∗
n
Fall 2017 - 85/95
Other Stochastic Multi-arm Bandit Problems
Space of MDPs ρ∗( Mt) ρ∗(M) ρ∗ Estimated MDP
Optimistic MDP True MDP
M ∗ M ∗ ρ∗( Mt) High confidence space
π∗( M) Optimism in face of uncertainty ⇒
Fall 2017 - 86/95
Other Stochastic Multi-arm Bandit Problems
Space of MDPs ρ∗( Mt) ρ∗(M) ρ∗ Estimated MDP
Optimistic MDP True MDP
M ∗ M ∗ ρ∗( Mt) High confidence space
⇒ Optimism in face of uncertainty π∗( Mt)
Fall 2017 - 86/95
Other Stochastic Multi-arm Bandit Problems
⇒ π∗( Mt′) ρ∗( Mt′) Space of MDPs ρ∗( Mt′) ρ∗(M) ρ∗ Estimated MDP
Optimistic MDP True MDP
M ∗ M ∗ High confidence space
Fall 2017 - 86/95
Other Stochastic Multi-arm Bandit Problems
ρ∗( Mn) π∗( Mn) ⇒ Space of MDPs Estimated MDP
Optimistic MDP True MDP
M ∗ M ∗ High confidence space
ρ∗ρ∗( Mn) ρ∗(M)
Fall 2017 - 86/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 87/95
Other Stochastic Multi-arm Bandit Problems
Initialize episode k
t=1 rtI{xt = x, at = a}
rk(x, a) = Rk(x,a)
Nk(x,a) , ˆ
pk(x, a, x ′) = Pk(x,a,x′)
Nk(x,a)
Compute optimistic policy
Mk =
r(x, a) − ˆ rk(x, a)| ≤ Br(x, a); ˜ p(·|x, a) − ˆ pk(·|x, a)1 ≤ Bp(x, a)
˜ πk = arg max
π
max
˜ M∈Mk
ρ(π; ˜ M) Execute ˜ πk until at least one state-action space counter is doubled
Fall 2017 - 88/95
Other Stochastic Multi-arm Bandit Problems
Set of plausible MDPs Mk = { M}: confidence intervals built using Chernoff bounds Br(x, a) ≈
Nk(x, a) ; Bp(x, a) ≈
Nk(x, a)
Fall 2017 - 89/95
Other Stochastic Multi-arm Bandit Problems
Set of plausible MDPs Mk = { M}: confidence intervals built using Chernoff bounds Br(x, a) ≈
Nk(x, a) ; Bp(x, a) ≈
Nk(x, a) Computation of the optimistic optimal policy πk
π
max
ρπ( M)
Fall 2017 - 89/95
Other Stochastic Multi-arm Bandit Problems
Planning in average reward MDPs
◮ The optimal Bellman equation: optimal gain ρ∗ and bias u∗
u∗(x) + ρ∗ = max
a
p(x′|x, a)u∗(x′)
vn = max
a
p(x′|x, a)vn−1(x′)
◮ Guarantees of greedy policy
πn(x) = arg max
a
p(x′|x, a)vn−1(x′)
Fall 2017 - 90/95
Other Stochastic Multi-arm Bandit Problems
Planning in optimistic average reward MDPs
◮ The optimal Bellman equation: optimal gain
ρ and bias u
ρ = max
a
max
˜ r(x,a) max ˜ p(·|x,a)
r(x, a) +
˜ p(x′|x, a) u(x′)
vn = max
a
max
˜ r(x,a) max ˜ p(·|x,a)
r(x, a) +
˜ p(x′|x, a)vn−1(x′)
a
max
˜ p(·|x,a)
r +(x, a) +
˜ p(x′|x, a)vn−1(x′)
r + = ˆ r +
= max
a
r +(x, a) + max
˜ p(·|x,a)
˜ p(x′|x, a)vn−1(x′)
◮ LP problem: assign highest probability from ˜
p(·|x, a) − ˆ p(·|x, a)1 to highest vn−1(x′)
Fall 2017 - 91/95
Other Stochastic Multi-arm Bandit Problems
Theorem UCRL2 run over n steps in an MDP with diameter D, X states and A actions suffers a regret Rn = O(DX √ An) where diameter D = maxx,x′ minπ E
Fall 2017 - 92/95
Other Stochastic Multi-arm Bandit Problems
Initialize episode k
Compute random policy
Mk = { rk, pk} such that rk, pk sampled from their posteriors
πk = arg maxπ ρπ( Mk) Execute ˜ πk until at least one state-action space counter is doubled
Fall 2017 - 93/95
Other Stochastic Multi-arm Bandit Problems
Fall 2017 - 94/95
Other Stochastic Multi-arm Bandit Problems
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr