On adaptive regret bounds for non- stochastic bandits
Gergely Neu
INRIA Lille, SequeL team
β Universitat Pompeu Fabra, Barcelona
On adaptive regret bounds for non- stochastic bandits Gergely Neu - - PowerPoint PPT Presentation
On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team Universitat Pompeu Fabra, Barcelona Online learning and bandits Adaptive bounds in online learning Adaptive bounds for bandits Outline
Gergely Neu
INRIA Lille, SequeL team
β Universitat Pompeu Fabra, Barcelona
Outline
*Opinion alert!
For each round π’ = 1,2, β¦ , π
For each round π’ = 1,2, β¦ , π
For each round π’ = 1,2, β¦ , π
For each round π’ = 1,2, β¦ , π
For each round π’ = 1,2, β¦ , π
Need to explore!
ππ,π = π
π’=1 π
βπ’,π½π’ β
π’=1 π
βπ’,π
ππ = ππ,πβ = max
π
ππ,π
ππ,π = π
π’=1 π
βπ’,π½π’ β
π’=1 π
βπ’,π
ππ = ππ,πβ = max
π
ππ,π ππ = Ξ ππ ππ = Ξ π log π
Full information Bandit
ππ = Ξ ππ ππ = Ξ π log π
Full information Bandit
Ξ(log π) Ξ(πlog π)
ππ = π ππ ππ = π π log π
Full information Bandit minimax first-order ππ,π = π’ βπ’,π
ππ = π ππ,πβ log π
second-order ππ,π = π’ βπ’,π
2
ππ = π ππ’,πβ log π
variance
ππ,π = π’ βπ’,π β π
2
ππ = π ππ,πβ log π
Cesa-Bianchi, Mansour, Stoltz (2005) Hazan and Kale (2010)
with a little cheating
ππ = π ππ ππ = π π log π
Full information Bandit minimax first-order ππ,π = π’ βπ’,π
ππ = π ππ,πβ log π
second-order ππ,π = π’ βπ’,π
2
ππ = π ππ’,πβ log π
variance
ππ,π = π’ βπ’,π β π
2
ππ = π ππ,πβ log π ππ = π π ππ’,π ππ = π π2 π π
π’,π
Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)
with a little cheating
Cesa-Bianchi, Mansour, Stoltz (2005)
ππ = π ππ ππ = π π log π
Full information Bandit minimax first-order ππ,π = π’ βπ’,π
ππ = π ππ,πβ log π
second-order ππ,π = π’ βπ’,π
2
ππ = π ππ’,πβ log π
variance
ππ,π = π’ βπ’,π β π
2
ππ = π ππ,πβ log π ππ = π π ππ’,π ππ = π π2 π π
π’,π
(itβs complicated)
Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)
with a little cheating
Cesa-Bianchi, Mansour, Stoltz (2005)
(itβs complicated)
ππ = π ππ»π,πβ log π
π»π,π = π’ ππ’,π
(itβs complicated)
ππ = π ππ»π,πβ log π
π»π,π = π’ ππ’,π
Problem:
(itβs complicated)
ππ = π ππ»π,πβ log π ππ = π π’ π ππ’,π log π ππ = π π’ π βπ’,π log π
(itβs complicated)
ππ = π ππ»π,πβ log π ππ = π π’ π ππ’,π log π ππ = π π’ π βπ’,π log π
Problem:
βΊ Stoltz (2005): π ππ
β
βΊ Allenberg, Auer, GyΓΆrfi and OttucsΓ‘k (2006): πππ
β
βΊ Rakhlin and Sridharan (2013): π3/2 ππ
β
(itβs complicated)
ππ = π ππ»π,πβ log π ππ = π π’ π βπ’,π log π
βΊ Stoltz (2005): π ππ
β
βΊ Allenberg, Auer, GyΓΆrfi and OttucsΓ‘k (2006): πππ
β
βΊ Rakhlin and Sridharan (2013): π3/2 ππ
β
(itβs complicated)
ππ = π ππ»π,πβ log π ππ = π π’ π βπ’,π log π
Problem: no real insight from analyses!
For every round π’ = 1,2, β¦ , π
βπ’,π = βπ’,π ππ’,π π π½π’=π
βπ’,π in a black-box online learning algorithm to compute ππ’+1
log π π + ππ
π’=1 π π=1 π
ππ’,π βπ’,π
2
(π: βlearning rateβ)
log π π + ππ
π’=1 π π=1 π
βπ’,π
(π: βlearning rateβ) log π π + ππ
π’=1 π π=1 π
ππ’,π βπ’,π
2
log π π + π
π=1 π
ππ,π
(π: βlearning rateβ) log π π + ππ
π’=1 π π=1 π
βπ’,π log π π + ππ
π’=1 π π=1 π
ππ’,π βπ’,π
2
π π=1
π
ππ,π
(π: βlearning rateβ) (for appropriate π) log π π + π
π=1 π
ππ,π log π π + ππ
π’=1 π π=1 π
βπ’,π log π π + ππ
π’=1 π π=1 π
ππ’,π βπ’,π
2
Itβs all because π ππ,π = ππ,π!!!
Itβs all because π ππ,π = ππ,π!!! Idea: try to enforce π ππ,π = π(ππ,πβ)
Itβs all because π ππ,π = ππ,π!!! Idea: try to enforce π ππ,π = π(ππ,πβ)
Need optimistic estimates!
For every round π’ = 1,2, β¦ , π
βπ’,π = βπ’,π ππ’,π π π½π’=π
βπ’,π in a black-box online learning algorithm to compute ππ’+1
For every round π’ = 1,2, β¦ , π
βπ’,π = βπ’,π ππ’,π π π½π’=π
βπ’,π in a black-box online learning algorithm to compute ππ’+1 βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π βImplicit explorationβ
(KocΓ‘k, N, Valko and Munos, 2015)
(Kalai and Vempala, 2005, Poland, 2005)
For every round π’ = 1,2, β¦ , π
π
π ππ’β1,π β ππ’,π
βπ’,π = βπ’,π ππ’,π π π½π’=π
ππ’,π = ππ’β1,π + βπ’,π βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π βImplicit explorationβ
(KocΓ‘k, N, Valko and Munos, 2015)
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
True losses
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
True losses unbiased estimates
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
True losses unbiased estimates +uniform exploration
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
True losses IX estimates
Lemma (N, 2015a): Assume that ππ’,π β€ πΆ for all π’ and π. Then, for any π and π, ππ,π β€ ππ,π + log π + πΆ π + 1 πΏ
Lemma (N, 2015a): Assume that ππ’,π β€ πΆ for all π’ and π. Then, for any π and π, ππ,π β€ ππ,π + log π + πΆ π + 1 πΏ All perturbations are nicely bounded with high probability β bad arms are suppressed!
βπ’,1 βπ’,2 βπ’,3 βπ’,4 βπ’,5 βπ’,6
βπ’,1 βπ’,2 βπ’,3 βπ’,4 βπ’,5 βπ’,6
πΆ
βπ’,1 βπ’,2 βπ’,3 βπ’,4 βπ’,5 βπ’,6
πΆ
βπ’,1 βπ’,2 βπ’,3 βπ’,4 βπ’,5 βπ’,6
πΆ Bad arms are no longer drawn!
βπ’,1 βπ’,2 βπ’,3 βπ’,4 βπ’,5 βπ’,6
πΆ Bad arms are no longer drawn! ππ’,π stops growing if π is bad
π π=1
π
ππ,π
(for appropriate π) log π π + π
π=1 π
ππ,π log π π + ππ
π’=1 π π=1 π
βπ’,π log π π + ππ
π’=1 π π=1 π
ππ’,π βπ’,π
2
π πππ,πβ
(for appropriate π, πΏ, πΆ)
Lemma
log π π + ππππ,πβ +π(πΆ + log π) log π π + ππ
π’=1 π π=1 π
βπ’,π log π π + ππ
π’=1 π π=1 π
ππ’,π βπ’,π
2
Parameters:
Parameters:
(log π+1) πππ,πβ
Parameters:
(log π+1) πππ,πβ
(log π+1) π 1+ π ππ’β1,π
Parameters:
(log π+1) πππ,πβ
(log π+1) π 1+ π ππ’β1,π
Arguments also extend to combinatorial semi-bandits!
ππ = π ππ ππ = Ξ π log π
Full information Bandit minimax first-order ππ,π = π’ βπ’,π
ππ = π ππ,πβ log π
second-order ππ,π = π’ βπ’,π
2
ππ = π ππ’,πβ log π
variance
ππ,π = π’ βπ’,π β π
2
ππ = π ππ,πβ log π ππ = π π ππ’,π ππ = π π2 π π
π’,π
Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)
with a little cheating
(itβs complicated)
Cesa-Bianchi, Mansour, Stoltz (2005)
ππ = π ππ ππ = Ξ π log π
Full information Bandit minimax first-order ππ,π = π’ βπ’,π
ππ = π ππ,πβ log π
second-order ππ,π = π’ βπ’,π
2
ππ = π ππ’,πβ log π
variance
ππ,π = π’ βπ’,π β π
2
ππ = π ππ,πβ log π ππ = π π ππ’,π ππ = π π2 π π
π’,π
ππ = π πππ,πβ
Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)
with a little cheating
Cesa-Bianchi, Mansour, Stoltz (2005)
ππ = π ππ ππ = Ξ π log π
Full information Bandit minimax first-order ππ,π = π’ βπ’,π
ππ = π ππ,πβ log π
second-order ππ,π = π’ βπ’,π
2
ππ = π ππ’,πβ log π
variance
ππ,π = π’ βπ’,π β π
2
ππ = π ππ,πβ log π ππ = π π ππ’,π ππ = π π2 π π
π’,π
ππ = π πππ,πβ
Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)
with a little cheating
Cesa-Bianchi, Mansour, Stoltz (2005)
What about these?
A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) ππ’,π β
π‘=1 π’β1
(1 β πβπ‘,π)
A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) ππ’,π β
π‘=1 π’β1
(1 β πβπ‘,π)
Used for proving
A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) ππ’,π β
π‘=1 π’β1
(1 β πβπ‘,π)
Used for proving
But does it work for bandits?
A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) ππ’,π β
π‘=1 π’β1
(1 β πβπ‘,π)
Used for proving
But does it work for bandits?
ππ’,π β
π‘=1 π’β1
1 β π βπ‘,π ππ’,π β πβπ π‘=1
π’β1
βπ‘,π
with βπ’,π = β 1 π log 1 β π βπ’,π
ππ’,π β
π‘=1 π’β1
1 β π βπ‘,π ππ’,π β πβπ π‘=1
π’β1
βπ‘,π
with βπ’,π = β 1 π log 1 β π βπ’,π
EXP3 with a pessimistic estimate: π βπ’,π β₯ βπ’,π
ππ’,π β
π‘=1 π’β1
1 β π βπ‘,π ππ’,π β πβπ π‘=1
π’β1
βπ‘,π
with βπ’,π = β 1 π log 1 β π βπ’,π
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π Implicit exploration
EXP3 with a pessimistic estimate: π βπ’,π β₯ βπ’,π
ππ’,π β
π‘=1 π’β1
1 β π βπ‘,π ππ’,π β πβπ π‘=1
π’β1
βπ‘,π
with βπ’,π = β 1 π log 1 β π βπ’,π
ππ’,π β πβπ π‘=1
π’β1
βπ‘,π
with βπ’,π = 1 π log 1 + π βπ’,π
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π Implicit exploration
EXP3 with a pessimistic estimate: π βπ’,π β₯ βπ’,π
True losses
True losses
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
+ also key for high- probability bounds! (NIPS 2015)
smoothness conflicts with need to explore!
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
+ also key for high- probability bounds! (NIPS 2015)
smoothness conflicts with need to explore!
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
+ also key for high- probability bounds! (NIPS 2015)
smoothness conflicts with need to explore!
βπ’,π = βπ’,π ππ’,π + πΏ π π½π’=π
+ also key for high- probability bounds! (NIPS 2015)
For every round π’ = 1,2, β¦ , π
π’ β π β 0,1 π
π’ β€βπ’
π’,πβπ’,π
π₯
π£
π₯
π£
For every round π’ = 1,2, β¦ , π
π’ β π β 0,1 π
π’ β€βπ’
π’,πβπ’,π
For every round π’ = 1,2, β¦ , π
π’ β π β 0,1 π
π’ β€βπ’
π’,πβπ’,π
π₯
π£
Decision set: π = π€π π=1
π
β 0,1 π π€π 1 β€ π
ππ = max
π€βπ π π’=1 π
π
π’ β π€ β€βπ’
ππ = Ξ πππ
ππ = π π ππ log(π)
ππ = max
π€βπ π π’=1 π
π
π’ β π€ β€βπ’
ππ = Ξ πππ
ππ = π π ππ log(π)
ππ = π π πππ
β log(π)