Linear Bandits: From Theory to Applications
Claire Vernade
DeepMind – Foundations Team credits : Csaba Szepesv´ ari, Tor Lattimore for their blog
Linear Bandits: From Theory to Applications Claire Vernade DeepMind - - PowerPoint PPT Presentation
Linear Bandits: From Theory to Applications Claire Vernade DeepMind Foundations Team credits : Csaba Szepesv ari, Tor Lattimore for their blog Sequential Decision Making 1 Real World Sequential Decision Making 2 Table of contents 1.
DeepMind – Foundations Team credits : Csaba Szepesv´ ari, Tor Lattimore for their blog
1
2
3
a∈Ata, θ∗ − Xt
4
5
t t
t
s .
Vt−1 ≤ βt
A = x⊤Ax. 6
a∈A
θ∈Cta, θ .
θ∈Cta, θ .
a
a
t−1 .
7
8
9
n
t − At, θ := n
n
t
t
t , θ∗ ≤ Ut(A∗ t ) ≤ Ut(At) = At, ˜
t−1 ˜
t−1
10
n
n
t ≤ 2
n
V −1
t−1) .
s=1 xsx⊤ s , t ∈ [n], and L ≥ maxt xt2.
n
V −1
t−1
11
Assumptions: θ∗ ≤ S, and let (As)s, (ηs)s be so that for any 1 ≤ s ≤ t, ηs|Fs−1 ∼ subG(1), where Fs = σ(A1, η1, . . . , As−1, ηs−1, As)
δ
λd
δ
dλ
12
13
1 d (V0)
14
15
16
17
t−1
s
t−1
t−m
t−1
t = V −1 t
t
t−mbt−m
t
18
t − τmθ = ˆ
t − ˆ
t+m + ˆ
t+m − τmθ
t − ˆ
t+m
t+m − τmθ
a∈A
θ∈Cta, θ .
a
t−1 + m aV −2 t−1 .
19
m
1 d (V0)
m
20
1000 2000 3000 Round t 25 50 75 100 125 150 Regret R(T )
WaiLinUCB DeLinUCB
1000 2000 3000 Round t 25 50 75 100 125 150 Regret R(T )
WaiLinUCB DeLinUCB
1000 2000 3000 Round t 50 100 150 200 Regret R(T )
WaiLinUCB DeLinUCB
21
22
22
[1] Yasin Abbasi-Yadkori. Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta, 2009. [2] Yasin Abbasi-Yadkori, Andr´ as Antos, and Csaba Szepesv´
algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009. [3] Yasin Abbasi-Yadkori, Csaba Szepesv´ ari, and David Tax. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320, 2011. [4] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic
[5] P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010. [6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002. [7] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages 208–214, 2011. 23
[8] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of Conference on Learning Theory (COLT), pages 355–366, 2008. [9] S. Filippi, O. Capp´ e, A. Garivier, and Cs. Szepesv´
linear case. pages 586–594. [10] Sarah Filippi, Olivier Capp´ e, and Aur´ elien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 115–122. IEEE, 2010. [11] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics
[12] Michal Valko, Nathaniel Korda, R´ emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013. [13] Michal Valko, R´ emi Munos, Branislav Kveton, and Tom´ aˇ s Koc´
smooth graph functions. In International Conference on Machine Learning, pages 46–54, 2014. 24