Linear Bandits: From Theory to Applications Claire Vernade DeepMind - - PowerPoint PPT Presentation

linear bandits from theory to applications
SMART_READER_LITE
LIVE PREVIEW

Linear Bandits: From Theory to Applications Claire Vernade DeepMind - - PowerPoint PPT Presentation

Linear Bandits: From Theory to Applications Claire Vernade DeepMind Foundations Team credits : Csaba Szepesv ari, Tor Lattimore for their blog Sequential Decision Making 1 Real World Sequential Decision Making 2 Table of contents 1.


slide-1
SLIDE 1

Linear Bandits: From Theory to Applications

Claire Vernade

DeepMind – Foundations Team credits : Csaba Szepesv´ ari, Tor Lattimore for their blog

slide-2
SLIDE 2

Sequential Decision Making

1

slide-3
SLIDE 3

Real World Sequential Decision Making

2

slide-4
SLIDE 4

Table of contents

  • 1. Linear Bandits

2. Real-World Setting: Delayed Feedback

3

slide-5
SLIDE 5

Linear Bandits

slide-6
SLIDE 6

Linear Bandits

  • 1. In round t, observe action set At ⊂ Rd.
  • 2. The learner chooses At ∈ At and receives Xt, satisfying

E[Xt|A1, A1, . . . , At, At] = At, θ∗ := fθ∗(At) for some unknown θ∗.

  • 3. Light-tailed noise:

Xt − At, θ∗ = ηt ∼ N(0, 1) Goal: Keep regret Rn = E n

  • t=1

max

a∈Ata, θ∗ − Xt

  • small.

4

slide-7
SLIDE 7

Real-World setting

Typical setting: a user, represented by its feature vector ut, shows up and we have a finite set of (correlated) actions (a1, . . . , aK). Some function Φ joins these vectors pairwise to create a contextualized action set: ∀i ∈ [K], Φ(ut, ai) = at,i ∈ Rd At = {at,1, . . . , at,K}. No assumption is to be made on the joining function Φ as the bandit may take over the decision step from that contextualized action set. So, it is equivalent to At ∼ Π(Rd) some arbitrary distribution, or A1, . . . , An fixed arbitrarily by the environment.

5

slide-8
SLIDE 8

Toolbox of the optimist

Say, reward in round t is Xt, action in round t is At ∈ Rd: Xt = At, θ∗ + ηt , We want to estimate θ∗:regularized least-squares estimator: ˆ θt = V −1

t t

  • s=1

AsXs , V0 = λI , Vt = V0 +

t

  • s=1

AsA⊤

s .

Choice of confidence regions (ellipsoids) Ct: Ct . =

  • θ ∈ Rd : θ − ˆ

θt−12

Vt−1 ≤ βt

  • .

where, for A positive definite, x2

A = x⊤Ax. 6

slide-9
SLIDE 9

LinUCB

“Choose the best action in the best environment amongst the plausible

  • nes.”

Choose Ct with suitable (βt)t and let At = argmax

a∈A

max

θ∈Cta, θ .

Or, more concretely, for each action a ∈ A, compute the ”optimistic index” Ut(a) = max

θ∈Cta, θ .

Maximising a linear function over a convex closed set, the solution is explicit: At = argmax

a

Ut(a) = argmax

a

a, ˆ θt +

  • βt aV −1

t−1 .

7

slide-10
SLIDE 10

Optimism in the Face of Uncertainty Principle

8

slide-11
SLIDE 11

Regret Bound

Assumptions:

  • 1. Bounded scalar mean reward: |a, θ∗| ≤ 1 for any a ∈ ∪tAt.
  • 2. Bounded actions: for any a ∈ ∪tAt, a2 ≤ L.
  • 3. Honest confidence intervals: There exists a δ ∈ (0, 1) such that with

probability 1 − δ, for all t ∈ [n], θ∗ ∈ Ct for some choice of (βt)t≤n. Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies ˆ Rn ≤

  • 8dnβn log

dλ + nL2 dλ

  • .

9

slide-12
SLIDE 12

Proof

Jensen’s inequality shows that ˆ Rn =

n

  • t=1

A∗

t − At, θ := n

  • t=1

rt ≤

  • n

n

  • t=1

r 2

t

where A∗

t

. = argmaxa∈Ata, θ∗. Let ˜ θt be the vector that realizes the maximum over the ellipsoid: ˜ θt ∈ Ct s.t. At, ˜ θt = Ut(At). From the definition of LinUCB, A∗

t , θ∗ ≤ Ut(A∗ t ) ≤ Ut(At) = At, ˜

θt . Then, rt ≤ At, ˜ θt − θ∗ ≤ AtV −1

t−1 ˜

θt − θ∗Vt−1 ≤ 2 AtV −1

t−1

  • βt .

10

slide-13
SLIDE 13

Elliptical Potential Lemma

So we now have a new upper bound, ˆ Rn =

n

  • t=1

rt ≤

  • n

n

  • t=1

r 2

t ≤ 2

  • nβn

n

  • t=1

(1 ∧ At2

V −1

t−1) .

Lemma (Abbasi-Yadkori et al. (2011)) Let x1, . . . , xn ∈ Rd, Vt = V0 + t

s=1 xsx⊤ s , t ∈ [n], and L ≥ maxt xt2.

Then,

n

  • t=1
  • 1 ∧ xt2

V −1

t−1

  • ≤ 2 log

det Vn det V0

  • ≤ d log
  • trace(V0) + nL2

d det1/d(V0)

  • .

11

slide-14
SLIDE 14

Confidence Ellipsoids

Assumptions: θ∗ ≤ S, and let (As)s, (ηs)s be so that for any 1 ≤ s ≤ t, ηs|Fs−1 ∼ subG(1), where Fs = σ(A1, η1, . . . , As−1, ηs−1, As)

Fix δ ∈ (0, 1). Let βt+1 = √ λS +

  • 2 log

1

δ

  • + log
  • det Vt(λ)

λd

√ λS +

  • 2 log

1

δ

  • + log
  • λd+nL2

  • ,

and Ct+1 =

  • θ ∈ Rd : ˆ

θt − θ∗Vt(λ) ≤ βt+1

  • .

Theorem Ct+1 is a confidence set for θ∗ at level 1 − δ: P (θ∗ ∈ Ct+1) ≥ 1 − δ . Proof : See Chapter 20 of Bandit Algorithms (www.banditalgs.com)

12

slide-15
SLIDE 15

History

  • Abe and Long [4] introduced stochastic linear bandits into machine

learning literature.

  • Auer [6] was the first to consider optimism for linear bandits (LinRel,

SupLinRel). Main restriction: |At| < +∞.

  • Confidence ellipsoids: Dani et al. [8] (ConfidenceBall2),

Rusmevichientong and Tsitsiklis [11] (Uncertainty Ellipsoid Policy), Abbasi-Yadkori et al. [3] (OFUL).

  • The name LinUCB comes from Chu et al. [7].
  • Alternative routes:
  • Explore then commit for action sets with smooth boundary.

Abbasi-Yadkori [1], Abbasi-Yadkori et al. [2], Rusmevichientong and Tsitsiklis [11].

  • Phased elimination
  • Thompson sampling

13

slide-16
SLIDE 16

Summary

Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies ˆ Rn ≤

  • 8dnβn log
  • trace(V0) + nL2

d det

1 d (V0)

  • = O(d√n).

Linear bandits are an elegant model of the exploration-exploitation dilemma when actions are correlated. The main ingredients of the regret analysis are:

  • bounding the instantaneous regret using the definition of optimism;
  • a maximal concentration inequality holding for a randomized,

sequential design;

  • the Elliptical Potential Lemma.

14

slide-17
SLIDE 17

Real-World Setting: Delayed Feedback

slide-18
SLIDE 18

In a real-world application, rewards are delayed ...

15

slide-19
SLIDE 19

In a real-world application, rewards are delayed ... and censored.

16

slide-20
SLIDE 20

Delayed Linear Bandits

Modified setting: at round t ≥ 1,

  • receive contextualized action set At = {a1, . . . , aK} and choose

action At ∈ At,

  • two random variables are generated but not observed:

Xt ∼ B(θ⊤At) and Dt ∼ D(τ),

  • at t + Dt the reward Xt of action At is disclosed ...
  • ...unless Dt > m : If the delay is too long, the reward is discarded.

New parameter: 0 < m < T is the cut-off time of the system. If the delay is longer, the reward is never received. The delay distribution D(τ) characterizes the proportion of converting actions: τm = p(Dt ≤ m).

17

slide-21
SLIDE 21

A new estimator

We now have : Vt =

t−1

  • s=1

AsA⊤

s

˜ bt =

t−1

  • s=1

AsXs1{Ds ≤ m} where ˜ bt contains additional non-identically distributed samples: ˜ bt =

t−m

  • s=1

AsXs1{Ds ≤ m} +

t−1

  • s=t−m+1

AsXs1{Ds ≤ t − s} ”Conditionally biased” least squares estimator includes every received feedback ˆ θb

t = V −1 t

˜ bt Baseline: use previous estimator but discard last m steps ˆ θdisc

t

= V −1

t−mbt−m

with E[ˆ θdisc

t

|Ft] ≈ τmθ

18

slide-22
SLIDE 22

Confidence interval and the D-LinUCB policy

We remark that ˆ θb

t − τmθ = ˆ

θb

t − ˆ

θdisc

t+m + ˆ

θdisc

t+m − τmθ

= ˆ θb

t − ˆ

θdisc

t+m

  • finite bias

+ ˆ θdisc

t+m − τmθ

  • same as before

For the new Ct, we have new optimistic indices At = argmax

a∈A

max

θ∈Cta, θ .

But now, the solution has an extra (vanishing) bias term At = argmax

a

a, ˆ θt +

  • βt aV −1

t−1 + m aV −2 t−1 .

D-LinUCB: Easy, straightforward, harmless modification of LinUCB, with regret guarantees in the delayed feedback setting.

19

slide-23
SLIDE 23

Regret bound

Theorem (D-LinUCB Regret) Under the same conditions as before, with V0 = λI, with probability 1 − δ the regret of D-LinUCB satisfies ˆ Rn ≤ τ −1

m

  • 8dnβn log
  • trace(V0) + nL2

d det

1 d (V0)

  • +

dm (λ − 1)τ −1

m

log

  • 1 +

n d(λ − 1)

  • .

20

slide-24
SLIDE 24

Simulations

We fix n = 3000 and generate geometric delays with E[Dt] = 100. In a real setting, this would correspond to an experiment that lasts 3h, with average delays of 6 minutes. Then, we let the cut off vary m ∈ 250, 500, 1000, i.e. waiting time of 15min, 30min and 1h, respectively.

1000 2000 3000 Round t 25 50 75 100 125 150 Regret R(T )

WaiLinUCB DeLinUCB

1000 2000 3000 Round t 25 50 75 100 125 150 Regret R(T )

WaiLinUCB DeLinUCB

1000 2000 3000 Round t 50 100 150 200 Regret R(T )

WaiLinUCB DeLinUCB

Figure 1: Comparison of the simulated behaviors of D-LinUCB and (waiting)LinUCB

21

slide-25
SLIDE 25

Conclusions

  • Linear Bandits are a powerful and well-understood way of solving the

exploration-exploitation trade-off in a metric space;

  • The techniques have been extended to Generalized Linear models by

Filippi et al. [9]

  • and to kernel regression Valko et al. [12, 13].
  • Yet, including constraints and external sources of noise in real-world

application is challenging.

  • Some use cases challenge the bandit model assumptions...
  • ... and then it’s time to open the box of MDP’s (e.g. UCRL abd

KL-UCRL Auer et al. [5], Filippi et al. [10]).

22

slide-26
SLIDE 26

Conclusions

  • Linear Bandits are a powerful and well-understood way of solving the

exploration-exploitation trade-off in a metric space;

  • The techniques have been extended to Generalized Linear models by

Filippi et al. [9]

  • and to kernel regression Valko et al. [12, 13].
  • Yet, including constraints and external sources of noise in real-world

application is challenging.

  • Some use cases challenge the bandit model assumptions...
  • ... and then it’s time to open the box of MDP’s (e.g. UCRL abd

KL-UCRL Auer et al. [5], Filippi et al. [10]). Thanks!

22

slide-27
SLIDE 27

References i

References

[1] Yasin Abbasi-Yadkori. Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta, 2009. [2] Yasin Abbasi-Yadkori, Andr´ as Antos, and Csaba Szepesv´

  • ari. Forced-exploration based

algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009. [3] Yasin Abbasi-Yadkori, Csaba Szepesv´ ari, and David Tax. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320, 2011. [4] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic

  • concepts. In ICML, pages 3–11, 1999.

[5] P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010. [6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002. [7] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages 208–214, 2011. 23

slide-28
SLIDE 28

References ii

[8] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of Conference on Learning Theory (COLT), pages 355–366, 2008. [9] S. Filippi, O. Capp´ e, A. Garivier, and Cs. Szepesv´

  • ari. Parametric bandits: The generalized

linear case. pages 586–594. [10] Sarah Filippi, Olivier Capp´ e, and Aur´ elien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 115–122. IEEE, 2010. [11] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics

  • f Operations Research, 35(2):395–411, 2010.

[12] Michal Valko, Nathaniel Korda, R´ emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013. [13] Michal Valko, R´ emi Munos, Branislav Kveton, and Tom´ aˇ s Koc´

  • ak. Spectral bandits for

smooth graph functions. In International Conference on Machine Learning, pages 46–54, 2014. 24