[PPT] - Linear Bandits: From Theory to Applications Claire Vernade DeepMind PowerPoint Presentation

SLIDE 1

Linear Bandits: From Theory to Applications

Claire Vernade

DeepMind – Foundations Team credits : Csaba Szepesv´ ari, Tor Lattimore for their blog

SLIDE 2

Sequential Decision Making

1

SLIDE 3

Real World Sequential Decision Making

2

SLIDE 4

Linear Bandits

SLIDE 6

Linear Bandits

1. In round t, observe action set At ⊂ Rd.
2. The learner chooses At ∈ At and receives Xt, satisfying

E[Xt|A1, A1, . . . , At, At] = At, θ∗ := fθ∗(At) for some unknown θ∗.

3. Light-tailed noise:

Xt − At, θ∗ = ηt ∼ N(0, 1) Goal: Keep regret Rn = E n

t=1

max

a∈Ata, θ∗ − Xt

small.

4

SLIDE 7

Real-World setting

Typical setting: a user, represented by its feature vector ut, shows up and we have a finite set of (correlated) actions (a1, . . . , aK). Some function Φ joins these vectors pairwise to create a contextualized action set: ∀i ∈ [K], Φ(ut, ai) = at,i ∈ Rd At = {at,1, . . . , at,K}. No assumption is to be made on the joining function Φ as the bandit may take over the decision step from that contextualized action set. So, it is equivalent to At ∼ Π(Rd) some arbitrary distribution, or A1, . . . , An fixed arbitrarily by the environment.

5

SLIDE 8

Toolbox of the optimist

Say, reward in round t is Xt, action in round t is At ∈ Rd: Xt = At, θ∗ + ηt , We want to estimate θ∗:regularized least-squares estimator: ˆ θt = V −1

t t

s=1

AsXs , V0 = λI , Vt = V0 +

t

s=1

AsA⊤

s .

Choice of confidence regions (ellipsoids) Ct: Ct . =

θ ∈ Rd : θ − ˆ

θt−12

Vt−1 ≤ βt

.

where, for A positive definite, x2

A = x⊤Ax. 6

SLIDE 9

LinUCB

“Choose the best action in the best environment amongst the plausible

nes.”

Choose Ct with suitable (βt)t and let At = argmax

a∈A

max

θ∈Cta, θ .

Or, more concretely, for each action a ∈ A, compute the ”optimistic index” Ut(a) = max

θ∈Cta, θ .

Maximising a linear function over a convex closed set, the solution is explicit: At = argmax

a

Ut(a) = argmax

a

a, ˆ θt +

βt aV −1

t−1 .

7

SLIDE 10

Optimism in the Face of Uncertainty Principle

8

SLIDE 11

Regret Bound

Assumptions:

1. Bounded scalar mean reward: |a, θ∗| ≤ 1 for any a ∈ ∪tAt.
2. Bounded actions: for any a ∈ ∪tAt, a2 ≤ L.
3. Honest confidence intervals: There exists a δ ∈ (0, 1) such that with

probability 1 − δ, for all t ∈ [n], θ∗ ∈ Ct for some choice of (βt)t≤n. Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies ˆ Rn ≤

8dnβn log

dλ + nL2 dλ

.

9

SLIDE 12

Proof

Jensen’s inequality shows that ˆ Rn =

n

t=1

A∗

t − At, θ := n

t=1

rt ≤

n

n

t=1

r 2

t

where A∗

t

. = argmaxa∈Ata, θ∗. Let ˜ θt be the vector that realizes the maximum over the ellipsoid: ˜ θt ∈ Ct s.t. At, ˜ θt = Ut(At). From the definition of LinUCB, A∗

t , θ∗ ≤ Ut(A∗ t ) ≤ Ut(At) = At, ˜

θt . Then, rt ≤ At, ˜ θt − θ∗ ≤ AtV −1

t−1 ˜

θt − θ∗Vt−1 ≤ 2 AtV −1

t−1

βt .

10

SLIDE 13

Elliptical Potential Lemma

So we now have a new upper bound, ˆ Rn =

n

t=1

rt ≤

n

n

t=1

r 2

t ≤ 2

nβn

n

t=1

(1 ∧ At2

V −1

t−1) .

Lemma (Abbasi-Yadkori et al. (2011)) Let x1, . . . , xn ∈ Rd, Vt = V0 + t

s=1 xsx⊤ s , t ∈ [n], and L ≥ maxt xt2.

Then,

n

t=1
1 ∧ xt2

V −1

t−1

≤ 2 log

det Vn det V0

≤ d log
trace(V0) + nL2

d det1/d(V0)

.

11

SLIDE 14

Confidence Ellipsoids

Assumptions: θ∗ ≤ S, and let (As)s, (ηs)s be so that for any 1 ≤ s ≤ t, ηs|Fs−1 ∼ subG(1), where Fs = σ(A1, η1, . . . , As−1, ηs−1, As)

Fix δ ∈ (0, 1). Let βt+1 = √ λS +

2 log

1

δ

+ log
det Vt(λ)

λd

≤

√ λS +

2 log

1

δ

+ log
λd+nL2

dλ

,

and Ct+1 =

θ ∈ Rd : ˆ

θt − θ∗Vt(λ) ≤ βt+1

.

Theorem Ct+1 is a confidence set for θ∗ at level 1 − δ: P (θ∗ ∈ Ct+1) ≥ 1 − δ . Proof : See Chapter 20 of Bandit Algorithms (www.banditalgs.com)

12

SLIDE 15

History

Abe and Long [4] introduced stochastic linear bandits into machine

learning literature.

Auer [6] was the first to consider optimism for linear bandits (LinRel,

SupLinRel). Main restriction: |At| < +∞.

Confidence ellipsoids: Dani et al. [8] (ConfidenceBall2),

Rusmevichientong and Tsitsiklis [11] (Uncertainty Ellipsoid Policy), Abbasi-Yadkori et al. [3] (OFUL).

The name LinUCB comes from Chu et al. [7].
Alternative routes:
Explore then commit for action sets with smooth boundary.

Abbasi-Yadkori [1], Abbasi-Yadkori et al. [2], Rusmevichientong and Tsitsiklis [11].

Phased elimination
Thompson sampling

13

SLIDE 16

Summary

Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies ˆ Rn ≤

8dnβn log
trace(V0) + nL2

d det

1 d (V0)

= O(d√n).

Linear bandits are an elegant model of the exploration-exploitation dilemma when actions are correlated. The main ingredients of the regret analysis are:

bounding the instantaneous regret using the definition of optimism;
a maximal concentration inequality holding for a randomized,

sequential design;

the Elliptical Potential Lemma.

14

SLIDE 17

Real-World Setting: Delayed Feedback

SLIDE 18

In a real-world application, rewards are delayed ...

15

SLIDE 19

In a real-world application, rewards are delayed ... and censored.

16

SLIDE 20

Delayed Linear Bandits

Modified setting: at round t ≥ 1,

receive contextualized action set At = {a1, . . . , aK} and choose

action At ∈ At,

two random variables are generated but not observed:

Xt ∼ B(θ⊤At) and Dt ∼ D(τ),

at t + Dt the reward Xt of action At is disclosed ...
...unless Dt > m : If the delay is too long, the reward is discarded.

New parameter: 0 < m < T is the cut-off time of the system. If the delay is longer, the reward is never received. The delay distribution D(τ) characterizes the proportion of converting actions: τm = p(Dt ≤ m).

17

SLIDE 21

A new estimator

We now have : Vt =

t−1

s=1

AsA⊤

s

˜ bt =

t−1

s=1

AsXs1{Ds ≤ m} where ˜ bt contains additional non-identically distributed samples: ˜ bt =

t−m

s=1

AsXs1{Ds ≤ m} +

t−1

s=t−m+1

AsXs1{Ds ≤ t − s} ”Conditionally biased” least squares estimator includes every received feedback ˆ θb

t = V −1 t

˜ bt Baseline: use previous estimator but discard last m steps ˆ θdisc

t

= V −1

t−mbt−m

with E[ˆ θdisc

t

|Ft] ≈ τmθ

18

SLIDE 22

Confidence interval and the D-LinUCB policy

We remark that ˆ θb

t − τmθ = ˆ

θb

t − ˆ

θdisc

t+m + ˆ

θdisc

t+m − τmθ

= ˆ θb

t − ˆ

θdisc

t+m

finite bias

+ ˆ θdisc

t+m − τmθ

same as before

For the new Ct, we have new optimistic indices At = argmax

a∈A

max

θ∈Cta, θ .

But now, the solution has an extra (vanishing) bias term At = argmax

a

a, ˆ θt +

βt aV −1

t−1 + m aV −2 t−1 .

D-LinUCB: Easy, straightforward, harmless modification of LinUCB, with regret guarantees in the delayed feedback setting.

19

SLIDE 23

Regret bound

Theorem (D-LinUCB Regret) Under the same conditions as before, with V0 = λI, with probability 1 − δ the regret of D-LinUCB satisfies ˆ Rn ≤ τ −1

m

8dnβn log
trace(V0) + nL2

d det

1 d (V0)

+

dm (λ − 1)τ −1

m

log

1 +

n d(λ − 1)

.

20

SLIDE 24

Simulations

We fix n = 3000 and generate geometric delays with E[Dt] = 100. In a real setting, this would correspond to an experiment that lasts 3h, with average delays of 6 minutes. Then, we let the cut off vary m ∈ 250, 500, 1000, i.e. waiting time of 15min, 30min and 1h, respectively.

1000 2000 3000 Round t 25 50 75 100 125 150 Regret R(T )

WaiLinUCB DeLinUCB

1000 2000 3000 Round t 25 50 75 100 125 150 Regret R(T )

WaiLinUCB DeLinUCB

1000 2000 3000 Round t 50 100 150 200 Regret R(T )

WaiLinUCB DeLinUCB

Figure 1: Comparison of the simulated behaviors of D-LinUCB and (waiting)LinUCB

21

SLIDE 25

Conclusions

Linear Bandits are a powerful and well-understood way of solving the

exploration-exploitation trade-off in a metric space;

The techniques have been extended to Generalized Linear models by

Filippi et al. [9]

and to kernel regression Valko et al. [12, 13].
Yet, including constraints and external sources of noise in real-world

application is challenging.

Some use cases challenge the bandit model assumptions...
... and then it’s time to open the box of MDP’s (e.g. UCRL abd

KL-UCRL Auer et al. [5], Filippi et al. [10]).

22

SLIDE 26

Conclusions

Linear Bandits are a powerful and well-understood way of solving the

exploration-exploitation trade-off in a metric space;

The techniques have been extended to Generalized Linear models by

Filippi et al. [9]

and to kernel regression Valko et al. [12, 13].
Yet, including constraints and external sources of noise in real-world

application is challenging.

Some use cases challenge the bandit model assumptions...
... and then it’s time to open the box of MDP’s (e.g. UCRL abd

KL-UCRL Auer et al. [5], Filippi et al. [10]). Thanks!

22

SLIDE 27

References i

References

[1] Yasin Abbasi-Yadkori. Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta, 2009. [2] Yasin Abbasi-Yadkori, Andr´ as Antos, and Csaba Szepesv´

ari. Forced-exploration based

algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009. [3] Yasin Abbasi-Yadkori, Csaba Szepesv´ ari, and David Tax. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320, 2011. [4] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic

concepts. In ICML, pages 3–11, 1999.

[5] P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010. [6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002. [7] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages 208–214, 2011. 23

SLIDE 28

References ii

[8] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of Conference on Learning Theory (COLT), pages 355–366, 2008. [9] S. Filippi, O. Capp´ e, A. Garivier, and Cs. Szepesv´

ari. Parametric bandits: The generalized

linear case. pages 586–594. [10] Sarah Filippi, Olivier Capp´ e, and Aur´ elien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 115–122. IEEE, 2010. [11] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics

f Operations Research, 35(2):395–411, 2010.

[12] Michal Valko, Nathaniel Korda, R´ emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013. [13] Michal Valko, R´ emi Munos, Branislav Kveton, and Tom´ aˇ s Koc´

ak. Spectral bandits for

smooth graph functions. In International Conference on Machine Learning, pages 46–54, 2014. 24

Linear Bandits: From Theory to Applications

Claire Vernade

Sequential Decision Making

Real World Sequential Decision Making

Table of contents

2. Real-World Setting: Delayed Feedback

Linear Bandits

Linear Bandits

E[Xt|A1, A1, . . . , At, At] = At, θ∗ := fθ∗(At) for some unknown θ∗.

Xt − At, θ∗ = ηt ∼ N(0, 1) Goal: Keep regret Rn = E n

max

Real-World setting

Toolbox of the optimist

Say, reward in round t is Xt, action in round t is At ∈ Rd: Xt = At, θ∗ + ηt , We want to estimate θ∗:regularized least-squares estimator: ˆ θt = V −1

AsXs , V0 = λI , Vt = V0 +

AsA⊤

Choice of confidence regions (ellipsoids) Ct: Ct . =

θt−12

where, for A positive definite, x2

LinUCB

“Choose the best action in the best environment amongst the plausible

Choose Ct with suitable (βt)t and let At = argmax

max

Or, more concretely, for each action a ∈ A, compute the ”optimistic index” Ut(a) = max

Maximising a linear function over a convex closed set, the solution is explicit: At = argmax

Ut(a) = argmax

a, ˆ θt +

Optimism in the Face of Uncertainty Principle

Regret Bound

Assumptions:

probability 1 − δ, for all t ∈ [n], θ∗ ∈ Ct for some choice of (βt)t≤n. Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies ˆ Rn ≤

dλ + nL2 dλ

Proof

Jensen’s inequality shows that ˆ Rn =

A∗

rt ≤

r 2

where A∗

. = argmaxa∈Ata, θ∗. Let ˜ θt be the vector that realizes the maximum over the ellipsoid: ˜ θt ∈ Ct s.t. At, ˜ θt = Ut(At). From the definition of LinUCB, A∗

θt . Then, rt ≤ At, ˜ θt − θ∗ ≤ AtV −1

θt − θ∗Vt−1 ≤ 2 AtV −1

Elliptical Potential Lemma

So we now have a new upper bound, ˆ Rn =

rt ≤

r 2

(1 ∧ At2

Lemma (Abbasi-Yadkori et al. (2011)) Let x1, . . . , xn ∈ Rd, Vt = V0 + t

Then,

det Vn det V0

d det1/d(V0)

Confidence Ellipsoids

Fix δ ∈ (0, 1). Let βt+1 = √ λS +

1

√ λS +

1

and Ct+1 =

θt − θ∗Vt(λ) ≤ βt+1

Theorem Ct+1 is a confidence set for θ∗ at level 1 − δ: P (θ∗ ∈ Ct+1) ≥ 1 − δ . Proof : See Chapter 20 of Bandit Algorithms (www.banditalgs.com)

History

learning literature.

SupLinRel). Main restriction: |At| < +∞.

Rusmevichientong and Tsitsiklis [11] (Uncertainty Ellipsoid Policy), Abbasi-Yadkori et al. [3] (OFUL).

Abbasi-Yadkori [1], Abbasi-Yadkori et al. [2], Rusmevichientong and Tsitsiklis [11].

Summary

Theorem (LinUCB Regret) Let the conditions listed above hold. Then with probability 1 − δ the regret of LinUCB satisfies ˆ Rn ≤

d det

Linear bandits are an elegant model of the exploration-exploitation dilemma when actions are correlated. The main ingredients of the regret analysis are:

sequential design;

Real-World Setting: Delayed Feedback

In a real-world application, rewards are delayed ...

In a real-world application, rewards are delayed ... and censored.

Delayed Linear Bandits

Modified setting: at round t ≥ 1,

action At ∈ At,

Xt ∼ B(θ⊤At) and Dt ∼ D(τ),

New parameter: 0 < m < T is the cut-off time of the system. If the delay is longer, the reward is never received. The delay distribution D(τ) characterizes the proportion of converting actions: τm = p(Dt ≤ m).

A new estimator

We now have : Vt =

AsA⊤

˜ bt =