Linear Bandits D avid P al Google, New York & Department of - - PowerPoint PPT Presentation

linear bandits
SMART_READER_LITE
LIVE PREVIEW

Linear Bandits D avid P al Google, New York & Department of - - PowerPoint PPT Presentation

Linear Bandits D avid P al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv ari Linear Bandits In round t = 1 , 2 ,


slide-1
SLIDE 1

Linear Bandits

D´ avid P´ al

Google, New York & Department of Computing Science University of Alberta dpal@google.com

November 2, 2011

joint work with Yasin Abbasi-Yadkori and Csaba Szepesv´ ari

slide-2
SLIDE 2

Linear Bandits

In round t = 1, 2, . . .

◮ Choose an action Xt from a set Dt ⊂ ❘d. ◮ Receive a reward

Xt, θ∗ + random noise

◮ Weights θ∗ are unknown but fixed. ◮ Goal: Maximize total reward.

slide-3
SLIDE 3

Motivation

◮ exploration & exploitation with side information ◮ action = arm = ad = feature vector ◮ reward = click

slide-4
SLIDE 4

Outline

◮ Formal model & Regret ◮ Algorithm:

Optimism in the Face of Uncertainty principle

◮ Confidence sets for Least Squares ◮ Sparse models: Online-to-Confidence-Set Conversion

slide-5
SLIDE 5

Formal model

Unknown but fixed weight vector θ∗ ∈ ❘d. In round t = 1, 2, . . .

◮ Receive Dt ⊂ ❘d ◮ Choose an action Xt ∈ Dt ◮ Receive a reward

Yt = Xt, θ∗ + ηt Noise is conditionally R-sub-Gaussian i.e. ∀γ ∈ ❘ E[eγηt | X1:t, η1:t−1] ≤ exp γ2R2 2

  • .
slide-6
SLIDE 6

Sub-Gaussianity

Definition

Random variable Z is R-sub-Gaussian for some R ≥ 0 if ∀γ ∈ ❘ E[eγZ] ≤ exp γ2R2 2

  • .

The condition implies that

◮ E[Z] = 0 ◮ Var[Z] ≤ R2

Examples:

◮ Zero-mean bounded in an interval of length 2R

(Hoeffding-Azuma)

◮ Zero-mean Gaussian with variance ≤ R2

slide-7
SLIDE 7

Regret

◮ If we knew θ∗, then in round t we’d choose action

X∗

t = argmax x∈Dt

x, θ∗

◮ Regret is our reward in n rounds relative to X∗ t :

Regretn =

n

  • t=1

X∗

t , θ∗ − n

  • t=1

Xt, θ∗

◮ We want Regretn /n → 0 as n → ∞

slide-8
SLIDE 8

Optimism in the Face of Uncertainty Principle

◮ Maintain a confidence set Ct ⊆ ❘d such that θ∗ ∈ Ct

with high probability.

◮ In round t, choose

(Xt, θt) = argmax

(x,θ)∈Dt×Ct−1

Xt, θt

θt is an “optimistic” estimate of θ∗

◮ UCB algorithm is a special case.

slide-9
SLIDE 9

Least Squares

◮ Data (X1, Y1), . . . , (Xn, Yn) such that Yt ≈ Xt, θ∗ ◮ Stack them into matrices: X1:n is n × d and Y1:n is n × 1 ◮ Least squares estimate:

  • θn = (X1:nXT

1:n + λI)−1XT 1:nY1:n ◮ Let Vn = X1:nXT 1:n + λI

Theorem

If θ∗2 ≤ S, then with probability at least 1 − δ, for all t, θ∗ lies in Ct =

  • θ :

θt − θVt ≤ R

  • 2 ln

det(Vt)1/2 δ det(λI)1/2

  • + S

√ λ

  • where vA =

√ vTAv is the matrix A-norm.

slide-10
SLIDE 10

Confidence Set Ct

  • θt

θ∗

  • θt+1

◮ Least squares solution

θt is the center of Ct

◮ θ∗ lies somewhere in Ct w.h.p. ◮ Next action

θt+1 is on the boundary of Ct

slide-11
SLIDE 11

Comparison with Previous Confidence Sets

◮ Our bound:

  • θt − θ∗Vt ≤ R
  • 2 ln

det(Vt)1/2 δ det(λI)1/2

  • + S

√ λ

◮ [Dani et al.(2008)] If θ∗2, Xt2 ≤ 1 then for a

specific λ

  • θt − θ∗Vt ≤ R max
  • 128d ln(t) ln(t2/δ), 8

3 ln(t2/δ)

  • ◮ [Rusmevichientong and Tsitsiklis(2010)] If Xt2 ≤ 1
  • θt − θ∗Vt ≤ 2Rκ

√ ln t

  • d ln t + ln(t2/δ) + S

√ λ where κ = 3 + 2 ln((1 + λd)/λ). Our bound doesn’t depend on t.

slide-12
SLIDE 12

Regret of the Bandit Algorithm

Theorem ([Dani et al.(2008)])

If θ∗2 ≤ 1 and Dt’s are subsets of the unit 2-ball with probability at least 1 − δ Regretn ≤ O(Rd√n · polylog(n, d, 1/δ)) We get the same result with smaller polylog(n, d, 1/δ) factor.

slide-13
SLIDE 13

Sparse Bandits

What if θ∗ is sparse?

◮ Not good idea to use least squares. ◮ Better use e.g. L1-regularization. ◮ How do we construct confidence sets?

Our new technique: Online-to-Confidence-Set Conversion

◮ Similar to Online-to-Batch Conversion, but very

different

◮ We start with an online prediction algorithm.

slide-14
SLIDE 14

Online Prediction Algorithms

In round t

◮ Receive Xt ∈ ❘d ◮ Predict

Yt ∈ ❘

◮ Receive correct label Yt ∈ ❘ ◮ Suffer loss (Yt −

Yt)2 No assumptions whatsoever on (X1, Y1), (X2, Y2), . . . There are heaps of algorithms of this structure:

◮ online gradient descent [Zinkevich(2003)] ◮ online least-squares [Azoury and Warmuth(2001), Vovk(2001)] ◮ exponetiated gradient [Kivinen and Warmuth(1997)] ◮ online LASSO (??) ◮ SeqSEW [Gerchinovitz(2011), Dalalyan and Tsybakov(2007)]

slide-15
SLIDE 15

Online Prediction Algorithms, cnt’d

◮ Regret with respect to a linear predictor θ ∈ ❘d

ρn(θ) =

n

  • t=1

(Yt − Yt)2 −

n

  • t=1

(Yt − Xt, θ)2

◮ Prediction algorithms come with “regret bounds” Bn:

∀n ρn(θ) ≤ Bn

◮ Bn depends on n, d, θ and possibly X1, X2, . . . , Xn and

Y1, Y2, . . . , Yn

◮ Typically, Bn = O(√n) or Bn = O(log n)

slide-16
SLIDE 16

Online-to-Confidence-Set Conversion

◮ Data (X1, Y1), . . . , (Xn, Yn) where Yt = Xt, θ∗ + ηt

and ηt is conditionally R-sub-Gaussian.

◮ Predictions

Y1, Y2, . . . , Yn

◮ Regret bound ρ(θ∗) ≤ Bn

Theorem (Conversion)

With probability at least 1 − δ, for all n, θ∗ lies in Cn =

  • θ ∈ ❘d :

n

  • t=1

(^ Yt − Xt, θ)2 ≤ 1 + 2Bn + 32R2 ln

  • R

√ 8 + √1 + Bn δ

slide-17
SLIDE 17

Optimistic Algorithm with Conversion

Theorem

If |x, θ∗| ≤ 1 for all x ∈ Dt and all t then with probability at least 1 − δ, for all n, the regret of Optimistic Algorithm is Regretn ≤ O

  • dnBn · polylog(n, d, 1/δ, Bn)
  • .
slide-18
SLIDE 18

Bandits combined with SeqSEW

Theorem ([Gerchinovitz(2011)])

If θ∞ ≤ 1 and θ0 ≤ p then SEQSEW algorithm has regret bound ρn(θ) ≤ Bn = O(p log(nd)) . Suppose θ∗2 ≤ 1 and θ∗0 ≤ p. Via the conversion, the Optimistic Algorithm has regret O(R

  • pdn · polylog(n, d, 1/δ))

which is better than O(Rd√n · polylog(n, d, 1/δ)).

slide-19
SLIDE 19

Open problems

◮ Confidence sets for batch algorithms e.g. offline

LASSO.

◮ Adaptive bandit algorithm that doesn’t need p

upfront.

slide-20
SLIDE 20

Questions? Read papers at http://david.palenica.com/

slide-21
SLIDE 21

References

Katy S. Azoury and Mafred K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43:211–246, 2001. Arnak S. Dalalyan and Alexandre B. Tsybakov. Aggregation by exponential weighting and sharp oracle inequalities. In Proceedings of the 20th Annual Conference on Learning Theory, pages 97–111, 2007. Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Rocco Servedio and Tong Zhang, editors, Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 355–366, 2008. S` ebastien Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. In Proceedings of the 24st Annual Conference on Learning Theory (COLT 2011), 2011. Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, January 1997. Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010. Vladimir Vovk. Competitive on-line statistics. International Statistical Review, 69:213–248, 2001. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of Twentieth International Conference on Machine Learning, 2003.