An Improved Regret Bound for Thompson Sampling in the Gaussian - - PowerPoint PPT Presentation

an improved regret bound for thompson sampling in the
SMART_READER_LITE
LIVE PREVIEW

An Improved Regret Bound for Thompson Sampling in the Gaussian - - PowerPoint PPT Presentation

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem Kalkanl, Ayfer Ozg ur Stanford University ISIT, June 2020 An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 1


slide-1
SLIDE 1

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting

Cem Kalkanlı, Ayfer ¨ Ozg¨ ur

Stanford University

ISIT, June 2020

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 1 / 13

slide-2
SLIDE 2

The Gaussian Linear Bandit Problem

Compact action set U: ||u||2 ≤ c for any u ∈ U Reward at time t: Yut = θTut + ηt where θ ∼ N(µ, K), θ ∈ Rd ηt ∼ N(0, σ2), ηt ∈ R Optimal action and reward: u∗ = arg max

u∈U

θTu Yu∗,t = θTu∗ + ηt

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 2 / 13

slide-3
SLIDE 3

A Policy and the Performance Criterion

Past t − 1 observations: Ht−1 = {u1, Yu1, ..., ut−1, Yut−1}, H0 = ∅ A policy π = (π1, π2, π3, ...): P(ut ∈ ·| Ht−1) = πt(Ht−1)(·). The performance criterion for the policy π, the Bayesian regret: R(T, π) =

T

  • t=1

E[Yu∗,t − Yut]

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 3 / 13

slide-4
SLIDE 4

Posterior of θ

Claim: θ| Ht ∼ N(µt, Kt) for any non negative integer t where µt = E[θ|Ht] Kt = E[(θ − E[θ|Ht])(θ − E[θ|Ht])T|Ht] Assume θ| Ht−1 ∼ N(µt−1, Kt−1) θ is independent of ut given Ht−1. (θ, Yut) is a Gaussian random vector given {Ht−1, ut}. Result: θ| Ht ∼ N(µt, Kt)

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 4 / 13

slide-5
SLIDE 5

Thompson Sampling

Proposed by Thompson (1933) Posterior Matching: P(ut ∈ B| Ht−1) = P(u∗ ∈ B| Ht−1) Significant empirical performance in online service, display advertising, and online revenue management

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 5 / 13

slide-6
SLIDE 6

Thompson Sampling For The Gaussian Linear Bandit

Implementation: Select ut

1

Sample ˆ θt ∼ N(µt−1, Kt−1)

2

ut = arg maxu∈U ˆ θT

t u

Compute the posterior of θ given Ht: µt ← E[θ|Ht] Kt ← E[(θ − E[θ|Ht])(θ − E[θ|Ht])T|Ht] Keywords: Thompson sampling: πTS The Bayesian regret of Thompson sampling: R(T, πTS)

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 6 / 13

slide-7
SLIDE 7

Prior Work

Lower bound: R(T, π) √ T for any policy π in a certain Gaussian linear bandit setting (Rusmevichientong & Tsitsiklis, 2010) Thompson sampling:

1

R(T, πTS) log(T) √ T (Russo & Van Roy, 2014)

2

R(T, πTS) √ T when |U| < ∞ (Russo & Van Roy, 2016)

3

R(T, πTS)

  • T log(T) when θ and U are bounded, not

including the Gaussian linear bandit (Dong & Van Roy, 2018)

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 7 / 13

slide-8
SLIDE 8

Main Result

Theorem The Bayesian regret of Thompson sampling in the Gaussian linear bandit setup: R(T, πTS) ≤ d

  • T(σ2 + c2 Tr(K)) log(1 + T

d ). Within

  • log(T) of optimality compared with the lower bound
  • f Ω(

√ T) (Rusmevichientong and Tsitsiklis, 2010) Improves the state-of-the-art upper bound by an order of

  • log(T) for the case of an action set with infinitely many

elements (Previous bound: O(log(T) √ T) by Russo and Van Roy (2014)) Same T dependency as the bound given by Dong & Van Roy (2018) even though θ here has unbounded support unlike the

  • ne in 2018

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 8 / 13

slide-9
SLIDE 9

Cauchy–Schwarz Type Inequality

Proposition Let X1 and X2 be arbitrary i.i.d., Rm valued random variables and f1, f2 measurable maps such that f1, f2 : Rm → Rd with E[||f1(X1)||2

2], E[||f2(X1)||2 2] < ∞, then

| E[f1(X1)Tf2(X1)]| ≤

  • d E[(f1(X1)Tf2(X2))2].

Reduces to Cauchy-Schwarz inequality when d = 1 Similar statement when d > 1

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 9 / 13

slide-10
SLIDE 10

Single Step Regret

Lemma Let G > 0 such that G ≥ Tr(K), then E[Yu∗,1 − Yu1] ≤

  • d(σ2 + c2G) E[log(1 +

uT

1 Ku1

σ2 + c2G )]. I(θ; u1, Yu1) = I(θ; u1) + I(θ; Yu1|u1) = Eu∼u1[I(θ; Yu)] θ and Yu are jointly Gaussian random variables: Eu∼u1[I(θ; Yu)] = E[log(1 + uT

1 Ku1

σ2 )] Similar to the information ratio concept used by Russo and Van Roy (2016), Dong and Van Roy (2018) Instead of a discrete entropy term, maybe use the mutual information as is

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 10 / 13

slide-11
SLIDE 11

Proof of the Lemma

1 Use of the earlier proposition:

E[Yu∗,1 − Yu1] = E[(θ − µ)Tu∗] ≤

  • d E[((θ − µ)Tu1)2]

=

  • d E[uT

1 Ku1]

2 uT

1 Ku1 ≤ σ2 + c2 Tr(K) ≤ σ2 + c2G and x ≤ log(1 + x) for

any x ∈ [0, 1]: uT

1 Ku1 = (σ2+c2G) uT 1 Ku1

σ2 + c2G ≤ (σ2+c2G) log(1+ uT

1 Ku1

σ2 + c2G )

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 11 / 13

slide-12
SLIDE 12

An Overview of the Main Theorem’s Proof

1 Use the lemma:

E[Yu∗,t − Yut|Ht−1] ≤

  • d(σ2 + c2 Tr(K)) E[log(1 +

uT

t Kt−1ut

σ2 + c2 Tr(K))|Ht−1] Jensen’s Inequality  

  • E[Yu∗,t − Yut] ≤
  • d(σ2 + c2 Tr(K)) E[log(1 +

uT

t Kt−1ut

σ2 + c2 Tr(K))]

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 12 / 13

slide-13
SLIDE 13

An Overview of the Main Theorem’s Proof cont.

2 Overall bound on the Bayesian regret:

T

  • t=1

E[Yu∗,t − Yut] ≤

  • Td(σ2 + c2 Tr(K)) E[

T

  • t=1

log(1 + uT

t Kt−1ut

σ2 + c2 Tr(K))]

3 Show that T

t=1 log(1 + uT

t Kt−1ut

σ2+c2 Tr(K)) ≤ d log(1 + T d ):

1 + uT

t Kt−1ut

σ2 + c2 Tr(K) ≤ 1 + uT

t (K −1 + 1 σ2+c2 Tr(K)

t−1

i=1 uiuT i )−1ut

σ2 + c2 Tr(K) = det(K −1 +

1 σ2+c2 Tr(K)

t

i=1 uiuT i )

det(K −1 +

1 σ2+c2 Tr(K)

t−1

i=1 uiuT i )

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting 13 / 13