Toward Better Use of Data in Contextual and Linear Bandits Nima - - PowerPoint PPT Presentation

toward better use of data in contextual and linear bandits
SMART_READER_LITE
LIVE PREVIEW

Toward Better Use of Data in Contextual and Linear Bandits Nima - - PowerPoint PPT Presentation

Toward Better Use of Data in Contextual and Linear Bandits Nima Hamidi and Mohsen Bayati Stanford University October 2, 2020 References: arXiv 2002.05152 & arXiv 2006.06790 N. Hamidi, M. Bayati (Stanford University) Better Use of Data in


slide-1
SLIDE 1

Toward Better Use of Data in Contextual and Linear Bandits

Nima Hamidi and Mohsen Bayati

Stanford University

October 2, 2020

References: arXiv 2002.05152 & arXiv 2006.06790

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 1 / 53

slide-2
SLIDE 2

Overview

1

Motivation

2

Confidence-based Policies

3

Sieved-Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 2 / 53

slide-3
SLIDE 3

How to Test New Medical Interventions?

A hospital wants to reduce post-discharge complications:

– Use one of two newly designed telehealth (A or B)

Should select one of A or B per patient A/B test or Randomized Control Trial (RCT) have high opportunity cost

– In healthcare,experimentation is costly or unethical1

A B

1Sibbald, Bonnie. 1998. Understanding controlled trials: Why are randomized controlled trials important?, British Medical Journal (Clinical Research Ed.) 316(201).

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 3 / 53

slide-4
SLIDE 4

Beyond Healthcare

“Today, Microsoft and several other leading companies, including Amazon, Booking.com, Facebook, and Google, each conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users.”

Kohavi and Thompke, Harvard Business Review, 2017

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 4 / 53

slide-5
SLIDE 5

Beyond Healthcare

“Today, Microsoft and several other leading companies, including Amazon, Booking.com, Facebook, and Google, each conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users.”

Kohavi and Thompke, Harvard Business Review, 2017

Also,

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 4 / 53

slide-6
SLIDE 6

Multi-armed Bandit Experiments

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 5 / 53

slide-7
SLIDE 7

Example (Google Analytics)2

A/B testing

– Website configurations A and B with conversion rates 4% and 5% respectively

Using Thompson Sampling, instead of A/B testing, can run experiment with 78.5% less data → 97.5 conversions saved (on avg.)

2Source: Google Analytics Support Page

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 6 / 53

slide-8
SLIDE 8

Stochastic Linear Bandit Problem

Let Θ⋆ ∈ Rd be fixed (and unknown). At time t, the action set At ⊆ Rd is revealed to a policy π. The policy chooses At ∈ At. It observes a reward rt = Θ⋆, At + εt. Conditional on the history, εt has zero mean.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 7 / 53

slide-9
SLIDE 9

Evaluation Metric

The objective is to improve using past experiences. The cumulative regret is defined as Regret(T, Θ⋆, π) := E T

  • i=1

sup

A∈At

Θ⋆, A − Θ⋆, At

  • Θ⋆
  • .
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 8 / 53

slide-10
SLIDE 10

Evaluation Metric

The objective is to improve using past experiences. The cumulative regret is defined as Regret(T, Θ⋆, π) := E T

  • i=1

sup

A∈At

Θ⋆, A − Θ⋆, At

  • Θ⋆
  • .

In the Bayesian setting, the Bayesian regret is given by BayesRegret(T, π) := EΘ⋆∼P[Regret(T, Θ⋆, π)].

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 8 / 53

slide-11
SLIDE 11

Special Cases

Standard multi-armed bandit problem

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 9 / 53

slide-12
SLIDE 12

Special Cases

Standard multi-armed bandit problem k-armed contextual bandit problem

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 9 / 53

slide-13
SLIDE 13

Special Cases

Standard multi-armed bandit problem k-armed contextual bandit problem Dynamic-pricing with demand covariates Expected Demand = α + β p + Γ, X Expected Revenue = αp + β p2 + Γ, Xp

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 9 / 53

slide-14
SLIDE 14

Special Cases

Standard multi-armed bandit problem k-armed contextual bandit problem Dynamic-pricing with demand covariates Expected Demand = α + β p + Γ, X Expected Revenue = αp + β p2 + Γ, Xp can be mapped to a linear bandit by setting A = {(p, p2, pX)|p ∈ [pmin, pmax]} and Θ⋆ =   α β Γ  

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 9 / 53

slide-15
SLIDE 15

Related Literature

UCB/OFUL: Auer, Cesa-Bianchi, and Fischer 2002; Dani, Hayes, and Kakade 2008; Rusmevichientong and Tsitsiklis 2010; Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011 Thompson sampling: Agrawal and Goyal 2013; Russo and Van Roy 2014, 2016; Abeille and Lazaric 2017 ǫ-Greedy and variants: Langford and Zhang 2008; Goldenshluger and Zeevi 2013

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 10 / 53

slide-16
SLIDE 16

Related Literature

UCB/OFUL: Auer, Cesa-Bianchi, and Fischer 2002; Dani, Hayes, and Kakade 2008; Rusmevichientong and Tsitsiklis 2010; Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011 Thompson sampling: Agrawal and Goyal 2013; Russo and Van Roy 2014, 2016; Abeille and Lazaric 2017 ǫ-Greedy and variants: Langford and Zhang 2008; Goldenshluger and Zeevi 2013

Learning and earning in operations: Carvalho and Puterman05, Araman and Caldentey09, Besbes and Zeevi0911, Harrison et al.12, den Boer and Zwart14-16, Keskin and Zeevi14-16, Gur et al.’14, Johnson et al.15, Chen et al.15, Cohen et al 16, Bayati and Bastani’15, Kallus and Udell16, Javanmard and Nazerzadeh16, Javanmard17, Elmachtoub et. al.’17, Ban and Keskin17, Cheung et al ’18, Bastani et al.’19, and many more!

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 10 / 53

slide-17
SLIDE 17

Algorithms

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 11 / 53

slide-18
SLIDE 18

Greedy

At time t = 1, 2, · · · , T: Using the set of observations Ht−1 := {( A1, r1), · · · , ( At−1, rt−1)}, Construct an estimate Θt−1 for Θ⋆, Choose the action A ∈ At with largest A, Θt−1.

Estimate Θ⋆ Greedy Decision Update H Reward History

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 12 / 53

slide-19
SLIDE 19

Greedy

The ridge estimator is used to obtain Θt (for a fixed λ): Vt := λI +

t

  • i=1
  • Ai

A⊤

i ∈ Rd×d,

(1) and

  • Θt := V−1

t

t

  • i=1
  • Airi
  • ∈ Rd.

(2)

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 13 / 53

slide-20
SLIDE 20

Greedy

Algorithm 1 Greedy algorithm

1: for t = 1 to T do 2:

Pull At := arg maxA∈At A, Θt−1

3:

Observe the reward rt

4:

Compute Vt = λI + t

i=1

Ai A⊤

i

5:

Compute Θt = V−1

t

t

i=1

Airi

  • 6: end for
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 14 / 53

slide-21
SLIDE 21

Greedy

Algorithm 2 Greedy algorithm

1: for t = 1 to T do 2:

Pull At := arg maxA∈At A, Θt−1

3:

Observe the reward rt

4:

Compute Vt = λI + t

i=1

Ai A⊤

i

5:

Compute Θt = V−1

t

t

i=1

Airi

  • 6: end for

Greedy makes wrong decisions due to over- or under-estimating the true rewards. The over-estimation is automatically corrected. The under-estimation can cause linear regret.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 14 / 53

slide-22
SLIDE 22

Greedy

A1 A2 A3 A4 A5

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 15 / 53

slide-23
SLIDE 23

Greedy

A1 A2 A3 A4 A5 Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 15 / 53

slide-24
SLIDE 24

Optimism in Face of Uncertainty (OFU) Algorithm

Key idea: be optimistic when estimating the reward of actions.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 16 / 53

slide-25
SLIDE 25

Optimism in Face of Uncertainty (OFU) Algorithm

Key idea: be optimistic when estimating the reward of actions. For ρ > 0, define the confidence set Ct−1(ρ) to be Ct−1(ρ) := {Θ | Θ − Θt−1Vt−1 ≤ ρ}, where X2

Vt−1 = X ⊤Vt−1X ∈ R+.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 16 / 53

slide-26
SLIDE 26

Optimism in Face of Uncertainty (OFU) Algorithm

Key idea: be optimistic when estimating the reward of actions. For ρ > 0, define the confidence set Ct−1(ρ) to be Ct−1(ρ) := {Θ | Θ − Θt−1Vt−1 ≤ ρ}, where X2

Vt−1 = X ⊤Vt−1X ∈ R+.

Theorem (Informal, Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011)

Letting ρ := O( √ d), we have Θ⋆ ∈ Ct−1(ρ) with high probability.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 16 / 53

slide-27
SLIDE 27

Optimism in Face of Uncertainty (OFU) Algorithm

Algorithm 3 OFUL algorithm

1: for t = 1 to T do 2:

Pull At := arg maxA∈At supΘ∈Ct−1(ρ)A, Θ

3:

Observe the reward rt

4:

Compute Vt = λI + t

i=1

Ai A⊤

i

5:

Compute Θt = V−1

t

t

i=1

Airi

  • 6: end for
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 17 / 53

slide-28
SLIDE 28

Optimism in Face of Uncertainty (OFU) Algorithm

Algorithm 3 OFUL algorithm

1: for t = 1 to T do 2:

Pull At := arg maxA∈At supΘ∈Ct−1(ρ)A, Θ

3:

Observe the reward rt

4:

Compute Vt = λI + t

i=1

Ai A⊤

i

5:

Compute Θt = V−1

t

t

i=1

Airi

  • 6: end for

It can be shown that sup

Θ∈Ct−1(ρ)

A, Θ = A, Θt−1+ρAV−1

t−1.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 17 / 53

slide-29
SLIDE 29

Optimism in Face of Uncertainty (OFU) Algorithm

A1 A2 A3 A4 A5

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 18 / 53

slide-30
SLIDE 30

Linear Thompson Sampling (LinTS) Algorithm

Key idea: use randomization to address under-estimation.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 19 / 53

slide-31
SLIDE 31

Linear Thompson Sampling (LinTS) Algorithm

Key idea: use randomization to address under-estimation. LinTS samples from the posterior distribution of Θ⋆. Algorithm 4 LinTS algorithm

1: for t = 1 to T do 2:

Sample Θt−1 ∼ P(Θ⋆ | Ht−1)

3:

Pull At := arg maxA∈At A, Θt−1

4:

Observe the reward rt

5:

Update Ht ← Ht−1 {(At, rt)}

6: end for

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 19 / 53

slide-32
SLIDE 32

Linear Thompson Sampling (LinTS) Algorithm

Under normality, LinTS becomes: Algorithm 5 LinTS algorithm under normality

1: for t = 1 to T do 2:

Sample Θt−1 ∼ N( Θt−1, V−1

t−1)

3:

Pull At := arg maxA∈At A, Θt−1

4:

Observe the reward rt

5:

Compute Vt = λI + t

i=1

Ai A⊤

i

6:

Compute Θt = V−1

t

t

i=1

Airi

  • 7: end for
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 20 / 53

slide-33
SLIDE 33

Linear Thompson Sampling (LinTS) Algorithm

A1 A2 A3 A4 A5

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 21 / 53

slide-34
SLIDE 34

Why Is LinTS Popular?

Empirical superiority:

d = 120, Θ⋆ ∼ N(0, Id), k = 10, X ∼ N(0, I12), Each At contains X as a block3.

2000 4000 6000 8000 10000 1000 2000 3000 4000 5000 Greedy OFUL TS

3This is the 10-armed contextual bandit with 12 dimensional covariates.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 22 / 53

slide-35
SLIDE 35

Comparison of Regret Bounds

Theorem (Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011)

Under some conditions, the regret of OFUL is bounded by Regret(T, Θ⋆, πOFUL) ≤ O(d √ T).

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 23 / 53

slide-36
SLIDE 36

Comparison of Regret Bounds

Theorem (Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011)

Under some conditions, the regret of OFUL is bounded by Regret(T, Θ⋆, πOFUL) ≤ O(d √ T).

Theorem (Russo and Van Roy 2014)

Under minor assumptions, the Bayesian regret of LinTS is bounded by BayesRegret(T, πLinTS) ≤ O(d √ T).

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 23 / 53

slide-37
SLIDE 37

Comparison of Regret Bounds

Theorem (Abbasi-Yadkori, P´ al, and Szepesv´ ari 2011)

Under some conditions, the regret of OFUL is bounded by Regret(T, Θ⋆, πOFUL) ≤ O(d √ T).

Theorem (Russo and Van Roy 2014)

Under minor assumptions, the Bayesian regret of LinTS is bounded by BayesRegret(T, πLinTS) ≤ O(d √ T).

Theorem (Dani, Hayes, and Kakade 2008)

There is a Bayesian linear bandit problem that satisfies inf

π BayesRegret(T, π) ≥ Ω(d

√ T).

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 23 / 53

slide-38
SLIDE 38

A Worst-Case Regret Bound for LinTS

Question: can one prove a similar worst-case regret bound for LinTS? The only known results require inflating the posterior variance. Algorithm 6 LinTS algorithm under normality

1: for t = 1 to T do 2:

Sample Θt−1 ∼ N( Θt−1, β2V−1

t−1)

3:

Pull At := arg maxA∈AtA, Θt−1

4:

Update Vt and Θt

5: end for

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 24 / 53

slide-39
SLIDE 39

Inflated Linear Thompson Sampling (LinTS) Algorithm

A1 A2 A3 A4 A5

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 25 / 53

slide-40
SLIDE 40

A Worst-Case Regret Bound for LinTS

Theorem (Agrawal and Goyal 2013; Abeille and Lazaric 2017)

If β ∝ √ d, then Regret(T, Θ⋆, πLinTS) ≤ O(d √ dT). This result is far from optimal by a √ d factor.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 26 / 53

slide-41
SLIDE 41

Empirical Performance of Inflated LinTS

Unfortunately, the inflated variant of LinTS performs poorly...

2000 4000 6000 8000 10000 5000 10000 15000 20000 25000 30000 35000 Greedy OFUL TS Freq-ts

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 27 / 53

slide-42
SLIDE 42

Bayesian Analyses are Brittle

We prove that the inflation is necessary for LinTS to work.

Theorem

There exists a linear bandit problem such that for T ≤ exp(Ω(d)), we have BayesRegret(T, πLinTS) = Ω(T).

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 28 / 53

slide-43
SLIDE 43

Bayesian Analyses are Brittle

We prove that the inflation is necessary for LinTS to work.

Theorem

There exists a linear bandit problem such that for T ≤ exp(Ω(d)), we have BayesRegret(T, πLinTS) = Ω(T). The counter-example satisfies the following properties: Θ⋆ ∼ N(0, Id), LinTS uses the right prior, LinTS assumes noises are standard normal, rt = Θ⋆, At. (i.e., noiseless data!)

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 28 / 53

slide-44
SLIDE 44

Some Remarks on LinTS

Under some assumptions on the action set we can prove that LinTS with only a logarithmic inflation works. Bastani, Simchi-Levi, and Zhu (2019) showed, in the dynamic pricing case, when only the prior mean is unknown but prior variance is known, regret loss can be only up to a constant. Jin, Xu, Shi, Xiao, and Gu (2020) showed a variant of TS achieves

  • ptimal regret bound in the standard k-armed bandit setting.
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 29 / 53

slide-45
SLIDE 45

Improving OFUL

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 30 / 53

slide-46
SLIDE 46

Weaker Optimism and Improving OFUL

Unlike its name, OFUL is very pessimistic. It assumes that A, Θt is as small as it can be for all arms. In practice, this may not be the case. One may be able to use data more efficiently. This is challenging since even choosing the second most optimistic action can lead to linear regret.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 31 / 53

slide-47
SLIDE 47

Optimism Baseline

Let Lt(A) be the lower confidence-bound for action A defined as Lt(A) := A, Θt−1 − ρAV−1

t−1.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 32 / 53

slide-48
SLIDE 48

Optimism Baseline

Let Lt(A) be the lower confidence-bound for action A defined as Lt(A) := A, Θt−1 − ρAV−1

t−1.

A simple observation:

Lemma

The following inequality holds with high probability: sup

A∈At

A, Θ⋆ ≥ sup

A∈At

Lt(A). Then, define the baseline as Bt := supA∈At Lt(A).

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 32 / 53

slide-49
SLIDE 49

Sieved-Greedy

Let Lt(A) and Ut(A) be the confidence-bounds for action A given by Ut(A) := A, Θt−1 + ρAV−1

t−1

Lt(A) := A, Θt−1 − ρAV−1

t−1.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 33 / 53

slide-50
SLIDE 50

Sieved-Greedy

Let Lt(A) and Ut(A) be the confidence-bounds for action A given by Ut(A) := A, Θt−1 + ρAV−1

t−1

Lt(A) := A, Θt−1 − ρAV−1

t−1.

For a fixed sieving-rate α ∈ (0, 1], define A′

t :=

  • A ∈ At : Ut(A) − Bt ≥ α
  • sup

A′∈At

Ut(A′) − Bt

  • .
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 33 / 53

slide-51
SLIDE 51

Sieved-Greedy

Let Lt(A) and Ut(A) be the confidence-bounds for action A given by Ut(A) := A, Θt−1 + ρAV−1

t−1

Lt(A) := A, Θt−1 − ρAV−1

t−1.

For a fixed sieving-rate α ∈ (0, 1], define A′

t :=

  • A ∈ At : Ut(A) − Bt ≥ α
  • sup

A′∈At

Ut(A′) − Bt

  • .

Sieved-greedy (SG) then chooses

  • At ∈ arg max

A∈A′

t

A, Θt−1.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 33 / 53

slide-52
SLIDE 52

Sieved-Greedy

A1 A2 A3 A4 A5

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-53
SLIDE 53

Sieved-Greedy

A1 A2 A3 A4 A5 Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-54
SLIDE 54

Sieved-Greedy

A1 A2 A3 A4 A5 supA∈At Ut(A) Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-55
SLIDE 55

Sieved-Greedy

A1 A2 A3 A4 A5 supA∈At Ut(A) OFUL Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-56
SLIDE 56

Sieved-Greedy

A1 A2 A3 A4 A5 Baseline Bt supA∈At Ut(A) OFUL Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-57
SLIDE 57

Sieved-Greedy

A1 A2 A3 A4 A5 Baseline Bt supA∈At Ut(A) Sieving threshold (0.5) OFUL Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-58
SLIDE 58

Sieved-Greedy

A1 A2 A3 A4 A5 Baseline Bt supA∈At Ut(A) Sieving threshold (0.5) OFUL SG(0.5) Greedy

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 34 / 53

slide-59
SLIDE 59

Sieved-Greedy

Two instances of Sieved-greedy: For α = 1, Sieved-greedy is equivalent to OFUL. For α = 0, Sieved-greedy is the same as Greedy.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 35 / 53

slide-60
SLIDE 60

Sieved-Greedy

Theorem

For α > 0, the regret of Sieved-greedy is bounded by Regret(T, Θ⋆, πSG(α)) ≤ O

  • d

√ T α

  • .
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 36 / 53

slide-61
SLIDE 61

Simulations

2000 4000 6000 8000 10000 500 1000 1500 2000 2500 3000 3500 Greedy OFUL TS SG(0.5)

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 37 / 53

slide-62
SLIDE 62

A General Regret Bound

By a worth function, we mean a function Mt that maps each A ∈ At to R such that | Mt(A) − A, Θt−1| ≤ ρAV−1

t−1

with probability at least 1 −

1 T 2 .

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 38 / 53

slide-63
SLIDE 63

A General Regret Bound

By a worth function, we mean a function Mt that maps each A ∈ At to R such that | Mt(A) − A, Θt−1| ≤ ρAV−1

t−1

with probability at least 1 −

1 T 2 .

Next, define Randomized OFUL (ROFUL) to be: Algorithm 7 ROFUL algorithm

1: for t = 1 to T do 2:

Pull At := arg maxA∈At Mt(A)

3:

Observe the reward rt

4:

Compute Vt = λI + t

i=1

Ai A⊤

i

5:

Compute Θt = V−1

t

t

i=1

Airi

  • 6: end for
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 38 / 53

slide-64
SLIDE 64

A General Regret Bound

Examples of worth functions: Greedy: Mt(A) = A, Θt−1 OFUL: Mt(A) = A, Θt−1 + ρAV−1

t−1

LinTS: Mt(A) = A, Θt−1

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 39 / 53

slide-65
SLIDE 65

A General Regret Bound

Examples of worth functions: Greedy: Mt(A) = A, Θt−1 OFUL: Mt(A) = A, Θt−1 + ρAV−1

t−1

LinTS: Mt(A) = A, Θt−1 Sieved-greedy:

  • Mt(A) =
  • Ut(A)

if A = At Lt(A)

  • therwise
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 39 / 53

slide-66
SLIDE 66

Weaker Optimism and Regret Bound

Definition (Optimism in expectation)

We say a worth function Mt is optimistic in expectation (OIE) if E

  • sup

A∈At

  • Mt(A) − Bt

2 ≥ p E

  • sup

A∈At

A, Θ⋆ − Bt 2 .

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 40 / 53

slide-67
SLIDE 67

Weaker Optimism and Regret Bound

Definition (Optimism in expectation)

We say a worth function Mt is optimistic in expectation (OIE) if E

  • sup

A∈At

  • Mt(A) − Bt

2 ≥ p E

  • sup

A∈At

A, Θ⋆ − Bt 2 .

Theorem

For a sequence of OIE worth functions ( Mt)T

t=1, we have

Regret(T, πROFUL) ≤ O

  • ρ
  • dT

p

  • .
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 40 / 53

slide-68
SLIDE 68

Conclusion

Proved that LinTS without inflation can incur linear regret. Provided a general regret bound for confidence-based policies. Introduced a new algorithm that is less pessimistic than OFUL while enjoying similar regret bounds.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 41 / 53

slide-69
SLIDE 69

References I

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem”. In: Machine learning 47.2-3 (2002), pp. 235–256. Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. “Stochastic Linear Optimization under Bandit Feedback”. In: COLT. 2008. John Langford and Tong Zhang. “The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information”. In: Advances in Neural Information Processing Systems 20. Ed. by J. C. Platt et al. Curran Associates, Inc., 2008, pp. 817–824. Paat Rusmevichientong and John N Tsitsiklis. “Linearly parameterized bandits”. In: Mathematics of Operations Research 35.2 (2010), pp. 395–411.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 42 / 53

slide-70
SLIDE 70

References II

Yasin Abbasi-Yadkori, D´ avid P´ al, and Csaba Szepesv´

  • ari. “Improved

algorithms for linear stochastic bandits”. In: Advances in Neural Information Processing Systems. 2011, pp. 2312–2320. Shipra Agrawal and Navin Goyal. “Thompson Sampling for Contextual Bandits with Linear Payoffs.”. In: ICML (3). 2013,

  • pp. 127–135.

Alexander Goldenshluger and Assaf Zeevi. “A linear response bandit problem”. In: Stochastic Systems 3.1 (2013), pp. 230–261. Daniel Russo and Benjamin Van Roy. “Learning to Optimize via Posterior Sampling”. In: Mathematics of Operations Research 39.4 (2014), pp. 1221–1243. doi: 10.1287/moor.2014.0650. Daniel Russo and Benjamin Van Roy. “An information-theoretic analysis of thompson sampling”. In: The Journal of Machine Learning Research 17.1 (2016), pp. 2442–2471.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 43 / 53

slide-71
SLIDE 71

References III

Marc Abeille, Alessandro Lazaric, et al. “Linear Thompson sampling revisited”. In: Electronic Journal of Statistics 11.2 (2017),

  • pp. 5165–5197.
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 44 / 53

slide-72
SLIDE 72

Thank you!

Any questions?

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 45 / 53

slide-73
SLIDE 73

A General Regret Bound

Definition (Strong optimism)

We say a worth function Mt is optimistic if sup

A∈At

  • Mt(A) ≥ sup

A∈At

A, Θ⋆ (3) with probability at least p.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 46 / 53

slide-74
SLIDE 74

A General Regret Bound

Definition (Strong optimism)

We say a worth function Mt is optimistic if sup

A∈At

  • Mt(A) ≥ sup

A∈At

A, Θ⋆ (3) with probability at least p.

Theorem

Let ( Mt)T

t=1 be a sequence of optimistic worth functions. Then, the regret

  • f ROFUL with this worth function is bounded by

Regret(T, πROFUL) ≤ O

  • ρ
  • dT

p

  • .
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 46 / 53

slide-75
SLIDE 75

A Sufficient Condition for Optimism

Recall that the worth function for LinTS is given by

  • Mt(A) = A,

Θt.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 47 / 53

slide-76
SLIDE 76

A Sufficient Condition for Optimism

Recall that the worth function for LinTS is given by

  • Mt(A) = A,

Θt. We can decompose it as

  • Mt(A) = A,

Θt − Θt−1 + A, Θt−1 − Θ⋆ + A, Θ⋆.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 47 / 53

slide-77
SLIDE 77

A Sufficient Condition for Optimism

Recall that the worth function for LinTS is given by

  • Mt(A) = A,

Θt. We can decompose it as

  • Mt(A) = A,

Θt − Θt−1 + A, Θt−1 − Θ⋆ + A, Θ⋆. Hence, we have sup

A∈At

  • Mt(A) − sup

A∈At

A, Θ⋆ ≥ Mt(A⋆

t ) − A⋆ t , Θ⋆

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 47 / 53

slide-78
SLIDE 78

A Sufficient Condition for Optimism

Recall that the worth function for LinTS is given by

  • Mt(A) = A,

Θt. We can decompose it as

  • Mt(A) = A,

Θt − Θt−1 + A, Θt−1 − Θ⋆ + A, Θ⋆. Hence, we have sup

A∈At

  • Mt(A) − sup

A∈At

A, Θ⋆ ≥ Mt(A⋆

t ) − A⋆ t , Θ⋆

= A⋆

t ,

Θt − Θt−1

  • Compensation term

+ A⋆

t ,

Θt−1 − Θ⋆

  • Error term

.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 47 / 53

slide-79
SLIDE 79

A Sufficient Condition for Optimism

Recall that the worth function for LinTS is given by

  • Mt(A) = A,

Θt. We can decompose it as

  • Mt(A) = A,

Θt − Θt−1 + A, Θt−1 − Θ⋆ + A, Θ⋆. Hence, we have sup

A∈At

  • Mt(A) − sup

A∈At

A, Θ⋆ ≥ Mt(A⋆

t ) − A⋆ t , Θ⋆

= A⋆

t ,

Θt − Θt−1

  • Compensation term

+ A⋆

t ,

Θt−1 − Θ⋆

  • Error term

.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 47 / 53

slide-80
SLIDE 80

A Sufficient Condition for Optimism

Recall that the worth function for LinTS is given by

  • Mt(A) = A,

Θt. We can decompose it as

  • Mt(A) = A,

Θt − Θt−1 + A, Θt−1 − Θ⋆ + A, Θ⋆. Hence, we have sup

A∈At

  • Mt(A) − sup

A∈At

A, Θ⋆ ≥ Mt(A⋆

t ) − A⋆ t , Θ⋆

= A⋆

t ,

Θt − Θt−1

  • Compensation term

+ A⋆

t ,

Θt−1 − Θ⋆

  • Error term

.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 47 / 53

slide-81
SLIDE 81

A Sufficient Condition for Optimism

Define Error vector E := Θ⋆ − Θt−1 Compensator vector C := Θt − Θt−1 The optimism assumption holds if, with probability p, the following holds A⋆

t , C ≥ A⋆ t , E.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 48 / 53

slide-82
SLIDE 82

Omniscient Adversary and LinTS

An adversary chooses At at time t. The adversary is omniscient if he knows Θt−1 and Θ⋆.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 49 / 53

slide-83
SLIDE 83

Omniscient Adversary and LinTS

An adversary chooses At at time t. The adversary is omniscient if he knows Θt−1 and Θ⋆. He chooses A = −c Θt−1 + E so that A, Θ⋆ > 0 and A, Θt−1 < −1 2 · AV−1

t−1 · EVt−1

√ d

≪ 0.

O Θ⋆

  • Θ

A⋆

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 49 / 53

slide-84
SLIDE 84

Omniscient Adversary and LinTS

The adversary sets At = {0, A}. LinTS chooses A if and only if A, Θt = A, Θt − Θt−1 + A, Θt−1 > 0. This requires A, C ∼ N(0, V−1

t−1) > 1

2 · AV−1

t−1 · EVt−1

√ d

.

O Θ⋆

  • Θ

A⋆

  • Θ
  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 50 / 53

slide-85
SLIDE 85

Omniscient Adversary and LinTS

Next, we have P(A, Θt > 0) ≤ exp (−Ω(d))! LinTS chooses the optimal arm A w.p. exponentially small in Ω(d). When At = 0, the reward contains no new information about Θ⋆. The adversary reveals the same action set in the next rounds. The regret will grow linearly.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 51 / 53

slide-86
SLIDE 86

Bayesian Analyses are Brittle

The key point was the adversary’s knowledge of E. This can be relaxed by slightly modifying the noise distribution. Reducing the noise variance reveals information about E.

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 52 / 53

slide-87
SLIDE 87

Why is LinTS Popular?

Computation efficiency: when At is a polytope · · · LinTS solves an LP problem, OFUL becomes an NP-hard problem! Photo credit: Russo and Van Roy 2014

  • N. Hamidi, M. Bayati (Stanford University)

Better Use of Data in Linear Bandits October 2, 2020 53 / 53