An Asymptotically Optimal Bandit Algorithm for Bounded Support - - PowerPoint PPT Presentation

an asymptotically optimal bandit algorithm for bounded
SMART_READER_LITE
LIVE PREVIEW

An Asymptotically Optimal Bandit Algorithm for Bounded Support - - PowerPoint PPT Presentation

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models Junya Honda and Akimichi Takemura The University of Tokyo COLT 2010 Outline Introduction DMED policy Proof of the optimality Efficient computation


slide-1
SLIDE 1

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

Junya Honda and Akimichi Takemura The University of Tokyo

COLT 2010

slide-2
SLIDE 2

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-3
SLIDE 3

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-4
SLIDE 4

Multiarmed bandit problem

  • Model of a gambler playing a slot machine with multiple arms
  • Example of a dilemma between exploration and exploitation
  • -armed stochastic bandit problem

– Burnates-Katehakis derived an asymptotic bound

  • f the regret
  • Model of reward distributions with support in [0,1]

– UCB policies by Auer et al. are widely used practically – Bound-achieving policies have not been known – We propose DMED policy, which achieves the bound

slide-5
SLIDE 5

: family of distributions with support in [0,1] : probability distribution of arm : expectation of arm ( : expectation of distribution ) : maximum expectation of arms : # of times that arm has been pulled through the first rounds Goal: minimize the regret by reducing each for suboptimal arm

Notation

i i

  • i:µi<µ∗

(µ∗ − µi)Ti(n)

i

i = 1, · · · , K Fi ∈ A ∈ A

slide-6
SLIDE 6

Asymptotic bound

Burnetas and Katehakis (1996)

  • Under any policy satisfying a mild condition (consistency),

for all and suboptimal where : Kullback-Leibler divergence

i

EF [Ti(n)] ≥

  • 1

Dmin(Fi, µ∗) − o(1)

  • log n

∈ A F = (F1, · · · , FK) ∈ AK ˆ Dmin(F, µ) = min

H∈A:E(H)≥µ D(F||H)

D(F||H) = EF

  • log dF

dH

slide-7
SLIDE 7

Visualization of

F Dmin(F, µ) = min

H∈A:E(H)≥µ D(F||H)

A large { ∈ E(H) = Dmin(F, µ) = {H ∈ A : E(H) ≥ µ} E( ) = { ∈ A E(H) = µ

slide-8
SLIDE 8

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-9
SLIDE 9

DMED policy

  • Deterministic Minimum Empirical Divergence policy

For each loop, DMED chooses arms to pull in this way:

  • 1. For each arm , check the condition

(The condition is always true for the currently best arm)

  • 2. Pull all of arms such that the condition is true

maximum sample mean at the -th round empirical distribution of arm at the -th round

i i

Ti(n)Dmin( ˆ Fi(n), ˆ µ∗(n)) ≤ log n

slide-10
SLIDE 10

Under DMED policy, for all suboptimal arm , Asymptotic bound: DMED is asymptotically optimal

Main theorem

i

EF [Ti(n)] ≤

  • 1

Dmin(Fi, µ∗) + o(1)

  • log n

EF [Ti(n)] ≥

  • 1

Dmin(Fi, µ∗) − o(1)

  • log n
slide-11
SLIDE 11

Intuitive interpretation (1)

  • Assume and consider the event
  • How likely is arm 1 actually the best?
  • is far more likely than
  • How likely is the hypothesis ?

K = 2 ˆ µ1(n) < ˆ µ2(n) = ˆ µ∗(n) T1(n) T2(n)

  • µ2 ≈ ˆ

µ2 ≈ µ1 ≈ ˆ µ1 ≈ µ1 ≥ ˆ µ2

slide-12
SLIDE 12

Intuitive interpretation (2)

  • By Sanov’s theorem in the large deviation theory,

F1 P[empirical distribution from F1 come close to ˆ F1]

1] ≈ exp(−T1(n)D( ˆ

F1||F1)) ≥ ˆ F1 D( ˆ F1||F) A number of samples

slide-13
SLIDE 13
  • By Sanov’s theorem in the large deviation theory,
  • Maximum likelihood of is

P[empirical distribution from F1 come close to ˆ F1]

1] ≈ exp(−T1(n)D( ˆ

F1||F1)) µ1 ≥ ˆ µ∗ max

H∈A:E(H)≥ˆ µ∗ exp(−T1(n)D( ˆ

F1||H)) = exp

  • −T1(n)

min

H∈A:E(H)≥ˆ µ∗ D( ˆ

F1||H)

  • = exp(−T1(n)Dmin( ˆ

F1, ˆ µ∗)) ≥ ˆ F1 || E(H) = ˆ µ∗ A

Intuitive interpretation (2)

Dmin( ˆ F1, ˆ µ∗)

slide-14
SLIDE 14
  • Maximum likelihood that arm is actually the best:
  • In DMED policy, arm is pulled when

– Arm is pulled if

  • the maximum likelihood is large
  • round number is large

i

exp(−Ti(n)Dmin( ˆ Fi, ˆ µ∗)) − Ti(n)Dmin( ˆ Fi, ˆ µ∗) ≤ log n

i i

n

Intuitive interpretation (3)

slide-15
SLIDE 15

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-16
SLIDE 16

Proof of the optimality

  • Assume and (arm 1 is the best)
  • Two events are essential for the proof:

: Estimators are already close to : , but (arm 1 seems inferior) An Bn ˆ Fi(n), ˆ µi(n) Fi, µi

2 ˆ

µ2(n) ≈ µ2

) ˆ µ1(n) < µ2 (< µ1)

n] + I[{Jn = 2} ∩ Ac n ∩ Bc]

  • T2(N) =

N

  • n=1
  • I[{Jn = 2} ∩ An] + I[{Jn = 2} ∩ Bn] +

K = 2 µ2 < µ1 = µ∗ arm pulled at the -th round “Arm 2 is pulled at the -th round”

n

slide-17
SLIDE 17

Proof of the optimality

  • Assume and (arm 1 is the best)
  • Two events are essential for the proof:

: Estimators are already close to : , but (arm 1 seems inferior) An Bn ˆ Fi(n), ˆ µi(n) Fi, µi

2 ˆ

µ2(n) ≈ µ2

) ˆ µ1(n) < µ2 (< µ1)

log n Dmin(F2, µ1) O(1) O(1)

n] + I[{Jn = 2} ∩ Ac n ∩ Bc]

  • T2(N) =

N

  • n=1
  • I[{Jn = 2} ∩ An] + I[{Jn = 2} ∩ Bn] +

= ≈ = = K = 2 µ2 < µ1 = µ∗

slide-18
SLIDE 18

After the convergence

  • Arm 2 is pulled when
  • On the event , holds

because is continuous If is true, arm 2 is pulled only while is true. An

Dmin(F, µ) ≈ T2(n)Dmin( ˆ F2(n), ˆ µ∗(n)) ≤ log n

  • Dmin( ˆ

F2(n), ˆ µ∗(n)) ≈ Dmin(F2, µ∗) An T2(n) log n Dmin(F2, µ∗)

N

  • n=1

I[{Jn = 2} ∩ An] log N Dmin(F2, µ∗)

slide-19
SLIDE 19

Before the convergence (1)

  • : and
  • We will show

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) F1 Bn E N

  • n=1

I[{Jn = 2} ∩ Bn]

  • = O(1)

1 E(H) = µ2

A E N

  • n=1

I[Bn]

  • )] ≤
slide-20
SLIDE 20

Before the convergence (1)

  • : and
  • We will show
  • Focus on of the event
  • is compact (w.r.t. Lévy distance)

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E N

  • n=1

I[Bn]

  • = O(1)
slide-21
SLIDE 21

Before the convergence (1)

  • : and
  • We will show
  • Focus on of the event
  • is compact (w.r.t. Lévy distance)

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E N

  • n=1

I[Bn]

  • = O(1)
slide-22
SLIDE 22

Before the convergence (1)

  • : and
  • We will show
  • Focus on of the event
  • is compact (w.r.t. Lévy distance)

It is sufficient to show for arbitrary s.t. Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E(G) ≤ µ2 ≥ G ∈ A G

  • ball with center G

G E N

  • n=1

I[Bn]

  • = O(1)

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

  • = O(1)
slide-23
SLIDE 23

Before the convergence (1)

  • : and
  • We will show
  • Focus on of the event
  • is compact (w.r.t. Lévy distance)

It is sufficient to show for arbitrary s.t. Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E(G) ≤ µ2 ≥ G ∈ A G

  • ball with center G

G E N

  • n=1

I[Bn]

  • = O(1)

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

  • = O(1)

Take the summation over finite balls

slide-24
SLIDE 24

Before the convergence (2)

  • : and
  • We will show

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

  • = O(1)

)] ≤

  • t=1

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

  • =
slide-25
SLIDE 25

Before the convergence (3)

  • We will show

  • t=1

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

  • = O(1)

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

  • ]
  • ≤ exp
  • − t
  • Dmin(G, µ1) − Dmin(G, µ2)
  • ≤ PF1[{ ˆ

F1(n) ∈ G} ∩ {T1(n) = t}] × max N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

slide-26
SLIDE 26

Before the convergence (4)

A F1 || Dmin(G, µ1)Dmin(G, µ2) G ≈ E(H) = µ1

1 E(H) = µ2

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

  • ]
  • ≤ exp
  • − t
  • Dmin(G, µ1) − Dmin(G, µ2)
slide-27
SLIDE 27

A F1 || Dmin(G, µ1)Dmin(G, µ2) G ≈ E(H) = µ1

1 E(H) = µ2

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

  • ]
  • ≤ exp
  • − t
  • Dmin(G, µ1) − Dmin(G, µ2)
  • ≤ exp(−t C)

C

Before the convergence (4)

slide-28
SLIDE 28
  • By taking the summation over ,

A F1 || Dmin(G, µ1)Dmin(G, µ2) G ≈ E(H) = µ1

1 E(H) = µ2

  • t

]

  • ≤ exp
  • − t
  • Dmin(G, µ1) − Dmin(G, µ2)
  • =1

E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

  • =

≤ − ≤ exp(−t C) E N

  • n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

  • = O(1)

C

Before the convergence (4)

slide-29
SLIDE 29

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-30
SLIDE 30

Computation of

  • has to be computed at each round
  • is represented as

– univariate convex optimization problem – efficiently computable by e.g. Newton’s method – is a good approximation of current

ν∗

n−1

ν∗

n

Dmin( ˆ Fi(n), ˆ µ∗(n)) = The optimal solution for the -st round ≤ n − 1 Dmin(F, µ) ≡ min

H∈A:E(H)≥µ D(F||G)

slide-31
SLIDE 31

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-32
SLIDE 32

Simulation 1

  • , beta distributions

F1

1 F2 2 F3 3 F4 4 F5

simple distributions on [0,1]

slide-33
SLIDE 33

Simulation result 1

  • Asymptotic slope of the regret is always larger than or

equal to that of “Asymptotic bound”

  • DMED seems to be achieving the asymptotic bound

UCB2 UCB-tuned DMED Asymptotic bound regret:

1 10 100 1000 10000 100000 10 20 30 40 50 regret plays DMED UCBtuned UCB2 Dmin

slide-34
SLIDE 34

Simulation 2

  • , example where the best arm is hard to distinguish

(Arm 2 seems to be best with high probability)

slide-35
SLIDE 35

Simulation result 2

  • DMED distinguishes the best arm quickly

UCB2 UCB-tuned DMED

1 10 100 1000 10000 100000 10 20 30 40 50 regret plays DMED UCBtuned UCB2 Dmin

Asymptotic bound

slide-36
SLIDE 36

Outline

  • Introduction
  • DMED policy

– Proof of the optimality – Efficient computation

  • Simulation results
  • Conclusion
slide-37
SLIDE 37

Conclusion

  • Proposed DMED policy and proved its asymptotic
  • ptimality.
  • Showed that the minimization of KL divergence is solvable

efficiently by a convex optimization technique.

  • Confirmed by simulations that DMED achieves the regret

near the asymptotic bound in finite time.

Thank you!