[PPT] - An Asymptotically Optimal Bandit Algorithm for Bounded Support PowerPoint Presentation

SLIDE 1

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

Junya Honda and Akimichi Takemura The University of Tokyo

COLT 2010

SLIDE 2

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 3

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 4

Multiarmed bandit problem

Model of a gambler playing a slot machine with multiple arms
Example of a dilemma between exploration and exploitation
-armed stochastic bandit problem

– Burnates-Katehakis derived an asymptotic bound

f the regret
Model of reward distributions with support in [0,1]

– UCB policies by Auer et al. are widely used practically – Bound-achieving policies have not been known – We propose DMED policy, which achieves the bound

SLIDE 5

: family of distributions with support in [0,1] : probability distribution of arm : expectation of arm ( : expectation of distribution ) : maximum expectation of arms : # of times that arm has been pulled through the first rounds Goal: minimize the regret by reducing each for suboptimal arm

Notation

i i

i:µi<µ∗

(µ∗ − µi)Ti(n)

i

i = 1, · · · , K Fi ∈ A ∈ A

SLIDE 6

Asymptotic bound

Burnetas and Katehakis (1996)

Under any policy satisfying a mild condition (consistency),

for all and suboptimal where ： Kullback-Leibler divergence

i

EF [Ti(n)] ≥

1

Dmin(Fi, µ∗) − o(1)

log n

∈ A F = (F1, · · · , FK) ∈ AK ˆ Dmin(F, µ) = min

H∈A:E(H)≥µ D(F||H)

D(F||H) = EF

log dF

dH

＋

SLIDE 7

Visualization of

F Dmin(F, µ) = min

H∈A:E(H)≥µ D(F||H)

A large { ∈ E(H) = Dmin(F, µ) = {H ∈ A : E(H) ≥ µ} E( ) = { ∈ A E(H) = µ

SLIDE 8

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 9

DMED policy

Deterministic Minimum Empirical Divergence policy

For each loop, DMED chooses arms to pull in this way:

1. For each arm , check the condition

(The condition is always true for the currently best arm)

2. Pull all of arms such that the condition is true

maximum sample mean at the -th round empirical distribution of arm at the -th round

i i

Ti(n)Dmin( ˆ Fi(n), ˆ µ∗(n)) ≤ log n

SLIDE 10

Under DMED policy, for all suboptimal arm , Asymptotic bound： DMED is asymptotically optimal

Main theorem

i

EF [Ti(n)] ≤

1

Dmin(Fi, µ∗) + o(1)

log n

EF [Ti(n)] ≥

1

Dmin(Fi, µ∗) − o(1)

log n

SLIDE 11

Intuitive interpretation (1)

Assume and consider the event
How likely is arm 1 actually the best?
is far more likely than
How likely is the hypothesis ?

K = 2 ˆ µ1(n) < ˆ µ2(n) = ˆ µ∗(n) T1(n) T2(n)

µ2 ≈ ˆ

µ2 ≈ µ1 ≈ ˆ µ1 ≈ µ1 ≥ ˆ µ2

SLIDE 12

Intuitive interpretation (2)

By Sanov’s theorem in the large deviation theory,

F1 P[empirical distribution from F1 come close to ˆ F1]

1] ≈ exp(−T1(n)D( ˆ

F1||F1)) ≥ ˆ F1 D( ˆ F1||F) A number of samples

SLIDE 13

By Sanov’s theorem in the large deviation theory,
Maximum likelihood of is

P[empirical distribution from F1 come close to ˆ F1]

1] ≈ exp(−T1(n)D( ˆ

F1||F1)) µ1 ≥ ˆ µ∗ max

H∈A:E(H)≥ˆ µ∗ exp(−T1(n)D( ˆ

F1||H)) = exp

−T1(n)

min

H∈A:E(H)≥ˆ µ∗ D( ˆ

F1||H)

= exp(−T1(n)Dmin( ˆ

F1, ˆ µ∗)) ≥ ˆ F1 || E(H) = ˆ µ∗ A

Intuitive interpretation (2)

Dmin( ˆ F1, ˆ µ∗)

SLIDE 14

Maximum likelihood that arm is actually the best:
In DMED policy, arm is pulled when

– Arm is pulled if

the maximum likelihood is large
round number is large

i

exp(−Ti(n)Dmin( ˆ Fi, ˆ µ∗)) − Ti(n)Dmin( ˆ Fi, ˆ µ∗) ≤ log n

i i

n

Intuitive interpretation (3)

SLIDE 15

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 16

Proof of the optimality

Assume and (arm 1 is the best)
Two events are essential for the proof:

: Estimators are already close to : , but (arm 1 seems inferior) An Bn ˆ Fi(n), ˆ µi(n) Fi, µi

2 ˆ

µ2(n) ≈ µ2

＋

) ˆ µ1(n) < µ2 (< µ1)

＋

n] + I[{Jn = 2} ∩ Ac n ∩ Bc]

T2(N) =

N

n=1
I[{Jn = 2} ∩ An] + I[{Jn = 2} ∩ Bn] +

K = 2 µ2 < µ1 = µ∗ arm pulled at the -th round “Arm 2 is pulled at the -th round”

n

SLIDE 17

Proof of the optimality

Assume and (arm 1 is the best)
Two events are essential for the proof:

: Estimators are already close to : , but (arm 1 seems inferior) An Bn ˆ Fi(n), ˆ µi(n) Fi, µi

2 ˆ

µ2(n) ≈ µ2

＋

) ˆ µ1(n) < µ2 (< µ1)

＋

log n Dmin(F2, µ1) O(1) O(1)

n] + I[{Jn = 2} ∩ Ac n ∩ Bc]

T2(N) =

N

n=1
I[{Jn = 2} ∩ An] + I[{Jn = 2} ∩ Bn] +

= ≈ = = K = 2 µ2 < µ1 = µ∗

SLIDE 18

After the convergence

Arm 2 is pulled when
On the event , holds

because is continuous If is true, arm 2 is pulled only while is true. An

＋

Dmin(F, µ) ≈ T2(n)Dmin( ˆ F2(n), ˆ µ∗(n)) ≤ log n

Dmin( ˆ

F2(n), ˆ µ∗(n)) ≈ Dmin(F2, µ∗) An T2(n) log n Dmin(F2, µ∗)

N

n=1

I[{Jn = 2} ∩ An] log N Dmin(F2, µ∗)

SLIDE 19

Before the convergence (1)

: and
We will show

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) F1 Bn E N

n=1

I[{Jn = 2} ∩ Bn]

= O(1)

1 E(H) = µ2

A E N

n=1

I[Bn]

)] ≤

SLIDE 20

Before the convergence (1)

: and
We will show
Focus on of the event
is compact (w.r.t. Lévy distance)

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E N

n=1

I[Bn]

= O(1)

SLIDE 21

Before the convergence (1)

: and
We will show
Focus on of the event
is compact (w.r.t. Lévy distance)

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E N

n=1

I[Bn]

= O(1)

SLIDE 22

Before the convergence (1)

: and
We will show
Focus on of the event
is compact (w.r.t. Lévy distance)

It is sufficient to show for arbitrary s.t. Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E(G) ≤ µ2 ≥ G ∈ A G

ball with center G

G E N

n=1

I[Bn]

= O(1)

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

= O(1)

SLIDE 23

Before the convergence (1)

: and
We will show
Focus on of the event
is compact (w.r.t. Lévy distance)

It is sufficient to show for arbitrary s.t. Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) ˆ F1(n) Bn F1 Bn

1 E(H) = µ2

A A E(G) ≤ µ2 ≥ G ∈ A G

ball with center G

G E N

n=1

I[Bn]

= O(1)

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

= O(1)

Take the summation over finite balls

SLIDE 24

Before the convergence (2)

: and
We will show

Bn ˆ µ2 ≈ µ2 ˆ µ1 < µ2 (< µ1) E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

= O(1)

)] ≤

∞

t=1

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

=

SLIDE 25

Before the convergence (3)

We will show

∞

t=1

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

= O(1)

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

]
≤ exp
− t
Dmin(G, µ1) − Dmin(G, µ2)
≤ PF1[{ ˆ

F1(n) ∈ G} ∩ {T1(n) = t}] × max N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

SLIDE 26

Before the convergence (4)

A F1 || Dmin(G, µ1)Dmin(G, µ2) G ≈ E(H) = µ1

1 E(H) = µ2

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

]
≤ exp
− t
Dmin(G, µ1) − Dmin(G, µ2)

SLIDE 27

A F1 || Dmin(G, µ1)Dmin(G, µ2) G ≈ E(H) = µ1

1 E(H) = µ2

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

]
≤ exp
− t
Dmin(G, µ1) − Dmin(G, µ2)
≤ exp(−t C)

C

Before the convergence (4)

SLIDE 28

By taking the summation over ,

A F1 || Dmin(G, µ1)Dmin(G, µ2) G ≈ E(H) = µ1

1 E(H) = µ2

t

]

≤ exp
− t
Dmin(G, µ1) − Dmin(G, µ2)
=1

E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G} ∩ {T1(n) = t}]

=

≤ − ≤ exp(−t C) E N

n=1

I[Bn ∩ { ˆ F1(n) ∈ G}]

= O(1)

C

Before the convergence (4)

SLIDE 29

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 30

Computation of

has to be computed at each round
is represented as

– univariate convex optimization problem – efficiently computable by e.g. Newton’s method – is a good approximation of current

＋

ν∗

n−1

＋

−

ν∗

n

Dmin( ˆ Fi(n), ˆ µ∗(n)) = The optimal solution for the -st round ≤ n − 1 Dmin(F, µ) ≡ min

H∈A:E(H)≥µ D(F||G)

SLIDE 31

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 32

Simulation 1

, beta distributions

F1

1 F2 2 F3 3 F4 4 F5

simple distributions on [0,1]

SLIDE 33

Simulation result 1

Asymptotic slope of the regret is always larger than or

equal to that of “Asymptotic bound”

DMED seems to be achieving the asymptotic bound

UCB2 UCB-tuned DMED Asymptotic bound regret:

1 10 100 1000 10000 100000 10 20 30 40 50 regret plays DMED UCBtuned UCB2 Dmin

SLIDE 34

Simulation 2

, example where the best arm is hard to distinguish

（Arm 2 seems to be best with high probability）

SLIDE 35

Simulation result 2

DMED distinguishes the best arm quickly

UCB2 UCB-tuned DMED

1 10 100 1000 10000 100000 10 20 30 40 50 regret plays DMED UCBtuned UCB2 Dmin

Asymptotic bound

SLIDE 36

Outline

Introduction
DMED policy

– Proof of the optimality – Efficient computation

Simulation results
Conclusion

SLIDE 37

Conclusion

Proposed DMED policy and proved its asymptotic
ptimality.
Showed that the minimization of KL divergence is solvable

efficiently by a convex optimization technique.

Confirmed by simulations that DMED achieves the regret