Introduction to Bandits R emi Munos SequeL project: Sequential - - PowerPoint PPT Presentation

introduction to bandits
SMART_READER_LITE
LIVE PREVIEW

Introduction to Bandits R emi Munos SequeL project: Sequential - - PowerPoint PPT Presentation

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X -armed bandits Planning Conclusion Introduction to Bandits R emi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA


slide-1
SLIDE 1

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Introduction to Bandits

R´ emi Munos

SequeL project: Sequential Learning http://researchers.lille.inria.fr/∼munos/ INRIA Lille - Nord Europe

ThRaSH’2012, Lille, May 2nd, 2012

slide-2
SLIDE 2

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Introduction

Multi-armed bandit: simple mathematical model for decision-making under uncertainty. Illustrates the exploration-exploitation tradeoff that appears in any optimization problem where information is missing. Applications:

  • Clinical trials (Thompson, 1933)
  • Ads placement on webpages
  • Computation of Nash equilibria (trafic or communication

networks, agent simulation, poker, ...)

  • Game-playing computers (Go, urban rivals, ...)
  • Packet routing, itinerary selection, ...
  • Stochastic optimization under finite numerical budget, ...
slide-3
SLIDE 3

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

A few references on bandits (2005-2011)

[Abbasi-Yadkori, 2009] [Abernethy, Hazan, Rakhlin, 2008] [Abernethy, Bartlett, Rakhlin, Tewari, 2008] [Abernethy, Agarwal, Bartlett, Rakhlin, 2009] [Audibert, Bubeck, 2010] [Audibert, Munos, Szepesv´ ari, 2009] [Audibert, Bubeck, Lugosi, 2011] [Auer, Ortner, Szepesv´ ari, 2007] [Auer, Ortner, 2010] [Awerbuch, Kleinberg, 2008] [Bartlett, Hazan, Rakhlin, 2007] [Bartlett, Dani, Hayes, Kakade, Rakhlin, Tewari, 2008] [Bartlett, Tewari, 2009] [Ben-David, Pal, Shalev-Shwartz, 2009] [Blum, Mansour, 2007] [Bubeck, 2010] [Bubeck, Munos, 2010] [Bubeck, Munos, Stoltz, 2009] [Bubeck, Munos, Stoltz, Szepesv´ ari, 2008] [Cesa-Bianchi, Lugosi, 2006] [Cesa-Bianchi, Lugosi, 2009] [Chakrabarti, Kumar, Radlinski, Upfal, 2008] [Chu, Li, Reyzin, Schapire, 2011] [Coquelin, Munos, 2007] [Dani, Hayes, Kakade, 2008] [Dorard, Glowacka, Shawe-Taylor, 2009] [Filippi, 2010] [Filippi, Capp´ e, Garivier, Szepesv´ ari, 2010] [Flaxman, Kalai, McMahan, 2005] [Garivier, Capp´ e, 2011] [Gr¨ unew¨ alder, Audibert, Opper, Shawe-Taylor, 2010] [Guha, Munagala, Shi, 2007] [Hazan, Agarwal, Kale, 2006] [Hazan, Kale, 2009] [Hazan, Megiddo, 2007] [Honda, Takemura, 2010] [Jaksch, Ortner, Auer, 2010] [Kakade, Shalev-Shwartz, Tewari, 2008] [Kakade, Kalai, 2005] [Kale, Reyzin, Schapire, 2010] [Kanade, McMahan, Bryan, 2009] [Kleinberg, 2005] [Kleinberg, Slivkins, 2010] [Kleinberg, Niculescu-Mizil, Sharma, 2008] [Kleinberg, Slivkins, Upfal, 2008] [Kocsis, Szepesv´ ari, 2006] [Langford, Zhang, 2007] [Lazaric, Munos, 2009] [Li, Chu, Langford, Schapire, 2010] [Li, Chu, Langford, Wang, 2011] [Lu, P` al, P` al, 2010] [Maillard, 2011] [Maillard, Munos, 2010] [Maillard, Munos, Stoltz, 2011] [McMahan, Streeter, 2009] [Narayanan, Rakhlin, 2010] [Ortner, 2008] [Pandey, Agarwal, Chakrabarti, Josifovski, 2007] [Poland, 2008] [Radlinski, Kleinberg, Joachims, 2008] [Rakhlin, Sridharan, Tewari, 2010] [Rigollet, Zeevi, 2010] [Rusmevichientong, Tsitsiklis, 2010] [Shalev-Shwartz, 2007] [Slivkins, Upfal, 2008] [Slivkins, 2011] [Srinivas, Krause, Kakade, Seeger, 2010] [Stoltz, 2005] [Sundaram, 2005] [Wang, Kulkarni, Poor, 2005] [Wang, Audibert, Munos, 2008]

slide-4
SLIDE 4

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Outline of this tutorial

Introduction to Bandits

  • The stochastic bandit: UCB
  • The adversarial bandit: EXP3
  • Populations of bandits
  • Computation of equilibrium in games. Application to Poker
  • Hierarchical bandits. MCTS and application to Go.
  • Bandits in general spaces
  • Lipschitz optimization
  • X-armed bandits
  • Application to planning in MDPs
slide-5
SLIDE 5

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

The stochastic multi-armed bandit problem

Setting:

  • Set of K arms, defined by distributions νk

(with support in [0, 1]), whose law is unknown,

  • At each time t, choose an arm kt and

receive reward xt

i.i.d.

∼ νkt.

  • Goal: find an arm selection policy such as

to maximize the expected sum of rewards. Exploration-exploitation tradeoff:

  • Explore: learn about the environment
  • Exploit: act optimally according to our current beliefs
slide-6
SLIDE 6

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

The regret

Definitions:

  • Let µk = E[νk] be the expected value of arm k,
  • Let µ∗ = maxk µk the best expected value,
  • The cumulative expected regret:

Rn

def

=

n

t=1

µ∗ −µkt =

K

k=1

(µ∗ −µk)

n

t=1

1{kt = k} =

K

k=1

∆knk, where ∆k

def

= µ∗ − µk, and nk the number of times arm k has been pulled up to time n. Goal: Find an arm selection policy such as to minimize Rn.

slide-7
SLIDE 7

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Proposed solutions

This is an old problem! [Robbins, 1952] Maybe surprisingly, not fully solved yet! Many proposed strategies:

  • ϵ-greedy exploration: choose apparent best action with

proba 1 − ϵ, or random action with proba ϵ,

  • Bayesian exploration: assign prior to the arm distributions

and select arm according to the posterior distributions (Gittins index, Thompson strategy, ...)

  • Softmax exploration: choose arm k with proba ∝ exp(β

Xk) (ex: EXP3 algo)

  • Follow the perturbed leader: choose best perturbed arm
  • Optimistic exploration: select arm with highest upper bound
slide-8
SLIDE 8

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

The UCB algorithm

Upper Confidence Bound algorithm [Auer, Cesa-Bianchi, Fischer, 2002]: at each time n, select the arm k with highest Bk,nk,n value: Bk,nk,n

def

= 1 nk

nk

s=1

xk,s

  • Xk,nk

+ √ 3 log(n) 2nk

  • cnk ,n

, with:

  • nk is the number of times arm k has been pulled up to time n
  • xk,s is the s-th reward received when pulling arm k.

Note that

  • Sum of an exploitation term and an exploration term.
  • cnk,n is a confidence interval term, so Bk,nk,n is a UCB.
slide-9
SLIDE 9

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Intuition of the UCB algorithm

Idea:

  • ”Optimism in the face of uncertainty” principle
  • Select the arm with highest upper bound (on the true value of

the arm, given what has been observed so far).

  • The B-values Bk,s,t are UCBs on µk. Indeed:

P( Xk,s − µk ≥ √ 3 log(t) 2s ) ≤ 1 t3 , P( Xk,s − µk ≤ − √ 3 log(t) 2s ) ≤ 1 t3 Reminder of Chernoff-Hoeffding inequality: P( Xk,s − µk ≥ ϵ) ≤ e−2sϵ2 P( Xk,s − µk ≤ −ϵ) ≤ e−2sϵ2

slide-10
SLIDE 10

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Regret bound for UCB

Proposition 1.

Each sub-optimal arm k is visited in average, at most: Enk(n) ≤ 6log n ∆2

k

+ 1 + π2 3 times (where ∆k

def

= µ∗ − µk > 0). Thus the expected regret is bounded by: ERn = ∑

k

E[nk]∆k ≤ 6 ∑

k:∆k>0

log n ∆k + K(1 + π2 3 ).

slide-11
SLIDE 11

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Intuition of the proof

Let k be a sub-optimal arm, and k∗ be an optimal arm. At time n, if arm k is selected, this means that Bk,nk,n ≥ Bk∗,nk∗,n

  • Xk,nk +

√ 3 log(n) 2nk ≥

  • Xk∗,nk∗ +

√ 3 log(n) 2nk∗ µk + 2 √ 3 log(n) 2nk ≥ µ∗, with high proba nk ≤ 6 log(n) ∆2

k

Thus, if nk > 6 log(n)

∆2

k

, then there is only a small probability that arm k be selected.

slide-12
SLIDE 12

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Proof of Proposition 1

Write u = 6 log(n)

∆2

k

+ 1. We have:

nk(n) ≤ u +

n

t=u+1

1{kt = k; nk(t) > u} ≤ u +

n

t=u+1

[

t

s=u+1

1{ˆ Xk,s − µk ≥ ct,s} +

t

s=1

1{ ˆ Xk∗,s∗ − µk ≤ −ct,s∗} ]

Now, taking the expectation of both sides,

E[nk(n)] ≤ u +

n

t=u+1

[

t

s=u+1

P ( ˆ Xk,s − µk ≥ ct,s ) +

t

s=1

P ( ˆ Xk∗,s∗ − µk ≤ −ct,s∗)] ≤ u +

n

t=u+1

[

t

s=u+1

t−3 +

t

s=1

t−3] ≤ 6 log(n) ∆2

k

+ 1 + π2 3

slide-13
SLIDE 13

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Variants of UCB

[Audibert et al., 2008]

  • UCB with variance estimate: Define the UCB:

Bk,nk,n

def

= Xk,t + √ 2Vk,nk log(1.2n) nk + 3 log(1.2n) nk . Then the expected regret is bounded by: ERn ≤ 10 ( ∑

k:∆k>0

σ2

k

∆k + 2 ) log(n).

  • PAC-UCB: Let β > 0. Define the UCB:

Bk,nk

def

= Xk,nk + √ log(Knk(nk + 1)β−1) nk . Then w.p. 1 − β, the regret is bounded by a constant: Rn ≤ 6 log(Kβ−1) ∑

k:∆k>0

1 ∆k .

slide-14
SLIDE 14

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Upper and Lower bounds

UCB:

  • Distribution-dependent: ERn = O

( ∑

k:∆k>0 1 ∆k log n

)

  • Distribution-independent: ERn = O(√Kn log n).

Lower-bounds:

  • Distribution-dependent [Lai et Robbins, 1985]:

ERn = Ω ( ∑

k:∆k>0

∆k KL(νk||ν∗) log n )

  • Distribution-independent [Cesa-Bianchi et Lugosi, 2006]:

inf Algo sup Problem Rn = Ω( √ nK). Recent improvements in upper-bounds: optimal bounds!

  • MOSS [Audibert & Bubeck, 2009]
  • KL-UCB [Garivier & Capp´

e, 2011], [Maillard et al., 2011]

slide-15
SLIDE 15

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

The adversarial bandit

The rewards are no more i.i.d., but arbitrary! At time t, simultaneously

  • The adversary assigns a reward xk,t ∈ [0, 1] to each arm

k ∈ {1, . . . , K}

  • The player chooses an arm kt

The player receives the corresponding reward xkt. His goal is to maximize the sum of rewards. Can we expect to do almost as good as the best (constant) arm? Time 1 2 3 4 5 6 7 8 ... Arm pulled 1 2 1 1 2 1 1 1 Reward arm 1 1 0.7 0.9 1 1 1 0.8 1 Reward arm 2 0.9 1 0.4 0.6 Reward obtained: 6.1. Arm 1: 7.4, Arm 2: 2.9. Regret w.r.t. best constant strategy: 7.4 − 6.1 = 1.3.

slide-16
SLIDE 16

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Notion of regret

Define the regret: Rn = max

k∈{1,...,K} n

t=1

xk,t −

n

t=1

xkt.

  • Performance assessed in terms of the best constant strategy.
  • Can we have

sup

rewards

ERn = o(n)?

  • If the policy of the player is deterministic, there exists a

reward sequence such that the performance is arbitrarily poor − → Need internal randomization.

slide-17
SLIDE 17

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

EXP3 algorithm

EXP3 algorithm (Explore-Exploit using Exponential weights) [Auer et al, 2002]:

  • η > 0 and β > 0 are two parameters of the algorithm.
  • Initialize w1(k) = 1 for all k = 1, . . . , K.
  • At each round t = 1, . . . , n, player selects arm kt ∼ pt(·),

where pt(k) = (1 − β) wt(k) ∑K

i=1 wt(i)

+ β K , with wt(k) = eη ∑t−1

s=1 ˜

xs(k),

where ˜ xs(k) = xs(k) ps(k)1{ks = k}.

slide-18
SLIDE 18

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Performance of EXP3

Proposition 2.

Let η ≤ 1 and β = ηK. We have ERn ≤ log K

η

+ (e − 1)ηnK. Thus, by choosing η = √

log K (e−1)nK , it comes

sup

rewards

ERn ≤ 2.63 √ nK log K. Properties:

  • If all rewards are provided to the learner, with a similar

algorithms we have [Lugosi and Cesa-Bianchi, 2006] sup

rewards

ERn = O( √ n log K).

slide-19
SLIDE 19

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Proof of Proposition 2 [part 1]

Write Wt = ∑K

k=1 wk(t). Notice that

Eks∼ps[˜ xs(k)] =

K

i=1

ps(i)xs(k) ps(k)1{i = k} = xs(k), and Eks∼ps[˜ xs(ks)] =

K

i=1

ps(i)xs(i) ps(i) ≤ K. We thus have

Wt+1 Wt =

K

k=1

wk(t)eη˜

xt(k)

Wt =

K

k=1

pk(t) − β/K 1 − β eη˜

xt(k)

K

k=1

pk(t) − β/K 1 − β (1 + η˜ xt(k) + (e − 2)η2˜ xt(k)2),

since η˜ xt(k) ≤ ηK/β = 1, and ex ≤ 1 + x + (e − 2)x2 for x ≤ 1.

slide-20
SLIDE 20

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Proof of Proposition 2 [part 2]

Thus

Wt+1 Wt ≤ 1 + 1 1 − β

K

k=1

pk(t)(η˜ xt(k) + (e − 2)η2˜ xt(k)2), log Wt+1 Wt ≤ 1 1 − β

K

k=1

pk(t)(η˜ xt(k) + (e − 2)η2˜ xt(k)2), log Wn+1 W1 ≤ 1 1 − β

n

t=1 K

k=1

pk(t)(η˜ xt(k) + (e − 2)η2˜ xt(k)2).

But we also have log Wn+1 W1 = log

K

k=1

eη ∑n

t=1 ˜

xt(k) − log K ≥ η n

t=1

˜ xt(k) − log K, for any k = 1, . . . , n.

slide-21
SLIDE 21

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Proof of Proposition 2 [part 3]

Take expectation w.r.t. internal randomization of the algo, thus for all k,

E [ (1 − β)

n

t=1

˜ xt(k) −

n

t=1 K

i=1

pi(t)˜ xt(i) ] ≤ (1 − β)log K η + (e − 2)ηE [

n

t=1 K

k=1

pk(t)˜ xt(k)2] E [

n

t=1

xt(k) −

n

t=1

xt(kt) ] ≤ βn + log K η + (e − 2)ηnK E[Rn(k)] ≤ log K η + (e − 1)ηnK

slide-22
SLIDE 22

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

In summary...

Distribution-dependent bounds: UCB: ERn = O ( ∑

k 1 ∆k log n

) lower-bound: ERn = Ω ( ∑

k ∆k KL(νk,ν∗) log n

) Distribution-independent bounds: UCB: sup

distributions

ERn = O (√Kn log n ) EXP3: sup

rewards

ERn = O (√Kn log K ) lower-bound: sup

rewards

ERn = Ω (√ Kn ) Remark: The optimal rate O (√ Kn ) is achieved by INF [Audibert and Bubeck, 2010]

slide-23
SLIDE 23

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

K-armed bandit, with K = 4

At each round t, select a tap. Optimize quality of n selected beers.

slide-24
SLIDE 24

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Bandit with a large number of arms

Goal: optimize the quality of the beer you drink before you get drunk...

slide-25
SLIDE 25

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Bandits with a large set of arms

Typically, the number of arms is larger then the number of rounds.

  • Bandit = simple tool for rapidly selecting the best action.
  • Bandit = building block from which one can build more

complex problems

  • We now consider a population of bandits:

Adversarial bandits Collaborative bandits

slide-26
SLIDE 26

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Game between bandits

Consider a 2-players zero-sum repeated game: A and B play actions: 1 or 2 simultaneously, and receive the reward (for A): A \ B 1 2 1 2 2

  • 1

1 (A likes consensus, B likes conflicts) Now, let A and B be bandit algorithms, aiming at minimizing their regret, i.e. for player A: Rn(A) def = max

a∈{1,2} n

t=1

rA(a, Bt) −

n

t=1

rA(At, Bt). What happens?

slide-27
SLIDE 27

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Nash equilibrium

Nash equilibrium: (mixed) strategy for both players, such that no player has incentive for changing unilaterally his own strategy. A \ B 1 2 1 2 2

  • 1

1 Here: A plays 1 with probability pA = 1/2, B plays 1 with proba- bility pB = 1/4.

1 A B=1 B=2 2

A

r

slide-28
SLIDE 28

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Regret minimization → Nash equilibrium

Define the regret of A: Rn(A) def = max

a∈{1,2} n

t=1

rA(a, Bt) −

n

t=1

rA(At, Bt). and that of B accordingly.

Proposition 3.

If both players perform a (Hannan) consistent regret-minimization strategy (i.e. Rn(A)/n → 0 and Rn(B)/n → 0), then the empirical frequencies of chosen actions of both players converge to a Nash equilibrium. (Remember that EXP3 is consistent!) Note that in general, we have convergence towards correlated equilibrium [Foster and Vohra, 1997].

slide-29
SLIDE 29

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Sketch of proof:

Write pn

A def

= 1

n

∑n

t=1 1At=1 and pn B def

= 1

n

∑n

t=1 1Bt=1 and

rA(p, q) def = ErA(A ∼ p, B ∼ q). Regret-minimization algorithm: Rn(A)/n → 0 means that: ∀ε > 0, for n large enough, max

a∈{1,2}

1 n

n

t=1

rA(a, Bt) − 1 n

n

t=1

rA(At, Bt) ≤ ε max

a∈{1,2} rA(a, pn B) − rA(pn A, pn B)

≤ ε rA(p, pn

B) − rA(pn A, pn B)

≤ ε, for all p ∈ [0, 1]. Now, using Rn(B)/n → 0 we deduce that: rA(p, pn

B) − ε ≤ rA(pn A, pn B) ≤ rA(pn A, q) + ε,

∀p, q ∈ [0, 1] Thus the empirical frequencies of actions played by both players is arbitrarily close to a Nash strategy.

slide-30
SLIDE 30

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Texas Hold’em Poker

  • In the 2-players Poker game, the

Nash equilibrium is interesting (zero-sum game)

  • A policy:

information set (my cards + board + pot) → probabilities over decisions (check, raise, fold)

  • Space of policies is huge!

Idea: Approximate the Nash equilibrium by using bandit algorithms assigned to each information set.

  • This provides the world best Texas Hold’em Poker program

for 2-player with pot-limit [Zinkevich et al., 2007]

slide-31
SLIDE 31

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Hierarchy of bandits

We now consider another way of combining bandits: Hierarchy of bandits: the reward obtained when pulling an arm is itself the return of another bandit in a hierarchy. Applications to

  • tree search,
  • optimization,
  • planning
slide-32
SLIDE 32

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Historical motivation for this problem

MCTS in Crazy-Stone (R´ emi Coulom, 2005) Idea: use bandits at each node.

slide-33
SLIDE 33

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Hierarchical bandit algorithm

Upper Confidence Bound (UCB) algo at each node Bj

def

= Xj,nj + √ 2 log(ni) nj . Intuition:

  • Explore

first the most promising branches

  • Average converges to max

Node i: Bi Bj

  • Adaptive Multistage Sampling (AMS) algorithm [Chang, Fu,

Hu, Marcus, 2005]

  • UCB applied to Trees (UCT) [Kocsis and Szepesv´

ari, 2006]

slide-34
SLIDE 34

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

The MoGo program

[Gelly, Wang, Munos, Teytaud, 2006] + many others. Features:

  • Explore-Exploit with UCT
  • Monte-Carlo evaluation
  • Asymmetric tree

expansion

  • Anytime algo
  • Use of features

Among world best programs!

slide-35
SLIDE 35

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

No finite-time guarantee for UCT

Problem: at each node, the rewards are not i.i.d. Consider the tree: The left branches seem better than right branches, thus are ex- plored for a very long time be- fore the optimal leaf is eventually reached. The expected regret is disastrous: ERn = Ω(exp(exp(. . . exp(

  • D times

1) . . . )))+O(log(n)). See [Coquelin and Munos, 2007]

D−1 D D−2 D D−3 D 1 D 1

slide-36
SLIDE 36

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Bandits in general spaces

Outline:

  • Optimization of deterministic Lipschitz functions
  • X-armed bandits in general spaces: HOO
  • Application to planning in MDPs
  • Deterministic environments
  • Open-loop planning in stochastic environments
  • Closed-loop planning in MDPs
slide-37
SLIDE 37

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Online optimization of a deterministic Lipschitz function

Problem: Find online the maximum of f : X → I R, assumed to be Lipschitz: |f (x) − f (y)| ≤ ℓ(x, y).

  • At each time step t, select xt ∈ X
  • Observe f (xt)
  • Goal: find an exploration policy such as to maximize the sum
  • f rewards.

Define the cumulative regret Rn =

n

t=1

[ f ∗ − f (xt) ] , where f ∗ = supx∈X f (x)

slide-38
SLIDE 38

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Example in 1d

f(x )

t

xt f f *

Lipschitz property → the evaluation of f at xt provides a first upper-bound on f .

slide-39
SLIDE 39

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Example in 1d (continued)

New point → refined upper-bound on f .

slide-40
SLIDE 40

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Example in 1d (continued)

Question: where should one sample the next point? Answer: select the point with highest upper bound! “Optimism in the face of (partial observation) uncertainty”

slide-41
SLIDE 41

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Lipschitz optimization with noisy evaluations

f is still Lipschitz, but now, the evaluation of f at xt returns a noisy evaluation rt of f (xt), i.e. such that E[rt|xt] = f (xt).

slide-42
SLIDE 42

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Where should one sample next?

x

How to define a high probability upper bound at any state x?

slide-43
SLIDE 43

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

UCB in a given domain

xt f(xt) rt x

Xi

For a fixed domain Xi ∋ x containing ni points {xt} ∈ Xi, we have that ∑ni

t=1 rt − f (xt) is a Martingale. Thus by Azuma’s inequality,

1 ni

ni

t=1

rt + √ log 1/δ 2ni ≥ 1 ni

ni

t=1

f (xt) ≥ f (x) − diam(Xi),

since f is Lipschitz (where diam(Xi) = supx,y∈Xi ℓ(x, y)).

slide-44
SLIDE 44

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

High probability upper bound

1 ni

ni

t=1 rt

diam(Xi )
  • log 1/δ

2ni

Upp er-b
  • und

Xi w.p. 1 − δ, 1 ni

ni

t=1

rt + √ log 1/δ 2ni + diam(Xi) ≥ sup

x∈Xi

f (x). Tradeoff between size of the confidence interval and diameter. By considering several domains we can derive a tighter upper bound.

slide-45
SLIDE 45

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

A hierarchical decomposition

Use a tree of partitions at all scales:

Bi(t)

def

= min   ˆ µi(t) + √ 2 log(t) ti + diam(i), max

j∈C(i) Bj(t)

  

slide-46
SLIDE 46

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

X-armed bandits

More generally: Let X be space equipped with a semi-metric ℓ(x, y). Let f (x) be a function such that: f (x∗) − f (x) ≤ ℓ(x, x∗), where f (x∗) = supx∈X f (x). X-armed bandit problem: At each round t, choose a point (arm) xt ∈ X, receive reward rt independent sample drawn from a distribution ν(xt) with mean f (xt). Goal: minimize regret: Rn

def

=

n

t=1

f (x∗) − rt

slide-47
SLIDE 47

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Hierarchical Optimistic Optimization (HOO)

[Bubeck et al., 2011]: Consider a tree of partitions of X, where each node i corresponds to a subdomain Xi. HOO Algorithm: Let Tt denote the set of ex- panded nodes at round t.

  • T1 = {root} (space X)
  • At t, select a leaf it of Tt by

maximizing the B-values,

  • Tt+1 = Tt ∪ {it}
  • Select xt ∈ Xit (arbitrarily)
  • Observe reward rt ∼ ν(xt) and

update the B-values:

h,i

B Bh+1,2i−1 Bh+1,2i

Xt

Turned−on nodes

Followed path Selected node Pulled point

Bi

def

= min [

  • Xi,ni +

2 log(n) ni

+ diam(i), maxj∈C(i) Bj ]

slide-48
SLIDE 48

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Properties of HOO

Properties:

  • For any domain Xi ∋ x∗, the corresponding Bi values is a

(high probability) upper bound on f (x∗).

  • We don’t really care if for sub-optimal domains Xi, the Bi

values is an upper bound on supx∈Xi f (x) or not.

  • The tree grows in an asymmetric way, leaving mainly

unexplored the sub-optimal branches,

  • Only the optimal branch is essentially explored.
slide-49
SLIDE 49

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Example in 1d

rt ∼ B(f (xt)) a Bernoulli distribution with parameter f (xt) Resulting tree at time n = 1000 and at n = 10000.

slide-50
SLIDE 50

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Analysis of HOO

Let d be the near-optimality dimension of f in X: i.e. such that the set of ε-optimal states Xε

def

= {x ∈ X, f (x) ≥ f ∗ − ε} can be covered by O(ε−d) balls of radius ε. Then ERn = O(n

d+1 d+2 ).

(Similar to Zooming algorithm of [Kleinberg, Slivkins, Upfall, 2008], but weaker assumption about f and ℓ, and does not require a sampling oracle)

slide-51
SLIDE 51

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Example 1:

Assume the function is locally peaky around its maximum: f (x∗) − f (x) = Θ(||x∗ − x||).

ε ε

It takes O(ϵ0) balls of radius ϵ to cover Xε. Thus d = 0 and the regret is √n.

slide-52
SLIDE 52

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Example 2:

Assume the function is locally quadratic around its maximum: f (x∗) − f (x) = Θ(||x∗ − x||α), with α = 2.

ε ε

  • For ℓ(x, y) = ||x − y||, it takes O(ϵ−D/2) balls of radius ϵ to

cover Xε (of size O(ϵD/2)). Thus d = D/2 and the regret is n

D+2 D+4 .

  • For ℓ(x, y) = ||x − y||2, it takes O(ϵ0) ℓ-balls of radius ϵ to

cover Xε. Thus d = 0 and the regret is √n.

slide-53
SLIDE 53

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Known smoothness around the maximum

Consider X = [0, 1]d. Assume that f has a finite number of global maxima and is locally α-smooth around each maximum x∗, i.e. f (x∗) − f (x) = Θ(||x∗ − x||α). Then, by choosing ℓ(x, y) = ||x − y||α, Xε is covered by O(1) balls

  • f “radius” ε. Thus the near-optimality dimension d = 0, and the

regret of HOO is: ERn = O(√n), i.e. the rate of growth is independent of the ambient dimension.

slide-54
SLIDE 54

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Conclusions on bandits in general spaces

The near-optimality dimension may be seen as an excess order of smoothness of f (around its maxima) compared to what is known:

  • If the smoothness order of the function is known then the

regret of HOO algorithm is O(√n)

  • If the smoothness is underestimated, for example f is

α-smooth but we only use ℓ(x, y) = ||x − y||β, with β < α, then the near-optimality dimension is d = D(1/β − 1/α) and the regret is O(n(d+1)/(d+2))

  • If the smoothness is overestimated, the weak-Lipschitz

assumption is violated, thus there is no guarantee (e.g., UCT)

slide-55
SLIDE 55

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Applications

  • Online supervized learning: At time t, HOO selects ht ∈ H.

The environment chooses (xt, yt) ∼ P. The resulting loss ℓ(ht(xt), yt) is a noisy evaluation of E(x,y)∼P[ℓ(h(x), y)]. HOO generates sequences of hypotheses (ht) whose cumulated performances are close to that of the best hypothesis h∗ ∈ H.

  • Policy optimization for MDPs or POMDPs: Consider a

class of parameterized policies πα. At time t, HOO algo selects αt and a trajectory is generated using παt. The sum of rewards obtained is a noisy evaluation of the value function V παt . Thus HOO performs almost as well as if using the best parameter α∗.

slide-56
SLIDE 56

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Application to planning in MDPs

Setting:

  • Assume we have a generative model of an MDP.
  • The state space is large: no way to represent the value

function

  • Search for the best policy, given a computational budget (e.g.,

number of calls to the model).

  • Ex: from current state st, search for the best possible

immediate action at, play this action, observe next state st+1, and repeat Works:

  • Optimistic planning in deterministic systems
  • Open-Loop optimistic planning
slide-57
SLIDE 57

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Planning in deterministic systems

Controlled deterministic system with discounted rewards: st+1 = f (st, at), where at ∈ A. Goal is to maximize ∑

t≥0 γtr(st, at).

Online planning:

  • From the current state st, return the best possible immediate

action at, computed by using a given computational budget (eg, CPU time, number of calls to the model).

  • Play at in the real world, and repeat from next state st+1.

Given n calls to a generative model, return actions at(n). Simple regret: rn

def

= max

a∈A Q∗(st, a) − Q∗(st, at(n)).

slide-58
SLIDE 58

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Look-ahead tree for planning in deterministic systems

From the current state, build the look-ahead tree:

  • Root of the tree = current state st
  • Search space X = set of paths

(infinite sequence of actions)

  • Value of any path x:

f (x) = ∑

t≥0 γtrt

  • Metric: ℓ(x, y) = γh(x,y)

1−γ

  • Prop: f is Lipschitz w.r.t. ℓ
  • Use optimistic search to explore

the tree with budget n resources

Path

action 1 action 2 Initial state

x y

slide-59
SLIDE 59

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Optimistic exploration

(HOO algo in deterministic setting)

  • For any node i of depth d,

define the B-values: Bi

def

=

d−1

t=0

γtrt + γd 1 − γ ≥ vi

  • At each round n, expand the

node with highest B-value

  • Observe reward, update

B-values,

  • Repeat until no more

available resources

  • Return maximizing action

Optimal path

Expanded nodes Node i

slide-60
SLIDE 60

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Analysis of the regret

[Hren and Munos, 2008] Define β such that the proportion of ϵ-optimal paths is O(ϵβ) (this is related to the near-optimal dimension). Let κ def = Kγβ ∈ [1, K].

  • If κ > 1, then

rn = O ( n− log 1/γ

log κ

) . (whereas for uniform planning Rn = O ( n− log 1/γ

log K )

.)

  • If κ = 1, then we obtain the exponential rate

rn = O ( γ

(1−γ)β c

n)

, where c is such that the proportion of ϵ-path is bounded by cϵβ.

slide-61
SLIDE 61

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Open Loop Optimistic Planning

Setting:

  • Rewards are stochastic but depend on sequence of actions

(and not resulting states)

  • Goal : find the sequence of actions that maximizes the

expected discounted sum of rewards

  • Search space: open-loop policies (sequences of actions)

[Bubeck et Munos, 2010] OLOP algorithm has expected regret Ern =        ˜ O ( n− log 1/γ

log κ

) if γ√κ > 1, ˜ O ( n− 1

2

) if γ√κ ≤ 1. Remarks:

  • For γ√κ > 1, this is the same rate as for deterministic

systems!

  • This is not a consequence of HOO
slide-62
SLIDE 62

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Possible extensions

Applications of hierarchical bandits:

  • Planning in MDPs when the number of next states is finite

[Bu¸ soniu et al., 2011]

  • Planning in POMDPs when the number of observations is

finite

  • Combine planning with function approximation: local ADP

methods.

  • Many applications in MCTS (Monte-Carlo Tree Search): See

Teytaud, Chaslot, Bouzy, Cazenave, and many others.

slide-63
SLIDE 63

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Conclusion

Bandit theory = how fast it takes to learn or compute the best possible decisions. Many different setting:

  • Stochastic versus adversarial environment
  • Full info versus partial info
  • Bandits with many arms: unstructured or structured set of

arms

  • Dependent versus independent rewards
  • Stationary versus non-stationary reward process (Markov,

restless bandits)

  • Cumulative versus final regret
slide-64
SLIDE 64

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Extensions

We have seen applications in

  • Computation of equilibria in games
  • Optimization
  • Planning and learning in MDPs

but also:

  • Online learning
  • Active learning
  • Active Monte-Carlo

basically any sequential decision making problem where some information are initially missing...

slide-65
SLIDE 65

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Thank you!

slide-66
SLIDE 66

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Related references

  • J.Y. Audibert, R. Munos, and C. Szepesvari, Tuning bandit algorithms in

stochastic environments, ALT, 2007.

  • P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite time analysis of the

multiarmed bandit problem, Machine Learning, 2002.

  • S. Bubeck and R. Munos, Open Loop Optimistic Planning, COLT 2010.
  • S. Bubeck, R. Munos, G. Stoltz, Cs. Szepesvari, Online Optimization in

X-armed Bandits, NIPS 2008. Long version X-armed Bandits JMLR 2011.

  • L. Bu¸

soniu, R. Munos, B. De Schutter, R. Babuˇ ska Optimistic Planning for Sparsely Stochastic Systems, ADPRL 2011.

  • P.-A. Coquelin and R. Munos, Bandit Algorithm for Tree Search, UAI

2007.

  • S. Gelly, Y. Wang, R. Munos, and O. Teytaud, Modification of UCT with

Patterns in Monte-Carlo Go, RR INRIA, 2006.

slide-67
SLIDE 67

. . . . . .

Introduction to bandits Games Hierarchical bandits Lipschitz optimization X-armed bandits Planning Conclusion

Related references (cont’ed)

  • J.-F. Hren and R. Munos, Optimistic planning in deterministic systems.

EWRL 2008.

  • M. Kearns, Y. Mansour, A. Ng, A Sparse Sampling Algorithm for

Near-Optimal Planning in Large Markov Decision Processes, Machine Learning, 2002.

  • R. Kleinberg, A. Slivkins, and E. Upfal, Multi-Armed Bandits in Metric

Spaces, ACM Symposium on Theory of Computing, 2008.

  • L. Kocsis and Cs. Szepesvri, Bandit based Monte-Carlo Planning, ECML

2006.

  • T. L. Lai and H. Robbins, Asymptotically Efficient Adaptive Allocation

Rules, Advances in Applied Mathematics, 1985.