The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation

the multi arm bandit framework
SMART_READER_LITE
LIVE PREVIEW

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation

The Multi-Arm Bandit Framework A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013 - 2/94 In This Lecture


slide-1
SLIDE 1

MVA-RL Course

The Multi-Arm Bandit Framework

  • A. LAZARIC (SequeL Team @INRIA-Lille)

ENS Cachan - Master 2 MVA

SequeL – INRIA Lille

slide-2
SLIDE 2

In This Lecture

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 2/94

slide-3
SLIDE 3

In This Lecture

Question: which route should we take? Problem: each day we obtain a limited feedback: traveling time

  • f the chosen route

Results: if we do not repeatedly try different options we cannot learn. Solution: trade off between optimization and learning.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 3/94

slide-4
SLIDE 4

Mathematical Tools

Outline

Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 4/94

slide-5
SLIDE 5

Mathematical Tools

Concentration Inequalities

Proposition (Chernoff-Hoeffding Inequality)

Let Xi ∈ [ai, bi] be n independent r.v. with mean µi = EXi. Then P

  • n
  • i=1
  • Xi − µi
  • ≥ ǫ
  • ≤ 2 exp

2ǫ2 n

i=1(bi − ai)2

  • .
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 5/94

slide-6
SLIDE 6

Mathematical Tools

Concentration Inequalities

Proof. P

  • n
  • i=1

Xi − µi ≥ ǫ

  • =

P(es n

i=1 Xi−µi ≥ esǫ)

≤ e−sǫE[es n

i=1 Xi−µi],

Markov inequality = e−sǫ

n

  • i=1

E[es(Xi−µi)], independent random variables ≤ e−sǫ

n

  • i=1

es2(bi−ai)2/8, Hoeffding inequality = e−sǫ+s2 n

i=1(bi−ai)2/8

If we choose s = 4ǫ/ n

i=1(bi − ai)2, the result follows.

Similar arguments hold for P n

i=1 Xi − µi ≤ −ǫ

  • .
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 6/94

slide-7
SLIDE 7

Mathematical Tools

Concentration Inequalities

Finite sample guarantee: P

  • 1

n

n

  • t=1

Xt − E[X1]

  • deviation

> ǫ

  • accuracy
  • ≤ 2 exp

2nǫ2 (b − a)2

  • confidence
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 7/94

slide-8
SLIDE 8

Mathematical Tools

Concentration Inequalities

Finite sample guarantee: P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > (b − a)
  • log 2/δ

2n

  • ≤ δ
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 8/94

slide-9
SLIDE 9

Mathematical Tools

Concentration Inequalities

Finite sample guarantee: P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > ǫ
  • ≤ δ

if n ≥ (b−a)2 log 2/δ

2ǫ2

.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 9/94

slide-10
SLIDE 10

The General Multi-arm Bandit Problem

Outline

Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 10/94

slide-11
SLIDE 11

The General Multi-arm Bandit Problem

The Multi–armed Bandit Game

The learner has i = 1, . . . , N arms (options, experts, ...) At each round t = 1, . . . , n

◮ At the same time

◮ The environment chooses a vector of rewards {Xi,t}N

i=1

◮ The learner chooses an arm It

◮ The learner receives a reward XIt,t ◮ The environment does not reveal the rewards of the other

arms

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 11/94

slide-12
SLIDE 12

The General Multi-arm Bandit Problem

The Multi–armed Bandit Game (cont’d)

The regret Rn(A) = max

i=1,...,N E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • The expectation summarizes any possible source of randomness (either in

X or in the algorithm)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 12/94

slide-13
SLIDE 13

The General Multi-arm Bandit Problem

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge: The learner should solve two opposite problems! Challenge: The learner should solve the exploration-exploitation dilemma!

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 13/94

slide-14
SLIDE 14

The General Multi-arm Bandit Problem

The Multi–armed Bandit Game (cont’d)

Examples

◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ...

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 14/94

slide-15
SLIDE 15

The Stochastic Multi-arm Bandit Problem

Outline

Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 15/94

slide-16
SLIDE 16

The Stochastic Multi-arm Bandit Problem

The Stochastic Multi–armed Bandit Problem

Definition

The environment is stochastic

◮ Each arm has a distribution νi bounded in [0, 1] and

characterized by an expected value µi

◮ The rewards are i.i.d. Xi,t ∼ νi

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 16/94

slide-17
SLIDE 17

The Stochastic Multi-arm Bandit Problem

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) = max

i=1,...,N E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • Rn(A) =

max

i=1,...,N(nµi) − E

  • n
  • t=1

XIt,t

  • Rn(A) =

max

i=1,...,N(nµi) − N

  • i=1

E[Ti,n]µi Rn(A) = nµi∗ −

N

  • i=1

E[Ti,n]µi

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 17/94

slide-18
SLIDE 18

The Stochastic Multi-arm Bandit Problem

The Stochastic Multi–armed Bandit Problem (cont’d)

Rn(A) =

  • i=i∗

E[Ti,n]∆i ⇒ we only need to study the expected number of pulls of the suboptimal arms

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 18/94

slide-19
SLIDE 19

The Stochastic Multi-arm Bandit Problem

The Stochastic Multi–armed Bandit Problem (cont’d)

Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm. Why it works:

◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the

uncertainty is maximized

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 19/94

slide-20
SLIDE 20

The Stochastic Multi-arm Bandit Problem

The Stochastic Multi–armed Bandit Problem (cont’d)

−4 −2 2 4 6 5 10 15 20 25 Rewards −4 −2 2 4 6 5 10 15 20 25 30 35 40 Rewards

pulls = 100 pulls = 200

−4 −2 2 4 6 2 4 6 8 10 12 14 Rewards −4 −2 2 4 6 0.5 1 1.5 2 2.5 3 Rewards

pulls = 50 pulls = 20

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 20/94

slide-21
SLIDE 21

The Stochastic Multi-arm Bandit Problem

The Stochastic Multi–armed Bandit Problem (cont’d)

Optimism in face of uncertainty

−4 −2 2 4 6 5 10 15 20 25 Rewards −4 −2 2 4 6 5 10 15 20 25 30 35 40 Rewards

−4 −2 2 4 6 2 4 6 8 10 12 14 Rewards −4 −2 2 4 6 0.5 1 1.5 2 2.5 3 Rewards

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 21/94

slide-22
SLIDE 22

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm

The idea

1 (10) 2 (73) 3 (3) 4 (23) −1.5 −1 −0.5 0.5 1 1.5 2 Arms Reward

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 22/94

slide-23
SLIDE 23

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm

Show time!

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 23/94

slide-24
SLIDE 24

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

At each round t = 1, . . . , n

◮ Compute the score of each arm i

Bi = (optimistic score of arm i)

◮ Pull arm

It = arg max

i=1,...,N Bi,s,t ◮ Update the number of pulls TIt,t = TIt,t−1 + 1

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 24/94

slide-25
SLIDE 25

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ) Bi = (optimistic score of arm i) Bi,s,t = (optimistic score of arm i if pulled s times up to round t) Bi,s,t = knowledge +

  • ptimism

uncertainty Bi,s,t = ˆ µi,s + ρ

  • log 1/δ

2s Optimism in face of uncertainty: Current knowledge: average rewards ˆ µi,s Current uncertainty: number of pulls s

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 25/94

slide-26
SLIDE 26

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Do you remember Chernoff-Hoeffding?

Theorem

Let X1, . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > (b − a)
  • log 2/δ

2n

  • ≤ δ
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 26/94

slide-27
SLIDE 27

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i P

  • E[Xi] ≤ 1

s

s

  • t=1

Xi,t +

  • log 1/δ

2s

  • ≥ 1 − δ

P

  • µi ≤ ˆ

µi,s +

  • log 1/δ

2s

  • ≥ 1 − δ

⇒ UCB uses an upper confidence bound on the expectation

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 27/94

slide-28
SLIDE 28

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Theorem

For any set of N arms with distributions bounded in [0, b], if δ = 1/t, then UCB(ρ) with ρ > 1, achieves a regret Rn(A) ≤

  • i=i∗
  • 4b2

∆i ρ log(n) + ∆i

  • 3

2 + 1 2(ρ − 1)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 28/94

slide-29
SLIDE 29

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Let N = 2 with i∗ = 1 Rn(A) ≤ O

  • 1

∆ρ log(n)

  • Remark 1: the cumulative regret slowly increases as log(n)

Remark 2: the smaller the gap the bigger the regret... why?

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 29/94

slide-30
SLIDE 30

The Stochastic Multi-arm Bandit Problem

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Show time (again)!

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 30/94

slide-31
SLIDE 31

The Stochastic Multi-arm Bandit Problem

The Worst–case Performance

Remark: the regret bound is distribution–dependent Rn(A; ∆) ≤ O

  • 1

∆ρ log(n)

  • Meaning: the algorithm is able to adapt to the specific problem at

hand! Worst–case performance: what is the distribution which leads to the worst possible performance of UCB? what is the distribution–free performance of UCB? Rn(A) = sup

Rn(A; ∆)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 31/94

slide-32
SLIDE 32

The Stochastic Multi-arm Bandit Problem

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as Rn(A; ∆) = E[T2,n]∆ then if ∆i is small, the regret is also small... In fact Rn(A; ∆) = min

  • O
  • 1

∆ρ log(n)

  • , E[T2,n]∆
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 32/94

slide-33
SLIDE 33

The Stochastic Multi-arm Bandit Problem

The Worst–case Performance

Then Rn(A) = sup

Rn(A; ∆) = sup

min

  • O
  • 1

∆ρ log(n)

  • , n∆
  • ≈ √n

for ∆ =

  • 1/n
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 33/94

slide-34
SLIDE 34

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB

Remark: UCB is an anytime algorithm (δ = 1/t) Bi,s,t = ˆ µi,s + ρ

  • log t

2s Remark: If the time horizon n is known then the optimal choice is δ = 1/n Bi,s,t = ˆ µi,s + ρ

  • log n

2s

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 34/94

slide-35
SLIDE 35

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal arms

◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible

The confidence 1 − δ has the following impact (similar for ρ)

◮ Big 1 − δ: high level of exploration ◮ Small 1 − δ: high level of exploitation

Solution: depending on the time horizon, we can tune how to trade-off between exploration and exploitation

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 35/94

slide-36
SLIDE 36

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB (cont’d)

Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =

  • ∀i, s
  • ˆ

µi,s − µi

  • log 1/δ

2s

  • By Chernoff-Hoeffding P[E] ≥ 1 − nNδ.

At time t we pull arm i [algorithm] Bi,Ti,t−1 ≥ Bi∗,Ti∗,t−1 ˆ µi,Ti,t−1 +

  • log 1/δ

2Ti,t−1 ≥ ˆ µi∗,Ti∗,t−1 +

  • log 1/δ

2Ti∗,t−1 On the event E we have [math] µi + 2

  • log 1/δ

2Ti,t−1 ≥ µi∗

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 36/94

slide-37
SLIDE 37

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB (cont’d)

Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2

  • log 1/δ

2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2

i

+ 1 under event E and thus with probability 1 − nNδ. Moving to the expectation [statistics] E[Ti,n] = E[Ti,nIE] + E[Ti,nIEC] E[Ti,n] ≤ log 1/δ 2∆2

i

+ 1 + n(nNδ) Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +

  • 2 log n

2Ti,t−1 and

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 37/94

slide-38
SLIDE 38

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB (cont’d)

Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +

  • 2 log n

2Ti,t−1 and E[Ti,n] ≤ log n ∆2

i

+ 1 + N

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 38/94

slide-39
SLIDE 39

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB (cont’d)

Multi–armed Bandit: the same for δ = 1/t and δ = 1/n... ... almost (i.e., in expectation)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 39/94

slide-40
SLIDE 40

The Stochastic Multi-arm Bandit Problem

Tuning the confidence δ of UCB (cont’d)

The value–at–risk of the regret for UCB-anytime

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 40/94

slide-41
SLIDE 41

The Stochastic Multi-arm Bandit Problem

Tuning the ρ of UCB (cont’d)

UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ

  • log n

2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 41/94

slide-42
SLIDE 42

The Stochastic Multi-arm Bandit Problem

Improvements over UCB: UCB-V

Idea: use Bernstein bounds with empirical variance Algorithm: Bi,s,t = ˆ µi,s +

  • log t

2s Rn ≤ O 1 ∆ log n

  • BV

i,s,t = ˆ

µi,s+

σ2

i,s log t

s +8 log t 3s Rn ≤ O σ2 ∆ log n

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 42/94

slide-43
SLIDE 43

The Stochastic Multi-arm Bandit Problem

Improvements over UCB: KL-UCB

Idea: use Kullback–Leibler bounds which are tighter than other bounds Algorithm: the algorithm is still index–based but a bit more complicated Rn ≤ O 1 ∆ log n

  • Rn ≤ O
  • 1

KL(ν, νi∗) log n

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 43/94

slide-44
SLIDE 44

The Stochastic Multi-arm Bandit Problem

Improvements over UCB: Thompson strategy

Idea: Keep a distribution over the possible values of µi Algorithm: Bayesian approach. Compute the posterior distributions given the samples.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 44/94

slide-45
SLIDE 45

The Stochastic Multi-arm Bandit Problem

Back to UCB: the Lower Bound

Theorem

For any stochastic bandit {νi}, any algorithm A has a regret lim

n→∞

Rn log n ≥ ∆i infν KL(νi, ν) Problem: this is just asymptotic Open Question: what is the finite-time lower bound?

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 45/94

slide-46
SLIDE 46

The Non-Stochastic Multi-arm Bandit Problem

Outline

Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 46/94

slide-47
SLIDE 47

The Non-Stochastic Multi-arm Bandit Problem

The Non–Stochastic Multi–armed Bandit Problem

Definition

The environment is adversarial

◮ Arms have no fixed distribution ◮ The rewards Xi,t are arbitrarily chosen by the environment

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 47/94

slide-48
SLIDE 48

The Non-Stochastic Multi-arm Bandit Problem

The Non–Stochastic Multi–armed Bandit Problem (cont’d)

The (non–stochastic bandit) regret Rn(A) = max

i=1,...,N E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • Rn(A) =

max

i=1,...,N n

  • t=1

Xi,t − E

  • n
  • t=1

XIt,t

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 48/94

slide-49
SLIDE 49

The Non-Stochastic Multi-arm Bandit Problem

The Exponentially Weighted Average Forecaster

Initialize the weights wi,0 = 1

◮ Compute (Wt−1 = N

i=1 wi,t−1)

ˆ pi,t = wi,t−1 Wt−1

◮ Choose the arm at random

It ∼ ˆ pt

◮ Observe the rewards {Xi,t} ◮ Receive a reward XIt,t ◮ Update

wi,t = wi,t−1 exp

  • + ηXit,t
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 49/94

slide-50
SLIDE 50

The Non-Stochastic Multi-arm Bandit Problem

The Non–Stochastic Multi–armed Bandit Problem (cont’d)

Problem: we only observe the reward of the specific arm chosen at time t!! (i.e., only XIt,t is observed)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 50/94

slide-51
SLIDE 51

The Non-Stochastic Multi-arm Bandit Problem

The Exponentially Weighted Average Forecaster

Initialize the weights wi,0 = 1

◮ Compute (Wt−1 = N

i=1 wi,t−1)

ˆ pi,t = wi,t−1 Wt−1

◮ Choose the arm at random

It ∼ ˆ pt

◮ Observe the rewards {Xi,t} ◮ Receive a reward XIt,t ◮ Update

wi,t = wi,t−1 exp

  • ηXit,t
  • ⇒ this update is not possible
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 51/94

slide-52
SLIDE 52

The Non-Stochastic Multi-arm Bandit Problem

The Non–Stochastic Multi–armed Bandit Problem (cont’d)

We use the importance weight trick ˆ Xi,t = Xi,t

ˆ pi,t

if i = It

  • therwise

Why it is a good idea: E ˆ Xi,t

  • = Xi,t

ˆ pi,t ˆ pi,t + 0(1 − ˆ pi,t) = Xi,t ˆ Xi,t is an unbiased estimator of Xi,t

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 52/94

slide-53
SLIDE 53

The Non-Stochastic Multi-arm Bandit Problem

The Exp3 Algorithm

Exp3: Exponential-weight algorithm for Exploration and Exploitation Initialize the weights wi,0 = 1

◮ Compute (Wt−1 = N

i=1 wi,t−1)

ˆ pi,t = wi,t−1 Wt−1

◮ Choose the arm at random

It ∼ ˆ pt

◮ Receive a reward XIt,t ◮ Update

wi,t = wi,t−1 exp

  • η ˆ

Xit,t

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 53/94

slide-54
SLIDE 54

The Non-Stochastic Multi-arm Bandit Problem

The Exp3 Algorithm

Question: is this enough? is this algorithm actually exploring enough? Answer: more or less...

◮ Exp3 has a small regret in expectation ◮ Exp3 might have large deviations with high probability (ie,

from time to time it may concentrate ˆ pt on the wrong arm for too long and then incur a large regret)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 54/94

slide-55
SLIDE 55

The Non-Stochastic Multi-arm Bandit Problem

The Exp3 Algorithm

Fix: add some extra uniform exploration Initialize the weights wi,0 = 1

◮ Compute (Wt−1 = N

i=1 wi,t−1)

ˆ pi,t = (1 − γ)wi,t−1 Wt−1 + γ K

◮ Choose the arm at random

It ∼ ˆ pt

◮ Receive a reward XIt,t ◮ Update

wi,t = wi,t−1 exp

  • η ˆ

Xit,t

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 55/94

slide-56
SLIDE 56

The Non-Stochastic Multi-arm Bandit Problem

The Exp3 Algorithm

Theorem

If Exp3 is run with γ = η, then it achieves a regret Rn(A) = max

i=1,...,N n

  • t=1

Xi,t − E

  • n
  • t=1

XIt,t

  • ≤ (e − 1)γGmax + N log N

γ with Gmax = maxi=1,...,N n

t=1 Xi,t.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 56/94

slide-57
SLIDE 57

The Non-Stochastic Multi-arm Bandit Problem

The Exp3 Algorithm

Theorem

If Exp3 is run with γ = η =

  • N log N

(e − 1)n then it achieves a regret Rn(A) ≤ O(

  • nN log N)
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 57/94

slide-58
SLIDE 58

The Non-Stochastic Multi-arm Bandit Problem

The Exp3 Algorithm

Comparison with online learning Rn(Exp3) ≤ O(

  • nN log N)

Rn(EWA) ≤ O(

  • n log N)

Intuition: in online learning at each round we obtain N feedbacks, while in bandits we receive 1 feedback.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 58/94

slide-59
SLIDE 59

The Non-Stochastic Multi-arm Bandit Problem

The Improved-Exp3 Algorithm

Initialize the weights wi,0 = 1

◮ Compute (Wt−1 = N

i=1 wi,t−1)

ˆ pi,t = (1 − γ)wi,t−1 Wt−1 + γ K

◮ Choose the arm at random

It ∼ ˆ pt

◮ Receive a reward XIt,t ◮ Compute

  • Xi,t = ˆ

Xi,t + β ˆ pi,t

◮ Update

wi,t = wi,t−1 exp

  • η

Xit,t

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 59/94

slide-60
SLIDE 60

The Non-Stochastic Multi-arm Bandit Problem

The Improved-Exp3 Algorithm

Theorem

If Improved-Exp3 is run with parameters in the ranges γ ≤ 1 2; 0 ≤ η ≤ γ 2N ;

  • 1

nN log N δ ≤ β ≤ 1 then it achieves a regret RHP

n

(A) ≤ n

  • γ + η(1 + β)N
  • + log N

η + 2nNβ with probability at least 1 − δ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 60/94

slide-61
SLIDE 61

The Non-Stochastic Multi-arm Bandit Problem

The Improved-Exp3 Algorithm

Theorem

If Improved-Exp3 is run with parameters in the ranges β =

  • 1

nN log N δ ; γ = 4Nβ 3 + β ; η = γ 2N then it achieves a regret RHP

n

(A) ≤ 11 2

  • nN log(N/δ) + log N

2 with probability at least 1 − δ.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 61/94

slide-62
SLIDE 62

Connections to Game Theory

Outline

Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 62/94

slide-63
SLIDE 63

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

A two–player zero–sum game

A B C 1 30, -30

  • 10, 10

20, -20 2 10, -10

  • 20, 20
  • 20, 20

Nash equilibrium: A set of strategies is a Nash equilibrium if no player can do better by unilaterally changing his strategy. Red: take action 1 with prob. 4/7 and action 2 with prob. 3/7 Blue: take action A with prob. 0, action B with prob. 4/7, and action C with prob. 3/7 Value of the game: V = 20/7 (reward of Red at the equilibrium)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 63/94

slide-64
SLIDE 64

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

At each round t

◮ Row player computes a mixed strategy ˆ

pt = (ˆ p1,t, . . . , ˆ pN,t)

◮ Column player computes a mixed strategy ˆ

qt = (ˆ q1,t, . . . , ˆ qM,t)

◮ Row player selects action It ∈ {1, . . . , N} ◮ Column player selects action Jt ∈ {1, . . . , M} ◮ Row player suffers ℓ(It, Jt) ◮ Column player suffers −ℓ(It, Jt)

Value of the game V = max

q

min

p

¯ ℓ(p, q) with ¯ ℓ(p, q) =

N

  • i=1

M

  • j=1

piqjℓ(i, j)

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 64/94

slide-65
SLIDE 65

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Question: what if the two players are both bandit algorithms (e.g., Exp3)? Row player: a bandit algorithm is able to minimize Rn(row) =

n

  • t=1

ℓIt,Jt − min

i=1,...,N n

  • t=1

ℓi,Jt Col player: a bandit algorithm is able to minimize Rn(col) =

n

  • t=1

ℓIt,Jt − min

j=1,...,M n

  • t=1

ℓIt,j

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 65/94

slide-66
SLIDE 66

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Theorem

If both the row and column players play according to an Hannan-consistent strategy, then lim sup

n→∞

1 n

n

  • t=1

ℓ(It, Jt) = V

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 66/94

slide-67
SLIDE 67

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Theorem

The empirical distribution of plays ˆ pi,n = 1 n

n

  • t=1

I{It = i} ˆ qj,n = 1 n

n

  • t=1

I{Jt = j} induces a product distribution ˆ pn × ˆ qn which converges to the set

  • f Nash equilibria p × q.
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 67/94

slide-68
SLIDE 68

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Proof idea. Since ¯ ℓ(p, Jt) is linear, over the simplex, the minimum is at one of the corners [math] min

i=1,...,N

1 N

n

  • t=1

ℓ(i, Jt) = min

p

1 n

n

  • t=1

¯ ℓ(p, Jt) We consider the empirical probability of the row player [def] ˆ qj,n = 1 n

n

  • t=1

IJt = j Elaborating on it [math] min

p

1 n

n

  • t=1

¯ ℓ(p, Jt) = min

p M

  • j=1

ˆ qj,n¯ ℓ(p, j) = min

p

¯ ℓ(p, ˆ qn) ≤ max

q

min

p

¯ ℓ(p, q) = V

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 68/94

slide-69
SLIDE 69

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Proof idea. By definition of Hannan’s consistent strategy [def] lim sup

n→∞

1 n

n

  • t=1

ℓ(It, Jt) = min

i=1,...,N

1 n

n

  • t=1

ℓ(i, Jt) Then lim sup

n→∞

1 n

n

  • t=1

ℓ(It, Jt) ≤ V If we do the same for the other player [zero–sum game] lim sup

n→∞

1 n

n

  • t=1

ℓ(It, Jt) ≥ V

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 69/94

slide-70
SLIDE 70

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Question: how fast do they converge to the Nash equilibrium? Answer: it depends on the specific algorithm. For EWA(η), we now that

n

  • t=1

ℓ(It, Jt) − min

i=1,...,N n

  • t=1

ℓ(i, Jt) ≤ log N η + nη 8 +

  • n

2 log 1 δ

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 70/94

slide-71
SLIDE 71

Connections to Game Theory

Repeated Two–Player Zero–Sum Games

Generality of the results

◮ Players do not know the payoff matrix ◮ Players do not observe the loss of the other player ◮ Players do not even observe the action of the other player

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 71/94

slide-72
SLIDE 72

Connections to Game Theory

Internal Regret and Correlated Equilibria

External (expected) regret Rn =

n

  • t=1

¯ ℓ(ˆ pt, yt) − min

i=1,...,N n

  • t=1

ℓ(i, yt) = max

i=1,...,N n

  • t=1

N

  • j=1

ˆ pj,t

  • ℓ(j, yt) − ℓ(i, yt)
  • Internal (expected) regret

RI

n =

max

i,j=1,...,N n

  • t=1

ˆ pj,t

  • ℓ(i, yt) − ℓ(j, yt)
  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 72/94

slide-73
SLIDE 73

Connections to Game Theory

Internal Regret and Correlated Equilibria

Internal (expected) regret RI

n =

max

i,j=1,...,N n

  • t=1

ˆ pj,t

  • ℓ(i, yt) − ℓ(j, yt)
  • Intuition: an algorithm has small internal regret if, for each pair of

experts (i, j), the learner does not regret of not having followed expert j each time it followed expert i.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 73/94

slide-74
SLIDE 74

Connections to Game Theory

Internal Regret and Correlated Equilibria

Theorem

Given a K–person game with a set of correlated equilibria C. If all the players are internal–regret minimizers, then the distance between the empirical distribution of plays and the set of correlated equilibria C converges to 0.

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 74/94

slide-75
SLIDE 75

Connections to Game Theory

Nash Equilibria in Extensive Form Games

A powerful model for sequential games

◮ Checkers / Chess / Go ◮ Poker ◮ Bargaining ◮ Monitoring ◮ Patrolling ◮ ...

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 75/94

slide-76
SLIDE 76

Connections to Game Theory

Nash Equilibria in Extensive Form Games

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 76/94

slide-77
SLIDE 77

Connections to Game Theory

Nash Equilibria in Extensive Form Games

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 77/94

slide-78
SLIDE 78

Connections to Game Theory

Nash Equilibria in Extensive Form Games

No details about the algorithm... but...

Theorem

If player k selects actions according to the counterfactual regret minimization algorithm, then it achieves a regret Rk,T ≤ # states

  • # actions

T

Theorem

In a two–player zero–sum extensive form game, counterfactual regret minimization algorithms achieves an 2ǫ-Nash equilibrium, with ǫ ≤ # states

  • # actions

T

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 78/94

slide-79
SLIDE 79

Other Stochastic Multi-arm Bandit Problems

Outline

Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 79/94

slide-80
SLIDE 80

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

Motivating Examples

◮ Find the best shortest path in a limited number of days ◮ Maximize the confidence about the best treatment after a

finite number of patients

◮ Discover the best advertisements after a training phase ◮ ...

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 80/94

slide-81
SLIDE 81

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

Objective: given a fixed budget n, return the best arm i∗ = arg maxi µi at the end of the experiment Measure of performance: the probability of error P[Jn = i∗] P[Jn = i∗] ≤

N

  • i=1

exp

  • − Ti,n∆2

i

  • Algorithm idea: mimic the behavior of the optimal strategy

Ti,n =

1 ∆2

i

N

j=1 1 ∆2

j

n

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 81/94

slide-82
SLIDE 82

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

◮ Divide the budget in N − 1 phases. Define

(log(N) = 0.5 + N

i=2 1/i)

nk = 1 logK n − N N + 1 − k

◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1

◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds ◮ Remove the worst arm

Ak+1 = Ak\ arg min

i∈Ak ˆ

µi,nk

◮ Return the only remaining arm Jn = AN

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 82/94

slide-83
SLIDE 83

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

Theorem

The successive reject algorithm have a probability of doing a mistake of P[Jn = i∗] ≤ K(K − 1) 2 exp

  • − n − N

logNH2

  • with H2 = maxi=1,...,N i∆−2

(i) .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 83/94

slide-84
SLIDE 84

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The UCB-E Algorithm

◮ Define an exploration parameter a ◮ Compute

Bi,s = ˆ µi,s + a s

◮ Select

It = arg max

Bi,s ◮ At the end return

Jn = arg max

i

ˆ µi,Ti,n

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 84/94

slide-85
SLIDE 85

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The UCB-E Algorithm

Theorem

The UCB-E algorithm with a = 25

36 n−N H1

has a probability of doing a mistake of P[Jn = i∗] ≤ 2nN exp

  • − 2a

25

  • with H1 = N

i=1 1/∆2 i .

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 85/94

slide-86
SLIDE 86

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 86/94

slide-87
SLIDE 87

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Motivating Examples

◮ N production lines ◮ The test of the performance of a line is expensive ◮ We want an accurate estimation of the performance of each

production line

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 87/94

slide-88
SLIDE 88

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms Notice: Given an arm has a mean µi and a variance σ2

i , if it is

pulled Ti,n times, then Li,n = E

µi,Ti,n − µi)2 = σ2

i

Ti,n Ln = max

i

Li,n

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 88/94

slide-89
SLIDE 89

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Problem: what are the number of pulls (T1,n, . . . , TN,n) (such that Ti,n = n) which minimizes the loss? (T ∗

1,n, . . . , T ∗ N,n) = arg

min

(T1,n,...,TN,n) Ln

Answer T ∗

i,n =

σ2

i

N

j=1 σ2 j

n L∗

n =

N

i=1 σ2 i

n = Σ n

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 89/94

slide-90
SLIDE 90

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms Measure of performance: the regret on the quadratic error Rn(A) = max

i

Ln(A) − N

i=1 σ2 i

n Algorithm idea: mimic the behavior of the optimal strategy Ti,n = σ2

i

N

j=1 σ2 j

n = λin

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 90/94

slide-91
SLIDE 91

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

An UCB–based strategy At each time step t = 1, . . . , n

◮ Estimate

ˆ σ2

i,Ti,t−1 =

1 Ti,t−1

Ti,t−1

  • s=1

X 2

s,i − ˆ

µ2

i,Ti,t−1 ◮ Compute

Bi,t = 1 Ti,t−1

  • ˆ

σ2

i,Ti,t−1 + 5

  • log 1/δ

2Ti,t−1

  • ◮ Pull arm

It = arg max Bi,t

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 91/94

slide-92
SLIDE 92

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Theorem

The UCB–based algorithm achieves a regret Rn(A) ≤ 98 log(n) n3/2λ5/2

min

+ O log n n2

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 92/94

slide-93
SLIDE 93

Other Stochastic Multi-arm Bandit Problems

Bibliography I

  • A. LAZARIC – Reinforcement Learning Algorithms

Oct 29th, 2013 - 93/94

slide-94
SLIDE 94

Other Stochastic Multi-arm Bandit Problems

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr