The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation

the exploration exploitation dilemma
SMART_READER_LITE
LIVE PREVIEW

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course The Exploration-Exploitation Dilemma A. LAZARIC Reinforcement Learning Fall 2017 - 2/95 The


slide-1
SLIDE 1

MVA-RL Course

The Exploration-Exploitation Dilemma

  • A. LAZARIC (SequeL Team @INRIA-Lille)

ENS Cachan - Master 2 MVA

SequeL – INRIA Lille

slide-2
SLIDE 2

The Exploration-Exploitation Dilemma

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 2/95

slide-3
SLIDE 3

The Exploration-Exploitation Dilemma

Tools Stochastic Multi-Armed Bandit Contextual Linear Bandit Other Multi-Armed Bandit Problems

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 2/95

slide-4
SLIDE 4

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at according to a suitable exploration policy 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 3/95

slide-5
SLIDE 5

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at = arg maxa Q(xt, a) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 4/95

slide-6
SLIDE 6

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at = arg maxa Q(xt, a) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

⇒ no convergence

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 4/95

slide-7
SLIDE 7

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at ∼ U(A) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 5/95

slide-8
SLIDE 8

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at ∼ U(A) 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

⇒ very poor rewards

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 5/95

slide-9
SLIDE 9

The Exploration-Exploitation Dilemma

Tools Contextual Linear Bandit Stochastic Multi-Armed Bandit Other Multi-Armed Bandit Problems

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 6/95

slide-10
SLIDE 10

Mathematical Tools

Concentration Inequalities

Proposition (Chernoff-Hoeffding Inequality)

Let Xi ∈ [ai, bi] be n independent r.v. with mean µi = EXi. Then P

  • n
  • i=1
  • Xi − µi
  • ≥ ǫ
  • ≤ 2 exp

2ǫ2 n

i=1(bi − ai)2

  • .
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 7/95

slide-11
SLIDE 11

Mathematical Tools

Concentration Inequalities

Proof. P

  • n
  • i=1

Xi − µi ≥ ǫ

  • =

P(es n

i=1 Xi−µi ≥ esǫ)

≤ e−sǫE[es n

i=1 Xi−µi],

Markov inequality = e−sǫ

n

  • i=1

E[es(Xi−µi)], independent random variables ≤ e−sǫ

n

  • i=1

es2(bi−ai)2/8, Hoeffding inequality = e−sǫ+s2 n

i=1(bi−ai)2/8

If we choose s = 4ǫ/ n

i=1(bi − ai)2, the result follows.

Similar arguments hold for P n

i=1 Xi − µi ≤ −ǫ

  • .
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 8/95

slide-12
SLIDE 12

Mathematical Tools

Concentration Inequalities

Finite sample guarantee: P

  • 1

n

n

  • t=1

Xt − E[X1]

  • deviation

> ǫ

  • accuracy
  • ≤ 2 exp

2nǫ2 (b − a)2

  • confidence
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 9/95

slide-13
SLIDE 13

Mathematical Tools

Concentration Inequalities

Finite sample guarantee: P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > (b − a)
  • log 2/δ

2n

  • ≤ δ
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 10/95

slide-14
SLIDE 14

Mathematical Tools

Concentration Inequalities

Finite sample guarantee: P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > ǫ
  • ≤ δ

if n ≥ (b−a)2 log 2/δ

2ǫ2

.

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 11/95

slide-15
SLIDE 15

Mathematical Tools

The Exploration-Exploitation Dilemma

Tools Stochastic Multi-Armed Bandit Contextual Linear Bandit Other Multi-Armed Bandit Problems

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 12/95

slide-16
SLIDE 16

Mathematical Tools

Reducing RL down to Multi-Armed Bandit

Definition (Markov decision process)

A Markov decision process is defined as a tuple M = (X, A, p, r):

◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability ◮ r(x, a, y) is the reward of transition (x, a, y)

⇒ r(a) is the reward of action a

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 13/95

slide-17
SLIDE 17

Mathematical Tools

Notice For coherence with the bandit literature we use the notation

◮ i = 1, . . . , K set of possible actions ◮ t = 1, . . . , n time ◮ It action selected at time t ◮ Xi,t reward for action i at time t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 14/95

slide-18
SLIDE 18

Mathematical Tools

Learning the Optimal Policy

Objective: learn the optimal policy π∗ as efficiently as possible

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 15/95

slide-19
SLIDE 19

Mathematical Tools

Learning the Optimal Policy

Objective: learn the optimal policy π∗ as efficiently as possible For t = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at 3.2 Observe next state xt+1 and reward rt 3.3 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 15/95

slide-20
SLIDE 20

Mathematical Tools

The Multi–armed Bandit Protocol

The learner has i = 1, . . . , K arms (actions) At each round t = 1, . . . , n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 16/95

slide-21
SLIDE 21

Mathematical Tools

The Multi–armed Bandit Protocol

The learner has i = 1, . . . , K arms (actions) At each round t = 1, . . . , n

◮ At the same time

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 16/95

slide-22
SLIDE 22

Mathematical Tools

The Multi–armed Bandit Protocol

The learner has i = 1, . . . , K arms (actions) At each round t = 1, . . . , n

◮ At the same time

◮ The environment chooses a vector of rewards {Xi,t}K

i=1

◮ The learner chooses an arm It

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 16/95

slide-23
SLIDE 23

Mathematical Tools

The Multi–armed Bandit Protocol

The learner has i = 1, . . . , K arms (actions) At each round t = 1, . . . , n

◮ At the same time

◮ The environment chooses a vector of rewards {Xi,t}K

i=1

◮ The learner chooses an arm It

◮ The learner receives a reward XIt,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 16/95

slide-24
SLIDE 24

Mathematical Tools

The Multi–armed Bandit Protocol

The learner has i = 1, . . . , K arms (actions) At each round t = 1, . . . , n

◮ At the same time

◮ The environment chooses a vector of rewards {Xi,t}K

i=1

◮ The learner chooses an arm It

◮ The learner receives a reward XIt,t ◮ The environment does not reveal the rewards of the other

arms

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 16/95

slide-25
SLIDE 25

Mathematical Tools

The Multi–armed Bandit Game (cont’d)

The regret Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 17/95

slide-26
SLIDE 26

Mathematical Tools

The Multi–armed Bandit Game (cont’d)

The regret Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • The expectation summarizes any possible source of randomness (either in

X or in the algorithm)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 17/95

slide-27
SLIDE 27

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-28
SLIDE 28

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-29
SLIDE 29

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2: Whenever the learner pulls a bad arm, it suffers some regret

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-30
SLIDE 30

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-31
SLIDE 31

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge: The learner should solve two opposite problems!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-32
SLIDE 32

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge: The learner should solve two opposite problems!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-33
SLIDE 33

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge: The learner should solve two opposite problems!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-34
SLIDE 34

Mathematical Tools

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge: The learner should solve the exploration-exploitation dilemma!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 18/95

slide-35
SLIDE 35

Mathematical Tools

The Multi–armed Bandit Game (cont’d)

Examples

◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ...

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 19/95

slide-36
SLIDE 36

Mathematical Tools

The Stochastic Multi–armed Bandit Problem

Definition

The environment is stochastic

◮ Each arm has a distribution νi bounded in [0, 1] and

characterized by an expected value µi

◮ The rewards are i.i.d. Xi,t ∼ νi (as in the MDP model)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 20/95

slide-37
SLIDE 37

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-38
SLIDE 38

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-39
SLIDE 39

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) = max

i=1,...,K(nµi) − E

  • n
  • t=1

XIt,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-40
SLIDE 40

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) = max

i=1,...,K(nµi) − K

  • i=1

E[Ti,n]µi

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-41
SLIDE 41

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) = nµi∗ −

K

  • i=1

E[Ti,n]µi

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-42
SLIDE 42

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) =

  • i=i∗

E[Ti,n](µi∗ − µi)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-43
SLIDE 43

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) =

  • i=i∗

E[Ti,n]∆i

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-44
SLIDE 44

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Notation

◮ Number of times arm i has been pulled after n rounds

Ti,n =

n

  • t=1

I{It = i}

◮ Regret

Rn(A) =

  • i=i∗

E[Ti,n]∆i

◮ Gap ∆i = µi∗ − µi

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 21/95

slide-45
SLIDE 45

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Rn(A) =

  • i=i∗

E[Ti,n]∆i ⇒ we only need to study the expected number of pulls of the suboptimal arms

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 22/95

slide-46
SLIDE 46

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm.

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 23/95

slide-47
SLIDE 47

Mathematical Tools

The Stochastic Multi–armed Bandit Problem (cont’d)

Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm. Why it works:

◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the

uncertainty is maximized

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 23/95

slide-48
SLIDE 48

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm

The idea

1 (10) 2 (73) 3 (3) 4 (23) −1.5 −1 −0.5 0.5 1 1.5 2 Arms Reward

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 24/95

slide-49
SLIDE 49

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm

Show time!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 25/95

slide-50
SLIDE 50

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

At each round t = 1, . . . , n

◮ Compute the score of each arm i

Bi = (optimistic score of arm i)

◮ Pull arm

It = arg max

i=1,...,K Bi,s,t ◮ Update the number of pulls TIt,t = TIt,t−1 + 1 and the other

statistics

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 26/95

slide-51
SLIDE 51

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ) Bi = (optimistic score of arm i)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 27/95

slide-52
SLIDE 52

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ) Bi,s,t = (optimistic score of arm i if pulled s times up to round t)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 27/95

slide-53
SLIDE 53

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ) Bi,s,t = (optimistic score of arm i if pulled s times up to round t) Optimism in face of uncertainty: Current knowledge: average rewards ˆ µi,s Current uncertainty: number of pulls s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 27/95

slide-54
SLIDE 54

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ) Bi,s,t = knowledge +

  • ptimism

uncertainty Optimism in face of uncertainty: Current knowledge: average rewards ˆ µi,s Current uncertainty: number of pulls s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 27/95

slide-55
SLIDE 55

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ) Bi,s,t = ˆ µi,s + ρ

  • log 1/δ

2s Optimism in face of uncertainty: Current knowledge: average rewards ˆ µi,s Current uncertainty: number of pulls s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 27/95

slide-56
SLIDE 56

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

At each round t = 1, . . . , n

◮ Compute the score of each arm i

Bi,t = ˆ µi,Ti,t + ρ

  • log(t)

2Ti,t

◮ Pull arm

It = arg max

i=1,...,K Bi,t ◮ Update the number of pulls TIt,t = TIt,t−1 + 1 and ˆ

µi,Ti,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 28/95

slide-57
SLIDE 57

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Theorem

Let X1, . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) P

  • 1

n

n

  • t=1

Xt − E[X1]

  • > (b − a)
  • log 2/δ

2n

  • ≤ δ
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 29/95

slide-58
SLIDE 58

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i P

  • E[Xi] ≤ 1

s

s

  • t=1

Xi,t +

  • log 1/δ

2s

  • ≥ 1 − δ
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 30/95

slide-59
SLIDE 59

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i P

  • µi ≤ ˆ

µi,s +

  • log 1/δ

2s

  • ≥ 1 − δ
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 30/95

slide-60
SLIDE 60

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i P

  • µi ≤ ˆ

µi,s +

  • log 1/δ

2s

  • ≥ 1 − δ

⇒ UCB uses an upper confidence bound on the expectation

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 30/95

slide-61
SLIDE 61

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Theorem

For any set of K arms with distributions bounded in [0, b], if δ = 1/t, then UCB(ρ) with ρ > 1, achieves a regret Rn(A) ≤

  • i=i∗
  • 4b2

∆i ρ log(n) + ∆i

  • 3

2 + 1 2(ρ − 1)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 31/95

slide-62
SLIDE 62

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Let K = 2 with i∗ = 1 Rn(A) ≤ O

  • 1

∆ρ log(n)

  • Remark 1: the cumulative regret slowly increases as log(n)
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 32/95

slide-63
SLIDE 63

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Let K = 2 with i∗ = 1 Rn(A) ≤ O

  • 1

∆ρ log(n)

  • Remark 1: the cumulative regret slowly increases as log(n)

Remark 2: the smaller the gap the bigger the regret... why?

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 32/95

slide-64
SLIDE 64

Mathematical Tools

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Show time (again)!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 33/95

slide-65
SLIDE 65

Mathematical Tools

The Worst–case Performance

Remark: the regret bound is distribution–dependent Rn(A; ∆) ≤ O

  • 1

∆ρ log(n)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 34/95

slide-66
SLIDE 66

Mathematical Tools

The Worst–case Performance

Remark: the regret bound is distribution–dependent Rn(A; ∆) ≤ O

  • 1

∆ρ log(n)

  • Meaning: the algorithm is able to adapt to the specific problem at

hand!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 34/95

slide-67
SLIDE 67

Mathematical Tools

The Worst–case Performance

Remark: the regret bound is distribution–dependent Rn(A; ∆) ≤ O

  • 1

∆ρ log(n)

  • Meaning: the algorithm is able to adapt to the specific problem at

hand! Worst–case performance: what is the distribution which leads to the worst possible performance of UCB? what is the distribution–free performance of UCB? Rn(A) = sup

Rn(A; ∆)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 34/95

slide-68
SLIDE 68

Mathematical Tools

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity...

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 35/95

slide-69
SLIDE 69

Mathematical Tools

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as Rn(A; ∆) = E[T2,n]∆

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 35/95

slide-70
SLIDE 70

Mathematical Tools

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as Rn(A; ∆) = E[T2,n]∆ then if ∆i is small, the regret is also small...

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 35/95

slide-71
SLIDE 71

Mathematical Tools

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as Rn(A; ∆) = E[T2,n]∆ then if ∆i is small, the regret is also small... In fact Rn(A; ∆) = min

  • O
  • 1

∆ρ log(n)

  • , E[T2,n]∆
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 35/95

slide-72
SLIDE 72

Mathematical Tools

The Worst–case Performance

Then Rn(A) = sup

Rn(A; ∆) = sup

min

  • O
  • 1

∆ρ log(n)

  • , n∆
  • ≈ √n

for ∆ =

  • 1/n
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 36/95

slide-73
SLIDE 73

Mathematical Tools

Tuning the confidence δ of UCB

Remark: UCB is an anytime algorithm (δ = 1/t) Bi,s,t = ˆ µi,s + ρ

  • log t

2s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 37/95

slide-74
SLIDE 74

Mathematical Tools

Tuning the confidence δ of UCB

Remark: UCB is an anytime algorithm (δ = 1/t) Bi,s,t = ˆ µi,s + ρ

  • log t

2s Remark: If the time horizon n is known then the optimal choice is δ = 1/n Bi,s,t = ˆ µi,s + ρ

  • log n

2s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 37/95

slide-75
SLIDE 75

Mathematical Tools

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal arms

◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 38/95

slide-76
SLIDE 76

Mathematical Tools

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal arms

◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible

The confidence 1 − δ has the following impact (similar for ρ)

◮ Big 1 − δ: high level of exploration ◮ Small 1 − δ: high level of exploitation

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 38/95

slide-77
SLIDE 77

Mathematical Tools

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal arms

◮ Enough: so as to understand which arm is the best ◮ Not too much: so as to keep the regret as small as possible

The confidence 1 − δ has the following impact (similar for ρ)

◮ Big 1 − δ: high level of exploration ◮ Small 1 − δ: high level of exploitation

Solution: depending on the time horizon, we can tune how to trade-off between exploration and exploitation

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 38/95

slide-78
SLIDE 78

Mathematical Tools

UCB Proof

Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =

  • ∀i, s
  • ˆ

µi,s − µi

  • log 1/δ

2s

  • By Chernoff-Hoeffding P[E] ≥ 1 − nKδ.
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 39/95

slide-79
SLIDE 79

Mathematical Tools

UCB Proof

Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =

  • ∀i, s
  • ˆ

µi,s − µi

  • log 1/δ

2s

  • By Chernoff-Hoeffding P[E] ≥ 1 − nKδ.

At time t we pull arm i [algorithm] Bi,Ti,t−1 ≥ Bi∗,Ti∗,t−1

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 39/95

slide-80
SLIDE 80

Mathematical Tools

UCB Proof

Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =

  • ∀i, s
  • ˆ

µi,s − µi

  • log 1/δ

2s

  • By Chernoff-Hoeffding P[E] ≥ 1 − nKδ.

At time t we pull arm i [algorithm] ˆ µi,Ti,t−1 +

  • log 1/δ

2Ti,t−1 ≥ ˆ µi∗,Ti∗,t−1 +

  • log 1/δ

2Ti∗,t−1

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 39/95

slide-81
SLIDE 81

Mathematical Tools

UCB Proof

Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] E =

  • ∀i, s
  • ˆ

µi,s − µi

  • log 1/δ

2s

  • By Chernoff-Hoeffding P[E] ≥ 1 − nKδ.

At time t we pull arm i [algorithm] ˆ µi,Ti,t−1 +

  • log 1/δ

2Ti,t−1 ≥ ˆ µi∗,Ti∗,t−1 +

  • log 1/δ

2Ti∗,t−1 On the event E we have [math] µi + 2

  • log 1/δ

2Ti,t−1 ≥ µi∗

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 39/95

slide-82
SLIDE 82

Mathematical Tools

UCB Proof (cont’d)

Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2

  • log 1/δ

2(Ti,n − 1) ≥ µi∗

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 40/95

slide-83
SLIDE 83

Mathematical Tools

UCB Proof (cont’d)

Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2

  • log 1/δ

2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2

i

+ 1 under event E and thus with probability 1 − nKδ.

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 40/95

slide-84
SLIDE 84

Mathematical Tools

UCB Proof (cont’d)

Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2

  • log 1/δ

2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2

i

+ 1 under event E and thus with probability 1 − nKδ. Moving to the expectation [statistics] E[Ti,n] = E[Ti,nIE] + E[Ti,nIEC]

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 40/95

slide-85
SLIDE 85

Mathematical Tools

UCB Proof (cont’d)

Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2

  • log 1/δ

2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2

i

+ 1 under event E and thus with probability 1 − nKδ. Moving to the expectation [statistics] E[Ti,n] ≤ log 1/δ 2∆2

i

+ 1 + n(nKδ)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 40/95

slide-86
SLIDE 86

Mathematical Tools

UCB Proof (cont’d)

Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus µi + 2

  • log 1/δ

2(Ti,n − 1) ≥ µi∗ Reordering [math] Ti,n ≤ log 1/δ 2∆2

i

+ 1 under event E and thus with probability 1 − nKδ. Moving to the expectation [statistics] E[Ti,n] ≤ log 1/δ 2∆2

i

+ 1 + n(nKδ) Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +

  • 2 log n

2Ti,t−1 and n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 40/95

slide-87
SLIDE 87

Mathematical Tools

UCB Proof (cont’d)

Trading-off the two terms δ = 1/n2, we obtain ˆ µi,Ti,t−1 +

  • 2 log n

2Ti,t−1 and E[Ti,n] ≤ log n ∆2

i

+ 1 + K

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 41/95

slide-88
SLIDE 88

Mathematical Tools

Tuning the confidence δ of UCB (cont’d)

Multi–armed Bandit: the same for δ = 1/t and δ = 1/n...

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 42/95

slide-89
SLIDE 89

Mathematical Tools

Tuning the confidence δ of UCB (cont’d)

Multi–armed Bandit: the same for δ = 1/t and δ = 1/n... ... almost (i.e., in expectation)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 42/95

slide-90
SLIDE 90

Mathematical Tools

Tuning the confidence δ of UCB (cont’d)

The value–at–risk of the regret for UCB-anytime

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 43/95

slide-91
SLIDE 91

Mathematical Tools

Tuning the ρ of UCB (cont’d)

UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ

  • log n

2s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 44/95

slide-92
SLIDE 92

Mathematical Tools

Tuning the ρ of UCB (cont’d)

UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ

  • log n

2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 44/95

slide-93
SLIDE 93

Mathematical Tools

Tuning the ρ of UCB (cont’d)

UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ

  • log n

2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 44/95

slide-94
SLIDE 94

Mathematical Tools

Tuning the ρ of UCB (cont’d)

UCB values (for the δ = 1/n algorithm) Bi,s = ˆ µi,s + ρ

  • log n

2s Theory ◮ ρ < 0.5, polynomial regret w.r.t. n ◮ ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 44/95

slide-95
SLIDE 95

Mathematical Tools

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i.

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 45/95

slide-96
SLIDE 96

Mathematical Tools

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm

◮ Compute the score of each arm i

Bi,t = ˆ µi,Ti,t + ρ

  • log(t)

2Ti,t

◮ Pull arm

It = arg max

i=1,...,K Bi,t

◮ Update the number of pulls TIt,t, ˆ

µi,Ti,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 45/95

slide-97
SLIDE 97

Mathematical Tools

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm

◮ Compute the score of each arm i

Bi,t = ˆ µi,Ti,t +

σ2

i,Ti,t log t

Ti,t + 8 log t 3Ti,t

◮ Pull arm

It = arg max

i=1,...,K Bi,t

◮ Update the number of pulls TIt,t, ˆ

µi,Ti,t and ˆ σ2

i,Ti,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 45/95

slide-98
SLIDE 98

Mathematical Tools

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm

◮ Compute the score of each arm i

Bi,t = ˆ µi,Ti,t +

σ2

i,Ti,t log t

Ti,t + 8 log t 3Ti,t

◮ Pull arm

It = arg max

i=1,...,K Bi,t

◮ Update the number of pulls TIt,t, ˆ

µi,Ti,t and ˆ σ2

i,Ti,t

Regret Rn ≤ O 1 ∆ log n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 45/95

slide-99
SLIDE 99

Mathematical Tools

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i. Algorithm

◮ Compute the score of each arm i

Bi,t = ˆ µi,Ti,t +

σ2

i,Ti,t log t

Ti,t + 8 log t 3Ti,t

◮ Pull arm

It = arg max

i=1,...,K Bi,t

◮ Update the number of pulls TIt,t, ˆ

µi,Ti,t and ˆ σ2

i,Ti,t

Regret Rn ≤ O σ2 ∆ log n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 45/95

slide-100
SLIDE 100

Mathematical Tools

Improvements: KL-UCB

Idea: use even tighter c.i. based on Kullback–Leibler divergence d(p, q) = p log p q + (1 − p) log 1 − p 1 − q

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 46/95

slide-101
SLIDE 101

Mathematical Tools

Improvements: KL-UCB

Idea: use even tighter c.i. based on Kullback–Leibler divergence d(p, q) = p log p q + (1 − p) log 1 − p 1 − q Algorithm: Compute the score of each arm i (convex optimization) Bi,t = max

  • q ∈ [0, 1] : Ti,td
  • ˆ

µi,Ti,t, q

  • ≤ log(t) + c log(log(t))
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 46/95

slide-102
SLIDE 102

Mathematical Tools

Improvements: KL-UCB

Idea: use even tighter c.i. based on Kullback–Leibler divergence d(p, q) = p log p q + (1 − p) log 1 − p 1 − q Algorithm: Compute the score of each arm i (convex optimization) Bi,t = max

  • q ∈ [0, 1] : Ti,td
  • ˆ

µi,Ti,t, q

  • ≤ log(t) + c log(log(t))
  • Regret: pulls to suboptimal arms

E

  • Ti,n
  • ≤ (1 + ǫ) log(n)

d(µi, µ∗) + C1 log(log(n)) + C2(ǫ) nβ(ǫ) where d(µi, µ∗) > 2∆2

i

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 46/95

slide-103
SLIDE 103

Mathematical Tools

Improvements: Thompson strategy

Idea: Use a Bayesian approach to estimate the means {µi}i

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 47/95

slide-104
SLIDE 104

Mathematical Tools

Improvements: Thompson strategy

Idea: Use a Bayesian approach to estimate the means {µi}i Algorithm: Assuming Bernoulli arms and a Beta prior on the mean

◮ Compute

Di,t = Beta(Si,t + 1, Fi,t + 1)

◮ Draw a mean sample as

  • µi,t ∼ Di,t

◮ Pull arm

It = arg max µi,t

◮ If XIt,t = 1 update SIt,t+1 = SIt,t + 1, else update FIt,t+1 = FIt,t + 1

Regret: lim

n→∞

Rn log(n) =

K

  • i=1

∆i d(µi, µ∗)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 47/95

slide-105
SLIDE 105

Mathematical Tools

The Lower Bound

Theorem

For any stochastic bandit {νi}, any algorithm A has a regret lim

n→∞

Rn log n ≥ ∆i infν KL(νi, ν)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 48/95

slide-106
SLIDE 106

Mathematical Tools

The Lower Bound

Theorem

For any stochastic bandit {νi}, any algorithm A has a regret lim

n→∞

Rn log n ≥ ∆i infν KL(νi, ν) Problem: this is just asymptotic

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 48/95

slide-107
SLIDE 107

Mathematical Tools

The Lower Bound

Theorem

For any stochastic bandit {νi}, any algorithm A has a regret lim

n→∞

Rn log n ≥ ∆i infν KL(νi, ν) Problem: this is just asymptotic Open Question: what is the finite-time lower bound?

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 48/95

slide-108
SLIDE 108

Mathematical Tools

The Exploration-Exploitation Dilemma

Tools Stochastic Multi-Armed Bandit Contextual Linear Bandit Other Multi-Armed Bandit Problems

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 49/95

slide-109
SLIDE 109

Mathematical Tools

The Contextual Linear Bandit Problem

Motivating Example: news recommendation

◮ Different users may have different preferences ◮ Different news may have different characteristics ◮ The set of available news may change over time ◮ We want to minimise the regret w.r.t. the best news for each

user

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 50/95

slide-110
SLIDE 110

Mathematical Tools

The Linear Bandit Problem

Limitations of MAB:

◮ Arms are independent ◮ Each single arm has to be tested at least once ◮ Regret scales linearly with K

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 51/95

slide-111
SLIDE 111

Mathematical Tools

The Linear Bandit Problem

Limitations of MAB:

◮ Arms are independent ◮ Each single arm has to be tested at least once ◮ Regret scales linearly with K

Linear bandit approach:

◮ Embed arms in Rd (each arm a is mapped to a feature vector

φa ∈ Rd)

◮ The reward varies linearly with the arm

E[r(a)] = φ⊤

a θ∗

where θ∗ ∈ Rd is unknown.

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 51/95

slide-112
SLIDE 112

Mathematical Tools

The Linear Bandit Problem

Limitations of MAB:

◮ Arms are independent ◮ Each single arm has to be tested at least once ◮ Regret scales linearly with K

Linear bandit approach:

◮ Embed arms in Rd (each arm a is mapped to a feature vector

φa ∈ Rd)

◮ The reward varies linearly with the arm

E[r(a)] = φ⊤

a θ∗

where θ∗ ∈ Rd is unknown. Remark: if d = A and φa = ea, then it coincides with MAB

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 51/95

slide-113
SLIDE 113

Mathematical Tools

The Linear Bandit Problem

The problem: at each time t = 1, . . . , n

◮ The learner chooses an arm at and receives a reward rat

The optimal arm: a∗ = arg maxa∈A E[r(a)] = arg maxa∈A φ⊤

a θ∗

The regret: Rn = E

  • n
  • t=1

rt(a)

  • − E
  • n
  • t=1

rt(at)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 52/95

slide-114
SLIDE 114

Mathematical Tools

The Linear Bandit Problem

The MAB approach: the value of an arm is estimated by µi,t Exploiting the linear assumption:

◮ Estimate θ∗ using regularized least squares

  • θn = arg min

θ n

  • t=1
  • φ⊤

atθ − rt(at)

2 + λθ2

2

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 53/95

slide-115
SLIDE 115

Mathematical Tools

The Linear Bandit Problem

The MAB approach: the value of an arm is estimated by µi,t Exploiting the linear assumption:

◮ Estimate θ∗ using regularized least squares

  • θn = arg min

θ n

  • t=1
  • φ⊤

atθ − rt(at)

2 + λθ2

2

◮ Closed-form solution

An =

n

  • t=1

φatφ⊤

at + λI bn = n

  • t=1

φatrt(at) ⇒ θn = A−1

n bn

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 53/95

slide-116
SLIDE 116

Mathematical Tools

The Linear Bandit Problem

The MAB approach: the value of an arm is estimated by µi,t Exploiting the linear assumption:

◮ Estimate θ∗ using regularized least squares

  • θn = arg min

θ n

  • t=1
  • φ⊤

atθ − rt(at)

2 + λθ2

2

◮ Closed-form solution

An =

n

  • t=1

φatφ⊤

at + λI bn = n

  • t=1

φatrt(at) ⇒ θn = A−1

n bn

◮ Estimate of the value of arm a

  • rn(a) = φ⊤

a

θn

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 53/95

slide-117
SLIDE 117

Mathematical Tools

The Linear Bandit Problem

The MAB approach: construct confidence intervals

  • log(1/δ)/Ti,n

Exploiting the linear assumption:

◮ Estimate of an arm

rn(a) may be accurate when “similar” arms have been selected (even if Tn(a) = 0!)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 54/95

slide-118
SLIDE 118

Mathematical Tools

The Linear Bandit Problem

The MAB approach: construct confidence intervals

  • log(1/δ)/Ti,n

Exploiting the linear assumption:

◮ Estimate of an arm

rn(a) may be accurate when “similar” arms have been selected (even if Tn(a) = 0!)

◮ Confidence intervals

  • r(a) −

rn(a)

  • ≤ αn
  • φ⊤

a A−1 n φa

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 54/95

slide-119
SLIDE 119

Mathematical Tools

The Linear Bandit Problem

The MAB approach: construct confidence intervals

  • log(1/δ)/Ti,n

Exploiting the linear assumption:

◮ Estimate of an arm

rn(a) may be accurate when “similar” arms have been selected (even if Tn(a) = 0!)

◮ Confidence intervals

  • r(a) −

rn(a)

  • ≤ αn
  • φ⊤

a A−1 n φa

◮ Tuning of the confidence interval

αn = B

  • d log

1 + nL/λ δ

  • + λ1/2θ∗2

Remark: the confidence interval reduces to MAB when all arms are

  • rthogonal
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 54/95

slide-120
SLIDE 120

Mathematical Tools

The Linear Bandit Problem

The MAB approach – UCB: pull arm It = µi,t +

  • log(1/δ)/Ti,t

Exploiting the linear assumption:

◮ At each time step t select arm

at = arg max

a∈A φ⊤ a

θt + αt

  • φ⊤

a A−1 t φa

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 55/95

slide-121
SLIDE 121

Mathematical Tools

The Linear Bandit Problem

The MAB approach – UCB: regret O(K log(n)/∆) or O(

  • Kn log(K))

Exploiting the linear assumption:

◮ Regret bound

Rn = O

  • d log(n)√n
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 56/95

slide-122
SLIDE 122

Mathematical Tools

The Linear Bandit Problem

The MAB approach – TS:

◮ Compute a posterior over µi ◮ Draw a

µi from the posterior

◮ Select arm It = arg maxi

µi Exploiting the linear assumption:

◮ Regret bound

Rn = O

  • d log(n)√n
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 57/95

slide-123
SLIDE 123

Mathematical Tools

The Contextual Linear Bandit Problem

Limitations of MAB:

◮ The value of an arm is fixed ◮ No side-information / context is used

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 58/95

slide-124
SLIDE 124

Mathematical Tools

The Contextual Linear Bandit Problem

Limitations of MAB:

◮ The value of an arm is fixed ◮ No side-information / context is used

Contextual linear bandit approach:

◮ Finite arms ◮ Define a context x ∈ X ◮ The reward varies linearly with the context

E[r(x, a)] = φ⊤

x θ∗ a

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 58/95

slide-125
SLIDE 125

Mathematical Tools

The Contextual Linear Bandit Problem

Limitations of MAB:

◮ The value of an arm is fixed ◮ No side-information / context is used

Contextual linear bandit approach:

◮ Finite arms ◮ Define a context x ∈ X ◮ The reward varies linearly with the context

E[r(x, a)] = φ⊤

x θ∗ a

Extensions:

◮ Embed arms in Rd and

E[r(x, a)] = φ⊤

x,aθ∗ a

◮ Let the arm set change over time At

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 58/95

slide-126
SLIDE 126

Mathematical Tools

The Contextual Linear Bandit Problem

The problem: at each time t = 1, . . . , n

◮ User xt arrives and a set of news At is provided ◮ The user xt together with a news a ∈ At are described by a

feature vector φxt,a

◮ The learner chooses a news at ∈ At and receives a reward

rt(xt, at) The optimal news: at each time t = 1, . . . , n, the optimal news is a∗

t = arg max a∈At E[rt(xt, at)]

The regret: Rn = E

  • n
  • t=1

rt(xt, a∗

t )

  • − E
  • n
  • t=1

rt(xt, at)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 59/95

slide-127
SLIDE 127

Mathematical Tools

The Contextual Linear Bandit Problem

The linear regression estimate:

◮ Ta = {t : at = a} ◮ Construct the design matrix of all the contexts observed when

action a has been taken Da ∈ R|Ta|×d

◮ Construct the reward vector of all the rewards observed when

action a has been taken ca ∈ R|Ta|

◮ Estimate θa as

ˆ θa = (D⊤

a Da + I)−1D⊤ a ca

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 60/95

slide-128
SLIDE 128

Mathematical Tools

The Contextual Linear Bandit Problem

Optimism in face of uncertainty: the LinUCB algorithm

◮ Chernoff-Hoeffding in this case becomes

  • φ⊤

x,aˆ

θa − r(x, a)]

  • ≤ α
  • φ⊤

x,a(D⊤ a Da + I)−1φx,a ◮ and the UCB strategy is

at = arg max

a∈At φ⊤ x,aˆ

θa + α

  • φ⊤

x,a(D⊤ a Da + I)−1φx,a

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 61/95

slide-129
SLIDE 129

Mathematical Tools

The Contextual Linear Bandit Problem

The evaluation problem

◮ Online evaluation: too expensive ◮ Offline evaluation: how to use the logged data?

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 62/95

slide-130
SLIDE 130

Mathematical Tools

The Contextual Linear Bandit Problem

Evaluation from logged data

◮ Assumption 1: contexts and rewards are i.i.d. from a

stationary distribution (x1, . . . , xK, r1, . . . , rK) ∼ D

◮ Assumption 2: the logging strategy is random

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 63/95

slide-131
SLIDE 131

Mathematical Tools

The Contextual Linear Bandit Problem

Evaluation from logged data: given a bandit strategy π, a desired number of samples T, and a (infinite) stream of data

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 64/95

slide-132
SLIDE 132

Mathematical Tools

The Exploration-Exploitation Dilemma

Tools Stochastic Multi-Armed Bandit Contextual Linear Bandit Other Multi-Armed Bandit Problems

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 65/95

slide-133
SLIDE 133

Mathematical Tools

The Exploration-Exploitation Dilemma

Tools Stochastic Multi-Armed Bandit Contextual Linear Bandit Other Multi-Armed Bandit Problems

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 66/95

slide-134
SLIDE 134

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

Motivating Examples

◮ Find the best shortest path in a limited number of days ◮ Maximize the confidence about the best treatment after a

finite number of patients

◮ Discover the best advertisements after a training phase ◮ ...

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 67/95

slide-135
SLIDE 135

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

Objective: given a fixed budget n, return the best arm i∗ = arg maxi µi at the end of the experiment

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 68/95

slide-136
SLIDE 136

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

Objective: given a fixed budget n, return the best arm i∗ = arg maxi µi at the end of the experiment Measure of performance: the probability of error P[Jn = i∗] ≤

N

  • i=1

exp

  • − Ti,n∆2

i

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 68/95

slide-137
SLIDE 137

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

Objective: given a fixed budget n, return the best arm i∗ = arg maxi µi at the end of the experiment Measure of performance: the probability of error P[Jn = i∗] ≤

N

  • i=1

exp

  • − Ti,n∆2

i

  • Algorithm idea: mimic the behavior of the optimal strategy

Ti,n =

1 ∆2

i

N

j=1 1 ∆2

j

n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 68/95

slide-138
SLIDE 138

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

◮ Divide the budget in N − 1 phases. Define

(log(N) = 0.5 + N

i=2 1/i)

nk = 1 logK n − N N + 1 − k

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 69/95

slide-139
SLIDE 139

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

◮ Divide the budget in N − 1 phases. Define

(log(N) = 0.5 + N

i=2 1/i)

nk = 1 logK n − N N + 1 − k

◮ Set of active arms Ak at phase k (A1 = {1, . . . , N})

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 69/95

slide-140
SLIDE 140

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

◮ Divide the budget in N − 1 phases. Define

(log(N) = 0.5 + N

i=2 1/i)

nk = 1 logK n − N N + 1 − k

◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1

◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 69/95

slide-141
SLIDE 141

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

◮ Divide the budget in N − 1 phases. Define

(log(N) = 0.5 + N

i=2 1/i)

nk = 1 logK n − N N + 1 − k

◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1

◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds ◮ Remove the worst arm

Ak+1 = Ak\ arg min

i∈Ak ˆ

µi,nk

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 69/95

slide-142
SLIDE 142

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

◮ Divide the budget in N − 1 phases. Define

(log(N) = 0.5 + N

i=2 1/i)

nk = 1 logK n − N N + 1 − k

◮ Set of active arms Ak at phase k (A1 = {1, . . . , N}) ◮ For each phase k = 1, . . . , N − 1

◮ For each arm i ∈ Ak, pull arm i for nk − nk−1 rounds ◮ Remove the worst arm

Ak+1 = Ak\ arg min

i∈Ak ˆ

µi,nk

◮ Return the only remaining arm Jn = AN

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 69/95

slide-143
SLIDE 143

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The Successive Reject Algorithm

Theorem

The successive reject algorithm have a probability of doing a mistake of P[Jn = i∗] ≤ K(K − 1) 2 exp

  • − n − N

logNH2

  • with H2 = maxi=1,...,N i∆−2

(i) .

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 70/95

slide-144
SLIDE 144

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The UCB-E Algorithm

◮ Define an exploration parameter a ◮ Compute

Bi,s = ˆ µi,s + a s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 71/95

slide-145
SLIDE 145

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The UCB-E Algorithm

◮ Define an exploration parameter a ◮ Compute

Bi,s = ˆ µi,s + a s

◮ Select

It = arg max

Bi,s

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 71/95

slide-146
SLIDE 146

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The UCB-E Algorithm

◮ Define an exploration parameter a ◮ Compute

Bi,s = ˆ µi,s + a s

◮ Select

It = arg max

Bi,s ◮ At the end return

Jn = arg max

i

ˆ µi,Ti,n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 71/95

slide-147
SLIDE 147

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

The UCB-E Algorithm

Theorem

The UCB-E algorithm with a = 25

36 n−N H1

has a probability of doing a mistake of P[Jn = i∗] ≤ 2nN exp

  • − 2a

25

  • with H1 = N

i=1 1/∆2 i .

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 72/95

slide-148
SLIDE 148

Other Stochastic Multi-arm Bandit Problems

The Best Arm Identification Problem

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 73/95

slide-149
SLIDE 149

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Motivating Examples

◮ N production lines ◮ The test of the performance of a line is expensive ◮ We want an accurate estimation of the performance of each

production line

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 74/95

slide-150
SLIDE 150

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 75/95

slide-151
SLIDE 151

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms Notice: Given an arm has a mean µi and a variance σ2

i , if it is

pulled Ti,n times, then Li,n = E

µi,Ti,n − µi)2 = σ2

i

Ti,n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 75/95

slide-152
SLIDE 152

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms Notice: Given an arm has a mean µi and a variance σ2

i , if it is

pulled Ti,n times, then Li,n = E

µi,Ti,n − µi)2 = σ2

i

Ti,n Ln = max

i

Li,n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 75/95

slide-153
SLIDE 153

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Problem: what are the number of pulls (T1,n, . . . , TN,n) (such that Ti,n = n) which minimizes the loss? (T ∗

1,n, . . . , T ∗ N,n) = arg

min

(T1,n,...,TN,n) Ln

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 76/95

slide-154
SLIDE 154

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Problem: what are the number of pulls (T1,n, . . . , TN,n) (such that Ti,n = n) which minimizes the loss? (T ∗

1,n, . . . , T ∗ N,n) = arg

min

(T1,n,...,TN,n) Ln

Answer T ∗

i,n =

σ2

i

N

j=1 σ2 j

n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 76/95

slide-155
SLIDE 155

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Problem: what are the number of pulls (T1,n, . . . , TN,n) (such that Ti,n = n) which minimizes the loss? (T ∗

1,n, . . . , T ∗ N,n) = arg

min

(T1,n,...,TN,n) Ln

Answer T ∗

i,n =

σ2

i

N

j=1 σ2 j

n L∗

n =

N

i=1 σ2 i

n = Σ n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 76/95

slide-156
SLIDE 156

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 77/95

slide-157
SLIDE 157

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms Measure of performance: the regret on the quadratic error Rn(A) = max

i

Ln(A) − N

i=1 σ2 i

n

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 77/95

slide-158
SLIDE 158

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Objective: given a fixed budget n, return the an estimate of the means ˆ µi,t which is as accurate as possible for all the arms Measure of performance: the regret on the quadratic error Rn(A) = max

i

Ln(A) − N

i=1 σ2 i

n Algorithm idea: mimic the behavior of the optimal strategy Ti,n = σ2

i

N

j=1 σ2 j

n = λin

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 77/95

slide-159
SLIDE 159

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

An UCB–based strategy At each time step t = 1, . . . , n

◮ Estimate

ˆ σ2

i,Ti,t−1 =

1 Ti,t−1

Ti,t−1

  • s=1

X 2

s,i − ˆ

µ2

i,Ti,t−1 ◮ Compute

Bi,t = 1 Ti,t−1

  • ˆ

σ2

i,Ti,t−1 + 5

  • log 1/δ

2Ti,t−1

  • ◮ Pull arm

It = arg max Bi,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 78/95

slide-160
SLIDE 160

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Theorem

The UCB–based algorithm achieves a regret Rn(A) ≤ 98 log(n) n3/2λ5/2

min

+ O log n n2

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 79/95

slide-161
SLIDE 161

Other Stochastic Multi-arm Bandit Problems

The Active Bandit Problem

Theorem

The UCB–based algorithm achieves a regret Rn(A) ≤ 98 log(n) n3/2λ5/2

min

+ O log n n2

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 79/95

slide-162
SLIDE 162

Other Stochastic Multi-arm Bandit Problems

The Exploration-Exploitation Dilemma

Tools Stochastic Multi-Armed Bandit Contextual Linear Bandit Other Multi-Armed Bandit Problems Bonus: Reinforcement Learning

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 80/95

slide-163
SLIDE 163

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

For i = 1, . . . , n

  • 1. Set t = 0
  • 2. Set initial state x0
  • 3. While (xt not terminal)

3.1 Take action at according to a suitable exploration policy 3.2 Observe next state xt+1 and reward rt 3.3 Compute the temporal difference δt (e.g., Q-learning) 3.4 Update the Q-function

  • Q(xt, at) =

Q(xt, at) + α(xt, at)δt 3.5 Set t = t + 1

EndWhile EndFor

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 81/95

slide-164
SLIDE 164

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

The regret in MAB Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 82/95

slide-165
SLIDE 165

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

The regret in MAB Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • ⇒ Rn(A) = max

π

E

  • n
  • t=1

r(xt, π(xt))

  • − E
  • n
  • t=1

r(xt, at)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 82/95

slide-166
SLIDE 166

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

The regret in MAB Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • ⇒ Rn(A) = max

π

E

  • n
  • t=1

r(xt, π(xt))

  • − E
  • n
  • t=1

r(xt, at)

  • ⇒ not correct: actions influence the state as well!
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 82/95

slide-167
SLIDE 167

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

The regret in MAB Rn(A) = max

i=1,...,K E

  • n
  • t=1

Xi,t

  • − E
  • n
  • t=1

XIt,t

  • ⇒ Rn(A) = max

π

E

  • n
  • t=1

r(xt, π(xt))

  • − E
  • n
  • t=1

r(xt, at)

  • ⇒ not correct: actions influence the state as well!

The regret in RL Rn(A) = max

π

E

  • n
  • t=1

r(x∗

t , π(x∗ t ))

  • − E
  • n
  • t=1

r(xt, at)

  • ,

x∗

t ∼ p

  • · |x∗

t−1, π∗(x∗ t−1)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 82/95

slide-168
SLIDE 168

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

Idea: can we adapt UCB (that already works in MAB, contextual bandit) here?

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 83/95

slide-169
SLIDE 169

Other Stochastic Multi-arm Bandit Problems

Learning the Optimal Policy

Idea: can we adapt UCB (that already works in MAB, contextual bandit) here? Yes!

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 83/95

slide-170
SLIDE 170

Other Stochastic Multi-arm Bandit Problems

Exploration-Exploitation in RL

◮ A policy π is defined as π : X → A ◮ The long-term average reward of a policy is

ρπ(M) = lim

n→∞ E

1 n

n

  • t=1

rt

  • ◮ Optimal policy

π∗(M) = arg max

π

ρπ(M) = ⇒ ρ∗(M) = ρπ∗(M)(M)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 84/95

slide-171
SLIDE 171

Other Stochastic Multi-arm Bandit Problems

Exploration-Exploitation in RL

◮ A policy π is defined as π : X → A ◮ The long-term average reward of a policy is

ρπ(M) = lim

n→∞ E

1 n

n

  • t=1

rt

  • ◮ Optimal policy

π∗(M) = arg max

π

ρπ(M) = ⇒ ρ∗(M) = ρπ∗(M)(M)

◮ Exploration-exploitation dilemma

◮ Explore the environment to estimate its parameters ◮ Exploit the estimates to collect reward

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 84/95

slide-172
SLIDE 172

Other Stochastic Multi-arm Bandit Problems

Exploration-Exploitation in RL

Regret Learning curve Steps Per-step reward ρ∗

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 85/95

slide-173
SLIDE 173

Other Stochastic Multi-arm Bandit Problems

Exploration-Exploitation in RL

Regret Learning curve Steps Per-step reward ρ∗

Cumulative Regret Rn = nρ∗ −

n

  • t=1

rt

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 85/95

slide-174
SLIDE 174

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

Space of MDPs ρ∗( Mt) ρ∗(M) ρ∗ Estimated MDP

  • Mt

Optimistic MDP True MDP

  • Mt

M ∗ M ∗ ρ∗( Mt) High confidence space

  • Mt
  • Mt

π∗( M) Optimism in face of uncertainty ⇒

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 86/95

slide-175
SLIDE 175

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

Space of MDPs ρ∗( Mt) ρ∗(M) ρ∗ Estimated MDP

  • Mt

Optimistic MDP True MDP

  • Mt

M ∗ M ∗ ρ∗( Mt) High confidence space

  • Mt
  • Mt

⇒ Optimism in face of uncertainty π∗( Mt)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 86/95

slide-176
SLIDE 176

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

⇒ π∗( Mt′) ρ∗( Mt′) Space of MDPs ρ∗( Mt′) ρ∗(M) ρ∗ Estimated MDP

  • Mt′

Optimistic MDP True MDP

  • Mt′

M ∗ M ∗ High confidence space

  • Mt′
  • Mt′
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 86/95

slide-177
SLIDE 177

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

ρ∗( Mn) π∗( Mn) ⇒ Space of MDPs Estimated MDP

  • Mn

Optimistic MDP True MDP

  • Mn

M ∗ M ∗ High confidence space

  • Mn
  • Mn

ρ∗ρ∗( Mn) ρ∗(M)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 86/95

slide-178
SLIDE 178

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 87/95

slide-179
SLIDE 179

Other Stochastic Multi-arm Bandit Problems

The UCRL2 Algorithm

Initialize episode k

  • 1. Current time tk
  • 2. Let Nk(x, a) =
  • {τ < tk : xt = x, at = a}
  • 3. Let Rk(x, a) = tk

t=1 rtI{xt = x, at = a}

  • 4. Let Pk(x, a, x ′) =
  • {τ < tk : xt = x, at = a, xt+1 = x ′}
  • 5. Compute ˆ

rk(x, a) = Rk(x,a)

Nk(x,a) , ˆ

pk(x, a, x ′) = Pk(x,a,x′)

Nk(x,a)

Compute optimistic policy

  • 1. Let

Mk =

  • M :|˜

r(x, a) − ˆ rk(x, a)| ≤ Br(x, a); ˜ p(·|x, a) − ˆ pk(·|x, a)1 ≤ Bp(x, a)

  • 2. Compute

˜ πk = arg max

π

max

˜ M∈Mk

ρ(π; ˜ M) Execute ˜ πk until at least one state-action space counter is doubled

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 88/95

slide-180
SLIDE 180

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

Set of plausible MDPs Mk = { M}: confidence intervals built using Chernoff bounds Br(x, a) ≈

  • log(XA/δ)

Nk(x, a) ; Bp(x, a) ≈

  • X log(XA/δ)

Nk(x, a)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 89/95

slide-181
SLIDE 181

Other Stochastic Multi-arm Bandit Problems

Upper-confidence Bound for RL (UCRL)

Set of plausible MDPs Mk = { M}: confidence intervals built using Chernoff bounds Br(x, a) ≈

  • log(XA/δ)

Nk(x, a) ; Bp(x, a) ≈

  • X log(XA/δ)

Nk(x, a) Computation of the optimistic optimal policy πk

  • πk = arg max

π

max

  • M∈Mk

ρπ( M)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 89/95

slide-182
SLIDE 182

Other Stochastic Multi-arm Bandit Problems

The Extended Value Iteration Algorithm

Planning in average reward MDPs

◮ The optimal Bellman equation: optimal gain ρ∗ and bias u∗

u∗(x) + ρ∗ = max

a

  • r(x, a) +
  • x′

p(x′|x, a)u∗(x′)

  • ◮ Value iteration (given v0)

vn = max

a

  • r(x, a) +
  • x′

p(x′|x, a)vn−1(x′)

  • until span(vn − vn−1) ≤ ǫ

◮ Guarantees of greedy policy

πn(x) = arg max

a

  • r(x, a) +
  • x′

p(x′|x, a)vn−1(x′)

  • ⇒ |gπn − g∗| ≤ ǫ
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 90/95

slide-183
SLIDE 183

Other Stochastic Multi-arm Bandit Problems

The Extended Value Iteration Algorithm

Planning in optimistic average reward MDPs

◮ The optimal Bellman equation: optimal gain

ρ and bias u

  • u(x) +

ρ = max

a

max

˜ r(x,a) max ˜ p(·|x,a)

  • ˜

r(x, a) +

  • x′

˜ p(x′|x, a) u(x′)

  • ◮ Value iteration (given v0)

vn = max

a

max

˜ r(x,a) max ˜ p(·|x,a)

  • ˜

r(x, a) +

  • x′

˜ p(x′|x, a)vn−1(x′)

  • = max

a

max

˜ p(·|x,a)

  • ˜

r +(x, a) +

  • x′

˜ p(x′|x, a)vn−1(x′)

r + = ˆ r +

  • 1/Nk)

= max

a

  • ˜

r +(x, a) + max

˜ p(·|x,a)

  • x′

˜ p(x′|x, a)vn−1(x′)

  • (simple LP)

◮ LP problem: assign highest probability from ˜

p(·|x, a) − ˆ p(·|x, a)1 to highest vn−1(x′)

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 91/95

slide-184
SLIDE 184

Other Stochastic Multi-arm Bandit Problems

The Regret

Theorem UCRL2 run over n steps in an MDP with diameter D, X states and A actions suffers a regret Rn = O(DX √ An) where diameter D = maxx,x′ minπ E

  • Tπ(x, x′)
  • .
  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 92/95

slide-185
SLIDE 185

Other Stochastic Multi-arm Bandit Problems

Posterior Sampling for Reinforcement Learning (PSRL)

Initialize episode k

  • 1. Current time tk
  • 2. Let Nk(x, a) =
  • {τ < tk : xt = x, at = a}
  • 3. Compute posterior over r(x, a) and p(·|x, a)

Compute random policy

  • 1. Let

Mk = { rk, pk} such that rk, pk sampled from their posteriors

  • 2. Compute optimal policy ˜

πk = arg maxπ ρπ( Mk) Execute ˜ πk until at least one state-action space counter is doubled

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 93/95

slide-186
SLIDE 186

Other Stochastic Multi-arm Bandit Problems

Bibliography I

  • A. LAZARIC – Reinforcement Learning

Fall 2017 - 94/95

slide-187
SLIDE 187

Other Stochastic Multi-arm Bandit Problems

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr