Reduced Variance Payoff Estimation in Adversarial Bandit Problems - - PowerPoint PPT Presentation

reduced variance payoff estimation in adversarial bandit
SMART_READER_LITE
LIVE PREVIEW

Reduced Variance Payoff Estimation in Adversarial Bandit Problems - - PowerPoint PPT Presentation

Improved Bandit Algorithms Reduced Variance Payoff Estimation in Adversarial Bandit Problems Levente Kocsis Csaba Szepesv ari Computer and Automation Research Institute of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111,


slide-1
SLIDE 1

Improved Bandit Algorithms

Reduced Variance Payoff Estimation in Adversarial Bandit Problems

Levente Kocsis Csaba Szepesv´ ari

Computer and Automation Research Institute of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary E-mail: szcsaba@sztaki.hu

ECML-2005 Workshop on Reinforcement Learning in Non-stationary Environments Porto, 2005

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-2
SLIDE 2

Improved Bandit Algorithms Outline

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-3
SLIDE 3

Improved Bandit Algorithms Outline

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-4
SLIDE 4

Improved Bandit Algorithms Outline

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-5
SLIDE 5

Improved Bandit Algorithms Outline

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-6
SLIDE 6

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-7
SLIDE 7

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice: Motivation

Motivation: Opponent Modelling in Poker Knowing your opponent gives rise to substantial performance gains Opponent types Which of the available types is your opponent?

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-8
SLIDE 8

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice: Motivation

Motivation: Opponent Modelling in Poker Knowing your opponent gives rise to substantial performance gains Opponent types Which of the available types is your opponent?

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-9
SLIDE 9

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice: Motivation

Motivation: Opponent Modelling in Poker Knowing your opponent gives rise to substantial performance gains Opponent types Which of the available types is your opponent?

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-10
SLIDE 10

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice (Bandit Problems)

Player Adversary

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-11
SLIDE 11

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice (Bandit Problems)

Expert #1 Expert #2 Expert #3 Player Adversary

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-12
SLIDE 12

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice (Bandit Problems)

Expert #1 Expert #2 Expert #3 Player Adversary It

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-13
SLIDE 13

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice (Bandit Problems)

Expert #1 Expert #2 Expert #3 Player Adversary It Yt

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-14
SLIDE 14

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Prediction with Expert Advice (Bandit Problems)

Expert #1 Expert #2 Expert #3 Player Adversary It Yt

Rt = g(It, Yt)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-15
SLIDE 15

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Non-stationarity?

Any powerful adversary is allowed – even non-stationary ones: The adversary’s “strategy” can change with time!

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-16
SLIDE 16

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-17
SLIDE 17

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-18
SLIDE 18

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-19
SLIDE 19

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-20
SLIDE 20

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-21
SLIDE 21

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-22
SLIDE 22

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-23
SLIDE 23

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-24
SLIDE 24

Improved Bandit Algorithms Universal Prediction with Expert Advice Prediction with Expert Advice

Setting the goal

Goal #1: Maximize total reward Very ambitious!! (too ambitious!) Goal #2: Minimize loss over the total reward of the best single expert (regret)! Goal #2.1: .. uniformly over all adversaries of interest Consequences:

⇒ There should be a single good expert for each of the adversaries of interest ⇒ We might need a large number of experts ⇒ Bounds should scale well with the # experts1

Formally: Gi,n = n

t=1 g(i, Yt), ˆ

Gn = n

t=1 g(It, Yt);

we want Rn = max

i

Gi,n − ˆ Gn → min .

1Alternative is to consider tracking

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-25
SLIDE 25

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-26
SLIDE 26

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Results for Adversarial Bandit Problems

Theorem For any time horizon n, the expected total regret of the Exp3 algorithm is at most 2 √ 2 n N ln N

P . Auer et. al: “The nonstochastic multi-armed bandit problem”, SIAM Journal

  • n Computing, 32:48–77, 2002.

Stationary environment2: Rn = O(ln n).

2T.L. Lai and H. Robbins: “Asymptotically efficient adaptive allocation

rules” Adv. in Appl.Math., 6:4-22, 1985.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-27
SLIDE 27

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-28
SLIDE 28

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-29
SLIDE 29

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-30
SLIDE 30

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-31
SLIDE 31

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-32
SLIDE 32

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-33
SLIDE 33

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

The Exp3 Algorithm

Parameters: η – learning rate, γ – exploration rate. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) select It ∈ {1, . . . , N} randomly according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) Compute the feedbacks g′

t(i, Yt), i = 1, . . . , N:

g′

t(i, Yt) = I(It = i) g(It, Yt)/pi,t.

(4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-34
SLIDE 34

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Interpretation

Fully observable case; g(i, Yt) is known for each round. Exponentially weighted predictors:3 wi,n = exp(η

n

  • t=1

g(i, Yt)), γ = 0 Here g′

t(i, Yt) is an unbiased estimate of the reward

g(i, Yt),

n

  • t=1

gt(i, Yt) ∼

n

  • t=1

g′

t(i, Yt).

3Weighted majority – Littlestone & Warmuth (1994), Aggregating

strategies – Vovk (1990), Freund & Schapire (1997,1999).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-35
SLIDE 35

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Interpretation

Fully observable case; g(i, Yt) is known for each round. Exponentially weighted predictors:3 wi,n = exp(η

n

  • t=1

g(i, Yt)), γ = 0 Here g′

t(i, Yt) is an unbiased estimate of the reward

g(i, Yt),

n

  • t=1

gt(i, Yt) ∼

n

  • t=1

g′

t(i, Yt).

3Weighted majority – Littlestone & Warmuth (1994), Aggregating

strategies – Vovk (1990), Freund & Schapire (1997,1999).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-36
SLIDE 36

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Interpretation

Fully observable case; g(i, Yt) is known for each round. Exponentially weighted predictors:3 wi,n = exp(η

n

  • t=1

g(i, Yt)), γ = 0 Here g′

t(i, Yt) is an unbiased estimate of the reward

g(i, Yt),

n

  • t=1

gt(i, Yt) ∼

n

  • t=1

g′

t(i, Yt).

3Weighted majority – Littlestone & Warmuth (1994), Aggregating

strategies – Vovk (1990), Freund & Schapire (1997,1999).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-37
SLIDE 37

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Exp3: Unbiasedness of Payoff Estimates

E[g′

t(i, Yt) | Ft]

= N

j=1 E[I(It = i) g(It, Yt)/pi,t | Ft]P(It = j | Ft)

= g(i, Yt)/pi,tP(It = i | Ft) = g(i, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-38
SLIDE 38

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Exp3: Unbiasedness of Payoff Estimates

E[g′

t(i, Yt) | Ft]

= N

j=1 E[I(It = i) g(It, Yt)/pi,t | Ft]P(It = j | Ft)

= g(i, Yt)/pi,tP(It = i | Ft) = g(i, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-39
SLIDE 39

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Exp3: Unbiasedness of Payoff Estimates

E[g′

t(i, Yt) | Ft]

= N

j=1 E[I(It = i) g(It, Yt)/pi,t | Ft]P(It = j | Ft)

= g(i, Yt)/pi,tP(It = i | Ft) = g(i, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-40
SLIDE 40

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Exp3: Unbiasedness of Payoff Estimates

E[g′

t(i, Yt) | Ft]

= N

j=1 E[I(It = i) g(It, Yt)/pi,t | Ft]P(It = j | Ft)

= g(i, Yt)/pi,tP(It = i | Ft) = g(i, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-41
SLIDE 41

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-42
SLIDE 42

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-43
SLIDE 43

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-44
SLIDE 44

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-45
SLIDE 45

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-46
SLIDE 46

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-47
SLIDE 47

Improved Bandit Algorithms Universal Prediction with Expert Advice Some Previous Results

Example: Dynamic Pricing, single product

Player: A vendor – wants to sell a product Adversary: A new customer in each round Problem: Select the “right” price (high enough to make profit, low enough to cut a deal!) Expert i suggests price p1,i Highest price the customer is willing to accept: p2 (unknown! never revealed!) Payoff: gi = p1,iI(p1,i ≤ p2); vendor learns only gIt! y = (p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-48
SLIDE 48

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-49
SLIDE 49

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Some issues

Practical performance is often very poor

We would like to use bandit algorithms in Poker for

  • pponent modelling!

Bound scales with N like O(N ln N)

Fully observable case: O(ln N) Possible remedy(?): Allow best expert change with time4

More information/structure is often available – why not exploiting it?

4M.K. Warmuth, M. Hebster: “Tracking the best expert”, Machine

Learning, 32:151–178, 1998.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-50
SLIDE 50

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Some issues

Practical performance is often very poor

We would like to use bandit algorithms in Poker for

  • pponent modelling!

Bound scales with N like O(N ln N)

Fully observable case: O(ln N) Possible remedy(?): Allow best expert change with time4

More information/structure is often available – why not exploiting it?

4M.K. Warmuth, M. Hebster: “Tracking the best expert”, Machine

Learning, 32:151–178, 1998.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-51
SLIDE 51

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Some issues

Practical performance is often very poor

We would like to use bandit algorithms in Poker for

  • pponent modelling!

Bound scales with N like O(N ln N)

Fully observable case: O(ln N) Possible remedy(?): Allow best expert change with time4

More information/structure is often available – why not exploiting it?

4M.K. Warmuth, M. Hebster: “Tracking the best expert”, Machine

Learning, 32:151–178, 1998.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-52
SLIDE 52

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Some issues

Practical performance is often very poor

We would like to use bandit algorithms in Poker for

  • pponent modelling!

Bound scales with N like O(N ln N)

Fully observable case: O(ln N) Possible remedy(?): Allow best expert change with time4

More information/structure is often available – why not exploiting it?

4M.K. Warmuth, M. Hebster: “Tracking the best expert”, Machine

Learning, 32:151–178, 1998.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-53
SLIDE 53

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Some issues

Practical performance is often very poor

We would like to use bandit algorithms in Poker for

  • pponent modelling!

Bound scales with N like O(N ln N)

Fully observable case: O(ln N) Possible remedy(?): Allow best expert change with time4

More information/structure is often available – why not exploiting it?

4M.K. Warmuth, M. Hebster: “Tracking the best expert”, Machine

Learning, 32:151–178, 1998.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-54
SLIDE 54

Improved Bandit Algorithms Universal Prediction with Expert Advice Issues

Some issues

Practical performance is often very poor

We would like to use bandit algorithms in Poker for

  • pponent modelling!

Bound scales with N like O(N ln N)

Fully observable case: O(ln N) Possible remedy(?): Allow best expert change with time4

More information/structure is often available – why not exploiting it?

4M.K. Warmuth, M. Hebster: “Tracking the best expert”, Machine

Learning, 32:151–178, 1998.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-55
SLIDE 55

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-56
SLIDE 56

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Side information

Bandit Problems with Side Information

1

Player receives information Ct about the environment state

2

Player selects expert It

3

Expert It plays against the advisory, knowing Ct

4

Player receives payoff g(It, Yt)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-57
SLIDE 57

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Side information

Bandit Problems with Side Information

1

Player receives information Ct about the environment state

2

Player selects expert It

3

Expert It plays against the advisory, knowing Ct

4

Player receives payoff g(It, Yt)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-58
SLIDE 58

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Side information

Bandit Problems with Side Information

1

Player receives information Ct about the environment state

2

Player selects expert It

3

Expert It plays against the advisory, knowing Ct

4

Player receives payoff g(It, Yt)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-59
SLIDE 59

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Side information

Bandit Problems with Side Information

1

Player receives information Ct about the environment state

2

Player selects expert It

3

Expert It plays against the advisory, knowing Ct

4

Player receives payoff g(It, Yt)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-60
SLIDE 60

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

The Key Observation

Hypothesis In Exp3 any unbiased estimate of the immediate payoffs will do.. Estimates with smaller variance should lead to more efficient algorithms.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-61
SLIDE 61

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

The Key Observation

Hypothesis In Exp3 any unbiased estimate of the immediate payoffs will do.. Estimates with smaller variance should lead to more efficient algorithms.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-62
SLIDE 62

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

The Key Observation

Hypothesis In Exp3 any unbiased estimate of the immediate payoffs will do.. Estimates with smaller variance should lead to more efficient algorithms.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-63
SLIDE 63

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Example: Dynamic Pricing, multiple products

Side information: the cost of the product v Payoff: p1I(p1 ≤ p2) + (1 − α)vI(p1 < p2). y = (v, p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2) + (1 − α)vI(p1,i > p2). Hypothesis: One should be able to reduce payoff variance given the knowledge of v

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-64
SLIDE 64

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Example: Dynamic Pricing, multiple products

Side information: the cost of the product v Payoff: p1I(p1 ≤ p2) + (1 − α)vI(p1 < p2). y = (v, p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2) + (1 − α)vI(p1,i > p2). Hypothesis: One should be able to reduce payoff variance given the knowledge of v

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-65
SLIDE 65

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Example: Dynamic Pricing, multiple products

Side information: the cost of the product v Payoff: p1I(p1 ≤ p2) + (1 − α)vI(p1 < p2). y = (v, p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2) + (1 − α)vI(p1,i > p2). Hypothesis: One should be able to reduce payoff variance given the knowledge of v

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-66
SLIDE 66

Improved Bandit Algorithms Improved Payoff-Estimation Generalized Exp3

Example: Dynamic Pricing, multiple products

Side information: the cost of the product v Payoff: p1I(p1 ≤ p2) + (1 − α)vI(p1 < p2). y = (v, p2, p1,1, . . . , p1,N), so g(i, y) = p1,iI(p1,i ≤ p2) + (1 − α)vI(p1,i > p2). Hypothesis: One should be able to reduce payoff variance given the knowledge of v

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-67
SLIDE 67

Improved Bandit Algorithms Improved Payoff-Estimation Theoretical Results

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-68
SLIDE 68

Improved Bandit Algorithms Improved Payoff-Estimation Theoretical Results

The Exp3G Algorithm

Parameters: real numbers 0 < η, γ < 1. Initialization: w0 = (1, . . . , 1)T; For each round t = 1, 2, . . . (1) observe Ct, select It ∈ {1, . . . , N} according to pi,t = (1 − γ) wi,t−1 N

k=1 wk,t−1

+ γ N ; (2) observe gt = g(It, Yt); (3) based on gt, Ct, compute the feedbacks g′

t(i, Yt),

i = 1, . . . , N; (4) compute wi,t = wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-69
SLIDE 69

Improved Bandit Algorithms Improved Payoff-Estimation Theoretical Results

Exp3G: Expected Regret

Assumptions: (A1) E[g(i, Yt)|Ct, It−1, Y t−1] ≤ 1; (A2) E[g′

t(i, Yt)|Ct, It−1, Y t−1]

= E[g(i, Yt)|Ct, It−1, Y t−1]; (A3) Var[g′

t(i, Yt) | Ct, It−1, Y t−1]

≤ σ2; (A4) |g′

t(i, Yt)|

≤ B. Theorem Consider algorithm Exp3G. Assume A1–A4. Then for γ = 0 and a suitable η = ηn, n sufficiently large, max

i

E[

n

  • t=1

g(i, Yt)] − E[

n

  • t=1

g(It, Yt)] ≤

  • (1 + σ2)n ln N.
  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-70
SLIDE 70

Improved Bandit Algorithms Improved Payoff-Estimation Theoretical Results

Exp3G: PAC-bounds on Regret

Theorem Consider algorithm Exp3G. Assume A1–A4 and further, assume that |g(i, Yt)| ≤ 1. Then, for any δ > 0, for a suitable η = ηn with n sufficiently large, the following bound on the regret of Exp3G holds with probability at least 1 − δ: max

i

Gin − ˆ Gn ≤ n1/2

  • ((1 + σ2) ln N)1/2 +

(2 + √ 2σ)ln N + 1 δ 1/2 + 2(B + 1) 3 ln N + 1 δ

  • .
  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-71
SLIDE 71

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-72
SLIDE 72

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

Likelihood-ratio-Based Payoff Estimation

Assumptions Explicit model of the game played (actions). Experts have probabilistic action selection strategies. Probability of any action (action-sequence) can be queried for any expert. Payoff-estimate (single-stage games): g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). This estimate is unbiased. Extension to multi-stage games is straightforward (use the chain-rule to show unbiasedness).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-73
SLIDE 73

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

Likelihood-ratio-Based Payoff Estimation

Assumptions Explicit model of the game played (actions). Experts have probabilistic action selection strategies. Probability of any action (action-sequence) can be queried for any expert. Payoff-estimate (single-stage games): g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). This estimate is unbiased. Extension to multi-stage games is straightforward (use the chain-rule to show unbiasedness).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-74
SLIDE 74

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

Likelihood-ratio-Based Payoff Estimation

Assumptions Explicit model of the game played (actions). Experts have probabilistic action selection strategies. Probability of any action (action-sequence) can be queried for any expert. Payoff-estimate (single-stage games): g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). This estimate is unbiased. Extension to multi-stage games is straightforward (use the chain-rule to show unbiasedness).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-75
SLIDE 75

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

Likelihood-ratio-Based Payoff Estimation

Assumptions Explicit model of the game played (actions). Experts have probabilistic action selection strategies. Probability of any action (action-sequence) can be queried for any expert. Payoff-estimate (single-stage games): g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). This estimate is unbiased. Extension to multi-stage games is straightforward (use the chain-rule to show unbiasedness).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-76
SLIDE 76

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

LExp: LR-based Payoff Estimation # 1

Simple LR-based estimate: g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). Insight Large likehood ratios are likely to yield large variance. Idea: When πi(At|Ct)/πIt(At|Ct) is big, make it equal to 0 and compensate for the bias introduced.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-77
SLIDE 77

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

LExp: LR-based Payoff Estimation # 2

Simple LR-based estimate: g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). Let φt(It, At, i) be such that φt(It, At, i) = 1 denotes an event when the LRs are big: φt(k, a, i) = I πi(a|Ct) πk(a|Ct) > pk,t pi,t

  • .

Modified LR-based estimate: g′

t(i, Yt) =

(1 − φt(It, At, i)) N

j=1 pj,t(1 − φt(j, At, i))

πi(At|Ct) πIt(At|Ct)g(It, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-78
SLIDE 78

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

LExp: LR-based Payoff Estimation # 2

Simple LR-based estimate: g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). Let φt(It, At, i) be such that φt(It, At, i) = 1 denotes an event when the LRs are big: φt(k, a, i) = I πi(a|Ct) πk(a|Ct) > pk,t pi,t

  • .

Modified LR-based estimate: g′

t(i, Yt) =

(1 − φt(It, At, i)) N

j=1 pj,t(1 − φt(j, At, i))

πi(At|Ct) πIt(At|Ct)g(It, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-79
SLIDE 79

Improved Bandit Algorithms Improved Payoff-Estimation Likelihood-ratio Based Payoff Estimation

LExp: LR-based Payoff Estimation # 2

Simple LR-based estimate: g′

t(i, Yt) = πi(At|Ct)

πIt(At|Ct)g(It, Yt). Let φt(It, At, i) be such that φt(It, At, i) = 1 denotes an event when the LRs are big: φt(k, a, i) = I πi(a|Ct) πk(a|Ct) > pk,t pi,t

  • .

Modified LR-based estimate: g′

t(i, Yt) =

(1 − φt(It, At, i)) N

j=1 pj,t(1 − φt(j, At, i))

πi(At|Ct) πIt(At|Ct)g(It, Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-80
SLIDE 80

Improved Bandit Algorithms Improved Payoff-Estimation Control-variates

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-81
SLIDE 81

Improved Bandit Algorithms Improved Payoff-Estimation Control-variates

CExp3: Control-variates # 1

Motivation: In dynamic pricing, the product Ct controls to a large extent the distribution of the actual payoffs g(i, Yt) – hence also the variance. Idea: consider the payoffs compensated for Ct instead of the original payoffs: gc(i, Yt) = g(i, Yt) − r(Ct), g′

c,t(i, Yt)

= g′

t(i, Yt) − r(Ct).

Here r(Ct) is the mean payoff when seeing Ct.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-82
SLIDE 82

Improved Bandit Algorithms Improved Payoff-Estimation Control-variates

CExp3: Control-variates # 1

Motivation: In dynamic pricing, the product Ct controls to a large extent the distribution of the actual payoffs g(i, Yt) – hence also the variance. Idea: consider the payoffs compensated for Ct instead of the original payoffs: gc(i, Yt) = g(i, Yt) − r(Ct), g′

c,t(i, Yt)

= g′

t(i, Yt) − r(Ct).

Here r(Ct) is the mean payoff when seeing Ct.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-83
SLIDE 83

Improved Bandit Algorithms Improved Payoff-Estimation Control-variates

CExp3: Control-variates # 2

Effect: Var[g′

t(i, Yt)|Y t−1, It−1] is reduced. Previous analysis can

be repeated to show that this is beneficial – regret bounds for compressed range payoffs carry over to the regret defined with the unmodified payoffs. Intuitive explanation: Compresses the range of payoffs.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-84
SLIDE 84

Improved Bandit Algorithms Improved Payoff-Estimation Control-variates

CExp3: Control-variates # 2

Effect: Var[g′

t(i, Yt)|Y t−1, It−1] is reduced. Previous analysis can

be repeated to show that this is beneficial – regret bounds for compressed range payoffs carry over to the regret defined with the unmodified payoffs. Intuitive explanation: Compresses the range of payoffs.

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-85
SLIDE 85

Improved Bandit Algorithms Experimental Results Dynamic Pricing

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-86
SLIDE 86

Improved Bandit Algorithms Experimental Results Dynamic Pricing

Dynamic pricing

Experts:

0.9v v 1.1v v + 0.02 E4 E5 E1 E2 E3 b

Customers: p2 = 1.1v + β − 50 500 , where β ∼ B(100, 0.5).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-87
SLIDE 87

Improved Bandit Algorithms Experimental Results Dynamic Pricing

Results: Almost deterministic experts

0.014 0.012 0.01 0.008 0.006 0.006 0.005 0.004 0.003 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

average regret iteration b = 0.05

EXP3 CEXP3 LEXP

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-88
SLIDE 88

Improved Bandit Algorithms Experimental Results Dynamic Pricing

Dynamic pricing: Statistics # 1

b = 0.05 E[g′

t(i, Yt)]

i=1 i=2 i=3 i=4 i=5 Exp3 0.388 0.399 0.401 0.371 0.396 CExp3 0.390 0.399 0.398 0.371 0.398 LExp 0.390 0.402 0.400 0.368 0.399 E[g(i, Yt)] 0.390 0.400 0.399 0.371 0.399

  • Var[g′

t(i, Yt)]

i=1 i=2 i=3 i=4 i=5 Exp3 1.782 1.435 1.427 2.097 1.573 CExp3 0.467 0.476 0.473 0.332 0.472 LExp 0.739 0.788 0.500 1.671 0.688

  • Var[g(i, Yt)]

0.143 0.148 0.145 0.129 0.144

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-89
SLIDE 89

Improved Bandit Algorithms Experimental Results Dynamic Pricing

Results: Heavily randomized experts

0.03 0.025 0.02 0.017 0.014 0.012 0.01 0.008 0.006 0.005 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

average regret iteration b = 0.3

EXP3 CEXP3 LEXP

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-90
SLIDE 90

Improved Bandit Algorithms Experimental Results Dynamic Pricing

Dynamic pricing: Statistics # 2

b = 0.3 E[g′

t(i, Yt)]

i=1 i=2 i=3 i=4 i=5 Exp3 0.338 0.351 0.347 0.385 0.383 CExp3 0.343 0.354 0.348 0.381 0.382 LExp 0.343 0.356 0.351 0.383 0.384 E[g(i, Yt)] 0.343 0.356 0.350 0.383 0.384

  • Var[g′

t(i; Yt)]

i=1 i=2 i=3 i=4 i=5 Exp3 2.107 1.929 2.014 1.046 1.169 CExp3 0.735 0.724 0.726 0.744 0.745 LExp 0.856 0.573 0.651 0.475 0.412

  • Var[g(i, Yt)]

0.153 0.151 0.150 0.141 0.143

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-91
SLIDE 91

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-92
SLIDE 92

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Experiments with Tracking Algorithms

Goal: minimize regret over the sequence of best experts where the frequency of expert changes is upper bounded Problem: Exp3 and variants can run into problems (weights converging too fast – too slow response at changepoints Warmuth & Herbster: Tracking the Best Expert (ML, 1998); Fixed-Share Algorithm: wi,t = α N

j=1 wj,t−1eηg′

t (j,Yt)

N + (1 − α)wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-93
SLIDE 93

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Experiments with Tracking Algorithms

Goal: minimize regret over the sequence of best experts where the frequency of expert changes is upper bounded Problem: Exp3 and variants can run into problems (weights converging too fast – too slow response at changepoints Warmuth & Herbster: Tracking the Best Expert (ML, 1998); Fixed-Share Algorithm: wi,t = α N

j=1 wj,t−1eηg′

t (j,Yt)

N + (1 − α)wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-94
SLIDE 94

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Experiments with Tracking Algorithms

Goal: minimize regret over the sequence of best experts where the frequency of expert changes is upper bounded Problem: Exp3 and variants can run into problems (weights converging too fast – too slow response at changepoints Warmuth & Herbster: Tracking the Best Expert (ML, 1998); Fixed-Share Algorithm: wi,t = α N

j=1 wj,t−1eηg′

t (j,Yt)

N + (1 − α)wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-95
SLIDE 95

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Experiments with Tracking Algorithms

Goal: minimize regret over the sequence of best experts where the frequency of expert changes is upper bounded Problem: Exp3 and variants can run into problems (weights converging too fast – too slow response at changepoints Warmuth & Herbster: Tracking the Best Expert (ML, 1998); Fixed-Share Algorithm: wi,t = α N

j=1 wj,t−1eηg′

t (j,Yt)

N + (1 − α)wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-96
SLIDE 96

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Experiments with Tracking Algorithms

Goal: minimize regret over the sequence of best experts where the frequency of expert changes is upper bounded Problem: Exp3 and variants can run into problems (weights converging too fast – too slow response at changepoints Warmuth & Herbster: Tracking the Best Expert (ML, 1998); Fixed-Share Algorithm: wi,t = α N

j=1 wj,t−1eηg′

t (j,Yt)

N + (1 − α)wi,t−1eηg′

t (i,Yt).

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-97
SLIDE 97

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Parameters of Experiments

round m k E[g(1, Yt )] E[g(2, Yt )] E[g(3, Yt )] E[g(4, Yt )] E[g(5, Yt )] 100 1.1 0.3906 0.4000 0.3992 0.3714 0.3989 5000 10 0.1 0.3767 0.3641 0.3716 0.371 0.3760 10000 20 0.9 0.3607 0.3604 0.3605 0.3615 0.3606 15000 100 1.0 0.3785 0.3766 0.3806 0.3687 0.3822

b = 0.05 p2 = kv + β−m/2

m

, β ∼ B(m, 0.5)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-98
SLIDE 98

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

No exploration, fixed-share update-rule

10 20 30 40 50 60 70 80 90 15000 10000 5000

cumulative regret iteration

fixed-share EXP3(gamma=0) fixed-share CEXP3(gamma=0) fixed-share LEXP(gamma=0)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-99
SLIDE 99

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Results for CExp3 variants

10 20 30 40 50 60 15000 10000 5000

cumulative regret iteration

CEXP3 restart CEXP3 fixed-share CEXP3 fixed-share CEXP3(gamma=0)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-100
SLIDE 100

Improved Bandit Algorithms Experimental Results Dynamic Pricing – Tracking Experiments

Expert-selection Frequencies (CExp3)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 15000 10000 5000

choice probability iteration

CEXP3 expert 1 expert 2 expert 3 expert 4 expert 5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 15000 10000 5000

choice probability iteration

fixed-share CEXP3(gamma=0) expert 1 expert 2 expert 3 expert 4 expert 5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 15000 10000 5000

choice probability iteration

restart CEXP3 expert 1 expert 2 expert 3 expert 4 expert 5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 15000 10000 5000

choice probability iteration

fixed-share CEXP3 expert 1 expert 2 expert 3 expert 4 expert 5

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-101
SLIDE 101

Improved Bandit Algorithms Experimental Results Experiments with Poker

Outline

1

Universal Prediction with Expert Advice Prediction with Expert Advice Some Previous Results Issues

2

Improved Payoff-Estimation Generalized Exp3 Theoretical Results Likelihood-ratio Based Payoff Estimation Control-variates

3

Experimental Results Dynamic Pricing Dynamic Pricing – Tracking Experiments Experiments with Poker

4

Conclusions

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-102
SLIDE 102

Improved Bandit Algorithms Experimental Results Experiments with Poker

Opponent Modelling in Poker

Omaha Hi-Lo McRaise: Monte-Carlo Sampling to Approximate Action-Values Opponent Model: Weights card configurations in face of

  • pponents’ previous actions

Five models: random, greedy, smooth, mcr, spsa, humanoid Algorithms: Exp3, aCExp3, cCExp3, LExp

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-103
SLIDE 103

Improved Bandit Algorithms Experimental Results Experiments with Poker

Opponent Modelling in Poker

Omaha Hi-Lo McRaise: Monte-Carlo Sampling to Approximate Action-Values Opponent Model: Weights card configurations in face of

  • pponents’ previous actions

Five models: random, greedy, smooth, mcr, spsa, humanoid Algorithms: Exp3, aCExp3, cCExp3, LExp

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-104
SLIDE 104

Improved Bandit Algorithms Experimental Results Experiments with Poker

Opponent Modelling in Poker

Omaha Hi-Lo McRaise: Monte-Carlo Sampling to Approximate Action-Values Opponent Model: Weights card configurations in face of

  • pponents’ previous actions

Five models: random, greedy, smooth, mcr, spsa, humanoid Algorithms: Exp3, aCExp3, cCExp3, LExp

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-105
SLIDE 105

Improved Bandit Algorithms Experimental Results Experiments with Poker

Opponent Modelling in Poker

Omaha Hi-Lo McRaise: Monte-Carlo Sampling to Approximate Action-Values Opponent Model: Weights card configurations in face of

  • pponents’ previous actions

Five models: random, greedy, smooth, mcr, spsa, humanoid Algorithms: Exp3, aCExp3, cCExp3, LExp

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-106
SLIDE 106

Improved Bandit Algorithms Experimental Results Experiments with Poker

Opponent Modelling in Poker

Omaha Hi-Lo McRaise: Monte-Carlo Sampling to Approximate Action-Values Opponent Model: Weights card configurations in face of

  • pponents’ previous actions

Five models: random, greedy, smooth, mcr, spsa, humanoid Algorithms: Exp3, aCExp3, cCExp3, LExp

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-107
SLIDE 107

Improved Bandit Algorithms Experimental Results Experiments with Poker

Opponent Modelling in Poker: Regret

0.4 0.2 0.1 0.06 0.04 0.02 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

average regret (sb/h) iteration

EXP3 aCEXP3 cCEXP3 LEXP

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-108
SLIDE 108

Improved Bandit Algorithms Experimental Results Experiments with Poker

Poker: Choosing the best expert

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 2000 3000 4000 5000

choice probability iteration

EXP3: spsa aCEXP3: spsa cCEXP3: spsa LEXP: spsa

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-109
SLIDE 109

Improved Bandit Algorithms Conclusions

Conclusions

Regret-minimization framework ∼ non-stationary environments Bandit algorithms:

Role of payoff-estimate variance Algorithms for reducing the variance

Results: Substantial improvements .. though in poker far from being useful (so-far)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-110
SLIDE 110

Improved Bandit Algorithms Conclusions

Conclusions

Regret-minimization framework ∼ non-stationary environments Bandit algorithms:

Role of payoff-estimate variance Algorithms for reducing the variance

Results: Substantial improvements .. though in poker far from being useful (so-far)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-111
SLIDE 111

Improved Bandit Algorithms Conclusions

Conclusions

Regret-minimization framework ∼ non-stationary environments Bandit algorithms:

Role of payoff-estimate variance Algorithms for reducing the variance

Results: Substantial improvements .. though in poker far from being useful (so-far)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-112
SLIDE 112

Improved Bandit Algorithms Conclusions

Conclusions

Regret-minimization framework ∼ non-stationary environments Bandit algorithms:

Role of payoff-estimate variance Algorithms for reducing the variance

Results: Substantial improvements .. though in poker far from being useful (so-far)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-113
SLIDE 113

Improved Bandit Algorithms Conclusions

Conclusions

Regret-minimization framework ∼ non-stationary environments Bandit algorithms:

Role of payoff-estimate variance Algorithms for reducing the variance

Results: Substantial improvements .. though in poker far from being useful (so-far)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-114
SLIDE 114

Improved Bandit Algorithms Conclusions

Conclusions

Regret-minimization framework ∼ non-stationary environments Bandit algorithms:

Role of payoff-estimate variance Algorithms for reducing the variance

Results: Substantial improvements .. though in poker far from being useful (so-far)

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-115
SLIDE 115

Improved Bandit Algorithms Conclusions

Future Work

Non-oblivious (adapting) opponents Use Ct in expert selection External-regret vs. internal-regret

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-116
SLIDE 116

Improved Bandit Algorithms Conclusions

Future Work

Non-oblivious (adapting) opponents Use Ct in expert selection External-regret vs. internal-regret

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-117
SLIDE 117

Improved Bandit Algorithms Conclusions

Future Work

Non-oblivious (adapting) opponents Use Ct in expert selection External-regret vs. internal-regret

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms

slide-118
SLIDE 118

Improved Bandit Algorithms Conclusions

Questions?

More information: www.sztaki.hu/∼szcsaba

  • L. Kocsis and Cs. Szepesv´

ari Improved Bandit Algorithms