Announcements HW 1 is due now 1 CS6501: T opics in Learning and - - PowerPoint PPT Presentation

announcements
SMART_READER_LITE
LIVE PREVIEW

Announcements HW 1 is due now 1 CS6501: T opics in Learning and - - PowerPoint PPT Presentation

Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu Outline The Adversarial Multi-armed Bandit Problem A Basic Algorithm: Exp3 Regret


slide-1
SLIDE 1

1

Announcements

ØHW 1 is due now

slide-2
SLIDE 2

CS6501: T

  • pics in Learning and Game Theory

(Fall 2019)

Adversarial Multi-Armed Bandits

Instructor: Haifeng Xu

slide-3
SLIDE 3

3

Outline

Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3

slide-4
SLIDE 4

4

Recap: Online Learning So Far

Setup: 𝑈 rounds; the following occurs at round 𝑢:

1.

Learner picks a distribution 𝑞$ over actions [𝑜]

2.

Adversary picks cost vector 𝑑$ ∈ 0,1 -

3.

Action 𝑗$ ∼ 𝑞$ is chosen and learner incurs cost 𝑑$(𝑗$)

4.

Learner observes 𝑑$ (for use in future time steps)

Performance is typically measured by regret:

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

The multiplicative weight update algorithm has regret 𝑃( 𝑈 ln 𝑜).

slide-5
SLIDE 5

5

Recap: Online Learning So Far

Convergence to equilibrium

ØIn repeated zero-sum games, if both players use a no-regret

learning algorithm, their average strategy converges to an NE

ØIn general games, the average strategy converges to a CCE

There is a general reduction, converting any learning algorithm with regret 𝑆 to one with swap regret 𝑜𝑆. Swap regret – a “stronger” regret concept and better convergence

ØDef: each action 𝑗 has a chance to deviate to another action 𝑡(𝑗) ØIn repeated general games, if both players use a no-swap-regret

learning algorithm, their average strategy converges to a CE

slide-6
SLIDE 6

6

This Lecture: Address Partial Feedback

ØIn online learning, the whole cost vector 𝑑$ can be observed by

the learner, despite she only takes a single action 𝑗$

  • Realistic in some applications, e.g., stock investment

ØIn many cases, we only see the reward of the action we take

  • For example: slot machines, a.k.a., multi-armed bandits
slide-7
SLIDE 7

7

Other Applications with Partial Feedback

ØOnline advertisement placement or web ranking

  • Action: ad placement or ranking of webs
  • Cannot see the feedback for untaken actions
slide-8
SLIDE 8

8

Other Applications with Partial Feedback

ØOnline advertisement placement or web ranking

  • Action: ad placement or ranking of webs
  • Cannot see the feedback for untaken actions

ØRecommendation system:

  • Action = recommended option (e.g., a restaurant)
  • Do not know other options’ feedback
slide-9
SLIDE 9

9

Other Applications with Partial Feedback

ØOnline advertisement placement or web ranking

  • Action: ad placement or ranking of webs
  • Cannot see the feedback for untaken actions

ØRecommendation system:

  • Action = recommended option (e.g., a restaurant)
  • Do not know other options’ feedback

ØClinical trials

  • Action = a treatment
  • Don’t know what would happen for treatments not chosen

ØPlaying strategic games

  • Cannot observe opponents’ strategies but only know the payoff of the

taken action

  • E.g., Poker games, competition in markets
slide-10
SLIDE 10

10

Adversarial Multi-Armed Bandits (MAB)

ØVery much like online learning, except partial feedback

  • The name “bandit” is inspired by slot machines

ØModel: at each time step 𝑢 = 1, ⋯ , 𝑈; the following occurs in order

1.

Learner picks a distribution 𝑞$ over arms [𝑜]

2.

Adversary picks cost vector 𝑑$ ∈ 0,1 -

3.

Arm 𝑗$ ∼ 𝑞$ is chosen and learner incurs cost 𝑑$(𝑗$)

4.

Learner only observes 𝑑$(𝑗$) (for use in future time steps)

ØThough we cannot observe 𝑑$, adversary still picks 𝑑$ before 𝑗$ is

sampled Q: since learner does not observe 𝑑$(𝑗) for 𝑗 ≠ 𝑗$, can adversary arbitrarily modify these 𝑑$(𝑗)’s after 𝑗$ has been selected?

No, because this makes 𝑑$ depends on sampled 𝑗$ which is not allowed

slide-11
SLIDE 11

11

Outline

Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3

slide-12
SLIDE 12

12

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $

2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3. (1 − 𝜗𝑑$(𝑗))

slide-13
SLIDE 13

13

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $

2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3.

Recall 1 − 𝜀 ≈ 𝑓KR for small 𝜀

slide-14
SLIDE 14

14

ØIn this lecture we will use this exponential-weight variant, and

prove its regret bound en route

ØAlso called Exponential Weight Update (EWU)

Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $

2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3.

Recall 1 − 𝜀 ≈ 𝑓KR for small 𝜀

slide-15
SLIDE 15

15

Basic idea of Exp3

ØWant to use EWU, but do not know vector 𝑑$ ØWell, we really only have 𝑑$(𝑗$), what can we do?

Estimate O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ , 0, ⋯ 0 3? Estimate O 𝑑$ = 0, ⋯ , 0,

MN 6N SN 6N , 0, ⋯ 0 3

Too optimistic

Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $

2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3. à try to estimate 𝑑$! Recall the algorithm for full information setting:

slide-16
SLIDE 16

16

Exp3: a Basic Algorithm for Adversarial MAB

ØThat is, weight is updated only for the pulled arm

  • Because we really don’t know how good are other arms at 𝑢
  • But 𝑗$ is more heavily penalized now
  • Attention: 𝑑$ 𝑗$ /𝑞$(𝑗$) may be extremely large if 𝑞$(𝑗$) is small

ØCalled Exp3: Exponential-weight algorithm for Exploration and

Exploitation Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $

2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅ O

MN (6) where O

𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3.

slide-17
SLIDE 17

17

A Closer Look at the Estimator O 𝑑$

Ø O

𝑑$ is random – it depends on the randomly sampled 𝑗$ ∼ 𝑞$

Ø O

𝑑$ is an unbiased estimator of 𝑑$, i.e., 𝔽6N∼SN O 𝑑$ = 𝑑$

  • Because given 𝑞$, for any 𝑗 we have

𝔽6N∼SN O 𝑑$ 𝑗 = ℙ 𝑗$ = 𝑗 ⋅ 𝑑$ 𝑗 𝑞$ 𝑗 + ℙ 𝑗$ ≠ 𝑗 ⋅ 0 = 𝑞$(𝑗) ⋅ 𝑑$ 𝑗 𝑞$ 𝑗 = 𝑑$(𝑗)

ØThis is exactly the reason for our choice of O

𝑑$

slide-18
SLIDE 18

18

Regret

Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)

  • Because distribution 𝑞$ itself is random, depends on sampled 𝑗D, ⋯ 𝑗$KD
  • That is, if we run the same algorithm for multiple times, we will get

different 𝑆3 value even when facing the same adversary!

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

Some key differences from online learning

slide-19
SLIDE 19

19

Regret

Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)

  • Because distribution 𝑞$ itself is random, depends on sampled 𝑗D, ⋯ 𝑗$KD
  • That is, if we run the same algorithm for multiple times, we will get

different 𝑆3 value even when facing the same adversary!

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

Some key differences from online learning

𝑥D 𝑗 = 1, ∀𝑗 round 1 𝑥D 𝑗 = 1, ∀𝑗 ≠ 1 𝑥D 1 < 1 round 2 pull arm 1

. . . .

slide-20
SLIDE 20

20

Regret

Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)

  • Because distribution 𝑞$ itself is random, depends on sampled 𝑗D, ⋯ 𝑗$KD
  • That is, if we run the same algorithm for multiple times, we will get

different 𝑆3 value even when facing the same adversary!

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

Some key differences from online learning

𝑥D 𝑗 = 1, ∀𝑗 round 1 𝑥D 𝑗 = 1, ∀𝑗 ≠ 2 𝑥D 2 < 1 round 2 pull arm 2

. . . .

slide-21
SLIDE 21

21

Regret

Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)

  • Because distribution 𝑞$ itself is random, depends on sampled 𝑗D, ⋯ 𝑗$KD
  • That is, if we run the same algorithm for multiple times, we will get

different 𝑆3 value even when facing the same adversary

ØCost vector 𝑑$ is also random as it generally depends on 𝑞$

  • Adversary maps distribution 𝑞$ to a cost vector 𝑑$

ØThis is not the case in online learning

  • If we run the same algorithm for multiple times, we shall obtain the

same 𝑆3 value if facing the same adversary

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

Some key differences from online learning

slide-22
SLIDE 22

22

Regret

ØTherefore, in principle, we have to upper bound 𝔽(𝑆3) where

expectation is over the randomness of arm sampling

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

𝔽(𝑆3) = 𝔽 ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

= ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − 𝔽 min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

by linearity of expectation

slide-23
SLIDE 23

23

Regret

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

𝔽(𝑆3) = 𝔽 ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

= ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − 𝔽 min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

≥ ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]

because min

;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ] ≥ 𝔽 min ;∈[-] ∑$∈[3] 𝑑$(𝑘)

ØTherefore, in principle, we have to upper bound 𝔽(𝑆3) where

expectation is over the randomness of arm sampling

(proof: homework exercise)

slide-24
SLIDE 24

24

Regret

𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

𝔽(𝑆3) = 𝔽 ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

= ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − 𝔽 min

;∈[-] ∑$∈[3] 𝑑$(𝑘)

≥ ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − min

;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]

ØTherefore, in principle, we have to upper bound 𝔽(𝑆3) where

expectation is over the randomness of arm sampling Pseudo-Regret 𝑆3

ØGood regret guarantees good pseudo-regret, but not the reverse

slide-25
SLIDE 25

25

Bounding regret turns out to be challenging

ØExp3 is not sufficient to guarantee small regret ØNext, we instead prove that Exp3 has small pseudo-regret

  • As is typical in many works

ØA slight modification of Exp3 can be proved to have small regret

slide-26
SLIDE 26

26

Outline

Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3

slide-27
SLIDE 27

27

High-level idea of the proof

ØPretend to be in the full information setting with cost equal the

estimated O 𝑑$

ØRelate O

𝑑$ to 𝑑$ since we know it is an unbiased estimator of 𝑑$

  • Theorem. The pseudo regret of Exp3 is 𝑃( nT ln 𝑜).
slide-28
SLIDE 28

28

Imitate a Full-Info Setting with Cost O 𝑑$

ØRecall regret bound for full information setting

𝑆3

[\]] ≤ ln 𝑜

𝜗 + 𝜗𝑈

ØThis holds for any cost vector, thus also O

𝑑$

ØBut…one issue is that O

𝑑$(𝑗$) may be greater than 1

ØNot a big issue – the same analysis yields the following bound

𝑆3

[\]] ≤ _` - L + 𝜗 max 6

∑$∈ 3 O 𝑑$ 𝑗

c

Real Issue: O 𝑑$ 𝑗 may be too large that we cannot bound 𝑆3

[\]]

slide-29
SLIDE 29

29

Imitate a Full-Info Setting with Cost O 𝑑$

ØThat is, instead of max𝑗, the bound here averages over 𝑗 ØWhy more useful?

  • The 𝑞$(𝑗) term will help to cancel out a 𝑞$(𝑗) demominator in O

𝑑$ 𝑗 = 𝑑$(𝑗)/𝑞$(𝑗)

  • This turns out to be enough to bound the regret

A regret bound as follows turns out to work for our proof 𝑆3

[\]] ≤ _` - L + 𝜗 ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c

slide-30
SLIDE 30

30

Step 1: Tighter Regret for Full-Info Case

Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $

2. Observe cost vector O 𝑑$ ≥ 0 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅ O

MN(6)

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

Note: this yields a bound _` -

L + L c 𝑈 when 𝑑$ ∈ 0,1 -

slide-31
SLIDE 31

31

Step 1: Tighter Regret for Full-Info Case

Proof: similar technique – carefully bound certain quantity

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

slide-32
SLIDE 32

32

Step 1: Tighter Regret for Full-Info Case

Proof: similar technique – carefully bound certain quantity

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

Why this term? Ø It tracks weight decrease (will be clear in next slide) Ø The algebraic reasons, 𝑓KR ≈ 1 − 𝜀 + 𝜀c/2, which will give rise to both the term 𝑞$(𝑗)O 𝑑$(𝑗) and 𝑞$(𝑗) O 𝑑$ 𝑗

c

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

slide-33
SLIDE 33

33

Step 1: Tighter Regret for Full-Info Case

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

Fact 1. ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) = 𝑋

$HD /𝑋 $, where 𝑋 $ = ∑6 𝑥$(𝑗).

  • The term ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) is the decreasing rate of 𝑋

$

  • Formal proof: HW exercise

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

slide-34
SLIDE 34

34

Step 1: Tighter Regret for Full-Info Case

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

Fact 1. ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) = 𝑋

$HD /𝑋 $, where 𝑋 $ = ∑6 𝑥$(𝑗).

  • The term ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) is the decreasing rate of 𝑋

$

  • Formal proof: HW exercise
  • Corollary. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

= log 𝑋3HD − log 𝑜

  • Telescope sum and 𝑋

D = 𝑜

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

slide-35
SLIDE 35

35

Step 1: Tighter Regret for Full-Info Case

Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf

c ∑$,6 𝑞$ 𝑗

𝑑$ 𝑗

c .

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

Follows from algebraic calculation

slide-36
SLIDE 36

36

Step 1: Tighter Regret for Full-Info Case

Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf

c ∑$,6 𝑞$ 𝑗

𝑑$ 𝑗

c .

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ ∑$ log ∑6∈[-] 𝑞$ 𝑗 [1 − 𝜗𝑑$ 𝑗 +

Lf c 𝑑$ 𝑗 c]

By 𝑓KR ≤ 1 − 𝜀 + 𝜀c/2

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

Follows from algebraic calculation

slide-37
SLIDE 37

37

Step 1: Tighter Regret for Full-Info Case

Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf

c ∑$,6 𝑞$ 𝑗

𝑑$ 𝑗

c .

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ ∑$ log ∑6∈[-] 𝑞$ 𝑗 [1 − 𝜗𝑑$ 𝑗 +

Lf c 𝑑$ 𝑗 c]

Since ∑6∈ - 𝑞$ 𝑗 = 1 = ∑$ log 1 − ∑6∈ - 𝑞$ 𝑗 𝜗𝑑$ 𝑗 + ∑6∈[-] 𝑞$ 𝑗

Lf c 𝑑$ 𝑗 c

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

Follows from algebraic calculation

slide-38
SLIDE 38

38

Step 1: Tighter Regret for Full-Info Case

Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf

c ∑$,6 𝑞$ 𝑗

𝑑$ 𝑗

c .

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)

∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ ∑$ log ∑6∈[-] 𝑞$ 𝑗 [1 − 𝜗𝑑$ 𝑗 +

Lf c 𝑑$ 𝑗 c]

= ∑$ log 1 − ∑6∈ - 𝑞$ 𝑗 𝜗𝑑$ 𝑗 + ∑6∈[-] 𝑞$ 𝑗

Lf c 𝑑$ 𝑗 c

≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf

c ∑$,6 𝑞$ 𝑗

𝑑$ 𝑗

c

Since log 1 + 𝜀 ≤ 𝜀 for any 𝜀

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

Follows from algebraic calculation

slide-39
SLIDE 39

39

Step 1: Tighter Regret for Full-Info Case

ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ØCombining the two facts yields the lemma

  • HW exercise

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

slide-40
SLIDE 40

40

Step 2: Relate O 𝑑$ to Pseudo-Regret

𝑆3 = ∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − min

;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]

= max

;∈[-]

∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − ∑$∈ 3 𝔽 𝑑$ 𝑘 = max

;∈[-] ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘)

Recall pseudo-regret definition

Pseudo-regret from action 𝑘

slide-41
SLIDE 41

41

Step 2: Relate O 𝑑$ to Pseudo-Regret

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØThat is, expected pseudo regret from 𝑘 w.r.t. true cost 𝑑$ equals

that w.r.t. the estimated cost O 𝑑$ 𝑆3 = ∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − min

;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]

= max

;∈[-]

∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − ∑$∈ 3 𝔽 𝑑$ 𝑘 = max

;∈[-] ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘)

Recall pseudo-regret definition

Pseudo-regret from action 𝑘

slide-42
SLIDE 42

42

Step 2: Relate O 𝑑$ to Pseudo-Regret

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØProof:

𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) = 𝔽 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)| 𝑞$

Because the randomness of O 𝑑$ comes:

  • 1. Randomness of 𝑗$ ∼ 𝑞$
  • 2. Randomness of 𝑞$ itself which depends
  • n 𝑗D, ⋯ , 𝑗$KD
slide-43
SLIDE 43

43

Step 2: Relate O 𝑑$ to Pseudo-Regret

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØProof:

𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) = 𝔽 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)| 𝑞$ = 𝔽 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$ (𝑘)| 𝑞$

Because conditioning on 𝑞$, O 𝑑$ is an unbiased estimator of 𝑑$

slide-44
SLIDE 44

44

Step 2: Relate O 𝑑$ to Pseudo-Regret

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØProof:

= 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$ (𝑘) 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) = 𝔽 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)| 𝑞$ = 𝔽 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$ (𝑘)| 𝑞$

slide-45
SLIDE 45

45

Step 3: Derive Pseudo-Regret Bounds

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØFor any 𝑘, we have

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -

L + L c ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c

By Lemma 1

slide-46
SLIDE 46

46

Step 3: Derive Pseudo-Regret Bounds

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØFor any 𝑘, we have

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -

L + L c ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c

=

_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c |𝑞$

By conditional expectation

slide-47
SLIDE 47

47

Step 3: Derive Pseudo-Regret Bounds

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØFor any 𝑘, we have

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -

L + L c ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c

=

_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c |𝑞$

= _` -

L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽

O 𝑑$ 𝑗

c|𝑞$

By linearity of expectation

slide-48
SLIDE 48

48

Step 3: Derive Pseudo-Regret Bounds

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØFor any 𝑘, we have

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -

L + L c ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c

=

_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c |𝑞$

= _` -

L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽

O 𝑑$ 𝑗

c|𝑞$

Observer 𝔽 O 𝑑$ 𝑗

c|𝑞$ = 0 ⋅ 1 − 𝑞$ 𝑗

+

MN 6 SN 6 c

⋅ 𝑞$ 𝑗 =

MN 6

f

SN(6)

slide-49
SLIDE 49

49

Step 3: Derive Pseudo-Regret Bounds

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØFor any 𝑘, we have

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -

L + L c ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c

=

_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c |𝑞$

= _` -

L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽

O 𝑑$ 𝑗

c|𝑞$

Observer 𝔽 O 𝑑$ 𝑗

c|𝑞$ = 0 ⋅ 1 − 𝑞$ 𝑗

+

MN 6 SN 6 c

⋅ 𝑞$ 𝑗 =

MN 6

f

SN(6)

slide-50
SLIDE 50

50

Step 3: Derive Pseudo-Regret Bounds

Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)

ØFor any 𝑘, we have

Lemma 1. The regret of the following algorithm is at most

_` - L + L c ∑$ ∑6 𝑞$(𝑗) O

𝑑$ 𝑗

c for any cost vector O

𝑑$ ≥ 0.

∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -

L + L c ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c

=

_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗

O 𝑑$ 𝑗

c |𝑞$

= _` -

L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽

O 𝑑$ 𝑗

c|𝑞$

=

_` - L + L c 𝔽 ∑$ ∑6 𝑑$ 𝑗 c

_` - L + L c 𝑜𝑈

Pick 𝜗 =

c _` -

  • 3 yields a

regret bound of 𝑃( nT ln 𝑜)

slide-51
SLIDE 51

51

Summary of the Proof

ØA tighter regret bound for full information setting ØTreat the (realized) estimated O

𝑑$ as the cost for full information

ØExpected pseudo-regret w.r.t. to 𝑑$ equals expected pseudo-

regret w.r.t. to O 𝑑$

ØUpper bound pseudo-regret by taking expectation over O

𝑑$’s

slide-52
SLIDE 52

52

The True Regret and Beyond

ØExp3 does not guarantee good true regret, still because

𝑑$(𝑗)/𝑞$(𝑗) may be too large

  • Pseudo-regret “smooths out” 𝑞$(𝑗) by taking expectations first

ØTo obtain good true regret, need to modify Exp3 by adding some

uniform exploration so that 𝑞$(𝑗) is never too small

  • More intricate analysis, but will get the same regret bound 𝑃( nT ln 𝑜)

Ø In additional to adversarial feedback, a “nicer” setting is when the

cost of each arm is drawn from a fixed but unknown distribution

  • Called stochastic multi-armed bandits
  • Naturally, Exp3 and regret bound 𝑃( nT ln 𝑜) still applies
  • But a better algorithm called Upper-Confidence Bounds (UCB) yields

much better regret bound 𝑃( n ln 𝑈)

  • Different analysis techniques
slide-53
SLIDE 53

Thank You

Haifeng Xu

University of Virginia hx4ad@virginia.edu