1
Announcements
ØHW 1 is due now
Announcements HW 1 is due now 1 CS6501: T opics in Learning and - - PowerPoint PPT Presentation
Announcements HW 1 is due now 1 CS6501: T opics in Learning and Game Theory (Fall 2019) Adversarial Multi-Armed Bandits Instructor: Haifeng Xu Outline The Adversarial Multi-armed Bandit Problem A Basic Algorithm: Exp3 Regret
1
ØHW 1 is due now
CS6501: T
(Fall 2019)
Instructor: Haifeng Xu
3
Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3
4
Setup: 𝑈 rounds; the following occurs at round 𝑢:
1.
Learner picks a distribution 𝑞$ over actions [𝑜]
2.
Adversary picks cost vector 𝑑$ ∈ 0,1 -
3.
Action 𝑗$ ∼ 𝑞$ is chosen and learner incurs cost 𝑑$(𝑗$)
4.
Learner observes 𝑑$ (for use in future time steps)
Performance is typically measured by regret:
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
The multiplicative weight update algorithm has regret 𝑃( 𝑈 ln 𝑜).
5
Convergence to equilibrium
ØIn repeated zero-sum games, if both players use a no-regret
learning algorithm, their average strategy converges to an NE
ØIn general games, the average strategy converges to a CCE
There is a general reduction, converting any learning algorithm with regret 𝑆 to one with swap regret 𝑜𝑆. Swap regret – a “stronger” regret concept and better convergence
ØDef: each action 𝑗 has a chance to deviate to another action 𝑡(𝑗) ØIn repeated general games, if both players use a no-swap-regret
learning algorithm, their average strategy converges to a CE
6
ØIn online learning, the whole cost vector 𝑑$ can be observed by
the learner, despite she only takes a single action 𝑗$
ØIn many cases, we only see the reward of the action we take
7
ØOnline advertisement placement or web ranking
8
ØOnline advertisement placement or web ranking
ØRecommendation system:
9
ØOnline advertisement placement or web ranking
ØRecommendation system:
ØClinical trials
ØPlaying strategic games
taken action
10
ØVery much like online learning, except partial feedback
ØModel: at each time step 𝑢 = 1, ⋯ , 𝑈; the following occurs in order
1.
Learner picks a distribution 𝑞$ over arms [𝑜]
2.
Adversary picks cost vector 𝑑$ ∈ 0,1 -
3.
Arm 𝑗$ ∼ 𝑞$ is chosen and learner incurs cost 𝑑$(𝑗$)
4.
Learner only observes 𝑑$(𝑗$) (for use in future time steps)
ØThough we cannot observe 𝑑$, adversary still picks 𝑑$ before 𝑗$ is
sampled Q: since learner does not observe 𝑑$(𝑗) for 𝑗 ≠ 𝑗$, can adversary arbitrarily modify these 𝑑$(𝑗)’s after 𝑗$ has been selected?
No, because this makes 𝑑$ depends on sampled 𝑗$ which is not allowed
11
Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3
12
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $
2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3. (1 − 𝜗𝑑$(𝑗))
13
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $
2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3.
Recall 1 − 𝜀 ≈ 𝑓KR for small 𝜀
14
ØIn this lecture we will use this exponential-weight variant, and
prove its regret bound en route
ØAlso called Exponential Weight Update (EWU)
Recall the algorithm for full information setting: Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $
2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3.
Recall 1 − 𝜀 ≈ 𝑓KR for small 𝜀
15
Basic idea of Exp3
ØWant to use EWU, but do not know vector 𝑑$ ØWell, we really only have 𝑑$(𝑗$), what can we do?
Estimate O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ , 0, ⋯ 0 3? Estimate O 𝑑$ = 0, ⋯ , 0,
MN 6N SN 6N , 0, ⋯ 0 3
Too optimistic
Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $
2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅MN (6) where O 𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3. à try to estimate 𝑑$! Recall the algorithm for full information setting:
16
Exp3: a Basic Algorithm for Adversarial MAB
ØThat is, weight is updated only for the pulled arm
ØCalled Exp3: Exponential-weight algorithm for Exploration and
Exploitation Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $
2. Observe cost vector 𝑑$ ∈ [0,1]- 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅ O
MN (6) where O
𝑑$ = 0, ⋯ , 0, 𝑑$ 𝑗$ /𝑞$(𝑗$), 0, ⋯ 0 3.
17
Ø O
𝑑$ is random – it depends on the randomly sampled 𝑗$ ∼ 𝑞$
Ø O
𝑑$ is an unbiased estimator of 𝑑$, i.e., 𝔽6N∼SN O 𝑑$ = 𝑑$
𝔽6N∼SN O 𝑑$ 𝑗 = ℙ 𝑗$ = 𝑗 ⋅ 𝑑$ 𝑗 𝑞$ 𝑗 + ℙ 𝑗$ ≠ 𝑗 ⋅ 0 = 𝑞$(𝑗) ⋅ 𝑑$ 𝑗 𝑞$ 𝑗 = 𝑑$(𝑗)
ØThis is exactly the reason for our choice of O
𝑑$
18
Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)
different 𝑆3 value even when facing the same adversary!
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
Some key differences from online learning
19
Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)
different 𝑆3 value even when facing the same adversary!
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
Some key differences from online learning
𝑥D 𝑗 = 1, ∀𝑗 round 1 𝑥D 𝑗 = 1, ∀𝑗 ≠ 1 𝑥D 1 < 1 round 2 pull arm 1
20
Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)
different 𝑆3 value even when facing the same adversary!
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
Some key differences from online learning
𝑥D 𝑗 = 1, ∀𝑗 round 1 𝑥D 𝑗 = 1, ∀𝑗 ≠ 2 𝑥D 2 < 1 round 2 pull arm 2
21
Ø𝑆3 is random (even it already takes expectation over 𝑗$ ∼ 𝑞$)
different 𝑆3 value even when facing the same adversary
ØCost vector 𝑑$ is also random as it generally depends on 𝑞$
ØThis is not the case in online learning
same 𝑆3 value if facing the same adversary
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
Some key differences from online learning
22
ØTherefore, in principle, we have to upper bound 𝔽(𝑆3) where
expectation is over the randomness of arm sampling
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
𝔽(𝑆3) = 𝔽 ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
= ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − 𝔽 min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
by linearity of expectation
23
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
𝔽(𝑆3) = 𝔽 ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
= ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − 𝔽 min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
≥ ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]
because min
;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ] ≥ 𝔽 min ;∈[-] ∑$∈[3] 𝑑$(𝑘)
ØTherefore, in principle, we have to upper bound 𝔽(𝑆3) where
expectation is over the randomness of arm sampling
(proof: homework exercise)
24
𝑆3 = ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
𝔽(𝑆3) = 𝔽 ∑6∈[-] ∑$∈ 3 𝑑$(𝑗) 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
= ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − 𝔽 min
;∈[-] ∑$∈[3] 𝑑$(𝑘)
≥ ∑6∈[-] ∑$∈ 3 𝔽 𝑑$ 𝑗 𝑞$(𝑗) − min
;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]
ØTherefore, in principle, we have to upper bound 𝔽(𝑆3) where
expectation is over the randomness of arm sampling Pseudo-Regret 𝑆3
ØGood regret guarantees good pseudo-regret, but not the reverse
25
Bounding regret turns out to be challenging
ØExp3 is not sufficient to guarantee small regret ØNext, we instead prove that Exp3 has small pseudo-regret
ØA slight modification of Exp3 can be proved to have small regret
26
Ø The Adversarial Multi-armed Bandit Problem Ø A Basic Algorithm: Exp3 Ø Regret Analysis of Exp3
27
High-level idea of the proof
ØPretend to be in the full information setting with cost equal the
estimated O 𝑑$
ØRelate O
𝑑$ to 𝑑$ since we know it is an unbiased estimator of 𝑑$
28
ØRecall regret bound for full information setting
𝑆3
[\]] ≤ ln 𝑜
𝜗 + 𝜗𝑈
ØThis holds for any cost vector, thus also O
𝑑$
ØBut…one issue is that O
𝑑$(𝑗$) may be greater than 1
ØNot a big issue – the same analysis yields the following bound
𝑆3
[\]] ≤ _` - L + 𝜗 max 6
∑$∈ 3 O 𝑑$ 𝑗
c
Real Issue: O 𝑑$ 𝑗 may be too large that we cannot bound 𝑆3
[\]]
29
ØThat is, instead of max𝑗, the bound here averages over 𝑗 ØWhy more useful?
𝑑$ 𝑗 = 𝑑$(𝑗)/𝑞$(𝑗)
A regret bound as follows turns out to work for our proof 𝑆3
[\]] ≤ _` - L + 𝜗 ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c
30
Parameter: 𝜗 Initialize weight 𝑥D(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
$ = ∑6∈[-] 𝑥$(𝑗), pick arm 𝑗 with probability 𝑥$(𝑗)/𝑋 $
2. Observe cost vector O 𝑑$ ≥ 0 3. For all 𝑗 ∈ [𝑜], update 𝑥$HD (𝑗) = 𝑥$(𝑗) ⋅ 𝑓KL ⋅ O
MN(6)
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
Note: this yields a bound _` -
L + L c 𝑈 when 𝑑$ ∈ 0,1 -
31
Proof: similar technique – carefully bound certain quantity
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
32
Proof: similar technique – carefully bound certain quantity
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
Why this term? Ø It tracks weight decrease (will be clear in next slide) Ø The algebraic reasons, 𝑓KR ≈ 1 − 𝜀 + 𝜀c/2, which will give rise to both the term 𝑞$(𝑗)O 𝑑$(𝑗) and 𝑞$(𝑗) O 𝑑$ 𝑗
c
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
33
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
Fact 1. ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) = 𝑋
$HD /𝑋 $, where 𝑋 $ = ∑6 𝑥$(𝑗).
$
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
34
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
Fact 1. ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) = 𝑋
$HD /𝑋 $, where 𝑋 $ = ∑6 𝑥$(𝑗).
$
= log 𝑋3HD − log 𝑜
D = 𝑜
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
35
Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf
c ∑$,6 𝑞$ 𝑗
𝑑$ 𝑗
c .
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
Follows from algebraic calculation
36
Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf
c ∑$,6 𝑞$ 𝑗
𝑑$ 𝑗
c .
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ ∑$ log ∑6∈[-] 𝑞$ 𝑗 [1 − 𝜗𝑑$ 𝑗 +
Lf c 𝑑$ 𝑗 c]
By 𝑓KR ≤ 1 − 𝜀 + 𝜀c/2
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
Follows from algebraic calculation
37
Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf
c ∑$,6 𝑞$ 𝑗
𝑑$ 𝑗
c .
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ ∑$ log ∑6∈[-] 𝑞$ 𝑗 [1 − 𝜗𝑑$ 𝑗 +
Lf c 𝑑$ 𝑗 c]
Since ∑6∈ - 𝑞$ 𝑗 = 1 = ∑$ log 1 − ∑6∈ - 𝑞$ 𝑗 𝜗𝑑$ 𝑗 + ∑6∈[-] 𝑞$ 𝑗
Lf c 𝑑$ 𝑗 c
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
Follows from algebraic calculation
38
Fact 2. ∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf
c ∑$,6 𝑞$ 𝑗
𝑑$ 𝑗
c .
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6)
∑$ log ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ≤ ∑$ log ∑6∈[-] 𝑞$ 𝑗 [1 − 𝜗𝑑$ 𝑗 +
Lf c 𝑑$ 𝑗 c]
= ∑$ log 1 − ∑6∈ - 𝑞$ 𝑗 𝜗𝑑$ 𝑗 + ∑6∈[-] 𝑞$ 𝑗
Lf c 𝑑$ 𝑗 c
≤ −𝜗 ∑$,6 𝑞$ 𝑗 𝑑$ 𝑗 + Lf
c ∑$,6 𝑞$ 𝑗
𝑑$ 𝑗
c
Since log 1 + 𝜀 ≤ 𝜀 for any 𝜀
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
Follows from algebraic calculation
39
ØConsider quantity ∑6∈[-] 𝑞$ 𝑗 𝑓KLMN(6) ØCombining the two facts yields the lemma
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
40
𝑆3 = ∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − min
;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]
= max
;∈[-]
∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − ∑$∈ 3 𝔽 𝑑$ 𝑘 = max
;∈[-] ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘)
Recall pseudo-regret definition
Pseudo-regret from action 𝑘
41
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØThat is, expected pseudo regret from 𝑘 w.r.t. true cost 𝑑$ equals
that w.r.t. the estimated cost O 𝑑$ 𝑆3 = ∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − min
;∈[-] ∑$∈[3] 𝔽[𝑑$ 𝑘 ]
= max
;∈[-]
∑$∈ 3 𝔽 𝑑$ ⋅ 𝑞$ − ∑$∈ 3 𝔽 𝑑$ 𝑘 = max
;∈[-] ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘)
Recall pseudo-regret definition
Pseudo-regret from action 𝑘
42
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØProof:
𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) = 𝔽 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)| 𝑞$
Because the randomness of O 𝑑$ comes:
43
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØProof:
𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) = 𝔽 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)| 𝑞$ = 𝔽 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$ (𝑘)| 𝑞$
Because conditioning on 𝑞$, O 𝑑$ is an unbiased estimator of 𝑑$
44
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØProof:
= 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$ (𝑘) 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) = 𝔽 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)| 𝑞$ = 𝔽 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$ (𝑘)| 𝑞$
45
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØFor any 𝑘, we have
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -
L + L c ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c
By Lemma 1
46
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØFor any 𝑘, we have
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -
L + L c ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c
=
_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c |𝑞$
By conditional expectation
47
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØFor any 𝑘, we have
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -
L + L c ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c
=
_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c |𝑞$
= _` -
L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽
O 𝑑$ 𝑗
c|𝑞$
By linearity of expectation
48
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØFor any 𝑘, we have
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -
L + L c ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c
=
_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c |𝑞$
= _` -
L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽
O 𝑑$ 𝑗
c|𝑞$
Observer 𝔽 O 𝑑$ 𝑗
c|𝑞$ = 0 ⋅ 1 − 𝑞$ 𝑗
+
MN 6 SN 6 c
⋅ 𝑞$ 𝑗 =
MN 6
f
SN(6)
49
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØFor any 𝑘, we have
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -
L + L c ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c
=
_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c |𝑞$
= _` -
L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽
O 𝑑$ 𝑗
c|𝑞$
Observer 𝔽 O 𝑑$ 𝑗
c|𝑞$ = 0 ⋅ 1 − 𝑞$ 𝑗
+
MN 6 SN 6 c
⋅ 𝑞$ 𝑗 =
MN 6
f
SN(6)
50
Lemma 2. ∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = ∑$∈[3] 𝔽 O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘)
ØFor any 𝑘, we have
Lemma 1. The regret of the following algorithm is at most
_` - L + L c ∑$ ∑6 𝑞$(𝑗) O
𝑑$ 𝑗
c for any cost vector O
𝑑$ ≥ 0.
∑$∈[3] 𝔽 𝑑$ ⋅ 𝑞$ − 𝑑$(𝑘) = 𝔽 ∑$∈[3] O 𝑑$ ⋅ 𝑞$ − O 𝑑$(𝑘) ≤ 𝔽 _` -
L + L c ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c
=
_` - L + L c 𝔽 𝔽 ∑$ ∑6 𝑞$ 𝑗
O 𝑑$ 𝑗
c |𝑞$
= _` -
L + L c 𝔽 ∑$ ∑6 𝑞$ 𝑗 𝔽
O 𝑑$ 𝑗
c|𝑞$
=
_` - L + L c 𝔽 ∑$ ∑6 𝑑$ 𝑗 c
≤
_` - L + L c 𝑜𝑈
Pick 𝜗 =
c _` -
regret bound of 𝑃( nT ln 𝑜)
51
ØA tighter regret bound for full information setting ØTreat the (realized) estimated O
𝑑$ as the cost for full information
ØExpected pseudo-regret w.r.t. to 𝑑$ equals expected pseudo-
regret w.r.t. to O 𝑑$
ØUpper bound pseudo-regret by taking expectation over O
𝑑$’s
52
ØExp3 does not guarantee good true regret, still because
𝑑$(𝑗)/𝑞$(𝑗) may be too large
ØTo obtain good true regret, need to modify Exp3 by adding some
uniform exploration so that 𝑞$(𝑗) is never too small
Ø In additional to adversarial feedback, a “nicer” setting is when the
cost of each arm is drawn from a fixed but unknown distribution
much better regret bound 𝑃( n ln 𝑈)
Haifeng Xu
University of Virginia hx4ad@virginia.edu