1
Announcements
ØHW 1 draft is slightly updated; See website for more info ØMinbiao’s office hour has been changed to Thursday 1-2 pm from
this week, at Rice Hall 442
Announcements HW 1 draft is slightly updated; See website for more - - PowerPoint PPT Presentation
Announcements HW 1 draft is slightly updated; See website for more info Minbiaos office hour has been changed to Thursday 1-2 pm from this week, at Rice Hall 442 1 CS6501: T opics in Learning and Game Theory (Fall 2019) MW Updates and
1
ØHW 1 draft is slightly updated; See website for more info ØMinbiao’s office hour has been changed to Thursday 1-2 pm from
this week, at Rice Hall 442
CS6501: T
(Fall 2019)
Instructor: Haifeng Xu
3
Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium
4
At each time step 𝑢 = 1, ⋯ , 𝑈, the following occurs in order:
1.
Learner picks a distribution 𝑞( over actions [𝑜]
2.
Adversary picks cost vector 𝑑( ∈ 0,1 /
3.
Action 𝑗( ∼ 𝑞( is chosen and learner incurs cost 𝑑((𝑗()
4.
Learner observes 𝑑( (for use in future time steps) Ø Learner’s goal: pick distribution sequence 𝑞4, ⋯ , 𝑞5 to minimize expected cost 𝔽 ∑(∈5 𝑑((𝑗()
5
ØRegret – how much the learner regrets, had he known the cost
vector 𝑑4, ⋯ , 𝑑5 in hindsight
Ø Formally, ØBenchmark min
;∈[/] ∑( 𝑑((𝑗) is the learner utility had he known 𝑑4, ⋯ , 𝑑5
and is allowed to take the best single action across all rounds
;∈[/] ∑( 𝑑((𝑗) is mostly used
𝑆5 = 𝔽;=∼>= ∑(∈[5] 𝑑( 𝑗( − min
;∈[/] ∑(∈[5] 𝑑((𝑗)
Regret is an appropriate performance measure of online algorithms
An algorithm has no regret if
@A 5 → 0 as 𝑈 → ∞, i.e., 𝑆5 = 𝑝(𝑈).
6
Ø Last lecture: both 𝑈 and ln 𝑜 term are necessary Ø Next, we prove the theorem
Parameter: 𝜗 Initialize weight 𝑥4(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
( = ∑;∈[/] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (
2. Observe cost vector 𝑑( ∈ [0,1]/ 3. For all 𝑗 ∈ [𝑜], update 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))
for the previously described online learning problem.
7
Parameter: 𝜗 Initialize weight 𝑥4(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋
( = ∑;∈[/] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (
2. Observe cost vector 𝑑( ∈ [0,1]/ 3. For all 𝑗 ∈ [𝑜], update 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))
ØThe decrease of weights relates to expected cost at each round
𝐷( = ∑;∈[/] 𝑞((𝑗) ⋅ 𝑑((𝑗) =
∑P∈[Q] R=(;)⋅S=(;) T
=
ØProof idea: bound how fast do total weights decrease
∑;∈[/] 𝜗 ⋅ 𝑥( 𝑗 𝑑((𝑗) = 𝜗𝑋
( ⋅
̅ 𝐷(
8
Proof Step 1: How Fast do T
Proof
ØAlmost Immediate from update rule 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))
𝑋
(K4 = ∑;∈[/] 𝑥(K4 (𝑗)
= ∑;∈[/] 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗)) = 𝑋
( − 𝜗 ⋅ ∑;∈[/] 𝑥((𝑗) ⋅ 𝑑((𝑗)
= 𝑋
( − 𝜗 ⋅ 𝑋 (
̅ 𝐷( = 𝑋
((1 − 𝜗 ⋅
̅ 𝐷() ≤ 𝑋
( ⋅ 𝑓WX⋅ ̅ Y=
since 1 − 𝜀 ≤ 𝑓W[, ∀𝜀 ≥ 0
Lemma 1. 𝑋
(K4 ≤ 𝑋 ( ⋅ 𝑓WX ̅ Y= where 𝑋 ( = ∑;∈[/] 𝑥((𝑗) is the total
weight at 𝑢 and ̅ 𝐷( is the expected loss at time 𝑢.
is̅ 𝐷( = ∑;∈[/] 𝑞( 𝑗 𝑑((𝑗) =
∑P∈[Q] R= ; S=(;) T=
9
Proof Step 1: How Fast do T
Lemma 1. 𝑋
(K4 ≤ 𝑋 ( ⋅ 𝑓WX ̅ Y= where 𝑋 ( = ∑;∈[/] 𝑥((𝑗) is the total
weight at 𝑢 and ̅ 𝐷( is the expected loss at time 𝑢.
is̅ 𝐷( = ∑;∈[/] 𝑞( 𝑗 𝑑((𝑗) =
∑P∈[Q] R= ; S=(;) T=
Corollary 1. 𝑋5K4 ≤ 𝑜𝑓WX ∑=]^
A
̅ Y=.
is𝑋5K4 ≤ 𝑋5 ⋅ 𝑓WX ̅
YA
≤ [𝑋5W4 ⋅ 𝑓WX ̅
YA`^] ⋅ 𝑓WX ̅ YA
= 𝑋5W4 ⋅ 𝑓WX[ ̅
YAK ̅ YA`^]
. . .
= 𝑋
4 ⋅ 𝑓WX⋅∑=]^
A
̅ Y=
= 𝑜 ⋅ 𝑓WX⋅∑=]^
A
̅ Y=
10
Lemma 2. 𝑋5K4 ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^
A
S=(;) for any action 𝑗.
𝑋5K4 ≥ 𝑥5K4(𝑗) = 𝑥4 𝑗 1 − 𝜗𝑑4 𝑗 1 − 𝜗𝑑b 𝑗 … 1 − 𝜗𝑑5 𝑗 ≥ Π(e4
5
𝑓WXS= ; WXa[S=(;)]a ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^
A
S=(;)
by MW update rule by fact 1 − 𝜀 ≥ 𝑓W[W[a relax 𝑑( 𝑗
b to 1
11
ØTherefore, for any 𝑗 we have
Corollary 1. 𝑋5K4 ≤ 𝑜𝑓WX ∑=]^
A
̅ Y=.
isLemma 2. 𝑋5K4 ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^
A
S=(;) for any action 𝑗.
𝑓W5Xa ⋅ 𝑓WX ∑=]^
A
S= ; ≤ 𝑜𝑓WX ∑=]^
A
̅ Y=
⇔ −𝑈𝜗b − 𝜗 ∑(e4
5
𝑑( 𝑗 ≤ ln 𝑜 − 𝜗 ∑(e4
5
̅ 𝐷( ⇔ ∑(e4
5
̅ 𝐷( − ∑(e4
5
𝑑( 𝑗 ≤
gh / X + 𝑈𝜗
take “ln” on both sides rearrange terms Taking 𝜗 = ln 𝑜 /𝑈, we have ∑(e4
5
̅ 𝐷( − min
;
∑(e4
5
𝑑( 𝑗 ≤ 2 𝑈 ln 𝑜
12
ØSome MW description uses 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ 𝑓WX ⋅S=(;). Analysis is
similar due to the fact 𝑓WX ≈ 1 − 𝜗 for small 𝜗 ∈ [0,1]
ØThe same algorithm also works for 𝑑( ∈ [−𝜍, 𝜍] (still use update
rule 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))). Analysis is the same
ØMW update is a very powerful technique – it can also be used to
solve, e.g., LP, semidefinite programs, SetCover, Boosting, etc.
where the “cost vector” will be generated by other players
13
Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium
14
ØThink about how you play rock-paper-scissor repeatedly ØIn reality, we play like online learning
respond, possibly with some randomness
Online learning – A natural way to play repeated games Repeated game: the same game played for many rounds
15
Repeated Zero-Sum Games with No-Regret Players
Basic Setup:
ØA zero-sum game with payoff matrix 𝑉 ∈ ℝp×/ ØRow player maximizes utility and has actions 𝑛 = {1, ⋯ , 𝑛}
ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to pick a mixed
strategy at each round
16
Repeated Zero-Sum Games with No-Regret Players
ØFrom row player’s perspective, the following occurs in order at
round 𝑢
∑x∈[/] 𝑧( 𝑘 ⋅ 𝑉(𝑗(, 𝑘)
ØColumn player has a symmetric perspective, but will think of
𝑉 𝑗, 𝑘 as his cost Difference from online learning: utility/cost vector determined by the opponent, instead of being arbitrarily chosen
17
Repeated Zero-Sum Games with No-Regret Players
ØExpected total utility of row player ∑(e4
5
𝑉 𝑦(, 𝑧(
max
;∈[p] ∑(e4 5
𝑉 𝑗, 𝑧( − ∑(e4
5
𝑉 𝑦(, 𝑧( Ø Regret of row player is Ø Regret of column player is ∑(e4
5
𝑉 𝑦(, 𝑧( − min
x∈[/] ∑(e4 5
𝑉 𝑦(, 𝑘
18
Next, we give another proof of the minimax theorem, using the fact that no regret algorithms exist (e.g., MW update)
19
ØAssume both players use no-regret learning algorithms ØFor row player, we have
𝑆5
|}R = max ;∈[p] ∑(e4 5
𝑉 𝑗, 𝑧( − ∑(e4
5
𝑉 𝑦(, 𝑧( ⇔
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( + @A
~•€
5
= 4
5 max ;∈[p] ∑(e4 5
𝑉 𝑗, 𝑧( = max
;∈[p] 𝑉 𝑗, ∑= •= 5
≥ min
;∈[p] 𝑉 𝑗, 𝑧
20
ØAssume both players use no-regret learning algorithms ØFor row player, we have ØSimilarly, for column player,
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( + @A
~•€
5
≥ min
;∈[p] 𝑉 𝑗, 𝑧
𝑆5
S}ƒ„p/ = ∑(e4 5
𝑉 𝑦(, 𝑧( − min
x∈[/] ∑(e4 5
𝑉 𝑦(, 𝑘
implies
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( −
@A
…•†‡ˆQ
5
≤ max
‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘
min
;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘
ØLet 𝑈 → ∞, no regret implies
@A
~•€
5
and
@A
…•†‡ˆQ
5
tend to 0. We have
21
ØAssume both players use no-regret learning algorithms
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( +
@A
~•€
5
≥ min
;∈[p] 𝑉 𝑗, 𝑧 4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( − @A
…•†‡ˆQ
5
≤ max
‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘
⇒ min
;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘
Corollary.
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( converges to the game value
ØRecall that min-max ≥ max-min also holds, because moving
second will not be worse for the row player
22
ØRecall that (𝑦∗, 𝑧∗) is a NE if and only if 𝑦∗ is the maximin strategy
and 𝑧∗ is the minimax strategy
ØFrom previous derivations
with action sequence {𝑦(} and {𝑧(}. Then
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( converges to the game value and (
∑=]^
A
‰= 5
,
∑=]^
A
5
) converges to NE of the game.
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( +
@A
~•€
5
= max
;∈[p] 𝑉 𝑗, ∑= •= 5
≥ min
;∈[p] 𝑉 𝑗, 𝑧
Ø As 𝑈 → ∞, “≥” becomes “=”. So
∑= •= 5
solves the min-max problem Ø Similarly,
∑= ‰= 5
solves the max-min problem
23
ØIf both players use no regret algorithms with 𝑃( 𝑈), then
4 5 ∑(e4 5
𝑉 𝑦(, 𝑧( converges to the game value at rate
@A 5 = 4 5
ØThis convergence rate can be improved to
4 5 by careful regularization
[NIPS’15 best paper]
24
ØConvergence of no-regret learning to NE is the key framework for
designing the AI agent that beats top humans in Texas hold’em poker
Information Games.” [NeurIPS’17 best paper]
25
ØConvergence of no-regret learning to NE is the key framework for
designing the AI agent that beats top humans in Texas hold’em poker
Information Games.” [NeurIPS’17 best paper]
Exciting research is happening at this intersected space of Learning & Game Theory
26
Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium
27
Ø 𝑜 players, denoted by set 𝑜 = {1, ⋯ , 𝑜} Ø Player 𝑗 takes action 𝑏; ∈ 𝐵; Ø Player utility depends on the outcome of the game, i.e., an action
profile 𝑏 = (𝑏4, ⋯ , 𝑏/)
/ 𝐵;
ØCourse correlated equilibrium is an action recommendation policy
A recommendation policy 𝜌 is a coarse correlated equilibrium if
∑‘∈’ 𝑣; 𝑏 ⋅ 𝜌(𝑏) ≥ ∑‘∈’ 𝑣; 𝑏“;, 𝑏W; ⋅ 𝜌 𝑏 , ∀ 𝑏“; ∈ 𝐵;, ∀𝑗 ∈ 𝑜 .
That is, for any player 𝑗, following 𝜌’s recommendations is better than opting out of the recommendation and “acting on his own”.
28
Repeated Games with No-Regret Players
ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to select a mixed
strategy at each round 𝑢
ØFor any player 𝑗’s perspective, the following occurs in order at 𝑢
( ∈ Δ|’P| over actions in 𝐵;
( ∈ Δ|’–|
(, 𝑦W; (
= 𝔽‘∼(‰P
=,‰`P = ) 𝑣;(𝑏)
( (for future use)
29
Repeated Games with No-Regret Players
ØExpected total utility of player 𝑗 equals ∑(e4
5
𝑉; 𝑦;
(, 𝑦W; (
ØRegret of player 𝑗 is
𝑆;
5 = max ‘P∈’P
∑(e4
5
𝑉 𝑏;, 𝑦W;
(
− ∑(e4
5
𝑉; 𝑦;
(, 𝑦W; (
30
Remarks:
ØIn mixed strategy profile 𝑦4
(, 𝑦b (, ⋯ , 𝑦/ ( , prob of 𝑏 is Π;∈ / 𝑦; ((𝑏;)
Ø𝜌5(𝑏) is simply the average of Π;∈ / 𝑦;
((𝑏;) over 𝑈 rounds
with strategy sequence 𝑦;
( (∈[5] for 𝑗. The following recommendation
policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =
4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.
31
∑‘∈’
4 5 ∑( Π;∈ / 𝑦; ((𝑏;)
⋅ 𝑣;(𝑏) =
4 5 ∑( ∑‘∈’ Π;∈ / 𝑦; ((𝑏;) ⋅ 𝑣;(𝑏)
=
4 5 ∑( 𝑣;(𝑦; (, 𝑦W; ( )
Remarks:
ØIn mixed strategy profile 𝑦4
(, 𝑦b (, ⋯ , 𝑦/ ( , prob of 𝑏 is Π;∈ / 𝑦; ((𝑏;)
Ø𝜌5(𝑏) is simply the average of Π;∈ / 𝑦;
((𝑏;) over 𝑈 rounds
ØPlayer 𝑗’s expected utility from 𝜌5 is
with strategy sequence 𝑦;
( (∈[5] for 𝑗. The following recommendation
policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =
4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.
32
Proof:
ØThe CCE condition requires for all player 𝑗 ØRegret
4 5 ∑( 𝑣;(𝑦; (, 𝑦W; ( ) ≥ 4 5 ∑( 𝑣; 𝑏;, 𝑦W; (
∀𝑏; ∈ 𝐵; (1) 𝑆;
5 = max ‘P∈’P
∑(e4
5
𝑣; 𝑏;, 𝑦W;
(
− ∑(e4
5
𝑣; 𝑦;
(, 𝑦W; (
(2)
ØDividing Equation (2) by 𝑈 and let 𝑈 → ∞ yields Condition (1) since
𝑆;
5/𝑈 tends to 0 by definition of no regret
with strategy sequence 𝑦;
( (∈[5] for 𝑗. The following recommendation
policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =
4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.
33
Next lecture:
ØStudy a stronger regret notion called “swap regret” – it uses a
stronger benchmark
ØShow any game with no-swap-regret players will converge to a
correlated equilibrium
ØProve that any no-regret algorithm can be converted to a no-
swap-regret algorithm, with slightly worse regret guarantee
Haifeng Xu
University of Virginia hx4ad@virginia.edu