[PPT] - Announcements HW 1 draft is slightly updated; See website for more PowerPoint Presentation

SLIDE 1

1

Announcements

ØHW 1 draft is slightly updated; See website for more info ØMinbiao’s office hour has been changed to Thursday 1-2 pm from

this week, at Rice Hall 442

SLIDE 2

CS6501: T

pics in Learning and Game Theory

(Fall 2019)

MW Updates and Implications

Instructor: Haifeng Xu

SLIDE 3

3

Outline

Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium

SLIDE 4

4

Recap: the Model of Online Learning

At each time step 𝑢 = 1, ⋯ , 𝑈, the following occurs in order:

1.

Learner picks a distribution 𝑞( over actions [𝑜]

2.

Adversary picks cost vector 𝑑( ∈ 0,1 /

3.

Action 𝑗( ∼ 𝑞( is chosen and learner incurs cost 𝑑((𝑗()

4.

Learner observes 𝑑( (for use in future time steps) Ø Learner’s goal: pick distribution sequence 𝑞4, ⋯ , 𝑞5 to minimize expected cost 𝔽 ∑(∈5 𝑑((𝑗()

Expectation over randomness of action

SLIDE 5

5

Measure Algorithms via Regret

ØRegret – how much the learner regrets, had he known the cost

vector 𝑑4, ⋯ , 𝑑5 in hindsight

Ø Formally, ØBenchmark min

;∈[/] ∑( 𝑑((𝑗) is the learner utility had he known 𝑑4, ⋯ , 𝑑5

and is allowed to take the best single action across all rounds

Can also use other benchmarks, but min

;∈[/] ∑( 𝑑((𝑗) is mostly used

𝑆5 = 𝔽;=∼>= ∑(∈[5] 𝑑( 𝑗( − min

;∈[/] ∑(∈[5] 𝑑((𝑗)

Regret is an appropriate performance measure of online algorithms

It measures exactly the loss due to not knowing the data in advance

An algorithm has no regret if

@A 5 → 0 as 𝑈 → ∞, i.e., 𝑆5 = 𝑝(𝑈).

SLIDE 6

6

Ø Last lecture: both 𝑈 and ln 𝑜 term are necessary Ø Next, we prove the theorem

The Multiplicative Weight Update Alg

Parameter: 𝜗 Initialize weight 𝑥4(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

( = ∑;∈[/] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (

2. Observe cost vector 𝑑( ∈ [0,1]/ 3. For all 𝑗 ∈ [𝑜], update 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))

Theorem. MW Update algorithm achieves regret at most O( 𝑈 ln 𝑜 )

for the previously described online learning problem.

SLIDE 7

7

Intuition of the Proof

Parameter: 𝜗 Initialize weight 𝑥4(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

( = ∑;∈[/] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (

2. Observe cost vector 𝑑( ∈ [0,1]/ 3. For all 𝑗 ∈ [𝑜], update 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))

ØThe decrease of weights relates to expected cost at each round

Expected cost at round 𝑢 is ̅

𝐷( = ∑;∈[/] 𝑞((𝑗) ⋅ 𝑑((𝑗) =

∑P∈[Q] R=(;)⋅S=(;) T

=

Propositional to the decrease of total weight at round 𝑢, which is

ØProof idea: bound how fast do total weights decrease

∑;∈[/] 𝜗 ⋅ 𝑥( 𝑗 𝑑((𝑗) = 𝜗𝑋

( ⋅

̅ 𝐷(

SLIDE 8

8

Proof Step 1: How Fast do T

tal Weights Decrease?

Proof

ØAlmost Immediate from update rule 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))

𝑋

(K4 = ∑;∈[/] 𝑥(K4 (𝑗)

= ∑;∈[/] 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗)) = 𝑋

( − 𝜗 ⋅ ∑;∈[/] 𝑥((𝑗) ⋅ 𝑑((𝑗)

= 𝑋

( − 𝜗 ⋅ 𝑋 (

̅ 𝐷( = 𝑋

((1 − 𝜗 ⋅

̅ 𝐷() ≤ 𝑋

( ⋅ 𝑓WX⋅ ̅ Y=

since 1 − 𝜀 ≤ 𝑓W[, ∀𝜀 ≥ 0

Lemma 1. 𝑋

(K4 ≤ 𝑋 ( ⋅ 𝑓WX ̅ Y= where 𝑋 ( = ∑;∈[/] 𝑥((𝑗) is the total

weight at 𝑢 and ̅ 𝐷( is the expected loss at time 𝑢.

is

̅ 𝐷( = ∑;∈[/] 𝑞( 𝑗 𝑑((𝑗) =

∑P∈[Q] R= ; S=(;) T=

SLIDE 9

9

Proof Step 1: How Fast do T

tal Weights Decrease?

Lemma 1. 𝑋

(K4 ≤ 𝑋 ( ⋅ 𝑓WX ̅ Y= where 𝑋 ( = ∑;∈[/] 𝑥((𝑗) is the total

weight at 𝑢 and ̅ 𝐷( is the expected loss at time 𝑢.

is

̅ 𝐷( = ∑;∈[/] 𝑞( 𝑗 𝑑((𝑗) =

∑P∈[Q] R= ; S=(;) T=

Corollary 1. 𝑋5K4 ≤ 𝑜𝑓WX ∑=]^

A

̅ Y=.

is

𝑋5K4 ≤ 𝑋5 ⋅ 𝑓WX ̅

YA

≤ [𝑋5W4 ⋅ 𝑓WX ̅

YA`^] ⋅ 𝑓WX ̅ YA

= 𝑋5W4 ⋅ 𝑓WX[ ̅

YAK ̅ YA`^]

. . .

= 𝑋

4 ⋅ 𝑓WX⋅∑=]^

A

̅ Y=

= 𝑜 ⋅ 𝑓WX⋅∑=]^

A

̅ Y=

SLIDE 10

10

Proof Step 2: Lower Bounding 𝑋5K4

Lemma 2. 𝑋5K4 ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S=(;) for any action 𝑗.

𝑋5K4 ≥ 𝑥5K4(𝑗) = 𝑥4 𝑗 1 − 𝜗𝑑4 𝑗 1 − 𝜗𝑑b 𝑗 … 1 − 𝜗𝑑5 𝑗 ≥ Π(e4

5

𝑓WXS= ; WXa[S=(;)]a ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S=(;)

by MW update rule by fact 1 − 𝜀 ≥ 𝑓W[W[a relax 𝑑( 𝑗

b to 1

SLIDE 11

11

Proof Step 3: Combing the Two Lemmas

ØTherefore, for any 𝑗 we have

Corollary 1. 𝑋5K4 ≤ 𝑜𝑓WX ∑=]^

A

̅ Y=.

is

Lemma 2. 𝑋5K4 ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S=(;) for any action 𝑗.

𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S= ; ≤ 𝑜𝑓WX ∑=]^

A

̅ Y=

⇔ −𝑈𝜗b − 𝜗 ∑(e4

5

𝑑( 𝑗 ≤ ln 𝑜 − 𝜗 ∑(e4

5

̅ 𝐷( ⇔ ∑(e4

5

̅ 𝐷( − ∑(e4

5

𝑑( 𝑗 ≤

gh / X + 𝑈𝜗

take “ln” on both sides rearrange terms Taking 𝜗 = ln 𝑜 /𝑈, we have ∑(e4

5

̅ 𝐷( − min

;

∑(e4

5

𝑑( 𝑗 ≤ 2 𝑈 ln 𝑜

SLIDE 12

12

Remarks

ØSome MW description uses 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ 𝑓WX ⋅S=(;). Analysis is

similar due to the fact 𝑓WX ≈ 1 − 𝜗 for small 𝜗 ∈ [0,1]

ØThe same algorithm also works for 𝑑( ∈ [−𝜍, 𝜍] (still use update

rule 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))). Analysis is the same

ØMW update is a very powerful technique – it can also be used to

solve, e.g., LP, semidefinite programs, SetCover, Boosting, etc.

Because it works for arbitrary cost vectors
Next, we show how it can be used to compute equilibria of games

where the “cost vector” will be generated by other players

SLIDE 13

13

Outline

Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium

SLIDE 14

14

ØThink about how you play rock-paper-scissor repeatedly ØIn reality, we play like online learning

You try to analyze the past patterns, then decide which action to

respond, possibly with some randomness

This is basically online learning!

Online learning – A natural way to play repeated games Repeated game: the same game played for many rounds

SLIDE 15

15

Repeated Zero-Sum Games with No-Regret Players

Basic Setup:

ØA zero-sum game with payoff matrix 𝑉 ∈ ℝp×/ ØRow player maximizes utility and has actions 𝑛 = {1, ⋯ , 𝑛}

Column player thus minimizes utility

ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to pick a mixed

strategy at each round

SLIDE 16

16

Repeated Zero-Sum Games with No-Regret Players

ØFrom row player’s perspective, the following occurs in order at

round 𝑢

Picks a mixed strategy 𝑦( ∈ Δp over actions in [𝑛]
Her opponent, the column player, picks a mixed strategy 𝑧( ∈ Δ/
Action 𝑗( ∼ 𝑦( is chosen and row player receives utility 𝑉 𝑗(, 𝑧( =

∑x∈[/] 𝑧( 𝑘 ⋅ 𝑉(𝑗(, 𝑘)

Row player learns 𝑧( (for future use)

ØColumn player has a symmetric perspective, but will think of

𝑉 𝑗, 𝑘 as his cost Difference from online learning: utility/cost vector determined by the opponent, instead of being arbitrarily chosen

SLIDE 17

17

Repeated Zero-Sum Games with No-Regret Players

ØExpected total utility of row player ∑(e4

5

𝑉 𝑦(, 𝑧(

Note: 𝑉 𝑦(, 𝑧( = ∑;,x 𝑉 𝑗, 𝑘 𝑦( 𝑗 𝑧((𝑘) = 𝑦( 5𝑉𝑧(

max

;∈[p] ∑(e4 5

𝑉 𝑗, 𝑧( − ∑(e4

5

𝑉 𝑦(, 𝑧( Ø Regret of row player is Ø Regret of column player is ∑(e4

5

𝑉 𝑦(, 𝑧( − min

x∈[/] ∑(e4 5

𝑉 𝑦(, 𝑘

SLIDE 18

18

From No Regret to Minimax Theorem

Next, we give another proof of the minimax theorem, using the fact that no regret algorithms exist (e.g., MW update)

SLIDE 19

19

From No Regret to Minimax Theorem

ØAssume both players use no-regret learning algorithms ØFor row player, we have

𝑆5

|}R = max ;∈[p] ∑(e4 5

𝑉 𝑗, 𝑧( − ∑(e4

5

𝑉 𝑦(, 𝑧( ⇔

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( + @A

~•€

5

= 4

5 max ;∈[p] ∑(e4 5

𝑉 𝑗, 𝑧( = max

;∈[p] 𝑉 𝑗, ∑= •= 5

≥ min

∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧

SLIDE 20

20

From No Regret to Minimax Theorem

ØAssume both players use no-regret learning algorithms ØFor row player, we have ØSimilarly, for column player,

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( + @A

~•€

5

≥ min

∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧

𝑆5

S}ƒ„p/ = ∑(e4 5

𝑉 𝑦(, 𝑧( − min

x∈[/] ∑(e4 5

𝑉 𝑦(, 𝑘

implies

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( −

@A

…•†‡ˆQ

5

≤ max

‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

min

∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

ØLet 𝑈 → ∞, no regret implies

@A

~•€

5

and

@A

…•†‡ˆQ

5

tend to 0. We have

SLIDE 21

21

From No Regret to Minimax Theorem

ØAssume both players use no-regret learning algorithms

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( +

@A

~•€

5

≥ min

∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧 4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( − @A

…•†‡ˆQ

5

≤ max

‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

⇒ min

∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

Corollary.

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( converges to the game value

ØRecall that min-max ≥ max-min also holds, because moving

second will not be worse for the row player

SLIDE 22

22

Convergence to Nash Equilibrium

ØRecall that (𝑦∗, 𝑧∗) is a NE if and only if 𝑦∗ is the maximin strategy

and 𝑧∗ is the minimax strategy

ØFrom previous derivations

Theorem. Suppose both players use no-regret learning algorithms

with action sequence {𝑦(} and {𝑧(}. Then

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( converges to the game value and (

∑=]^

A

‰= 5

,

∑=]^

A

=

5

) converges to NE of the game.

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( +

@A

~•€

5

= max

;∈[p] 𝑉 𝑗, ∑= •= 5

≥ min

∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧

Ø As 𝑈 → ∞, “≥” becomes “=”. So

∑= •= 5

solves the min-max problem Ø Similarly,

∑= ‰= 5

solves the max-min problem

SLIDE 23

23

Remarks

ØIf both players use no regret algorithms with 𝑃( 𝑈), then

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( converges to the game value at rate

@A 5 = 4 5

ØThis convergence rate can be improved to

4 5 by careful regularization

f the no-regret algorithm
More readings: “Fast Convergence of Regularized Learning in Games”

[NIPS’15 best paper]

Intuition: our no-regret algorithm assumes adversarial feedbacks but the
ther player is not really adversary – he uses another no-regret algorithm
This can be exploited to improve learning rate

SLIDE 24

24

Remarks

ØConvergence of no-regret learning to NE is the key framework for

designing the AI agent that beats top humans in Texas hold’em poker

Plus many other game solving techniques and engineering work
More reading: “Safe and Nested Subgame Solving for Imperfect-

Information Games.” [NeurIPS’17 best paper]

SLIDE 25

25

Remarks

ØConvergence of no-regret learning to NE is the key framework for

designing the AI agent that beats top humans in Texas hold’em poker

Plus many other game solving techniques and engineering work
More reading: “Safe and Nested Subgame Solving for Imperfect-

Information Games.” [NeurIPS’17 best paper]

Exciting research is happening at this intersected space of Learning & Game Theory

SLIDE 26

26

Outline

Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium

SLIDE 27

27

Recap: Normal-Form Games and CCE

Ø 𝑜 players, denoted by set 𝑜 = {1, ⋯ , 𝑜} Ø Player 𝑗 takes action 𝑏; ∈ 𝐵; Ø Player utility depends on the outcome of the game, i.e., an action

profile 𝑏 = (𝑏4, ⋯ , 𝑏/)

Player 𝑗 receives payoff 𝑣;(𝑏) for any outcome 𝑏 ∈ Π;e4

/ 𝐵;

ØCourse correlated equilibrium is an action recommendation policy

A recommendation policy 𝜌 is a coarse correlated equilibrium if

∑‘∈’ 𝑣; 𝑏 ⋅ 𝜌(𝑏) ≥ ∑‘∈’ 𝑣; 𝑏“;, 𝑏W; ⋅ 𝜌 𝑏 , ∀ 𝑏“; ∈ 𝐵;, ∀𝑗 ∈ 𝑜 .

That is, for any player 𝑗, following 𝜌’s recommendations is better than opting out of the recommendation and “acting on his own”.

SLIDE 28

28

Repeated Games with No-Regret Players

ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to select a mixed

strategy at each round 𝑢

ØFor any player 𝑗’s perspective, the following occurs in order at 𝑢

Picks a mixed strategy 𝑦;

( ∈ Δ|’P| over actions in 𝐵;

Any other player 𝑘 ≠ 𝑗 picks a mixed strategy 𝑦x

( ∈ Δ|’–|

Player 𝑗 receives expected utility 𝑉; 𝑦;

(, 𝑦W; (

= 𝔽‘∼(‰P

=,‰`P = ) 𝑣;(𝑏)

Player 𝑗 learns 𝑦W;

( (for future use)

SLIDE 29

29

Repeated Games with No-Regret Players

ØExpected total utility of player 𝑗 equals ∑(e4

5

𝑉; 𝑦;

(, 𝑦W; (

ØRegret of player 𝑗 is

𝑆;

5 = max ‘P∈’P

∑(e4

5

𝑉 𝑏;, 𝑦W;

(

− ∑(e4

5

𝑉; 𝑦;

(, 𝑦W; (

SLIDE 30

30

From No Regret to CCE

Remarks:

ØIn mixed strategy profile 𝑦4

(, 𝑦b (, ⋯ , 𝑦/ ( , prob of 𝑏 is Π;∈ / 𝑦; ((𝑏;)

Ø𝜌5(𝑏) is simply the average of Π;∈ / 𝑦;

((𝑏;) over 𝑈 rounds

Theorem. Suppose all players use no-regret learning algorithms

with strategy sequence 𝑦;

( (∈[5] for 𝑗. The following recommendation

policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =

4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.

SLIDE 31

31

From No Regret to CCE

∑‘∈’

4 5 ∑( Π;∈ / 𝑦; ((𝑏;)

⋅ 𝑣;(𝑏) =

4 5 ∑( ∑‘∈’ Π;∈ / 𝑦; ((𝑏;) ⋅ 𝑣;(𝑏)

=

4 5 ∑( 𝑣;(𝑦; (, 𝑦W; ( )

Remarks:

ØIn mixed strategy profile 𝑦4

(, 𝑦b (, ⋯ , 𝑦/ ( , prob of 𝑏 is Π;∈ / 𝑦; ((𝑏;)

Ø𝜌5(𝑏) is simply the average of Π;∈ / 𝑦;

((𝑏;) over 𝑈 rounds

ØPlayer 𝑗’s expected utility from 𝜌5 is

Theorem. Suppose all players use no-regret learning algorithms

with strategy sequence 𝑦;

( (∈[5] for 𝑗. The following recommendation

policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =

4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.

SLIDE 32

32

From No Regret to CCE

Proof:

ØThe CCE condition requires for all player 𝑗 ØRegret

4 5 ∑( 𝑣;(𝑦; (, 𝑦W; ( ) ≥ 4 5 ∑( 𝑣; 𝑏;, 𝑦W; (

∀𝑏; ∈ 𝐵; (1) 𝑆;

5 = max ‘P∈’P

∑(e4

5

𝑣; 𝑏;, 𝑦W;

(

− ∑(e4

5

𝑣; 𝑦;

(, 𝑦W; (

(2)

ØDividing Equation (2) by 𝑈 and let 𝑈 → ∞ yields Condition (1) since

𝑆;

5/𝑈 tends to 0 by definition of no regret

Theorem. Suppose all players use no-regret learning algorithms

with strategy sequence 𝑦;

( (∈[5] for 𝑗. The following recommendation

policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =

4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.

SLIDE 33

33

Next lecture:

ØStudy a stronger regret notion called “swap regret” – it uses a

stronger benchmark

ØShow any game with no-swap-regret players will converge to a

correlated equilibrium

ØProve that any no-regret algorithm can be converted to a no-

swap-regret algorithm, with slightly worse regret guarantee

SLIDE 34

Thank You

Haifeng Xu

University of Virginia hx4ad@virginia.edu