Announcements HW 1 draft is slightly updated; See website for more - - PowerPoint PPT Presentation

announcements
SMART_READER_LITE
LIVE PREVIEW

Announcements HW 1 draft is slightly updated; See website for more - - PowerPoint PPT Presentation

Announcements HW 1 draft is slightly updated; See website for more info Minbiaos office hour has been changed to Thursday 1-2 pm from this week, at Rice Hall 442 1 CS6501: T opics in Learning and Game Theory (Fall 2019) MW Updates and


slide-1
SLIDE 1

1

Announcements

ØHW 1 draft is slightly updated; See website for more info ØMinbiao’s office hour has been changed to Thursday 1-2 pm from

this week, at Rice Hall 442

slide-2
SLIDE 2

CS6501: T

  • pics in Learning and Game Theory

(Fall 2019)

MW Updates and Implications

Instructor: Haifeng Xu

slide-3
SLIDE 3

3

Outline

Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium

slide-4
SLIDE 4

4

Recap: the Model of Online Learning

At each time step 𝑢 = 1, ⋯ , 𝑈, the following occurs in order:

1.

Learner picks a distribution 𝑞( over actions [𝑜]

2.

Adversary picks cost vector 𝑑( ∈ 0,1 /

3.

Action 𝑗( ∼ 𝑞( is chosen and learner incurs cost 𝑑((𝑗()

4.

Learner observes 𝑑( (for use in future time steps) Ø Learner’s goal: pick distribution sequence 𝑞4, ⋯ , 𝑞5 to minimize expected cost 𝔽 ∑(∈5 𝑑((𝑗()

  • Expectation over randomness of action
slide-5
SLIDE 5

5

Measure Algorithms via Regret

ØRegret – how much the learner regrets, had he known the cost

vector 𝑑4, ⋯ , 𝑑5 in hindsight

Ø Formally, ØBenchmark min

;∈[/] ∑( 𝑑((𝑗) is the learner utility had he known 𝑑4, ⋯ , 𝑑5

and is allowed to take the best single action across all rounds

  • Can also use other benchmarks, but min

;∈[/] ∑( 𝑑((𝑗) is mostly used

𝑆5 = 𝔽;=∼>= ∑(∈[5] 𝑑( 𝑗( − min

;∈[/] ∑(∈[5] 𝑑((𝑗)

Regret is an appropriate performance measure of online algorithms

  • It measures exactly the loss due to not knowing the data in advance

An algorithm has no regret if

@A 5 → 0 as 𝑈 → ∞, i.e., 𝑆5 = 𝑝(𝑈).

slide-6
SLIDE 6

6

Ø Last lecture: both 𝑈 and ln 𝑜 term are necessary Ø Next, we prove the theorem

The Multiplicative Weight Update Alg

Parameter: 𝜗 Initialize weight 𝑥4(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

( = ∑;∈[/] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (

2. Observe cost vector 𝑑( ∈ [0,1]/ 3. For all 𝑗 ∈ [𝑜], update 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))

  • Theorem. MW Update algorithm achieves regret at most O( 𝑈 ln 𝑜 )

for the previously described online learning problem.

slide-7
SLIDE 7

7

Intuition of the Proof

Parameter: 𝜗 Initialize weight 𝑥4(𝑗) = 1, ∀𝑗 = 1, ⋯ 𝑜 For 𝑢 = 1, ⋯ , 𝑈 1. Let 𝑋

( = ∑;∈[/] 𝑥((𝑗), pick action 𝑗 with probability 𝑥((𝑗)/𝑋 (

2. Observe cost vector 𝑑( ∈ [0,1]/ 3. For all 𝑗 ∈ [𝑜], update 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))

ØThe decrease of weights relates to expected cost at each round

  • Expected cost at round 𝑢 is ̅

𝐷( = ∑;∈[/] 𝑞((𝑗) ⋅ 𝑑((𝑗) =

∑P∈[Q] R=(;)⋅S=(;) T

=

  • Propositional to the decrease of total weight at round 𝑢, which is

ØProof idea: bound how fast do total weights decrease

∑;∈[/] 𝜗 ⋅ 𝑥( 𝑗 𝑑((𝑗) = 𝜗𝑋

( ⋅

̅ 𝐷(

slide-8
SLIDE 8

8

Proof Step 1: How Fast do T

  • tal Weights Decrease?

Proof

ØAlmost Immediate from update rule 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))

𝑋

(K4 = ∑;∈[/] 𝑥(K4 (𝑗)

= ∑;∈[/] 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗)) = 𝑋

( − 𝜗 ⋅ ∑;∈[/] 𝑥((𝑗) ⋅ 𝑑((𝑗)

= 𝑋

( − 𝜗 ⋅ 𝑋 (

̅ 𝐷( = 𝑋

((1 − 𝜗 ⋅

̅ 𝐷() ≤ 𝑋

( ⋅ 𝑓WX⋅ ̅ Y=

since 1 − 𝜀 ≤ 𝑓W[, ∀𝜀 ≥ 0

Lemma 1. 𝑋

(K4 ≤ 𝑋 ( ⋅ 𝑓WX ̅ Y= where 𝑋 ( = ∑;∈[/] 𝑥((𝑗) is the total

weight at 𝑢 and ̅ 𝐷( is the expected loss at time 𝑢.

is

̅ 𝐷( = ∑;∈[/] 𝑞( 𝑗 𝑑((𝑗) =

∑P∈[Q] R= ; S=(;) T=

slide-9
SLIDE 9

9

Proof Step 1: How Fast do T

  • tal Weights Decrease?

Lemma 1. 𝑋

(K4 ≤ 𝑋 ( ⋅ 𝑓WX ̅ Y= where 𝑋 ( = ∑;∈[/] 𝑥((𝑗) is the total

weight at 𝑢 and ̅ 𝐷( is the expected loss at time 𝑢.

is

̅ 𝐷( = ∑;∈[/] 𝑞( 𝑗 𝑑((𝑗) =

∑P∈[Q] R= ; S=(;) T=

Corollary 1. 𝑋5K4 ≤ 𝑜𝑓WX ∑=]^

A

̅ Y=.

is

𝑋5K4 ≤ 𝑋5 ⋅ 𝑓WX ̅

YA

≤ [𝑋5W4 ⋅ 𝑓WX ̅

YA`^] ⋅ 𝑓WX ̅ YA

= 𝑋5W4 ⋅ 𝑓WX[ ̅

YAK ̅ YA`^]

. . .

= 𝑋

4 ⋅ 𝑓WX⋅∑=]^

A

̅ Y=

= 𝑜 ⋅ 𝑓WX⋅∑=]^

A

̅ Y=

slide-10
SLIDE 10

10

Proof Step 2: Lower Bounding 𝑋5K4

Lemma 2. 𝑋5K4 ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S=(;) for any action 𝑗.

𝑋5K4 ≥ 𝑥5K4(𝑗) = 𝑥4 𝑗 1 − 𝜗𝑑4 𝑗 1 − 𝜗𝑑b 𝑗 … 1 − 𝜗𝑑5 𝑗 ≥ Π(e4

5

𝑓WXS= ; WXa[S=(;)]a ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S=(;)

by MW update rule by fact 1 − 𝜀 ≥ 𝑓W[W[a relax 𝑑( 𝑗

b to 1

slide-11
SLIDE 11

11

Proof Step 3: Combing the Two Lemmas

ØTherefore, for any 𝑗 we have

Corollary 1. 𝑋5K4 ≤ 𝑜𝑓WX ∑=]^

A

̅ Y=.

is

Lemma 2. 𝑋5K4 ≥ 𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S=(;) for any action 𝑗.

𝑓W5Xa ⋅ 𝑓WX ∑=]^

A

S= ; ≤ 𝑜𝑓WX ∑=]^

A

̅ Y=

⇔ −𝑈𝜗b − 𝜗 ∑(e4

5

𝑑( 𝑗 ≤ ln 𝑜 − 𝜗 ∑(e4

5

̅ 𝐷( ⇔ ∑(e4

5

̅ 𝐷( − ∑(e4

5

𝑑( 𝑗 ≤

gh / X + 𝑈𝜗

take “ln” on both sides rearrange terms Taking 𝜗 = ln 𝑜 /𝑈, we have ∑(e4

5

̅ 𝐷( − min

;

∑(e4

5

𝑑( 𝑗 ≤ 2 𝑈 ln 𝑜

slide-12
SLIDE 12

12

Remarks

ØSome MW description uses 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ 𝑓WX ⋅S=(;). Analysis is

similar due to the fact 𝑓WX ≈ 1 − 𝜗 for small 𝜗 ∈ [0,1]

ØThe same algorithm also works for 𝑑( ∈ [−𝜍, 𝜍] (still use update

rule 𝑥(K4 (𝑗) = 𝑥((𝑗) ⋅ (1 − 𝜗 ⋅ 𝑑((𝑗))). Analysis is the same

ØMW update is a very powerful technique – it can also be used to

solve, e.g., LP, semidefinite programs, SetCover, Boosting, etc.

  • Because it works for arbitrary cost vectors
  • Next, we show how it can be used to compute equilibria of games

where the “cost vector” will be generated by other players

slide-13
SLIDE 13

13

Outline

Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium

slide-14
SLIDE 14

14

ØThink about how you play rock-paper-scissor repeatedly ØIn reality, we play like online learning

  • You try to analyze the past patterns, then decide which action to

respond, possibly with some randomness

  • This is basically online learning!

Online learning – A natural way to play repeated games Repeated game: the same game played for many rounds

slide-15
SLIDE 15

15

Repeated Zero-Sum Games with No-Regret Players

Basic Setup:

ØA zero-sum game with payoff matrix 𝑉 ∈ ℝp×/ ØRow player maximizes utility and has actions 𝑛 = {1, ⋯ , 𝑛}

  • Column player thus minimizes utility

ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to pick a mixed

strategy at each round

slide-16
SLIDE 16

16

Repeated Zero-Sum Games with No-Regret Players

ØFrom row player’s perspective, the following occurs in order at

round 𝑢

  • Picks a mixed strategy 𝑦( ∈ Δp over actions in [𝑛]
  • Her opponent, the column player, picks a mixed strategy 𝑧( ∈ Δ/
  • Action 𝑗( ∼ 𝑦( is chosen and row player receives utility 𝑉 𝑗(, 𝑧( =

∑x∈[/] 𝑧( 𝑘 ⋅ 𝑉(𝑗(, 𝑘)

  • Row player learns 𝑧( (for future use)

ØColumn player has a symmetric perspective, but will think of

𝑉 𝑗, 𝑘 as his cost Difference from online learning: utility/cost vector determined by the opponent, instead of being arbitrarily chosen

slide-17
SLIDE 17

17

Repeated Zero-Sum Games with No-Regret Players

ØExpected total utility of row player ∑(e4

5

𝑉 𝑦(, 𝑧(

  • Note: 𝑉 𝑦(, 𝑧( = ∑;,x 𝑉 𝑗, 𝑘 𝑦( 𝑗 𝑧((𝑘) = 𝑦( 5𝑉𝑧(

max

;∈[p] ∑(e4 5

𝑉 𝑗, 𝑧( − ∑(e4

5

𝑉 𝑦(, 𝑧( Ø Regret of row player is Ø Regret of column player is ∑(e4

5

𝑉 𝑦(, 𝑧( − min

x∈[/] ∑(e4 5

𝑉 𝑦(, 𝑘

slide-18
SLIDE 18

18

From No Regret to Minimax Theorem

Next, we give another proof of the minimax theorem, using the fact that no regret algorithms exist (e.g., MW update)

slide-19
SLIDE 19

19

From No Regret to Minimax Theorem

ØAssume both players use no-regret learning algorithms ØFor row player, we have

𝑆5

|}R = max ;∈[p] ∑(e4 5

𝑉 𝑗, 𝑧( − ∑(e4

5

𝑉 𝑦(, 𝑧( ⇔

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( + @A

~•€

5

= 4

5 max ;∈[p] ∑(e4 5

𝑉 𝑗, 𝑧( = max

;∈[p] 𝑉 𝑗, ∑= •= 5

≥ min

  • ∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧

slide-20
SLIDE 20

20

From No Regret to Minimax Theorem

ØAssume both players use no-regret learning algorithms ØFor row player, we have ØSimilarly, for column player,

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( + @A

~•€

5

≥ min

  • ∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧

𝑆5

S}ƒ„p/ = ∑(e4 5

𝑉 𝑦(, 𝑧( − min

x∈[/] ∑(e4 5

𝑉 𝑦(, 𝑘

implies

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( −

@A

…•†‡ˆQ

5

≤ max

‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

min

  • ∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

ØLet 𝑈 → ∞, no regret implies

@A

~•€

5

and

@A

…•†‡ˆQ

5

tend to 0. We have

slide-21
SLIDE 21

21

From No Regret to Minimax Theorem

ØAssume both players use no-regret learning algorithms

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( +

@A

~•€

5

≥ min

  • ∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧 4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( − @A

…•†‡ˆQ

5

≤ max

‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

⇒ min

  • ∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧 ≤ max ‰∈‚ˆ min x∈[/] 𝑉 𝑦, 𝑘

Corollary.

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( converges to the game value

ØRecall that min-max ≥ max-min also holds, because moving

second will not be worse for the row player

slide-22
SLIDE 22

22

Convergence to Nash Equilibrium

ØRecall that (𝑦∗, 𝑧∗) is a NE if and only if 𝑦∗ is the maximin strategy

and 𝑧∗ is the minimax strategy

ØFrom previous derivations

  • Theorem. Suppose both players use no-regret learning algorithms

with action sequence {𝑦(} and {𝑧(}. Then

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( converges to the game value and (

∑=]^

A

‰= 5

,

∑=]^

A

  • =

5

) converges to NE of the game.

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( +

@A

~•€

5

= max

;∈[p] 𝑉 𝑗, ∑= •= 5

≥ min

  • ∈‚Q max

;∈[p] 𝑉 𝑗, 𝑧

Ø As 𝑈 → ∞, “≥” becomes “=”. So

∑= •= 5

solves the min-max problem Ø Similarly,

∑= ‰= 5

solves the max-min problem

slide-23
SLIDE 23

23

Remarks

ØIf both players use no regret algorithms with 𝑃( 𝑈), then

4 5 ∑(e4 5

𝑉 𝑦(, 𝑧( converges to the game value at rate

@A 5 = 4 5

ØThis convergence rate can be improved to

4 5 by careful regularization

  • f the no-regret algorithm
  • More readings: “Fast Convergence of Regularized Learning in Games”

[NIPS’15 best paper]

  • Intuition: our no-regret algorithm assumes adversarial feedbacks but the
  • ther player is not really adversary – he uses another no-regret algorithm
  • This can be exploited to improve learning rate
slide-24
SLIDE 24

24

Remarks

ØConvergence of no-regret learning to NE is the key framework for

designing the AI agent that beats top humans in Texas hold’em poker

  • Plus many other game solving techniques and engineering work
  • More reading: “Safe and Nested Subgame Solving for Imperfect-

Information Games.” [NeurIPS’17 best paper]

slide-25
SLIDE 25

25

Remarks

ØConvergence of no-regret learning to NE is the key framework for

designing the AI agent that beats top humans in Texas hold’em poker

  • Plus many other game solving techniques and engineering work
  • More reading: “Safe and Nested Subgame Solving for Imperfect-

Information Games.” [NeurIPS’17 best paper]

Exciting research is happening at this intersected space of Learning & Game Theory

slide-26
SLIDE 26

26

Outline

Ø Regret Proof of MW Update Ø Convergence to Minimax Equilibrium Ø Convergence to Coarse Correlated Equilibrium

slide-27
SLIDE 27

27

Recap: Normal-Form Games and CCE

Ø 𝑜 players, denoted by set 𝑜 = {1, ⋯ , 𝑜} Ø Player 𝑗 takes action 𝑏; ∈ 𝐵; Ø Player utility depends on the outcome of the game, i.e., an action

profile 𝑏 = (𝑏4, ⋯ , 𝑏/)

  • Player 𝑗 receives payoff 𝑣;(𝑏) for any outcome 𝑏 ∈ Π;e4

/ 𝐵;

ØCourse correlated equilibrium is an action recommendation policy

A recommendation policy 𝜌 is a coarse correlated equilibrium if

∑‘∈’ 𝑣; 𝑏 ⋅ 𝜌(𝑏) ≥ ∑‘∈’ 𝑣; 𝑏“;, 𝑏W; ⋅ 𝜌 𝑏 , ∀ 𝑏“; ∈ 𝐵;, ∀𝑗 ∈ 𝑜 .

That is, for any player 𝑗, following 𝜌’s recommendations is better than opting out of the recommendation and “acting on his own”.

slide-28
SLIDE 28

28

Repeated Games with No-Regret Players

ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to select a mixed

strategy at each round 𝑢

ØFor any player 𝑗’s perspective, the following occurs in order at 𝑢

  • Picks a mixed strategy 𝑦;

( ∈ Δ|’P| over actions in 𝐵;

  • Any other player 𝑘 ≠ 𝑗 picks a mixed strategy 𝑦x

( ∈ Δ|’–|

  • Player 𝑗 receives expected utility 𝑉; 𝑦;

(, 𝑦W; (

= 𝔽‘∼(‰P

=,‰`P = ) 𝑣;(𝑏)

  • Player 𝑗 learns 𝑦W;

( (for future use)

slide-29
SLIDE 29

29

Repeated Games with No-Regret Players

ØExpected total utility of player 𝑗 equals ∑(e4

5

𝑉; 𝑦;

(, 𝑦W; (

ØRegret of player 𝑗 is

𝑆;

5 = max ‘P∈’P

∑(e4

5

𝑉 𝑏;, 𝑦W;

(

− ∑(e4

5

𝑉; 𝑦;

(, 𝑦W; (

slide-30
SLIDE 30

30

From No Regret to CCE

Remarks:

ØIn mixed strategy profile 𝑦4

(, 𝑦b (, ⋯ , 𝑦/ ( , prob of 𝑏 is Π;∈ / 𝑦; ((𝑏;)

Ø𝜌5(𝑏) is simply the average of Π;∈ / 𝑦;

((𝑏;) over 𝑈 rounds

  • Theorem. Suppose all players use no-regret learning algorithms

with strategy sequence 𝑦;

( (∈[5] for 𝑗. The following recommendation

policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =

4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.

slide-31
SLIDE 31

31

From No Regret to CCE

∑‘∈’

4 5 ∑( Π;∈ / 𝑦; ((𝑏;)

⋅ 𝑣;(𝑏) =

4 5 ∑( ∑‘∈’ Π;∈ / 𝑦; ((𝑏;) ⋅ 𝑣;(𝑏)

=

4 5 ∑( 𝑣;(𝑦; (, 𝑦W; ( )

Remarks:

ØIn mixed strategy profile 𝑦4

(, 𝑦b (, ⋯ , 𝑦/ ( , prob of 𝑏 is Π;∈ / 𝑦; ((𝑏;)

Ø𝜌5(𝑏) is simply the average of Π;∈ / 𝑦;

((𝑏;) over 𝑈 rounds

ØPlayer 𝑗’s expected utility from 𝜌5 is

  • Theorem. Suppose all players use no-regret learning algorithms

with strategy sequence 𝑦;

( (∈[5] for 𝑗. The following recommendation

policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =

4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.

slide-32
SLIDE 32

32

From No Regret to CCE

Proof:

ØThe CCE condition requires for all player 𝑗 ØRegret

4 5 ∑( 𝑣;(𝑦; (, 𝑦W; ( ) ≥ 4 5 ∑( 𝑣; 𝑏;, 𝑦W; (

∀𝑏; ∈ 𝐵; (1) 𝑆;

5 = max ‘P∈’P

∑(e4

5

𝑣; 𝑏;, 𝑦W;

(

− ∑(e4

5

𝑣; 𝑦;

(, 𝑦W; (

(2)

ØDividing Equation (2) by 𝑈 and let 𝑈 → ∞ yields Condition (1) since

𝑆;

5/𝑈 tends to 0 by definition of no regret

  • Theorem. Suppose all players use no-regret learning algorithms

with strategy sequence 𝑦;

( (∈[5] for 𝑗. The following recommendation

policy 𝜌5 converges to a CCE: 𝜌5 𝑏 =

4 5 ∑( Π;∈ / 𝑦; ((𝑏;) , ∀ 𝑏 ∈ 𝐵.

slide-33
SLIDE 33

33

Next lecture:

ØStudy a stronger regret notion called “swap regret” – it uses a

stronger benchmark

ØShow any game with no-swap-regret players will converge to a

correlated equilibrium

ØProve that any no-regret algorithm can be converted to a no-

swap-regret algorithm, with slightly worse regret guarantee

slide-34
SLIDE 34

Thank You

Haifeng Xu

University of Virginia hx4ad@virginia.edu