generalised weakened fictitious play and random belief
play

Generalised weakened fictitious play and random belief learning - PowerPoint PPT Presentation

Generalised weakened fictitious play and random belief learning David S. Leslie 12 April 2010 Collaborators: Sean Collins, Claudio Mezzetti, Archie Chapman Overview Learning in games Stochastic approximation Generalised weakened


  1. Generalised weakened fictitious play and random belief learning David S. Leslie 12 April 2010 Collaborators: Sean Collins, Claudio Mezzetti, Archie Chapman

  2. Overview • Learning in games • Stochastic approximation • Generalised weakened fictitious play – Random belief learning – Oblivious learners

  3. Normal form games • Players i = 1 , . . . , N • Action sets A i • Reward functions r i : A 1 × · · · × A N → R

  4. Mixed strategies • Mixed strategies π i ∈ ∆ i • Joint mixed strategy π = ( π 1 , . . . , π N ) • Reward function extended so that r i ( π ) = E π [ r i ( a )]

  5. Best responses Assume other players use mixed strategy π − i . Player i should choose a mixed strategy in the best response set b i ( π − i ) = argmax π i ∈ ∆ i r i (˜ π i , π − i ) ˜

  6. Best responses Assume other players use mixed strategy π − i . Player i should choose a mixed strategy in the best response set b i ( π − i ) = argmax π i ∈ ∆ i r i (˜ π i , π − i ) ˜ A Nash equilibrium is a fixed point of the best response map: π i ∈ b i ( π − i ) for all i

  7. A problem with Nash Consider the game � (2 , 0) (0 , 1) � (0 , 2) (1 , 0) with unique Nash equilibrium π 1 = (2 / 3 , 1 / 3) , π 2 = (1 / 3 , 2 / 3)

  8. A problem with Nash Consider the game � (2 , 0) (0 , 1) � (0 , 2) (1 , 0) with unique Nash equilibrium π 1 = (2 / 3 , 1 / 3) , π 2 = (1 / 3 , 2 / 3) • r i ( a i , π − i ) = 2 / 3 for each i, a i • How does Player 1 know to use π 1 = (2 / 3 , 1 / 3) ? • Player 2 to use π 2 = (1 / 3 , 2 / 3) ?

  9. Learning in games • Attempts to justify equilibrium play as the end point of a learning process • Generally assumes pretty stupid players! • Related to evolutionary game theory

  10. Multi-armed bandits At time n , choose action a n , and receive reward R n

  11. Multi-armed bandits Estimate after time n of the expected reward for action a ∈ A is: � R m m ≤ n : a m = a Q n ( a ) = κ n ( a ) where κ n ( a ) = � n m =1 I { a m = a }

  12. Multi-armed bandits If a n � = a , κ n ( a ) = κ n − 1 ( a ) and: �� n − 1 � m =1 I { a m = a } R m + 0 Q n ( a ) = = Q n − 1 ( a ) κ n − 1 ( a )

  13. Multi-armed bandits if a n = a , �� n − 1 � m =1 I { a m = a } R m + R n Q n ( a ) = κ n ( a ) � � 1 1 = 1 − Q n − 1 ( a ) + κ n ( a ) R n κ n ( a )

  14. Multi-armed bandits Update estimates using 1  Q n − 1 ( a ) + κ n ( a ) { R n − Q n − 1 ( a ) } if a n = a   Q n ( a ) =  Q n − 1 ( a ) if a n � = a  At time n + 1 use Q n to choose an action a n +1

  15. Fictitious play At iteration n + 1 , player i : • forms beliefs σ − i ∈ ∆ − i about the other players’ strategies n • chooses an action in b i ( σ − i n )

  16. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n

  17. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: n +1 ( a j ) = κ j = κ j n ( a j )+ I { a j n +1 ( a j ) n +1 = a j } σ j n +1 n +1

  18. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: n +1 ( a j ) = κ j = κ j n ( a j )+ I { a j + I { a j n +1 ( a j ) n +1 = a j } κ j n +1 = a j } n ( a j ) σ j n = n +1 n +1 n +1 n n +1

  19. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j n +1 I { a j n +1 ( a j ) = σ j n ( a j ) + n +1 = a j } � 1 � 1 1 − n +1

  20. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j σ j � 1 � 1 = 1 − + n +1 e a j n n +1 n +1 n +1

  21. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j σ j � 1 � 1 = 1 − + n +1 e a j n n +1 n +1 n +1 In terms of best responses: σ j � 1 � σ j n +1 b j ( σ − j 1 ∈ 1 − + n ) n n +1 n +1

  22. Belief formation The beliefs about player j are simply the MLE: n ( a j ) = κ j n ( a j ) where κ j m =1 I { a j σ j n ( a j ) = � n m = a j } n Recursive update: σ j σ j � 1 � 1 = 1 − + n +1 e a j n n +1 n +1 n +1 In terms of best responses: � 1 � 1 σ n +1 ∈ 1 − σ n + n +1 b ( σ n ) n +1

  23. Stochastic approximation

  24. Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 }

  25. Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 } • F : Θ → Θ is a (bounded u.s.c.) set-valued map • α n → 0 , � n α n = ∞ • For any T > 0 , � � k − 1 � � � � � lim sup α i +1 M i +1 � = 0 � � n →∞ � � k>n : � k − 1 i = n α i +1 ≤ T i = n � n ( α n ) 2 < ∞ , E [ M n +1 | θ n ] → 0 , and The last is implied by: � Var [ M n +1 | θ n ] < C almost surely.

  26. Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 } θ n +1 − θ n ∈ F ( θ n ) + M n +1 α n ↓ d d t θ ∈ F ( θ ) , a differential inclusion (Bena ¨ ım, Hofbauer and Sorin, 2005)

  27. Stochastic approximation θ n +1 ∈ θ n + α n +1 { F ( θ n ) + M n +1 } In fictitious play: 1 σ n +1 ∈ σ n + n +1 { b ( σ n ) − σ n } ↓ d d t σ ∈ b ( σ ) − σ, the best response differential inclusion. Hence σ n converges to the set of Nash equilibria in zero-sum games, potential games, and generic 2 × m games.

  28. Generalised weakened fictitious play

  29. Weakened fictitious play • Van der Genugten (2000) showed that the convergence rate of fictitious play can be improved if players use ǫ n -best re- sponses. (For 2-player zero-sum games, and a very specific choice of ǫ n ) • π ∈ b ǫ n ( σ n ) ⇒ π ∈ b ( σ n ) + M n +1 where M n → 0 as ǫ n → 0 (by continuity properties of b and boundedness of r ) • For general games and general ǫ n → 0 this fits into the stochastic approximation framework

  30. Generalised weakened fictitious play Theorem: Any process such that σ n +1 ∈ σ n + α n +1 { b ǫ n ( σ n ) − σ n + M n +1 } where • ǫ n → 0 as n → ∞ • α n → 0 as n → ∞ � � k − 1 � � � � � • lim sup α i +1 M i +1 � = 0 � � n →∞ � � k>n : � k − 1 i = n i = n α i +1 ≤ T � converges to the set of Nash equilibria for zero-sum games, po- tential games and generic 2 × m games.

  31. Recency • For classical fictitious play α n = 1 n , ǫ n ≡ 0 and M n ≡ 0 • For any α n → 0 the conditions are met (since M n ≡ 0 ) 1 1 • How about α n = √ n , or even α n = log n ?

  32. Recency Belief that Player 1 plays Heads over 200 plays of the two-player matching pennies game under clas- sical fictitious play (top), under a modified ficti- 1 tious play with α n = √ n (middle), and with α n = 1 log n (bottom)

  33. Stochastic fictitious play In fictitious play, players always choose pure actions ⇒ strategies never converge to mixed strategies (beliefs do, but played strategies do not)

  34. Stochastic fictitious play Instead consider smooth best responses: β i τ ( σ − i ) = argmax � r i ( π i , σ − i ) + τv ( π i ) � π i ∈ ∆ i exp { r i ( a i ,σ − i ) /τ } For example β i τ ( σ − i )( a i ) = a ∈ Ai exp { r i ( a,σ − i ) /τ } �

  35. Stochastic fictitious play Instead consider smooth best responses: β i τ ( σ − i ) = argmax � r i ( π i , σ − i ) + τv ( π i ) � π i ∈ ∆ i exp { r i ( a i ,σ − i ) /τ } For example β i τ ( σ − i )( a i ) = a ∈ Ai exp { r i ( a,σ − i ) /τ } � Strategies evolve according to σ n +1 = σ n + 1 n +1 { β τ ( σ n ) + M n +1 − σ n } where E [ M n +1 | σ n ] = 0

  36. Convergence 1 σ n +1 = σ n + n +1 { β τ ( σ n ) − σ n + M n +1 }

  37. Convergence 1 σ n +1 = σ n + n +1 { β τ ( σ n ) − σ n + M n +1 } 1 ∈ σ n + n +1 { b ǫ ( σ n ) − σ n + M n +1 }

  38. Convergence 1 σ n +1 = σ n + n +1 { β τ ( σ n ) − σ n + M n +1 } 1 ∈ σ n + n +1 { b ǫ ( σ n ) − σ n + M n +1 } But can now consider the effect of using smooth best response β τ n with τ n → 0 . . . . . . it means that ǫ n → 0 , resulting in a GWFP!

  39. Random belief learning

  40. Random beliefs (Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in: • knowledge of the reward functions • beliefs σ about opponent strategy

  41. Random beliefs (Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in: • knowledge of the reward functions • beliefs σ about opponent strategy

  42. Random beliefs (Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in: • knowledge of the reward functions • beliefs σ about opponent strategy Uncertainty in the beliefs σ n ← → distribution on belief space

  43. Belief distributions • The belief about player j is that π j ∼ µ j • E µ j [ π j ] = σ j , the focus of µ j .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend