beliefs and learning in repeated games
play

Beliefs and Learning in Repeated Games Florin Constantin and Ivo - PDF document

Beliefs and Learning in Repeated Games Florin Constantin and Ivo Parashkevov March 15, 2006 Context 2-player discounted repeated games [can be extended to n -player] want to provably learn equilibrium play, as quickly as possible and


  1. Beliefs and Learning in Repeated Games Florin Constantin and Ivo Parashkevov March 15, 2006

  2. Context • 2-player discounted repeated games [can be extended to n -player] • want to provably learn equilibrium play, as quickly as possible and with as little info as possible 1

  3. Rational (Bayesian) Learning • use beliefs about opponents’ strategies to guide prediction of future play • play Best Response to beliefs • update beliefs based on actual play • learning = recurrently update beliefs until convergence to equilibrium 2

  4. Belief Learning vs. Bayesian Learning • Behavior Strategy: history → distribution over opponent’s play in next period. Example: h opp = ( C, C, D ) → Pr t =4 ( C ) = 2 3 , Pr t =4 ( D ) = 1 3 • Belief Learning - prediction rule as behav- ior strategy : associate probabilities with future play of opponents based on play his- tory. Best Respond to prediction rule • Bayesian Learning - Best Respond to be- liefs • Belief Learning as Bayesian Learning: Best Respond to belief that puts probability 1 on the behavior strategy predicted by the prediction rule 3

  5. Belief Learning vs. Bayesian Learning II • Bayesian Learning as Belief Learning: For any belief B of player 1 over player 2’s be- havior strategies, there exists an equivalent belief assigning probability 1 to a particular behavior strategy (called reduced form of B ). Prediction rule: predict the reduced form 4

  6. Fictitious Play • P (opponent plays s at time t ) = t k t + k (freq of s up to time t ) + t + k prior( s ) • Assumptions – myopia • if it converges, it converges to NE 5

  7. Calibrated Learning • use forecasts; if – every player plays Best Response to fore- casts – forecasts are calibrated then learning converges to Correlated Equi- librium • history is correlating device (umpire) • Assumptions – stationary tie-breaking rule 6

  8. Problematic assumptions in papers so far • myopia – ignores strategic considerations about the future - can not experiment for long run benefit – can only implement any NE of repeated game that consist of NE in stage games (e.g. no trigger strategies) • observable rewards • common prior 7

  9. Kalai & Lehrer - Rational Learning • Setting: – n -player infinitely repeated discounted games – subjective rationality - best responding to beliefs – learning is through Bayesian updating of individual prior – encode beliefs as behavior strategies • Main result: if individual beliefs are com- patible with actual play then best response to beliefs leads to accurate prediction of future play. Play converges to Nash equi- librium play . 8

  10. Assumptions • Perfect monitoring - observe actions of other players • Independence of other players’ actions and beliefs • No longer assume common prior or myopia • Opponents not assumed to be rational • Knowledge of own payoff matrix 9

  11. Some Notation • n finite sets Σ 1 , Σ 2 , ..., Σ n of actions • H t - set of histories of length t . H = � t H t is the set of all finite histories. • behavior strategy of player i is a function f i : H → ∆(Σ i ), where ∆(Σ i ) denotes the set of probability distributions over Σ i • µ f is the probability distribution over the set of infinite play paths induced by the strategy vector f 10

  12. Absolute Continuity and Grain of Truth Assumptions What does it mean to have ”beliefs compatible with actual play”? • Do not assign zero probability to events that can occur in the play of the game. • ”Grain of Truth” - beliefs about opponent’s play assign a (small) positive probability to the strategy actually chosen. – Sufficient, but stronger than needed. • Absolute Continuity - measure µ f is ab- solutely continuous w.r.t. µ g ( µ f ≪ µ g ) if µ f ( A ) > 0 ⇒ µ g ( A ) > 0 for all sets A ⊆ Σ ∞ . • Main result requires: actual µ ≪ belief ˜ µ i 11

  13. Prisoner’s Dilemma Example D C D 0,0 2,-1 C -1,2 1,1 • Consider strategies – g ∞ : grim trigger – g t : use grim trigger until time t , then defect forever • P1 assigns probs ( β 0 , β 1 , . . . , β ∞ ) to P2 play- ing ( g 0 , g 1 , . . . , g ∞ ), β t > 0. P2 assigns probs ( α 0 , α 1 , . . . , α ∞ ) to P1 playing ( g 0 , g 1 , . . . , g ∞ α t > 0. • According to own learning parameters, P1 chooses g t 1 and P 2 chooses g t 2 . 12

  14. Prisoner’s Dilemma Example • all events with positive probability in the game (C until time t < min( t 1 , t 2 ), D af- ter min( t 1 , t 2 ) etc) are associated positive probability by players’ beliefs: beliefs are compatible with actual play. • learning must occur - if t 1 < t 2 then P2 will assign prob 1 to P1 playing g t 1 from time t 1 +1 on. So P2 knows that P1 will defect forever. • P1 will not know P2’s strategy - will only know that t 2 > t 1 , but will be able to pre- dict that P2 will defect forever as well - future play is learned only on the play path . 13

  15. Prisoner’s Dilemma Example What if t 1 = t 2 = ∞ ? • after time t , P1 knows P2 did not play g 0 , . . . , g t and assigns probabilities ∞ � ( β t +1 , . . . , β ∞ ) / β r r = t +1 to ( g t +1 , . . . , g ∞ ). β ∞ t →∞ Since β ∞ > 0, 1 → � ∞ r = t +1 β r • P1 becomes more and more confident that P2 is playing g ∞ , but never knows for sure . 14

  16. Definitions • Let ε > 0 and µ, ˜ µ two probability measures. µ is ε -close to ˜ µ if ∃ set Q such that – µ ( Q ) > 1 − ε and ˜ µ ( Q ) > 1 − ε – ∀ A ⊆ Q, (1 − ε )˜ µ ( A ) ≤ µ ( A ) ≤ (1+ ε )˜ µ ( A ) • f plays ε -like g if µ f is ε -close to µ g . • Let f be a strategy, t ≥ 0 and h a history up to time t . The induced strategy f h ( · ) f h ( h ′ ) = f ( concat ( h, h ′ )) for all h ′ of length r 15

  17. Theorem 1 Let f be the strategy that is actually chosen and f i be the beliefs of player i . Assume f is absolutely continuous with respect to f i . Then ∀ ε > 0 for almost every play path z ∃ time T ( z, ε ) such that ∀ t ≥ T ( z, ε ), f z ( t ) plays ε -like f i z ( t ) . If players maximize payoff then they will even- tually be playing a subjective ε -equilibrium : • each player plays a Best Response to own beliefs • these beliefs are ε -never going to be con- tradicted by actual play Interpretation? 16

  18. Theorem 2 Let f be the strategy that is actually chosen and f 1 , . . . , f n be the beliefs of players 1 , . . . , n . Assume • f is absolutely continuous with respect to f i • each player plays a Best Response to its beliefs. Then ∀ ε > 0, for almost every play path z ∃ a time T ( z, ε ) such that for all t ≥ T ( z, ε ) there exists an ε -Nash Equilibrium ¯ f of the repeated game such that f z ( t ) plays ε -like ¯ f . 17

  19. Comments • Theorem 1 does not assume anything about players’ strategies. • Convergence of beliefs with reality occurs only on the actual play path. Players do not learn what their opponents will do in response to actions that will not be taken. • If players are best responding (Theorem 2), then convergence is to NE play in the re- peated game, not to repeated play of a single stage NE. • Convergence is to an equilibrium play , not to an equilibrium. We are not learning Nash strategies, but we can learn to play as if we knew them 18

  20. • So what? • If – Assumptions are met – All other players play Best Response to their beliefs can you do better? 19

  21. Beliefs in Repeated Games - Nachbar 2005 Main Result For a large class of repeated games, beliefs can not simultaneously satisfy: • learnability • consistency • CSP (diversity of belief condition) 20

  22. Learnability - informally Player 1 learns to predict the path of play gen- erated by σ 2 if her one -period-ahead forecasts along the path of play eventually becomes al- most as accurate as if she knew σ 2 . 21

  23. Learnability - formally Fix a belief β 2 of player 1 about player 2’s strategy. Player 1 learns to predict the path of play gen- erated by behavior strategy σ = ( σ 1 , σ 2 ) iff • ∀ finite history h , µ ( σ 1 ,σ 2 ) ( h ∗ ) > 0 ⇒ µ ( σ 1 ,β 2 ) ( h ∗ ) > 0 h ∗ = the set of all paths of play starting with h • ∀ ε > 0 and for almost all paths of play z , ∃ T ( ε, z ) such that for any time t > T ( ε, z ) and any action a 2 of player 2, | (the prob that σ 2 ( h ) assigns to a 2 ) − (the prob that σ β 2 ( h ) assigns to a 2 ) | < ε where h = the first t stages of z and σ β 2 = the reduced form of β 2 . 22

  24. CSP Two conditions: CS and P, both addressing the richness of ˆ Σ. All restrictions only on the path of play! • CS - (Weak) Caution and Symmetry – s 1 is a simple variant of s 2 if s 1 can be generated from s 2 by a uniform relabel- ing of actions – Weak Caution means: if I believe you could play the pure strategy s 1 , then I also believe you could play all simple variants of s 1 . Strong caution would mean ˆ S i = S i – Symmetry means: if I believe that you can play s 1 , then you believe I can play all simple variants of s 1 . 23

  25. – Symmetry motivated by the necessity to have equally powerful strategy-generating machines. • P ˆ – if a behavior strategy σ 2 is in Σ 2 , then at least one pure strategy that coarsely ˆ approximates σ 2 is in Σ 2 as well.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend