efficient learning equilibrium
play

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz - PDF document

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006 Efficient Learning Equilibrium Infinitely repeated stage games M with single- stage matrix G Individual Rationality Efficiency


  1. Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006

  2. Efficient Learning Equilibrium • Infinitely repeated stage games M with single- stage matrix G • Individual Rationality • Efficiency – Unilateral deviation irrational after a poly- nomial number of stages – Without deviation, after a polynomial number of steps, the expected payoff will be within ǫ of a Nash equilbrium (hence also within ǫ of minimax pay- offs). • A set of algorithms with respect to a spe- cific class of games that meets the above conditions is considered to be an ELE 1

  3. Motivation • Get objective convergence rather than con- vergence in beliefs – Works with (relatively) patient agents • Exploit richness of results from Folk The- orems 2

  4. Assumptions about Agents • Agents care about average reward, as well as well as how quickly this is reached • No discounting • Agents have NO PRIOR about the payoffs of the game • Agents may or may not be able to observe payoffs – Perfect monitoring (main results proved) – Weak imperfect monitoring ∗ observe other players actions, not pay- offs – Strict imperfect monitoring 3

  5. Formal definition of learning equilbrium • Define U i ( M, σ 1 , σ 2 , T ) given repeated game M to be the expected average reward for player i after T periods when players choose the strategy (policy) σ = ( σ 1 , σ 2 ). • Then let U i ( M, σ 1 , σ 2 ) represented the av- erage reward in the limit: U i ( M, σ 1 , σ 2 ) = lim inf T →∞ U i ( M, σ 1 , σ 2 , T ). • A strategy (policy) is in LE if, given strat- egy σ = ( σ 1 , σ 2 ), then for all repeated games M neither player can benefit from a unilat- eral deviation ∀ σ ′ 1 , U 1 ( M, σ ′ 1 , σ 2 ) ≤ U 1 ( M, σ 1 , σ 2 ) and ∀ σ ′ 2 , U 2 ( M, σ 1 , σ ′ 2 ) ≤ U 2 ( M, σ ′ 1 , σ 2 ). 4

  6. Efficient Learning Equilibrium • Add requirements for speed of convergence from basic definition in learning equilibrium • ∀ ǫ > 0 , 0 < δ < 1, there exists some T = poly ( 1 ǫ , 1 δ , k ) s.t. ∀ t ≥ T , if player 1 switches from strategy σ 1 to σ ′ 1 in itera- tion l , then U 1 ( M, σ ′ 1 , σ 2 , l + t ) ≤ U 1 ( M, σ 1 , σ 2 , l + t ) + ǫ , with probability of failure bounded above by δ . 5

  7. Efficient Learning Equilibrium (cont’d) • In order to provide an IR constraint the authors require that the utilities are higher than those in some Nash equilibrium (which must be higher than minimax) • Note the limit is dropped: if there exists a policy beneficial in polynomial time but disastrous in the long-term, we do not have ELE – Toy example: Opponent has trigger strat- egy with exponential delay 6

  8. Theorem 1 (BT, 7) • There exists an ELE for any perfect mon- itoring setting • Three stages (if no deviation) – Exploration – Offline Computation of Equilbrium – Play of equilibrium • Proof demonstrates that deviation can be punished in exploration stage – No subgame perfection demonstrated – In folk theorems more general results found - Pareto-ELE 7

  9. Theorem 1 (BT, 7) cont’d • If player 2 deviates during exploration player 1 minimaxes player 2 using available pay- offs (player 1 may continue exploring to learn payoffs in order to minimax player 2) – Note that if player 2 starts exploring again, he/she might do better, but will never be able to do better than the max- imin of the game • Lemma 1: Chernoff bounds if maximin re- quires randomizing • Lemma 2: Let R max be the minimax pay- off, then there exists a some z polynomial in R max , k, 1 ǫ and 1 δ s.t. if player 1 punishes player 2 as prescribed for z steps, then ei- ther player 1 will learn a new entry or will reach the desired minimax payoff with high probability. 8

  10. Theorem 1 (BT, 7) cont’d • Given k is an upper bound on the number of actions for each player, then player 1 can only learn k 2 − 1 new entries after deviation. • Thus, the probablistic minimax can be reached in a polynomial number of moves. • With a second Chernoff bound, the au- thors conclude that the actual payoff will be within ǫ of the expected minimax value with probability 1 − δ with only a polynomial (linear) increase in the number of trials as 1 δ and 1 ǫ grow. 9

  11. Weaknesses • Proofs are restricted to 2 player setting • Trigger strategies used are very far from subgame perfection • Agents that care about average payoffs are significant deviation from discounting agents • Choosing from multiple equilibria?? • Exhibited learning algorithms seem na ¨ ıve – Explore entire state space, then simply compute equilibrium 10

  12. Theorem 2 (BT, 7) • An ELE does not always exist in the im- perfect monitoring setting. � � 6 , 0 0 , 100 M 1 = 5 , − 100 1, 500 � � 0 , 1 6, 9 M ′ 1 = 5 , 11 1, 10 • In both M 1 and M ′ 1 , player 1 has the same payoffs. Both games have a unique Nash equilibrium - the one in M 1 must be played for it to be ELE. • Player 2 benefits if he plays as if he is in M ′ 1 11

  13. • Contradicts definition of ELE - Player 2 immediately, permanently benefits from uni- lateral deviation – In example it seems like player 2 must know player 1’s payoffs but not vice- versa � � 6 , 0 0 , 100 M 1 = 5 , − 100 1, 500 � � 0 , 1 6, 9 M ′ 1 = 5 , 11 1, 10 • Player 2 could just pretend playing Right is a dominant strategy

  14. Theorem 3 (BT, 9) • There exists ELE for the class of common- interest games under strict imperfect mon- itoring. – Agents know own action and payoffs, but neither action nor payoff of oppo- nent • Proof outline: proceed as in Thm. 1, but explore by independently randomizing over actions until both agents are confident all actions have been seen. 12

  15. Theorem 3 (BT, 9) cont’d • If agents both play action that lead to the highest reward they saw, they are guaran- teed to coordinate • Concerns – Common interest implies players DO know opponent’s payoffs. – If players don’t know number of actions of opponent, how do they decide when to stop? – If players do know number of actions of opponent, why not use direct result from Thm. 1? 13

  16. Pareto ELE • Exploits repeated game strategy to allow wider range of payoffs – Differs in that it allows side payoffs – Now within ǫ of economically efficient outcome rather than NE • Given efficient joint actions ( P 1 ( G ) , P 2 ( G )), with values PV i ( M ) for agent i , we now re- quire U 1 ( M, σ 1 , σ 2 , t ) + U 2 ( M, σ 1 , σ 2 , t ) ≥ PV 1 ( M ) + PV 2 ( M ) − ǫ . • Same condition that with prob 1 − δ gain of less than ǫ for deviation after polynomial time 14

  17. Theorem 4 • There exists a Pareto ELE for any perfect monitoring setting • Proof outline: proceed as in regular ELE in exploration • Pay a player if she receives less than her probablistic maximin value • By definition of Pareto optimality, both play- ers now exceed their maximin value • Use same punishment approaches from be- fore 15

  18. Stochastic games • Players observe payoffs, new states, must create model for probablistic transitions. • Nash equilbrium results in average payoffs are hard to prove - work in Pareto-ELE setting with side payoffs • Ergodicity assumption: every state is reach- able from every other state – This combined with finite number of states implies we can expect to explore the entire game matrix in finite time • Results polynomial in 1 δ , 1 ǫ , and T min – T mix denotes the ǫ -return mixing time 16

  19. ǫ -Mixing Time • Informally the time it takes for the expected average reward to approach the infinite re- ward ∀ states s • T mix is the minimum t s.t. ∀ s ∈ S U ( s, σ 1 , σ 2 , t ) > U ( s, σ 1 , σ 2 ) − ǫ • How long is this given a ”reasonable” tran- sition function 17

  20. Theorem 6 (BT, 13) • A Pareto-ELE in a stochastic game exists if (1) the agents have perfect monitoring and (2) T mix is known. • Proof similar to previous ones, requires E 3 approach and results from Learning to co- ordinate efficiently (BT, [5]). 18

  21. Extensions • Move towards credible threats if not SPE – Automated agents can implement unre- alistic threats • More results in the case of imperfect mon- itoring – May require probablistic reasoning or con- ditional priors rather than just learning the entire game matrix • Model of Pareto-ELE based off of cycling rather than side-payoffs? 19

  22. Conclusions • Pros: – We get objective convergence, not con- vergence in beliefs – Punishment is relatively quick, if not dis- counted • Cons: – NO priors! – Discounting seems more realistic model of behavior – Hard time horizon for punishment may be required otherwise agents will try to delay costly punishment forever – Trigger strategies far from SPE 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend