Efficient Learning Equilibrium R.Brafman and M.Tennenholtz - PDF document

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006

Efficient Learning Equilibrium • Infinitely repeated stage games M with single- stage matrix G • Individual Rationality • Efficiency – Unilateral deviation irrational after a polynomial number of stages – Without deviation, after a polynomial number of steps, the expected payoff will be within ǫ of a Nash equilbrium (hence also within ǫ of minimax payoffs). • A set of algorithms with respect to a spe- cific class of games that meets the above conditions is considered to be an ELE 1

Motivation • Get objective convergence rather than convergence in beliefs – Works with (relatively) patient agents • Exploit richness of results from Folk The- orems 2

Assumptions about Agents • Agents care about average reward, as well as well as how quickly this is reached • No discounting • Agents have NO PRIOR about the payoffs of the game • Agents may or may not be able to observe payoffs – Perfect monitoring (main results proved) – Weak imperfect monitoring ∗ observe other players actions, not payoffs – Strict imperfect monitoring 3

Formal definition of learning equilbrium • Define U i ( M, σ 1 , σ 2 , T ) given repeated game M to be the expected average reward for player i after T periods when players choose the strategy (policy) σ = ( σ 1 , σ 2 ). • Then let U i ( M, σ 1 , σ 2 ) represented the average reward in the limit: U i ( M, σ 1 , σ 2 ) = lim inf T →∞ U i ( M, σ 1 , σ 2 , T ). • A strategy (policy) is in LE if, given strategy σ = ( σ 1 , σ 2 ), then for all repeated games M neither player can benefit from a unilateral deviation ∀ σ ′ 1 , U 1 ( M, σ ′ 1 , σ 2 ) ≤ U 1 ( M, σ 1 , σ 2 ) and ∀ σ ′ 2 , U 2 ( M, σ 1 , σ ′ 2 ) ≤ U 2 ( M, σ ′ 1 , σ 2 ). 4

Efficient Learning Equilibrium • Add requirements for speed of convergence from basic definition in learning equilibrium • ∀ ǫ > 0 , 0 < δ < 1, there exists some T = poly ( 1 ǫ , 1 δ , k ) s.t. ∀ t ≥ T , if player 1 switches from strategy σ 1 to σ ′ 1 in itera- tion l , then U 1 ( M, σ ′ 1 , σ 2 , l + t ) ≤ U 1 ( M, σ 1 , σ 2 , l + t ) + ǫ , with probability of failure bounded above by δ . 5

Efficient Learning Equilibrium (cont’d) • In order to provide an IR constraint the authors require that the utilities are higher than those in some Nash equilibrium (which must be higher than minimax) • Note the limit is dropped: if there exists a policy beneficial in polynomial time but disastrous in the long-term, we do not have ELE – Toy example: Opponent has trigger strategy with exponential delay 6

Theorem 1 (BT, 7) • There exists an ELE for any perfect monitoring setting • Three stages (if no deviation) – Exploration – Offline Computation of Equilbrium – Play of equilibrium • Proof demonstrates that deviation can be punished in exploration stage – No subgame perfection demonstrated – In folk theorems more general results found - Pareto-ELE 7

Theorem 1 (BT, 7) cont’d • If player 2 deviates during exploration player 1 minimaxes player 2 using available payoffs (player 1 may continue exploring to learn payoffs in order to minimax player 2) – Note that if player 2 starts exploring again, he/she might do better, but will never be able to do better than the maximin of the game • Lemma 1: Chernoff bounds if maximin requires randomizing • Lemma 2: Let R max be the minimax payoff, then there exists a some z polynomial in R max , k, 1 ǫ and 1 δ s.t. if player 1 punishes player 2 as prescribed for z steps, then ei- ther player 1 will learn a new entry or will reach the desired minimax payoff with high probability. 8

Theorem 1 (BT, 7) cont’d • Given k is an upper bound on the number of actions for each player, then player 1 can only learn k 2 − 1 new entries after deviation. • Thus, the probablistic minimax can be reached in a polynomial number of moves. • With a second Chernoff bound, the authors conclude that the actual payoff will be within ǫ of the expected minimax value with probability 1 − δ with only a polynomial (linear) increase in the number of trials as 1 δ and 1 ǫ grow. 9

Weaknesses • Proofs are restricted to 2 player setting • Trigger strategies used are very far from subgame perfection • Agents that care about average payoffs are significant deviation from discounting agents • Choosing from multiple equilibria?? • Exhibited learning algorithms seem na ¨ ıve – Explore entire state space, then simply compute equilibrium 10

Theorem 2 (BT, 7) • An ELE does not always exist in the imperfect monitoring setting. � � 6 , 0 0 , 100 M 1 = 5 , − 100 1, 500 � � 0 , 1 6, 9 M ′ 1 = 5 , 11 1, 10 • In both M 1 and M ′ 1 , player 1 has the same payoffs. Both games have a unique Nash equilibrium - the one in M 1 must be played for it to be ELE. • Player 2 benefits if he plays as if he is in M ′ 1 11

• Contradicts definition of ELE - Player 2 immediately, permanently benefits from unilateral deviation – In example it seems like player 2 must know player 1’s payoffs but not vice- versa � � 6 , 0 0 , 100 M 1 = 5 , − 100 1, 500 � � 0 , 1 6, 9 M ′ 1 = 5 , 11 1, 10 • Player 2 could just pretend playing Right is a dominant strategy

Theorem 3 (BT, 9) • There exists ELE for the class of common- interest games under strict imperfect monitoring. – Agents know own action and payoffs, but neither action nor payoff of opponent • Proof outline: proceed as in Thm. 1, but explore by independently randomizing over actions until both agents are confident all actions have been seen. 12

Theorem 3 (BT, 9) cont’d • If agents both play action that lead to the highest reward they saw, they are guaran- teed to coordinate • Concerns – Common interest implies players DO know opponent’s payoffs. – If players don’t know number of actions of opponent, how do they decide when to stop? – If players do know number of actions of opponent, why not use direct result from Thm. 1? 13

Pareto ELE • Exploits repeated game strategy to allow wider range of payoffs – Differs in that it allows side payoffs – Now within ǫ of economically efficient outcome rather than NE • Given efficient joint actions ( P 1 ( G ) , P 2 ( G )), with values PV i ( M ) for agent i , we now require U 1 ( M, σ 1 , σ 2 , t ) + U 2 ( M, σ 1 , σ 2 , t ) ≥ PV 1 ( M ) + PV 2 ( M ) − ǫ . • Same condition that with prob 1 − δ gain of less than ǫ for deviation after polynomial time 14

Theorem 4 • There exists a Pareto ELE for any perfect monitoring setting • Proof outline: proceed as in regular ELE in exploration • Pay a player if she receives less than her probablistic maximin value • By definition of Pareto optimality, both players now exceed their maximin value • Use same punishment approaches from be- fore 15

Stochastic games • Players observe payoffs, new states, must create model for probablistic transitions. • Nash equilbrium results in average payoffs are hard to prove - work in Pareto-ELE setting with side payoffs • Ergodicity assumption: every state is reach- able from every other state – This combined with finite number of states implies we can expect to explore the entire game matrix in finite time • Results polynomial in 1 δ , 1 ǫ , and T min – T mix denotes the ǫ -return mixing time 16

ǫ -Mixing Time • Informally the time it takes for the expected average reward to approach the infinite reward ∀ states s • T mix is the minimum t s.t. ∀ s ∈ S U ( s, σ 1 , σ 2 , t ) > U ( s, σ 1 , σ 2 ) − ǫ • How long is this given a ”reasonable” tran- sition function 17

Theorem 6 (BT, 13) • A Pareto-ELE in a stochastic game exists if (1) the agents have perfect monitoring and (2) T mix is known. • Proof similar to previous ones, requires E 3 approach and results from Learning to coordinate efficiently (BT, [5]). 18

Extensions • Move towards credible threats if not SPE – Automated agents can implement unre- alistic threats • More results in the case of imperfect monitoring – May require probablistic reasoning or con- ditional priors rather than just learning the entire game matrix • Model of Pareto-ELE based off of cycling rather than side-payoffs? 19

Conclusions • Pros: – We get objective convergence, not convergence in beliefs – Punishment is relatively quick, if not dis- counted • Cons: – NO priors! – Discounting seems more realistic model of behavior – Hard time horizon for punishment may be required otherwise agents will try to delay costly punishment forever – Trigger strategies far from SPE 20

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz - PDF document

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006 Efficient Learning Equilibrium Infinitely repeated stage games M with single- stage matrix G Individual Rationality Efficiency

LABOR MARKET EQUILIBRIUM Competitive Equilibrium I Equilibrium as the intersection of supply and

New Tier 1 Boron Guideline for Alberta Greg Huber, M.Sc., P.Eng., PMP (Equilibrium) Anthony

Chemical Equilibrium Chemical equilibrium occurs when a reaction and its reverse reaction proceed

Equilibrium Refinements Mihai Manea MIT Sequential Equilibrium In many games information is

PHYSICS OF BIOLOGICAL SYSTEMS ph549 LECTURE 9 Energy and Equilibrium LIFE and ENERGY

Chemical Equilibrium Chapter 13 Chemical Equilibrium When neither the products nor the

Chapter 9 Chapter Outline The FE Line: Equilibrium in the Labor

Chemistry 2000 Slide Set 11: Chemical equilibrium Marc R. Roussel February 4, 2020 Marc R.

Beyond Nash Equilibrium: Solution Concepts for the 21st Century Joe Halpern and many

Equilibrium conditions In 2D, equilibrium equation can be written in scalar form as F x = 0

Mapping equilibrium and non-equilibrium entropy landscapes : the path-sampling approach Manuel

LN-12 Notes on the history of general equilibrium, welfare economics General equilibrium We have

Non-equilibrium Non-equilibrium Fluctuations in Strongly Fluctuations in Strongly Correlated

Nash Equilibrium 14.12 Game Theory Muhamet Yildiz 1 Road Map 1. Definition 2. Examples 3.

Non-Equilibrium Chemistry & Cooling Alexander Richings & Joop Schaye Leiden Observatory

About Chemical Equilibrium and Free Energy UNIT 6 DAY 2 What are we going to learn today?

COMPANY PRESENTATION www.grupoexcelsior.com Lifting your quality of life INTERNATIONAL

and Innovation Charles Boamah Vice President, FNVP Tunis - 12 June 2013 Tunis June 2013

WEBINAR: DT FOR SOCIALLY N O SOC RESPONSIBLE INVESTORS 16 th of October 2014 DISCLAIMER This

Disclaimer This presentation contains forward-looking statements that reflect managements

Financial Report Group Report 2009 www.finnair.com/group Financial Report CONTENTS Finnair

PRODUCT HIGHLIGHTS SHEET A (div) EUR: 1.45% 31 December 2019 A (div) SGD: 1.45% A (div)

Post-Pricing Information Unlimited Tax Refunding Bonds, Taxable Series 2019 Tuesday, November

Post-Pricing Information Unlimited Tax School Building Bonds, Series 2019 Monday, May 6, 2019

Sambuz

Useful Links

Newsletter

Mail Us

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz - PDF document

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006 Efficient Learning Equilibrium Infinitely repeated stage games M with single- stage matrix G Individual Rationality Efficiency

LABOR MARKET EQUILIBRIUM Competitive Equilibrium I Equilibrium as the intersection of supply and

New Tier 1 Boron Guideline for Alberta Greg Huber, M.Sc., P.Eng., PMP (Equilibrium) Anthony

Chemical Equilibrium Chemical equilibrium occurs when a reaction and its reverse reaction proceed

Equilibrium Refinements Mihai Manea MIT Sequential Equilibrium In many games information is

PHYSICS OF BIOLOGICAL SYSTEMS ph549 LECTURE 9 Energy and Equilibrium LIFE and ENERGY

Chemical Equilibrium Chapter 13 Chemical Equilibrium When neither the products nor the

Chapter 9 Chapter Outline The FE Line: Equilibrium in the Labor

Chemistry 2000 Slide Set 11: Chemical equilibrium Marc R. Roussel February 4, 2020 Marc R.

Beyond Nash Equilibrium: Solution Concepts for the 21st Century Joe Halpern and many

Equilibrium conditions In 2D, equilibrium equation can be written in scalar form as F x = 0

Mapping equilibrium and non-equilibrium entropy landscapes : the path-sampling approach Manuel

LN-12 Notes on the history of general equilibrium, welfare economics General equilibrium We have

Non-equilibrium Non-equilibrium Fluctuations in Strongly Fluctuations in Strongly Correlated

Nash Equilibrium 14.12 Game Theory Muhamet Yildiz 1 Road Map 1. Definition 2. Examples 3.

Non-Equilibrium Chemistry &amp; Cooling Alexander Richings &amp; Joop Schaye Leiden Observatory

About Chemical Equilibrium and Free Energy UNIT 6 DAY 2 What are we going to learn today?

COMPANY PRESENTATION www.grupoexcelsior.com Lifting your quality of life INTERNATIONAL

and Innovation Charles Boamah Vice President, FNVP Tunis - 12 June 2013 Tunis June 2013

WEBINAR: DT FOR SOCIALLY N O SOC RESPONSIBLE INVESTORS 16 th of October 2014 DISCLAIMER This

Disclaimer This presentation contains forward-looking statements that reflect managements

Financial Report Group Report 2009 www.finnair.com/group Financial Report CONTENTS Finnair

PRODUCT HIGHLIGHTS SHEET A (div) EUR: 1.45% 31 December 2019 A (div) SGD: 1.45% A (div)

Post-Pricing Information Unlimited Tax Refunding Bonds, Taxable Series 2019 Tuesday, November

Post-Pricing Information Unlimited Tax School Building Bonds, Series 2019 Monday, May 6, 2019

Sambuz

Useful Links

Newsletter

Mail Us

Non-Equilibrium Chemistry & Cooling Alexander Richings & Joop Schaye Leiden Observatory