 
              Multi-agent learning Gradient Dynamics Gradient Dynami s Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 1
Multi-agent learning Gradient Dynamics Gradient dynamics: motivation • Every player “identifies itself” with a single mixed strategy. pla y ers even kno w the mixed strategies of their opp onent . • Like in fictitious play, players project each other on a mixed strategy. • CKR is in order. CKR (common knowledge of rationality, cf. Hargreaves Heap & Varoufakis, 2004) implies that players know everything. In this case, however, (Hence, q − i = s − i , for all i .) – Fictitious play assesses strategies, and plays a best response to an assessment. – Gradient dynamics does not asses, and it does not play a best response. • With gradient dynamics, players don’t actually (need to) play to learn. Rather, players gradually adapt their strategy through hill-climbing in the payoff space. Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 2
Multi-agent learning Gradient Dynamics Dynami s of (mixed) strategies in such games. Plan for today 1. Two-player, two-action, general sum games with real payoffs. 2. Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Other examples 3. IGA : I nfinitesimal G radient A scent. Singh, Kearns and Mansour (2000). — Convergence of IGA. 4. IGA-WoLF : W in o r L earn F ast. Bowling and Veloso (2001, 2002). — Convergence of IGA-WoLF. — Analysis of the proof of convergence of IGA-WoLF. Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 3
Multi-agent learning Gradient Dynamics Two-player, two-action, general sum games with real payoffs In its most general form, a two-player, two-action, general sum game in normal form with real-valued payoffs can be represented by L R � � T r 11 , c 11 r 12 , c 12 M = B r 21 , c 21 r 22 , c 22 Row plays mixed ( α , 1 − α ) . Column plays mixed ( β , 1 − β ) . Expected payoff: V r ( α , β ) = α [ β r 11 + ( 1 − β ) r 12 ] + ( 1 − α )[ β r 21 + ( 1 − β ) r 22 ] = u αβ + α ( r 12 − r 22 ) + β ( r 21 − r 22 ) + r 22 . V c ( α , β ) = β [ α c 11 + ( 1 − α ) c 21 ] + ( 1 − β )[ α c 12 + ( 1 − α ) c 22 ] = u ′ αβ + α ( c 21 − c 22 ) + β ( c 12 − c 22 ) + c 22 . where u = ( r 11 − r 12 ) − ( r 21 − r 22 ) and u ′ = ( c 11 − c 21 ) − ( c 12 − c 22 ) . Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 4
Multi-agent learning Gradient Dynamics Example: payoffs for Player 1 and Player 2 in Stag Hunt Player 1 may only move “back – front”; Player 2 may only move “left – right”. Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 5
Multi-agent learning Gradient Dynamics Gradient of expected payoff Stationary point: Gradient: ∂ v r ( α , β ) ( α ∗ , β ∗ ) = ( c 21 − c 22 , r 12 − r 22 a�ne dynami al system : = β u + ( r 12 − r 22 ) ) ∂α u ′ u ∂ v c ( α , β ) = α u ′ + ( c 21 − c 22 ) Remarks: ∂β • There is at most one stationary As an point. � � ∂ V r / ∂α = • If a stationary point exists, it may ∂ V c / ∂β lie outside [ 0, 1 ] 2 . � � � � 0 u α • If there is a stationary point + u ′ 0 β inside [ 0, 1 ] 2 , it is a non-strict � � Nash equilibrium. r 12 − r 22 c 21 − c 22 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 6
Multi-agent learning Gradient Dynamics Gradient dynamics: Coordination game • Symmetric, but not zero sum: L R � � T 1, 1 0, 0 B 0, 0 1, 1 • Gradient: � � 2 · β − 1 2 · α − 1 • Stationary at ( 1/2, 1/2 ) . Multip. matrix has a real eigenvalue: � � 0 2 2 0 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 7
Multi-agent learning Gradient Dynamics Saddle point Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 8
Multi-agent learning Gradient Dynamics Gradient dynamics: Stag hunt • Symmetric, but not zero sum: L R � � T 5, 5 0, 3 B 3, 0 2, 2 • Gradient: � � 4 · β − 2 4 · α − 2 • Stationary at ( 1/2, 1/2 ) . Multip. matrix has a real eigenvalue: � � 0 4 4 0 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 9
Multi-agent learning Gradient Dynamics Gradient dynamics: Prisoners’ Dilemma • Symmetric, but not zero sum: L R � � T 3, 3 0, 5 B 5, 0 1, 1 • Gradient: � � − 1 · β − 1 − 1 · α − 1 • Stationary at ( − 1, − 1 ) . Multip. matrix has a real eigenvalue: � � − 1 0 − 1 0 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 10
Multi-agent learning Gradient Dynamics Gradient dynamics: Game of Chicken • Symmetric, but not zero sum: L R � � − 1, 1 T 0, 0 1, − 1 − 3, − 3 B • Gradient: � � − 3 · β + 2 − 3 · α + 2 • Stationary at ( 2/3, 2/3 ) . Multip. matrix has a real eigenvalue: � � − 3 0 − 3 0 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 11
Multi-agent learning Gradient Dynamics Gradient dynamics: Battle of the Sexes • Symmetric, but not zero sum: L R � � T 0, 0 2, 3 B 3, 2 1, 1 • Gradient: � � − 4 · β + 1 − 4 · α + 1 • Stationary at ( 1/4, 1/4 ) . Multip. matrix has a real eigenvalue: � � − 4 0 − 4 0 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 12
Multi-agent learning Gradient Dynamics Gradient dynamics: Matching pennies • Symmetric, zero sum: L R � � 1, − 1 − 1, 1 T − 1, 1 1, − 1 B • Gradient: � � 4 · β − 2 − 4 · α + 2 • Stationary at ( 1/2, 1/2 ) . Multip. matrix has imaginary eigenvalue: � � 0 4 − 4 0 Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 13
Multi-agent learning Gradient Dynamics Gradient dynamics of expected payoff Discrete dynamics with step size η : an incentive to improve, but cannot improve further. � � � � � � α α ∂ V r / ∂α = + η • To maintain dynamics within β β ∂ V c / ∂β t + 1 t t [ 0, 1 ] 2 , the gradient is projected back on to [ 0, 1 ] 2 . • Because α , β ∈ [ 0, 1 ] , the Intuition: if one of the players has dynamics must be confined to an incentive to improve, but [ 0, 1 ] 2 . cannot improve, then he will not • Suppose the state ( α , β ) is on the improve. boundary of the probability space • If nonzero, the projected gradient [ 0, 1 ] 2 , and the gradient vector is parallel to the (closest part of points outwards. the) boundary. Intuition: one of the players has Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 14
Multi-agent learning Gradient Dynamics Infinitesimal Gradient Ascent : IGA (Singh et al. , 2000) IGA: Discrete dynamics with step size η → 0: � � � � �� � � � � �� r 12 − r 22 α α 0 u α = + η + u ′ c 21 − c 22 β β 0 β t + 1 t Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where η → 0 , then their strategies will converge to a Nash equilibrium. If not, then at least their average payoffs will converge to the expected payoffs of a Nash equilibrium. invertible . The proof is based on a qualitative result in the theory of differential invertible , and eigenvalue Ux = λ x is real. equations. The behaviour of an affine differential map is determined by the multiplicative matrix U : invertible , and eigenvalue Ux = λ x is imaginary. 1. U is not 2. U is 3. U is Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 15
Multi-agent learning Gradient Dynamics Convergence of IGA (Singh et al. , 2000) rep ello r . Then it repels Proof outline . There are two main movement which then cases: becomes stationary. (b) The stationary point is a 1. There is no stationary point, or the stationary point lies outside [ 0, 1 ] 2 . Then there is movement movement towards the everywhere in [ 0, 1 ] 2 . boundary. (c) Both (2a) and (2b): saddle Since movement is caused by an point. affine differential map the flow is in one direction, hence gets stuck (d) None of the above. Then plain somewhere at the boundary. IGA does not converge. attra to r . Then it attracts 2. There is a stationary point In three out of four cases, the inside [ 0, 1 ] 2 . dynamics ends, hence ends in Nash. (a) The stationary point is an Case (2a) and (2b) actually do not occur in isolation. Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 16
Multi-agent learning Gradient Dynamics IGA-WoLF (Bowling et al. , 2001) Bowling and Veloso modify IGA as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable step: � � � � � � l r t · ∂ V r / ∂α α α = + η Winning l c t · ∂ V c / ∂β β β t + 1 t t Losing where l { r , c } ∈ { l min , l max } all positive. (Bowling et al. use [ l min , l max ] .) t Winning � Losing if V r ( α t , β t ) > V r ( α e , β t ) l min l r t = Def l max otherwise � if V c ( α t , β t ) > V c ( α t , β e ) l min l c t = Def l max otherwise where α e is a row strategy belonging to some NE, chosen by row player. Similarly for β e and column player. Thus, ( α e , β e ) need not be Nash. Last modified on March 1 st , 2012 at 10:07 Gerard Vreeswijk. Slide 17
Recommend
More recommend