Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, - - PowerPoint PPT Presentation

Multi-agent learning Gradient Dynamics Gradient Dynamis Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Last modified on March 1 st ,


slide-1
SLIDE 1

Multi-agent learning Gradient Dynamics

Multi-agent learning

Gradient Dynami s

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 1

slide-2
SLIDE 2

Multi-agent learning Gradient Dynamics

Gradient dynamics: motivation

  • Every player “identifies itself” with a single mixed strategy.
  • Like in fictitious play, players project each other on a mixed strategy.
  • CKR is in order. CKR (common knowledge of rationality, cf. Hargreaves

Heap & Varoufakis, 2004) implies that players know everything. In this case, however,

pla y ers even kno w the mixed strategies
  • f
their
  • pp
  • nent.

(Hence, q−i = s−i, for all i.) – Fictitious play assesses strategies, and plays a best response to an assessment. – Gradient dynamics does not asses, and it does not play a best response.

  • With gradient dynamics, players don’t actually (need to) play to learn.

Rather, players gradually adapt their strategy through hill-climbing in the payoff space.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 2

slide-3
SLIDE 3

Multi-agent learning Gradient Dynamics

Plan for today

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Other examples

  • 3. IGA: Infinitesimal Gradient Ascent. Singh, Kearns and Mansour (2000).

— Convergence of IGA.

  • 4. IGA-WoLF: Win or Learn Fast. Bowling and Veloso (2001, 2002).

— Convergence of IGA-WoLF. — Analysis of the proof of convergence of IGA-WoLF.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 3

slide-4
SLIDE 4

Multi-agent learning Gradient Dynamics

Two-player, two-action, general sum games with real payoffs

In its most general form, a two-player, two-action, general sum game in normal form with real-valued payoffs can be represented by M = T B L R

  • r11, c11

r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β). Expected payoff:

Vr(α, β) = α[βr11 + (1 − β)r12] + (1 − α)[βr21 + (1 − β)r22]

= u αβ + α(r12 − r22) + β(r21 − r22) + r22.

Vc(α, β) = β[αc11 + (1 − α)c21] + (1 − β)[αc12 + (1 − α)c22]

= u′αβ + α(c21 − c22) + β(c12 − c22) + c22.

where u = (r11 − r12) − (r21 − r22) and u′ = (c11 − c21) − (c12 − c22).

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 4

slide-5
SLIDE 5

Multi-agent learning Gradient Dynamics

Example: payoffs for Player 1 and Player 2 in Stag Hunt

Player 1 may

  • nly

move “back – front”; Player 2 may

  • nly

move “left – right”.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 5

slide-6
SLIDE 6

Multi-agent learning Gradient Dynamics

Gradient of expected payoff

Gradient: ∂vr(α, β) ∂α

= βu + (r12 − r22)

∂vc(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane dynami al system:
  • ∂Vr/∂α

∂Vc/∂β

  • =
  • u

u′ α β

  • +
  • r12 − r22

c21 − c22

  • Stationary point:

(α∗, β∗) = (c21 − c22

u′ , r12 − r22 u

)

Remarks:

  • There is at most one stationary

point.

  • If a stationary point exists, it may

lie outside [0, 1]2.

  • If there is a stationary point

inside [0, 1]2, it is a non-strict Nash equilibrium.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 6

slide-7
SLIDE 7

Multi-agent learning Gradient Dynamics

Gradient dynamics: Coordination game

  • Symmetric, but not zero sum:

T B L R

  • 1, 1

0, 0 0, 0 1, 1

  • Gradient:

2· β − 1 2· α − 1

  • Stationary at (1/2, 1/2). Multip.

matrix has a real eigenvalue:

  • 2

2

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 7

slide-8
SLIDE 8

Multi-agent learning Gradient Dynamics

Saddle point

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 8

slide-9
SLIDE 9

Multi-agent learning Gradient Dynamics

Gradient dynamics: Stag hunt

  • Symmetric, but not zero sum:

T B L R

  • 5, 5

0, 3 3, 0 2, 2

  • Gradient:

4· β − 2 4· α − 2

  • Stationary at (1/2, 1/2). Multip.

matrix has a real eigenvalue:

  • 4

4

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 9

slide-10
SLIDE 10

Multi-agent learning Gradient Dynamics

Gradient dynamics: Prisoners’ Dilemma

  • Symmetric, but not zero sum:

T B L R

  • 3, 3

0, 5 5, 0 1, 1

  • Gradient:
  • −1· β − 1

−1· α − 1

  • Stationary at (−1, −1). Multip.

matrix has a real eigenvalue:

  • −1

−1

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 10

slide-11
SLIDE 11

Multi-agent learning Gradient Dynamics

Gradient dynamics: Game of Chicken

  • Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

  • Gradient:

−3· β + 2 −3· α + 2

  • Stationary at (2/3, 2/3). Multip.

matrix has a real eigenvalue:

  • −3

−3

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 11

slide-12
SLIDE 12

Multi-agent learning Gradient Dynamics

Gradient dynamics: Battle of the Sexes

  • Symmetric, but not zero sum:

T B L R

  • 0, 0

2, 3 3, 2 1, 1

  • Gradient:

−4· β + 1 −4· α + 1

  • Stationary at (1/4, 1/4). Multip.

matrix has a real eigenvalue:

  • −4

−4

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 12

slide-13
SLIDE 13

Multi-agent learning Gradient Dynamics

Gradient dynamics: Matching pennies

  • Symmetric, zero sum:

T B L R

  • 1, −1

−1, 1 −1, 1

1, −1

  • Gradient:

4· β − 2

−4· α + 2

  • Stationary at (1/2, 1/2). Multip.

matrix has imaginary eigenvalue:

  • 4

−4

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 13

slide-14
SLIDE 14

Multi-agent learning Gradient Dynamics

Gradient dynamics of expected payoff

Discrete dynamics with step size η:

  • α

β

  • t+1

=

  • α

β

  • t

+ η

  • ∂Vr/∂α

∂Vc/∂β

  • t
  • Because α, β ∈ [0, 1], the

dynamics must be confined to

[0, 1]2.

  • Suppose the state (α, β) is on the

boundary of the probability space

[0, 1]2, and the gradient vector

points outwards. Intuition: one of the players has an incentive to improve, but cannot improve further.

  • To maintain dynamics within

[0, 1]2, the gradient is projected

back on to [0, 1]2. Intuition: if one of the players has an incentive to improve, but cannot improve, then he will not improve.

  • If nonzero, the projected gradient

is parallel to the (closest part of the) boundary.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 14

slide-15
SLIDE 15

Multi-agent learning Gradient Dynamics

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

IGA: Discrete dynamics with step size η → 0:

  • α

β

  • t+1

=

  • α

β

  • t

+ η

  • u

u′ α β

  • +
  • r12 − r22

c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, then their strategies will converge to a Nash equilibrium. If not, then at least their average payoffs will converge to the expected payoffs of a Nash equilibrium. The proof is based on a qualitative result in the theory of differential

  • equations. The behaviour of an affine differential map is determined by the

multiplicative matrix U:

  • 1. U is not
invertible.
  • 2. U is
invertible, and eigenvalue Ux = λx is real.
  • 3. U is
invertible, and eigenvalue Ux = λx is imaginary.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 15

slide-16
SLIDE 16

Multi-agent learning Gradient Dynamics

Convergence of IGA (Singh et al., 2000)

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies outside

[0, 1]2. Then there is movement

everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary. (b) The stationary point is a

rep ello
  • r. Then it repels

movement towards the boundary. (c) Both (2a) and (2b): saddle point. (d) None of the above. Then plain IGA does not converge. In three out of four cases, the dynamics ends, hence ends in Nash. Case (2a) and (2b) actually do not

  • ccur in isolation.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 16

slide-17
SLIDE 17

Multi-agent learning Gradient Dynamics

IGA-WoLF (Bowling et al., 2001)

Bowling and Veloso modify IGA as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable step:

  • α

β

  • t+1

=

  • α

β

  • t

+ η

  • lr

t · ∂Vr/∂α

lc

t · ∂Vc/∂β

  • t

where l{r,c}

t

∈ {lmin, lmax} all positive. (Bowling et al. use [lmin, lmax].)

lr

t =Def

  • lmin

if Vr(αt, βt) > Vr(αe, βt)

Winning

lmax

  • therwise
Losing

lc

t =Def

  • lmin

if Vc(αt, βt) > Vc(αt, βe)

Winning

lmax

  • therwise
Losing

where αe is a row strategy belonging to some NE, chosen by row player. Similarly for βe and column player. Thus, (αe, βe) need not be Nash.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 17

slide-18
SLIDE 18

Multi-agent learning Gradient Dynamics

Case 2d: revolution around Nash equilibrium

  • Lemma. If the learning rates, lr and lc, remain constant, then the trajectory of the

strategy pair is an elliptical orbit around the center (α∗, β∗) with axes

  • lc|u|/lr|u′|
  • ,
  • 1
  • Remarks:
  • For ellipses with center (α∗, β∗) there are four possibilities, depending on

whether

  • lc|u|/lr|u′| > 1 or < 1.
  • 1. Lies flat and axes < 1.
  • 2. Is standing and axes < 1.
  • 3. Lies flat and axes > 1.
  • 4. Is standing and axes > 1.
  • Bowling et al. do not prove this result but refer to Sing etal., who, on their

turn refer to a work on differential equations by Reinhard (1987).

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 18

slide-19
SLIDE 19

Multi-agent learning Gradient Dynamics

Case 2d: revolution around Nash equilibrium

  • Lemma. A player is “winning” if and only if that player’s strategy is moving away

from the center.

  • Proof. When play revolves around the center, there can only be one
  • equilibrium. Hence, (αe, βe) is an equilibrium!

Consider the row player, who wins if and only if Vr(αt, βt) − Vr(αe, βt) > 0 Simplifying yields

(α − αe)∂Vr

∂α > 0 Thus, row player “wins” iff either α > αe and α increases, or else α < αe and α decreases.

  • Corollary. The learning rate is constant throughout any one quadrant.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 19

slide-20
SLIDE 20

Multi-agent learning Gradient Dynamics

Case 2d: revolution around Nash equilibrium

  • Lemma. Let C be the center. For every

initial strategy pair (α, β) that is sufficiently close to C, the lmin-lmax dynamics will bring that pair to C.

  • Proof. Let (α, β) be a strategy pair.

According to the previous lemma, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

  • 1. Strategy pair moves clockwise.

(a) We will thus have to ensure that learning parameters are set such that ellipse that forms the trajectory “is standing” when (α, β) is in Q1 and Q3. (b) Similarly, we will have to ensure that learning parameters are such that the ellipse “lies flat” when (α, β) is in Q2 and Q4.

  • 2. Strategy pair moves

counter-clockwise. Similar reasoning.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 20

slide-21
SLIDE 21

Multi-agent learning Gradient Dynamics

Trajectory in different quadrants

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 21

slide-22
SLIDE 22

Multi-agent learning Gradient Dynamics

Trajectory in different quadrants

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 22

slide-23
SLIDE 23

Multi-agent learning Gradient Dynamics

Trajectory in different quadrants

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 23

slide-24
SLIDE 24

Multi-agent learning Gradient Dynamics

Compound trajectory

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 24

slide-25
SLIDE 25

Multi-agent learning Gradient Dynamics

Trajectory in different quadrants

  • Claim. Learning parameters lr

t and lc t alternate in such a way that the ellipse that

forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “is standing” otherwise.

  • Proof. Suppose movement is clockwise.
  • 1. Suppose (α, β) is in Q1 (upper right) of the ellipse. Then row “wins” and

col “loses”. Hence, horizontal velocity < vertical velocity, and the ellipse “is standing”.

  • 2. Suppose (α, β) is in Q2 (lower right) of the ellipse. Then row “loses” and

col “wins”. Hence, horizontal velocity > vertical velocity, and the ellipse “lies flat”. Clearly, the reasoning is similar when the strategy pair (α, β) is in the other two quadrants, or when movement is counter-clockwise.

  • Gerard Vreeswijk.

Last modified on March 1st, 2012 at 10:07 Slide 25

slide-26
SLIDE 26

Multi-agent learning Gradient Dynamics

Why not utilise Singh et al.’s result on empirical frequencies?

  • Theorem (Singh, Kearns and

Mansour, 2000) If players follow IGA, where η → 0, then their strategies will converge to a Nash

  • equilibrium. If not, then at least their

average payoffs will converge to the expected payoffs of a Nash equilibrium.

  • Idea: use average payoffs to

correct the gradient.

  • So gradient points slight more in

direction of average payoffs.

  • At least works empirically in

Netlogo

:-)

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 26

slide-27
SLIDE 27

Multi-agent learning Gradient Dynamics

Literature

  • Original work on gradient dynamics in general-sum games:

Singh, Kearns, and Mansour (2000). “Nash convergence of gradient dynamics in general-sum games”. In: Proc. of the Sixteenth Conf. on Uncertainty in Artificial Intelligence (pp. 541-548).

  • Today’s presentation was mainly based on this conference publication:

Bowling and Veloso (2001). “Convergence of Gradient Dynamics with a Variable Learning Rate”. In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 27-34, June 2001.

Conference publication was elaborated (and got accepted) as a journal article:

Bowling and Veloso (2002). “Multiagent Learning Using a Variable Learning Rate”. In: Artificial Intelligence 136, pp. 215-250, 2002.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 27

slide-28
SLIDE 28

Multi-agent learning Gradient Dynamics

What next?

Bayesian play :

  • With fictitious play, or

gradient dynamics, the behaviour of opponents is modelled by a

single mixed strategy.
  • With Bayesian play,
  • pponents are modelled by

a

p robabilit y distribution
  • ver
(a p
  • ssibly
  • nned
set
  • f
) mixed strategies.

Gerard Vreeswijk. Last modified on March 1st, 2012 at 10:07 Slide 28