Multi-agent learning Gradient asent Gerard Vreeswijk , Intelligent - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning Gradient asent Gerard Vreeswijk , Intelligent - - PowerPoint PPT Presentation

Multi-agent learning Gradient asent Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Saturday 16 th May, 2020 Gradient ascent: idea Idea Author: Gerard


slide-1
SLIDE 1

Multi-agent learning

Gradient as ent

Gerard Vreeswijk, Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.

Saturday 16th May, 2020

slide-2
SLIDE 2

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

slide-3
SLIDE 3

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy.

slide-4
SLIDE 4

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy. ■ Players can observe the (possibly mixed) strategy of their opponent.

slide-5
SLIDE 5

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy. ■ Players can observe the (possibly mixed) strategy of their opponent. ■ After observation, every player changes its strategy a tiny bit in the

right direction.

slide-6
SLIDE 6

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy. ■ Players can observe the (possibly mixed) strategy of their opponent. ■ After observation, every player changes its strategy a tiny bit in the

right direction.

■ Comparison with fictitious play.

slide-7
SLIDE 7

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy. ■ Players can observe the (possibly mixed) strategy of their opponent. ■ After observation, every player changes its strategy a tiny bit in the

right direction.

■ Comparison with fictitious play.

  • Like in fictitious play, opponents are modelled through a mixed

strategy.

slide-8
SLIDE 8

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy. ■ Players can observe the (possibly mixed) strategy of their opponent. ■ After observation, every player changes its strategy a tiny bit in the

right direction.

■ Comparison with fictitious play.

  • Like in fictitious play, opponents are modelled through a mixed

strategy.

  • In fictitious play, players learn projected opponent strategies, and

play a best response to it.

slide-9
SLIDE 9

Gradient ascent: idea

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 2

Idea

■ Every opponent is identified with a (possibly mixed) strategy. ■ Players can observe the (possibly mixed) strategy of their opponent. ■ After observation, every player changes its strategy a tiny bit in the

right direction.

■ Comparison with fictitious play.

  • Like in fictitious play, opponents are modelled through a mixed

strategy.

  • In fictitious play, players learn projected opponent strategies, and

play a best response to it.

  • In gradient ascent, players do not project a mixed strategy, and do

not play a best response.

slide-10
SLIDE 10

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

Dynami s
  • f
(mixed) strategies
slide-11
SLIDE 11

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.
Dynami s
  • f
(mixed) strategies
slide-12
SLIDE 12

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.
slide-13
SLIDE 13

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples:

slide-14
SLIDE 14

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game

slide-15
SLIDE 15

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma

slide-16
SLIDE 16

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt

slide-17
SLIDE 17

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt (d) Other examples

slide-18
SLIDE 18

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt (d) Other examples

  • 3. IGA: Infinitesimal Gradient Ascent. Singh, Kearns and Mansour

(2000).

slide-19
SLIDE 19

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt (d) Other examples

  • 3. IGA: Infinitesimal Gradient Ascent. Singh, Kearns and Mansour

(2000).

■ Convergence of IGA.

slide-20
SLIDE 20

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt (d) Other examples

  • 3. IGA: Infinitesimal Gradient Ascent. Singh, Kearns and Mansour

(2000).

■ Convergence of IGA.

  • 4. IGA-WoLF: Win or Learn Fast. Bowling and Veloso (2001, 2002).
slide-21
SLIDE 21

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt (d) Other examples

  • 3. IGA: Infinitesimal Gradient Ascent. Singh, Kearns and Mansour

(2000).

■ Convergence of IGA.

  • 4. IGA-WoLF: Win or Learn Fast. Bowling and Veloso (2001, 2002).

■ Convergence of IGA-WoLF

slide-22
SLIDE 22

Plan for today

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 3

  • 1. Two-player, two-action, general sum games with real payoffs.

2.

Dynami s
  • f
(mixed) strategies in such games.

Examples: (a) Coordination game (b) Prisoners’ dilemma (c) Stag hunt (d) Other examples

  • 3. IGA: Infinitesimal Gradient Ascent. Singh, Kearns and Mansour

(2000).

■ Convergence of IGA.

  • 4. IGA-WoLF: Win or Learn Fast. Bowling and Veloso (2001, 2002).

■ Convergence of IGA-WoLF + analysis of the proof of convergence.

slide-23
SLIDE 23

Part 1: Payoffs of general 2x2 games in normal form

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 4

slide-24
SLIDE 24

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

slide-25
SLIDE 25

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).
slide-26
SLIDE 26

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff:

slide-27
SLIDE 27

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff: u1(α, β)

slide-28
SLIDE 28

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff: u1(α, β) = α[βr11 + (1 − β)r12] + (1 − α)[βr21 + (1 − β)r22]

slide-29
SLIDE 29

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff: u1(α, β) = α[βr11 + (1 − β)r12] + (1 − α)[βr21 + (1 − β)r22]

= u αβ + α(r12 − r22) + β(r21 − r22) + r22.

slide-30
SLIDE 30

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff: u1(α, β) = α[βr11 + (1 − β)r12] + (1 − α)[βr21 + (1 − β)r22]

= u αβ + α(r12 − r22) + β(r21 − r22) + r22.

u2(α, β)

slide-31
SLIDE 31

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff: u1(α, β) = α[βr11 + (1 − β)r12] + (1 − α)[βr21 + (1 − β)r22]

= u αβ + α(r12 − r22) + β(r21 − r22) + r22.

u2(α, β) = β[αc11 + (1 − α)c21] + (1 − β)[αc12 + (1 − α)c22]

= u′αβ + α(c21 − c22) + β(c12 − c22) + c22.

slide-32
SLIDE 32

Two-player, two-action, general sum games

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 5

In its most general form, a two-player, two-action game in normal form with real-valued payoffs can be represented by M = T B L R r11, c11 r12, c12 r21, c21 r22, c22

  • Row plays mixed (α, 1 − α). Column plays mixed (β, 1 − β).

Expected payoff: u1(α, β) = α[βr11 + (1 − β)r12] + (1 − α)[βr21 + (1 − β)r22]

= u αβ + α(r12 − r22) + β(r21 − r22) + r22.

u2(α, β) = β[αc11 + (1 − α)c21] + (1 − β)[αc12 + (1 − α)c22]

= u′αβ + α(c21 − c22) + β(c12 − c22) + c22.

where u = (r11 − r12) − (r21 − r22) and u′ = (c11 − c21) − (c12 − c22).

slide-33
SLIDE 33

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

ane map
slide-34
SLIDE 34

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane map:

∂u1/∂α ∂u2/∂β

  • =

u u′ α β

  • +

r12 − r22 c21 − c22

  • = U

α β

  • + C
slide-35
SLIDE 35

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane map:

∂u1/∂α ∂u2/∂β

  • =

u u′ α β

  • +

r12 − r22 c21 − c22

  • = U

α β

  • + C

Stationary point:

(α∗, β∗) = (c22 − c21

u′ , r22 − r12 u

)

slide-36
SLIDE 36

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane map:

∂u1/∂α ∂u2/∂β

  • =

u u′ α β

  • +

r12 − r22 c21 − c22

  • = U

α β

  • + C

Stationary point:

(α∗, β∗) = (c22 − c21

u′ , r22 − r12 u

)

Remarks:

slide-37
SLIDE 37

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane map:

∂u1/∂α ∂u2/∂β

  • =

u u′ α β

  • +

r12 − r22 c21 − c22

  • = U

α β

  • + C

Stationary point:

(α∗, β∗) = (c22 − c21

u′ , r22 − r12 u

)

Remarks:

■ There is at most one stationary

point.

slide-38
SLIDE 38

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane map:

∂u1/∂α ∂u2/∂β

  • =

u u′ α β

  • +

r12 − r22 c21 − c22

  • = U

α β

  • + C

Stationary point:

(α∗, β∗) = (c22 − c21

u′ , r22 − r12 u

)

Remarks:

■ There is at most one stationary

point.

■ If a stationary point exists, it

may lie outside [0, 1]2.

slide-39
SLIDE 39

Gradient of expected payoff

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 6

Gradient: ∂u1(α, β) ∂α

= βu + (r12 − r22)

∂u2(α, β) ∂β

= αu′ + (c21 − c22)

As an

ane map:

∂u1/∂α ∂u2/∂β

  • =

u u′ α β

  • +

r12 − r22 c21 − c22

  • = U

α β

  • + C

Stationary point:

(α∗, β∗) = (c22 − c21

u′ , r22 − r12 u

)

Remarks:

■ There is at most one stationary

point.

■ If a stationary point exists, it

may lie outside [0, 1]2.

■ If there is a stationary point

inside [0, 1]2, it is a weak (i.e., non-strict) Nash equilibrium.

slide-40
SLIDE 40

Example: payoffs in Stag Hunt (r=4, t=3, s=1, p=3)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 7

slide-41
SLIDE 41

Example: payoffs in Stag Hunt (r=4, t=3, s=1, p=3)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 7

Player 1 may

  • nly

move “back – front”; Player 2 may

  • nly

move “left – right”.

slide-42
SLIDE 42

Part 2: IGA

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 8

slide-43
SLIDE 43

Gradient ascent

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 9

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

∂u1/∂α ∂u2/∂β

  • t
slide-44
SLIDE 44

Gradient ascent

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 9

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

∂u1/∂α ∂u2/∂β

  • t

■ Because α, β ∈ [0, 1], the

dynamics must be confined to

[0, 1]2.

slide-45
SLIDE 45

Gradient ascent

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 9

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

∂u1/∂α ∂u2/∂β

  • t

■ Because α, β ∈ [0, 1], the

dynamics must be confined to

[0, 1]2.

■ Suppose the state (α, β) is on

the boundary of the probability space [0, 1]2, and the gradient vector points outwards. Intuition: one of the players has an incentive to improve, but cannot improve further.

slide-46
SLIDE 46

Gradient ascent

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 9

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

∂u1/∂α ∂u2/∂β

  • t

■ Because α, β ∈ [0, 1], the

dynamics must be confined to

[0, 1]2.

■ Suppose the state (α, β) is on

the boundary of the probability space [0, 1]2, and the gradient vector points outwards. Intuition: one of the players has an incentive to improve, but cannot improve further.

■ To maintain dynamics within

[0, 1]2, the gradient is projected

back on to [0, 1]2. Intuition: if one of the players has an incentive to improve, but cannot improve, then he will not improve.

slide-47
SLIDE 47

Gradient ascent

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 9

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

∂u1/∂α ∂u2/∂β

  • t

■ Because α, β ∈ [0, 1], the

dynamics must be confined to

[0, 1]2.

■ Suppose the state (α, β) is on

the boundary of the probability space [0, 1]2, and the gradient vector points outwards. Intuition: one of the players has an incentive to improve, but cannot improve further.

■ To maintain dynamics within

[0, 1]2, the gradient is projected

back on to [0, 1]2. Intuition: if one of the players has an incentive to improve, but cannot improve, then he will not improve.

■ If nonzero, the projected

gradient is parallel to the (closest) boundary of [0, 1]2.

slide-48
SLIDE 48

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • invertible
eigenvalue invertible eigenvalue invertible
slide-49
SLIDE 49

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE.

invertible eigenvalue invertible eigenvalue invertible
slide-50
SLIDE 50

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE. The proof is based on a qualitative result in the the theory of differential equations, which says that the behaviour of an affine differential map is determined by the multiplicative matrix U:

invertible eigenvalue invertible eigenvalue invertible
slide-51
SLIDE 51

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE. The proof is based on a qualitative result in the the theory of differential equations, which says that the behaviour of an affine differential map is determined by the multiplicative matrix U:

  • 1. If U is
invertible, and its eigenvalue λ (solution of Ux = λx ⇔ solution
  • f Det[U − λI] = 0) is real, ∃ stationary point
invertible eigenvalue invertible
slide-52
SLIDE 52

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE. The proof is based on a qualitative result in the the theory of differential equations, which says that the behaviour of an affine differential map is determined by the multiplicative matrix U:

  • 1. If U is
invertible, and its eigenvalue λ (solution of Ux = λx ⇔ solution
  • f Det[U − λI] = 0) is real, ∃ stationary point, w.i. is a saddle point.
invertible eigenvalue invertible
slide-53
SLIDE 53

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE. The proof is based on a qualitative result in the the theory of differential equations, which says that the behaviour of an affine differential map is determined by the multiplicative matrix U:

  • 1. If U is
invertible, and its eigenvalue λ (solution of Ux = λx ⇔ solution
  • f Det[U − λI] = 0) is real, ∃ stationary point, w.i. is a saddle point.
  • 2. If U is
invertible, and its eigenvalue λ is imaginary, there is a stationary

point

invertible
slide-54
SLIDE 54

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE. The proof is based on a qualitative result in the the theory of differential equations, which says that the behaviour of an affine differential map is determined by the multiplicative matrix U:

  • 1. If U is
invertible, and its eigenvalue λ (solution of Ux = λx ⇔ solution
  • f Det[U − λI] = 0) is real, ∃ stationary point, w.i. is a saddle point.
  • 2. If U is
invertible, and its eigenvalue λ is imaginary, there is a stationary

point, which, in particular, is a centric point.

invertible
slide-55
SLIDE 55

Infinitesimal Gradient Ascent : IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 10

Ane dierential map:

α β

  • t+1

=

α β

  • t

+ η

u u′ α β

  • +

r12 − r22 c21 − c22

  • Theorem (Singh, Kearns and Mansour, 2000) If players follow IGA, where

η → 0, their average payoffs will converge to the (expected) payoffs of a NE. If their strategies converge, they will converge to that same NE. The proof is based on a qualitative result in the the theory of differential equations, which says that the behaviour of an affine differential map is determined by the multiplicative matrix U:

  • 1. If U is
invertible, and its eigenvalue λ (solution of Ux = λx ⇔ solution
  • f Det[U − λI] = 0) is real, ∃ stationary point, w.i. is a saddle point.
  • 2. If U is
invertible, and its eigenvalue λ is imaginary, there is a stationary

point, which, in particular, is a centric point.

  • 3. If U is not
invertible (iff u = 0 or u′ = 0), there is no stationary point.
slide-56
SLIDE 56

Saddle point

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 11

slide-57
SLIDE 57

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

slide-58
SLIDE 58

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

■ Symmetric, but not zero sum:

T B L R 1, 1 0, 0 0, 0 1, 1

slide-59
SLIDE 59

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

■ Symmetric, but not zero sum:

T B L R 1, 1 0, 0 0, 0 1, 1

  • ■ Gradient: 2· β − 1

2· α − 1

slide-60
SLIDE 60

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

■ Symmetric, but not zero sum:

T B L R 1, 1 0, 0 0, 0 1, 1

  • ■ Gradient: 2· β − 1

2· α − 1

  • ■ Stationary at (1/2, 1/2).
slide-61
SLIDE 61

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

■ Symmetric, but not zero sum:

T B L R 1, 1 0, 0 0, 0 1, 1

  • ■ Gradient: 2· β − 1

2· α − 1

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U = 2 2

  • has real eigenvalues: λ2 − 4 = 0.
slide-62
SLIDE 62

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

■ Symmetric, but not zero sum:

T B L R 1, 1 0, 0 0, 0 1, 1

  • ■ Gradient: 2· β − 1

2· α − 1

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U = 2 2

  • has real eigenvalues: λ2 − 4 = 0.

Saddle point inside [0, 1]2.

slide-63
SLIDE 63

Gradient ascent: Coordination game

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 12

■ Symmetric, but not zero sum:

T B L R 1, 1 0, 0 0, 0 1, 1

  • ■ Gradient: 2· β − 1

2· α − 1

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U = 2 2

  • has real eigenvalues: λ2 − 4 = 0.

Saddle point inside [0, 1]2.

slide-64
SLIDE 64

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

slide-65
SLIDE 65

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

■ Symmetric, but not zero sum:

T B L R 3, 3 0, 5 5, 0 1, 1

slide-66
SLIDE 66

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

■ Symmetric, but not zero sum:

T B L R 3, 3 0, 5 5, 0 1, 1

  • ■ Gradient: −1· β − 1

−1· α − 1

slide-67
SLIDE 67

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

■ Symmetric, but not zero sum:

T B L R 3, 3 0, 5 5, 0 1, 1

  • ■ Gradient: −1· β − 1

−1· α − 1

  • ■ Stationary at (−1, −1).
slide-68
SLIDE 68

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

■ Symmetric, but not zero sum:

T B L R 3, 3 0, 5 5, 0 1, 1

  • ■ Gradient: −1· β − 1

−1· α − 1

  • ■ Stationary at (−1, −1).

■ Matrix

U =

  • −1

−1

  • has real eigenvalues: λ2 − 1 = 0.
slide-69
SLIDE 69

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

■ Symmetric, but not zero sum:

T B L R 3, 3 0, 5 5, 0 1, 1

  • ■ Gradient: −1· β − 1

−1· α − 1

  • ■ Stationary at (−1, −1).

■ Matrix

U =

  • −1

−1

  • has real eigenvalues: λ2 − 1 = 0.

Saddle point outside [0, 1]2.

slide-70
SLIDE 70

Gradient ascent: Prisoners’ Dilemma

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 13

■ Symmetric, but not zero sum:

T B L R 3, 3 0, 5 5, 0 1, 1

  • ■ Gradient: −1· β − 1

−1· α − 1

  • ■ Stationary at (−1, −1).

■ Matrix

U =

  • −1

−1

  • has real eigenvalues: λ2 − 1 = 0.

Saddle point outside [0, 1]2.

slide-71
SLIDE 71

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

slide-72
SLIDE 72

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

■ Symmetric, but not zero sum:

T B L R 5, 5 0, 3 3, 0 2, 2

slide-73
SLIDE 73

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

■ Symmetric, but not zero sum:

T B L R 5, 5 0, 3 3, 0 2, 2

  • ■ Gradient: 4· β − 2

4· α − 2

slide-74
SLIDE 74

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

■ Symmetric, but not zero sum:

T B L R 5, 5 0, 3 3, 0 2, 2

  • ■ Gradient: 4· β − 2

4· α − 2

  • ■ Stationary at (1/2, 1/2).
slide-75
SLIDE 75

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

■ Symmetric, but not zero sum:

T B L R 5, 5 0, 3 3, 0 2, 2

  • ■ Gradient: 4· β − 2

4· α − 2

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U = 4 4

  • has real eigenvalues:

λ2 − 16 = 0.

slide-76
SLIDE 76

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

■ Symmetric, but not zero sum:

T B L R 5, 5 0, 3 3, 0 2, 2

  • ■ Gradient: 4· β − 2

4· α − 2

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U = 4 4

  • has real eigenvalues:

λ2 − 16 = 0. Saddle point inside [0, 1]2.

slide-77
SLIDE 77

Gradient ascent: Stag hunt

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 14

■ Symmetric, but not zero sum:

T B L R 5, 5 0, 3 3, 0 2, 2

  • ■ Gradient: 4· β − 2

4· α − 2

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U = 4 4

  • has real eigenvalues:

λ2 − 16 = 0. Saddle point inside [0, 1]2.

slide-78
SLIDE 78

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

slide-79
SLIDE 79

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

■ Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

slide-80
SLIDE 80

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

■ Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

  • ■ Gradient: −3· β + 2

−3· α + 2

slide-81
SLIDE 81

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

■ Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

  • ■ Gradient: −3· β + 2

−3· α + 2

  • ■ Stationary at (2/3, 2/3).
slide-82
SLIDE 82

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

■ Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

  • ■ Gradient: −3· β + 2

−3· α + 2

  • ■ Stationary at (2/3, 2/3).

■ Matrix

U =

  • −3

−3

  • has real eigenvalues: λ2 − 9 = 0.
slide-83
SLIDE 83

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

■ Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

  • ■ Gradient: −3· β + 2

−3· α + 2

  • ■ Stationary at (2/3, 2/3).

■ Matrix

U =

  • −3

−3

  • has real eigenvalues: λ2 − 9 = 0.

Saddle point inside [0, 1]2.

slide-84
SLIDE 84

Gradient ascent: Game of Chicken

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 15

■ Symmetric, but not zero sum:

T B L R

  • 0, 0

−1, 1

1, −1

−3, −3

  • ■ Gradient: −3· β + 2

−3· α + 2

  • ■ Stationary at (2/3, 2/3).

■ Matrix

U =

  • −3

−3

  • has real eigenvalues: λ2 − 9 = 0.

Saddle point inside [0, 1]2.

slide-85
SLIDE 85

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

slide-86
SLIDE 86

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

■ Symmetric, but not zero sum:

T B L R 0, 0 2, 3 3, 2 1, 1

slide-87
SLIDE 87

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

■ Symmetric, but not zero sum:

T B L R 0, 0 2, 3 3, 2 1, 1

  • ■ Gradient: −4· β + 1

−4· α + 1

slide-88
SLIDE 88

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

■ Symmetric, but not zero sum:

T B L R 0, 0 2, 3 3, 2 1, 1

  • ■ Gradient: −4· β + 1

−4· α + 1

  • ■ Stationary at (1/4, 1/4).
slide-89
SLIDE 89

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

■ Symmetric, but not zero sum:

T B L R 0, 0 2, 3 3, 2 1, 1

  • ■ Gradient: −4· β + 1

−4· α + 1

  • ■ Stationary at (1/4, 1/4).

■ Matrix

U =

  • −4

−4

  • has real eigenvalues:

λ2 − 16 = 0.

slide-90
SLIDE 90

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

■ Symmetric, but not zero sum:

T B L R 0, 0 2, 3 3, 2 1, 1

  • ■ Gradient: −4· β + 1

−4· α + 1

  • ■ Stationary at (1/4, 1/4).

■ Matrix

U =

  • −4

−4

  • has real eigenvalues:

λ2 − 16 = 0. Saddle point inside [0, 1]2.

slide-91
SLIDE 91

Gradient ascent: Battle of the Sexes

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 16

■ Symmetric, but not zero sum:

T B L R 0, 0 2, 3 3, 2 1, 1

  • ■ Gradient: −4· β + 1

−4· α + 1

  • ■ Stationary at (1/4, 1/4).

■ Matrix

U =

  • −4

−4

  • has real eigenvalues:

λ2 − 16 = 0. Saddle point inside [0, 1]2.

slide-92
SLIDE 92

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

slide-93
SLIDE 93

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

■ Symmetric, zero sum:

T B L R 1, −1

−1, 1 −1, 1

1, −1

slide-94
SLIDE 94

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

■ Symmetric, zero sum:

T B L R 1, −1

−1, 1 −1, 1

1, −1

  • ■ Gradient:

4· β − 2

−4· α + 2

slide-95
SLIDE 95

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

■ Symmetric, zero sum:

T B L R 1, −1

−1, 1 −1, 1

1, −1

  • ■ Gradient:

4· β − 2

−4· α + 2

  • ■ Stationary at (1/2, 1/2).
slide-96
SLIDE 96

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

■ Symmetric, zero sum:

T B L R 1, −1

−1, 1 −1, 1

1, −1

  • ■ Gradient:

4· β − 2

−4· α + 2

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U =

  • 4

−4

  • has imaginary eigenvalues:

λ2 + 16 = 0.

slide-97
SLIDE 97

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

■ Symmetric, zero sum:

T B L R 1, −1

−1, 1 −1, 1

1, −1

  • ■ Gradient:

4· β − 2

−4· α + 2

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U =

  • 4

−4

  • has imaginary eigenvalues:

λ2 + 16 = 0. Centric point inside [0, 1]2.

slide-98
SLIDE 98

Gradient ascent: Matching pennies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 17

■ Symmetric, zero sum:

T B L R 1, −1

−1, 1 −1, 1

1, −1

  • ■ Gradient:

4· β − 2

−4· α + 2

  • ■ Stationary at (1/2, 1/2).

■ Matrix

U =

  • 4

−4

  • has imaginary eigenvalues:

λ2 + 16 = 0. Centric point inside [0, 1]2.

slide-99
SLIDE 99

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

slide-100
SLIDE 100

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

■ Symmetric, zero sum:

T B L R −2, 2 1, 1 3, −3

−2, 1

slide-101
SLIDE 101

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

■ Symmetric, zero sum:

T B L R −2, 2 1, 1 3, −3

−2, 1

  • ■ Gradient: −8· β + 3

5· α − 4

slide-102
SLIDE 102

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

■ Symmetric, zero sum:

T B L R −2, 2 1, 1 3, −3

−2, 1

  • ■ Gradient: −8· β + 3

5· α − 4

  • ■ Stationary at (4/5, 3/8).
slide-103
SLIDE 103

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

■ Symmetric, zero sum:

T B L R −2, 2 1, 1 3, −3

−2, 1

  • ■ Gradient: −8· β + 3

5· α − 4

  • ■ Stationary at (4/5, 3/8).

■ Matrix

U =

−8

5

  • has imaginary eigenvalues:

λ2 + 40 = 0.

slide-104
SLIDE 104

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

■ Symmetric, zero sum:

T B L R −2, 2 1, 1 3, −3

−2, 1

  • ■ Gradient: −8· β + 3

5· α − 4

  • ■ Stationary at (4/5, 3/8).

■ Matrix

U =

−8

5

  • has imaginary eigenvalues:

λ2 + 40 = 0. Centric point inside [0, 1]2.

slide-105
SLIDE 105

Gradient ascent: other game with centric

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 18

■ Symmetric, zero sum:

T B L R −2, 2 1, 1 3, −3

−2, 1

  • ■ Gradient: −8· β + 3

5· α − 4

  • ■ Stationary at (4/5, 3/8).

■ Matrix

U =

−8

5

  • has imaginary eigenvalues:

λ2 + 40 = 0. Centric point inside [0, 1]2.

slide-106
SLIDE 106

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

attra to r rep ello r
slide-107
SLIDE 107

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

attra to r rep ello r
slide-108
SLIDE 108

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2.

attra to r rep ello r
slide-109
SLIDE 109

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary.

rep ello r
slide-110
SLIDE 110

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary. (b) The stationary point is a

rep ello
  • r. Then it repels

movement towards the boundary.

slide-111
SLIDE 111

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary. (b) The stationary point is a

rep ello
  • r. Then it repels

movement towards the boundary. (c) Both (2a) and (2b): saddle point.

slide-112
SLIDE 112

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary. (b) The stationary point is a

rep ello
  • r. Then it repels

movement towards the boundary. (c) Both (2a) and (2b): saddle point. (d) None of the above. Then strategies do not converge.

slide-113
SLIDE 113

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary. (b) The stationary point is a

rep ello
  • r. Then it repels

movement towards the boundary. (c) Both (2a) and (2b): saddle point. (d) None of the above. Then strategies do not converge. In three out of four cases, the dynamics ends, hence ends in Nash.

slide-114
SLIDE 114

Convergence of IGA (Singh et al., 2000)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 19

Proof outline. There are two main cases:

  • 1. There is no stationary point, or

the stationary point lies

  • utside [0, 1]2. Then there is

movement everywhere in [0, 1]2. Since movement is caused by an affine differential map the flow is in one direction, hence gets stuck somewhere at the boundary.

  • 2. There is a stationary point

inside [0, 1]2. (a) The stationary point is an

attra to
  • r. Then it attracts

movement which then becomes stationary. (b) The stationary point is a

rep ello
  • r. Then it repels

movement towards the boundary. (c) Both (2a) and (2b): saddle point. (d) None of the above. Then strategies do not converge. In three out of four cases, the dynamics ends, hence ends in Nash.

slide-115
SLIDE 115

Part 3: IGA-WoLF

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 20

slide-116
SLIDE 116

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast

Winning Losing Winning Losing
slide-117
SLIDE 117

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF).

Winning Losing Winning Losing
slide-118
SLIDE 118

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t
Winning Losing Winning Losing
slide-119
SLIDE 119

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t

where l{1,2}

t

∈ {lmin, lmax} all positive.

Winning Losing Winning Losing
slide-120
SLIDE 120

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t

where l{1,2}

t

∈ {lmin, lmax} all positive.

l1

t =Def

lmin if u1(αt, βt) > u1(αe, βt)

Winning

lmax

  • therwise
Losing Winning Losing
slide-121
SLIDE 121

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t

where l{1,2}

t

∈ {lmin, lmax} all positive.

l1

t =Def

lmin if u1(αt, βt) > u1(αe, βt)

Winning

lmax

  • therwise
Losing

l2

t =Def

lmin if u2(αt, βt) > u2(αt, βe)

Winning

lmax

  • therwise
Losing
slide-122
SLIDE 122

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t

where l{1,2}

t

∈ {lmin, lmax} all positive.

l1

t =Def

lmin if u1(αt, βt) > u1(αe, βt)

Winning

lmax

  • therwise
Losing

l2

t =Def

lmin if u2(αt, βt) > u2(αt, βe)

Winning

lmax

  • therwise
Losing

where αe is a row strategy belonging to some NE, chosen by the row player.

slide-123
SLIDE 123

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t

where l{1,2}

t

∈ {lmin, lmax} all positive.

l1

t =Def

lmin if u1(αt, βt) > u1(αe, βt)

Winning

lmax

  • therwise
Losing

l2

t =Def

lmin if u2(αt, βt) > u2(αt, βe)

Winning

lmax

  • therwise
Losing

where αe is a row strategy belonging to some NE, chosen by the row

  • player. Similarly for βe and column player.
slide-124
SLIDE 124

IGA-WoLF (Bowling et al., 2001)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 21

Bowling and Veloso modify IGA so as to ensure convergence in Case 2d. Idea: Win or Learn Fast (WoLF). To this end, IGA-WoLF uses a variable learning rate: α β

  • t+1

=

α β

  • t

+ η

l1

t · ∂u1/∂α

l2

t · ∂u2/∂β

  • t

where l{1,2}

t

∈ {lmin, lmax} all positive.

l1

t =Def

lmin if u1(αt, βt) > u1(αe, βt)

Winning

lmax

  • therwise
Losing

l2

t =Def

lmin if u2(αt, βt) > u2(αt, βe)

Winning

lmax

  • therwise
Losing

where αe is a row strategy belonging to some NE, chosen by the row

  • player. Similarly for βe and column player.

So (αe, βe) need not be Nash!

slide-125
SLIDE 125

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

slide-126
SLIDE 126

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:
slide-127
SLIDE 127

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:

■ For ellipses with center (α∗, β∗) there are four possibilities, depending

  • n u, u′, and whether
  • l2|u|/l1|u′| > 1 or < 1.
slide-128
SLIDE 128

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:

■ For ellipses with center (α∗, β∗) there are four possibilities, depending

  • n u, u′, and whether
  • l2|u|/l1|u′| > 1 or < 1.
  • 1. Lies flat and axes < 1.
slide-129
SLIDE 129

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:

■ For ellipses with center (α∗, β∗) there are four possibilities, depending

  • n u, u′, and whether
  • l2|u|/l1|u′| > 1 or < 1.
  • 1. Lies flat and axes < 1.
  • 2. Stands and axes < 1.
slide-130
SLIDE 130

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:

■ For ellipses with center (α∗, β∗) there are four possibilities, depending

  • n u, u′, and whether
  • l2|u|/l1|u′| > 1 or < 1.
  • 1. Lies flat and axes < 1.
  • 2. Stands and axes < 1.
  • 3. Lies flat and axes > 1.
slide-131
SLIDE 131

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:

■ For ellipses with center (α∗, β∗) there are four possibilities, depending

  • n u, u′, and whether
  • l2|u|/l1|u′| > 1 or < 1.
  • 1. Lies flat and axes < 1.
  • 2. Stands and axes < 1.
  • 3. Lies flat and axes > 1.
  • 4. Stands and axes > 1.
slide-132
SLIDE 132

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 22

Lemma 1. With fixed l1 and l2, the trajectory of the strategy pair (α, β) is an elliptic orbit around (α∗, β∗) with axes

  • l2|u|/l1|u′|
  • ,

1

  • Remarks:

■ For ellipses with center (α∗, β∗) there are four possibilities, depending

  • n u, u′, and whether
  • l2|u|/l1|u′| > 1 or < 1.
  • 1. Lies flat and axes < 1.
  • 2. Stands and axes < 1.
  • 3. Lies flat and axes > 1.
  • 4. Stands and axes > 1.

■ Bowling et al. do not prove this result but refer to Sing et al., who, on

their turn refer to a work on differential equations by Reinhard (1987).

slide-133
SLIDE 133

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 23

Lemma 2. A player is “winning” if and only if that player’s strategy is moving away from the center.

slide-134
SLIDE 134

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 23

Lemma 2. A player is “winning” if and only if that player’s strategy is moving away from the center.

  • Proof. When play revolves around the center, there can only be one

equilibrium.

slide-135
SLIDE 135

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 23

Lemma 2. A player is “winning” if and only if that player’s strategy is moving away from the center.

  • Proof. When play revolves around the center, there can only be one
  • equilibrium. So (αe, βe) is in fact an equilibrium.
slide-136
SLIDE 136

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 23

Lemma 2. A player is “winning” if and only if that player’s strategy is moving away from the center.

  • Proof. When play revolves around the center, there can only be one
  • equilibrium. So (αe, βe) is in fact an equilibrium.

Consider the row player, who wins if and only if u1(αt, βt) − u1(αe, βt) > 0. Simplifying and using βu − (r12 − r22) = ∂u1/∂α yields

(α − αe)∂u1

∂α > 0.

slide-137
SLIDE 137

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 23

Lemma 2. A player is “winning” if and only if that player’s strategy is moving away from the center.

  • Proof. When play revolves around the center, there can only be one
  • equilibrium. So (αe, βe) is in fact an equilibrium.

Consider the row player, who wins if and only if u1(αt, βt) − u1(αe, βt) > 0. Simplifying and using βu − (r12 − r22) = ∂u1/∂α yields

(α − αe)∂u1

∂α > 0. Thus, row player “wins” iff either α > αe and α increases, or else α < αe and α decreases.

slide-138
SLIDE 138

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 23

Lemma 2. A player is “winning” if and only if that player’s strategy is moving away from the center.

  • Proof. When play revolves around the center, there can only be one
  • equilibrium. So (αe, βe) is in fact an equilibrium.

Consider the row player, who wins if and only if u1(αt, βt) − u1(αe, βt) > 0. Simplifying and using βu − (r12 − r22) = ∂u1/∂α yields

(α − αe)∂u1

∂α > 0. Thus, row player “wins” iff either α > αe and α increases, or else α < αe and α decreases.

  • Corollary. The learning rate is constant throughout any one quadrant.
slide-139
SLIDE 139

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C.

slide-140
SLIDE 140

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C.

slide-141
SLIDE 141

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted.

slide-142
SLIDE 142

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

slide-143
SLIDE 143

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

  • 1. Strategy pair moves clockwise.
slide-144
SLIDE 144

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

  • 1. Strategy pair moves clockwise.

(a) We will thus have to ensure that learning parameters are set such that ellipse that forms the trajectory “stands” when

(α, β) is in Q1 and Q3.

slide-145
SLIDE 145

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

  • 1. Strategy pair moves clockwise.

(a) We will thus have to ensure that learning parameters are set such that ellipse that forms the trajectory “stands” when

(α, β) is in Q1 and Q3.

(b) Similarly, we will have to ensure that learning parameters are such that the ellipse “lies flat” when

(α, β) is in Q2 and Q4.

slide-146
SLIDE 146

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

  • 1. Strategy pair moves clockwise.

(a) We will thus have to ensure that learning parameters are set such that ellipse that forms the trajectory “stands” when

(α, β) is in Q1 and Q3.

(b) Similarly, we will have to ensure that learning parameters are such that the ellipse “lies flat” when

(α, β) is in Q2 and Q4.

  • 2. Strategy pair moves

counter-clockwise. Similar reasoning.

slide-147
SLIDE 147

Case 2d: revolution around Nash equilibrium

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 24

Lemma 3. Let C be the center. For every initial strategy pair (α, β) that is sufficiently close to C, the lmin / lmax dynamics will bring that pair to C. Proof. Let (α, β) be a strategy pair. According to Lemma 1, its trajectory forms an ellipse with center C. If (α, β) is sufficiently close to C, this ellipse will entirely be in [0, 1]2 and its trajectory will not be disrupted. There are two cases.

  • 1. Strategy pair moves clockwise.

(a) We will thus have to ensure that learning parameters are set such that ellipse that forms the trajectory “stands” when

(α, β) is in Q1 and Q3.

(b) Similarly, we will have to ensure that learning parameters are such that the ellipse “lies flat” when

(α, β) is in Q2 and Q4.

  • 2. Strategy pair moves

counter-clockwise. Similar reasoning. (First a suggestive picture, then the rest of the proof.)

slide-148
SLIDE 148

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 25

lmin lmax

slide-149
SLIDE 149

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 26

lmin lmax

slide-150
SLIDE 150

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 27

lmin lmax

slide-151
SLIDE 151

Compound trajectory

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 28

lmin lmax

slide-152
SLIDE 152

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise.

slide-153
SLIDE 153

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise. Proof.

slide-154
SLIDE 154

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise. Proof. Suppose movement is clockwise.

slide-155
SLIDE 155

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise. Proof. Suppose movement is clockwise.

  • 1. Suppose (α, β) is in Q1 (upper right) of the ellipse. Then row “wins”

and col “loses”. By Lemma 2 horizontal velocity < vertical velocity, and the ellipse “stands”.

slide-156
SLIDE 156

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise. Proof. Suppose movement is clockwise.

  • 1. Suppose (α, β) is in Q1 (upper right) of the ellipse. Then row “wins”

and col “loses”. By Lemma 2 horizontal velocity < vertical velocity, and the ellipse “stands”.

  • 2. Suppose (α, β) is in Q2 (lower right) of the ellipse. Then row “loses”

and col “wins”. By Lemma 2 horizontal velocity > vertical velocity, and the ellipse “lies flat”.

slide-157
SLIDE 157

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise. Proof. Suppose movement is clockwise.

  • 1. Suppose (α, β) is in Q1 (upper right) of the ellipse. Then row “wins”

and col “loses”. By Lemma 2 horizontal velocity < vertical velocity, and the ellipse “stands”.

  • 2. Suppose (α, β) is in Q2 (lower right) of the ellipse. Then row “loses”

and col “wins”. By Lemma 2 horizontal velocity > vertical velocity, and the ellipse “lies flat”. Clearly, the reasoning is similar when – the strategy pair (α, β) is in the other two quadrants

slide-158
SLIDE 158

Trajectory in different quadrants (clockwise)

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 29

  • Claim. The learning parameters lmin / lmax alternate in such a way that the

ellipse that forms the trajectory in clockwise movement “lies flat” when (α, β) is in Q1 and Q3 of the ellipse and “stands” otherwise. Proof. Suppose movement is clockwise.

  • 1. Suppose (α, β) is in Q1 (upper right) of the ellipse. Then row “wins”

and col “loses”. By Lemma 2 horizontal velocity < vertical velocity, and the ellipse “stands”.

  • 2. Suppose (α, β) is in Q2 (lower right) of the ellipse. Then row “loses”

and col “wins”. By Lemma 2 horizontal velocity > vertical velocity, and the ellipse “lies flat”. Clearly, the reasoning is similar when – the strategy pair (α, β) is in the other two quadrants, or – when movement is counter-clockwise.

slide-159
SLIDE 159

Part 4: Another solution

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 30

slide-160
SLIDE 160

Why not utilise Singh et al.’s result on emp. frequencies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 31

:-)
slide-161
SLIDE 161

Why not utilise Singh et al.’s result on emp. frequencies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 31

■ Theorem (Singh, Kearns and

Mansour, 2000) If players follow IGA, where η → 0, then their strategies will converge to a Nash

  • equilibrium. If not, then at least

their average payoffs will converge to the expected payoffs of a Nash equilibrium.

:-)
slide-162
SLIDE 162

Why not utilise Singh et al.’s result on emp. frequencies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 31

■ Theorem (Singh, Kearns and

Mansour, 2000) If players follow IGA, where η → 0, then their strategies will converge to a Nash

  • equilibrium. If not, then at least

their average payoffs will converge to the expected payoffs of a Nash equilibrium.

■ Idea: use average payoffs to

correct the gradient.

:-)
slide-163
SLIDE 163

Why not utilise Singh et al.’s result on emp. frequencies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 31

■ Theorem (Singh, Kearns and

Mansour, 2000) If players follow IGA, where η → 0, then their strategies will converge to a Nash

  • equilibrium. If not, then at least

their average payoffs will converge to the expected payoffs of a Nash equilibrium.

■ Idea: use average payoffs to

correct the gradient.

■ So gradient points slight more

in direction of average payoffs.

:-)
slide-164
SLIDE 164

Why not utilise Singh et al.’s result on emp. frequencies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 31

■ Theorem (Singh, Kearns and

Mansour, 2000) If players follow IGA, where η → 0, then their strategies will converge to a Nash

  • equilibrium. If not, then at least

their average payoffs will converge to the expected payoffs of a Nash equilibrium.

■ Idea: use average payoffs to

correct the gradient.

■ So gradient points slight more

in direction of average payoffs.

■ At least works empirically

:-)
slide-165
SLIDE 165

Why not utilise Singh et al.’s result on emp. frequencies

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 31

■ Theorem (Singh, Kearns and

Mansour, 2000) If players follow IGA, where η → 0, then their strategies will converge to a Nash

  • equilibrium. If not, then at least

their average payoffs will converge to the expected payoffs of a Nash equilibrium.

■ Idea: use average payoffs to

correct the gradient.

■ So gradient points slight more

in direction of average payoffs.

■ At least works empirically

:-)
slide-166
SLIDE 166

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

slide-167
SLIDE 167

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

■ Original work on gradient ascent in general-sum games:

slide-168
SLIDE 168

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

■ Original work on gradient ascent in general-sum games:

Singh, Kearns, and Mansour (2000). “Nash convergence of gradient ascent in general- sum games”. In: Proc. of the Sixteenth Conf. on Uncertainty in Artificial Intelligence (pp. 541-548).

slide-169
SLIDE 169

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

■ Original work on gradient ascent in general-sum games:

Singh, Kearns, and Mansour (2000). “Nash convergence of gradient ascent in general- sum games”. In: Proc. of the Sixteenth Conf. on Uncertainty in Artificial Intelligence (pp. 541-548).

■ Today’s presentation was mainly based on this conference

publication:

slide-170
SLIDE 170

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

■ Original work on gradient ascent in general-sum games:

Singh, Kearns, and Mansour (2000). “Nash convergence of gradient ascent in general- sum games”. In: Proc. of the Sixteenth Conf. on Uncertainty in Artificial Intelligence (pp. 541-548).

■ Today’s presentation was mainly based on this conference

publication:

Bowling and Veloso (2001). “Convergence of gradient ascent with a Variable Learning Rate”. In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 27-34, June 2001.

slide-171
SLIDE 171

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

■ Original work on gradient ascent in general-sum games:

Singh, Kearns, and Mansour (2000). “Nash convergence of gradient ascent in general- sum games”. In: Proc. of the Sixteenth Conf. on Uncertainty in Artificial Intelligence (pp. 541-548).

■ Today’s presentation was mainly based on this conference

publication:

Bowling and Veloso (2001). “Convergence of gradient ascent with a Variable Learning Rate”. In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 27-34, June 2001.

Conference publication was elaborated, and published as a journal article:

slide-172
SLIDE 172

Literature

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 32

■ Original work on gradient ascent in general-sum games:

Singh, Kearns, and Mansour (2000). “Nash convergence of gradient ascent in general- sum games”. In: Proc. of the Sixteenth Conf. on Uncertainty in Artificial Intelligence (pp. 541-548).

■ Today’s presentation was mainly based on this conference

publication:

Bowling and Veloso (2001). “Convergence of gradient ascent with a Variable Learning Rate”. In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 27-34, June 2001.

Conference publication was elaborated, and published as a journal article:

Bowling and Veloso (2002). “Multiagent Learning Using a Variable Learning Rate”. In: Artificial Intelligence 136, pp. 215-250, 2002.

slide-173
SLIDE 173

What next?

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 33

single mixed strategy p robabilit y distribution
  • ver
all
  • pp
  • nent
strategies
slide-174
SLIDE 174

What next?

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 33

■ With fictitious play, or gradient ascent, opponents are modelled by a

single mixed strategy. p robabilit y distribution
  • ver
all
  • pp
  • nent
strategies
slide-175
SLIDE 175

What next?

Author: Gerard Vreeswijk. Slides last modified on May 16th, 2020 at 11:33 Multi-agent learning: Gradient ascent, slide 33

■ With fictitious play, or gradient ascent, opponents are modelled by a

single mixed strategy.

■ With Bayesian play, opponents are modelled by a

p robabilit y distribution
  • ver
all
  • pp
  • nent
strategies

  • Πj=i ∆(Xj)H

.

  • ∆(A) denotes the set of all probability distributions over A.
  • BA denotes the set of all functions from A to B.
  • Πj=iAj denotes the Cartesian product of {Aj}j=i. In case of a finite

product, this can be written as Πj=iAj = A1 × A2 × · · · × Ai−1 × Ai+1 × · · · × An.