Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, - - PowerPoint PPT Presentation

Multi-agent learning Fictitious Play Fititious Pla y Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Last modified on February 27 th ,


slide-1
SLIDE 1

Multi-agent learning Fictitious Play

Multi-agent learning

Fi titious Pla y

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 1

slide-2
SLIDE 2

Multi-agent learning Fictitious Play

Fictitious play: motivation

  • Rather than considering your
  • wn payoffs, monitor the

behaviour of your opponent(s), and respond optimally.

  • Behaviour of an opponent is

projected on a

single mixed strategy.
  • Brown (1951): explanation for

Nash equilibrium play. In terms of current use, the name is a bit of a misnomer, since play actually occurs (Berger, 2005).

  • One of the most important, if not

the most important, representative of a

follo w er strategy.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 2

slide-3
SLIDE 3

Multi-agent learning Fictitious Play

Plan for today

Part I. Best reply strategy 1. Pure fictitious play.

  • 2. Results that connect pure fictitious play to Nash equilibria.

Part II. Extensions and approximations of fictitious play

  • 1. Smoothed fictitious play.
  • 2. Exponential regret matching.
  • 3. No-regret property of smoothed fictitious play (Fudenberg et al., 1995).
  • 4. Convergence of better reply strategies when players have limited

memory and are inert [tend to stick to their current strategy] (Young, 1998).

Shoham et al. (2009): Multi-agent Systems. Ch. 7: “Learning and Teaching”.

  • H. Young (2004): Strategic Learning and it Limits, Oxford UP.
  • D. Fudenberg and D.K. Levine (1998), The Theory of Learning in Games, MIT Press.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 3

slide-4
SLIDE 4

Multi-agent learning Fictitious Play

Part I: Pure fictitious play

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 4

slide-5
SLIDE 5

Multi-agent learning Fictitious Play

Repeated Coordination Game

Players receive payoff p > 0 iff they coordinate. This game possesses three Nash equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1). Round A’s action B’s action A’s beliefs B’s beliefs 0.

(0.0, 0.0) (0.0, 0.0)

1. L* R*

(0.0, 1.0) (1.0, 0.0)

2. R L

(1.0, 1.0) (1.0, 1.0)

3. L* R*

(1.0, 2.0) (2.0, 1.0)

4. R L

(2.0, 2.0) (2.0, 2.0)

5. R* R*

(2.0, 3.0) (2.0, 3.0)

6. R R

(2.0, 4.0) (2.0, 4.0)

7. R R

(2.0, 5.0) (2.0, 5.0)

. . . . . . . . . . . . . . .

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 5

slide-6
SLIDE 6

Multi-agent learning Fictitious Play

Steady states are pure (but possibly weak) Nash equilibria

Definition (Steady state). An action profile a is a

steady state (or abso rbing state) of fictitious play if it is the case that whenever a is played at round t

then, inevitably, it is also played at round t + 1.

  • Theorem. If a pure strategy profile is a steady state of fictitious play, then it is a

(possibly weak) Nash equilibrium in the stage game.

  • Proof. Suppose a = (a1, . . . , an) is a steady state. Consequently, i’s opponent

model converges to a−i, for all i. By definition of fictitious play, i plays best responses to a−i, i.e.,

∀i : ai ∈ BR(a−i).

The latter is precisely the definition of a Nash equilibrium. Still, the resulting Nash equilibrium is often strict, because for weak equilibria the process is likely to drift due to alternative best responses.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 6

slide-7
SLIDE 7

Multi-agent learning Fictitious Play

Pure strict Nash equilibria are steady states

  • Theorem. If a pure strategy profile is a strict Nash equilibrium of a stage game, then

it is a steady state of fictitious play in the repeated game. Notice the use of terminology: “pure strategy profile” for Nash equilibria; “action profile” for steady states.

  • Proof. Suppose a is a pure Nash equilibrium and ai is played at round t, for

all i. Because a is strict, ai is the unique best response to a−i. Because this argument holds for each i, action profile a will be played in round t + 1 again.

  • Summary of the two theorems:

Pure strict Nash ⇒ Steady state ⇒ Pure Nash. But what if pure Nash equilibria do not exist?

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 7

slide-8
SLIDE 8

Multi-agent learning Fictitious Play

Repeated game of Matching Pennies

Zero sum game. A’s goal is to have pennies matched. B maintains opposite. Round A’s action B’s action A’s beliefs B’s beliefs 0.

(1.5, 2.0) (2.0, 1.5)

1. T T

(1.5, 3.0) (2.0, 2.5)

2. T H

(2.5, 3.0) (2.0, 3.5)

3. T H

(3.5, 3.0) (2.0, 4.5)

4. H H

(4.5, 3.0) (3.0, 4.5)

5. H H

(5.5, 3.0) (4.0, 4.5)

6. H H

(6.5, 3.0) (5.0, 4.5)

7. H T

(6.5, 4.0) (6.0, 4.5)

8. H T

(6.5, 5.0) (7.0, 4.5)

. . . . . . . . . . . . . . .

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 8

slide-9
SLIDE 9

Multi-agent learning Fictitious Play

Convergent empirical distribution of strategies

  • Theorem. If the empirical distribution of each player’s strategies converges in

fictitious play, then it converges to a Nash equilibrium.

  • Proof. Same as before. If the empirical distributions converge to q, then i’s
  • pponent model converges to q−i, for all i. By definition of fictitious play,

qi ∈ BR(q−i). Because of convergence, all such (mixed) best replies remain the

  • same. By definition we have a Nash equilibrium.
  • Remarks:
  • 1. The qi may be mixed.
  • 2. It actually suffices that the q−i

converge asymptotically to the actual distribution (Fudenberg & Levine, 1998).

  • 3. If empirical distributions

converge (hence, converge to a Nash equilibrium), the actually played responses per stage need not be Nash equilibria of the stage game.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 9

slide-10
SLIDE 10

Multi-agent learning Fictitious Play

Empirical distributions converge to Nash⇒ stage Nash

Repeated Coordination Game. Players receive payoff p > 0 iff they coordinate. Round A’s action B’s action A’s beliefs B’s beliefs 0.

(0.5, 1.0) (1.0, 0.5)

1. B A

(1.5, 1.0) (1.0, 1.5)

2. A B

(1.5, 2.0) (2.0, 1.5)

3. B A

(2.5, 2.0) (2.0, 2.5)

4. A B

(2.5, 3.0) (3.0, 2.5)

. . . . . . . . . . . . . . .

  • This game possesses three equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1), with

expected payoffs 1, 0.5, and 1, respectively.

  • Empirical distribution of play converges to (0.5, 0.5),—with payoff 0,

rather than p/2.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 10

slide-11
SLIDE 11

Multi-agent learning Fictitious Play

Empirical distribution of play does not need to converge

Rock-paper-scissors. Winner receives payoff p > 0. Else, payoff zero.

  • Rock-paper-scissors with these payoffs is known as the
Shapley game.
  • The Shapley game possesses one equilibrium, viz. (1/3, 1/3, 1/3), with

expected payoff p/3. Round A’s action B’s action A’s beliefs B’s beliefs 0.

(0.0, 0.0, 0.5) (0.0, 0.5, 0.0)

1. Rock Scissors

(0.0, 0.0, 1.5) (1.0, 0.5, 0.0)

2. Rock Paper

(0.0, 1.0, 1.5) (2.0, 0.5, 0.0)

3. Rock Paper

(0.0, 2.0, 1.5) (3.0, 0.5, 0.0)

4. Scissors Paper

(0.0, 3.0, 1.5) (3.0, 0.5, 1.0)

5. Scissors Paper

(0.0, 4.0, 1.5) (3.0, 0.5, 2.0)

. . . . . . . . . . . . . . .

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 11

slide-12
SLIDE 12

Multi-agent learning Fictitious Play

Repeated Shapley Game: Phase Diagram

  • Rock

Paper Scissors

  • Gerard Vreeswijk.

Last modified on February 27th, 2012 at 18:35 Slide 12

slide-13
SLIDE 13

Multi-agent learning Fictitious Play

Part II: Extensions and approximations of fictitious play

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 13

slide-14
SLIDE 14

Multi-agent learning Fictitious Play

Proposed extensions to fictitious play

Build forecasts, not on complete history, but on

  • Recent data, say on m most recent rounds.
  • Discounted data, say with discount factor γ.
  • Perturbed data, say with error ǫ on individual observations.
  • Random samples of historical data, say on random samples of size m.

Give not necessarily best responses, but

  • ǫ-greedy.
  • Perturbed throughout, with small random shocks.
  • Randomly, and proportional to expected payoff.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 14

slide-15
SLIDE 15

Multi-agent learning Fictitious Play

Framework for predictive learning (like fictitious play)

A

fo re asting rule for player i is a function that maps a history to a probability

distribution over the opponents’ actions in the next round: fi : H → ∆(X−i). A

resp
  • nse
rule for player i is a function that maps a history to a probability

distribution over i’s own actions in the next round: gi : H → ∆(Xi). A

p redi tive lea rning rule for player i is the combination of a forecasting rule

and a response rule. This is typically written as ( fi, gi).

  • This framework can be attributed to J.S. Jordan (1993).
  • Forecasting and response functions are deterministic.
  • Reinforcement and regret do not fit. They are not involved with

prediction.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 15

slide-16
SLIDE 16

Multi-agent learning Fictitious Play

Forecasting and response rules for fictitious play

Let ht ∈ Ht be a history of play up to and including round t and φjt =Def the empirical distribution of j’s actions up to and including round t. Then the

titious pla y fo re asting rule is given by

fi(ht) =Def ∏

j=i

φjt. Let fi be a fictitious play forecasting rule. Then gi is said to be a

titious pla y resp
  • nse
rule if all values are best responses to values of fi.

Remarks:

  • 1. Player i attributes a mixed strategy φjt to player j. This strategy reflects

the number of times each action is played by j.

  • 2. The mixed strategies are assumed to be independent.
  • 3. Both (1) and (2) are simplifying assumptions.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 16

slide-17
SLIDE 17

Multi-agent learning Fictitious Play

Smoothed fictitious play

Notation: p−i : strategy profile of opponents as predicted by fi in round t. ui(xi, p−i) : expected utility of action xi, given p−i. qi : strategy profile of player i in round t + 1. I.e., gi(h). Task: define qi given p−i and ui(xi, p−i). Idea: Respond randomly, but (somehow) proportional to expected payoff. Elaborations of this idea: a) Strictly proportional: qi(xi | p−i) =Def ui(xi, p−i) ∑x′

i∈Xi ui(x′

i, p−i).

b) Through, what is called,

mixed logit:

qi(xi | p−i) =Def eui(xi,p−i)/γi ∑x′

i∈Xi eui(x′ i,p−i)/γi .

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 17

slide-18
SLIDE 18

Multi-agent learning Fictitious Play

Mixed logit, or quantal response function

  • Let d1 + · · · + dn = 1 and dj ≥ 0.

logit(di) =Def edi/γ ∑j edj/γ where γ > 0.

  • The logit function can be seen as a
soft maximum on n variables.

γ ↓ 0 : logit “shares” 1 among all maximal di γ = 1 : logit is strictly proportional γ → ∞ : logit “spreads” 1 among all di evenly Mixed logit can be justified in different ways. a) On the basis of information and

entrop y arguments.

b) By assuming the dj are i.i.d.

extreme value (a.k.a. log W eibull) distributed.

Anderson et al. (1992): Discrete Choice Theory of Product Differentiation. Sec. 2.6.1: “Derivation of the Logit”.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 18

slide-19
SLIDE 19

Multi-agent learning Fictitious Play

Evenly (γ → ∞) −

→ mixed logit − → best response only (γ ↓ 0)

As you see, mixed logit respects best replies, but leaves room for experimentation.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 19

slide-20
SLIDE 20

Multi-agent learning Fictitious Play

Digression: Coding theory and entropy

This digression tries to answer the following question: Why does play according to a diversified strategy yields more information than play according to a strategy where only a few options are played?

  • To send 8 different binary

encoded messages would cost 3

  • bits. Encoded messages are 000,

001, . . . 111.

  • To encode 16 different messages,

we would need log2 16 = 4 bits.

  • To encode 20 different messages,

we would need ⌈log2 20⌉ =

⌈4.32⌉ = 5 bits.

If some messages are send more frequently than others, it pays off to search for a code such that messages that occur more frequently are represented by short code words (at the expense of messages that are send less frequently, that must then be represented by the remaining longer code words).

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 20

slide-21
SLIDE 21

Multi-agent learning Fictitious Play

Coding theory and entropy (continued)

Example. Suppose persons A and B work on a dark terrain. They are separated, and can only communicate by morse through a flashlight. A and B have agreed to send only the following messages: m1 Yes m2 No m3 All well? m4 Shall I come over? A possible encoding could be

Co de 1:

m1

00 m2

01 m3

10 m4

11

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 21

slide-22
SLIDE 22

Multi-agent learning Fictitious Play

Coding theory and entropy (continued)

Another encoding could be

Co de 2:

m1

m2

10 m3

110 m4

111 To prevent ambiguity, no code word may be a prefix of some other code word. A useless coding would be

Co de 3:

m1

m2

1 m3

00 m4

01 Under Code 3, the sequence 0101 may mean different things, such as m1, m2, m1, m2, or m1, m2, m4. (There are still other possibilities.)

  • The objective is to search for an
e ient en o ding, i.e., an encoding

that minimises the number of bits per message.

  • If the
relative frequen y of messages

is known, we can for every code compute the expected number of bits per message, hence its efficiency.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 22

slide-23
SLIDE 23

Multi-agent learning Fictitious Play

Coding theory and entropy [end of digression]

The following would be a plausible probability distribution: m1 Yes 1/2 m2 No 1/4 m3 All well? 1/8 m4 Shall I come over? 1/8 For Code 2, E[number of bits] = 1 2· 1 + 1 4· 2 + 1 8· 3 + 1 8· 3 = 1.75 For Code 1, the expected number of bits is 2.0. Therefore, Code 2 is more efficient than Code 1. Theorem (Noiseless Coding Theo- rem, Shannon) p1 log2(1/p1) + . . . + pn log2(1/pn) is a lower bound for the expected number of bits in an encoding of n messages with expected occur- rence (p1, . . . , pn). This number is called the

entrop y of

(p1, . . . , pn). Alternatively, entropy is −[p1 log2(p1) + · · · + pn log2(pn)].

The entropy of Code 2 is equal to 1.75. Therefore, Code 2 is optimal.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 23

slide-24
SLIDE 24

Multi-agent learning Fictitious Play

Smoothed fictitious play (Fudenberg & Levine, 1995)

Smoothed fictitious play is a generalisation of mixed logit. Let wi : ∆i → R be a function that “grades” i’s probability distributions (over actions) under the following conditions.

  • 1. Grading is smooth (wi is infinitely often differentiable).
  • 2. Grading is strictly concave (bump) in such a manner that ∇wi(qi) → ∞

(steep) whenever grading approaches the boundary of ∆i (whenever distributions become extremely uneven). Let Ui(qi, p−i) =Def ui(qi, p−i) + γi· wi(qi) Let fi be fictitious forecasting and let gi correspond to a best response based

  • n Ui. Then ( fi, gi) is called
smo
  • thed
titious pla y with smoothing function

wi and

smo
  • thing
pa rameter γi.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 24

slide-25
SLIDE 25

Multi-agent learning Fictitious Play

Smoothed fictitious play limits regret

Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If a given player uses smoothed fictitious play with a sufficiently small smoothing parameter, then with probability one his regrets are bounded above by ǫ. – Young does not reproduce the proof of Fudenberg et al., but shows that in this case ǫ-regret can be derived from a later and more general result of Hart and Mas-Colell in 2001. – This later result identifies a large family of rules that eliminate regret, based on an extension of Blackwell’s approachability theorem. (Roughly, Blackwell’s approachability theorem generalises maxmin reasoning to vector-valued payoffs.)

Fudenberg & Levine, 1995. “Consistency and cautious fictitious play,” Journal of Economic Dynamics and Control, Vol. 19 (5-7), pp. 1065-1089. Hart & Mas-Colell, 2001. “A General Class of Adaptive Strategies,” Journal of Economic Theory, Vol. 98(1),

  • pp. 26-54.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 25

slide-26
SLIDE 26

Multi-agent learning Fictitious Play

Smoothed fictitious play converges to ǫ-CCE

  • Definition. A
  • a
rse
  • rrelated
equilib rium (CCE) is a probability distribu-

tion on strategy profiles, q ∈ ∆(X), such that no player can opt out (to gain expected utility) before q is made known. In a

  • a
rse
  • rrelated ǫ
  • equilib
rium (ǫ-CCE), no player can opt out to gain

more in expectation than ǫ. Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If all players use smoothed fictitious play with sufficiently small smoothing parameters, then with probability one empirical play will converge to the set of coarse correlated ǫ-equilibria. Summary of the two theorems: smoothed fictitious play limits regret and converges to ǫ-CCE. There is another learning method with no regret and convergence to zero-CCE . . .

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 26

slide-27
SLIDE 27

Multi-agent learning Fictitious Play

There are more Coarse Correlated Equilibria than Correlated Equilibria than Nash Equilibria

Simple coordination game: Other: Left Right

Y
  • u:

Left

(1, 1) (0, 0)

Right

(0, 0) (1, 1)

(In this picture, CCE = CE.)

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 27

slide-28
SLIDE 28

Multi-agent learning Fictitious Play

Exponentiated regret matching

Let j : action j, where 1 ≤ j ≤ k ¯ ut

i : i’s realised average payoff up to and including round t

φ−it : the realised joint empirical distribution of i’s opponents ¯ ui(j, φ−it) : i’s hypothetical average payoff for playing action j against φ−it ¯ rit : player i’s regret vector in round t, i.e., ¯ ui(j, φ−it) − ¯ ut

i

Exp
  • nentiated
regret mat hing (PY, p. 59) is defined as

qi(t+1)

j

∝ [¯ rit

j ]a

+

where a > 0. (For a = 1 we have ordinary regret matching.) An extended theorem on regret matching (Mas-Colell et al., 2001) ensures that individual players have no-regret with probability one, and empirical distribution of play converges to the set of coarse correlated equilibria (PY,

  • p. 60).

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 28

slide-29
SLIDE 29

Multi-agent learning Fictitious Play

FP vs. Smoothed FP vs. Exponentiated RM

Fictitious play Plays best responses.

  • Does depend on
past pla y of opponent(s).
  • Puts zero probabilities on sub-optimal responses.

Smoothed fictitious play Plays sub-optimal responses, e.g., softmax-proportional to their estimated payoffs.

  • Does depend on
past pla y of opponent(s).
  • Puts non-zero probabilities on sub-optimal responses.
  • Approaches fictitious play when γi ↓ 0 (PY, p. 84).

Exponentiated regret matching Plays regret suboptimally, i.e., proportional to a power of positive regret.

  • Does depend on own
past pa y
  • s.
  • Puts non-zero probabilities on sub-optimal responses.
  • Approaches fictitious play when exponent a → ∞ (PY, p. 84).

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 29

slide-30
SLIDE 30

Multi-agent learning Fictitious Play

FP vs. Smoothed FP vs. Exponentiated RM

FP Smoothed FP Exponentiated RM Depends on past play of

  • pponents

√ √ −

Depends on own past payoffs

− − √

Puts zero probabilities on sub-optimal responses

√ − −

Best response

when smoothing parameter γi ↓ 0 when exponent a → ∞ Individual no-regret

Within ǫ > 0, almost always (PY, p. 82) Exact, almost always (PY, p. 60) Collective convergence to coarse correlated equilibria

Within ǫ > 0, almost always (PY, p. 83) Exact, almost always (PY, p. 60)

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 30

slide-31
SLIDE 31

Multi-agent learning Fictitious Play

Part III: Finite memory and inertia

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 31

slide-32
SLIDE 32

Multi-agent learning Fictitious Play

Finite memory: motivation

  • In their basic version, most

learning rules rely on the entire history of play.

  • People, as well as computers,

have a finite memory. (On the

  • ther hand, for average or

discounted payoffs this is no real problem.)

  • Nevertheless: experiences in the

distant past are apt to be less relevant than more recent ones.

  • Idea: let players have a finite

memory of m rounds.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 32

slide-33
SLIDE 33

Multi-agent learning Fictitious Play

Inertia: motivation

  • When players’ strategies are

constantly re-evaluated, discontinuities in behaviour are likely to occur. Example: the asymmetric coordination game.

  • Discontinuities in behaviour are

less likely to lead to equilibria of any sort.

  • Idea: let players play the same

action as in the previous round with probability λ.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 33

slide-34
SLIDE 34

Multi-agent learning Fictitious Play

Weakly acyclic games

  • Game G with action space X.
  • G′ = (V, E) where V = X and

E = { (x, y) | for some i : y−i = x−i and ui(yi, y−i) > ui(xi, y−i) }

  • For all x ∈ X: x is a sink iff x is a

Nash equilibrium.

  • G is said to be
w eakly a y li under b etter replies if every node is

connected to a sink.

  • W
AuBR ⇒ ∃ Nash equilibrium.

(1, 1) (2, 4) (4, 2) (1, 1) (4, 2) (2, 4) (3, 3) (1, 1) (1, 1)

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 34

slide-35
SLIDE 35

Multi-agent learning Fictitious Play

Examples of weakly acyclic games

Coordination games Two-person games with identical actions for all players, where best responses are formed by the diagonal of the joint action space. Potential games (Monderer and Shapley, 1996). There is a function ρ : X → R, called the

p
  • tential,

such that for every player i and every action profile x, y ∈ X: y−i = x−i ⇒ ui(yi, y−i) − ui(xi, x−i) = ρ(y) − ρ(x) Example:

  • ngestion
games.

true : The potential function increases along every path.

⇒ : Paths cannot cycle. ⇒ : In finite graphs, paths must end.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 35

slide-36
SLIDE 36

Multi-agent learning Fictitious Play

Weakly acyclic games under finite memory and inertia

  • Theorem. Let G be a finite weakly acyclic n-person game. Every better-reply process

with finite memory and inertia converges to a pure Nash equilibrium of G.

  • Proof. (Outline.)
  • 1. Let the
state spa e Z be Xm.
  • 2. A state ¯

x ∈ Xm is called

homogeneous if it consists of

identical action profiles x. Such a state is denoted by x. Z∗ =Def { homogeneous states }.

  • 3. In a moment, it will be shown that

the process will hit Z∗ infinitely

  • ften.
  • 4. In a moment, it will be shown that

the overall probability to play any action is bounded away from zero.

  • 5. It can easily be seen that

Absorbing = Z∗ ∩ Pure Nash.

  • 6. In a moment, it will be shown that,

due to weak acyclicity, inertia, and (4), the process eventually lands in an absorbing state which, due to (5), is a repeated pure Nash equilibrium.

  • Gerard Vreeswijk.

Last modified on February 27th, 2012 at 18:35 Slide 36

slide-37
SLIDE 37

Multi-agent learning Fictitious Play

First claim: process hits Z∗ infinitely often

Let inertia be determined by λ > 0. Pr(all players play their previous action) = λn. Hence, Pr(all players play their previous action during m subsequent rounds) = λnm. If all players play their previous action during m subsequent rounds, then the process arrives at a homogeneous state. But also conversely. Hence, for all t, Pr(process arrives at a homogeneous state in round t + m) = λnm. Infinitely many disjoint histories of length m occur, hence infinitely many independent events “homogeneous at t + m” occur. Apply the (first)

Bo rel-Cantelli lemma: if {En}n are independent events, and

∑∞

n=1 Pr(En) is unbounded, then Pr( an infinite number of En ’s occur ) = 1.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 37

slide-38
SLIDE 38

Multi-agent learning Fictitious Play

Second claim: all actions will be played with probability γ > 0

A better-reply learning method from states (finite histories) to strategies (probability distributions on actions) γi : Z → ∆(Xi) possesses the following important properties: i) It is deterministic. ii) Every action is played with positive probability.

  • 1. Hence, if

γi = inf{γi(z)(xi) | z ∈ Z, xi ∈ Xi} Since Z and Xi are finite, the “inf” is a “min,” and γi > 0.

  • 2. Similarly, if

γ = inf{γi | 1 ≤ i ≤ n}. Since there are finitely many players, the “inf” is a “min,” and γ > 0.

  • Gerard Vreeswijk.

Last modified on February 27th, 2012 at 18:35 Slide 38

slide-39
SLIDE 39

Multi-agent learning Fictitious Play

Final claim: overall probability to reach a sink from Z∗ > 0

Suppose the process is in x.

  • 1. If x is pure Nash, we are done,

because response functions are deterministic better replies.

  • 2. If x is not pure Nash, there must

be an edge x → y in the better reply graph. Suppose this edge concerns action xi of player i. We now know that xi is played with probability at least γ, irrespective

  • f player and state.

Further probabilities:

  • All other players j = i keep

playing the same action : λn−1.

  • Edge x → y is actually traversed :

γλn−1.

  • Profile y is maintained for

another m − 1 rounds, so as to arrive at state y : λn(m−1).

  • To traverse from x to y :

γλn−1· λn(m−1) = γλnm−1.

  • The image x(1), . . . , x(l) of a

better reply-path x(1), . . . , x(l) is followed to a sink : ≥ (γλnm−1)L, where L is the length of a longest better-reply path. Since Z∗ is encountered infinitely

  • ften, the result follows.
  • Gerard Vreeswijk.

Last modified on February 27th, 2012 at 18:35 Slide 39

slide-40
SLIDE 40

Multi-agent learning Fictitious Play

Summary

  • With fictitious play, the
b ehaviour
  • f
  • pp
  • nents is modelled by (or

represented by, or projected on) a

mixed strategy.
  • Fictitious play ignores

sub-optimal actions.

  • There is a family of so-called
b etter-reply lea rning rules, that

i) play sub-optimal actions, and ii) can be brought arbitrary close to fictitious play.

  • In weakly acyclic n-person

games, every better-reply process with finite memory and inertia converges to a pure Nash equilibrium.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 40

slide-41
SLIDE 41

Multi-agent learning Fictitious Play

What next?

Bayesian play :

  • With fictitious play, the behaviour of opponents is modelled by a
single mixed strategy.
  • With Bayesian play, opponents are modelled by a
p robabilit y distribution
  • ver
(a p
  • ssibly
  • nned
set
  • f
) mixed strategies.

Gradient dynamics :

  • Like fictitious play, players model (or assess) each other through mixed

strategies.

  • Strategies are not played, only maintained.
  • Due to CKR (common knowledge of rationality, cf. Hargreaves Heap &

Varoufakis, 2004), all models of mixed strategies are correct. (I.e., q−i = s−i, for all i.)

  • Players gradually adapt their mixed strategies through hill-climbing in

the payoff space.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 41