Multi-agent learning Fictitious Play Gerard Vreeswijk , Intelligent - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning Fictitious Play Gerard Vreeswijk , Intelligent - - PowerPoint PPT Presentation

Multi-agent learning Fictitious Play Multi-agent learning Fictitious Play Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Slides last


slide-1
SLIDE 1

Multi-agent learning Fictitious Play

Multi-agent learning

Fictitious Play

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 1

slide-2
SLIDE 2

Multi-agent learning Fictitious Play

Fictitious play: motivation

  • Rather than considering your
  • wn payoffs, monitor the

behaviour of your opponent(s), and respond optimally.

  • Behaviour of an opponent is

projected on a single mixed strategy.

  • Brown (1951): explanation for

Nash equilibrium play. In terms of current use, the name is a bit of a misnomer, since play actually occurs (Berger, 2005).

  • One of the most important, if not

the most important, representative of a follower strategy.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 2

slide-3
SLIDE 3

Multi-agent learning Fictitious Play

Plan for today

Part I. Best reply strategy 1. Pure fictitious play.

  • 2. Results that connect pure fictitious play to Nash equilibria.

Part II. Extensions and approximations of fictitious play

  • 1. Smoothed fictitious play.
  • 2. Exponential regret matching.
  • 3. No regret property of smoothed fictitious play (Fudenberg et al., 1995).
  • 4. Convergence of better reply strategies when players have limited

memory and are inert [tend to stick to their current strategy] (Peyton Young, XXX).

Shoham et al. (2009): Multi-agent Systems. Ch. 7: “Learning and Teaching”.

  • H. Peyton Young (2004): Strategic Learning and it Limits, Oxford UP.
  • D. Fudenberg and D.K. Levine (1998), The Theory of Learning in Games, MIT Press.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 3

slide-4
SLIDE 4

Multi-agent learning Fictitious Play

Part I: Pure fictitious play

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 4

slide-5
SLIDE 5

Multi-agent learning Fictitious Play

Repeated Coordination Game

Players receive payoff p > 0 iff they coordinate. This game possesses three Nash equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1). Round A’s action B’s action A’s beliefs B’s beliefs 0.

(0.0, 0.0) (0.0, 0.0)

* 1. A B

(0.0, 1.0) (1.0, 0.0)

2. B A

(1.0, 1.0) (1.0, 1.0)

* 3. A B

(1.0, 2.0) (2.0, 1.0)

4. B A

(2.0, 2.0) (2.0, 2.0)

* 5. B B

(2.0, 3.0) (2.0, 3.0)

6. B B

(2.0, 4.0) (2.0, 4.0)

7. B B

(2.0, 5.0) (2.0, 5.0)

. . . . . . . . . . . . . . .

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 5

slide-6
SLIDE 6

Multi-agent learning Fictitious Play

Steady states are pure (but possibly weak) Nash equilibria

Definition (Steady state). An action profile a is a steady state (or absorbing state) of fictitious play if it is the case that whenever a is played at round t it is also played at round t + 1.

  • Theorem. If a pure strategy profile is a steady state of fictitious play, then it is a

(possibly weak) Nash equilibrium in the stage game.

  • Proof. Suppose s is a steady state of fictitious play. Consequently, i’s opponent

model converges to s−i, for all i. If s would not be Nash, one of the players would deviate from si, which would contradict our assumption that s is a Nash equiibrium.a

  • In practice, the resulting Nash equilibrium is often strict, because a weak

equilibrium is unlikely to maintain the process in a steady state.

aAd absurdum is not a preferred route. But sometimes it is more intuitive.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 6

slide-7
SLIDE 7

Multi-agent learning Fictitious Play

Pure strict Nash equilibria are steady states

  • Theorem. If a pure strategy profile is a strict Nash equilibrium of a stage game, then

it is a steady state of fictitious play in the repeated game. Notice the use of terminology: “pure strategy profile” for Nash equilibria; “action profile” for steady states.

  • Proof. Suppose s is a pure Nash equilibrium. Because s is pure, each si is

deterministic (not a mix). Suppose s is played at round t. Because s is Nash, a best response to s−i is action si. (There might be others!) Because s is a strict equilibrium, si is the unique best response to s−i. Because this argument holds for each i, action profile s will be played in round t + 1 again.

  • Summary of the two theorems:

Pure strict Nash ⇒ Steady state ⇒ Pure Nash. But what if pure Nash equilibria do not exist?

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 7

slide-8
SLIDE 8

Multi-agent learning Fictitious Play

Repeated game of Matching Pennies

Zero sum game. A’s goal is to have pennies matched. Round A’s action B’s action A’s beliefs B’s beliefs 0.

(1.5, 2.0) (2.0, 1.5)

1. T T

(1.5, 3.0) (2.0, 2.5)

2. T H

(2.5, 3.0) (2.0, 3.5)

3. T H

(3.5, 3.0) (2.0, 4.5)

4. H H

(4.5, 3.0) (3.0, 4.5)

5. H H

(5.5, 3.0) (4.0, 4.5)

6. H H

(6.5, 3.0) (5.0, 4.5)

7. H T

(6.5, 4.0) (6.0, 4.5)

8. H T

(6.5, 5.0) (7.0, 4.5)

. . . . . . . . . . . . . . .

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 8

slide-9
SLIDE 9

Multi-agent learning Fictitious Play

Convergent empirical distribution of strategies

  • Theorem. If the empirical distribution of each player’s strategies converges in

fictitious play, then it converges to a Nash equilibrium.

  • Proof. Same as before. If the empirical distributions converge to s, then i’s
  • pponent model converges to s−i, for all i. If s would not be Nash, one of the

players would deviate from si, which would contradict the convergence of the empirical distribution. Remarks:

  • 1. The si may be mixed.
  • 2. It actually suffices that the s−i

converge asymptotically to the actual distribution.

  • 3. If empirical distributions

converge (hence, converge to a Nash equilibrium), the actually played responses per stage need not be Nash equilibria of the stage game.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 9

slide-10
SLIDE 10

Multi-agent learning Fictitious Play

Empirical distributions converge to Nash⇒ stage Nash

Repeated Coordination Game. Players receive payoff p > 0 iff they coordinate. Round A’s action B’s action A’s beliefs B’s beliefs 0.

(0.5, 1.0) (1.0, 0.5)

1. B A

(1.5, 1.0) (1.0, 1.5)

2. A B

(1.5, 2.0) (2.0, 1.5)

3. B A

(2.5, 2.0) (2.0, 2.5)

4. A B

(2.5, 3.0) (3.0, 2.5)

. . . . . . . . . . . . . . .

  • This game possesses three equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1), with

expected payoffs 1, 0.5, and 1, respectively.

  • Empirical distribution of play converges to (0.5, 0.5),—with payoff 0,

rather than p/2.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 10

slide-11
SLIDE 11

Multi-agent learning Fictitious Play

Empirical distribution of play does not need to converge

Rock-paper-scissors. Winner receives payoff p > 0. Else, payoff zero.

  • Rock-paper-scissors with these payoffs is known as the Shapley game.
  • The Shapley game possesses one equilibrium, viz. (1/3, 1/3, 1/3), with

expected payoff p/3. Round A’s action B’s action A’s beliefs B’s beliefs 0.

(0.0, 0.0, 0.5) (0.0, 0.5, 0.0)

1. Rock Scissors

(0.0, 0.0, 1.5) (1.0, 0.5, 0.0)

2. Rock Paper

(0.0, 1.0, 1.5) (2.0, 0.5, 0.0)

3. Rock Paper

(0.0, 2.0, 1.5) (3.0, 0.5, 0.0)

4. Scissors Paper

(0.0, 3.0, 1.5) (3.0, 0.5, 1.0)

5. Scissors Paper

(0.0, 4.0, 1.5) (3.0, 0.5, 2.0)

. . . . . . . . . . . . . . .

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 11

slide-12
SLIDE 12

Multi-agent learning Fictitious Play

Repeated Shapley Game: Phase Diagram

  • Rocks

Paper Scissors

  • Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h.

Slide 12

slide-13
SLIDE 13

Multi-agent learning Fictitious Play

Part II: Extensions and approximations of fictitious play

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 13

slide-14
SLIDE 14

Multi-agent learning Fictitious Play

Proposed extensions to fictitious play

Build forecasts, not on complete history, but on

  • Recent data, say on m most recent rounds.
  • Discounted data, say with discount factor γ.
  • Perturbed data, say with error ǫ on individual observations.
  • Random samples of historical data, say on random samples of size m.

Give not necessarily best responses, but

  • ǫ-greedy.
  • Perturbed throughout, with small random shocks.
  • Randomly, and proportional to expected payoff.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 14

slide-15
SLIDE 15

Multi-agent learning Fictitious Play

Framework for predictive learning (like fictitious play)

A forecasting rule for player i is a function that maps a history to a probability distribution over the opponents’ actions in the next round: fi : H → ∆(X−i). A response rule for player i is a function that maps a history to a probability distribution over i’s own actions in the next round: gi : H → ∆(Xi). A predictive learning rule for player i is the combination of a forecasting rule and a response rule. This is typically written as ( fi, gi).

  • This framework can be attributed to J.S. Jordan (1993).
  • Forecasting and response functions are deterministic.
  • Reinforcement and regret do not fit. They are not involved with

prediction.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 15

slide-16
SLIDE 16

Multi-agent learning Fictitious Play

Forecasting and response rules for fictitious play

Let ht ∈ Ht be a history of play up to and including round t and φjt =Def the empirical distribution of j’s actions up to and including round t. Then the fictitious play forecasting rule is given by fi(ht) =Def ∏

j=i

φjt. Let fi be a fictitious play forecasting rule. Then gi is said to be a fictitious play response rule if all values are best responses to values of fi. Remarks:

  • 1. Player i attributes a mixed strategy φjt to player j. This strategy reflects

the number of times each action is played by j.

  • 2. The mixed strategies are assumed to be independent.
  • 3. Both (1) and (2) are simplifying assumptions.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 16

slide-17
SLIDE 17

Multi-agent learning Fictitious Play

Smoothed fictitious play

Notation: p−i : strategy profile of opponents as predicted by fi in round t. ui(xi, p−i) : expected utility of action xi, given p−i. qi : strategy profile of player i in round t + 1. I.e., gi. Task: define qi given p−i and ui(xi, p−i). Idea: Respond randomly, but (somehow) proportional to expected payoff. Elaborations of this idea: a) Strictly proportional: qi(xi | p−i) =Def ui(xi, p−i) ∑x′

i∈Xi ui(x′

i, p−i).

b) Through, what is called, mixed logit: qi(xi | p−i) =Def eui(xi,p−i)/γi ∑x′

i∈Xi eui(x′ i,p−i)/γi .

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 17

slide-18
SLIDE 18

Multi-agent learning Fictitious Play

Mixed logit, or quantal response function

  • Let d1 + · · · + dn = 1 and dj ≥ 0.

logit(di) =Def edi/γ ∑j edj/γ where γ > 0.

  • The logit function can be seen as a soft maximum on n variables.

γ ↓ 0 : logit “shares” 1 among all maximal di γ = 1 : logit is strictly proportional γ → ∞ : logit “spreads” 1 among all di evenly Mixed logit can be justified in different ways. a) On the basis of information and entropy arguments. b) By assuming the dj are i.i.d. extreme value distributed. This distribution arises as the limit of the maximum of n independent random variables (each exponentially distributed).

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 18

slide-19
SLIDE 19

Multi-agent learning Fictitious Play

Digression: Coding theory and entropy

In this digression we try to answer the following question: Why does play according to a diversified strategy yield more information than play according to a strategy where only a few options are played?

  • To send 8 different binary

encoded messages would cost 3

  • bits. Encoded messages are 000,

001, . . . 111.

  • To encode 16 different messages,

we would need log2 16 = 4 bits.

  • To encode 20 different messages,

we would need ⌈log2 20⌉ =

⌈4.32⌉ = 5 bits.

If some messages are send more frequently than others, it pays off to search for a code such that messages that occur more frequently are represented by short code words (at the expense of messages that are send less frequently, that must then be represented by the remaining longer code words).

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 19

slide-20
SLIDE 20

Multi-agent learning Fictitious Play

Coding theory and entropy (continued)

Example. Suppose persons A and B work on a dark terrain. They are separated, and can only communicate by morse through a flashlight. A and B have agreed to send only the following messages: m1 Yes m2 No m3 All well? m4 Shall I come to you? A possible encoding could be Code 1: m1

00 m2

01 m3

10 m4

11

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 20

slide-21
SLIDE 21

Multi-agent learning Fictitious Play

Coding theory and entropy (continued)

Another encoding could be Code 2: m1

m2

10 m3

110 m4

111 To prevent ambiguity, no code word may be a prefix of some other code word. A useless coding would be Code 3: m1

m2

1 m3

00 m4

01 Under Code 3, the sequence 0101 may mean different things, such as m1, m2, m1, m2, or m1, m2, m4. (There are still other possibilities.)

  • The objective is to search for an

efficient encoding, i.e., an encoding that minimises the number of bits per message.

  • If the relative frequency of

messages is known, we can for every code compute the expected number of bits per message, and hence its efficiency.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 21

slide-22
SLIDE 22

Multi-agent learning Fictitious Play

Coding theory and entropy [end of digression]

The following would be a plausible probability distribution: m1 Yes 1/2 m2 No 1/4 m3 All well? 1/8 m4 Shall I come to you? 1/8 For Code 2, E[number of bits] = 1 2· 1 + 1 4· 2 + 1 8· 3 + 1 8· 3 = 1.75 For Code 1, the expected number of bits is 2.0. Therefore, Code 2 is more efficient than Code 1. Theorem (Noiseless Coding Theo- rem, Shannon) p1 log2(1/p1) + . . . + pn log2(1/pn) is a lower bound for the expected number of bits in an encoding of n messages with expected occur- rence (p1, . . . , pn). This number is called the entropy of

(p1, . . . , pn). Alternatively, entropy is −[p1 log2(p1) + · · · + pn log2(pn)].

The entropy of Code 2 is equal to 1.75. Therefore, Code 2 is optimal.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 22

slide-23
SLIDE 23

Multi-agent learning Fictitious Play

Smoothed fictitious play (Fudenberg & Levine, 1995)

Smoothed fictitious play is a generalisation of mixed logit. Let wi : ∆i → R be a function that “grades” i’s probability distributions (over actions) under the following conditions.

  • 1. Grading is smooth (wi is infinitely many times differentiable).
  • 2. Grading is strictly concave (bump) in such a manner that ∇wi(qi) → ∞

(steep) whenever grading approaches the boundary of ∆i (whenever distributions become extremely uneven). Let Ui(qi, p−i) =Def ui(qi, p−i) + γi· wi(qi) Let fi be fictitious forecasting and let gi correspond to a best response based

  • n Ui. Then ( fi, gi) is called smoothed fictitious play with smoothing function

wi and smoothing parameter γi.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 23

slide-24
SLIDE 24

Multi-agent learning Fictitious Play

Smoothed fictitious play limits regret

Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If a given player uses smoothed fictitious play with a sufficiently small smoothing parameter, then with probability one his regrets are bounded above by ǫ. – Peyton Young does not reproduce the proof of Fudenberg et al., but shows that in this case ǫ-regret can be derived from a later and more general result of Hart and Mas-Colell in 2001. – This later result identifies a large family of rules that eliminate regret, based on an extension of Blackwell’s approachability theorem. – Roughly, Blackwell’s approachability theorem generalises maxmin reasoning to vector-valued payoffs.

Fudenberg & Levine, 1995. “Consistency and cautious fictitious play,” Journal of Economic Dynamics and Control, Vol. 19 (5-7), pp. 1065-1089. Hart & Mas-Colell, 2001. “A General Class of Adaptive Strategies,” Journal of Economic Theory, Vol. 98(1),

  • pp. 26-54.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 24

slide-25
SLIDE 25

Multi-agent learning Fictitious Play

Smoothed fictitious play converges to ǫ-CCE

  • Definition. A coarse correlated equilibrium (CCE) is a probability distribution
  • n strategy profiles, q ∈ ∆(X), such that no player can opt out and gain more

in expected payoff. A coarse correlated ǫ-equilibrium (ǫ-CCE) is a probability distribution on strategy profiles, such that no player can opt out and gain more in expectation than ǫ. Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If all players use smoothed fictitious play with sufficiently small smoothing parameters, then with probability one empirical play will converge to the set of coarse correlated ǫ-equilibria. Summary of the two theorems: smoothed fictitious play limits regret and converges to ǫ-CCE. There is another learning method with no regret and convergence to zero-CCE . . .

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 25

slide-26
SLIDE 26

Multi-agent learning Fictitious Play

Exponentiated regret matching

Let j : action j, where 1 ≤ j ≤ k ¯ ut

i : i’s realised average payoff up to and including round t

φ−it : the realised joint empirical distribution of i’s opponents ¯ ui(j, φ−it) : i’s hypothetical average payoff for playing action j against φ−it ¯ rit : player i’s regret vector in round t, i.e., ¯ ui(j, φ−it) − ¯ ut

i

Exponentiated regret matching (PY, p. 59) is defined as qi(t+1)

j

∝ [¯ rit

j ]a

+

where a > 0. (For a = 1 we have ordinary regret matching.) An extended theorem on regret matching (Mas-Colell et al., 2001) ensures that individual players have no regret with probability one, and empirical distribution of play converges to the set of coarse correlated equilibria (PY,

  • p. 60).

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 26

slide-27
SLIDE 27

Multi-agent learning Fictitious Play

FP vs. Smoothed FP vs. Exponentiated RM

FP Smoothed FP Exponentiated RM Depends on past play of

  • pponents

√ √ −

Depends on own past payoffs

− − √

Puts zero probabilities on sub-optimal responses

√ − −

Best response

when smoothing parameter γi ↓ 0 when exponent a → ∞ Individual no regret

Within ǫ > 0, almost always (PY, p. 82) Exact, almost always (PY, p. 60) Collective convergence to coarse correlated equilibria

Within ǫ > 0, almost always (PY, p. 83) Exact, almost always (PY, p. 60)

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 27

slide-28
SLIDE 28

Multi-agent learning Fictitious Play

FP vs. Smoothed FP vs. Exponentiated RM

Fictitious play Plays best responses.

  • Does depend on past play of opponent(s).
  • Puts zero probabilities on sub-optimal responses.

Smoothed fictitious play Plays sub-optimal responses, e.g., softmax-proportional to their estimated payoffs.

  • Does depend on past play of opponent(s).
  • Puts non-zero probabilities on sub-optimal responses.
  • Approaches fictitious play when smoothing parameter γi ↓ 0.

Exponentiated regret matching Plays regret suboptimally, i.e., proportional to a power of positive regret.

  • Does depend on own past payoffs.
  • Puts non-zero probabilities on sub-optimal responses.
  • Approaches fictitious play when exponent a → ∞.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 28

slide-29
SLIDE 29

Multi-agent learning Fictitious Play

Part III: Finite memory and inertia

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 29

slide-30
SLIDE 30

Multi-agent learning Fictitious Play

Finite memory: motivation

  • In their basic version, most

learning rules rely on the entire history of play.

  • People, as well as computers,

have a finite memory. (On the

  • ther hand, for average or

discounted payoffs this is no real problem.)

  • Nevertheless: experiences in the

distant past are apt to be less relevant than more recent ones.

  • Idea: let players have a finite

memory of m rounds.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 30

slide-31
SLIDE 31

Multi-agent learning Fictitious Play

Inertia: motivation

  • When players’ strategies are

constantly re-evaluated, discontinuities in behaviour are likely to occur. Example: the asymmetric coordination game.

  • Discontinuities in behaviour are

less likely to lead to equilibria of any sort.

  • Idea: let players play the same

action as in the previous round with probability λ.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 31

slide-32
SLIDE 32

Multi-agent learning Fictitious Play

Weakly acyclic games

  • Game G with action space X.
  • G′ = (V, E) where V = X and

E = { (x, y) | for some i : y−i = x−i and ui(yi, y−i) > ui(xi, y−i) }

  • For all x ∈ X: x is a sink iff x is a

Nash equilibrium.

  • G is said to be weakly acyclic under

better replies if every node is connected to a sink.

  • WAuBR ⇒ ∃ Nash equilibrium.

(1, 1) (2, 4) (4, 2) (1, 1) (4, 2) (2, 4) (3, 3) (1, 1) (1, 1)

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 32

slide-33
SLIDE 33

Multi-agent learning Fictitious Play

Examples of weakly acyclic games

Coordination games Two-person games with identical actions for all players, where best responses are formed by the diagonal of the joint action space. Potential games (Monderer and Shapely, 1996). There is a function ρ : X → R, called the potential, such that for every player i and every action profile x, y ∈ X: y−i = x−i ⇒ ui(yi, y−i) − ui(xi, y−i) = ρ(y) − ρ(x) true : The potential function increases along every path.

⇒ : Paths cannot cycle. ⇒ : In finite graphs, paths must end.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 33

slide-34
SLIDE 34

Multi-agent learning Fictitious Play

Weakly acyclic games under finite memory and inertia

  • Theorem. Let G be a finite weakly acyclic n-person game. Every better-reply process

with finite memory and inertia converges to a pure Nash equilibrium of G.

  • Proof. (Outline.)
  • 1. Let the state space Z be Xm.
  • 2. A state is called homogeneous if it

consists of identical action profiles x. Such a state is denoted by x. Z∗ =Def { homogeneous states }.

  • 3. Due to inertia, the process hits Z∗

infinitely many times .

  • 4. In a moment, it will be shown

that the overall probability to play any action is bounded away from zero.

  • 5. In a moment, it will be shown

that the set of absorbing states is identical to the set of homogeneous states that consist

  • f pure Nash equilibria.
  • 6. Due to weak acyclicity, inertia,

and (4), the process eventually lands in an absorbing state which, due to (5), is a repeated pure Nash equilibrium.

  • Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h.

Slide 34

slide-35
SLIDE 35

Multi-agent learning Fictitious Play

First claim: process hits Z∗ infinitely many times

Let inertia be determined by λ > 0. Pr(all players play their previous action) = λn. Hence, Pr(all players play their previous action during m subsequent rounds) = λnm. If all players play their previous action during m subsequent rounds, then the process arrives at a homogeneous state. But also conversely. Hence, for all t, Pr(process arrives at a homogeneous state in round t + m) = λnm. Recall the converse Borel-Cantelli lemma: if {En}n are independent events, and ∑∞

n=1 Pr(En) is unbounded, then Pr( an infinite number of En’s occur ) = 1.

This lemma applies if an infinite selection of disjoint histories is considered. Since history length ≤ m, this is always possible.

  • Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h.

Slide 35

slide-36
SLIDE 36

Multi-agent learning Fictitious Play

Second claim: the overall probability to play any action > 0

A better-reply learning method from states (finite histories) to strategies (probability distributions on actions) γi : Z → ∆(Xi) possesses the following important properties: i) It is deterministic. ii) Every action is played with positive probability.

  • 1. Hence, if

γi = inf{γi(z)(xi) | z ∈ Z, xi ∈ Xi} Since Z and Xi are finite, the “inf” is a “min,” and γi > 0.

  • 2. Similarly, if

γ = inf{γi | 1 ≤ i ≤ n}. Since there are finitely many players, the “inf” is a “min,” and γ > 0.

  • Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h.

Slide 36

slide-37
SLIDE 37

Multi-agent learning Fictitious Play

Final claim: overall probability to reach a sink from Z∗ > 0

Suppose the process is in x.

  • 1. If x is pure Nash, we are done,

because response functions are deterministic better replies.

  • 2. If x is not pure Nash, there must

be an edge x → y in the better reply graph. Suppose this edge concerns action xi of player i. We now know that xi is played with probability at least γ, irrespective

  • f player and state.

Further probabilities:

  • All other players j = i keep

playing the same action : λn−1.

  • Edge x → y is actually traversed :

γλn−1.

  • Profile y is maintained for

another m − 1 rounds, so as to arrive at state y : λn(m−1).

  • To traverse from x to y :

γλn−1· λn(m−1) = γλnm−1.

  • The image x(1), . . . , x(l) of a

better reply-path x(1), . . . , x(l) is followed to a sink : ≥ (γλnm−1)L, where L is the length of a longest better-reply path. Since Z∗ is encountered infinitely

  • ften, the result follows.
  • Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h.

Slide 37

slide-38
SLIDE 38

Multi-agent learning Fictitious Play

Summary

  • With fictitious play, the behaviour
  • f opponents is modelled by (or

represented by, or projected on) a mixed strategy.

  • Fictitious play ignores

sub-optimal actions.

  • There is a family of so-called

better-reply learning rules, that i) play sub-optimal actions, and ii) can be brought arbitrary close to fictitious play.

  • In weakly acyclic n-person

games, every better-reply process with finite memory and inertia converges to a pure Nash equilibrium.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 38

slide-39
SLIDE 39

Multi-agent learning Fictitious Play

What next?

Bayesian play :

  • With fictitious play, the behaviour of opponents is modelled by a single

mixed strategy.

  • With Bayesian play, opponents are modelled by a probability distribution
  • ver (a possibly confined set of) mixed strategies.

Gradient dynamics :

  • Like fictitious play, players model (or assess) each other through mixed

strategies.

  • Due to CKR (common knowledge of rationality, cf. Hargreaves Heap &

Varoufakis, 2004), all models of mixed strategies are correct. (I.e., q−i = s−i, for all i.)

  • Mixed strategies actually not played.
  • Players gradually adapt their mixed strategies through hill-climbing in

the payoff space.

Gerard Vreeswijk. Slides last processed on Tuesday 2nd March, 2010 at 13:53h. Slide 39