Beliefs and Learning in Repeated Games Florin Constantin and Ivo - - PDF document

beliefs and learning in repeated games
SMART_READER_LITE
LIVE PREVIEW

Beliefs and Learning in Repeated Games Florin Constantin and Ivo - - PDF document

Beliefs and Learning in Repeated Games Florin Constantin and Ivo Parashkevov March 15, 2006 Context 2-player discounted repeated games [can be extended to n -player] want to provably learn equilibrium play, as quickly as possible and


slide-1
SLIDE 1

Beliefs and Learning in Repeated Games

Florin Constantin and Ivo Parashkevov March 15, 2006

slide-2
SLIDE 2

Context

  • 2-player discounted repeated games [can

be extended to n-player]

  • want to provably learn equilibrium play, as

quickly as possible and with as little info as possible

1

slide-3
SLIDE 3

Rational (Bayesian) Learning

  • use beliefs about opponents’ strategies to

guide prediction of future play

  • play Best Response to beliefs
  • update beliefs based on actual play
  • learning = recurrently update beliefs until

convergence to equilibrium

2

slide-4
SLIDE 4

Belief Learning vs. Bayesian Learning

  • Behavior Strategy: history → distribution
  • ver opponent’s play in next period.

Example: hopp = (C, C, D) → Prt=4(C) =

2 3, Prt=4(D) = 1 3

  • Belief Learning - prediction rule as behav-

ior strategy: associate probabilities with future play of opponents based on play his-

  • tory. Best Respond to prediction rule
  • Bayesian Learning - Best Respond to be-

liefs

  • Belief Learning as Bayesian Learning: Best

Respond to belief that puts probability 1

  • n the behavior strategy predicted by the

prediction rule

3

slide-5
SLIDE 5

Belief Learning vs. Bayesian Learning II

  • Bayesian Learning as Belief Learning: For

any belief B of player 1 over player 2’s be- havior strategies, there exists an equivalent belief assigning probability 1 to a particular behavior strategy (called reduced form of B). Prediction rule: predict the reduced form

4

slide-6
SLIDE 6

Fictitious Play

  • P(opponent plays s at time t) =

t t + k (freq of s up to time t) + k t + k prior(s)

  • Assumptions

– myopia

  • if it converges, it converges to NE

5

slide-7
SLIDE 7

Calibrated Learning

  • use forecasts; if

– every player plays Best Response to fore- casts – forecasts are calibrated then learning converges to Correlated Equi- librium

  • history is correlating device (umpire)
  • Assumptions

– stationary tie-breaking rule

6

slide-8
SLIDE 8

Problematic assumptions in papers so far

  • myopia

– ignores strategic considerations about the future - can not experiment for long run benefit – can only implement any NE of repeated game that consist of NE in stage games (e.g. no trigger strategies)

  • observable rewards
  • common prior

7

slide-9
SLIDE 9

Kalai & Lehrer - Rational Learning

  • Setting:

– n-player infinitely repeated discounted games – subjective rationality - best responding to beliefs – learning is through Bayesian updating of individual prior – encode beliefs as behavior strategies

  • Main result: if individual beliefs are com-

patible with actual play then best response to beliefs leads to accurate prediction of future play. Play converges to Nash equi- librium play.

8

slide-10
SLIDE 10

Assumptions

  • Perfect monitoring - observe actions of other

players

  • Independence of other players’ actions and

beliefs

  • No longer assume common prior or myopia
  • Opponents not assumed to be rational
  • Knowledge of own payoff matrix

9

slide-11
SLIDE 11

Some Notation

  • n finite sets Σ1, Σ2, ..., Σn of actions
  • Ht - set of histories of length t. H =

t Ht

is the set of all finite histories.

  • behavior strategy of player i is a function

fi:H → ∆(Σi), where ∆(Σi) denotes the set of probability distributions over Σi

  • µf is the probability distribution over the

set of infinite play paths induced by the strategy vector f

10

slide-12
SLIDE 12

Absolute Continuity and Grain of Truth Assumptions

What does it mean to have ”beliefs compatible with actual play”?

  • Do not assign zero probability to events

that can occur in the play of the game.

  • ”Grain of Truth” - beliefs about opponent’s

play assign a (small) positive probability to the strategy actually chosen. – Sufficient, but stronger than needed.

  • Absolute Continuity - measure µf is ab-

solutely continuous w.r.t. µg (µf ≪ µg) if µf(A) > 0 ⇒ µg(A) > 0 for all sets A ⊆ Σ∞.

  • Main result requires: actual µ ≪ belief ˜

µi

11

slide-13
SLIDE 13

Prisoner’s Dilemma Example

D C D 0,0 2,-1 C

  • 1,2

1,1

  • Consider strategies

– g∞: grim trigger – gt: use grim trigger until time t, then defect forever

  • P1 assigns probs (β0, β1, . . . , β∞) to P2 play-

ing (g0, g1, . . . , g∞), βt > 0. P2 assigns probs (α0, α1, . . . , α∞) to P1 playing (g0, g1, . . . , g∞ αt > 0.

  • According to own learning parameters, P1

chooses gt1 and P2 chooses gt2.

12

slide-14
SLIDE 14

Prisoner’s Dilemma Example

  • all events with positive probability in the

game (C until time t < min(t1, t2), D af- ter min(t1, t2) etc) are associated positive probability by players’ beliefs: beliefs are compatible with actual play.

  • learning must occur - if t1 < t2 then P2 will

assign prob 1 to P1 playing gt1 from time t1 +1 on. So P2 knows that P1 will defect forever.

  • P1 will not know P2’s strategy - will only

know that t2 > t1, but will be able to pre- dict that P2 will defect forever as well - future play is learned only on the play path.

13

slide-15
SLIDE 15

Prisoner’s Dilemma Example

What if t1 = t2 = ∞?

  • after time t, P1 knows P2 did not play

g0, . . . , gt and assigns probabilities (βt+1, . . . , β∞)/

  • r=t+1

βr to (gt+1, . . . , g∞). Since β∞ > 0, β∞

r=t+1 βr t→∞

→ 1

  • P1 becomes more and more confident that

P2 is playing g∞, but never knows for sure.

14

slide-16
SLIDE 16

Definitions

  • Let ε > 0 and µ, ˜

µ two probability measures. µ is ε-close to ˜ µ if ∃ set Q such that – µ(Q) > 1 − ε and ˜ µ(Q) > 1 − ε – ∀A ⊆ Q, (1−ε)˜ µ(A) ≤ µ(A) ≤ (1+ε)˜ µ(A)

  • f plays ε-like g if µf is ε-close to µg.
  • Let f be a strategy, t ≥ 0 and h a history

up to time t. The induced strategy fh(·) fh(h′) = f(concat(h, h′)) for all h′ of length r

15

slide-17
SLIDE 17

Theorem 1

Let f be the strategy that is actually chosen and fi be the beliefs of player i. Assume f is absolutely continuous with respect to fi. Then ∀ε > 0 for almost every play path z ∃ time T(z, ε) such that ∀ t ≥ T(z, ε), fz(t) plays ε-like fi

z(t).

If players maximize payoff then they will even- tually be playing a subjective ε-equilibrium:

  • each player plays a Best Response to own

beliefs

  • these beliefs are ε-never going to be con-

tradicted by actual play Interpretation?

16

slide-18
SLIDE 18

Theorem 2

Let f be the strategy that is actually chosen and f1, . . . , fn be the beliefs of players 1, . . . , n. Assume

  • f is absolutely continuous with respect to

fi

  • each player plays a Best Response to its

beliefs. Then ∀ε > 0, for almost every play path z ∃ a time T(z, ε) such that for all t ≥ T(z, ε) there exists an ε-Nash Equilibrium ¯ f of the repeated game such that fz(t) plays ε-like ¯ f.

17

slide-19
SLIDE 19

Comments

  • Theorem 1 does not assume anything about

players’ strategies.

  • Convergence of beliefs with reality occurs
  • nly on the actual play path.

Players do not learn what their opponents will do in response to actions that will not be taken.

  • If players are best responding (Theorem 2),

then convergence is to NE play in the re- peated game, not to repeated play of a single stage NE.

  • Convergence is to an equilibrium play, not

to an equilibrium. We are not learning Nash strategies, but we can learn to play as if we knew them

18

slide-20
SLIDE 20
  • So what?
  • If

– Assumptions are met – All other players play Best Response to their beliefs can you do better?

19

slide-21
SLIDE 21

Beliefs in Repeated Games - Nachbar 2005

Main Result

For a large class of repeated games, beliefs can not simultaneously satisfy:

  • learnability
  • consistency
  • CSP (diversity of belief condition)

20

slide-22
SLIDE 22

Learnability - informally

Player 1 learns to predict the path of play gen- erated by σ2 if her one-period-ahead forecasts along the path of play eventually becomes al- most as accurate as if she knew σ2.

21

slide-23
SLIDE 23

Learnability - formally

Fix a belief β2 of player 1 about player 2’s strategy. Player 1 learns to predict the path of play gen- erated by behavior strategy σ = (σ1, σ2) iff

  • ∀ finite history h,

µ(σ1,σ2)(h∗) > 0 ⇒ µ(σ1,β2)(h∗) > 0 h∗ = the set of all paths of play starting with h

  • ∀ε > 0 and for almost all paths of play z,

∃T(ε, z) such that for any time t > T(ε, z) and any action a2 of player 2, |(the prob that σ2(h) assigns to a2) − (the prob that σβ

2(h) assigns to a2)| < ε

where h = the first t stages of z and σβ

2 = the reduced form of β2.

22

slide-24
SLIDE 24

CSP

Two conditions: CS and P, both addressing the richness of ˆ Σ. All restrictions only on the path of play!

  • CS - (Weak) Caution and Symmetry

– s1 is a simple variant of s2 if s1 can be generated from s2 by a uniform relabel- ing of actions – Weak Caution means: if I believe you could play the pure strategy s1, then I also believe you could play all simple variants of s1. Strong caution would mean ˆ Si = Si – Symmetry means: if I believe that you can play s1, then you believe I can play all simple variants of s1.

23

slide-25
SLIDE 25

– Symmetry motivated by the necessity to have equally powerful strategy-generating machines.

  • P

– if a behavior strategy σ2 is in ˆ Σ2, then at least one pure strategy that coarsely approximates σ2 is in ˆ Σ2 as well.

slide-26
SLIDE 26

Consistency

Fix ε ≥ 0 and beliefs β1 and β2. ˆ Σ ⊂ Σ is ε-consistent iff player 1 has a uniform ε-Best Response (to beliefs β2) in ˆ Σ1 and player 2 has a uniform ε-Best Response (to beliefs β1) in ˆ Σ2.

24

slide-27
SLIDE 27

MM and No Weak Dominance

No Weak Dominance no player has a (weakly) dominant action MM - true for coordination games - matching pennies, rock-paper-scissors . . . Mi < mi where m1 = min

α2∈∆(A2)

max

α1∈∆(A1) u1(α1, α2)

is player 1’s minmax payoff and M1 = max

a1∈A1

min

a2∈A2

u1(a1, a2) is player 1’s pure action maxmin payoff.

25

slide-28
SLIDE 28

Main result - a bit more formally

If NWD or MM holds for the stage game then for any δ > 0 there exists εδ > 0 such that for any ˆ Σ ⊂ Σ and for any beliefs, if ˆ Σ satisfies (pure weak) learnability & CSP then ˆ Σ is not εδ-consistent. Interpretation: if beliefs have supports that are learnable and sufficiently diverse then the beliefs are inconsistent.

26

slide-29
SLIDE 29

Discussion

  • if you are able to learn then does it mean

you already had some knowledge about the problem?

  • is consistency necessary for convergence or

belief learning?

  • Impossibility result not about convergence

and learning per se - but what about?

  • how useful/natural/hard to work with are

these conditions?

27