Efficient Learning Equilibrium R.Brafman and M.Tennenholtz - - PDF document

efficient learning equilibrium
SMART_READER_LITE
LIVE PREVIEW

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz - - PDF document

Efficient Learning Equilibrium R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006 Efficient Learning Equilibrium Infinitely repeated stage games M with single- stage matrix G Individual Rationality Efficiency


slide-1
SLIDE 1

Efficient Learning Equilibrium

R.Brafman and M.Tennenholtz Presented by Neal Gupta CS 286r March 20, 2006

slide-2
SLIDE 2

Efficient Learning Equilibrium

  • Infinitely repeated stage games M with single-

stage matrix G

  • Individual Rationality
  • Efficiency

– Unilateral deviation irrational after a poly- nomial number of stages – Without deviation, after a polynomial number of steps, the expected payoff will be within ǫ of a Nash equilbrium (hence also within ǫ of minimax pay-

  • ffs).
  • A set of algorithms with respect to a spe-

cific class of games that meets the above conditions is considered to be an ELE

1

slide-3
SLIDE 3

Motivation

  • Get objective convergence rather than con-

vergence in beliefs – Works with (relatively) patient agents

  • Exploit richness of results from Folk The-
  • rems

2

slide-4
SLIDE 4

Assumptions about Agents

  • Agents care about average reward, as well

as well as how quickly this is reached

  • No discounting
  • Agents have NO PRIOR about the payoffs
  • f the game
  • Agents may or may not be able to observe

payoffs – Perfect monitoring (main results proved) – Weak imperfect monitoring ∗ observe other players actions, not pay-

  • ffs

– Strict imperfect monitoring

3

slide-5
SLIDE 5

Formal definition of learning equilbrium

  • Define Ui(M, σ1, σ2, T) given repeated game

M to be the expected average reward for player i after T periods when players choose the strategy (policy) σ = (σ1, σ2).

  • Then let Ui(M, σ1, σ2) represented the av-

erage reward in the limit: Ui(M, σ1, σ2) = lim inf

T→∞ Ui(M, σ1, σ2, T).

  • A strategy (policy) is in LE if, given strat-

egy σ = (σ1, σ2), then for all repeated games M neither player can benefit from a unilat- eral deviation ∀σ′

1, U1(M, σ′ 1, σ2) ≤ U1(M, σ1, σ2)

and ∀σ′

2, U2(M, σ1, σ′ 2) ≤ U2(M, σ′ 1, σ2).

4

slide-6
SLIDE 6

Efficient Learning Equilibrium

  • Add requirements for speed of convergence

from basic definition in learning equilibrium

  • ∀ǫ > 0, 0 < δ < 1, there exists some

T = poly(1

ǫ, 1 δ, k) s.t.

∀t ≥ T, if player 1 switches from strategy σ1 to σ′

1 in itera-

tion l, then U1(M, σ′

1, σ2, l + t) ≤ U1(M, σ1, σ2, l + t) + ǫ,

with probability of failure bounded above by δ.

5

slide-7
SLIDE 7

Efficient Learning Equilibrium (cont’d)

  • In order to provide an IR constraint the

authors require that the utilities are higher than those in some Nash equilibrium (which must be higher than minimax)

  • Note the limit is dropped: if there exists

a policy beneficial in polynomial time but disastrous in the long-term, we do not have ELE – Toy example: Opponent has trigger strat- egy with exponential delay

6

slide-8
SLIDE 8

Theorem 1 (BT, 7)

  • There exists an ELE for any perfect mon-

itoring setting

  • Three stages (if no deviation)

– Exploration – Offline Computation of Equilbrium – Play of equilibrium

  • Proof demonstrates that deviation can be

punished in exploration stage – No subgame perfection demonstrated – In folk theorems more general results found - Pareto-ELE

7

slide-9
SLIDE 9

Theorem 1 (BT, 7) cont’d

  • If player 2 deviates during exploration player

1 minimaxes player 2 using available pay-

  • ffs (player 1 may continue exploring to

learn payoffs in order to minimax player 2) – Note that if player 2 starts exploring again, he/she might do better, but will never be able to do better than the max- imin of the game

  • Lemma 1: Chernoff bounds if maximin re-

quires randomizing

  • Lemma 2: Let Rmax be the minimax pay-
  • ff, then there exists a some z polynomial

in Rmax, k, 1

ǫ and 1 δ s.t. if player 1 punishes

player 2 as prescribed for z steps, then ei- ther player 1 will learn a new entry or will reach the desired minimax payoff with high probability.

8

slide-10
SLIDE 10

Theorem 1 (BT, 7) cont’d

  • Given k is an upper bound on the number
  • f actions for each player, then player 1 can
  • nly learn k2−1 new entries after deviation.
  • Thus, the probablistic minimax can be reached

in a polynomial number of moves.

  • With a second Chernoff bound, the au-

thors conclude that the actual payoff will be within ǫ of the expected minimax value with probability 1−δ with only a polynomial (linear) increase in the number of trials as

1 δ and 1 ǫ grow.

9

slide-11
SLIDE 11

Weaknesses

  • Proofs are restricted to 2 player setting
  • Trigger strategies used are very far from

subgame perfection

  • Agents that care about average payoffs are

significant deviation from discounting agents

  • Choosing from multiple equilibria??
  • Exhibited learning algorithms seem na

¨ ıve – Explore entire state space, then simply compute equilibrium

10

slide-12
SLIDE 12

Theorem 2 (BT, 7)

  • An ELE does not always exist in the im-

perfect monitoring setting. M1 =

  • 6,

0, 100 5, −100 1, 500

  • M′

1 =

  • 6,

9 0, 1 5, 11 1, 10

  • In both M1 and M′

1, player 1 has the same

  • payoffs. Both games have a unique Nash

equilibrium - the one in M1 must be played for it to be ELE.

  • Player 2 benefits if he plays as if he is in

M′

1

11

slide-13
SLIDE 13
  • Contradicts definition of ELE - Player 2

immediately, permanently benefits from uni- lateral deviation – In example it seems like player 2 must know player 1’s payoffs but not vice- versa M1 =

  • 6,

0, 100 5, −100 1, 500

  • M′

1 =

  • 6,

9 0, 1 5, 11 1, 10

  • Player 2 could just pretend playing Right is

a dominant strategy

slide-14
SLIDE 14

Theorem 3 (BT, 9)

  • There exists ELE for the class of common-

interest games under strict imperfect mon- itoring. – Agents know own action and payoffs, but neither action nor payoff of oppo- nent

  • Proof outline: proceed as in Thm. 1, but

explore by independently randomizing over actions until both agents are confident all actions have been seen.

12

slide-15
SLIDE 15

Theorem 3 (BT, 9) cont’d

  • If agents both play action that lead to the

highest reward they saw, they are guaran- teed to coordinate

  • Concerns

– Common interest implies players DO know

  • pponent’s payoffs.

– If players don’t know number of actions

  • f opponent, how do they decide when

to stop? – If players do know number of actions

  • f opponent, why not use direct result

from Thm. 1?

13

slide-16
SLIDE 16

Pareto ELE

  • Exploits repeated game strategy to allow

wider range of payoffs – Differs in that it allows side payoffs – Now within ǫ of economically efficient

  • utcome rather than NE
  • Given efficient joint actions (P1(G), P2(G)),

with values PVi(M) for agent i, we now re- quire U1(M, σ1, σ2, t) + U2(M, σ1, σ2, t) ≥ PV1(M) + PV2(M) − ǫ.

  • Same condition that with prob 1 − δ gain
  • f less than ǫ for deviation after polynomial

time

14

slide-17
SLIDE 17

Theorem 4

  • There exists a Pareto ELE for any perfect

monitoring setting

  • Proof outline: proceed as in regular ELE

in exploration

  • Pay a player if she receives less than her

probablistic maximin value

  • By definition of Pareto optimality, both play-

ers now exceed their maximin value

  • Use same punishment approaches from be-

fore

15

slide-18
SLIDE 18

Stochastic games

  • Players observe payoffs, new states, must

create model for probablistic transitions.

  • Nash equilbrium results in average payoffs

are hard to prove - work in Pareto-ELE setting with side payoffs

  • Ergodicity assumption: every state is reach-

able from every other state – This combined with finite number of states implies we can expect to explore the entire game matrix in finite time

  • Results polynomial in 1

δ, 1 ǫ, and Tmin

– Tmix denotes the ǫ-return mixing time

16

slide-19
SLIDE 19

ǫ-Mixing Time

  • Informally the time it takes for the expected

average reward to approach the infinite re- ward ∀ states s

  • Tmix is the minimum t s.t. ∀s ∈ S

U(s, σ1, σ2, t) > U(s, σ1, σ2) − ǫ

  • How long is this given a ”reasonable” tran-

sition function

17

slide-20
SLIDE 20

Theorem 6 (BT, 13)

  • A Pareto-ELE in a stochastic game exists

if (1) the agents have perfect monitoring and (2) Tmix is known.

  • Proof similar to previous ones, requires E3

approach and results from Learning to co-

  • rdinate efficiently (BT, [5]).

18

slide-21
SLIDE 21

Extensions

  • Move towards credible threats if not SPE

– Automated agents can implement unre- alistic threats

  • More results in the case of imperfect mon-

itoring – May require probablistic reasoning or con- ditional priors rather than just learning the entire game matrix

  • Model of Pareto-ELE based off of cycling

rather than side-payoffs?

19

slide-22
SLIDE 22

Conclusions

  • Pros:

– We get objective convergence, not con- vergence in beliefs – Punishment is relatively quick, if not dis- counted

  • Cons:

– NO priors! – Discounting seems more realistic model

  • f behavior

– Hard time horizon for punishment may be required otherwise agents will try to delay costly punishment forever – Trigger strategies far from SPE

20