Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman - - PowerPoint PPT Presentation

nash q learning for general sum stochastic games
SMART_READER_LITE
LIVE PREVIEW

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman - - PowerPoint PPT Presentation

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6th, 2006 CS286r Presented by Ilan Lobel Outline Stochastic Games and Markov Perfect Equilibria Bellmans Operator as a Contraction Mapping Stochastic


slide-1
SLIDE 1

Nash Q-Learning for General-Sum Stochastic Games

Hu & Wellman March 6th, 2006 CS286r Presented by Ilan Lobel

slide-2
SLIDE 2

Outline

 Stochastic Games and Markov Perfect Equilibria  Bellman’s Operator as a Contraction Mapping  Stochastic Approximation of a Contraction Mapping  Application to Zero-Sum Markov Games  Minimax-Q Learning  Theory of Nash-Q Learning  Empirical Testing of Nash-Q Learning

slide-3
SLIDE 3

How do we model games that evolve over time ?

 Stochastic Games !  Current Game = State  Ingredients:

– Agents (N) – States (S) – Payoffs (R) – Transition Probabilities (P) – Discount Factor (δ)

slide-4
SLIDE 4

Example of a Stochastic Game

1,2 3,4 5,6 7,8

  • 1,2
  • 3,4
  • 5,6
  • 7,8

A B C D 0,0

  • 10,10

A B C D E

Move with 30% probability when (B,D) Move with 50% probability when (A,C) or (A,D)

δ = 0.9

slide-5
SLIDE 5

Markov Game is a Generalization of…

Repeated Games Markov Games

Add States

slide-6
SLIDE 6

Markov Game is a Generalization of…

Repeated Games Markov Games

Add States

MDP

Add Agents

slide-7
SLIDE 7

Markov Perfect Equilibrium (MPE)

 Strategy maps states into randomized actions

– πi: S Δ(A)

 No agent has an incentive to unilaterally

change her policy.

slide-8
SLIDE 8

Cons & Pros of MPEs

 Cons:

– Can’t implement everything described by the Folk

Theorems (i.e., no trigger strategies)

 Pros:

– MPEs always exist in finite Markov Games (Fink, 64) – Easier to “search for”

slide-9
SLIDE 9

Learning in Stochastic Games

 Learning is specially important in Markov

Games because MPE are hard to compute.

 Do we know:

– Our own payoffs ? – Others’ rewards ? – Transition probabilities ? – Others’ strategies ?

slide-10
SLIDE 10

Learning in Stochastic Games

 Adapted from Reinforcement Learning:

– Minimax-Q Learning (zero-sum games) – Nash-Q Learning – CE-Q Learning

slide-11
SLIDE 11

Zero-Sum Stochastic Games

 Nice properties:

– All equilibria have the same value. – Any equilibrium strategy of player 1 against any

equilibrium strategy of player 2 produces an MPE.

– It has a Bellman’s-type equation.

slide-12
SLIDE 12

Bellman’s Equation in DP

 Bellman Operator: T  Bellman’s Equation Rewritten:

slide-13
SLIDE 13

Contraction Mapping

 The Bellman Operator has the contraction

property:

 Bellman’s Equation is a direct consequence of

the contraction.

slide-14
SLIDE 14

The Shapley Operator for Zero-Sum Stochastic Games

 The Shapley Operator is a contraction mapping.

(Shapley, 53)

 Hence, it also has a fixed point, which is an MPE:

slide-15
SLIDE 15

Value Iteration for Zero-Sum Stochastic Games

 Direct consequence of contraction.  Converges to fixed point of operator.

slide-16
SLIDE 16

Q-Learning

 Another consequence of a contraction mapping:

– Q-Learning converges !

 Q-Learning can be described as an

approximation of value iteration:

– Value iteration with noise.

slide-17
SLIDE 17

Q-Learning Convergence

 Q-Learning is called a Stochastic Iterative

Approximation of Bellman’s operator:

– Learning Rate of 1/t. – Noise is zero-mean and has bounded variance.

 It converges if all state-action pairs are visited

infinitely often.

(Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)

slide-18
SLIDE 18

Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games

 Initialize your Q0(s,a1,a2) for all states,

actions.

 Update rule:  Player 1 then chooses action u1 in the next

stage sk+1.

slide-19
SLIDE 19

Minimax-Q Learning

 It’s a Stochastic Iterative Approximation of

Shapley Operator.

 It converges to a Nash Equilibrium if all state-

action-action triplets are visited infinitely often. (Littman, 96)

slide-20
SLIDE 20

Can we extend it to General-Sum Stochastic Games ?

 Yes & No.  Nash-Q Learning is such an extension.  However, it has much worse computational and

theoretical properties.

slide-21
SLIDE 21

Nash-Q Learning Algorithm

 Initialize Q0j(s,a1,a2) for all states, actions and

for every agent.

– You must simulate everyone’s Q-factors.

 Update rule:  Choose the randomized action generated by the

Nash operator.

slide-22
SLIDE 22

The Nash Operator and The Principle of Optimality

 Nash Operator finds the Nash of a stage game.  Find Nash of stage game with Q-factors as

your payoffs.

Payoffs for Rest of the Markov Game Current Reward

slide-23
SLIDE 23

The Nash Operator

 Unkown complexity even for 2 players.  In comparison, the minimax operator can be

solved in polynomial time. (there’s a linear programming formulation)

 For convergence, all players must break ties in

favor of the same Nash Equilibrium.

 Why not go model-based if computation is so

expensive ?

slide-24
SLIDE 24

Convergence Results

 If every stage game encountered during

learning has a global optimum, Nash-Q converges.

 If every stage game encountered during

learning has a saddle point, Nash-Q converges.

 Both of these are VERY strong assumptions.

slide-25
SLIDE 25

Convergence Result Analysis

 The global optimum assumption implies full

cooperation between agents.

 The saddle point assumption implies no

cooperation between agents.

 Are these equivalent to DP Q-Learning and

minimax-Q Learning, respectively ?

slide-26
SLIDE 26

Empirical Testing: The Grid-world

WORLD 1 Some Nash Equilibria

slide-27
SLIDE 27

Empirical Testing: Nash Equilibria

WORLD 2 All Nash Equilibria

(97%) (3%) (3%)

slide-28
SLIDE 28

Empirical Performance

 In very small and simple games, Nash-Q

learning often converged even though theory did not predict so.

 In particular, if all Nash Equilibria have the

same value Nash-Q did better than expected.

slide-29
SLIDE 29

Conclusions

 Nash-Q is a nice step forward:

– It can be used for any Markov Game. – It uses the Principle of Optimality in a smart way.

 But there is still a long way to go:

– Convergence results are weak. – There are no computational complexity results.