SLIDE 1
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman - - PowerPoint PPT Presentation
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman - - PowerPoint PPT Presentation
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6th, 2006 CS286r Presented by Ilan Lobel Outline Stochastic Games and Markov Perfect Equilibria Bellmans Operator as a Contraction Mapping Stochastic
SLIDE 2
SLIDE 3
How do we model games that evolve over time ?
Stochastic Games ! Current Game = State Ingredients:
– Agents (N) – States (S) – Payoffs (R) – Transition Probabilities (P) – Discount Factor (δ)
SLIDE 4
Example of a Stochastic Game
1,2 3,4 5,6 7,8
- 1,2
- 3,4
- 5,6
- 7,8
A B C D 0,0
- 10,10
A B C D E
Move with 30% probability when (B,D) Move with 50% probability when (A,C) or (A,D)
δ = 0.9
SLIDE 5
Markov Game is a Generalization of…
Repeated Games Markov Games
Add States
SLIDE 6
Markov Game is a Generalization of…
Repeated Games Markov Games
Add States
MDP
Add Agents
SLIDE 7
Markov Perfect Equilibrium (MPE)
Strategy maps states into randomized actions
– πi: S Δ(A)
No agent has an incentive to unilaterally
change her policy.
SLIDE 8
Cons & Pros of MPEs
Cons:
– Can’t implement everything described by the Folk
Theorems (i.e., no trigger strategies)
Pros:
– MPEs always exist in finite Markov Games (Fink, 64) – Easier to “search for”
SLIDE 9
Learning in Stochastic Games
Learning is specially important in Markov
Games because MPE are hard to compute.
Do we know:
– Our own payoffs ? – Others’ rewards ? – Transition probabilities ? – Others’ strategies ?
SLIDE 10
Learning in Stochastic Games
Adapted from Reinforcement Learning:
– Minimax-Q Learning (zero-sum games) – Nash-Q Learning – CE-Q Learning
SLIDE 11
Zero-Sum Stochastic Games
Nice properties:
– All equilibria have the same value. – Any equilibrium strategy of player 1 against any
equilibrium strategy of player 2 produces an MPE.
– It has a Bellman’s-type equation.
SLIDE 12
Bellman’s Equation in DP
Bellman Operator: T Bellman’s Equation Rewritten:
SLIDE 13
Contraction Mapping
The Bellman Operator has the contraction
property:
Bellman’s Equation is a direct consequence of
the contraction.
SLIDE 14
The Shapley Operator for Zero-Sum Stochastic Games
The Shapley Operator is a contraction mapping.
(Shapley, 53)
Hence, it also has a fixed point, which is an MPE:
SLIDE 15
Value Iteration for Zero-Sum Stochastic Games
Direct consequence of contraction. Converges to fixed point of operator.
SLIDE 16
Q-Learning
Another consequence of a contraction mapping:
– Q-Learning converges !
Q-Learning can be described as an
approximation of value iteration:
– Value iteration with noise.
SLIDE 17
Q-Learning Convergence
Q-Learning is called a Stochastic Iterative
Approximation of Bellman’s operator:
– Learning Rate of 1/t. – Noise is zero-mean and has bounded variance.
It converges if all state-action pairs are visited
infinitely often.
(Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)
SLIDE 18
Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games
Initialize your Q0(s,a1,a2) for all states,
actions.
Update rule: Player 1 then chooses action u1 in the next
stage sk+1.
SLIDE 19
Minimax-Q Learning
It’s a Stochastic Iterative Approximation of
Shapley Operator.
It converges to a Nash Equilibrium if all state-
action-action triplets are visited infinitely often. (Littman, 96)
SLIDE 20
Can we extend it to General-Sum Stochastic Games ?
Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational and
theoretical properties.
SLIDE 21
Nash-Q Learning Algorithm
Initialize Q0j(s,a1,a2) for all states, actions and
for every agent.
– You must simulate everyone’s Q-factors.
Update rule: Choose the randomized action generated by the
Nash operator.
SLIDE 22
The Nash Operator and The Principle of Optimality
Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as
your payoffs.
Payoffs for Rest of the Markov Game Current Reward
SLIDE 23
The Nash Operator
Unkown complexity even for 2 players. In comparison, the minimax operator can be
solved in polynomial time. (there’s a linear programming formulation)
For convergence, all players must break ties in
favor of the same Nash Equilibrium.
Why not go model-based if computation is so
expensive ?
SLIDE 24
Convergence Results
If every stage game encountered during
learning has a global optimum, Nash-Q converges.
If every stage game encountered during
learning has a saddle point, Nash-Q converges.
Both of these are VERY strong assumptions.
SLIDE 25
Convergence Result Analysis
The global optimum assumption implies full
cooperation between agents.
The saddle point assumption implies no
cooperation between agents.
Are these equivalent to DP Q-Learning and
minimax-Q Learning, respectively ?
SLIDE 26
Empirical Testing: The Grid-world
WORLD 1 Some Nash Equilibria
SLIDE 27
Empirical Testing: Nash Equilibria
WORLD 2 All Nash Equilibria
(97%) (3%) (3%)
SLIDE 28
Empirical Performance
In very small and simple games, Nash-Q
learning often converged even though theory did not predict so.
In particular, if all Nash Equilibria have the
same value Nash-Q did better than expected.
SLIDE 29