Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum - - PowerPoint PPT Presentation

two timescale algorithms for learning nash equilibria in
SMART_READER_LITE
LIVE PREVIEW

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum - - PowerPoint PPT Presentation

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games H.L. Prasad , Prashanth L.A. and Shalabh Bhatnagar Streamoid Technologies, Inc Indian Institute of Science H.L. Prasad, Prashanth L A,


slide-1
SLIDE 1

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games

H.L. Prasad†, Prashanth L.A.♯ and Shalabh Bhatnagar♯

†Streamoid Technologies, Inc ♯Indian Institute of Science H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 1 / 21

slide-2
SLIDE 2

Multi-agent RL setting

1 2 . . . N

Agents Environment

Action a =

  • a1, a2, . . . , aN

Reward r =

  • r1, r2, . . . , rN

, next state y

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 2 / 21

slide-3
SLIDE 3

Problem area

Stochastic Games

Markov Decision Processes

Markov Chains

Normal-form Games

(N, S, A, p, r, β), N-agents (S, A, p, r, β), single agent (N, A, r), N-agents (S, p)

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 3 / 21

slide-4
SLIDE 4

Problem area (revisited)

Stochastic Games

Zero-sum General-sum Normal-form Games

Zero-sum General- sum

  • !

Design Objective: Online algorithm, Convergence to Nash equilibrium1

1If NE is a useful objective for learning in games, then we have a strong contribution! H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 4 / 21

slide-5
SLIDE 5

A General Optimization Problem

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 5 / 21

slide-6
SLIDE 6

Value function

vπ(s) = E

t

βt

a∈A(x)

r(st, a) π(st, a) | s0 = s

  • Value function

Reward Policy A stationary Markov strategy π∗ =

  • π1∗, π2∗, . . . , πN∗

is said to be Nash if vi

π∗(s) ≥ vi πi,π−i∗(s), ∀πi, ∀i, ∀s ∈ S

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

slide-7
SLIDE 7

Value function

vπ(s) = E

t

βt

a∈A(x)

r(st, a) π(st, a) | s0 = s

  • Value function

Reward Policy A stationary Markov strategy π∗ =

  • π1∗, π2∗, . . . , πN∗

is said to be Nash if vi

π∗(s) ≥ vi πi,π−i∗(s), ∀πi, ∀i, ∀s ∈ S

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

slide-8
SLIDE 8

Value function

vπ(s) = E

t

βt

a∈A(x)

r(st, a) π(st, a) | s0 = s

  • Value function

Reward Policy A stationary Markov strategy π∗ =

  • π1∗, π2∗, . . . , πN∗

is said to be Nash if vi

π∗(s) ≥ vi πi,π−i∗(s), ∀πi, ∀i, ∀s ∈ S

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

slide-9
SLIDE 9

Value function

vπ(s) = E

t

βt

a∈A(x)

r(st, a) π(st, a) | s0 = s

  • Value function

Reward Policy A stationary Markov strategy π∗ =

  • π1∗, π2∗, . . . , πN∗

is said to be Nash if vi

π∗(s) ≥ vi πi,π−i∗(s), ∀πi, ∀i, ∀s ∈ S

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

slide-10
SLIDE 10

Value function

vπ(s) = E

t

βt

a∈A(x)

r(st, a) π(st, a) | s0 = s

  • Value function

Reward Policy A stationary Markov strategy π∗ =

  • π1∗, π2∗, . . . , πN∗

is said to be Nash if vi

π∗(s) ≥ vi πi,π−i∗(s), ∀πi, ∀i, ∀s ∈ S

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

slide-11
SLIDE 11

Dynamic Programming Idea

vi

π∗(x) =

max

πi(x)∈∆(Ai(x))

  • Eπi(x) Qi

π−i∗(x, ai)

  • ,

Optimal (Nash) Value Marginal Value after fixing ai ∼ πi where Q-value is given by Qi

π−i(x, ai) = Eπ−i(x)

 ri(x, a) + β

  • y∈U(x)

p(y|x, a)vi(y)  

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

slide-12
SLIDE 12

Dynamic Programming Idea

vi

π∗(x) =

max

πi(x)∈∆(Ai(x))

  • Eπi(x) Qi

π−i∗(x, ai)

  • ,

Optimal (Nash) Value Marginal Value after fixing ai ∼ πi where Q-value is given by Qi

π−i(x, ai) = Eπ−i(x)

 ri(x, a) + β

  • y∈U(x)

p(y|x, a)vi(y)  

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

slide-13
SLIDE 13

Dynamic Programming Idea

vi

π∗(x) =

max

πi(x)∈∆(Ai(x))

  • Eπi(x) Qi

π−i∗(x, ai)

  • ,

Optimal (Nash) Value Marginal Value after fixing ai ∼ πi where Q-value is given by Qi

π−i(x, ai) = Eπ−i(x)

 ri(x, a) + β

  • y∈U(x)

p(y|x, a)vi(y)  

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

slide-14
SLIDE 14

Dynamic Programming Idea

vi

π∗(x) =

max

πi(x)∈∆(Ai(x))

  • Eπi(x) Qi

π−i∗(x, ai)

  • ,

Optimal (Nash) Value Marginal Value after fixing ai ∼ πi where Q-value is given by Qi

π−i(x, ai) = Eπ−i(x)

 ri(x, a) + β

  • y∈U(x)

p(y|x, a)vi(y)  

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

slide-15
SLIDE 15

Optimization problem - informal terms

Need to solve: vi

π∗(x) =

max

πi(x)∈∆(Ai(x))

  • Eπi(x)Qi

π−i∗(x, ai)

  • (1)

Formulation:

  • Objective. minimize the Bellman error vi(x) − EπiQi

π−i(x, ai) in every

state, for every agent Constraint 1. ensure policy π is a distribution Constraint 2. Qi

π−i(x, ai) ≤ vi π(x) ←

− a proxy for the max in (1)

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 8 / 21

slide-16
SLIDE 16

Optimization problem in formal terms

min

v,π f(v, π) = N

  • i=1
  • x∈S
  • vi(x) − EπiQi

π−i(x, ai)

  • subject to

πi(x, ai) ≥ 0, ∀ai ∈ Ai(x), x ∈ S, i = 1, 2, . . . , N,

N

  • i=1

πi(x, ai) = 1, ∀x ∈ S, i = 1, 2, . . . , N. Qi

π−i(x, ai) ≤ vi(x), ∀ai ∈ Ai(x), x ∈ S, i = 1, 2, . . . , N.

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 9 / 21

slide-17
SLIDE 17

Solution approach

Usual approach: Apply KKT conditions to solve the general optimization problem Caveat: Imposes a tricky linear independence requirement Alternative: Use a simpler set of SG-SP conditions

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 10 / 21

slide-18
SLIDE 18

A sufficient condition

SG-SP Point A point (v∗, π∗) is said to be an SG-SP point if it is feasible and for all x ∈ X and i ∈ {1, 2, . . . , N} πi∗(x, ai)gi

x,ai(vi∗, π−i∗(x)) = 0,

∀ai ∈ Ai(x) where gi

x,ai(vi, π−i(x)) := Qi π−i(x, ai) − vi(x).

Nash ⇔ SG-SP: A strategy π∗ is Nash if and only if (v∗, π∗) is an SG-SP point

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 11 / 21

slide-19
SLIDE 19

An Online Algorithm: ON-SGSP

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 12 / 21

slide-20
SLIDE 20

ON-SGSP’s decentralized online learning model

Environment

2 1 . . . N

ON-SGSP

r, y a1 r, y a2 r, y aN

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 13 / 21

slide-21
SLIDE 21

ON-SGSP - operational flow

Policy Evaluation Policy Improvement Value vπi Policy πi

Policy evaluation: estimate the value function using temporal difference (TD) learning Policy improvement: perform gradient descent for the policy using a descent direction Descent direction ensures convergence to a global minimum of the

  • ptimization problem

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 14 / 21

slide-22
SLIDE 22

ON-SGSP - operational flow

Policy Evaluation Policy Improvement Value vπi Policy πi

Policy evaluation: estimate the value function using temporal difference (TD) learning Policy improvement: perform gradient descent for the policy using a descent direction Descent direction ensures convergence to a global minimum of the

  • ptimization problem

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 14 / 21

slide-23
SLIDE 23

More on the descent direction

Descend along

  • πi(x, ai)
  • gi

x,ai(vi, π−i)

  • ×sgn

∂f(v, π) ∂πi

  • TD-learning for

policy evaluation

From Lagrange multiplier and slack variable theory

Solution tracks an ODE with limit as an SG-SP point

1sgn is a continuous version of sgn H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 15 / 21

slide-24
SLIDE 24

Experiments

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 16 / 21

slide-25
SLIDE 25

A single state non-generic 2-player game

Payoff Matrix Player 2 → a1 a2 a3 Player 1 ↓ a1 1, 0 0, 1 1, 0 a2 0, 1 1, 0 1, 0 a3 0, 1 0, 1 1, 1

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 17 / 21

slide-26
SLIDE 26

A single state non-generic 2-player game

Results from 100 simulation runs NashQ FFQ (Friend Q) ON-SGSP Oscillate or converge 95% 40% 0% to non-Nash strategy Converge to (0.5, 0.5, 0) 2% 0% 99% Converge to (0, 0, 1) 3% 60% 1%

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 18 / 21

slide-27
SLIDE 27

Stick-Together Game

Figure: Stick Together Game for M = 3

For M = 30, STG has 810000 states!

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 19 / 21

slide-28
SLIDE 28

Results for STG with M = 30

1 2 3 4 5 ×107 10 20

Number of iterations

  • Avg. distance dn

FFQ NashQ ON-SGSP

ON-SGSP takes agents to within a 4 × 4-grid, while NashQ/FFQ to a 8 × 8-grid. Foe Q-learning/NashQ have higher per-iteration complexity than ON-SGSP

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 20 / 21

slide-29
SLIDE 29

Thank You!

H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 21 / 21