two timescale algorithms for learning nash equilibria in
play

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum - PowerPoint PPT Presentation

Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games H.L. Prasad , Prashanth L.A. and Shalabh Bhatnagar Streamoid Technologies, Inc Indian Institute of Science H.L. Prasad, Prashanth L A,


  1. Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games H.L. Prasad † , Prashanth L.A. ♯ and Shalabh Bhatnagar ♯ † Streamoid Technologies, Inc ♯ Indian Institute of Science H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 1 / 21

  2. Multi-agent RL setting Environment � r 1 , r 2 , . . . , r N � Reward r = , a 1 , a 2 , . . . , a N � Action a = � next state y . . . 1 2 N Agents H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 2 / 21

  3. Problem area Markov Chains ( S , p ) Markov Normal-form Decision Games Processes ( N, A , r ) , ( S , A , p, r, β ) , N -agents single agent Stochastic Games ( N, S , A , p, r, β ) , N -agents H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 3 / 21

  4. Problem area (revisited) � Zero-sum � Zero-sum Normal-form Stochastic Games Games General- � sum Design Objective: ! General-sum Online algorithm, Convergence to Nash equilibrium 1 1If NE is a useful objective for learning in games, then we have a strong contribution! H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 4 / 21

  5. A General Optimization Problem H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 5 / 21

  6. Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

  7. Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

  8. Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

  9. Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

  10. Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21

  11. Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by   Q i π − i ( x, a i ) = E π − i ( x )  r i ( x, a ) + β � p ( y | x, a ) v i ( y )  y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

  12. Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by   Q i π − i ( x, a i ) = E π − i ( x )  r i ( x, a ) + β � p ( y | x, a ) v i ( y )  y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

  13. Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by   Q i π − i ( x, a i ) = E π − i ( x )  r i ( x, a ) + β � p ( y | x, a ) v i ( y )  y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

  14. Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by   Q i π − i ( x, a i ) = E π − i ( x )  r i ( x, a ) + β � p ( y | x, a ) v i ( y )  y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21

  15. Optimization problem - informal terms Need to solve: v i E π i ( x ) Q i π − i ∗ ( x, a i ) � � π ∗ ( x ) = max (1) π i ( x ) ∈ ∆( A i ( x )) Formulation: Objective. minimize the Bellman error v i ( x ) − E π i Q i π − i ( x, a i ) in every state, for every agent Constraint 1. ensure policy π is a distribution Constraint 2. Q i π − i ( x, a i ) ≤ v i π ( x ) ← − a proxy for the max in (1) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 8 / 21

  16. Optimization problem in formal terms N � � v i ( x ) − E π i Q i π − i ( x, a i ) � � min v,π f ( v, π ) = i =1 x ∈S subject to π i ( x, a i ) ≥ 0 , ∀ a i ∈ A i ( x ) , x ∈ S , i = 1 , 2 , . . . , N, N � π i ( x, a i ) = 1 , ∀ x ∈ S , i = 1 , 2 , . . . , N. i =1 π − i ( x, a i ) ≤ v i ( x ) , ∀ a i ∈ A i ( x ) , x ∈ S , i = 1 , 2 , . . . , N. Q i H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 9 / 21

  17. Solution approach Usual approach: Apply KKT conditions to solve the general optimization problem Caveat: Imposes a tricky linear independence requirement Alternative: Use a simpler set of SG-SP conditions H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 10 / 21

  18. A sufficient condition SG-SP Point A point ( v ∗ , π ∗ ) is said to be an SG-SP point if it is feasible and for all x ∈ X and i ∈ { 1 , 2 , . . . , N } ∀ a i ∈ A i ( x ) π i ∗ ( x, a i ) g i x,a i ( v i ∗ , π − i ∗ ( x )) = 0 , where g i x,a i ( v i , π − i ( x )) := Q i π − i ( x, a i ) − v i ( x ) . Nash ⇔ SG-SP: A strategy π ∗ is Nash if and only if ( v ∗ , π ∗ ) is an SG-SP point H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 11 / 21

  19. An Online Algorithm: ON-SGSP H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 12 / 21

  20. ON-SGSP’s decentralized online learning model Environment r , y r , y a 1 r , y a 2 a N . . . 1 2 N ON-SGSP H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 13 / 21

  21. ON-SGSP - operational flow Policy Evaluation Policy π i Value v π i Policy Improvement Policy evaluation: estimate the value function using temporal difference (TD) learning Policy improvement: perform gradient descent for the policy using a descent direction Descent direction ensures convergence to a global minimum of the optimization problem H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 14 / 21

  22. ON-SGSP - operational flow Policy Evaluation Policy π i Value v π i Policy Improvement Policy evaluation: estimate the value function using temporal difference (TD) learning Policy improvement: perform gradient descent for the policy using a descent direction Descent direction ensures convergence to a global minimum of the optimization problem H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 14 / 21

  23. More on the descent direction Descend along � � � TD-learning for � g i x,a i ( v i , π − i ) − π i ( x, a i ) � � � � ∂f ( v, π ) policy evaluation � × sgn ∂π i From Lagrange multiplier and slack variable theory Solution tracks an ODE with limit as an SG-SP point 1 sgn is a continuous version of sgn H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 15 / 21

  24. Experiments H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 16 / 21

  25. A single state non-generic 2-player game Payoff Matrix Player 2 → a 1 a 2 a 3 Player 1 ↓ a 1 1 , 0 0 , 1 1 , 0 0 , 1 1 , 0 1 , 0 a 2 a 3 0 , 1 0 , 1 1 , 1 H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 17 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend