Rational Learning of Mixed Equilibiria in Stochastic Games Michael - PowerPoint PPT Presentation

Rational Learning of Mixed Equilibiria in Stochastic Games ∗ Michael Bowling UAI Workshop: Beyond MDPs June 30, 2000 ∗ Joint work with Manuela Veloso

Overview • Stochastic Game Framework • Existing Techniques ... ... and Their Shortcomings • A New Algorithm • Experimental Results

Stochastic Game Framework MDPs Matrix Games - Single Agent - Multiple Agent - Multiple State - Single State Stochastic Games - Multiple Agent - Multiple State

Markov Decision Processes A Markov decision process (MDP) is a tuple, ( S , A , T, R ), where, • S is the set of states, • A is the set of actions, • T is a transition function S × A × S → [0 , 1], • R is a reward function S × A → ℜ . T(s, a, s’) R(s, a) s’ a s

Matrix Games A matrix game is a tuple ( n, A 1 ...n , R 1 ...n ), where, • n is the number of players, • A i is the set of actions available to player i – A is the joint action space A 1 × . . . × A n , • R i is player i ’s payoff function A → ℜ . a 2 a 2 . . R = R = 2 1 . . . . . . . . . . a 1 . . . a 1 . . . R (a) R (a) 1 2 . . . . . .

Matrix Games – Example Rock-Paper-Scissors • Two players. Each simultaneously picks an action: Rock , Paper , or Scissors . • The rules: Rock beats Scissors Scissors beats Paper Paper beats Rock • Represent game as two matrices, one for each player:     0 − 1 1 0 1 − 1 R 1 = 1 0 − 1 R 2 = − R 1 = − 1 0 1         − 1 1 0 1 − 1 0

Matrix Games – Best Response • No optimal opponent independent strategies. • Mixed (i.e. stochastic) strategies does not help. • Opponent dependent strategies, Definition 1 For a game, define the best-response function for player i , BR i ( σ − i ) , to be the set of all, possibly mixed, strategies that are optimal given the other player(s) play the possibly mixed joint strategy σ − i .

Matrix Games – Equilibria • Best-response equilibrium [Nash, 1950], Definition 2 A Nash equilibrium is a collection of strategies (possibly mixed) for all players, σ i , with, σ i ∈ BR i ( σ − i ) . • An equilibrium in Rock-Paper-Scissors consists of both players randomizing evenly among all its actions.

Stochastic Game Framework MDPs Matrix Games - Single Agent - Multiple Agent - Multiple State - Single State Stochastic Games - Multiple Agent - Multiple State

Stochastic Games A stochastic game is a tuple ( n, S , A 1 ...n , T, R 1 ...n ), where, • n is the number of agents, • S is the set of states, • A i is the set of actions available to agent i , – A is the joint action space A 1 × . . . × A n , • T is the transition function S × A × S → [0 , 1], • R i is the reward function for the i th agent S × A → ℜ . a 2 . T(s, a, s’) R (s,a)= . i . . . . . . . a 1 R (s,a) i s’ . . . s

Stochastic Games – Example A B • Players: Two • States: Players’ positions and possession of the ball (780). • Actions: N, S, E, W, Hold (5). • Transitions: – Actions are selected simultaneously but executed in a random order. – If a player moves to another player’s square, the stationary play gets possession of the ball. • Rewards: Reward is only received when the ball is moved into one of the goals. [Littman, 1994]

Solving Stochastic Games Matrix Game MDP Stochastic Game + = Solver Solver Solver MG + MDP = Game Theory RL LP TD(0) Shapley MiniMax-Q LP TD(1) Pollatschek and Avi-Itzhak – LP TD( λ ) Van der Wal – QP TD(0) – Hu and Wellman FP TD(0) Fictitious Play JALs / Opponent-Modeling LP: linear programming QP: quadratic programming FP: fictitious play

Minimax-Q 1. Initialize Q ( s ∈ S , a ∈ A ) arbitrarily. 2. Repeat, (a) From state s select action a i that solves the matrix game [ Q ( s, a ) a ∈A ], with some exploration. (b) Observing joint-action a , reward r , and next state s ′ , Q ( s, a ) ← (1 − α ) Q ( s, a ) + α ( r + γV ( s ′ )) , where, V ( s ) = Value ( [ Q ( s, a ) a ∈A ] ) . [Littman, 1994] • In zero-sum games, learns equilibrium almost independent of the actions selected by the opponent.

Joint-Action Learners 1. Initialize Q ( s ∈ S , a ∈ A ) arbitrarily. 2. Repeat, (a) From state s select action a i that maximizes, C ( s, a − i ) � Q ( s, � a i , a − i � ) n ( s ) a − i (b) Observing other agents’ actions a − i , reward r , and next state s ′ , C ( s, a − i ) ← C ( s, a − i ) + 1 n ( s ) ← n ( s ) + 1 (1 − α ) Q ( s, � a i , a − i � ) + α ( r + γV ( s ′ )) Q ( s, � a i , a − i � ) ← where, C ( s, a − i ) � V ( s ) = max Q ( s, � a i , a − i � ) . n ( s ) a i a − i [Claus & Boutilier, 1998; Uther & Veloso, 1997]

Joint-Action Learners • Finds equilibrium (when playing another JAL) in: – Fully collaborative games [Claus & Boutilier, 1998], – Iterated dominance solvable games [Fudenberg & Levine, 1998], – Fully competitive games [Uther & Veloso, 1997]. • Plays deterministically (i.e. cannot play mixed policies).

Problems with Existing Algorithms • Minimax-Q – Converges to an equilibrium, independent of the opponent’s actions. – Will not converge to a best-response unless the opponent also plays the equilibrium solution. ∗ Consider a player that almost always plays Rock . • Q-Learning, JALs, etc. – Always seeks to maximize reward. – Does not converge to stationary policies if the opponent is also learning. ∗ Cannot play mixed strategies.

Properties Property 1 (Rational) If the other players’ strategies converge to stationary strategies then the player will converge to a strategy that is optimal given their strategies. Property 2 (Convergent) Given that the other players are following behaviors from a class of behaviors, B , all the players will converge to stationary strategies. Algorithm Rational Convergent Minimax-Q No Yes JAL Yes No • If all players are rational and they converge to stationary strategies, they must have converged to an equilibrium. • If all players are both rational and convergent, then they are guaranteed to converge to an equilibrium.

A New Algorithm – Policy Hill-Climbing 1. Let α and δ be learning rates. Initialize, 1 Q ( s, a ) ← 0 , π ( s, a ) ← |A i | . 2. Repeat, (a) From state s select action a according to mixed strategy π ( s ) with some exploration. (b) Observing reward r and next state s ′ , � � Q ( s ′ , a ′ ) Q ( s, a ) ← (1 − α ) Q ( s, a ) + α r + γ max . a ′ (c) Update π ( s, a ) and constrain it to a legal probability distribution, if a = argmax a ′ Q ( s, a ′ ) � δ π ( s, a ) ← π ( s, a ) + . − δ Otherwise | A i |− 1 • PHC is rational, but still not convergent.

A New Algorithm – Adjusted Policy Hill-Climbing • APHC preserves rationality, while encouraging convergence. – Makes a change only to the algorithm’s learning rate. – “Learn faster while losing, slower while winning.” 1. Let α , δ l > δ w be learning rates. Initialize, 1 Q ( s, a ) ← 0 , π ( s, a ) ← |A i | , 2. Repeat, (a,b) Same as PHC. (c) Maintain running estimate of average policy, ¯ π . (d) Update π ( s, a ) and constrain it to a legal probability distribution, if a = argmax a ′ Q ( s, a ′ ) � δ π ( s, a ) ← π ( s, a ) + , − δ Otherwise | A i |− 1 where, � a ′ π ( s, a ′ ) Q ( s, a ′ ) > � π ( s, a ′ ) Q ( s, a ′ ) if � a ′ ¯ δ w δ = . δ l otherwise

� � Results – Rock-Paper-Scissors 1 0.8 0.6 0.4 0.2 0 1 1 0 Player 1 Player 1 Player 2 Player 2 0.8 0.8 0.2 0.6 0.6 0.4 Pr(Paper) Pr(Paper) 0.4 0.4 0.6 0.2 0.2 0.8 0 0 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Pr(Rock) Pr(Rock) PHC APHC

Results – Soccer A B 50 40 30 % Games Won 20 10 0 M-M APHC-APHC APHC-APHC(x2) PHC-PHC(L) PHC-PHC(W)

Discussion • Why convergence? – Non-stationary policies are hard to evaluate. – Complications with assigning delayed reward. • Why rationality? – Multiple equilibria. – Opponent may not be playing optimally. • What’s next? – More experimental results on more interesting problems. – Family of learning algorithms. – Theoretical analysis of convergence. – Learning in the presence of agents with limitations. http://www.cs.cmu.edu/~mhb/publications/

Rational Learning of Mixed Equilibiria in Stochastic Games Michael - PowerPoint PPT Presentation

Rational Learning of Mixed Equilibiria in Stochastic Games Michael Bowling UAI Workshop: Beyond MDPs June 30, 2000 Joint work with Manuela Veloso Overview Stochastic Game Framework Existing Techniques ... ... and Their

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Rational points, rational curves, rational varieties Rational and integral points We study

Rational Robot A Test Automation Tool What is Rational Robot? Rational Robot is a complete

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

2.5: Rational Expressions and Equations College Algebra Week 2 Rational Expression

E XAMPLE 1 Identify the sum of product as rational or irrational. a. 5 + 8 rational / irrational

On the convergence of rational Ritz values Applications to rational interpolation of rational

Classes of Rational Graphs Christophe Morvan Irisa Journ ees Montoises 2006 1/25 Classes of

Rational Agents (Ch. 2) Rational agent An agent/robot must be able to perceive and interact with

17. Structs and Classes I C++ does not provide a built-in type for rational numbers Goal Rational

Rational isogenies Computing rational isogenies from the equations of the kernel David Lubicz,

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

DECISIONS, ACTIONS, AND GAMES: A LOGICAL PERSPECTIVE Johan van Benthem, Amsterdam & Stanford,

Reconciling Rationality and Stochasticity: Rich Behavioral Models in Two-Player Games Mickael

Plan for the 2nd hour What is an agent? EDAF70: Applied Artificial Intelligence PEAS

CSC421 Intro to Artificial Intelligence UNIT 00: Overview & Introduction Overview

Zeta functions for two-dimensional shifts of finite type Wen-Guei Hu Shing-Tung Yau Center

Rationality & Recognisability An introduction to weighted automata theory Tutorial given at

Panel 1 All Landscapes Matter Graham Fairclough LANDSCAPE FORWARD? WHICH WAY FORWARD? I

PUB UBLIC C PROCURE ROCUREMENT NT WORK ORKING GROUP NG GROUP GUI GUIDE DELINE NE FOR A

Rational Learning of Mixed Equilibiria in Stochastic Games Michael - PowerPoint PPT Presentation

Rational Learning of Mixed Equilibiria in Stochastic Games Michael Bowling UAI Workshop: Beyond MDPs June 30, 2000 Joint work with Manuela Veloso Overview Stochastic Game Framework Existing Techniques ... ... and Their

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Rational points, rational curves, rational varieties Rational and integral points We study

Rational Robot A Test Automation Tool What is Rational Robot? Rational Robot is a complete

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

2.5: Rational Expressions and Equations College Algebra Week 2 Rational Expression

E XAMPLE 1 Identify the sum of product as rational or irrational. a. 5 + 8 rational / irrational

On the convergence of rational Ritz values Applications to rational interpolation of rational

Classes of Rational Graphs Christophe Morvan Irisa Journ ees Montoises 2006 1/25 Classes of

Rational Agents (Ch. 2) Rational agent An agent/robot must be able to perceive and interact with

17. Structs and Classes I C++ does not provide a built-in type for rational numbers Goal Rational

Rational isogenies Computing rational isogenies from the equations of the kernel David Lubicz,

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

DECISIONS, ACTIONS, AND GAMES: A LOGICAL PERSPECTIVE Johan van Benthem, Amsterdam &amp; Stanford,

Reconciling Rationality and Stochasticity: Rich Behavioral Models in Two-Player Games Mickael

Plan for the 2nd hour What is an agent? EDAF70: Applied Artificial Intelligence PEAS

CSC421 Intro to Artificial Intelligence UNIT 00: Overview &amp; Introduction Overview

Zeta functions for two-dimensional shifts of finite type Wen-Guei Hu Shing-Tung Yau Center

Rationality &amp; Recognisability An introduction to weighted automata theory Tutorial given at

Panel 1 All Landscapes Matter Graham Fairclough LANDSCAPE FORWARD? WHICH WAY FORWARD? I

PUB UBLIC C PROCURE ROCUREMENT NT WORK ORKING GROUP NG GROUP GUI GUIDE DELINE NE FOR A

DECISIONS, ACTIONS, AND GAMES: A LOGICAL PERSPECTIVE Johan van Benthem, Amsterdam & Stanford,

CSC421 Intro to Artificial Intelligence UNIT 00: Overview & Introduction Overview

Rationality & Recognisability An introduction to weighted automata theory Tutorial given at