markov games
play

MARKOV GAMES A framework for multi-agent reinforcement learning - PowerPoint PPT Presentation

MARKOV GAMES A framework for multi-agent reinforcement learning Shen (Sean) Chen Review on MDPs An MDP is defined by a set of states, S, and actions, A. Transition function, T: S A PD(S), where PD(S) represents discrete prob


  1. MARKOV GAMES A framework for multi-agent reinforcement learning Shen (Sean) Chen

  2. Review on MDP’s ■ An MDP is defined by a set of states, S, and actions, A. Transition function, T: S × A → PD(S), where PD(S) represents discrete prob ■ distribution over the set S. Reward function, R: S × A → R, which specifies the agent’s task ■ ■ Objective: find a policy mapping its interaction history to a current choice of action ( 𝛿 % 𝑠 +,% so as to maximize the expected sum of discounted reward 𝐹 ∑ %&'

  3. Markov Games ■ A Markov game is defined by a set of states S, and a collection of action sets, 𝐵 . , 𝐵 0 , … , 𝐵 2 , one for each agent in the environment. ■ State transitions are controlled by the current state and one action from each agent: T: S × 𝐵 . × 𝐵 0 × ⋯× 𝐵 2 → PD(S). Reward function associated to each agent i: 𝑆 5 : S × 𝐵 . × 𝐵 0 × ⋯× 𝐵 2 → R ■ ( 𝛿 % 𝑠 5, +,% , where 𝑠 5, +,% is the reward Objective: find a policy that maximizes 𝐹 ∑ %&' ■ received j steps into the future by agent i

  4. MDP’s VS Markov Games ■ MDP: – Assumes stationarity in the environment – Learns deterministic policies, hence agents not Adaptive ■ Markov Games: – An extension of game theory to MDP-like environments – Include multiple adaptive agents with interactive or competing goals – Minimax strategy allows the agent to converge to a fixed strategy that is guaranteed to be ‘safe’ in that it does not as well as possible against the worst possible opponent

  5. Optimal Policy – Matrix Games ■ Every two-player, simultaneous-move, zero-sum game has a Nash equilibrium ■ Suppose we have two agents: A and O ∗ , 𝜌 : ∗ , where V is from the perspective of A Value V = 𝐹 π 7 ■ ∗ , 𝜌 : ≥ 𝑊 𝐹 π 7 ■ ∗ ≤ 𝑊 𝐹 𝜌 7 , 𝜌 : ■

  6. Optimal Policy – Matrix Games ■ The agent’s policy is a probability distribution over actions ■ The optimal agent’s minimum expected reward should be as large as possible ■ Imagine a policy that is guaranteed an expected score of V no matter what action the opponent chooses ■ For pi to be optimal, we must identify the largest V for which there is some value of pi that makes the constraints hold, using linear programming :∈I ∑ 7∈F 𝑆 :,7 𝜌 7 , ■ Objective: 𝑊 = B∈DE F min max

  7. Optimal Policy – MDP’s ■ Method: Value Iteration Quality of a state-action pair: the total expected discounted reward attained by the non-stationary ■ Qu policy that takes action a at state s 𝑅 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 ∑ N O ∈P 𝑈 𝑡, 𝑏, 𝑡 R 𝑊 𝑡 R Immediate reward plus discounted value of all succeeding states weighted by likelihood Value of a state: the total expected discounted reward attained by policy starting from state s ■ Va 7 O ∈ F 𝑅 𝑡, 𝑏 R 𝑊 𝑡 = max Quality of the best action for that state; ■ Knowing Q is enough to specify an optimal policy, because action can be chosen with the the highest Q-value in each state

  8. Optimal Policy – Markov Games ■ Redefine V(s): expected reward for the optima policy starting from state s 𝑊 = B∈DE F min max :∈I S Q s, a, o 𝜌 7 , 7∈F For games with alternating t turns , i.e. optimal de deterministic policy, V(s) need not be computed by LP 𝑊 = max 7∈F min :∈I 𝑅(𝑡, 𝑏, 𝑝) ■ Q(s, a, o): expected reward for taking action a when the opponent chooses o from state s and continuing optimally thereafter 𝑈 𝑡, 𝑏, 𝑝, 𝑡 R 𝑊 𝑡 R 𝑅 𝑡, 𝑏, 𝑝 = 𝑆 𝑡, 𝑏, 𝑝 + 𝛿 S N O ∈P ■ Analogous value iteration algorithm can be shown to converge to the correct values

  9. Optimal Policy – Learning Process ■ Minimax-Q: Alternative approach to tradition value iteration method: 𝑅 𝑡, 𝑏 ≔ 𝑠 + 𝛿𝑊 𝑡 R – Performing the updates asynchronously without the use of the transition T – The probability of the update is precisely T – The rule converges to the correct value of Q & V if ■ Every action is tried in every state infinitely often ■ The new estimates are blended with previous ones using a slow e enough exponentially w weighted a average

  10. Experiments ■ A minmax-Q learning algorithm using a simple two-player zero-sum markov game modelled after the game of soccer ■ Consider a well-studied specialization in which there are only two agents and they have diametrically opposed goals.

  11. Experiments – Soccer Game ■ Actions: N, S, E, W, stand ■ Two moves are executed in random order ■ Circle represents the ball ■ Goals: left A, right B ■ Possession of the ball randomly initialized when game is reset ■ Discount factor 0.9 ■ To do better than breaking even against an unknown defender, an offensive agent must use a probabilistic policy

  12. Experiments – Training and Testing Four different Policies learnt: Using minmax-Q: explor = 0.2, decay = 10 (]^_ '.'.)/.' b = 0.9999954 ■ MR: minimax-Q trained against uniformly random ■ MM: minimax-Q trained against minimax (separate Q & V-tables) Using Q-learning: ‘max’ operator used in place of minimax; Q-table not tracking opponent’s actions ■ QR: Q trained against uniformly random ■ QQ: Q trained against Q (separate Q & V-tables)

  13. Experiments – Training and Testing Three ways of valuation on the resulting policies ■ First, each policy was run head-to-head with a random policy for 100,000 steps – To emulate the discount factor, every step had a 0.1 probability of being declared a draw – Wins and losses against the random opponent were tabulated ■ Second, head-to-head competition with a hand-built policy. – Hand-built policy was deterministic and had simple rules for scoring and blocking ■ Third, use –learning to train a ‘challenger’ opponent for each of MR, MM, QR, QQ. – The training policy followed that of QR, where the ‘champion’ policy was held while the challenger was trained against it. – The resulting policies were evaluated against their respective champions

  14. Experiments – Results

  15. Discussion and Questions ■ Why is it that in games such as checkers, backgammon and Go, "the minimax operator (in minimax Q) can be implemented extremely efficiently." Does the optimal strategy/policy always need to be mixed? Can it be pure, e.g. 𝜌 = ■ (0, 1,0) ? How would you design a markov game in which only pure strategies would be sufficient? ■ What if we have two sets of rewards for the agents, rather than a zero-sum setting? ■ Will mini-max/max-min strategy work for n-player game?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend