MARKOV GAMES A framework for multi-agent reinforcement learning - - PowerPoint PPT Presentation

markov games
SMART_READER_LITE
LIVE PREVIEW

MARKOV GAMES A framework for multi-agent reinforcement learning - - PowerPoint PPT Presentation

MARKOV GAMES A framework for multi-agent reinforcement learning Shen (Sean) Chen Review on MDPs An MDP is defined by a set of states, S, and actions, A. Transition function, T: S A PD(S), where PD(S) represents discrete prob


slide-1
SLIDE 1

MARKOV GAMES

A framework for multi-agent reinforcement learning Shen (Sean) Chen

slide-2
SLIDE 2

Review on MDP’s

■ An MDP is defined by a set of states, S, and actions, A. ■ Transition function, T: S × A → PD(S), where PD(S) represents discrete prob distribution over the set S. ■ Reward function, R: S × A → R, which specifies the agent’s task ■ Objective: find a policy mapping its interaction history to a current choice of action so as to maximize the expected sum of discounted reward 𝐹 ∑%&'

( 𝛿%𝑠+,%

slide-3
SLIDE 3

Markov Games

■ A Markov game is defined by a set of states S, and a collection of action sets, 𝐵., 𝐵0, … , 𝐵2, one for each agent in the environment. ■ State transitions are controlled by the current state and one action from each agent: T: S × 𝐵.× 𝐵0× ⋯× 𝐵2 → PD(S). ■ Reward function associated to each agent i: 𝑆5: S × 𝐵.× 𝐵0× ⋯× 𝐵2 → R ■ Objective: find a policy that maximizes 𝐹 ∑%&'

( 𝛿%𝑠5, +,% , where 𝑠5, +,% is the reward

received j steps into the future by agent i

slide-4
SLIDE 4

MDP’s VS Markov Games

■ MDP: – Assumes stationarity in the environment – Learns deterministic policies, hence agents not Adaptive ■ Markov Games: – An extension of game theory to MDP-like environments – Include multiple adaptive agents with interactive or competing goals – Minimax strategy allows the agent to converge to a fixed strategy that is guaranteed to be ‘safe’ in that it does not as well as possible against the worst possible opponent

slide-5
SLIDE 5

Optimal Policy – Matrix Games

■ Every two-player, simultaneous-move, zero-sum game has a Nash equilibrium ■ Suppose we have two agents: A and O ■ Value V = 𝐹 π7

∗ , 𝜌: ∗ , where V is from the perspective of A

■ 𝐹 π7

∗ , 𝜌: ≥ 𝑊

■ 𝐹 𝜌7, 𝜌:

∗ ≤ 𝑊

slide-6
SLIDE 6

Optimal Policy – Matrix Games

■ The agent’s policy is a probability distribution

  • ver actions

■ The optimal agent’s minimum expected reward should be as large as possible ■ Imagine a policy that is guaranteed an expected score of V no matter what action the opponent chooses ■ For pi to be optimal, we must identify the largest V for which there is some value of pi that makes the constraints hold, using linear programming ■ Objective: 𝑊 = max

B∈DE F min :∈I ∑7∈F 𝑆:,7𝜌7,

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

Optimal Policy – MDP’s

■ Method: Value Iteration ■ Qu Quality of a state-action pair: the total expected discounted reward attained by the non-stationary policy that takes action a at state s 𝑅 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 ∑NO∈P 𝑈 𝑡, 𝑏, 𝑡R 𝑊 𝑡R Immediate reward plus discounted value of all succeeding states weighted by likelihood ■ Va Value of a state: the total expected discounted reward attained by policy starting from state s 𝑊 𝑡 = max

7O∈ F𝑅 𝑡, 𝑏R

Quality of the best action for that state; ■ Knowing Q is enough to specify an optimal policy, because action can be chosen with the the highest Q-value in each state

slide-10
SLIDE 10

Optimal Policy – Markov Games

■ Redefine V(s): expected reward for the optima policy starting from state s 𝑊 = max

B∈DE F min :∈I S 7∈F

Q s, a, o 𝜌7 , For games with alternating t turns, i.e. optimal de deterministic policy, V(s) need not be computed by LP 𝑊 = max

7∈F min :∈I 𝑅(𝑡, 𝑏, 𝑝)

■ Q(s, a, o): expected reward for taking action a when the opponent chooses o from state s and continuing optimally thereafter 𝑅 𝑡, 𝑏, 𝑝 = 𝑆 𝑡, 𝑏, 𝑝 + 𝛿 S

NO∈P

𝑈 𝑡, 𝑏, 𝑝, 𝑡R 𝑊 𝑡R ■ Analogous value iteration algorithm can be shown to converge to the correct values

slide-11
SLIDE 11

Optimal Policy – Learning Process

■ Minimax-Q: Alternative approach to tradition value iteration method: 𝑅 𝑡, 𝑏 ≔ 𝑠 + 𝛿𝑊 𝑡R – Performing the updates asynchronously without the use of the transition T – The probability of the update is precisely T – The rule converges to the correct value of Q & V if

■ Every action is tried in every state infinitely often ■ The new estimates are blended with previous ones using a slow e enough exponentially w weighted a average

slide-12
SLIDE 12

Experiments

■ A minmax-Q learning algorithm using a simple two-player zero-sum markov game modelled after the game of soccer ■ Consider a well-studied specialization in which there are only two agents and they have diametrically opposed goals.

slide-13
SLIDE 13

Experiments – Soccer Game

■ Actions: N, S, E, W, stand ■ Two moves are executed in random order ■ Circle represents the ball ■ Goals: left A, right B ■ Possession of the ball randomly initialized when game is reset ■ Discount factor 0.9 ■ To do better than breaking even against an unknown defender, an offensive agent must use a probabilistic policy

slide-14
SLIDE 14

Experiments – Training and Testing

Four different Policies learnt: Using minmax-Q: explor = 0.2, decay = 10(]^_ '.'.)/.'b= 0.9999954 ■ MR: minimax-Q trained against uniformly random ■ MM: minimax-Q trained against minimax (separate Q & V-tables) Using Q-learning: ‘max’ operator used in place of minimax; Q-table not tracking opponent’s actions ■ QR: Q trained against uniformly random ■ QQ: Q trained against Q (separate Q & V-tables)

slide-15
SLIDE 15

Experiments – Training and Testing

Three ways of valuation on the resulting policies ■ First, each policy was run head-to-head with a random policy for 100,000 steps – To emulate the discount factor, every step had a 0.1 probability of being declared a draw – Wins and losses against the random opponent were tabulated ■ Second, head-to-head competition with a hand-built policy. – Hand-built policy was deterministic and had simple rules for scoring and blocking ■ Third, use –learning to train a ‘challenger’ opponent for each of MR, MM, QR, QQ. – The training policy followed that of QR, where the ‘champion’ policy was held while the challenger was trained against it. – The resulting policies were evaluated against their respective champions

slide-16
SLIDE 16

Experiments – Results

slide-17
SLIDE 17

Discussion and Questions

■ Why is it that in games such as checkers, backgammon and Go, "the minimax

  • perator (in minimax Q) can be implemented extremely efficiently."

■ Does the optimal strategy/policy always need to be mixed? Can it be pure, e.g. 𝜌 = (0, 1,0)? How would you design a markov game in which only pure strategies would be sufficient? ■ What if we have two sets of rewards for the agents, rather than a zero-sum setting? ■ Will mini-max/max-min strategy work for n-player game?