Cooperation and Communication in Multiagent Deep Reinforcement - - PowerPoint PPT Presentation

cooperation and communication in multiagent deep
SMART_READER_LITE
LIVE PREVIEW

Cooperation and Communication in Multiagent Deep Reinforcement - - PowerPoint PPT Presentation

Cooperation and Communication in Multiagent Deep Reinforcement Learning Matthew Hausknecht Nov 28, 2016 Advisor: Peter Stone 1 Motivation Intelligent decision making is at the heart of AI. 2 Thesis Question How can the power of Deep


slide-1
SLIDE 1

Cooperation and Communication in Multiagent Deep Reinforcement Learning

Matthew Hausknecht

Nov 28, 2016 Advisor: Peter Stone

1

slide-2
SLIDE 2

Intelligent decision making is at the heart of AI.

Motivation

2

slide-3
SLIDE 3

Thesis Question

How can the power of Deep Neural Networks be leveraged to extend Reinforcement Learning towards domains featuring partial observability, continuous parameterized action spaces, and sparse rewards?

3

How can Deep Reinforcement Learning agents learn to cooperate in a multiagent setting?

slide-4
SLIDE 4

Contributions

  • Half Field Offense Enivronment
  • Deep RL in parameterized action space
  • Multiagent Deep RL
  • Deep Recurrent Q-Network (DRQN)
  • Curriculum learning in HFO

4

slide-5
SLIDE 5

Outline

  • 1. Background
  • 2. Deep Reinforcement Learning
  • 3. Multiagent Architectures
  • 4. Communication

5

slide-6
SLIDE 6

Markov Decision Process

Action at State st Reward rt Formalizes the interaction between the agent and environment.

6

slide-7
SLIDE 7

Half Field Offense

Cooperative multiagent soccer domain built on the libraries used by the RoboCup competition Objective: Learn a goal scoring policy for the offense agents Features continuous actions, partial observability, and

  • pportunities for multi agent coordination

7

slide-8
SLIDE 8

Half Field Offense

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

State Action Spaces

58 continuous state features encoding distances and angles to points of interest Parameterized-Continuous Action Space:
 Dash(direction, power)
 Turn(direction)
 Tackle(direction)
 Kick(direction, power) Choose one discrete action + parameters every timestep

11

slide-12
SLIDE 12

Learning in HFO is difficult

12

slide-13
SLIDE 13

Reinforcement Learning

Reinforcement Learning provides a general framework for sequential decision making. Objective: Learn a policy that maximizes discounted sum of future rewards. Deterministic policy π is a mapping from states to actions. For each encountered state, what is the best action to perform.

13

slide-14
SLIDE 14

Q-Value Function

Estimates the expected return from a given state- action: Answers the question: “How good is action a from state s.” Optimal Q-Value function yields an optimal policy. Qπ(s, a) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + . . . |s, a ⇤

14

slide-15
SLIDE 15

Deep Neural Network

Parametric model with stacked layers of representation. Powerful, general purpose function approximator. Parameters optimized via backpropagation. Input Output θ

15

slide-16
SLIDE 16

Deep Reinforcement Learning

Neural network used to approximate Q-Value function and policy π. Replay Memory: a queue of recent experience tuples (s,a,r,s’) seen by agent. Updates to network are done on experience sampled randomly from replay memory. State Q-Value / Action

16

slide-17
SLIDE 17

Deep Deterministic Policy Gradients

Model-free Deep Actor Critic architecture [Lillicrap ’15] Actor learns a policy π, Critic learns to estimate Q-values Actor outputs 4 actions + 6 parameters. at = max(4 actions) + associated parameter(s)

State

4 Actions 6 Parameters 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Actor Critic

17

slide-18
SLIDE 18

Training

Critic trained using temporal difference: Given Experience Actor trained via Critic gradients:

State

θμ

4 Actions 6 Parameters

θQ

Q-Value

Actor Critic

aQ(s,a)

y = rt + γ(Q(st+1, µ(st+1)|θQ)) rθµµ(s) = raQ(s, a|θQ)rθµµ(s|θµ)

18

slide-19
SLIDE 19

Reward Signal

rt = -d(Agent, Ball) + Ikick + -3d(Ball, Goal) + 5IGoal Go to Ball Kick to Goal

With only goal-scoring reward, agent never learns to approach the ball or dribble.

19

slide-20
SLIDE 20

Results

20

slide-21
SLIDE 21

Results

Scoring

  • Avg. Steps

Percent to Goal DDPG1 1.0 108.0 DDPG2 .99 107.1 DDPG3 .98 104.8 DDPG4 .96 112.3 Helios’ Champion .96 72.0 DDPG5 .94 119.1 DDPG6 .84 113.2 SARSA .81 70.7 DDPG7 .80 118.2

[Deep Reinforcement Learning in Parameterized Action Space, Hausknecht and Stone, in ICLR ‘16]

21

slide-22
SLIDE 22

Offense versus keeper

Automated Helios goal keeper is quite effective at stopping shots. Independently created by Helios RoboCup team. DDPG fails to reliably score against keeper.

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

Better value estimates

Q-Learning is known to overestimate Q-Values [Hasselt ’16]. Several approaches have been found, but don’t always extend to an actor/critic framework. We will show that mixing off-policy updates with on- policy Monte-Carlo updates yields quicker, more stable learning.

24

slide-25
SLIDE 25

Q-Learning Spectrum

Q-Learning is a bootstrap off-policy method: N-step Q-learning [Watkins ’89]: On-Policy Monte-Carlo updates are on-policy, non- bootstrap: Q(st, at|θ) = rt+1 + γ max

a

Q(st+1, a|θ) Q(st, at|θ) = rt+1 + γrt+2 + · · · + γT rT

Q(st, at|θ) = rt+1 + γrt+2 + · · · + γn−1rt+n + γn max

a

Q(st+n, a|θ)

25

slide-26
SLIDE 26

On-Policy Off-Policy Bootstrap

No-Bootstrap

SARSA Off-policy MC

n-step-return methods

On-policy MC Q-Learning

Low Bias High Variance High Bias Low Variance

26

slide-27
SLIDE 27

On-Policy Monte Carlo

On-Policy Monte-Carlo updates make sense near the beginning of learning, since is nearly always wrong. After Q-Values are refined, off-policy, bootstrap updates more efficiently utilize experience samples. A middle path is to mix both update types: max

a

Q(st+1, a|θ) y = β yon-policy-MC + (1 − β) y1-step-q-learning

27

slide-28
SLIDE 28

Experiments

Trained on 1v1 task We evaluated 5 different β values: 0, .2, .5, .8, 1 y = β yon-policy-MC + (1 − β) y1-step-q-learning

28

slide-29
SLIDE 29

β = 0

29

slide-30
SLIDE 30

β = 0.2

30

slide-31
SLIDE 31

β = 0.5

31

slide-32
SLIDE 32

β = 0.8

32

slide-33
SLIDE 33

β = 1.0

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

Off-Policy Monte Carlo

For 1v1 experiments, a middle ground between on- policy and off-policy updates works best. Purely off-policy updates can’t learn; Purely on-policy updates take far too long to learn. [On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning, Hausknecht and Stone, DeepRL ’16]

35

slide-36
SLIDE 36

Thesis Question

How can the power of Deep Neural Networks be leveraged to extend Reinforcement Learning towards domains featuring partial observability, continuous parameterized action spaces, and sparse rewards? Novel extension of DDPG to parameterized-continuos action space. Method for efficiently mixing on-policy and off-policy update targets.

36

slide-37
SLIDE 37

Outline

  • 1. Background
  • 2. Deep Reinforcement Learning
  • 3. Multiagent Architectures
  • 4. Communication

37

slide-38
SLIDE 38

Deep Multiagent RL

Can multiple Deep RL agents cooperate to achieve a shared goal? Examine several architectures: Centralized: Single controller for multiple agents Parameter Sharing: Layers shared between agents Memory Sharing: Shared replay memory

38

slide-39
SLIDE 39

Centralized

Both agents are controlled by a single actor-critic State & Action spaces are concatenated Learning takes place in higher-dimensional joint state, action space

4 Actions 6 Parameters 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Actor Critic State

6 Parameters 4 Actions

State

39

slide-40
SLIDE 40

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

Parameter Sharing

Shared weights between layers in Actor networks. Separate sharing between Critic networks. Reduces total number of parameters Encourages both agents to participate even though 2v0 is solvable by a single agent.

State

4 Actions 6 Parameters 256

ReLU

128

ReLU

Q-Value 256

ReLU

128

ReLU

4 Actions 6 Parameters 256

ReLU

128

ReLU

Q-Value 256

ReLU

128

ReLU

State

1024

ReLU

512

ReLU

1024

ReLU

512

ReLU

Critics Actors

42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

44

slide-45
SLIDE 45

45

slide-46
SLIDE 46

Both agents add experiences to a shared memory. Both agents perform updates from the shared memory. Parameters of agents are not shared.

Memory Sharing

46

Shared Replay Queue Agent-0 Agent-1

slide-47
SLIDE 47

Memory Sharing agents add experience and update from a shared replay memory.

Memory Sharing

47

slide-48
SLIDE 48

48

slide-49
SLIDE 49

Multiagent Architectures

Centralized controller utilizes only a single agent. Sharing parameters and memories encourages policy similarity, which can help all agents learn the task. Memory sharing results least performance gap between agents.

49

slide-50
SLIDE 50

Outline

  • 1. Background
  • 2. Deep Reinforcement Learning
  • 3. Multiagent Architectures
  • 4. Communication

50

slide-51
SLIDE 51

Symbiosis in Nature

Crocodile and Egyptian Plover Clownfish and anemone

51

slide-52
SLIDE 52

Communication

In human society, cooperation can be achieved far faster than in nature, through communication.

How can learning agents use communication to achieve cooperation?

52

slide-53
SLIDE 53
  • 1. Identify task-relevant information
  • 2. Communicate meaningful information to the teammate
  • 3. Remain stable enough that teammate can trust the

meaning of messages

Desire a learned communication protocol that can:

53

slide-54
SLIDE 54

Related Work

  • Multiagent Cooperation and Competition with Deep

Reinforcement Learning; Tampuu et. al, 2015

  • Learning to Communicate to Solve Riddles with

Deep Distributed Recurrent Q-Networks; Foerster et al., 2016

  • Learning to Communicate with Deep Multi-Agent

Reinforcement Learning; Foerster et al., 2016

54

slide-55
SLIDE 55

Baseline Communication Architecture

Continuous communication

  • actions. Messages are real

values. No meaning attached to messages; no pre-defined communication protocol. Incoming messages appended to state. Messages updated in direction

  • f higher Q-Values.

Actions Parameters 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Actor Critic

Comm

State Comm

55

slide-56
SLIDE 56

Teammate Comm Gradients

Same as baseline except Communication gradients exchanged with teammate. Allows teammate to directly alter communicated messages in the direction of higher reward.

56

slide-57
SLIDE 57

Baseline Comm Arch

Actor Actor Critic Critic state comm state comm action action state state T=1 T=0 Agent 0 Agent 1

57

slide-58
SLIDE 58

Teammate Comm Gradients

Actor Actor Critic Critic state comm state comm action action state state T=1 T=0 Agent 0 Agent 1

58

slide-59
SLIDE 59

Guess My Number Task

Each agent assigned secret number Goal: teammate send a message close to secret number h Reward: Max reward when teammate message equals your secret number.

59

slide-60
SLIDE 60

Baseline

60

slide-61
SLIDE 61

Teammate Comm Grad

61

slide-62
SLIDE 62

Blind Agent can hear but cannot see Sighted Agent can see but cannot move Goal: sighted agent must use communication to help blind agent locate and approach the ball Rewards: Agents communicate using messages

Blind Soccer

62

slide-63
SLIDE 63

63

slide-64
SLIDE 64

Baseline

64

Baseline architecture begins to solve the task, but the protocol is not stable enough and performance crashes.

slide-65
SLIDE 65

Teammate Comm Gradients

65

Fails to ground messages in the state of the environment. Agents fabricate idealized messages that don’t reflect reality. Example: blind agent wants the ball to be directly ahead So it alters the sighted agents messages to say this, regardless of the actual location of the ball.

slide-66
SLIDE 66

Teammate Comm Gradients

66

slide-67
SLIDE 67

GSN learns to extract information from the sighted agent’s

  • bservations that is useful for

predicting the blind agent’s rewards. Intuition: We can use observed rewards to guide the learning of a communication protocol.

Grounded Semantic Network

  • (1)

r(2) 128

ReLU

256

ReLU

64

ReLU

64

ReLU

m(1) a(2)

θr θm

67

slide-68
SLIDE 68

Maps sighted agent’s observation

  • (1) and blind teammate’s action a(2)

to blind teammate reward r(2)
 r(2) = GSN(o(1), a(2))

Grounded Semantic Network

  • (1)

r(2) 128

ReLU

256

ReLU

64

ReLU

64

ReLU

m(1) a(2)

θr θm

68

slide-69
SLIDE 69

Grounded Semantic Network

Message encoder M and a reward model R: Activations of layer m(1) form the message. Intuition: m(1) will contain any salient aspects of o(1) that are relevant for predicting reward.

  • (1)

r(2) 128

ReLU

256

ReLU

64

ReLU

64

ReLU

m(1) a(2)

θr θm

69

slide-70
SLIDE 70

Grounded Semantic Network

Training minimizes supervised loss: Evaluation requires only observation

  • (1) to generate message m(1)

GSN is trained in parallel with agent. Uses learning rate 10x smaller than agent for stability.

  • (1)

r(2) 128

ReLU

256

ReLU

64

ReLU

64

ReLU

m(1) a(2)

θr θm

70

slide-71
SLIDE 71

GSN

71

slide-72
SLIDE 72

72

slide-73
SLIDE 73

Is communication really helping?

slide-74
SLIDE 74

74

slide-75
SLIDE 75

t-SNE Analysis

2D t-SNE projection of 4D messages sent by the sighted agent Similar messages in 4D space are close in the 2D projection Each dot is colored according to whether the blind agent Dashed or Turned Content of messages strongly influences actions of blind agent

75

slide-76
SLIDE 76
slide-77
SLIDE 77
  • 1. Identify task-relevant information
  • 2. Communicate meaningful information to the teammate

3.Remain stable enough that teammate can trust the meaning of messages

Desire a learned communication protocol that can:

77

GSN fulfills these criteria. For more info see: [Grounded Semantic Networks for Learning Shared Communication Protocols] NIPS DeepRL Workshop ‘16

slide-78
SLIDE 78

Communication Conclusions

Communication can help cooperation. It is possible to learn stable and informative communication protocols. Teammate Communication Gradients is best in domains where reward is tied directly to the content

  • f the messages.

GSN is ideal in domains in which communication needs to be used as a way to achieve some other

  • bjectives in the environment.

78

slide-79
SLIDE 79

Thesis Question

How can Deep Reinforcement Learning agents learn to cooperate in a multiagent setting? Showed that sharing parameters and replay memories can help multiple agents learn to perform a task. Demonstrated communication can help agents cooperate in a domain featuring asymmetric information.

79

slide-80
SLIDE 80

Future Work

Teammate modeling: Could such a model be used for planning or better cooperation? Embodied Imitation Learning: How can an agent learn from a teacher without directly observing the states or actions of the teacher? Adversarial multiagent learning: How to communicate in the presence of an adversary?

80

slide-81
SLIDE 81

Contributions

  • Extended Deep RL algorithms to parameterized-

continuous action space.

  • Demonstrated that mixing bootstrap and Monte Carlo

returns yields better learning performance.

  • Introduced and analyzed parameter and memory sharing

multiagent architectures.

  • Introduced communication architectures and

demonstrated that learned communication could help cooperation.

  • Open source contributions: HFO, all learning agents

81

slide-82
SLIDE 82

Thanks!

State

4 Actions 6 Parameters 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Actor Critic

Convolution 1 Convolution 2 Convolution 3 LSTM Fully Connected Q-Values

82

slide-83
SLIDE 83

Partially Observable MDP (POMDP)

Action at Observation ot Reward rt Observations provide noisy or incomplete information Memory may help to learn a better policy

83

slide-84
SLIDE 84

Atari Environment

Action at Observation ot Reward rt Resolution 160x210x3 18 discrete actions Reward is change in game score

84

slide-85
SLIDE 85

Atari: MDP or POMDP?

Depends on the number game screens used in the state representation. Many games PO with a single frame.

85

slide-86
SLIDE 86

Neural network estimates Q-Values Q(s,a) for all 18 actions: Learns via temporal difference: Accepts the last 4 screens as input.

Deep Q-Network (DQN)

Convolution 1 Convolution 2 Convolution 3 Fully Connected Fully Connected Q-Values

Q(s|θ) = (Qs,a1 . . . Qs,an) Li(θ) = Es,a,r,s0∼D h Q(st|θ) − yi 2i yi = rt + γmax(Q(st+1|θ))

86

slide-87
SLIDE 87

Flickering Atari

How well does DQN perform on POMDPs? Induce partial observability by stochastically

  • bscuring the game screen

Game state must be inferred from past observations

  • t =

⇢ st with p = 1

2

< 0, . . . , 0 >

  • therwise

87

slide-88
SLIDE 88

DQN Pong

True Game Screen Observed Game Screen

88

slide-89
SLIDE 89

DQN Flickering Pong

True Game Screen Observed Game Screen

89

slide-90
SLIDE 90

Uses a Long Short Term Memory (LSTM) to selectively remember past game screens. Architecture identical to DQN except:

  • 1. Replaces FC layer with LSTM
  • 2. Single frame as input each

timestep Trained end-to-end using BPTT for last 10 timesteps.

Deep Recurrent Q-Network

Convolution 1 Convolution 2 Convolution 3 LSTM Fully Connected Q-Values

90

slide-91
SLIDE 91

DRQN Flickering Pong

True Game Screen Observed Game Screen

91

slide-92
SLIDE 92

92

slide-93
SLIDE 93

LSTM infers velocity

93

slide-94
SLIDE 94

DRQN Frostbite

94

slide-95
SLIDE 95

95

slide-96
SLIDE 96

Extensions

DRQN has been extended in several ways:

  • Addressable Memory: Control of Memory, Active

Perception, and Action in Minecraft; Oh et al. in ICML ’16

  • Continuous Action Space: Memory Based Control

with Recurrent Neural Networks; Heess et al., 2016 [Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht et al, 2015; ArXiv]

96

slide-97
SLIDE 97

Bounded Action Space

HFO’s continuous parameters are bounded Dash(direction, power) Turn(direction) Tackle(direction) Kick(direction, power) Direction in [-180,180], Power in [0, 100] Exceeding these ranges results in no action If DDPG is unaware of the bounds, it will invariably exceed them

97

slide-98
SLIDE 98

We examine 3 approaches for bounding the DDPG’s action space:

  • 1. Squashing Function
  • 2. Zero Gradients
  • 3. Invert Gradients

Bounded DDPG

98

slide-99
SLIDE 99

Squashing Function

  • 1. Use Tanh non-linearity to bound parameter output
  • 2. Rescale into desired range

99

slide-100
SLIDE 100

Squashing Function

100

slide-101
SLIDE 101

Each continuous parameter has a range: [pmin, pmax] Let p denote current value of parameter, and the suggested gradient. Then:

Zeroing Gradients

rp = ( rp if pmin < p < pmax

  • therwise

rp

101

slide-102
SLIDE 102

Zeroing Gradients

102

slide-103
SLIDE 103

Inverting Gradients

rp = rp · ( (pmax p)/(pmax pmin) if rp suggests increasing p (p pmin)/(pmax pmin)

  • therwise

For each parameter: Allows parameters to approach the bounds of the ranges without exceeding them Parameters don’t get “stuck” or saturate

103

slide-104
SLIDE 104

Inverting Gradients

104

slide-105
SLIDE 105

2v1

  • Can agents learn cooperative behaviors like

passing and cross kicks?

  • Hypothesis: Cross kicks can help achieve more

reliable scoring in 2v1 setting. Can sharing architectures learn such behaviors?

105

slide-106
SLIDE 106

2v1

106

slide-107
SLIDE 107

2v1

107

slide-108
SLIDE 108

108

slide-109
SLIDE 109

2v1

  • Both memory sharing and parameter sharing result

in reasonably high goal percentage

  • Agents do not learn passes or cross kicks and

instead rely on individual attacks

109

slide-110
SLIDE 110

Curriculum Learning

  • Motivation: it’s difficult to design unbiased reward

functions for complex tasks.

  • Easier to break a complex task into many subtasks,

learn each subtask, and then use the skills to address the complex task.

  • Given: Complex target task with sparse reward

function, curriculum of tasks with non-sparse reward.

  • Goal: Learn how to perform all tasks in curriculum

including the target task.

110

slide-111
SLIDE 111

State Embed Architecture

Each task in curriculum is represented as an embedding vector. Task embedding vector is concatenated with agent’s

  • bservation.

State

4 Actions 6 Parameters 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

i

Task Embedding

Wemb

111

slide-112
SLIDE 112

Weight Embed Architecture

Each task in curriculum is represented as an embedding vector. Weight embedding architecture conditions the activations of a particular layer on the task embedding. Allows a single network to learn many tasks and act uniquely in each task.

State

4 Actions 6 Parameters 1024

ReLU

256

ReLU

512

ReLU

128

ReLU

i

Task Embedding 128

Wemb W Wenc Wdec

112

slide-113
SLIDE 113

Curriculum

  • Target Task: Score on Goal
  • Curriculum: Move to Ball, Kick to Goal
  • Each task in curriculum corresponds to one skill in

the target task.

  • Tasks are represented using an embedding vector.

113

slide-114
SLIDE 114

Curriculum Ordering

  • The order of training tasks has an impact on

ultimate performance.

  • Random curriculum: Presents a random task in the

curriculum at each episode

  • Sequential curriculum: Easiest tasks presented

first, then harder tasks. Each task trained until convergence.

114

slide-115
SLIDE 115

Random Curriculum, No Embedding

115

slide-116
SLIDE 116

Seq Curriculum, No Embedding

116

slide-117
SLIDE 117

Random Curriculum, State Embedding

117

slide-118
SLIDE 118

Seq Curriculum, State Embedding

118

slide-119
SLIDE 119

Random Curriculum, Weight Embedding

119

slide-120
SLIDE 120

Seq Curriculum, Weight Embedding

120

slide-121
SLIDE 121

Curriculum Learning

  • Agents can reuse learned skills to perform the

soccer task which features a sparse goal reward

  • Agents must continue to revisit all training tasks or

they will forget previous skills

  • Ablation experiments show that all tasks are

necessary for the soccer curriculum

121