Cooperation and Communication in Multiagent Deep Reinforcement - PowerPoint PPT Presentation

Baseline Communication Architecture Q-Value Continuous communication actions. Messages are real Actor ReLU values. 128 Actions Parameters Comm ReLU No meaning attached to 256 ReLU messages; no pre-defined 128 ReLU 512 communication protocol. ReLU 256 ReLU 1024 ReLU Incoming messages appended 512 to state. ReLU Critic 1024 Messages updated in direction of higher Q-Values. State Comm 55

Teammate Comm Gradients Same as baseline except Communication gradients exchanged with teammate. Allows teammate to directly alter communicated messages in the direction of higher reward. 56

Baseline Comm Arch Agent 0 Agent 1 Critic Critic state action comm action comm state T=1 Actor Actor state state T=0 57

Teammate Comm Gradients Agent 0 Agent 1 Critic Critic state action comm action comm state T=1 Actor Actor state state T=0 58

Guess My Number Task Each agent assigned secret number Goal: teammate send a message close to secret number h Reward: Max reward when teammate message equals your secret number. 59

Baseline 60

Teammate Comm Grad 61

Blind Soccer Blind Agent can hear but cannot see Sighted Agent can see but cannot move Goal: sighted agent must use communication to help blind agent locate and approach the ball Rewards: Agents communicate using messages 62

Baseline Baseline architecture begins to solve the task, but the protocol is not stable enough and performance crashes. 64

Teammate Comm Gradients Fails to ground messages in the state of the environment. Agents fabricate idealized messages that don’t reflect reality. Example: blind agent wants the ball to be directly ahead So it alters the sighted agents messages to say this, regardless of the actual location of the ball. 65

Teammate Comm Gradients 66

r (2) Grounded Semantic Network ReLU 64 ReLU θ r GSN learns to extract information 128 from the sighted agent’s ReLU observations that is useful for 256 predicting the blind agent’s rewards. Intuition: We can use observed a (2) m (1) rewards to guide the learning of a ReLU communication protocol. θ m 64 o (1) 67

r (2) Grounded Semantic Network ReLU 64 ReLU θ r 128 Maps sighted agent’s observation ReLU o (1) and blind teammate’s action a (2) 256 to blind teammate reward r (2)   a (2) m (1) r (2) = GSN(o (1) , a (2) ) ReLU θ m 64 o (1) 68

r (2) Grounded Semantic Network ReLU 64 Message encoder M and a reward ReLU θ r model R : 128 ReLU 256 Activations of layer m (1) form the a (2) m (1) message. ReLU θ m 64 Intuition: m (1) will contain any salient aspects of o (1) that are relevant for o (1) predicting reward. 69

r (2) Grounded Semantic Network ReLU 64 Training minimizes supervised loss: ReLU θ r 128 ReLU 256 Evaluation requires only observation o (1) to generate message m (1) a (2) m (1) ReLU GSN is trained in parallel with agent. θ m 64 Uses learning rate 10x smaller than agent for stability. o (1) 70

GSN 71

Is communication really helping?

t-SNE Analysis 2D t-SNE projection of 4D messages sent by the sighted agent Similar messages in 4D space are close in the 2D projection Each dot is colored according to whether the blind agent Dashed or Turned Content of messages strongly influences actions of blind agent 75

Desire a learned communication protocol that can: 1. Identify task-relevant information 2. Communicate meaningful information to the teammate 3.Remain stable enough that teammate can trust the meaning of messages GSN fulfills these criteria. For more info see: [ Grounded Semantic Networks for Learning Shared Communication Protocols ] NIPS DeepRL Workshop ‘16 77

Communication Conclusions Communication can help cooperation. It is possible to learn stable and informative communication protocols. Teammate Communication Gradients is best in domains where reward is tied directly to the content of the messages. GSN is ideal in domains in which communication needs to be used as a way to achieve some other objectives in the environment. 78

Thesis Question How can Deep Reinforcement Learning agents learn to cooperate in a multiagent setting? Showed that sharing parameters and replay memories can help multiple agents learn to perform a task. Demonstrated communication can help agents cooperate in a domain featuring asymmetric information. 79

Future Work Teammate modeling: Could such a model be used for planning or better cooperation? Embodied Imitation Learning: How can an agent learn from a teacher without directly observing the states or actions of the teacher? Adversarial multiagent learning: How to communicate in the presence of an adversary? 80

Contributions • Extended Deep RL algorithms to parameterized- continuous action space. • Demonstrated that mixing bootstrap and Monte Carlo returns yields better learning performance. • Introduced and analyzed parameter and memory sharing multiagent architectures. • Introduced communication architectures and demonstrated that learned communication could help cooperation. • Open source contributions: HFO, all learning agents 81

Thanks! Q-Value ReLU Actor 128 4 Actions 6 Parameters ReLU 256 ReLU ReLU Q-Values 128 512 Fully Connected ReLU ReLU LSTM 256 1024 Convolution 3 ReLU 512 Convolution 2 Critic Convolution 1 ReLU 1024 State 82

Partially Observable MDP (POMDP) Observation o t Action a t Reward r t Observations provide noisy or incomplete information Memory may help to learn a better policy 83

Atari Environment Observation o t Action a t Reward r t Reward is change in game score Resolution 160x210x3 18 discrete actions 84

Atari: MDP or POMDP? Depends on the number game screens used in the state representation. Many games PO with a single frame. 85

Deep Q-Network (DQN) Neural network estimates Q-Values Q-Values Q(s,a) for all 18 actions: Fully Connected Fully Connected Q ( s | θ ) = ( Q s,a 1 . . . Q s,a n ) Convolution 3 Convolution 2 Learns via temporal difference: Convolution 1 h� � 2 i L i ( θ ) = E s,a,r,s 0 ∼ D Q ( s t | θ ) − y i y i = r t + γ max ( Q ( s t +1 | θ )) Accepts the last 4 screens as input. 86

Flickering Atari How well does DQN perform on POMDPs? Induce partial observability by stochastically obscuring the game screen ⇢ with p = 1 s t 2 o t = < 0 , . . . , 0 > otherwise Game state must be inferred from past observations 87

DQN Pong True Game Screen Observed Game Screen 88

DQN Flickering Pong True Game Screen Observed Game Screen 89

Deep Recurrent Q-Network Uses a Long Short Term Memory Q-Values (LSTM) to selectively remember past Fully Connected game screens. LSTM Convolution 3 Architecture identical to DQN except: Convolution 2 1. Replaces FC layer with LSTM Convolution 1 2. Single frame as input each timestep Trained end-to-end using BPTT for last 10 timesteps. 90

DRQN Flickering Pong True Game Screen Observed Game Screen 91

LSTM infers velocity 93

DRQN Frostbite 94

Extensions DRQN has been extended in several ways: • Addressable Memory: Control of Memory, Active Perception, and Action in Minecraft; Oh et al. in ICML ’16 • Continuous Action Space: Memory Based Control with Recurrent Neural Networks; Heess et al., 2016 [ Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht et al, 2015; ArXiv] 96

Bounded Action Space HFO’s continuous parameters are bounded Dash(direction, power) Turn(direction) Tackle(direction) Kick(direction, power) Direction in [-180,180], Power in [0, 100] Exceeding these ranges results in no action If DDPG is unaware of the bounds, it will invariably exceed them 97

Bounded DDPG We examine 3 approaches for bounding the DDPG’s action space: 1. Squashing Function 2. Zero Gradients 3. Invert Gradients 98

Squashing Function 1. Use Tanh non-linearity to bound parameter output 2. Rescale into desired range 99

Squashing Function 100

Cooperation and Communication in Multiagent Deep Reinforcement - PowerPoint PPT Presentation

Cooperation and Communication in Multiagent Deep Reinforcement Learning Matthew Hausknecht Nov 28, 2016 Advisor: Peter Stone 1 Motivation Intelligent decision making is at the heart of AI. 2 Thesis Question How can the power of Deep

LECTURE 6: MULTIAGENT INTERACTIONS An Introduction to MultiAgent Systems

CHAPTER 11: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

CHAPTER 6: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

LECTURE 6: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

Multiagent Systems: Spring 2006 Ulle Endriss Institute for Logic, Language and Computation

CHAPTER 12: LOGICS FOR MULTIAGENT SYSTEMS An Introduction to Multiagent Systems

1 Brazilian Cooperation Modalities Technical Cooperation Technical Cooperation Humanitarian Aid

1. Introduction ( (to Agents and Multiagent g g Systems) ems (SMA-UPC) Javier

1. Introduction (to Agents and Multiagent ( g g D) ems Design (MASD Systems) Javier

A MultiAgent System for A MultiAgent System for Retrieving Bioinformatics Retrieving

Multiagent Systems: Spring 2006 Ulle Endriss Institute for Logic, Language and Computation

and Applications Lecture 13: Programming Multiagent Systems [Part 2] Juan Carlos Nieves Snchez

Multiagent Resource Allocation: What to optimise, how, and why? Ulle Endriss Imperial College

Agents and Artifacts: The A&A Meta-model for Multiagent Systems Multiagent Systems LS

Multiagent System-based Verification of Security and Privacy Ioana Boureanu Imperial College

Multiagent Systems: Rational Decision Making and Negotiation Ulle Endriss ( ue@doc.ic.ac.uk )

on TO Zonal Placement April 20, 2017 Outline I. Background of the Staffs Proposal Paul

Presented by Edward VanHoose Association of Illinois Electric Cooperatives Whats a co-op? A

On a Non-Cooperative Model for Wavelength Assignment in Multifiber Optical Networks E. Bampas,

Applicant Information Session Date: 7/6/16 Better Health, Better Care, and Lower Cost 1 Questions

10 main methods of how NOT to implement Agile Danny (Danko) Kovatch danko@Ajimeh.com

Latin and Greek Elements in English Lesson 21: Blends BLEND : a word formed by combining

The Place of the Region Higher education governance in Germany, Norway and the UK Jrgen Enders

NM Data In Integration Grant Program: Webinar 2 Zach Grant New Mexico Sentencing Commission

Cooperation and Communication in Multiagent Deep Reinforcement - PowerPoint PPT Presentation

Cooperation and Communication in Multiagent Deep Reinforcement Learning Matthew Hausknecht Nov 28, 2016 Advisor: Peter Stone 1 Motivation Intelligent decision making is at the heart of AI. 2 Thesis Question How can the power of Deep

LECTURE 6: MULTIAGENT INTERACTIONS An Introduction to MultiAgent Systems

CHAPTER 11: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

CHAPTER 6: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

LECTURE 6: MULTIAGENT INTERACTIONS An Introduction to Multiagent Systems

Multiagent Systems: Spring 2006 Ulle Endriss Institute for Logic, Language and Computation

CHAPTER 12: LOGICS FOR MULTIAGENT SYSTEMS An Introduction to Multiagent Systems

1 Brazilian Cooperation Modalities Technical Cooperation Technical Cooperation Humanitarian Aid

1. Introduction ( (to Agents and Multiagent g g Systems) ems (SMA-UPC) Javier

1. Introduction (to Agents and Multiagent ( g g D) ems Design (MASD Systems) Javier

A MultiAgent System for A MultiAgent System for Retrieving Bioinformatics Retrieving

Multiagent Systems: Spring 2006 Ulle Endriss Institute for Logic, Language and Computation

and Applications Lecture 13: Programming Multiagent Systems [Part 2] Juan Carlos Nieves Snchez

Multiagent Resource Allocation: What to optimise, how, and why? Ulle Endriss Imperial College

Agents and Artifacts: The A&amp;A Meta-model for Multiagent Systems Multiagent Systems LS

Multiagent System-based Verification of Security and Privacy Ioana Boureanu Imperial College

Multiagent Systems: Rational Decision Making and Negotiation Ulle Endriss ( ue@doc.ic.ac.uk )

on TO Zonal Placement April 20, 2017 Outline I. Background of the Staffs Proposal Paul

Presented by Edward VanHoose Association of Illinois Electric Cooperatives Whats a co-op? A

On a Non-Cooperative Model for Wavelength Assignment in Multifiber Optical Networks E. Bampas,

Applicant Information Session Date: 7/6/16 Better Health, Better Care, and Lower Cost 1 Questions

10 main methods of how NOT to implement Agile Danny (Danko) Kovatch danko@Ajimeh.com

Latin and Greek Elements in English Lesson 21: Blends BLEND : a word formed by combining

The Place of the Region Higher education governance in Germany, Norway and the UK Jrgen Enders

NM Data In Integration Grant Program: Webinar 2 Zach Grant New Mexico Sentencing Commission

Agents and Artifacts: The A&A Meta-model for Multiagent Systems Multiagent Systems LS