YunQi 2050 - DRL Session Communication in Multi-agent Reinforcement - PowerPoint PPT Presentation

YunQi 2050 - DRL Session Communication in Multi-agent Reinforcement Learning Ying Wen Department of Computer Science, University College London MediaGamma Ltd. ying.wen@cs.ucl.ac.uk 30 May, 2018

Multi-agent in Real-World Human Transportation Games Economies Communication Teams Networks Markets Networks 2

Agenda • Generalizing Reinforcement Learning § Single Agent Reinforcement Learning § Multi-agent Reinforcement Learning (MARL) • Challenges in MARL § Nonstationary Environment § Model Free Learning § Increasing Agent Number even Millions • Communication and Learning • Implicit Communication • Dynamic Interaction 3

Reinforcement Learning Agent Environment Action ! " Reward # "$% , State & "$% Optimal Policy ! = ( ∗ & ß Maximise Long Term Reward ∑ # " 4

Multi-Agent System • Multiagent system is a collection of multiple autonomous (intelligent) agents , each acting towards its objectives while all interacting in a shared environment , being able to communicate and possibly coordinating their actions. 5

Types of Agent Systems Single- Agent Multi- Agent Cooperative Competitive single multiple shared utility different utilities 6

Multi-agent Reinforcement Learning Agent 1 Environment Agent 2 Action ! " Action ! " Reward # "$% , State & "$% Reward # "$% , State & "$% Action ! " Reward # "$% , State & "$% Agent 3 7

Challenges in MARL 1. Non-stationary Environment • Needs for communication 2. Model Free - Agent Awareness • Intent / Opponent Modelling 3. Increasing Number of Agents • Approximation of other agents • Dynamics of agents 8

Multi-Agent Perspective 1. Micro Perspective , The agent design problem: • How should agents act to carry out their tasks? Optimal Policy. 2. Macro Perspective , The society design problem: • How should agents interact to carry out their tasks? Dynamic Interaction. 9

MARL with Communication Message (Communication) Environment Agent 1 Agent 2 Action ! " Action ! " Reward # "$% , State & "$% Reward # "$% , State & "$% How to cooperate? -> with Communication 10

MARL with Communication - Example Message (Communication) Pass me! Yes Football Game Agent 1 Agent 2 Action ! " Action ! " Reward # "$% , State & "$% Reward # "$% , State & "$% How to cooperate? -> with Communication 11

Bi-directionally Coordinated Network • Bi-directional recurrent networks o Means of communication o Connect each individual agent’s policy and and Q networks • Multi-agent deterministic actor-critic 12

How It Works • High Q-value steps are aggregated in the same area. 13

Emerged Human-level Coordination • Hit and Run tactics Attack Move Enemy • Focus fire without (a) time step 1 (b) time step 2 (c) time step 3 (d) time step 4 Figure 7: Hit and Run tactics in combat 3 Marines (ours) vs. overkill 1 Zealot (enemy) . Attack Move • …… (a) time step 1 (b) time step 2 (c) time step 3 (d) time step 4 Figure 9: ”focus fire” in combat 15 Marines (ours) vs. 16 Marines (enemy) . 14

Emerged Human-level Coordination - Video 15

MARL with Implicit Communication Intent Inference (Implicit Communication) Football Game Agent 1 Agent 2 ? Action ! " Action ! " Reward # "$% , State & "$% Reward # "$% , State & "$% How to know learn with unknown agents? -> Agent Awareness 16

Implicit Intent Inference in MARL State Action History Action Trajectory Observation Implicit Intent ( " ( "#* ( ")* $ $ $ & ")* & "#* & " $ $ $ ' "#* ' " ' ")* #$ ! "#* #$ #$ ! " ! ")* $ $ % "#* $ % ")* % " ! "#* ! " ! ")* #$ #$ + "#, + " #$ + "#* Implicit Intent Inference Network to Learn the Intent Embedding 17

Implicit Intent Inference in MARL Agent Aadversary Stop it Landmark Keep Away Game 18

Mean Field MARL • When the number of agents Agent 1 becomes thousands even Agent 2 millions …… • Mean action approximation Agent N 19

Mean Field MARL – Real-time Bidding • Mean Field Equilibrium learning in real-time bidding • High Volume and High Liquid • Second Price Auction only pay the second highest price 20

Multi-Agent Perspective 1. Micro Perspective , The agent design problem: • How should agents act to carry out their tasks? Optimal Policy. 2. Macro Perspective , The society design problem: • How should agents interact to carry out their tasks? Dynamic Interaction. 21

Population Dynamics in Million-agent RL • A major topic of population dynamics is the cycling of predator and prey populations • The Lotka-Volterra model is used to model this. 22

Population Dynamics in Million-agent RL • Predators hunt the prey so as to survive from starvation 1 1 2 • Each predator has its own 2 3 4 3 4 health bar and eyesight view 6 5 6 5 Timestep t Timestep t+1 • Predators can form a group Predator Prey Obstacle Health ID Group1 Group2 3 to hunt, and are scaled to 1 million 23

Population Dynamics in Million-agent RL • The action space: {move forward, ID embedding (Obs, ID) action (Obs, ID) Q-value backward, left, right, Q-network reward (s t , a t , r t , s t+1 ) . . Q-value . 1 rotate left, rotate right, 2 (Obs, ID) action updates Q-value stand still, join a group, 3 4 reward . (s t , a t , r t , s t+1 ) . Experience . and leave a group}. (Obs, ID) Buffer 6 5 action Q-value (s t , a t , r t , s t+1 ) reward 24

Population Dynamics in Million-agent RL The Dynamics of the Artificial Population Tiger-sheep-rabbit: Grouping 25

Reference [1] Peng, Peng*, Ying Wen*, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. "Multiagent Bidirectionally-Coordinated nets for learning to play StarCraft combat games.” [2] Wen, Ying, Hui Chen and Jun Wang. " Implicit Intent Inference with Action Trajectories in Multi-agent Reinforcement Learning." [3] Yang, Yaodong, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. "Mean Field Multi-Agent Reinforcement Learning." [4] Wen, Ying and Jun Wang. “A Mean Field Approximation for Real Time Bidding with Budget Constraints.” [5] Yang, Yaodong, Lantao Yu, Yiwei Bai, Ying Wen, Jun Wang, Weinan Zhang, and Yong Yu. "A Study of AI Population Dynamics with Million-agent Reinforcement Learning." 26

Thank You! Ying Wen ying.wen@cs.ucl.ac.uk

YunQi 2050 - DRL Session Communication in Multi-agent Reinforcement - PowerPoint PPT Presentation

YunQi 2050 - DRL Session Communication in Multi-agent Reinforcement Learning Ying Wen Department of Computer Science, University College London MediaGamma Ltd. ying.wen@cs.ucl.ac.uk 30 May, 2018 Multi-agent in Real-World Human

2050 PLAN 2050 PLAN GENERATI ON GENERATI ON Rick Colgan Colgan, 2050 Subcommittee , 2050

Another Pablo Perez Shore 1. Description 2. Gameplay 2050 2050 2050 1500 2018 2018 2050

Overview Multi-Agent Systems Introduction to multi-agent systems and agent societies Agent

Agent-Based Systems Agent communication Speech act theory Michael Rovatsos Agent

Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems

Texas Transportation Plan 2050 (TTP 2050) Planning for the Future of the Texas Multimodal System

Agent-Based Systems Michael Rovatsos mrovatso@inf.ed.ac.uk Lecture 6 Agent Communication 1

Spectral Method for Modularity Maximization Yunqi Guo January 24, 2017 Problem . Maximize

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

S S S S calable calable Agent calable calable Agent Agent Plat forms Agent Plat forms

LECTURE 8: macro-aspects of intelligent agent technology: those issues relating to the Agent

Thrive Montgomery 2050 Planning Board Presentation-Draft Vision and Goals Thrive Montgomery 2050

Food Security, Farming, and Climate Food Security, Farming, and Climate Change to Change to 2050

1 The BT 2050 project 2 What territorial future for the Baltic Sea Region in 2050? 3/11/2020

1 Agent Communication LECTURE 8: AGENT COMMUNICATION In this lecture, we cover macro-aspects of

Farm to School (Overview) Welcome to the National Farm to School Month: Celebrating Your

Certain information presented today may constitute forward-looking statements. Such statements

Macroeconomic developments in the Irish housing market DATE 21 st June 2018 Event Exploring

(FWE-RRK 3:1/FWE-RRK 5:1) R-ALF RESCUE KIT 3:1 / 5:1 Australia The Ferno R-ALF Rescue Kit is a

CS 574: Randomized Algorithms Lecture 14. Introduction to Martingales October 8, 2015 Lecture

Comparative Protein Structure Prediction Marc A. Marti-Renom http://bioinfo.cipf.es/sgu/

Time Series Analysis Henrik Madsen hm@imm.dtu.dk Informatics and Mathematical Modelling

STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April 12, 2005 Contents II Time