[PPT] - D++: Structural Credit Assignment in Tightly Coupled Multiagent PowerPoint Presentation

SLIDE 1

D++: Structural Credit Assignment in Tightly Coupled Multiagent Domains

Aida Rahmatualabi, Jen Jen Chung, Kagan Tumer Autonomous Agents and Distributed Intelligence Lab OSU Robotjcs

SLIDE 2

Problem Definition

2 DEMUR 2016 Aida Rahmattalabi | Oregon State University

team performance team performance

SLIDE 3

Loosely Coupled vs Tightly Coupled Agents

Loose coupling :

Task consists of many single-robot tasks
Each robot uses/requires litule knowledge of the other robots to

accomplish the task Tight coupling :

Multjple robots are required to achieve the task
Mutual dependence of the robots on each other's performance
The objectjve functjon is inherently non-smooth

3 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 4

Learning is Challenging in Tightly Coupled Tasks:

4

DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 5

Learning is Challenging in Tightly Coupled Tasks:

The probability of SUFFICIENT agents,

5

DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 6

Learning is Challenging in Tightly Coupled Tasks:

The probability of SUFFICIENT agents, picking the RIGHT ACTION

6

DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 7

Learning is Challenging in Tightly Coupled Tasks:

The probability of SUFFICIENT agents, picking the RIGHT ACTION, at the RIGHT TIME

7

DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 8

Learning is Challenging in Tightly Coupled Tasks:

The probability of SUFFICIENT agents, picking the RIGHT ACTION, at the RIGHT TIME is LOW

8

DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 9

Learning is Challenging in Tightly Coupled Tasks:

The probability of SUFFICIENT agents, picking the RIGHT ACTION, at the RIGHT TIME is LOW How can we devise agent-specifjc evaluatjon functjons to reward the stepping stone actjons?

9

DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 10

Difference Evaluation Function (Agogino and Tumer, 2004)

– Individual agents’ contributjon to the global team performance – Removes an agent replaces a “counterfactual” agent

11 DEMUR 2016 Aida Rahmattalabi | Oregon State University

Global system performance “The world with me” Global system performance excluding the efgects of agent i “The world without me”

SLIDE 11

D++: An Extension to Difference Reward (D)

– The reward functjon evaluates the performance of a “super agent” – It introduces “counterfactual” agents – Provides agents with stronger feedback signal – Rewards the stepping stones that lead to achieving the system objectjve

12 DEMUR 2016 Aida Rahmattalabi | Oregon State University

Global system performance Where “multjple copies of me” are present Global system performance

SLIDE 12

Example:

13 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 13

D++: An Extension to Difference Reward(D)

How many “counterfactual” agents should be added?

14

Aida Rahmattalabi | Oregon State University DEMUR 2016

SLIDE 14

D++: An Extension to Difference Reward(D)

How many “counterfactual” agents should be added?

Search difgerent number of counterfactual agents untjl a non zero reward is reached

15

Aida Rahmattalabi | Oregon State University DEMUR 2016

SLIDE 15

D++: An Extension to Difference Reward(D)

How many “counterfactual” agents should be added?

Search difgerent number of counterfactual agents untjl a non zero reward is reached

What if suffjcient number of agents are already available? Is D++ enough?

16

Aida Rahmattalabi | Oregon State University DEMUR 2016

SLIDE 16

D++: An Extension to Difference Reward(D)

How many “counterfactual” agents should be added?

Search difgerent number of counterfactual agents untjl a non zero reward is reached

What if suffjcient number of agents are already available? Is D++ enough?

Calculatjng both D and D++ and choosing the highest one

17

Aida Rahmattalabi | Oregon State University DEMUR 2016

SLIDE 17

Cooperative CoEvolutionary Algorithm (CCEA)

Train NN policy weights via cooperatjve coevolutjonary algorithm (CCEA)

18 DEMUR 2016 Aida Rahmattalabi | Oregon State University

Initjalize M populatjons of k NNs Initjalize M populatjons of k NNs Mutate each to create M populatjons of 2k NNs Mutate each to create M populatjons of 2k NNs Randomly select one NN from each populatjon to create team Ti Randomly select one NN from each populatjon to create team Ti Assess team performance and assign fjtness to team members Assess team performance and assign fjtness to team members Retain k best performing NNs of each populatjon Retain k best performing NNs of each populatjon Initjalize M populatjons of k NNs Initjalize M populatjons of k NNs Credit Assignment

SLIDE 18

Domain: Multi-robot Exploration

Neural-network controllers

– NN state vector – Control actjons

Team observatjon reward:

19 DEMUR 2016 Aida Rahmattalabi | Oregon State University

s

1,s 2

[ ]

s

1,q,i =

Vj d Lj, Li

( )

jÎ Iq

å

, s

2,q,i =

1 d Li', Li

( )

i'Î Nq

å

dx ,dy

[ ]

G = V

iNi, j 1 Ni,k 2

1 2 (d

i, j +d i,k) k

å

j

å

i

å

SLIDE 19

Experiments:

20 DEMUR 2016 Aida Rahmattalabi | Oregon State University

Number of robots Number of POIs Type Required

bservatjons

12 10 Homogeneous 3 12 10 Homogeneous 6 9 15 Heterogeneous [1,1,1] 9 15 Heterogeneous [3,1,1]

SLIDE 20

Homogeneous Agents: Number of observations = 3

21 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 21

Homogeneous Agents: Number of observations = 3

22 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 22

Homogeneous Agents: Learned Policies of D++ learners

23 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 23

Homogeneous Agents: Learned Policies of D++ learners

24 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 24

Homogeneous Agents: Learned Policies of D++ learners

25 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 25

Homogeneous Agents: Learned Policies of D++ learners

26 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 26

Homogeneous Agents: Number of observations = 6

27 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 27

Homogeneous Agents: Number of observations = 6

28 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 28

Heterogeneous Agents: Number of observations = [1, 1, 1]

Calls to G

1000 2000 3000 4000 5000 6000 7000 8000

G(z)

10 20 30 40 50

G D D++

29 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 29

Heterogeneous Agents: Learned Policies of D++ learners

X

5

5 10 15 20 25 30 35 40 Y 5 10 15 20 25 30 35 40

30 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 30

Heterogeneous Agents: Number of observations = [3, 1, 1]

Calls to G

1000 2000 3000 4000 5000 6000 7000 8000

G(z)

2 4 6 8 10 12 14

G D D++

31 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 31

Conclusion

D++ is a new rewarding structure for tjghtly coupled multjagent domains
D++ outperforms both G and D

– Rewarding the stepping stone actjons required in the long term success

Robot heterogeneity/tjghter coupling challenges G and D learners

– D++ learners can learn high-reward policies

32 DEMUR 2016 Aida Rahmattalabi | Oregon State University

SLIDE 32