[PPT] - Online Planning for Decentralized Sto ci astic Control with Partial PowerPoint Presentation

SLIDE 1

Online Planning for Decentralized Stociastic Control with Partial History Sharing

Kaiqing Zhang, Erik Miehling, and Tamer Başar Coordinated Science Lab — UIUC

American Control Conference — Philadelphia, PA July 11, 2019

SLIDE 2

2

Decentralized Stociastic Control

Asymmetric information → no single agent has knowledge of all

previous events

Control of a dynamic system by multiple agents each possessing

different information

Also termed Dec-POMDPs in the learning/CS community

Robotics Smart Grid Unmanned Aerial Vehicles MOBA Video Games

Dynamic programming techniques quickly become computationally

intractable

SLIDE 3

3

Decentralized Stociastic Control with Partial History Sharing

In practice, agents may have some information (history) in common
Agents may observe each other’s actions, e.g., fleet control [Gerla et

al., ’14]

Share some common observations, e.g., cooperative robot navigation

[Lowe et al., ’18; Zhang et al., ’18]

Tiis common information can be used to reduce the policy search space
Decentralized POMDP → centralized POMDP [Nayyar et al., ’13;

Mahajan & Mannan, ’13]

Sufficient information: belief over the system state and local

information of each agent

SLIDE 4

4

Related Work

Common information approach + dynamic programming decomposition

(requires model to be known) [Nayyar et al., ’13]

Common information-based reformulation is generalization of
ccupancy-state MDPs [Dibangoye et al., ’16, ’18] for Dec-POMDPs
Model-free/sampling-based planning heuristics for Dec-POMDPs:
Dec-POMDP → non-observable MDP, solved by heuristic tree search

[Oliehoek et al., ’14]

Monte-Carlo sampling + policy iteration/expectation-maximization

[Wu et al., ’10, ’13]

Monte-Carlo tree search for special Dec-POMDPs [Amato et al., ’13;

Best et al., ’18]

Require a centralized coordinator [Amato et al., ’13; Oliehoek et al.,

’14; Dibangoye et al., ’18] or communication [Oliehoek et al., ’12]

SLIDE 5

5

Our Contribution

Development of a tractable online + decentralized planning algorithm

for decentralized stochastic control with partial history sharing

Does not require an explicit model representation, only a generative

model (black-box simulator)

Does not require explicit communication among agents
Possesses provable convergence guarantees
Tie proposed algorithm unifies some recently developed Dec-POMDP

solvers

SLIDE 6

Consider a dynamical system consisting of agents where
Local memory/information,
Local action,
Local observations,

6

Decentralized Stociastic Control with Partial History Sharing — Model

Common information
Let and , the common information

is a subset of common info. increment ← updated local memory ← updated common info. ←

Dynamics
State:
Information:

SLIDE 7

Delayed sharing (d-step):

7

Examples of Partial History Sharing

Some additional examples are periodic sharing, delayed state sharing, and
thers (see [Nayyar et al., ’13])
Control sharing:

→ → → →

SLIDE 8

Tie goal is to find a joint control policy, , consisting of

local control policies , such that the total expected discounted reward, , is maximized

Note that all agents possess the same goal, i.e., they have a common

reward function

8

Decentralized Stociastic Control with Partial History Sharing — Model (cont’d)

SLIDE 9

Consider a coordinator that has access to the common information
Tie coordinator solves for prescriptions that map each

agent’s local information to a local action

9

Common Information Approaci

Tie coordinator’s problem is a POMDP with modified state, action, and
bservation processes [Nayyar et al., ’13]
state:
action:
bservation:
Define virtual history as — the information

state of the coordinator’s POMDP is the common information based belief

SLIDE 10

Given a virtual history , the coordinator determines a joint prescription

using a coordination strategy, , where

Tie coordinator’s objective is to find a coordination strategy profile

where

A dynamic programming decomposition to solve for the optimal

prescriptions exists [Nayyar et al., ’13]

10

Common Information Approaci

to maximize

Note: since the coordinator’s information is in common between the

agents, each agent can (in principle) perform this computation

SLIDE 11

11

Challenges with the Common Information Approaci

Tie decision variables of the coordinator are functions
Under finite action and observation spaces, the space of these functions

is also finite, but very large Examples:

1-step delayed sharing:

→ →

Control sharing:

SLIDE 12

12

Decentralized Online Planning

Inspired by the single-agent Partially Observable Monte-Carlo Planning

(POMCP) algorithm of [Silver & Veness, ’10]

Sampling-based approach helps to alleviate the computational challenges

associated with the large state space Key Ideas of the Algorithm:

Solve the coordinator’s POMDP via online tree search
Nodes of each search tree are virtual histories with

edges consisting of joint prescriptions and new common information

Agents possess a common random seed to avoid the

need to communicate (used before [Bernstein et al., ’09; Oliehoek et al., ’09; Arabneydi & Mahajan, ’15])

SLIDE 13

13

Decentralized Online Planning

Each agent constructs an identical set of search trees across all agents
Search trees are constructed iteratively by running simulations from the

current history for each agent (tree)

SLIDE 14

14

Searci Stage

Each simulation begins by sampling from the common information based

belief at the current history node

Successive simulations build out the search trees under a stopping

condition (e.g., timeout) is met where, : number of previous simulation visits to the virtual history h : estimated value of choosing prescription in virtual history h

Tie search tree is expanded by either rollout or selection via UCB1

[Auer et al., ’02] as follows

SLIDE 15

15

Belief Update

A joint prescription is selected, actions are realized, and new common

information is revealed to the agents

Tie belief at a given history node h is approximated by a set of K

particles, denoted by the set B(h)

Draw a particle uniformly from B(h)
Call the generative model to

construct a sample of new common information and updated local memories

If the sampled common information matches the true common

information, add particle to updated belief set

Tie belief is updated by the following procedure:

repeat K times

Generate from the selected prescription,

SLIDE 16

16

Convergence

Due to the common source of randomness, the planning procedure is

identical and decoupled for all agents

Convergence of the decentralized online planning algorithm can be

characterized by the (single-agent) POMCP alg. [Silver & Veness, ’10]

Applied to a novel security setuing

(collaborative intrusion response)

Agents choose defense actions in

response to security alert information

Actions and alerts are copied to

centralized database with delay (1-step delayed sharing)

400 600 800 1000 1200 1400 1600 0.5 1 1.5 2

SLIDE 17

17

MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local observation history General State System state + joint local observation history System state + joint local memory History Joint policies Joint prescriptions + common information Sufficient statistic Belief over system state + joint local

bservation history

Belief over system state and joint local memory

Existing Algorithms: MAA*

Heuristic tree-search algorithm from [Szer et al., ’05]
Dec-POMDP → non-observable MDP [Oliehoek et al., ’14]
Can be viewed as the designer’s approaci from [Nayyar et al., ’13]

SLIDE 18

18

MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local histories (observations + actions) General State System state + joint local histories System state + joint local memory History Joint policies Joint prescriptions + common information Sufficient statistic Belief over system state + joint local histories Belief over system state and joint local memory

Existing Algorithms: Occupancy-state MDPs

Dec-POMDP → occupancy-state MDP [Dibangoye et al., ’16]
Tie occupancy state (belief over state + joint local histories) is a sufficient

statistic for optimal planning

SLIDE 19

19

Concluding Remarks and Future Work

Proposed an online tree-search based planning algorithm for

decentralized stochastic control with partial history sharing

Represents an initial atuempt and highlights the difficulties with

development of tractable online planning for decentralized stochastic control under partial observability Remaining ciallenges and future work:

Tie branching factors of the search trees are very large — can we focus
n subset of prescriptions during simulation?
Development of the notion of probabilistic common information — can

we relax the binary membership of common information to a probability that given information is common?

Thank you!

SLIDE 20

20

Funding Aclnowledgements

US Office of Naval Research (ONR) MURI grant N00014-16-1-2710 US Army Research Office (ARO) grant W911NF-16-1-0485

SLIDE 21

Additional Slides

SLIDE 22

22 Algorithm 1 Decentralized Online Planning with Partial History Sharing – Agent i

function SEARCH(h) repeat if h = ∅ then (x, m1, . . . , mn) ⇠ B0 else (x, m1, . . . , mn) ⇠ B(h) end if SIMULATE(x, m1, . . . , mn, h, 0) until STOPPINGCONDITION() return argmaxδ2Γ V (hδ) end function function ROLLOUT(x, m1, . . . , mn, h, d) if βd < ε then return 0 end if γ = (γ1, . . . , γn) ⇠ (Γ1

rollout(h), . . . , Γn rollout(h))

(u1, . . . , un) (γ1(m1), . . . , γn(mn)) (x0, y1, . . . , yn, r) ⇠ G(x, u1, . . . , un) zi Pi

Z(mi, ui, yi) and share zi with all other agents

h0 hγz (m10, . . . , mn0) (P1

L(m1, u1, y1, z1), . . . ,

Pn

L(mn, un, yn, zn))

R r + β · ROLLOUT(x0, m10, . . . , mn0, h0, d + 1) return R end function function SIMULATE(x, m1, . . . , mn, h, d) if βd < ε then return 0 end if if h 62 T i then for all γ 2 Γ do T i(hγ) (N0(hγ), V0(hγ)) end for return ROLLOUT(x, m1, . . . , mn, h, d) end if γ=(γ1, . . . , γn) 2 argmax

(δ1,...,δn)2Γ1⇥···⇥Γn

V (hδ) + ρ s log N(h) N(hδ) (u1, . . . , un) (γ1(m1), . . . , γn(mn)) (x0, y1, . . . , yn, r) ⇠ G(x, u1, . . . , un) zi Pi

Z(mi, ui, yi) and share zi with all other agents

h0 hγz (m10, . . . , mn0) (P1

L(m1, u1, y1, z1), . . . , Pn L(mn, un, yn, zn))

N(h) N(h) + 1 R r + β · SIMULATE(x0, m10, . . . , mn0, h0, d + 1) N(hγ) N(hγ) + 1 V (hγ) V (hγ) + RV (hγ)

N(hγ)

return R end function