Online Planning for Decentralized Sto ci astic Control with Partial - - PowerPoint PPT Presentation

online planning for decentralized sto ci astic control
SMART_READER_LITE
LIVE PREVIEW

Online Planning for Decentralized Sto ci astic Control with Partial - - PowerPoint PPT Presentation

Online Planning for Decentralized Sto ci astic Control with Partial History Sharing Kaiqing Zhang, Erik Miehling, and Tamer Ba ar Coordinated Science Lab UIUC American Control Conference Philadelphia, PA July 11, 2019 Decentralized


slide-1
SLIDE 1

Online Planning for Decentralized Stociastic Control with Partial History Sharing

Kaiqing Zhang, Erik Miehling, and Tamer Başar Coordinated Science Lab — UIUC

American Control Conference — Philadelphia, PA July 11, 2019

slide-2
SLIDE 2

2

Decentralized Stociastic Control

  • Asymmetric information → no single agent has knowledge of all

previous events

  • Control of a dynamic system by multiple agents each possessing

different information

  • Also termed Dec-POMDPs in the learning/CS community

Robotics Smart Grid Unmanned Aerial Vehicles MOBA Video Games

  • Dynamic programming techniques quickly become computationally

intractable

slide-3
SLIDE 3

3

Decentralized Stociastic Control with Partial History Sharing

  • In practice, agents may have some information (history) in common
  • Agents may observe each other’s actions, e.g., fleet control [Gerla et

al., ’14]

  • Share some common observations, e.g., cooperative robot navigation

[Lowe et al., ’18; Zhang et al., ’18]

  • Tiis common information can be used to reduce the policy search space
  • Decentralized POMDP → centralized POMDP [Nayyar et al., ’13;

Mahajan & Mannan, ’13]

  • Sufficient information: belief over the system state and local

information of each agent

slide-4
SLIDE 4

4

Related Work

  • Common information approach + dynamic programming decomposition

(requires model to be known) [Nayyar et al., ’13]

  • Common information-based reformulation is generalization of
  • ccupancy-state MDPs [Dibangoye et al., ’16, ’18] for Dec-POMDPs
  • Model-free/sampling-based planning heuristics for Dec-POMDPs:
  • Dec-POMDP → non-observable MDP, solved by heuristic tree search

[Oliehoek et al., ’14]

  • Monte-Carlo sampling + policy iteration/expectation-maximization

[Wu et al., ’10, ’13]

  • Monte-Carlo tree search for special Dec-POMDPs [Amato et al., ’13;

Best et al., ’18]

  • Require a centralized coordinator [Amato et al., ’13; Oliehoek et al.,

’14; Dibangoye et al., ’18] or communication [Oliehoek et al., ’12]

slide-5
SLIDE 5

5

Our Contribution

  • Development of a tractable online + decentralized planning algorithm

for decentralized stochastic control with partial history sharing

  • Does not require an explicit model representation, only a generative

model (black-box simulator)

  • Does not require explicit communication among agents
  • Possesses provable convergence guarantees
  • Tie proposed algorithm unifies some recently developed Dec-POMDP

solvers

slide-6
SLIDE 6
  • Consider a dynamical system consisting of agents where
  • Local memory/information,
  • Local action,
  • Local observations,

6

Decentralized Stociastic Control with Partial History Sharing — Model

  • Common information
  • Let and , the common information

is a subset of common info. increment ← updated local memory ← updated common info. ←

  • Dynamics
  • State:
  • Information:
slide-7
SLIDE 7
  • Delayed sharing (d-step):

7

Examples of Partial History Sharing

  • Some additional examples are periodic sharing, delayed state sharing, and
  • thers (see [Nayyar et al., ’13])
  • Control sharing:

→ → → →

slide-8
SLIDE 8
  • Tie goal is to find a joint control policy, , consisting of

local control policies , such that the total expected discounted reward, , is maximized

  • Note that all agents possess the same goal, i.e., they have a common

reward function

8

Decentralized Stociastic Control with Partial History Sharing — Model (cont’d)

slide-9
SLIDE 9
  • Consider a coordinator that has access to the common information
  • Tie coordinator solves for prescriptions that map each

agent’s local information to a local action

9

Common Information Approaci

  • Tie coordinator’s problem is a POMDP with modified state, action, and
  • bservation processes [Nayyar et al., ’13]
  • state:
  • action:
  • bservation:
  • Define virtual history as — the information

state of the coordinator’s POMDP is the common information based belief

slide-10
SLIDE 10
  • Given a virtual history , the coordinator determines a joint prescription

using a coordination strategy, , where

  • Tie coordinator’s objective is to find a coordination strategy profile

where

  • A dynamic programming decomposition to solve for the optimal

prescriptions exists [Nayyar et al., ’13]

10

Common Information Approaci

to maximize

  • Note: since the coordinator’s information is in common between the

agents, each agent can (in principle) perform this computation

slide-11
SLIDE 11

11

Challenges with the Common Information Approaci

  • Tie decision variables of the coordinator are functions
  • Under finite action and observation spaces, the space of these functions

is also finite, but very large Examples:

  • 1-step delayed sharing:

→ →

  • Control sharing:
slide-12
SLIDE 12

12

Decentralized Online Planning

  • Inspired by the single-agent Partially Observable Monte-Carlo Planning

(POMCP) algorithm of [Silver & Veness, ’10]

  • Sampling-based approach helps to alleviate the computational challenges

associated with the large state space Key Ideas of the Algorithm:

  • Solve the coordinator’s POMDP via online tree search
  • Nodes of each search tree are virtual histories with

edges consisting of joint prescriptions and new common information

  • Agents possess a common random seed to avoid the

need to communicate (used before [Bernstein et al., ’09; Oliehoek et al., ’09; Arabneydi & Mahajan, ’15])

slide-13
SLIDE 13

13

Decentralized Online Planning

  • Each agent constructs an identical set of search trees across all agents
  • Search trees are constructed iteratively by running simulations from the

current history for each agent (tree)

slide-14
SLIDE 14

14

Searci Stage

  • Each simulation begins by sampling from the common information based

belief at the current history node

  • Successive simulations build out the search trees under a stopping

condition (e.g., timeout) is met where, : number of previous simulation visits to the virtual history h : estimated value of choosing prescription in virtual history h

  • Tie search tree is expanded by either rollout or selection via UCB1

[Auer et al., ’02] as follows

slide-15
SLIDE 15

15

Belief Update

  • A joint prescription is selected, actions are realized, and new common

information is revealed to the agents

  • Tie belief at a given history node h is approximated by a set of K

particles, denoted by the set B(h)

  • Draw a particle uniformly from B(h)
  • Call the generative model to

construct a sample of new common information and updated local memories

  • If the sampled common information matches the true common

information, add particle to updated belief set

  • Tie belief is updated by the following procedure:

repeat K times

  • Generate from the selected prescription,
slide-16
SLIDE 16

16

Convergence

  • Due to the common source of randomness, the planning procedure is

identical and decoupled for all agents

  • Convergence of the decentralized online planning algorithm can be

characterized by the (single-agent) POMCP alg. [Silver & Veness, ’10]

  • Applied to a novel security setuing

(collaborative intrusion response)

  • Agents choose defense actions in

response to security alert information

  • Actions and alerts are copied to

centralized database with delay (1-step delayed sharing)

400 600 800 1000 1200 1400 1600 0.5 1 1.5 2

slide-17
SLIDE 17

17

MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local observation history General State System state + joint local observation history System state + joint local memory History Joint policies Joint prescriptions + common information Sufficient statistic Belief over system state + joint local

  • bservation history

Belief over system state and joint local memory

Existing Algorithms: MAA*

  • Heuristic tree-search algorithm from [Szer et al., ’05]
  • Dec-POMDP → non-observable MDP [Oliehoek et al., ’14]
  • Can be viewed as the designer’s approaci from [Nayyar et al., ’13]
slide-18
SLIDE 18

18

MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local histories (observations + actions) General State System state + joint local histories System state + joint local memory History Joint policies Joint prescriptions + common information Sufficient statistic Belief over system state + joint local histories Belief over system state and joint local memory

Existing Algorithms: Occupancy-state MDPs

  • Dec-POMDP → occupancy-state MDP [Dibangoye et al., ’16]
  • Tie occupancy state (belief over state + joint local histories) is a sufficient

statistic for optimal planning

slide-19
SLIDE 19

19

Concluding Remarks and Future Work

  • Proposed an online tree-search based planning algorithm for

decentralized stochastic control with partial history sharing

  • Represents an initial atuempt and highlights the difficulties with

development of tractable online planning for decentralized stochastic control under partial observability Remaining ciallenges and future work:

  • Tie branching factors of the search trees are very large — can we focus
  • n subset of prescriptions during simulation?
  • Development of the notion of probabilistic common information — can

we relax the binary membership of common information to a probability that given information is common?

Thank you!

slide-20
SLIDE 20

20

Funding Aclnowledgements

US Office of Naval Research (ONR) MURI grant N00014-16-1-2710 US Army Research Office (ARO) grant W911NF-16-1-0485

slide-21
SLIDE 21

Additional Slides

slide-22
SLIDE 22

22 Algorithm 1 Decentralized Online Planning with Partial History Sharing – Agent i

function SEARCH(h) repeat if h = ∅ then (x, m1, . . . , mn) ⇠ B0 else (x, m1, . . . , mn) ⇠ B(h) end if SIMULATE(x, m1, . . . , mn, h, 0) until STOPPINGCONDITION() return argmaxδ2Γ V (hδ) end function function ROLLOUT(x, m1, . . . , mn, h, d) if βd < ε then return 0 end if γ = (γ1, . . . , γn) ⇠ (Γ1

rollout(h), . . . , Γn rollout(h))

(u1, . . . , un) (γ1(m1), . . . , γn(mn)) (x0, y1, . . . , yn, r) ⇠ G(x, u1, . . . , un) zi Pi

Z(mi, ui, yi) and share zi with all other agents

h0 hγz (m10, . . . , mn0) (P1

L(m1, u1, y1, z1), . . . ,

Pn

L(mn, un, yn, zn))

R r + β · ROLLOUT(x0, m10, . . . , mn0, h0, d + 1) return R end function function SIMULATE(x, m1, . . . , mn, h, d) if βd < ε then return 0 end if if h 62 T i then for all γ 2 Γ do T i(hγ) (N0(hγ), V0(hγ)) end for return ROLLOUT(x, m1, . . . , mn, h, d) end if γ=(γ1, . . . , γn) 2 argmax

(δ1,...,δn)2Γ1⇥···⇥Γn

V (hδ) + ρ s log N(h) N(hδ) (u1, . . . , un) (γ1(m1), . . . , γn(mn)) (x0, y1, . . . , yn, r) ⇠ G(x, u1, . . . , un) zi Pi

Z(mi, ui, yi) and share zi with all other agents

h0 hγz (m10, . . . , mn0) (P1

L(m1, u1, y1, z1), . . . , Pn L(mn, un, yn, zn))

N(h) N(h) + 1 R r + β · SIMULATE(x0, m10, . . . , mn0, h0, d + 1) N(hγ) N(hγ) + 1 V (hγ) V (hγ) + RV (hγ)

N(hγ)

return R end function