for Cooperative Agents Shlomo Zilberstein Shlomo Zilberstein School - - PDF document

for cooperative agents
SMART_READER_LITE
LIVE PREVIEW

for Cooperative Agents Shlomo Zilberstein Shlomo Zilberstein School - - PDF document

Multiagent Multiagent Planning Planning Planning under Uncertainty for Cooperative Agents Shlomo Zilberstein Shlomo Zilberstein School of Computer Science School of Computer Science ! Challenge : How to achieve intelligent coordination of a


slide-1
SLIDE 1

Planning under Uncertainty for Cooperative Agents

Shlomo Zilberstein Shlomo Zilberstein School of Computer Science School of Computer Science University of Massachusetts Amherst University of Massachusetts Amherst

IFAAMAS Summer school on Autonomous Agents and Multi-Agent Systems ! Beijing, China ! August 2013

2

Multiagent Multiagent Planning Planning

! Challenge: How to achieve intelligent coordination of a

group of decision makers in spite of stochasticity and partial observability?

! Objective: Develop effective decision-theoretic planning

methods to address the uncertainty about the domain, the

  • utcome of actions, and the knowledge, beliefs and

intentions of the other agents.

3

Problem Characteristics Problem Characteristics

! A group of decision makers or agents interact in

a stochastic environment

! Each “episode” involves a sequence of decisions

  • ver finite or infinite horizon

! The change in the environment is determined

stochastically by the current state and the set of actions taken by the agents

! Each decision maker obtains different partial

  • bservations of the overall situation

! Decision makers have the same objective

4

Sample Applications Sample Applications

! Space exploration rovers

[Zilberstein et al. 02]

! Multi-access broadcast channels

[Ooi and Wornell 96]

! Decentralized detection of

hazardous weather events [Kumar and Zilberstein 09]

! Mobile robot navigation

[Emery-Montemerlo et al. 05;

Spaan and Melo 08]

5

Outline Outline

! Planning with Markov decision processes ! Decentralized partially-observable MDPs ! Complexity results ! Solving finite-horizon decentralized POMDPs ! Solving infinite-horizon decentralized POMDPs ! Scalability with respect to the number of agents ! Conclusion

6

Planning under Uncertainty Planning under Uncertainty

! An agent interacts with the environment over

some extended period of time

! Utility function depends on the sequence of

decisions and their outcomes

! A rational agent should choose an action that

maximizes its expected utility

a s 1 World

slide-2
SLIDE 2

7

Markov Decision Process Markov Decision Process

! A Markov decision process is a tuple

〈S, A, P, R〉, where

! S is a finite set of domain states, with initial state s0 ! A is a finite set of actions ! P(s# | s, a) is a state transition function ! R(s), R(s, a), or R(s, a, s#) is a reward function

! The Markov assumption:

P(st | st-1, st-2, …, s1, a) = P(st | st-1, a)!

8

A Simple Grid Environment A Simple Grid Environment"

Russell & Norvig, 2003

9

Example: An Optimal Policy Example: An Optimal Policy

+1 1

.812"

+1 1

.762" .705" .868".912" .660" .655".611".388" Actions succeed with probability 0.8 and move sideways " with probability 0.1 (remain in the same position when" there is a wall). Actions incur a small cost."

10

Policies for Different Policies for Different R(s)

11

Policies and Utilities of States Policies and Utilities of States

! A policy π is a mapping from states to actions. ! An optimal policy π* maximizes the expected

reward:

! The utility of a state

π* =

π

argmax

γ tR(st) |π

t= 0 ∞

& ' ( ) * + U π (s) = E γ tR(st) |π,s0 = s

t= 0 ∞

& ' ( ) * +

12

The Bellman Equation The Bellman Equation

! Optimal policy defined by: ! Can be solved using dynamic programming

[Bellman, 1957]

π *(s) = argmax

a

P(s'| s,a)U(s')

s'

U(s) = R(s) + γ max

a

P(s'| s,a)U(s')

s'

slide-3
SLIDE 3

13

Value Iteration Value Iteration"

Bellman, 1957

repeat U ← U' for each state s do U'[s]← R[s]+ γ max

a

P(s'| s,a)U(s')

s'

end until CloseEnough(U,U')

14

Convergence of VI Convergence of VI

An initial error bound is: ||U −U* ||≤ 2Rmax /(1− γ) Based on || BUi − BUi

' ||≤ γ ||Ui −Ui ' ||

How many iterations are needed to reach a max norm error of ε? γ N ⋅ 2Rmax /(1− γ) ≤ ε N = log(2Rmax /(ε(1− γ))/log(1/γ)

' (

A less conservative termination condition if ||U'−U ||< ε(1− γ)/2γ then ||U'−U* ||< ε

15

Policy Loss Policy Loss

The error bound on the utility of each state may not be the most important factor. What the agent cares about is how well it does based on a given policy / utility function. Note that the policy loss can approach zero long before the utility estimates converge.

if ||Ui −U* ||< ε then ||U π i −U* ||< 2εγ /(1− γ)

16

Policy Iteration Policy Iteration"

Howard, 1960

repeat π ← π' U ← ValueDetermination(π) for each state s do π'[s]← argmax

a

P(s'| s,a)U(s')

s'

end until π = π'

17

Value Determination Value Determination

Can be implemented using: Value Iteration: U'(s) = R(s) + γ P(s'| s,π(s))U(s')

s'

  • r

By solving a set of n linear equations: U(s) = R(s) + P(s'| s,π(s))U(s')

s'

18

! Given a start state, the objective is to minimize

the expected cost of reaching a goal state.

! S: a finite set of states ! A(i), i ∈ S: a finite set of actions available in

state i

! Pij(a): probability of state j after action a in state i ! Ci(a): expected cost of taking action a in state i

Stochastic Shortest-Path Problems Stochastic Shortest-Path Problems

slide-4
SLIDE 4

19

MDPs and State-Space Search MDPs and State-Space Search"

Hansen & Zilberstein, AAAI 1998, AIJ 2001

! MDPs present a state-space search problem

in which transitions are stochastic.

! Because state transitions are stochastic, it is

impossible to bound the number of actions needed to reach the goal (indefinite-horizon).

! Search algorithms like A* can handle the

deterministic version of this problem.

! But neither A* nor AO* can solve indefinite

horizon problems.

20

Advantages of Search Advantages of Search

! Can find optimal solutions without evaluating

all problem states.

! Can take advantage of domain knowledge to

reduce search effort.

! Can benefit from a large body of existing work

in AI on how to search in real-time and how to trade-off solution quality for search effort.

21

Solving MDPs Using Search Solving MDPs Using Search

Start! state! Solution graph:" all states reachable" by optimal solution" Explicit graph:" states evaluated" during search! Implicit graph:" all states"

Given a start state, heuristic search can find an optimal" solution without evaluating all states."

22

Possible Solution Structures Possible Solution Structures

Solution is a " simple path" Solution is an" acyclic graph" Solution is a" cyclic graph"

23

Heuristic search = "starting state +" "forward expansion of solution +" "admissible heuristic + " " " "DP dynamic programming "

DP and Heuristic Search DP and Heuristic Search

Solution Structure Sequence Branches Loops Dynamic Programming Forward DP Backwards induction Policy (value) iteration Heuristic Search A* AO* LAO*

24

AO* AO*"

Nilsson 1971; Martelli & Montanari 1973

! Initialize partial solution graph to start state. ! Repeat until a complete solution is found:

! Expand some nonterminal state on the fringe of the

best partial solution graph.

! Use backwards induction to update the costs of all

ancestor states of the expanded state and possibly change the selected action.

slide-5
SLIDE 5

25

LAO* LAO*"

Hansen and Zilberstein, AAAI 1998

! Like AO*, performs dynamic programming on

the set of states that includes the expanded state and all of its ancestors.

! But LAO* must use either policy iteration or

value iteration instead of backward induction.

! Convergence to exact state costs of value

iteration is asymptotic, but it is generally more efficient than policy iteration for large problems.

26

Heuristic Evaluation Function Heuristic Evaluation Function

! h(i) is an heuristic estimate of the minimal-cost solution

for every non-terminal tip state.

! h(i) is admissible if h(i) < f*(i). ! An admissible heuristic estimate f(i) for any state in the

explicit graph is defined as follows:

f (i) = 0 if i is a goal state h(i) if i is a non- terminal tip state else min

a∈A(i) ci(a) +

pij(a) f ( j)

j∈S

# $ % & ' ( ) * + + , + +

27

Theoretical Properties Theoretical Properties

! Theorem 1: Using an admissible heuristic,

LAO* converges to an optimal solution without (necessarily) expanding/evaluating all states.

! Theorem 2: If h2(i) is a more informative

heuristic than h1(i) (i.e., h1(i) ≤ h2(i) ≤ f*(i)), LAO* using h2(i) expands a subset of the worst case set of states expanded using h1(i).

28

Imperfect Observations Imperfect Observations

Partially observable MDP adds the following:

! O – a finite set of observations ! P(o | s', a) – observation function: the probability that

  • is observed after taking action a resulting in a

transition to state s'

! A discrete probability distribution over starting states

(the initial belief state): b0 a

  • r

Reward World

Example: Hallway Example: Hallway

29

Minimize number of steps to the starred square for a given start state distribution

States: grid cells with

  • rientation

Actions: turn , , , move forward, stay Transitions: noisy Observations: red lines Goal: red star location

Example: Tiger Game Example: Tiger Game

30

States: tiger left, tiger right Actions: listen, open left, open right Transitions: listening only provides info; opening a door resets the problem Observations: noisy indications of tiger’s location Goal: get the treasure

slide-6
SLIDE 6

Belief States Belief States

31

31

! A belief state is a probability distribution over

the state of the system that can summarize the knowledge of the agent at a given point.

b(st) = Pr(st = s | s0, a1, o1, a2, o2, …, at-1, ot-1)

Bayesian Updating of Beliefs Bayesian Updating of Beliefs

32

! Suppose that the agent takes action a in

belief state b(s) and receives observation o

! What is the new belief state b'(s')? ! Can use Bayesian rule as follows:

" b ( " s ) = Pr( " s |b,a,o) = O(o | " s ,a) T(s,a, " s )b(s)

s∈S

Pr(o | a,b)

Bayesian Updating Bayesian Updating

! The probability of the observation can be

computed by summing over all possible s'

33

P(o | a,b) = Pr(o | a, " s ,b)Pr( " s | a,b)

" s

= O(o | " s ,a)Pr( " s | a,b)

" s

= O(o | " s ,a) T(s,a, " s )b(s)

s

" s

Belief State Transition Model Belief State Transition Model

34

! We can now define a new belief-state MDP

with the following transition model:

! And the following reward function:

τ(b,a, # b ) = Pr( # b | a,b) = Pr( # b |o,a,b)Pr(o | a,b)

= Pr( # b |o,a,b) O(o | # s ,a) T(s,a, # s )b(s)

s

# s

ρ(b) = b(s)R(s)

s

Belief Update Example Belief Update Example

35

Solution

Belief Update ⇒ Agent maintains belief over tiger’s

location and updates it

Policy Comp. ⇒ Agent computes optimal action for

each belief

1

L,GL

) Pr(TR 1 0.5 ) Pr(TR 0 0.15

L,GR

) Pr(TR 1 0.85

36

Performance Criteria and Utility Performance Criteria and Utility

! In the infinite-horizon case:

" performance criterion = expected discounted reward

  • ver an infinite horizon

Utility function:

! is the a priori state probability distribution ! is the discount factor

! " # $ % &∑

∞ =0

) , (

t t t t b

a s r E γ

b γ

slide-7
SLIDE 7

37

Policy Representation Policy Representation

! A policy is a rule for selecting actions ! For MDPs this can simply be a mapping from states (of

the underlying system) to actions

! For POMDPs this is not possible, because the system

state is only partially observable

! Thus, a policy must map from a decision state to

  • actions. This decision state can be defined by:

! The history of the process (Problem: grows exponentially, not

suitable for infinite horizon problems)

! A probability distribution over states (Continuous state space) ! The memory of a finite-state controller

π

38

Finite-Memory Policies Finite-Memory Policies

! We want a discrete representation with a finite

number of states!

! A finite state controller maps H*, the set of all

possible histories, into a finite number of memory states.

! Unlike a belief state, a memory state is not a

sufficient statistic but as the number of memory states is finite, the policy representation becomes easier.

39

Finite-Memory Policies cont. Finite-Memory Policies cont.

!

There are several possible representations for finite-memory policies:

  • Memoryless policies
  • Finite-length memory policies
  • Finite state controllers

!

We will focus on finite state controllers, which is the most general approach

!

Controllers could be deterministic or stochastic

!

Need to explore the space of possible finite- state controllers as potential policies

40

Policy Evaluation for (PO)MDPs Policy Evaluation for (PO)MDPs

!

Utility function:

!

For completely observable MDPs a policy determines a Markov chain in which each state corresponds to a state of the MDP.

!

Then the utility of each state can be determined by solving a system of |S| linear equations: ! " # $ % & =

∞ =0

)) ( , ( ) (

t t t t

b b R E b U π γ

π

∈ ∀ + =

S s

S s s U s s s s s R s U

'

), ' ( )) ( , | ' Pr( )) ( , ( ) (

π π

π γ π

41

Policy Evaluation for POMDPs cont. Policy Evaluation for POMDPs cont.

!

For infinite-horizon POMDPs, the possibility of policy evaluation depends on how a policy is represented:

!

As a mapping from belief space to action space: No general method is known

!

As a finite-state controller: Policy evaluation becomes straightforward because a finite-memory policy for POMPDs also determines a finite Markov chain

!

" This is the most important advantage of a finite-memory representation of a policy for POMDPs

42

Policy Evaluation for POMDPs cont. Policy Evaluation for POMDPs cont.

Illustration of (a) a finite state controller and (b) the corresponding Markov chain for two-state, two-action and two-observation POMDP.

slide-8
SLIDE 8

43

Policy Evaluation for POMDPs cont. Policy Evaluation for POMDPs cont.

!

We allow the finite-state controller to visit an infinite number of belief states.

!

In this way, the finite-state controller determines a Markov chain in which each state corresponds to a combination of a memory state qi and a system state sj.

!

Thus, the size of the Markov chain is |Q||S|. Pr([s'q'] |[s,q]) = p(s'| s,α(q)) Pr(o | s,α(q))

  • :τ(q,o)= q'

44

Dynamic Programming for POMDPs Dynamic Programming for POMDPs

!

The dynamic programming algorithm value iteration computes a new utility function Un based on a given utility function Un-1. For a completely observable MDP

  • ver belief space, this dynamic programming update is

defined via the Bellman optimality equation:

!

We must find a way of representing the utility function for all belief states in a way that dynamic programming is possible (we need a finite number of states).

B b

  • a

b T U a b

  • a

b R b U

O

  • n

A a n

∈ ∀ # $ % & ' ( + =

∈ − ∈

, )) , | ( ( ) , | Pr( ) , ( max ) (

1

γ

45

Utility-Function Representation Utility-Function Representation

A utility function for a two-state POMDP

!

The utility of each belief state:

!

Thus, the upper surface of the vectors represents the utility function

∈ Ω ∈

=

S s

s s b b U ) ( ) ( max ) ( ω

ω

46

Pruning Dominated Vectors Pruning Dominated Vectors

Non-minimal representation of a piecewise linear and convex utility function

!

Vectors 2,3 and 5 are said to be dominated vectors because there is no belief state for which their utility is not dominated by the utility

  • f some other vector " we can remove them

47

Algorithm: Value Iteration (Sketch) Algorithm: Value Iteration (Sketch)

! Iteratively apply the dynamic programming update to the

utility function.

! Interleave vector generation with vector pruning. ! Improve the process, by using incremental

pruning (prune the resulting vectors as early as possible)

! As in the completely observable case, value iteration

iteratively improves the utility function and eventually extracts a policy from it.

48

Value Iteration for POMPDs cont. Value Iteration for POMPDs cont.

!

A greedy policy with respect to a utility function U is defined as follows:

!

For infinite-horizon problems, value iteration converges to an ε-optimal utility function after a finite number of iterations.

!

If a utility function is ε-optimal, a greedy policy with respect to it is ε-optimal.

! " # $ % & + =

∈ ∈ O

  • A

a

  • a

T U a

  • P

a R b )) , | ( ( ) , | ( ) , ( max arg ) ( π π γ π π

slide-9
SLIDE 9

49

Outline Outline

! Planning with Markov decision processes ! Decentralized partially-observable MDPs ! Complexity results ! Solving finite-horizon decentralized POMDPs ! Solving infinite-horizon decentralized POMDPs ! Scalability with respect to the number of agents ! Conclusion

50

Decentralized POMDP Decentralized POMDP

! Multiple agents control a global Markov process ! Same reward function, different observation functions

a1

  • 1
  • 2

a2 1 2 r Reward World

51

DEC- DEC-POMDPs POMDPs

! A two-agent DEC-POMDP is defined by a tuple

〈S, A1, A2, P, R, Ω1, Ω2, O〉, where

! S is a finite set of domain states, with initial state s0 ! A1, A2 are finite action sets ! P(s, a1, a2, s' ) is a state transition function ! R(s, a1, a2) is a reward function ! Ω1, Ω2 are finite observation sets ! O(a1, a2, s', o1, o2) is an observation function

! Straightforward generalization to n agents

Relationship between Models Relationship between Models

52

POSG = Partially-Observable Stochastic Game DEC-POMDP-COM = DEC-POMDP with Communication I-POMDP = Interactive POMDP MTDP = Multiagent Team Decision problem

DEC-POMDP

DEC-POMDP-COM

MTDP POSG I-POMDP

(finitely nested)

POMDP MDP DEC-MDP

Example: Example: Multiagent Multiagent Tiger Tiger"

Nair, Tambe, Yokoo, Pynadath & Marsella, 2003

53

! Two agents try to locate tiger and get treasure ! Each agent may open one of the doors or listen ! Listening provides a noisy observation ! Large penalty for opening door leading to tiger; Large

reward for cooperation (choosing same action) and for getting the treasure.

54

Example: Broadcast Channel Example: Broadcast Channel"

Hansen, Bernstein & Zilberstein, 2004

States: who has a message to send? Actions: send or don’t send Reward: +1 for successful broadcast 0 if collision or channel not used Observations: was there a collision? (noisy)

msg msg

slide-10
SLIDE 10

55

Example: Meeting on a Grid Example: Meeting on a Grid"

Bernstein, Hansen & Zilberstein, 2005

States: grid cell pairs Actions: ↑,↓,←,→ Transitions: noisy Goal: meet quickly Observations: red lines

56

Example: Cooperative Box-Pushing Example: Cooperative Box-Pushing"

Seuken and Zilberstein, 2007

Goal: push as many boxes as possible to goal area; larger box has higher reward, but requires two agents to be moved.

Example: Sensor Network Example: Sensor Network

Nair et al. 2005; Kumar and Zilberstein, 2009

57

! Multiple sensors track a moving target ! To localize target two neighboring sensors must track it

at the same time

58

Solving DEC-POMDPs Solving DEC-POMDPs

! Each agent’s behavior is described by a

local policy δi

! Policy can be represented as a mapping from

! Local observation sequences to actions; or ! Local memory states to actions

! Actions can be selected deterministically or

stochastically

! Goal is to maximize expected reward over a

finite horizon or discounted infinite horizon

Policy Trees Policy Trees

! Optimal policy trees for the multiagent tiger problem with

horizon 3. The policy trees of both agents are the same

! Each node is labeled with an action and each edge is

labeled with an observation

! Suitable for finite-horizon problems

59

Agent 1

L L L OR L L OL

hl hr hl hr hl hr Agent 2

L L L OR L L OL

hl hr hl hr hl hr

Finite-State Controllers Finite-State Controllers

! Optimal four-node deterministic controllers for the

multiagent tiger problem

! Each node is labeled with an action and each transition

is labeled with an observation

! Suitable for infinite horizon problems.

60

hr hr hr hl hl hl Agent 1

L

hr

L L

hr, hl hr

OR

Agent 2

L

hl

L L

hr, hl hl

OL

hl hr

slide-11
SLIDE 11

61

Early Work on Decentralized Early Work on Decentralized Decision Making and DEC-POMDPs Decision Making and DEC-POMDPs

! Team theory [Marschak 55, Tsitsiklis & Papadimitriou 82] ! Incorporating dynamics [Witsenhausen 71] ! Communication strategies [Varaiya & Walrand 78, Xuan et al. 01, Pynadath & Tambe 02] ! Approximation algorithms [Peshkin et al. 00, Guestrin et al. 01, Nair et al. 03, Emery-Montemerlo et al. 04] ! First Exact DP algorithm [Hansen et al. 04] ! First policy iteration algorithm [Bernstein et al. 05] ! Many recent exact and approximate DEC-POMDP algorithms

62

Some Fundamental Questions Some Fundamental Questions

! Are DEC-POMDPs significantly harder to solve than

POMDPs? Why?

! What features of the problem domain affect the

complexity and how?

! Is optimal dynamic programming possible? ! Can dynamic programming be made practical? ! Is it beneficial to treat communication as a separate type

  • f action?

! How to use the structure and locality of agent interaction

in order to develop more scalable algorithms?

63

Outline Outline

! Planning with Markov decision processes ! Decentralized partially-observable MDPs ! Complexity results ! Solving finite-horizon decentralized POMDPs ! Solving infinite-horizon decentralized POMDPs ! Scalability with respect to the number of agents ! Conclusion

64

Previous Complexity Results Previous Complexity Results

MDP" P-complete " ( if T < |S| )"

Papadimitriou & Tsitsiklis 87"

POMDP" PSPACE- complete " ( if T < |S| )"

Papadimitriou & Tsitsiklis 87"

MDP" P-complete" "

Papadimitriou & Tsitsiklis 87"

POMDP" Undecidable" "

Madani et al. 99"

Finite Horizon Infinite-Horizon Discounted

65

Complexity of DEC-POMDPs Complexity of DEC-POMDPs"

Bernstein, Givan, Immerman & Zilberstein, UAI 2000, MOR 2002 ! The complexity of finite-horizon DEC-POMDPs has been

hard to establish.

! Static version of the problem, where a single set of

decisions is made in response to a single set of

  • bservations, is NP-hard [Tsitsiklis and Athan, 1985]

! In 2000, it was shown that a finite-horizon DEC-POMDP

with two agents is NEXP-complete

! Provably hard, since P ≠ NEXP ! Probably doubly exponential, as opposed to POMDPs,

which are probably exponential

66

DEC-POMDP DEC-POMDP ∈ NEXP NEXP

! Guess a joint policy ! The joint policy together with the DEC-POMDP

yield an exponentially bigger Markov process with states being observation sequences

! The value of the joint policy can be determined

using dynamic programming on this large Markov process

slide-12
SLIDE 12

67

DEC-POMDP is NEXP-hard DEC-POMDP is NEXP-hard

Bernstein, Givan, Immerman, and Zilberstein, 2002

! Reduction from TILING ! Given n, L, H, V, is there a consistent tiling?

68

DEC-POMDP is NEXP-hard cont. DEC-POMDP is NEXP-hard cont.

∃ policy with expected reward 0 ↔ ∃ consistent tiling

69

DEC-POMDP is NEXP-hard cont. DEC-POMDP is NEXP-hard cont.

! Naive approach has a state for every pair of tile

positions (exponential in size of instance!)

! Luckily, we need only remember info about the

relationship between the positions

! Generate positions bit-by-bit, and only remember

key information:

! Are they equal? ! Are they horizontally adjacent? ! Are they vertically adjacent?

Q.E.D.

70

Thus, DEC-POMDPs are Hard! Thus, DEC-POMDPs are Hard!

! They are provably intractable ! They are probably doubly exponential

(unlike POMDPs)

! The problem remains NEXP-hard even when

the state is jointly observable

! Hardness comes from decentralized

  • peration, not from having hidden state

! But these are worst-case results! Could real-

world problems be easier in practice?

71

What Features of the Domain Affect What Features of the Domain Affect the Complexity and How? the Complexity and How?

! Factored state spaces (structured domains) ! Independent transitions (IT) ! Independent observations (IO) ! Structured reward function (SR) ! Goal-oriented objectives (GO) ! Degree of observability (partial, full, jointly full) ! Degree and structure of interaction ! Degree of information sharing and communication

72

Complexity of Sub-Classes Complexity of Sub-Classes

Goldman & Zilberstein, JAIR 2004

Finite-Horizon

DEC-MDP IO & IT Goal Oriented Goal Oriented |G| = 1 |G| > 1

Certain Conditions

w/ Sharing Information NP-C NEXP-C NEXP-C NP-C NP-C P-C P-C

slide-13
SLIDE 13

73

Outline Outline

! Planning with Markov decision processes ! Decentralized partially-observable MDPs ! Complexity results ! Solving finite-horizon decentralized POMDPs ! Solving infinite-horizon decentralized POMDPs ! Scalability with respect to the number of agents ! Conclusion

JESP: First DP Algorithm JESP: First DP Algorithm"

Nair, Tambe, Yokoo, Pynadath & Marsella, IJCAI 2003 ! JESP: Joint

Equilibrium- based Search for Policies

! Complexity:

exponential

! Result: only

locally optimal solutions

74 75

Exact DP Exact DP for POMDPs for POMDPs

! The key to solving POMDPs is that they can be

viewed as belief-state MDPs [Smallwood & Sondik 73]

! We’ll start with some important concepts: a1 s2 s1

policy tree linear value function belief state s1 0.25 s2 0.40 s3 0.35

a2 a3 a3 a2 a1 a1

  • 1
  • 1
  • 2
  • 1
  • 2
  • 2

76

Dynamic Programming Dynamic Programming

a1 a2 s1 s2

77

Dynamic Programming Dynamic Programming

s1 s2 a1 a1 a1

  • 1
  • 2

a1 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a1 a2 a2

  • 1
  • 2

a2 a1 a1

  • 1
  • 2

a2 a1 a2

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

78

Dynamic Programming Dynamic Programming

s1 s2 a1 a1 a1

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a1 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2
slide-14
SLIDE 14

79

Dynamic Programming Dynamic Programming

s1 s2

80

Is Exact DP Possible for DEC- Is Exact DP Possible for DEC- POMDPs? POMDPs?

! Not as clear how to define and maintain a belief-

state for a DEC-POMDP

! To have a belief about the world, agents need to

have beliefs about other agents as well

! The first exact DP algorithm for finite-horizon

DEC-POMDPs uses the notion of a generalized belief state

! The algorithm also applies to competitive situations

modeled as POSGs

Generalized Belief State Generalized Belief State

A generalized belief state captures the uncertainty

  • f one agent with respect to the state of the world

as well as the policies of other agents.

81 82

Generalizing Dynamic Generalizing Dynamic Programming Programming

! Build policy trees as in single agent case ! Generalize pruning rule Normal form game" strategy" Δ(strategies of

  • ther agents)"

POMDP" policy tree" Δ(states)" POSG" DEC-POMDP" policy tree" Δ(states × policy trees

  • f other agents)"

What to prune Space for pruning

83

Strategy Elimination Strategy Elimination

! Any finite-horizon DEC-POMDP can be

converted to a normal form game

! But the number of strategies is doubly

exponential in the horizon length!

R11

1, R11 2"

…" R1n

1, R1n 2"

…" …" …" Rm1

1, Rm1 2"

…" Rmn

1, Rmn 2"

… …

84

A Better Way to Do Elimination A Better Way to Do Elimination

Hansen, Bernstein & Zilberstein, AAAI 2004 ! We can use dynamic programming to eliminate

dominated strategies without first converting to normal form

! Pruning a subtree eliminates the set of trees

containing it

a1 a1 a2 a2 a2 a3 a3

  • 1
  • 1
  • 2
  • 1
  • 2
  • 2

a3 a2 a1

  • 1
  • 2

a2 a2 a3 a3 a2 a2 a1

  • 1
  • 1
  • 2
  • 1
  • 2
  • 2

prune eliminate

slide-15
SLIDE 15

85

Dynamic Programming Dynamic Programming

a1 a2 a1 a2

86

Dynamic Programming Dynamic Programming

a1 a1 a2

  • 1
  • 2

a2 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a1 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a1 a2 a2

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

a1 a1 a2

  • 1
  • 2

a2 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a1 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a1 a2 a2

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

87

Dynamic Programming Dynamic Programming

a1 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a1 a2 a2

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

a1 a1 a2

  • 1
  • 2

a2 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a1 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a1 a2 a2

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

88

Dynamic Programming Dynamic Programming

a1 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a1 a2 a2

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

a1 a1 a2

  • 1
  • 2

a2 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

89

Dynamic Programming Dynamic Programming

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

a1 a1 a2

  • 1
  • 2

a2 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

90

Dynamic Programming Dynamic Programming

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2

a1 a1 a1

  • 1
  • 2

a1 a1 a2

  • 1
  • 2

a1 a2 a1

  • 1
  • 2

a2 a2 a1

  • 1
  • 2

a2 a2 a2

  • 1
  • 2
slide-16
SLIDE 16

91

Dynamic Programming Dynamic Programming

92

First Exact DP for DEC- First Exact DP for DEC-POMDPs POMDPs"

Hansen, Bernstein & Zilberstein, AAAI 2004 ! Theorem: DP performs iterated elimination of

dominated strategies in the normal form of the POSG.

! Corollary: DP can be used to find an optimal joint

policy in a DEC-POMDP.

# Algorithm is

complete & optimal

# Complexity is double

exponential

Example: Example: Multiagent Multiagent Tiger Problem Tiger Problem

Optimal policies with horizon=3, value=5.19

Agent 1

L L L OR L L OL

hl hr hl hr hl hr Agent 2

L L L OR L L OL

hl hr hl hr hl hr

93 94

Empirical Results Empirical Results

! Policy trees remaining after each iteration with

the MABC problem:

! DP prunes a significant number of trees Iteration Brute force DP algorithm 1 2 2 2 8 6 3 128 20 4 32,768 300

95

Alternative: Heuristic Search Alternative: Heuristic Search

Szer, Charpillet & Zilberstein, UAI 2005 ! Perform forward best-

first search in the space

  • f joint policies

! Take advantage of a

known start state distribution

! Take advantage of

domain-independent heuristics for pruning

96

The MAA* Algorithm The MAA* Algorithm

Szer, Charpillet & Zilberstein, UAI 2005

!

Theorem: MAA* is complete and optimal

!

Main advantage: significant reduction in memory requirements

  • ver the dynamic

programming approach

slide-17
SLIDE 17

Scaling Up Heuristic Search Scaling Up Heuristic Search

Spaan, Oliehoek, and Amato, IJCAI 2011; Oliehoek et al. JAIR 2013 ! Problem with MAA*: The number of children of a

node is doubly exponential in the node's depth

! Basic idea: avoid the full expansion of each node by

incrementally generating the children only when a child might have a higher heuristic value

! Introduce a more memory-efficient representation for

heuristic functions

! Yields a speed up over the state-of-the-art allowing

for the optimal solution over longer horizons

97

Scaling Up Heuristic Search Scaling Up Heuristic Search

Spaan, Oliehoek, and Amato, IJCAI 2011

98 99

Memory-Bounded DP (MBDP) Memory-Bounded DP (MBDP)

Seuken & Zilberstein, IJCAI 2007 ! Combining the two approaches:

! The DP algorithm is a bottom-up approach ! The search operates top-down

! The DP step can only eliminate a policy tree if it

is dominated for every belief state

! But, only a small subset of the belief space is

actually reachable

! Furthermore, the combined approach allows the

algorithm to focus on a small subset of joint policies that appear best

100

Memory-Bounded DP Cont. Memory-Bounded DP Cont.

101

The MBDP Algorithm The MBDP Algorithm

102

Generating “Good” Belief States Generating “Good” Belief States

! MDP Heuristic − Obtained by solving the

corresponding fully-observable multiagent MDP

! Infinite-Horizon Heuristic − Obtained by solving

the corresponding infinite-horizon DEC-POMDP

! Random Policy Heuristic − Could augment

another heuristic by adding random exploration

! Heuristic Portfolio − Maintain a set of belief

states generated by a set of different heuristics

! Recursive MBDP

slide-18
SLIDE 18

103

Performance of MBDP Performance of MBDP

104

Maximum Horizon Comparison Maximum Horizon Comparison

! Comparison of the maximum horizon each

algorithm can handle before it runs out of time

  • r memory

! MBDP can solve problems with much larger

horizons

105

MBDP Parameter Tuning MBDP Parameter Tuning

The best parameter settings and solution values for the tiger problem with horizon 20 for given time limits.

106

How Many Trees are Needed? How Many Trees are Needed?

In the problems we tested, maintaining a small number of policy trees is sufficient for finding near-optimal solutions.

MBDP Successors MBDP Successors

! Improved MBDP (IMBDP) [Seuken and Zilberstein, UAI 2007] ! MBDP with Observation Compression (MBDP-OC) [Carlin and Zilberstein, AAMAS 2008] ! Point Based Incremental Pruning (PBIP) [Dibangoye, Mouaddib, and Chaib-draa, AAMAS 2009] ! PBIP with Incremental Policy Generation (PBIP-IPG) [Amato, Dibangoye, Zilberstein, ICAPS 2009] ! Constraint-Based Dynamic Programming (CBDP) [Kumar and Zilberstein, AAMAS 2009] ! Point-Based Backup for Decentralized POMDPs [Kumar and Zilberstein, AAMAS 2010] ! Point-Based Policy Generation (PBPG) [Wu, Zilberstein, and Chen, AAMAS 2010]

107

Key Ideas Behind These Algorithms Key Ideas Behind These Algorithms

! Perform search in a reduced

policy space

! Exact algorithm perform only

lossless pruning

! Approximate algorithms rely on

more aggressive pruning

! MBDP represents an

exponential size policy with linear space O(maxTrees × T)

! Resulting policy is an acyclic

finite-state controller.

108

slide-19
SLIDE 19

Point-Based Policy Generation Point-Based Policy Generation"

Wu, Zilberstein & Chen, AAMAS 2010 ! A better way to perform the bottom-up dynamic

programming step is to construct the best joint policy for each belief state only, avoiding full or partial backups.

! The PBPG algorithm does that by alternating between

the agents and optimizing the policy parameters of one agent while keeping the policies of the other agents fixed until a Nash equilibrium (or a maximum number of iterations) is reached.

! Consequently, many more policies can be kept as

building blocks for the next iteration.

109

Point-Based Policy Generation Point-Based Policy Generation"

Wu, Zilberstein & Chen, AAMAS 2010

110

PBPG PBPG

! Performance of

state-of-the-art algorithms for finite- horizon DEC- POMDPs

! Online algorithms

  • ffer a viable

alternative [Wu, Zilberstein & Chen, 09]

111

Trial-Based DP for DEC-POMDP Trial-Based DP for DEC-POMDP"

Wu, Zilberstein & Chen, AAMAS 2010

112

Performance of TBDP Performance of TBDP"

Wu, Zilberstein & Chen, AAMAS 2010

113 114

Outline Outline

! Planning with Markov decision processes ! Decentralized partially-observable MDPs ! Complexity results ! Solving finite-horizon decentralized POMDPs ! Solving infinite-horizon decentralized POMDPs ! Scalability with respect to the number of agents ! Conclusion

slide-20
SLIDE 20

115

Infinite-Horizon DEC- Infinite-Horizon DEC-POMDPs POMDPs

! Unclear how to define a compact belief-state

without fixing the policies of other agents

! Value iteration does not generalize to the infinite-

horizon case

! Can generalize policy iteration for POMDPs

[Hansen 98, Poupart & Boutilier 04]

! Basic idea: Representing local policies using

(deterministic/stochastic) finite-state controllers and defining a set of controller transformations that guarantee improvement & convergence

Example: Example: Multiagent Multiagent Tiger Problem Tiger Problem

! Deterministic controller for each agent ! 3 nodes, 3 actions, 2 observations ! Parameters: one action per node & node

transition per observation Agent 1 Agent 2 L hr L hl OR hl hr, hl hr L hl L hr OL hr hr, hl hl

116

Example: Example: Multiagent Multiagent Tiger Problem Tiger Problem

! Stochastic controller for each agent ! 2 nodes, 3 actions, 2 observations ! Parameters: P(ai | qi), P(qi | qi, oi)

'

117

L OR L hr hl

0.125 0.875 1.0

hr, hl hl

0.125 1.0 1.0 1.0

hr

1.0 0.875

Agent 1 L OL L hl hr

0.125 0.875 1.0

hr, hl hr

0.125 1.0 1.0 1.0

hl

1.0 0.875

Agent 2

118

Policies as Controllers Policies as Controllers

! Finite state controller represents each policy ! Fixed memory ! Randomness used to offset memory limitations ! Action selection, ψ : Qi → ΔAi ! Transitions, η : Qi × Ai × Oi → ΔQi ! Value of two-agent joint controller given by the

Bellman equation:

V(q1,q2,s) = P(a1 |q1)P(a2 |q2)

a1,a2

R(s,a1,a2) +

[

γ P(s'| s,a1,a2) O(o1,o2 | s',a1,a2) P(q1'|q1,a1,o1)P(q2'|q2,a2,o2)

q1',q2'

  • 1,o2

V(q1',q2',s')

s'

$ % & &

119

Finding Optimal Controllers Finding Optimal Controllers

! How can we search the space of possible joint

controllers?

! How do we set the parameters of the controllers

to maximize value?

! Deterministic controllers – can use traditional

search methods such as BSF or B&B

! Stochastic controllers – present a continuous

  • ptimization problem

! Key question: how to best use a limited

amount of memory to optimize value?

120

Independent Joint Controllers Independent Joint Controllers

! Local controller for agent

i is defined by conditional distribution P(ai, qi | qi, oi)

! Independent joint

controller is expressed by: Πi P(ai, qi | qi, oi)

! Can be represented as a

dynamic Bayes net

' '

a1 a2

  • 2

q2 q

2

q1 s

  • 1

q

1

s

slide-21
SLIDE 21

121

Correlated Joint Controllers Correlated Joint Controllers"

Bernstein, Hansen & Zilberstein, IJCAI 2005, JAIR 2009 A correlation device, [Qc,ψ], is a set of nodes

and a stochastic state transition function

'

  • Joint controller:

∑qcP(qc|qc) Πi P(ai, qi | qi, oi, qc)

  • A shared source of

randomness affecting decisions and memory state update

  • Random bits for the

correlation device can be determined prior to execution time

a1 a2

  • 2

q2 q

2

q1 s

  • 1

q

1

s q

c

qc '

122

The Utility of Correlation The Utility of Correlation

! Two agents, each with actions A and B ! Restricted to memoryless, open-loop policies ! Best policy is 1/2 AA and 1/2 BB s1 s2 AA, AB, BA (0) AB, BA, BB (0) AA (+R) BB (+R)

123

Exhaustive Backups Exhaustive Backups

a1 a2

  • 1,o2
  • 1,o2

a1 a2

  • 1,o2
  • 1,o2

a1 a1 a1 a1 a2 a2 a2 a2

  • 2
  • 1
  • 1
  • 2
  • 1,o2
  • 2
  • 1
  • 1,o2
  • 2
  • 1
  • 1,o2
  • 1,o2

a1 a1 a1 a1 a2 a2 a2 a2

  • 2
  • 1
  • 1
  • 2
  • 1,o2
  • 2
  • 1
  • 1,o2
  • 2
  • 1
  • 1,o2
  • 1,o2

! Add a node for every possible action and

deterministic transition rule

! Repeated backups converge to optimality, but

lead to very large controllers

124

Value-Preserving Transformations Value-Preserving Transformations

! A value-preserving transformation changes

the joint controller without sacrificing value

! Formally, there must exist mappings

fi : Qi → ΔRi for each agent i and fc : Qc → ΔRc such that for all s ∈ S, , and V(s, q ,qc) ≤ P( r |  q ) P(r

c |qc)V(s,

r ,r

c) rc

 r

 q ∈  Q qc ∈ Qc

125

Bounded Policy Iteration Algorithm Bounded Policy Iteration Algorithm"

Bernstein, Hansen & Zilberstein, IJCAI 2005, JAIR 2009

Repeat

1) Evaluate the controller 2) Perform an exhaustive backup 3) Perform value-preserving transformations

Until controller is ε-optimal for all states Theorem: For any ε, bounded policy iteration returns a joint controller that is ε-optimal for all initial states in a finite number of iterations.

126

Exhaustive Backup Experiment Exhaustive Backup Experiment

Started with single-node controllers and no correlation

  • device. Alternated between exhaustive backups and

controller reductions until out of memory

Controller sizes and value from a start state

Problem Iterations Controller sizes (no reductions) Controller sizes (with reductions) Value Recycling Robots 3 (2187,2187) (24,24) 25.6 Broadcast Channel 2 (128,128) (10,12) 4.5 Meeting on a Grid 2 (3125,3125) (80,80) 3.7

slide-22
SLIDE 22

127

Useful Transformations Useful Transformations

! Controller reductions ! Shrink the controller without sacrificing value ! Bounded dynamic programming updates ! Increase value while keeping the size fixed ! Both can be done using polynomial-size linear

programs

! Generalize ideas from POMDP literature,

particularly the BPI algorithm [Poupart & Boutilier 03]

128

Controller Reduction Controller Reduction

! For some node qi, find a convex combination of nodes

in Qi \ qi that dominates qi for all states and nodes of the

  • ther controllers; Merge qi into the convex combination

by changing transition probabilities

! Corresponding linear program:

Variables: ε, Objective: Maximize ε Constraints: ∀s ∈ S, q–i ∈ Q–i, qc ∈ Qc

! Theorem: A controller reduction is a value-preserving

transformation. V(s,qi,q−i,qc) + ε ≤ P(ˆ q

i)V(s, ˆ

q

i,q−i,qc) ˆ q

i

P(ˆ q

i)

129

Bounded DP Update Bounded DP Update

! For some node qi, find better parameters assuming that

the old parameters will be used from the second step

  • nwards; New parameters must yield value at least as

high for all states and nodes of the other controllers

! Corresponding linear program:

Variables: ε, P(ai, qi | qi, oi, qc) Objective: Maximize ε Constraints: ∀s ∈ S, q–i ∈ Q–i, qc ∈ Qc

! Theorem: A bounded DP update is a value-preserving

transformation.

V(s, q ,qc) + ε ≤ P( a |  q ,qc) R(s,a) + γ P( q '|  q , a ,

  • ,qc)P(s',
  • | s,

a )P(qc'|qc)V(s', q ',qc')

s',

  • ,

q ',qc '

& ' ( ( ) * + +

 a

'

130

Modifying the Correlation Device Modifying the Correlation Device

! Both transformations can be applied to the

correlation device

! Slightly different linear programs to solve ! Can think of the correlation device as another

agent

! Implementation questions

! What to use for an initial joint controller? ! Which transformations to perform? ! Order for choosing nodes to remove or improve?

131

Decentralized BPI Summary Decentralized BPI Summary

! DEC-BPI finds better and much more compact

solutions than exhaustive backups

! A larger correlation device tends to lead to higher

values on average

! Larger local controllers tend to yield higher average

values up to a point

! But, bounded DP is limited by improving one

controller at a time

! Linear program (one-step lookahead) results in

local optimality and tends to “get stuck”

132

Nonlinear Optimization Approach Nonlinear Optimization Approach"

Amato, Bernstein & Zilberstein, UAI 2007, JAAMAS 2010 ! Basic idea: Model the problem as a non-linear

program (NLP)

! Consider node values (as well as controller

parameters) as variables

! The NLP can take advantage of an initial state

distribution when it is given

! Improvement and evaluation all in one step

(equivalent to an infinite lookahead)

! Additional constraints maintain valid values

slide-23
SLIDE 23

133

NLP Representation NLP Representation

Variables: , , Objective: Maximize Value Constraints: ∀s ∈ S, ∈ Q Additional linear constraints:

! ensure controllers are independent ! all probabilities sum to 1 and are non-negative

z( q ,s) = x( q ', a ) R(s, a ) + γ P(s'| s, a ) O(

  • | s',

a ) y( q , a ,

  • ,

q ')

 q '

z( q ',s')

s'

$ % & & ' ( ) )

 a

b0(s)

s

z( q

0,s)

x( q , a ) = P( a |  q ) y( q , a ,

  • ,

q ') = P( q '|  q , a ,

  • )

z( q ,s) = V( q ,s)  q

134

Independence Constraints Independence Constraints

! Independence constraints guarantee that action selection

and controller transition probabilities for each agent depend only on local information

! Action selection independence: ! Controller transition independence:

135

Probability Constraints Probability Constraints

! Probability constraints guarantee that action selection

probabilities and controller transition probabilities are non negative and that they add up to 1:

(Superscript f ’s represent arbitrary fixed values)

136

Optimality Optimality

Theorem: An optimal solution of the NLP results in

  • ptimal stochastic controllers for the given size and

initial state distribution.

! Advantages of the NLP approach: ! Efficient policy representation with fixed memory ! NLP represents optimal policy for given size ! Takes advantage of known start state ! Easy to implement using off-the-shelf solvers ! Limitations: ! Difficult to solve optimally

137

Adding a Correlation Device Adding a Correlation Device

! NLP approach can be extended to include a correlation

device, using the following formulation:

! New variable w(c,c') represents the transition function

  • f the correlation device; action selection and controller

transitions depend on new shared signal.

138

Comparison of NLP & DEC-BPI Comparison of NLP & DEC-BPI"

Amato, Bernstein & Zilberstein, UAI 2007, JAAMAS 2010

! Used freely available nonlinear constrained

  • ptimization solver called “filter” on the NEOS

server (http://www-neos.mcs.anl.gov/neos/)

! Solver guarantees locally optimal solution ! Used 10 random initial controllers for a range

  • f controller sizes

! Compared NLP with DEC-BPI, with and

without a small (2-node) correlation device

slide-24
SLIDE 24

139

Results: Broadcast Channel Results: Broadcast Channel"

Amato, Bernstein & Zilberstein, UAI 2007 ! Simple two agents networking problem

(2 agents, 4 states, 2 actions, 5 observations)

! Average quality over 10 trials: ! Average run time:

140

Results: Multi-Agent Tiger Results: Multi-Agent Tiger"

Amato, Bernstein & Zilberstein, JAAMAS 2010

! A two-agent version of a “well-known” POMDP benchmark

[Nair et al. 03] (2 states, 3 actions, 2 observations)

! Average quality of various controller sizes using NLP

methods with and without 2-node correlation device and BFS

141

Results: Meeting in a Grid Results: Meeting in a Grid"

Amato, Bernstein & Zilberstein, JAAMAS 2010

! A two-agent domain with 16 states, 5 actions, 2 observations ! Average quality of various controller sizes using NLP

methods and DEC-BPI with and without 2-node correlation device and BFS

142

Results: Recycling Robots Results: Recycling Robots"

Amato, Bernstein & Zilberstein, JAAMAS 2010

! A two-agent extension of the problem domain introduced in

[Sutton & Barto 98] (4 states, 3 actions, 2 observation)

! Average quality of various controller sizes using NLP methods

and DEC-BPI with and without 2-node correlation device and BFS

Results: Box Pushing Results: Box Pushing"

Amato, Bernstein & Zilberstein, JAAMAS 2010

143

Values and running times (in seconds) for each controller size using NLP methods and DEC-BPI with and without a 2 node correlation device and BFS. An “x” indicates that the approach was not able to solve the problem.

144

The Value of Correlation The Value of Correlation

A more detailed comparison of the NLP approach with and without a correlation device shows that:

! Correlation helps to produce significantly higher values ! Even though correlation increases runtime, it helps produce

better values within a given amount of time

slide-25
SLIDE 25

145

NLP Approach Summary NLP Approach Summary

! The NLP defines the optimal fixed-size stochastic

controller

! Approach shows consistent improvement over

DEC-BPI using an off-the-shelf locally optimal solver

! A small correlation device can have significant

benefits

! Better performance may be obtained by exploiting

the structure of the NLP

146

Outline Outline

! Planning with Markov decision processes ! Decentralized partially-observable MDPs ! Complexity results ! Solving finite-horizon decentralized POMDPs ! Solving infinite-horizon decentralized POMDPs ! Scalability with respect to the number of agents ! Conclusion

Exploiting the Locality of Interaction Exploiting the Locality of Interaction

! In practical settings that involve many agents, each agent

  • ften interacts with a small number of “neighboring” agents

(e.g., firefighting, sensor networks)

! Algorithms designed exploit this property include LID-JESP

[Nair et al. AAAI 05] and SPIDER [Varakantham et al. AAMAS 07] and FANS [Marecki et al. AAMAS 08]

! FANS uses FSCs for policy representation and

!

Exploits FSCs for dynamic programming in policy evaluation and heuristic computations and provides significant speedups

!

Introduces novel heuristics to automatically vary the FSC size in different agents

!

Performs policy search that exploits the locality of agent interactions

147 148

Constraint-Based DP Constraint-Based DP"

Kumar & Zilberstein, AAMAS 2009 ! Model the domain as a Network Distributed POMDP

(ND-POMDP)—a restricted class of DEC-POMDPs characterized by a decomposable reward function.

! CBDP uses a point-based dynamic programming

(similar to MBDP).

! CBDP uses constraint networks algorithms to improve

the efficiency of key steps:

! Computation of the heuristic function ! Belief sampling using heuristic function ! Finding the best joint policy for a particular belief

149

Results: Sensors Tracking Target Results: Sensors Tracking Target"

Kumar & Zilberstein, AAMAS 2009 ! CBDP provides orders

  • f magnitude of speedup
  • ver FANS

! Provides better solution quality for all test instances ! Provides strong theoretical guarantees on the time and

space complexity enhancing scalability

! Linear complexity in planning horizon length ! Linear in the number of agents, which is necessary to solve large

realistic problems

! Exponential only in a small parameter that depends on the level of

interaction among the agents

N S E W N S E W N S E W loc1 loc2

Sample Results Sample Results

!

A 7-agent configuration with 4 actions per agent. Two adjacent agents are required to track a target

!

Graphs show the solution quality (left) and time (right) of our approach (CBDP) compared with the best existing method (FANS)

!

FANS is not scalable beyond horizon 7. CBDP has linear complexity in the horizon, and it provides better solution quality is less time

0.1" 1" 10" 100" 1000" 2" 3" 4" 5" 6" 7" 8" 10" CBDP" FANS" 0" 100" 200" 300" 400" 500" 600" 700" 2" 3" 4" 5" 6" 7" 8" 10" CBDP" FANS"

7-H

Horizon Horizon Solution quality Time (sec, logscale)

slide-26
SLIDE 26

New Scalable Approach New Scalable Approach"

Kumar, Zilberstein, and Toussaint, IJCAI 2011 ! Extend an approach [Toussaint and Storkey, ICML 06]

that maps planning under uncertainty (POMDP) problems into probabilistic inference

! Characterize general constraints on the interaction

graph that facilitate scalable planning

! Introduce an efficient algorithm to solve such

models using probabilistic inference

! Identify a number of existing models with such

constraints

151

Value Factorization Value Factorization

! θ = parameters of an agent ! Factors state-space s = (s1, . . . , sM )

! Example: Consider four agents

s.t. V = V12 + V23 + V34

152

Existing Models Satisfy VF Existing Models Satisfy VF

! Each agent/state variable can participate in multiple

value factors

! Worst case complexity is NEXP-C ! TI-DEC-MDP, ND-POMDP, TD-POMDP satisfy value

factorization

153

Computational Advantages Computational Advantages

! Applicability

! In models that satisfy VF, inference in the EM framework

can be done independently in each value factor

! Smaller value factors ⇒ efficient inference ! Planning no longer exponential, linear in # of factors

! Implementation

! Distributed planning ! Efficient implementation using message-passing ! Parallel computation of messages

154

Planning by Inference Planning by Inference

! Recasts planning as likelihood

maximization in a DBN mixture with binary reward variable r : P(r =1 | s, a1, a2) ∝ R(s, a1, a2)

155

DBN Mixture

Exploiting the VF Property Exploiting the VF Property

! Exploit additive nature of value function for scalability ! Outer mixture simulates the VF property ! Each Vf (θf, sf ) evaluated using time dependent mixture ! Theorem: Maximizing the likelihood of observing the

variable r = 1 optimizes the joint-policy

156

slide-27
SLIDE 27

The Expectation-Maximization The Expectation-Maximization Algorithm Algorithm

! Observed data r = 1, every other variable hidden ! Use the EM algorithm to maximize the likelihood ! Implemented using message passing on the VF graph ! Example: 3 factors {Ag1, Ag2}, {Ag2, Ag3} and {Ag2, Ag3}

157

Properties of the EM Algorithm Properties of the EM Algorithm

! Scalability

! µ message requires independent inference in each factor ! Agents/state vars. can be involved in multiple factors – can

model complex systems via simpler interactions

! Distributed planning via message passing

! Complexity

! Linear in the number of factors, exponential in the number of

agents/state variables in a factor

! Generality

! No additional assumptions (such as TOI) required – a general

  • ptimization recipe for models with the VF property

! Local optima?

158

Experiments Experiments

! ND-POMDP domains involving target tracking in

sensor networks with imperfect sensing

! Multiple targets, limited sensors with battery ! Penalty = -1 per sensor for miscoordination,

recharging battery; positive reward (+80) per target scanned simultaneous by two adjacent sensors

159

Comparisons with NLP Approach Comparisons with NLP Approach (5P Domain) (5P Domain)

160

Scalability on Larger Benchmarks Scalability on Larger Benchmarks

! 15 agent and 20 agent domains, internal states = 5

161

Summary of the EM Approach Summary of the EM Approach

! Value factorization (VF) facilitates scalability ! Several existing weakly-coupled models satisfy VF ! An EM algorithm can solve models with such

property and yield good quality solutions

! Scalability: E-step decomposes according to value

factors; smaller factors lead to efficient inference

! Can be easily implemented using message-passing

among the agents

! Future work: Explore techniques for even faster

inference, and establish better error bounds.

162

slide-28
SLIDE 28

What’s Next? What’s Next?

! Need to address two drawbacks of state-of-the-art

solvers:

! Scalability is still an issue with 100s of agents ! Reliance on complete knowledge of the model

! Key ideas: ! Extend the reduction to the maximum likelihood

problem, which can be solved using EM.

! Introduce a model-free version of this approach that

employs Monte-Carlo EM (MCEM).

! Improved sampling strategies and error bounds

163

Example: Traffic Control Example: Traffic Control"

Wu, Zilberstein & Jennings, IJCAI 2013 Traffic flows on an n x n grid from one side to the other. Each intersection is controlled by an agent that only allows traffic to flow vertically or horizontally at each time step. When a path is clear, all awaiting traffic for that path propagates through with a +1 reward per unit of traffic.

164

Agent 1 Agent 2 Agent 3 Agent 4 Vertical Queues Horizontal Queues Arrival Rate 0.5 Clear Blocked

165

Outline Outline

! Models for decentralized decision making ! Complexity results ! Solving finite-horizon DEC-POMDPs ! Solving infinite-horizon DEC-POMDPs ! Scalability beyond two agents ! Conclusion

166

Back to Fundamental Questions Back to Fundamental Questions

! Are DEC-POMDPs significantly harder to solve than

POMDPs? Why?

! What features of the problem domain affect the

complexity and how?

! Is optimal dynamic programming possible? ! Can dynamic programming be made practical? ! Is it beneficial to treat communication as a separate type

  • f action?

! How to use the structure and locality of agent interaction

in order to develop more scalable algorithms?

Resources Resources

! Lots of online resources: tutorials, workshops,

publications, problem descriptions, algorithms, software, algorithm testing, visualization tools, downloads, announcements.

$ http://rbr.cs.umass.edu/decpomdp/ $ http://teamcore.usc.edu/projects/dpomdp/ $ http://masplan.org $ http://thinc.cs.uga.edu/pomdp/

167

rbr.cs.umass.edu rbr.cs.umass.edu/decpomdp decpomdp

Shlomo Zilberstein’s Group (Chris Amato)

168

slide-29
SLIDE 29

teamcore.usc.edu teamcore.usc.edu/projects/ /projects/dpomdp dpomdp

Milind Tambe’s Group @USC

169

masplan.org masplan.org

Matthijs Spaan @Delft

170

thinc.cs.uga.edu thinc.cs.uga.edu/pomdp pomdp"

Prashant Doshi @UGA ! Online portal for problem-solving using POMDPs ! Contains implementations of key algorithms for

POMDPs, Dec-POMDPs, and I-POMDPs

! For Dec-POMDPs:

! Exact methods: BFS, DICEPS, GMAA*, GMAA*-ICE ! Approximate methods: JESP, DP-JESP

! Offers interactive visualization of the policies and

downloading of the results

171

Acknowledgments Acknowledgments

Thanks to my former students Martin Allen, Christopher

Amato, Daniel Bernstein, Alan Carlin, Eric Hansen, Akshat Kumar, Marek Petrik, Sven Seuken, Siddharth Srivastava, Feng Wu, and Xiaojian Wu; my former postdocs Claudia Goldman and William Yeoh; and my colleagues François Charpillet, Prashant Doshi, Victor Lesser, Frans Oliehoek, and Matthijs Spaan for many fruitful discussions and some

  • f the materials included in the tutorial.

Support for this work has been provided in part by NSF, AFOSR, and NASA.

172 173

Questions? Questions?

Additional Information:"

Resource-Bounded Reasoning Lab Resource-Bounded Reasoning Lab" University of Massachusetts, Amherst" http://rbr.cs.umass.edu

References References

!

  • C. Amato, D. S. Bernstein, and S. Zilberstein. Optimizing memory-bounded controllers

for decentralized POMDPs. In Proc. of Uncertainty in Artificial Intelligence, 1–8, 2007.

!

  • C. Amato, D. S. Bernstein, and S. Zilberstein. Optimizing fixed-size stochastic controllers for

POMDPs and decentralized POMDPs. Autonomous Agents and Multi-Agent Systems, 21(3): 293–320, 2010.

!

  • C. Amato, J. Dibangoye, and S. Zilberstein. Incremental policy generation for finite-

horizon DEC-POMDPs. In Int'l Conf. on Automated Planning and Scheduling, 2–9, 2009.

!

  • R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman. Solving transition

independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22:423–455, 2004.

!

  • R. Bellman. Dynamic programming. Princeton University Press, 1957.

!

  • D. S. Bernstein, C. Amato, E. A. Hansen, and S. Zilberstein. Policy iteration for

decentralized control of Markov decision processes. Journal of Artificial Intelligence Research, 34:89–132, 2009.

!

  • D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of

decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.

174

slide-30
SLIDE 30

!

  • D. S. Bernstein, E. A. Hansen, and S. Zilberstein. Bounded policy iteration for

decentralized POMDPs. In Proc. Int'l Joint Conf. on Artificial Intelligence, 1287–1292, 2005.

!

  • C. Boutilier. Planning, learning and coordination in multiagent decision processes. In

Theoretical Aspects of Rationality and Knowledge, 1996.

!

  • A. Carlin and S. Zilberstein. Value-based observation compression for DEC-POMDPs. In
  • Proc. of Int'l Conf. on Autonomous Agents and Multi Agent Systems, 501–508, 2008.

!

  • A. Carlin and S. Zilberstein. Decentralized monitoring of anytime decision making. In Proc.
  • f Int'l Conf. on Autonomous Agents and Multiagent Systems, 157–164, 2011.

!

  • A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially
  • bservable stochastic domains. In Proc. of the National Conf. on Artificial Intelligence, 1994.

!

  • J. S. Dibangoye, A.-I. Mouaddib, and B. Chaib-draa. Point-based incremental pruning

heuristic for solving finite-horizon DEC-POMDPs. In Proc. of Int'l Joint Conf. on Autonomous Agents and Multi Agent Systems, 2009.

!

  • E. Durfee and S. Zilberstein. Multiagent Planning, Control, and Execution. In G. Weiss

(Ed.), Multiagent Systems, Second Edition, 485–546, MIT Press, 2013.

!

  • R. Emery-Montemerlo, G. Gordon, J. Schneider, and S. Thrun. Approximate solutions

for partially observable stochastic games with common payoffs. In Proc. of Int'l Joint

  • Conf. on Autonomous Agents and Multi Agent Systems, 2004.

!

  • R. Emery-Montemerlo, G. Gordon, J. Schneider and S. Thrun. Game theoretic control for

robot teams. In Proc. of IEEE Int'l Conf. on Robotics and Automation, 1175–1181, 2005.

!

  • P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent
  • settings. Journal of Artificial Intelligence Research, 24:49–79, 2005.

175 !

  • C. V. Goldman and S. Zilberstein. Decentralized control of cooperative systems:

Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22:143– 174, 2004.

!

  • E. A. Hansen. Solving POMDPs by searching in policy space. In Proc. of Uncertainty

in Artificial Intelligence, 1998.

!

  • E. A. Hansen, D. Bernstein, and S. Zilberstein. Dynamic programming for partially
  • bservable stochastic games. In Proc. of the National Conf. on Artificial Intelligence, 2004.

!

  • E. A. Hansen and S. Zilberstein. Heuristic search in cyclic AND/OR graphs In Proc. of

National Conf. on Artificial Intelligence, 412–418, 1998.

!

  • E. A. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with
  • loops. Artificial Intelligence, 129(1-2):35–62, 2001.

!

  • R. A. Howard. Dynamic Programming and Markov Processes. MIT Press and John Wiley

& Sons, Inc., 1960.

!

  • A. Kumar and S. Zilberstein. Constraint-based dynamic programming for decentralized

POMDPs with structured interactions. In Proc. of Int'l Joint Conf. on Autonomous Agents and Multi Agent Systems, 561–568, 2009.

!

  • A. Kumar and S. Zilberstein. Point-based backup for decentralized POMDPs: Complexity

and new algorithms. In Proc. of Int'l Conf. on Autonomous Agents and Multiagent Systems, 1315–1322, 2010.

!

  • A. Kumar, S. Zilberstein, and M. Toussaint. Scalable multiagent planning using probabilistic
  • inference. In Proc. of Int'l Joint Conf. on Artificial Intelligence, 2140–2146, 2011.

!

  • O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning

and infinite-horizon partially observable Markov decision problems. In Proc. of the National Conf. on Artificial Intelligence, 1999.

176 !

  • J. Marecki, T. Gupta, P. Varakantham, M. Tambe, and M. Yokoo. Not all agents are

equal: Scaling up distributed POMDPs for agent networks. In Proc. of Int. Joint Conf.

  • n Autonomous Agents and Multi Agent Systems, 2008.

!

  • J. Marschak. Elements for a theory of teams. Management Science, 1(2):127–137, 1955.

!

  • R. Nair, M. Tambe, M. Yokoo, D. Pynadath, and S. Marsella. Taming decentralized

POMDPs: Towards efficient policy computation for multiagent settings. In Proc. Int'l Joint

  • Conf. on Artificial Intelligence, 2003.

!

  • R. Nair, P. Varakantham, M. Tambe, and M. Yokoo. Networked distributed POMDPs: A

synthesis of distributed constraint optimization and POMDPs. In Proc. of the National Conf.

  • n Artificial Intelligence, 2005.

!

  • F. A. Oliehoek, M. T. J. Spaan, and N. Vlassis. Optimal and approximate Q-value functions

for decentralized POMDPs. Journal of Artificial Intelligence Research, 32:289–353, 2008.

!

  • F. A. Oliehoek, M. T. J. Spaan, S. Whiteson, and N. Vlassis. Exploiting locality of interaction

in factored Dec-POMDPs. In Proc. of Int'l Joint Conf. on Autonomous Agents and Multi- Agent Systems, 517–524, 2008.

!

  • F. A. Oliehoek, M. T.J. Spaan, C. Amato, and S. Whiteson. Incremental Clustering and

Expansion for Faster Optimal Planning in Decentralized POMDPs. Journal of Artificial Intelligence Research, 46:449–509, 2013.

!

  • J. M. Ooi and G. W. Wornell. Decentralized control of a multiple access broadcast

channel: Performance bounds. In Proc. of the 35th Conf. on Decision and Control, 1996.

!

  • C. H. Papadimitriou and J. N. Tsitsiklis. On the complexity of designing distributed
  • protocols. Information and Control, 53(3):211–218, 1982.

!

  • C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision
  • processes. Mathematics of Operations Research, 12(3):441–450, 1987.

177 !

  • L. Peshkin, K.-E. Kim, N. Meuleau, and L. P. Kaelbling. Learning to cooperate via policy
  • search. In Proc. of Uncertainty in Artificial Intelligence, 2000.

!

  • M. Petrik and S. Zilberstein. A bilinear programming approach for multiagent planning.

Journal of Artificial Intelligence Research, 35:235–274, 2009.

!

  • P. Poupart and C. Boutilier. Bounded finite state controllers. In Advances in Neural

Information Processing Systems 16. MIT Press, 2004.

!

  • D. V. Pynadath and M. Tambe. The communicative multiagent team decision problem:

Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16:389–423, 2002.

!

  • S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall,

2nd edition, 2003.

!

  • S. Seuken and S. Zilberstein. Improved memory-bounded dynamic programming

for decentralized POMDPs. In Proc. of Uncertainty in Artificial Intelligence, 2007.

!

  • S. Seuken and S. Zilberstein. Memory-bounded dynamic programming for DEC-POMDPs.

In Proc. Int'l Joint Conf. on Artificial Intelligence, 2009–2015, 2007.

!

  • E. J. Sondik. The optimal control of partially observable Markov processes. PhD thesis,

Stanford University, 1971.

!

  • M. T J. Spaan and F. S. Melo. Interaction-driven Markov games for decentralized

multiagent planning under uncertainty. In Proc. of Int'l Joint Conf. on Autonomous Agents and Multi Agent Systems, pages 525–532, 2008.

!

  • M. T.J. Spaan and F. A. Oliehoek. The MultiAgent Decision Process toolbox: Software

for decision-theoretic planning in multiagent systems. In AAMAS Workshop on Multi-agent Sequential Decision Making in Uncertain Domains, 2008.

178 !

  • M. T.J. Spaan, F. A. Oliehoek, and C. Amato. Scaling up optimal heuristic search in Dec-

POMDPs via incremental expansion. In Proc. of Int'l Joint Conf. on Artificial Intelligence, 2027–2032, 2011.

!

  • D. Szer, F. Charpillet, and S. Zilberstein. MAA*: A heuristic search algorithm for

solving decentralized POMDPs. In Proc. of Uncertainty in Artificial Intelligence, 2005.

!

  • J. Tsitsiklis and M. Athans. On the complexity of decentralized decision making and

detection problems. IEEE Transactions on Automatic Control, 30(5):440–446, 1985.

!

  • P. Varaiya and J. Walrand. On delayed sharing patterns. IEEE Transactions on

Automatic Control, 23(3):443–445, 1978.

!

  • P. Varakantham, J. Marecki, Y. Yabu, M. Tambe, and M. Yokoo. Letting loose a SPIDER on

a network of POMDPs: Generating quality guaranteed policies. In Proc. of Int'l Joint Conference on Autonomous Agents and Multi Agent Systems, 2007.

!

  • H. Witsenhausen. Separation of estimation and control for discrete time systems.

Proceedings of the IEEE, 59(11):1557–1566, 1971.

!

  • F. Wu, S. Zilberstein, and X. Chen. Multi-agent online planning with communication. In Proc.
  • f Int'l Conference on Automated Planning and Scheduling, 321–329, 2009.

!

  • F. Wu, S. Zilberstein, and X. Chen. Point-based policy generation for decentralized
  • POMDPs. In Proc. of Int'l Conf. on Autonomous Agents and Multiagent Systems, 1307–

1314, 2010.

!

  • F. Wu, S. Zilberstein, and X. Chen. Online planning for multi-agent systems with bounded
  • communication. Artificial Intelligence, 175(2):487-511, 2011.

!

  • F. Wu, S. Zilberstein, and N.R. Jennings. Monte-Carlo expectation maximization for

decentralized POMDPs. In Proc. of Int'l Joint Conf. on Artificial Intelligence, 2013.

179 !

  • P. Xuan, V. Lesser, and S. Zilberstein. Communication decisions in multi-agent

cooperation: Model and experiments. In Proc. of Int'l Conf. on Autonomous Agents, 2001.

!

  • W. Yeoh, A. Kumar, and S. Zilberstein. Automated generation of interaction graphs for

value-factored DEC-POMDPs. In Proc. of Int'l Joint Conference on Artificial Intelligence, 2013.

!

  • S. Zilberstein, R. Washington, D. Bernstein, and A.-I. Mouaddib. Decision-theoretic control
  • f planetary rovers. In Plan-Based control of Robotic Agents, volume 2466 of LNAI, 270–

289, Springer, 2002.

180