Computational Approaches for Stochastic Shortest Path on Succinct - - PowerPoint PPT Presentation

computational approaches for stochastic shortest path on
SMART_READER_LITE
LIVE PREVIEW

Computational Approaches for Stochastic Shortest Path on Succinct - - PowerPoint PPT Presentation

Computational Approaches for Stochastic Shortest Path on Succinct MDPs Krishnendu Chatterjee 1 Hongfei Fu 2 Amir Goharshady 1 Nastaran Okati 3 1 IST Austria 2 Shanghai Jiao Tong University 3 Ferdowsi University of Mashhad IJCAI 2018 Succinct MDPs


slide-1
SLIDE 1

Computational Approaches for Stochastic Shortest Path on Succinct MDPs

Krishnendu Chatterjee1 Hongfei Fu2 Amir Goharshady1 Nastaran Okati3

1IST Austria 2Shanghai Jiao Tong University 3Ferdowsi University of Mashhad

IJCAI 2018

slide-2
SLIDE 2

Succinct MDPs

slide-3
SLIDE 3

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

slide-4
SLIDE 4

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

a set of variables,

slide-5
SLIDE 5

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

a set of variables, a set of rules that describe how the variables can be updated,

slide-6
SLIDE 6

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

a set of variables, a set of rules that describe how the variables can be updated, a target set, consisting of valuations to the variables.

slide-7
SLIDE 7

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

a set of variables, a set of rules that describe how the variables can be updated, a target set, consisting of valuations to the variables.

At every time step, a rule is non-deterministically chosen to update the variables. This process continues until the target set is reached.

slide-8
SLIDE 8

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

a set of variables, a set of rules that describe how the variables can be updated, a target set, consisting of valuations to the variables.

At every time step, a rule is non-deterministically chosen to update the variables. This process continues until the target set is reached. We can think of a succinct MDP as a probabilistic program with the following form: while φ do Q1 . . . Qk od where denotes non-determinism and each Qi is a sequence

  • f assignments to variables.
slide-9
SLIDE 9

Succinct MDPs

A succinct MDP is an MDP described implicitly by:

a set of variables, a set of rules that describe how the variables can be updated, a target set, consisting of valuations to the variables.

At every time step, a rule is non-deterministically chosen to update the variables. This process continues until the target set is reached. We can think of a succinct MDP as a probabilistic program with the following form: while φ do Q1 . . . Qk od where denotes non-determinism and each Qi is a sequence

  • f assignments to variables.

Example: while x ≥ 1 do x := x + r x := x − 1 od

slide-10
SLIDE 10

Another Example

while x ≥ 1 do i f (0.4) { x := x + 1 reward=1 } e l s e { x := x − 1 } i f (0.3) { x := x + 1 reward=1 } e l s e { x := x − 1 } Figure: Gambler’s Ruin as a Succinct MDP

slide-11
SLIDE 11

Stochastic Shortest Path

Fix an initial valuation v for the program variables. Let σ be a policy that at any point of time, given the history

  • f the program, chooses one of the Qi’s to be executed.

We define R∞(v, σ) as the expected sum of rewards collected by the program before termination, if the program starts with the valuation v and follows the policy σ.

slide-12
SLIDE 12

Stochastic Shortest Path

Fix an initial valuation v for the program variables. Let σ be a policy that at any point of time, given the history

  • f the program, chooses one of the Qi’s to be executed.

We define R∞(v, σ) as the expected sum of rewards collected by the program before termination, if the program starts with the valuation v and follows the policy σ. We define infval(v) = infσ R∞(v, σ), and supval(v) = supσ R∞(v, σ), where the inf and sup are taken

  • ver all policies that guarantee finite expected termination

time. We are looking for methods to obtain upper and lower bounds for both infval and supval.

slide-13
SLIDE 13

LUPFs and LLPFs

We focus on supval, the approach for infval is similar.

slide-14
SLIDE 14

LUPFs and LLPFs

We focus on supval, the approach for infval is similar. Linear Upper Potential Function (LUPF):

Let X be the set of program variables, a function h : RX → R is an LUPF if it satisfies the following conditions:

slide-15
SLIDE 15

LUPFs and LLPFs

We focus on supval, the approach for infval is similar. Linear Upper Potential Function (LUPF):

Let X be the set of program variables, a function h : RX → R is an LUPF if it satisfies the following conditions:

1

h is linear,

slide-16
SLIDE 16

LUPFs and LLPFs

We focus on supval, the approach for infval is similar. Linear Upper Potential Function (LUPF):

Let X be the set of program variables, a function h : RX → R is an LUPF if it satisfies the following conditions:

1

h is linear,

2

the value of h at terminating valuations is bounded between two fixed constants K and K ′,

slide-17
SLIDE 17

LUPFs and LLPFs

We focus on supval, the approach for infval is similar. Linear Upper Potential Function (LUPF):

Let X be the set of program variables, a function h : RX → R is an LUPF if it satisfies the following conditions:

1

h is linear,

2

the value of h at terminating valuations is bounded between two fixed constants K and K ′,

3

For every Qi and every valuation v that satisfies the loop guard: h(v) ≥ Eu(h(Qi(v, u))) + Eu(R(u, Qi))

slide-18
SLIDE 18

LUPFs and LLPFs

We focus on supval, the approach for infval is similar. Linear Upper Potential Function (LUPF):

Let X be the set of program variables, a function h : RX → R is an LUPF if it satisfies the following conditions:

1

h is linear,

2

the value of h at terminating valuations is bounded between two fixed constants K and K ′,

3

For every Qi and every valuation v that satisfies the loop guard: h(v) ≥ Eu(h(Qi(v, u))) + Eu(R(u, Qi))

4

There is a fixed constant M, s.t. at each step of the program, the value of h changes by at most M.

slide-19
SLIDE 19

LUPFs and LLPFs

We focus on supval, the approach for infval is similar. Linear Upper Potential Function (LUPF):

Let X be the set of program variables, a function h : RX → R is an LUPF if it satisfies the following conditions:

1

h is linear,

2

the value of h at terminating valuations is bounded between two fixed constants K and K ′,

3

For every Qi and every valuation v that satisfies the loop guard: h(v) ≥ Eu(h(Qi(v, u))) + Eu(R(u, Qi))

4

There is a fixed constant M, s.t. at each step of the program, the value of h changes by at most M.

Linear Lower Potential Function (LLPF):

An LLPF h is a function that satisfies the above conditions, except that condition 3 is changed to:

For every v that satisfies the loop guard, there exists a Qi such that h(v) ≤ Eu(h(Qi(v, u))) + Eu(R(u, Qi)).

slide-20
SLIDE 20

Theorem If h is an LUPF, then supval(v) ≤ h(v) − K for all valuations v ∈ RX that satisfy the loop guard. Theorem If h is an LLPF, then supval(v) ≥ h(v) − K ′ for all valuations v ∈ RX that satisfy the loop guard.

slide-21
SLIDE 21

Theorem If h is an LUPF, then supval(v) ≤ h(v) − K for all valuations v ∈ RX that satisfy the loop guard. Theorem If h is an LLPF, then supval(v) ≥ h(v) − K ′ for all valuations v ∈ RX that satisfy the loop guard. Sketch of Proof. Construct a stochastic process based on h. Show that it forms a supermartingale, and then apply the optional stopping theorem (OST).

slide-22
SLIDE 22

Synthesizing LUPFs

while x ≥ 1 do i f (0.4) { x := x + 1 reward=1 } e l s e { x := x − 1 } i f (0.3) { x := x + 1 reward=1 } e l s e { x := x − 1 }

slide-23
SLIDE 23

Synthesizing LUPFs

while x ≥ 1 do i f (0.4) { x := x + 1 reward=1 } e l s e { x := x − 1 } i f (0.3) { x := x + 1 reward=1 } e l s e { x := x − 1 }

Let h : R → R be an LUPF for this example, we have: (1) ∃λ1, λ2 ∈ R ∀x ∈ R h(x) = λ1x + λ2 (2) ∃K, K ′ ∈ R ∀x ∈ [1, 2) K ≤ h(x) ≤ K ′ (3) ∀x ∈ [1, ∞) h(x) ≥ 0.4 · (1+h(x+1))+0.6 · h(x−1) (3) ∀x ∈ [1, ∞) h(x) ≥ 0.3 · (1+h(x+1))+0.7 · h(x−1) (4) ∃M ∈ [0, ∞) ∀x ∈ [1, ∞) |h(x) − h(x − 1)| ≤ M (4) and |h(x) − h(x + 1)| ≤ M

slide-24
SLIDE 24

By applying Farkas lemma and solving the resulting LP with the goal of minimizing λ1, we get: λ1 = M = 2, λ2 = K = 0, K ′ = 4.

slide-25
SLIDE 25

By applying Farkas lemma and solving the resulting LP with the goal of minimizing λ1, we get: λ1 = M = 2, λ2 = K = 0, K ′ = 4. Therefore, we have supval(x0) ≤ 2x0 for all initial valuations x0 that satisfy the loop guard. Hence, in this case the problem was solved by a reduction to linear programming.

slide-26
SLIDE 26

Synthesizing LLPFs

while x ≥ 1 do i f (0.4) { x := x + 1 reward=1 } e l s e { x := x − 1 } i f (0.3) { x := x + 1 reward=1 } e l s e { x := x − 1 }

slide-27
SLIDE 27

Synthesizing LLPFs

while x ≥ 1 do i f (0.4) { x := x + 1 reward=1 } e l s e { x := x − 1 } i f (0.3) { x := x + 1 reward=1 } e l s e { x := x − 1 }

This case is a bit more complicated. If h is an LLPF, we must have the exact same conditions as before, except that condition 3 changes to: (3′) ∀x ∈ [1, ∞) h(x) ≤ 0.4 · (1 + h(x + 1)) + 0.6 · h(x − 1)

  • r

h(x) ≤ 0.3 · (1 + h(x + 1)) + 0.7 · h(x − 1)

slide-28
SLIDE 28

Synthesizing LLPFs

while x ≥ 1 do i f (0.4) { x := x + 1 reward=1 } e l s e { x := x − 1 } i f (0.3) { x := x + 1 reward=1 } e l s e { x := x − 1 }

This case is a bit more complicated. If h is an LLPF, we must have the exact same conditions as before, except that condition 3 changes to: (3′) ∀x ∈ [1, ∞) h(x) ≤ 0.4 · (1 + h(x + 1)) + 0.6 · h(x − 1)

  • r

h(x) ≤ 0.3 · (1 + h(x + 1)) + 0.7 · h(x − 1) which is equivalent to λ1 ≤ 2, and hence we get supval(x0) ≥ 2x0. So, our previous bound is tight.

slide-29
SLIDE 29

Theorem (Motzkin) Let A ∈ Rm×n, B ∈ Rk×n and b ∈ Rm, c ∈ Rk. Assume that {x ∈ Rn | Ax ≤ b} = ∅. Then {x ∈ Rn | Ax ≤ b} ∩ {x ∈ Rn | Bx < c} = ∅ iff there exist y ∈ Rm and z ∈ Rk such that y, z ≥ 0, 1T · z > 0, ATy + BTz = 0 and bTy + cTz ≤ 0.

slide-30
SLIDE 30

Theorem (Motzkin) Let A ∈ Rm×n, B ∈ Rk×n and b ∈ Rm, c ∈ Rk. Assume that {x ∈ Rn | Ax ≤ b} = ∅. Then {x ∈ Rn | Ax ≤ b} ∩ {x ∈ Rn | Bx < c} = ∅ iff there exist y ∈ Rm and z ∈ Rk such that y, z ≥ 0, 1T · z > 0, ATy + BTz = 0 and bTy + cTz ≤ 0. By applying Motzkin’s theorem, we can reduce the conditions to a Quadratic Programming problem.

slide-31
SLIDE 31

Complexity

Using Farkas’ Lemma, we reduced the problem of finding the best LUPF to linear programming. Using Motzkin’s Theorem, we reduced the problem of finding the best LLPF to quadratic programming. Both reductions are polynomial in the size of the succinct MDP (i.e., the length of the code). Note that the MDPs might have infinitely many states (e.g., if the variables can hold integers) or even uncountably many states (e.g., if the variables hold real values).

slide-32
SLIDE 32

Experimental Results

MDP Parameters Upper bound Lower bound Time Gambler’s Ruin p1 = 0.4 p2 = 0.3 2x 2x 153 ms 2D Robot Planning p = 0.4 5x − 5y 5x − 5y 251 ms Multi-robot Planning p1 = 0.4 p2 = 0.4 2.5x−2.5y+5 2.5x − 2.5y 758 ms Mini-roulette — 11x 11x 320 ms American Roulette — 24x 24x 425 ms