Multigrid methods for zero-sum two player stochastic games with mean - - PowerPoint PPT Presentation

multigrid methods for zero sum two player stochastic
SMART_READER_LITE
LIVE PREVIEW

Multigrid methods for zero-sum two player stochastic games with mean - - PowerPoint PPT Presentation

Multigrid methods for zero-sum two player stochastic games with mean reward Sylvie Detournay and Marianne Akian INRIA Saclay and CMAP, Ecole Polytechnique (France) 15th Copper Mountain Conference on Multigrid Methods 27 March - 1 April,


slide-1
SLIDE 1

Multigrid methods for zero-sum two player stochastic games with mean reward

Sylvie Detournay and Marianne Akian

INRIA Saclay and CMAP, ´ Ecole Polytechnique (France)

15th Copper Mountain Conference on Multigrid Methods

27 March - 1 April, 2011

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 1 / 22

slide-2
SLIDE 2

DP for zero-sum stochastic games with mean reward

Dynamic programming equation of zero-sum two-player stochastic games with mean reward

ρ + v(x) = max

α∈A(x)

min

β∈B(x,α)

  • y∈X

P(y|x, α, β)v(y) + r(x, α, β) ∀x ∈ X (DP) X state space ρ is the mean reward of the game = non linear eigenvalue v(x) is the bias or relative value of the game starting at x ∈ X α, β action of the 1st, 2nd player MAX, MIN r(x, α, β) reward paid by MIN to MAX P(y|x, α, β) transition probability from x to y given the actions α, β

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 2 / 22

slide-3
SLIDE 3

DP for zero-sum stochastic games with mean reward

Value of the game with mean reward starting at x ∈ X

ρ(x) = sup

(αk)k≥0

inf

(βk)k≥0 lim sup N→∞

1 N E N

  • k=0

r(xk, αk, βk)

  • where

αk = αk(Xk, αk−1, βk−1, · · · ) βk = βk(Xk, αk, αk−1, βk−1, · · · ) are strategies and the state dynamics satisfies the process Xk P(Xk+1 = y|Xk = x, αk = α, βk = β) = P(y|x, α, β)

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 3 / 22

slide-4
SLIDE 4

A deterministic zero-sum game

Deterministic zero-sum two-player game

The circles (resp. squares) represent the nodes at which Max (resp. Min) can play.

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

Values in the (DP) equation: X = {Max nodes} A(x) = {Min nodes accessible from x} B(x, α) = {Max nodes accessible from α} r(x, α, β) =weight(x, α)+weight(α, β) y = β

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 4 / 22

slide-5
SLIDE 5

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

slide-6
SLIDE 6

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

slide-7
SLIDE 7

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

slide-8
SLIDE 8

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′ he eventually looses 5 per turn.

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

slide-9
SLIDE 9

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

But if Max initially moves to 1′

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

slide-10
SLIDE 10

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

But if Max initially moves to 1′

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

slide-11
SLIDE 11

A deterministic zero-sum game

2 2 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5 −1

But if Max initially moves to 1′

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

slide-12
SLIDE 12

A deterministic zero-sum game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

But if Max initially moves to 1′ he only looses eventually (1 + 0 + 2 + 3)/2 = 3 per turn.

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

slide-13
SLIDE 13

DP for zero-sum stochastic games

Optimal strategies and dynamic programming

ρ(x) = sup

(αk)k≥0

inf

(βk)k≥0

lim sup

N→∞

1 N E N

  • k=0

r(xk, αk, βk)

  • x ∈ X

For αk = ¯ α(Xk), βk = ¯ β(Xk, ¯ α(Xk)), define the matrix P ¯

α,¯ β where

P ¯

α,¯ β xy

:= P(y|x, ¯ α(x), ¯ β(x, ¯ α(x))). If P ¯

α,¯ β are irreducible for all ¯

α and ¯ β then ρ(x) ≡ ρ is the unique solution of ρ + v(x) = max

α∈A(x)

min

β∈B(x,α)

  • y∈X

P(y|x, α, β)v(y) + r(x, α, β) (DP) x ∈ X and ¯ α, ¯ β given by (DP)eq are optimal feedback strategies for both players.

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 7 / 22

slide-14
SLIDE 14

DP for zero-sum stochastic games

Dynamic programming equation of zero-sum two-player stochastic differential games

Isaacs PDE (diffusion problems) − ρ + H(x, ∂v ∂xi , ∂2v ∂xi∂xj ) = 0, x ∈ X (I) where H(x, p, K) = max

α∈A(x)

min

β∈B(x,α) [p · f (x, α, β)

+1 2tr(σ(x, α, β)σT(x, α, β)K) + r(x, α, β)

  • Discretization with monotone schemes of (I) yields (DP)

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 8 / 22

slide-15
SLIDE 15

DP for zero-sum stochastic games

Motivation

Solve dynamic programming equations arising from the discretization of Isaacs equations

for example, long term diffusion’s problems, risk sensitive problems (finance), singular perturbations of Isaacs eq . . .

Solve large scale zero-sum stochastic games (with discrete state space)

for example, problems arising from the web, problems in verification

  • f programs in computer science, . . .

Extend this equation for the general case, that is without irreducible assumption. → Use policy iteration algorithm combined with multigrids to solve the dynamic programming equation

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 9 / 22

slide-16
SLIDE 16

DP for zero-sum stochastic games

Dynamic programming for multichain games

In general, the value of the game is solution of the dynamic programming equation: ρ(x) (t + 1) + v(x) = F(ρ t + v; x), x ∈ X, t large enough where F is the dynamic programming operator: F(v; x) := max

α∈A(x)

min

β∈B(x,α)

  • y∈X

P(y|x, α, β)v(y) + r(x, α, β).

({ρt + v, t large } is an invariant half line).

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 10 / 22

slide-17
SLIDE 17

DP for zero-sum stochastic games

This is equivalent to solve the system for x ∈ X:

ρ(x) = max

α∈A(x)

min

β∈B(x,α)

  • y∈X

P(y|x, α, β)ρ(y) ρ(x) + v(x) = max

α∈Aρ(x)

min

β∈Bρ(x,α)

  • y∈X

P(y|x, α, β)v(y) + r(x, α, β) with Aρ(x) := argmaxα∈A(x)

  • minβ∈B(x,α)
  • y∈X P(y|x, α, β) ρ(y)
  • and Bρ(x, α) := argminβ∈B(x,α)
  • y∈X P(y|x, α, β) ρ(y)
  • For a one player game:

ρ(x) = min

β∈B(x)

  • y∈X

P(y|x, β)ρ(y) ρ(x) + v(x) = min

β∈Bρ(x)

  • y∈X

P(y|x, β)v(y) + r(x, β) with Bρ(x) = argminβ∈B(x)

  • y∈X P(y|x, β) ρ(y).

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 11 / 22

slide-18
SLIDE 18

Policy iteration (PI) algorithm

Multichain Policy Iteration Algorithm for one player (Denardo, Fox, 67)

Start with ¯ β0 : x → ¯ β0(x)

1

Calculate value and bias (ρk+1, v k+1) for policy ¯ βk solution of ρk+1 = P

¯ βkρk+1

and ρk+1 + v k+1 = P

¯ βkv k+1 + r ¯ βk

2

Improve the policy: find ¯ βk+1 optimal for (ρk+1, v k+1) ¯ βk+1(x) ∈ argmin

β∈Bρk+1(x)

  • y∈X

P(y|x, β)v k+1(y) + r(x, β)

  • ,

x ∈ X with Bρ(x) = argminβ∈B(x)

  • y∈X P(y|x, β) ρ(y).

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 12 / 22

slide-19
SLIDE 19

Policy iteration (PI) algorithm

Easy to show ρk+1 ≤ ρk

if ρk+1 = ρk → degenerate iteration

v k+1 is defined up to Ker(I − P ¯

βk) with dim = nb of ergodic class of

P ¯

βk ≥ 1.

→ PI may cycle when they are multiple ergodic classes To avoid this : Optimal strategies are improved in a conservative way (¯ βk+1(x) = ¯ βk(x) if optimal) v k+1 is fixed on a point of each ergodic class of P ¯

βk

⇒ when ρk+1 = ρk, v k+1(x) = v k(x) on each ergodic classes of P ¯

βk

⇒ (ρk, v k)k≥1 is non increasing in a lexicographical order

ρk+1 ≤ ρk and if ρk+1 = ρk, vk+1 ≤ vk

⇒ PI stops after a finite time when sets of actions are finite

Remark: PI ≈ Newton algorithm in the case with unique solution v.

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 13 / 22

slide-20
SLIDE 20

Policy iteration (PI) algorithm

Policy Iteration for unichain games (Hoffman and Karp,

66), and multichain games (Cochet-Terrasson, Gaubert, 06)

Start with ¯ α0

1

Calculate value and bias (ρn+1, v n+1) for policy ¯ αn solution of

ρn+1(x) = min

β∈B(x,¯ αn)

  • y∈X

P(y|x, ¯ αn, β)ρn+1(y) ρn+1(x)+vn+1(x) = min

β∈Bρn+1(x,¯ αn)

  • y∈X

P(y|x, ¯ αn, β)vn+1(y)+r(x, ¯ αn, β)

x ∈ X, using D&F algorithm

2

Improve the policy ¯ αn+1 optimal for (ρn+1, v n+1)

PIext                          α0 PIint                β0,0 MG      ρ0,0, v 0,0,0 . . . ρ0,0, v 0,0,l . . . β0,k . . . αn

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 14 / 22

slide-21
SLIDE 21

Policy iteration (PI) algorithm

Same difficulties as in D& F, in step 1 (calculate ρn+1, v n+1): if the set of solutions v n+1 is of dim > 1 → PI may cycle To avoid this : Optimal strategies are improved in a conservative way if ρn+1 = ρn, step 1 amounts to solve v n+1 = G(v n+1) where G is the dynamic programming operator of a one player game and v n ≤ G(v n). We choose v n+1 st v n+1(x) = v n(x), x ∈ C, where C is the set of critical nodes of G as defined in (Akian, Gaubert 2003). ⇒ (ρn, v n)n≥1 non decreasing in a lexicographical order ⇒ PI stops after a finite time when sets of actions are finite

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 15 / 22

slide-22
SLIDE 22

Policy Iteration Algorithm and AMG - (AMGπ)

AMG for a linear system Ax = b

Setup phase: construct “grids” based on the elements of matrix A define interpolation (I)ij ≈

aij somefactor , restriction R = I T

Solving phase: (two grids) v ← apply ν1 relaxations on the fine level to v v ← v + Iw where w is solution of RAIw = R(b − Av) (on the coarse grid) v ← apply ν2 relaxations on the fine level to v when applied recursively → V -cycle, W -cycle . . .

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 16 / 22

slide-23
SLIDE 23

Policy Iteration Algorithm and AMG - (AMGπ)

AMGπ for two player games with mean reward

Recall we need to solve at each iteration of PI: ρ = Pρ and ρ + v = Pv + r ρ, v, r ∈ RX, P ∈ RX×X v fixed at a point on each ergodic class. Assume the graph of P has only one ergodic class and one transient class: P = Pt PtE PE

  • then we have to solve

1

ρE + vE = PEvE + rE, ρE(x) ≡ ρE, x ∈ E with PE an irreducible markovian matrix (row-sums = 1)

2

ρt = Ptρt + PtEρE and ρt + vt = Ptvt + PtEvE + rt with Pt an irreducible strictly submarkovian matrix (one row-sum < 1) → AMG.

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 17 / 22

slide-24
SLIDE 24

Policy Iteration Algorithm and AMG - (AMGπ)

Several ways to do step 1 using the stationary probability of an irreducible Markov Chain: πT

E PE = πT E

→ direct solver or AMG for Markov chains ρE = πT

E rE

ρE + vE = PEvE + rE with (vE)min index = 0 → direct solver or adapted AMG solving vE = PEvE + rE − ρE → adapted AMG and improvement of ρ at each iteration

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 18 / 22

slide-25
SLIDE 25

Numerical tests for AMGπ

Example on a Isaacs equation (pursuit game)

Solve the stationary Isaacs equation: −ρ + ε∆v + max

α∈A(α · ∇v) + min β∈B(β · ∇v) + x2 2 = 0 on (−1/2, 1/2)2

with ǫ = 0.5, Neumann boundary conditions and A := {(a1, a2) | ai = ±1 or 0} B := {(0, 0), (1, 2), (2, 1)}. for a 129 × 129 grid : ρ = 0, 194

−0.6 −0.4 −0.2 0.2 0.4 0.6−0.6 −0.4 −0.2 0.2 0.4 0.6 −0.015 −0.01 −0.005 0.005 0.01 0.015 0.02

bias v

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 19 / 22

slide-26
SLIDE 26

Numerical tests for AMGπ

Optimal strategies

−0.6 −0.4 −0.2 0.2 0.4 0.6 −0.6 −0.4 −0.2 0.4 0.6 0.2

α

−0.6 −0.4 −0.2 0.2 0.4 0.6 −0.6 −0.4 −0.2 0.2 0.4

β

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 20 / 22

slide-27
SLIDE 27

Numerical tests for AMGπ

Numerical results

LU solver (SuperLU library) 257x257 points grid 513x513 points grid n k r∞ time 1 4 4.54e − 08 24s 2 3 5.87e − 09 43s 3 1 6.97e − 11 50s n k r∞ time 1 4 2.27e − 08 154s 2 2 3.27e − 09 231s 3 1 4.78e − 11 269s Adapted AMG (Ruge and Stuben algorithm) 257x257 points grid 513x513 points grid n k r∞ time 1 4 4.54e − 08 22s 2 3 5.87e − 09 41s 3 1 6.97e − 11 47s n k r∞ time 1 4 2.27e − 08 112s 2 2 3.27e − 09 169s 3 1 4.78e − 11 198s

using V (1, 1)-cycles (sym GS smoother), number of V -cycles ≈ 7

n = current iteration for MAX, k = number of iterations for MIN

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 21 / 22

slide-28
SLIDE 28

Numerical tests for AMGπ

Conclusion

We have proposed an algorithm for multichain games combining multigrids with policy iteration. Can we adapt multigrids methods for solving more general discrete games? Trying new multigrids methods for Markov Chains? (Horton (94), De Sterck (08) & all) Thank you!

Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 22 / 22