Multigrid methods for two player zero-sum stochastic games Sylvie - - PowerPoint PPT Presentation

multigrid methods for two player zero sum stochastic games
SMART_READER_LITE
LIVE PREVIEW

Multigrid methods for two player zero-sum stochastic games Sylvie - - PowerPoint PPT Presentation

Discounted Stochastic Games Stochastic Games with mean payoff Multigrid methods for two player zero-sum stochastic games Sylvie Detournay INRIA Saclay and CMAP, Ecole Polytechnique Soutenance de th` ese Le 25 septembre, 2012 Sylvie


slide-1
SLIDE 1

Discounted Stochastic Games Stochastic Games with mean payoff

Multigrid methods for two player zero-sum stochastic games

Sylvie Detournay

INRIA Saclay and CMAP, ´ Ecole Polytechnique

Soutenance de th` ese

Le 25 septembre, 2012

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 1 / 53

slide-2
SLIDE 2

Discounted Stochastic Games Stochastic Games with mean payoff

Outline

Zero-sum two player stochastic game with discounted payoff

Dynamic Programing equations Policy iteration and multigrids : AMGπ Numerical results

Zero-sum two player stochastic game with mean payoff

Unichain case

Dynamic Programing equations Policy iteration and multigrids : AMGπ Numerical results

Multichain case

Dynamic Programing equations Policy iteration for multichain Numerical results

Conclusions

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 2 / 53

slide-3
SLIDE 3

Discounted Stochastic Games Stochastic Games with mean payoff

Dynamic programming equation of zero-sum two-player stochastic games

v(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

γP(y|x, a, b)v(y) + r(x, a, b) ∀x ∈ X (DP) X state space v(x) the value of the game starting at x ∈ X, a, b action of the 1st, 2nd player MAX, MIN r(x, a, b) reward paid by MIN to MAX P(y|x, a, b) transition probability from x to y given the actions a, b γ < 1 discount factor

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 3 / 53

slide-4
SLIDE 4

Discounted Stochastic Games Stochastic Games with mean payoff

Value of the game starting in x

v(x) = max

(ak)k≥0 min (bk)k≥0 E

  • k=0

γkr(Xk, ak, bk)

  • where

ak = ak(Xk, bk−1, ak−1, Xk−1 · · · ) bk = bk(Xk, ak, · · · ) are strategies and the state dynamics satisfies the process Xk P(Xk+1 = y|Xk = x, ak = a, bk = b) = P(y|x, a, b)

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 4 / 53

slide-5
SLIDE 5

Discounted Stochastic Games Stochastic Games with mean payoff

Deterministic zero-sum two-player game

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

Circles : Max plays Squares : MIN plays Weight on the edges : payment made by MIN to MAX

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 5 / 53

slide-6
SLIDE 6

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53

slide-7
SLIDE 7

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53

slide-8
SLIDE 8

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53

slide-9
SLIDE 9

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

If Max initially moves to 2′ he eventually looses 5 per turn.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 6 / 53

slide-10
SLIDE 10

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

But if Max initially moves to 1′

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53

slide-11
SLIDE 11

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

But if Max initially moves to 1′

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53

slide-12
SLIDE 12

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5 −1

But if Max initially moves to 1′

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53

slide-13
SLIDE 13

Discounted Stochastic Games Stochastic Games with mean payoff

2 2 −1 5 −2 −3 3 4’ 1 1’ 2’ 7 1 6 3’ 11 9 −5

But if Max initially moves to 1′ he only looses eventually (1 + 0 + 2 + 3)/2 = 3 per turn.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 7 / 53

slide-14
SLIDE 14

Discounted Stochastic Games Stochastic Games with mean payoff

Feedback strategies or policy

v(x) = max

(ak)k≥0

min

(bk)k≥0

E ∞

  • k=0

γkr(Xk, ak, bk)

  • For α : x → α(x) ∈ A(x) and β : (x, a) → β(x, a) ∈ B(x, a), the

strategies ak = α(Xk) bk = β(Xk, ak) are such that Xk is a Markov Chain with transition matrix Pα,β where Pα,β

xy

:= P(y|x, α(x), β(x, α(x))) x, y in X.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 8 / 53

slide-15
SLIDE 15

Discounted Stochastic Games Stochastic Games with mean payoff

Dynamic programming operator and optimal policy

v(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

γP(y|x, a, b)v(y) + r(x, a, b)

  • F(v;(x,a),b)

:= F(v; x)

α policy maximizing (DP)eq for MAX β policy minimizing F(v; (x, a), b) for MIN The dynamic programming operator F is monotone and additively sub-homogeneous (F(λ + v) ≤ λ + F(v), λ ≥ 0).

Method to solve (DP) eqs : Policy iteration algorithm [Howard, 60 (1player

game)], [Denardo, 67 (2player game)]

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 9 / 53

slide-16
SLIDE 16

Discounted Stochastic Games Stochastic Games with mean payoff

Dynamic programming equation of zero-sum two-player stochastic differential games

PDE of Isaacs (or Hamilton-Jacobi-Bellman for one player) −λv(x) + H(x, ∂v ∂xi , ∂2v ∂xi∂xj ) = 0, x ∈ X (I) where H(x, p, K) = max

a∈A(x)

min

b∈B(x,a) [p · f (x, a, b)

+1 2tr(σ(x, a, b)σT(x, a, b)K) + r(x, a, b)

  • Discretization with monotone schemes of (I) yields (DP)

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 10 / 53

slide-17
SLIDE 17

Discounted Stochastic Games Stochastic Games with mean payoff

Motivation

Solve dynamic programming equations arising from the discretization

  • f Isaacs equations or other DP eq of diffucions (eg varitional

inequalities)

applications: pursuit-evasion games, finance,. . .

Solve large scale zero-sum stochastic games (with discrete state space)

for example, problems arising from the web, problems in verification of programs in computer science, . . .

→ Use policy iteration algorithm where the linear systems involved are solved using AMG

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 11 / 53

slide-18
SLIDE 18

Discounted Stochastic Games Stochastic Games with mean payoff

Policy Iteration (PI) Algorithm for games

v(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

γP(y|x, a, b)v(y) + r(x, a, b)

  • F(v;x,a)

Start with α0 : x → α0(x) ∈ A(x), apply successively

1 The value v k+1 of policy αk is solution of

v k+1(x) = F(v k+1; x, αk(x)) ∀x ∈ X.

2 Improve the policy: select αk+1 optimal for v k+1:

αk+1(x) ∈ argmax

a∈A(x)

F(v k+1; x, a) ∀x ∈ X. Until αk+1(x) = αk(x) ∀x ∈ X. Step 1 is solved by PI

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 12 / 53

slide-19
SLIDE 19

Discounted Stochastic Games Stochastic Games with mean payoff

Policy Iteration (PI) for 1-player games (Howard, 60)

Start with βk,0, apply successively

1 The value v k,s+1 of policy βk,s is solution of

v k,s+1 = γPαk,βk,sv k,s+1 + r αk,βk,s

where Pα,β

xy

:= P(y|x, α(x), β(x, α(x)))

2 Improve the policy: find

βk,s+1 optimal for v k,s+1 Until βk,s+1 = βk,s.

PIext                α0 PIint      β0,0 . . . β0,s . . . αk

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 13 / 53

slide-20
SLIDE 20

Discounted Stochastic Games Stochastic Games with mean payoff

(v k)k≥1 ր non decreasing (MAX player) (v k,s)s≥1 ց non increasing (MIN player) PI stops after a finite time when sets of actions are finite Internal loop (1player game): PI ≈ Newton algorithm where differentials are replaced by superdifferentials of the (DP) operator External loop (2player game): PI ≈ Newton algorithm where the (DP) operator is approached by below by piecewise affine and concave maps → expect super linear convergence in good cases

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 14 / 53

slide-21
SLIDE 21

Discounted Stochastic Games Stochastic Games with mean payoff

MultiGrids for a linear system Av = b

PDE is discretized on a regular grid with n nodes (= finest grid) Define a coarse grid with less nodes by tacking even nodes Solving phase : (two grids) v ← apply ν1 relaxations on the fine level to v v ← v + Iw where w is solution of RAIw = R(b − Av) (on the coarse grid) v ← apply ν2 relaxations on the fine level to v

eg relaxation - Jacobi: v ← D−1(b − (L + U)v) with A = D + L + U

when applied recursively → V -cycle, W -cycle . . .

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 15 / 53

slide-22
SLIDE 22

Discounted Stochastic Games Stochastic Games with mean payoff

AMG for a linear system Av = b

Setup phase: construct “grids” based on the elements of matrix A define interpolation (I)ij ≈

Aij somefactor ,

restriction R = I T Solving phase : (two grids) v ← apply ν1 relaxations on the fine level to v v ← v + Iw where w is solution of RAIw = R(b − Av) (on the coarse grid) v ← apply ν2 relaxations on the fine level to v

eg relaxation - Jacobi: v ← D−1(b − (L + U)v) with A = D + L + U

when applied recursively → V -cycle, W -cycle . . .

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 15 / 53

slide-23
SLIDE 23

Discounted Stochastic Games Stochastic Games with mean payoff

AMGπ

Combine PI for two-player games and AMG: Apply AMG to v = γPv + r in the internal loop of PI → AMGπ

PIext                          α0 PIint                β0,0 AMG      v 0,0,0 . . . v 0,0,m . . . β0,s . . . αk

Previous works in stochastic control (one player games): MG + PI in [Hoppe,86-87][Akian, 88-90] AMG + learning methods [Ziv and Shinkin, 05] → two player games never considered

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 16 / 53

slide-24
SLIDE 24

Discounted Stochastic Games Stochastic Games with mean payoff

Example on a Isaacs equations

Dynamic programming equation

  • ∆v(x) + ∇v(x)2 − 0.5 ∇v(x)2

2 + f (x) = 0

x ∈ X v(x) = g(x) x ∈ ∂X where ∇v(x)2 = max

a2≤1(a·∇v(x))

−∇v(x)2

2

2 = min

b (b·∇v(x)+b2 2

2 )

with v(x1, x2) = sin(x1) × sin(x2) on X = [0, 1] × [0, 1]

0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 17 / 53

slide-25
SLIDE 25

Discounted Stochastic Games Stochastic Games with mean payoff

AMGπ versus PI with LU

0.01 0.1 1 10 100 1000 10000 10 100 1000 10000 1+e05 1e+06 1e+07 CPU time (seconds) (log) number of discretization nodes (log) PI with LU (UMFPACK) AMGπ

For the 100 problems of finest discretization: slope ≈ 1.04 for AMGπ, slope ≈ 1.85 for PI with LU.

About 6 linear system solved for each problem, size from 52 to 15002.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 18 / 53

slide-26
SLIDE 26

Discounted Stochastic Games Stochastic Games with mean payoff

Variational inequalities problem (VI)

Optimal stopping time for first player

  • max
  • 0.5∆v(x) − 0.5 ∇v(x)2

2 + f (x), φ(x) − v(x)

  • = 0

x ∈ X v(x) = u(x) x ∈ ∂X MAX chooses between play

  • r stop (♯A(x) = 2) and

receives φ when he stops MIN leads ∇v2

2

with φ = 0 and solution v on X = [0, 1] × [0, 1] given by

0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 19 / 53

slide-27
SLIDE 27

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iterations = 100

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-28
SLIDE 28

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iterations = 200

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-29
SLIDE 29

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iterations = 300

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-30
SLIDE 30

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iterations = 400

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-31
SLIDE 31

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iterations = 500

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-32
SLIDE 32

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iterations = 600

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-33
SLIDE 33

Discounted Stochastic Games Stochastic Games with mean payoff

VI with 129 × 129 points grid

iteration 700! in ≈ 8148 seconds slow convergence Policy iterations bounded by ♯{possible policies} = exponential in ♯X [Friedmann, 09] example of parity game [Fearnley, 10] for MDP like Newton → improve with good initial guess? → FMG

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 20 / 53

slide-34
SLIDE 34

Discounted Stochastic Games Stochastic Games with mean payoff

Full Multilevel AMGπ

Define the problem on each coarse grid Xl := {1, · · · , nl} on level l

Interpolation of strategies and value AMG Policy Iterations

Interpolation of value v and strategies α, β Stopping criterion for AMGπ rL2 < ch2 with c = 0.1 and h = 1

nl

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 21 / 53

slide-35
SLIDE 35

Discounted Stochastic Games Stochastic Games with mean payoff

Full Multilevel AMGπ

X = [0, 1] × [0, 1], 1025 nodes in each direction nl = number of nodes in each direction (coarse grids)

nl MAX policy Number of MIN rL2 eL2 CPU time iteration index policy iterations (s) 3 1 1 2.17e − 1 1.53e − 1 << 1 3 2 1 1.14e − 2 3.30e − 2 << 1 5 1 2 8.26e − 5 1.71e − 2 << 1 9 1 2 1.06e − 3 7.99e − 3 << 1 9 2 1 5.41e − 4 8.15e − 3 << 1 9 3 1 5.49e − 5 8.30e − 3 << 1 . . . 513 1 1 4.04e − 9 1.33e − 4 2.62 1025 1 1 1.90e − 9 6.63e − 5 11.7 1025 2 1 5.83e − 10 6.62e − 5 21.1

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 22 / 53

slide-36
SLIDE 36

Discounted Stochastic Games Stochastic Games with mean payoff

Mean payoff of the game starting at x ∈ X

η(x) = sup

(ak)k≥0

inf

(bk)k≥0 lim sup N→∞

1 N E N

  • k=0

r(Xk, ak, bk)

  • where

ak = ak(Xk, bk−1, ak−1, Xk−1 · · · ) bk = bk(Xk, ak, · · · ) are strategies and the state dynamics satisfies the process Xk P(Xk+1 = y|Xk = x, ak = a, bk = b) = P(y|x, a, b)

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 23 / 53

slide-37
SLIDE 37

Discounted Stochastic Games Stochastic Games with mean payoff

Optimal strategies and dynamic programming

If there exist a constant ρ ∈ R and v ∈ Rn such that ρ + v(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

P(y|x, a, b)v(y) + r(x, a, b), (DP) x ∈ X. Then η(x) = ρ for x ∈ X and v is called the relative value. Moreover, α, β given by (DP) equations are optimal feedback strategies for both players. For instance when matrices Pα,β are irreducible for all α and β.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 24 / 53

slide-38
SLIDE 38

Discounted Stochastic Games Stochastic Games with mean payoff

Policy Iteration for games

(Hoffman and Karp, 66)

ρ + v(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

P(y|x, a, b)v(y) + r(x, a, b)

  • F(v;x,a)

Start with α0 : x → α0(x)

1 Calculate value and bias (ρk+1, v k+1) for policy αk solution of

ρk+1 + v k+1(x) = F(v k+1; x, αk(x)) x ∈ X Solved with PI for 1PG

2 Improve the policy

αk+1 for v k+1

PIext                α0 PIint      β0,0 . . . β0,s . . . αk

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 25 / 53

slide-39
SLIDE 39

Discounted Stochastic Games Stochastic Games with mean payoff

At each intern iteration of PI: ρ + v = Pv + r and P an irreducible markovian matrix (row-sums = 1) : using the stationary probability of an irreducible Markov Chain: πTP = πT ρ = πTr v = Pv + r − ρ → direct solver or linear solver by iterating on ρ and v alternatively ρ = ν(Pv + r − v) v = Pv + r − ρ µv = 0 with ν, µ ∈ Rn

+ probability vectors → adapted AMG

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 26 / 53

slide-40
SLIDE 40

Discounted Stochastic Games Stochastic Games with mean payoff

Denote by Rn×n

+

:= {A ∈ Rn×n | aij ≥ 0, for 1 ≤ i, j ≤ n}.

Theorem

Assume that P ∈ Rn×n

+

is an irreducible stochastic matrix. Let A = I − P and decompose A = M − N such that M ∈ Rn×n

+

is invertible and S = M−1N ∈ Rn×n

+ . Consider the iterates

v k+1 = (I − 1µ)(Sv k + M−1 (r − ρk1)), ρk+1 = ν (r − Av k+1), where µ, ν are probability vectors.Then, the iterates converge to a solution if ρ((I − 1ν)NM−1) < 1.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 27 / 53

slide-41
SLIDE 41

Discounted Stochastic Games Stochastic Games with mean payoff

Example on a pursuit-evasion game

Solve the stationary Isaacs equation on X = [−1/2, 1/2]2: −ρ + ε∆v(x) + max

a∈A(a · ∇v(x)) + min b∈B(b · ∇v(x)) + x2 2 = 0

with ǫ = 0.5 and Neumann boundary conditions. x = xE − xP with xE = position of evader (King) xP = position of pursuer (Horse) Actions for the King: A := {(a1, a2) | ai = ±1 or 0} Actions for the Horse: B := {(0, 0), (1, 2), (2, 1)}. for a 129 × 129 grid : ρ = 0, 194

−0.6 −0.4 −0.2 0.2 0.4 0.6−0.6 −0.4 −0.2 0.2 0.4 0.6 −0.015 −0.01 −0.005 0.005 0.01 0.015 0.02

bias v

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 28 / 53

slide-42
SLIDE 42

Discounted Stochastic Games Stochastic Games with mean payoff

Optimal strategies

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 29 / 53

slide-43
SLIDE 43

Discounted Stochastic Games Stochastic Games with mean payoff

Numerical results

PI & LU solver (SuperLU library using the stationary probability) 257x257 points grid 513x513 points grid k s r∞ time 1 4 4.54e − 08 24s 2 3 5.87e − 09 43s 3 1 6.97e − 11 50s k s r∞ time 1 4 2.27e − 08 154s 2 2 3.27e − 09 231s 3 1 4.78e − 11 269s PI & Adapted AMG (Ruge and Stuben algorithm computing ρ) 257x257 points grid 513x513 points grid k s r∞ time 1 4 4.54e − 08 22s 2 3 5.87e − 09 41s 3 1 6.97e − 11 47s k s r∞ time 1 4 2.27e − 08 112s 2 2 3.27e − 09 169s 3 1 4.78e − 11 198s

using V (1, 1)-cycles (sym GS smoother), number of V -cycles ≈ 7

k = current iteration for MAX, s = number of iterations for MIN

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 30 / 53

slide-44
SLIDE 44

Discounted Stochastic Games Stochastic Games with mean payoff

Application: Perron eigenvector and eigenvalue

Assume A ∈ Rn×n

+

irreducible, the Perron eigenvector v and eigenvalue ρ is solution of Av = ρv ρ > 0, v(i) > 0 ∀i Set v = exp(w), w ∈ Rn, then we have to solve log ρ + w = F(w) Fi(v) = sup

u∈Ai

  u v −

  • j∈[n],

Aij =0

log uj Aij

  • uj

  , v ∈ Rn, i ∈ [n] where Ai =

  • u ∈ Rn

+ | u probability row-vector and u ≪ Ai·

  • .

Apply to A = PT to find the stationary probability of an irreducible MC, we tested PI with adapted AMG versus MAA of [DeSterck, 08].

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 31 / 53

slide-45
SLIDE 45

Discounted Stochastic Games Stochastic Games with mean payoff

Policy Iteration for games

(Hoffman and Karp, 66)

ρ + v(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

P(y|x, a, b)v(y) + r(x, a, b)

  • F(v;x,a)

Start with α0 : x → α0(x)

1 Calculate value and bias (ρk+1, v k+1) for policy αk solution of

ρk+1 + v k+1(x) = F(v k+1; x, αk(x)) x ∈ X Solved with PI for 1PG

2 Improve the policy

αk+1 for v k+1

PIext                α0 PIint      β0,0 . . . β0,s . . . αk

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 32 / 53

slide-46
SLIDE 46

Discounted Stochastic Games Stochastic Games with mean payoff

Variant of Richman game

f (v; x) = 1 2

  • max

y:(x,y)∈E(r(x, y) + v(y)) +

min

y:(x,y)∈E(r(x, y) + v(y))

  • 2

3 −1 1

MAX and MIN flip a coin to decide who makes the move. Min pays r to MAX. X = {1, 2, 3} E = {(1, 1), (1, 2), (1, 3), (2, 2), (3, 3)}

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 33 / 53

slide-47
SLIDE 47

Discounted Stochastic Games Stochastic Games with mean payoff

Application of PI algorithm

ρ + v =  

1 2 (max(v(1) − 1, v(2), v(3)) + min(v(1) − 1, v(2), v(3)))

v(2) v(3)   2 3 −1 1 ρ + v =  

1 2 (min(v(1) − 1, v(2), v(3)) + v(2))

v(2) v(3)  

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 34 / 53

slide-48
SLIDE 48

Discounted Stochastic Games Stochastic Games with mean payoff

Application of PI algorithm

ρ + v =  

1 2 (max(v(1) − 1, v(2), v(3)) + min(v(1) − 1, v(2), v(3)))

v(2) v(3)   2 3 −1 1 ρ + v =  

1 2 (min(v(1) − 1, v(2), v(3)) + v(2))

v(2) v(3)   v (1) =   −1 1   , ρ = 0

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 34 / 53

slide-49
SLIDE 49

Discounted Stochastic Games Stochastic Games with mean payoff

Application of PI algorithm

ρ + v =  

1 2 (max(v(1) − 1, v(2), v(3) + min(v(1) − 1, v(2), v(3)))

v(2) v(3)   2 3 −1 1 ρ + v =  

1 2 (min(v(1) − 1, v(2), v(3)) + v(3))

v(2) v(3)  

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 34 / 53

slide-50
SLIDE 50

Discounted Stochastic Games Stochastic Games with mean payoff

Application of PI algorithm

ρ + v =  

1 2 (max(v(1) − 1, v(2), v(3) + min(v(1) − 1, v(2), v(3)))

v(2) v(3)   2 3 −1 1 ρ + v =  

1 2 (min(v(1) − 1, v(2), v(3)) + v(3))

v(2) v(3)   v (2) =   −1 1   , ρ = 0

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 34 / 53

slide-51
SLIDE 51

Discounted Stochastic Games Stochastic Games with mean payoff

Application of PI algorithm

ρ + v =  

1 2 (max(v(1) − 1, v(2), v(3)) + min(v(1) − 1, v(2), v(3)))

v(2) v(3)   2 3 −1 1 α3 = α1 → Algorithm cycle!

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 34 / 53

slide-52
SLIDE 52

Discounted Stochastic Games Stochastic Games with mean payoff

Dynamic programming for multichain games

Assume X := {1, · · · , n}, A(x), B(x, a) are finite sets. In general, the value η of the game is solution of the dynamic programming equation: η(x) (t + 1) + v(x) = F(η t + v; x), x ∈ X, t large enough for some v ∈ Rn, where F is the dynamic programming operator: F(v; x) := max

a∈A(x)

min

b∈B(x,a)

  • y∈X

P(y|x, a, b)v(y) + r(x, a, b).

({ηt + v, t large } is an invariant half line).

[Kolberg, 80]

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 35 / 53

slide-53
SLIDE 53

Discounted Stochastic Games Stochastic Games with mean payoff

This is equivalent to solve the system for x ∈ X:        η(x) = max

a∈A(x)

min

b∈B(x,a)

  • y∈X

P(y|x, a, b) η(y) η(x) + v(x) = max

a∈Aη(x)

min

b∈Bη(x,a)

  • y∈X

P(y|x, a, b) v(y) + r(x, a, b) with Aη(x) := argmaxa∈A(x)

  • minb∈B(x,a)
  • y∈X P(y|x, a, b) η(y)
  • and Bη(x, a) := argminb∈B(x,a)
  • y∈X P(y|x, a, b) η(y)
  • .

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 36 / 53

slide-54
SLIDE 54

Discounted Stochastic Games Stochastic Games with mean payoff

DP for 1 player stochastic game with mean payoff

       η(x) = min

b∈B(x)

  • y∈X

P(y|x, b) η(y) η(x) + v(x) = min

b∈Bη(x)

  • y∈X

P(y|x, b) v(y) + r(x, b) where x ∈ X and Bη(x) = argminb∈B(x)

  • y∈X P(y|x, b) η(y)
  • .

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 37 / 53

slide-55
SLIDE 55

Discounted Stochastic Games Stochastic Games with mean payoff

Multichain Policy Iteration for 1PG

(Howard, 60 and Denardo, Fox, 67)

Start with β0 : x → β0(x), apply successively

1 Calculate value and bias (ηs+1, v s+1) for policy βs solution of

ηs+1 = Pβsηs+1 and ηs+1 + v s+1 = Pβsv s+1 + r βs

2 Improve the policy: select βs+1 optimal for (ηs+1, v s+1)

βs+1(x) ∈ argmin

b∈Bηs+1(x)

  • y∈X

P(y|x, b)v s+1(y) + r(x, b)

  • until βs+1(x) = βs(x) ∀x ∈ X.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 38 / 53

slide-56
SLIDE 56

Discounted Stochastic Games Stochastic Games with mean payoff

Degenerate iteration

Easy to show ηs+1 ≤ ηs

if ηs+1 = ηs → degenerate iteration

v s+1 is defined up to Ker(I − Pβs) with dim = nb of final class of Pβs. → PI may cycle when they are multiple final classes To avoid this : Strategies are improved in a conservative way

(βs+1(x) = βs(x) if optimal)

v s+1 is fixed on a point of each final class of Pβs ⇒ when ηs+1 = ηs, v s+1(x) = v s(x) on each final classes of Pβs ⇒ (ηs, v s)s≥1 is non increasing in a lexicographical order

ηs+1 ≤ ηs and if ηs+1 = ηs, vs+1 ≤ vs

⇒ PI stops after a finite time when sets of actions are finite

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 39 / 53

slide-57
SLIDE 57

Discounted Stochastic Games Stochastic Games with mean payoff

DP for 2 player stochastic game with mean payoff

     η(x) = max

a∈A(x)

ˆ F(η; x, a) η(x) + v(x) = max

a∈Aη(x)

´ Fη(v; x, a) where x ∈ X and : ˆ F(η; x, a) := min

b∈B(x)

  • y∈X

P(y|x, a, b) η(y) ´ Fη(v; x, a) := min

b∈Bη(x,a)

  • y∈X

P(y|x, a, b) v(y) + r(x, a, b)

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 40 / 53

slide-58
SLIDE 58

Discounted Stochastic Games Stochastic Games with mean payoff

Multichain Policy Iteration for 2PG

(Cochet-Terrasson, Gaubert, 06)

Start with α0 : x → α0(x), apply successively

1 Calculate value and bias (ηk+1, v k+1) for policy αk solution of

   η(x) = ˆ F(η; x, αk(x)) η(x) + v(x) = ´ Fη(v; x, αk(x)) Use PI for 1P multichain game D& F

2 Improve the policy αk in a conservative way.

until αk+1(x) = αk(x) ∀x ∈ X.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 41 / 53

slide-59
SLIDE 59

Discounted Stochastic Games Stochastic Games with mean payoff

Same as in D& F, if ηk+1 = ηk, the set of solutions v k+1 may be of dim > 1 → PI may cycle If ηk+1 = ηk, then define ¯ g(v; x) := ´ Fηk+1(v; x, αk+1(x)) − ηk+1(x) the DP operator of a one player game. Compute the the critical graph of ¯ g as defined in (Akian, Gaubert 2003) by using a v ′ such that ¯ g(v ′) = v ′, for instance take v ′ = v k+1. Solve v k+1(x) = ¯ g(v k+1; x) x ∈ Nk+1 v k+1(x) = v k(x) x ∈ C k+1 where Nk := X \ C k.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 42 / 53

slide-60
SLIDE 60

Discounted Stochastic Games Stochastic Games with mean payoff

Theorem

(ηk, v k, C k)k≥1 ր non decreasing in a “lexicographical order”: ηk ≤ ηk+1 and if ηk = ηk+1, v k ≤ v k+1 and C k ⊃ C k+1 PI stops after a finite time when sets of actions are finite

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 43 / 53

slide-61
SLIDE 61

Discounted Stochastic Games Stochastic Games with mean payoff

Solve Step 1 : η = Pη and η + v = Pv + r Assume P has two final class and one transient class: P =   P11 P12 P13 P22 P33   then we have to solve

1 For the final classes I = 2, 3:

ηI + vI = PIIvI + rI, vI(0) = 0, ηI(x) ≡ ηI, x ∈ I with PII an irreducible markovian matrix (row-sums = 1)

2 For the transient class 1:

η1 = P1η1 + P12η2 + P13η3 η1 + v1 = P1v1 + P12v2 + P13v3 + r1 with P11 an irreducible strictly submarkovian matrix (one row-sum < 1) → LU, AMG, etc

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 44 / 53

slide-62
SLIDE 62

Discounted Stochastic Games Stochastic Games with mean payoff

Richman game on random sparse graphs

1000 3000 5000 7000 9000 30000 Frequency 20 40 60 80 Number of nodes 50000

10 arcs /node, 500 random graphs per dim, > 10% strongly deg. iter.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 45 / 53

slide-63
SLIDE 63

Discounted Stochastic Games Stochastic Games with mean payoff 10000 20000 30000 40000 50000 4 6 8 10 12 Number of nodes Iterations 10000 20000 30000 40000 50000 20 40 60 80 Number of nodes

Max, average, Min of policy iterations among 500 tests. Left = extern PI (1st player) Right = total intern PI (2nd player) Instance for n = 106 : 12 extern PI and 90 total intern PI

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 46 / 53

slide-64
SLIDE 64

Discounted Stochastic Games Stochastic Games with mean payoff

Example on a pursuit-evasion game

Set x = xE − xP with xE = pos. of evader and xP = pos. of pursuer Solve the stationary Isaacs equation on X = [−1/2, 1/2]2:

     max

a∈A(x)(a · ∇η(x)) + min b∈B(x)(b · ∇η(x)) = 0 ,

x ∈ X −η(x) + max

a∈Aη(x)(a · ∇v(x)) +

min

b∈Bη(x)(b · ∇v(x)) + x2 2 = 0 ,

x ∈ X

with natural boundary conditions (keeping x in the domain). Actions for the Mouse: A(x) := {(0, 0)} if x ∈ B((0, 0); 0.1) {(a1, a2) | ai = ±1 or 0}

  • therwize

Actions for the Cat: B(x) := {(b1, b2) | bi ∈ {0, ¯ b, −¯ b}}, ¯ b constant

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 47 / 53

slide-65
SLIDE 65

Discounted Stochastic Games Stochastic Games with mean payoff

− 257 x 257 grid.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 48 / 53

slide-66
SLIDE 66

Discounted Stochastic Games Stochastic Games with mean payoff

¯ b = 0.999 ¯ b = 1 ¯ b = 1.001

v v

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 140
  • 120
  • 100
  • 80
  • 60
  • 40
  • 20

20

v = 0

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 10 20 30 40 50 60 70 80 90

η = 0.492 η ≈ x2

2

η = 0

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 49 / 53

slide-67
SLIDE 67

Discounted Stochastic Games Stochastic Games with mean payoff

¯ b Cat policy Number of mouse Infinite norm of CPU time Strongly iteration index policy iterations residual (s) 0.999 1 2 1.25e − 06 2.59e + 01 2 1 9.93e − 12 3.95e + 01 3 1 5.68e − 14 7.35e + 02 1 1 2 1.25e − 06 2.60e + 01 2 1 3.39e − 21 3.84e + 01 1.001 1 2 1.25e − 06 2.59e + 01 2 1 1.96e − 14 6.51e + 02 257 x 257 grid.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 50 / 53

slide-68
SLIDE 68

Discounted Stochastic Games Stochastic Games with mean payoff

PIGAMES library

Implementation: PIGAMES (C library), by Detournay. AMG, LU solver + decomposition into classes to solve linear systems. Double precision arithmetics. In the double precision implementation, improvement tests are done up to some given treshold (which should be not too small if the matrices are ill conditioned). Single proc. Intel(R) Xeon(R) W3540 - 2.93GHz with 8Go of RAM

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 51 / 53

slide-69
SLIDE 69

Discounted Stochastic Games Stochastic Games with mean payoff

Conclusions and Perspectives

We have proposed algorithms combining AMG with PI for discounted stochastic games and unichain stochastic games with mean reward. AMG not efficient for strongly non symmetric matrices − > difficult to apply to general games Full multilevel scheme can make policy iteration faster and efficient! We have introduced a PI algorithm for multichain games and shown that degenerate iterations often occur. The termination proof of PI has been done assuming exact arithmetics.

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 52 / 53

slide-70
SLIDE 70

Discounted Stochastic Games Stochastic Games with mean payoff

Conclusions and Perspectives

Find AMG for strongly unsymmetric systems to solve more general discrete games. Prove the convergence of a ǫ-approximate policy iteration algorithm. Estimation of the number of iterations as a function of the conditionning or the stationary probability of Pαβ?

Akian, M. and Detournay, S. (2012), Multigrid methods for two-player zero-sum stochastic games. Numerical Linear Algebra with Applications. Akian M., Cochet-Terrasson J., Detournay S. and Gaubert S. (2012), Policy iteration algorithm for zero-sum multichain stochastic games with mean payoff and perfect information. Preprint on arXiv:1208.0446 Thank you!

Sylvie Detournay (INRIA and CMAP) Zero-sum two player stochastic games 25 septembre, 2012 53 / 53