multigrid methods for zero sum two player stochastic
play

Multigrid methods for zero-sum two player stochastic games with mean - PowerPoint PPT Presentation

Multigrid methods for zero-sum two player stochastic games with mean reward Sylvie Detournay and Marianne Akian INRIA Saclay and CMAP, Ecole Polytechnique (France) 15th Copper Mountain Conference on Multigrid Methods 27 March - 1 April,


  1. Multigrid methods for zero-sum two player stochastic games with mean reward Sylvie Detournay and Marianne Akian INRIA Saclay and CMAP, ´ Ecole Polytechnique (France) 15th Copper Mountain Conference on Multigrid Methods 27 March - 1 April, 2011 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 1 / 22

  2. DP for zero-sum stochastic games with mean reward Dynamic programming equation of zero-sum two-player stochastic games with mean reward � ρ + v ( x ) = max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) α ∈A ( x ) β ∈B ( x ,α ) y ∈ X ∀ x ∈ X (DP) X state space ρ is the mean reward of the game = non linear eigenvalue v ( x ) is the bias or relative value of the game starting at x ∈ X α, β action of the 1st, 2nd player MAX, MIN r ( x , α, β ) reward paid by MIN to MAX P ( y | x , α, β ) transition probability from x to y given the actions α, β Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 2 / 22

  3. DP for zero-sum stochastic games with mean reward Value of the game with mean reward starting at x ∈ X � N � 1 � ρ ( x ) = sup ( β k ) k ≥ 0 lim sup inf r ( x k , α k , β k ) N E N →∞ ( α k ) k ≥ 0 k =0 where � α k = α k ( X k , α k − 1 , β k − 1 , · · · ) β k = β k ( X k , α k , α k − 1 , β k − 1 , · · · ) are strategies and the state dynamics satisfies the process X k P ( X k +1 = y | X k = x , α k = α, β k = β ) = P ( y | x , α, β ) Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 3 / 22

  4. A deterministic zero-sum game Deterministic zero-sum two-player game The circles (resp. squares) represent the nodes at which Max (resp. Min) can play. 5 3 Values in the (DP) equation: −2 X = { Max nodes } 4’ 0 11 A ( x ) = { Min nodes accessible from x } −3 B ( x , α ) = { Max nodes accessible from 2 1’ α } −1 r ( x , α, β ) =weight( x , α )+weight( α, β ) 1 y = β 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 4 / 22

  5. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  6. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  7. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  8. A deterministic zero-sum game 5 3 −2 4’ If Max initially moves to 2 ′ 0 11 −3 2 1’ he eventually looses 5 per turn. −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 5 / 22

  9. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  10. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  11. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 2 1’ −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  12. A deterministic zero-sum game 5 3 But if Max initially moves to 1 ′ −2 4’ 0 11 −3 he only looses eventually 2 1’ (1 + 0 + 2 + 3) / 2 = 3 per turn. −1 1 9 3’ 7 1 −5 0 6 2’ 2 Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 6 / 22

  13. DP for zero-sum stochastic games Optimal strategies and dynamic programming � N � 1 � ρ ( x ) = sup inf lim sup r ( x k , α k , β k ) x ∈ X N E ( β k ) k ≥ 0 N →∞ ( α k ) k ≥ 0 k =0 β where α ( X k ), β k = ¯ α, ¯ α ( X k )), define the matrix P ¯ For α k = ¯ β ( X k , ¯ α, ¯ α ( x ) , ¯ P ¯ β := P ( y | x , ¯ β ( x , ¯ α ( x ))). xy β are irreducible for all ¯ α, ¯ α and ¯ If P ¯ β then ρ ( x ) ≡ ρ is the unique solution of � ρ + v ( x ) = max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) (DP) α ∈A ( x ) β ∈B ( x ,α ) y ∈ X α, ¯ x ∈ X and ¯ β given by (DP)eq are optimal feedback strategies for both players. Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 7 / 22

  14. DP for zero-sum stochastic games Dynamic programming equation of zero-sum two-player stochastic differential games Isaacs PDE (diffusion problems) ∂ 2 v − ρ + H ( x , ∂ v , ) = 0 , x ∈ X (I) ∂ x i ∂ x i ∂ x j where H ( x , p , K ) = max β ∈B ( x ,α ) [ p · f ( x , α, β ) min α ∈A ( x ) � +1 2 tr ( σ ( x , α, β ) σ T ( x , α, β ) K ) + r ( x , α, β ) Discretization with monotone schemes of (I) yields (DP) Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 8 / 22

  15. DP for zero-sum stochastic games Motivation Solve dynamic programming equations arising from the discretization of Isaacs equations for example, long term diffusion’s problems, risk sensitive problems (finance), singular perturbations of Isaacs eq . . . Solve large scale zero-sum stochastic games (with discrete state space) for example, problems arising from the web, problems in verification of programs in computer science, . . . Extend this equation for the general case, that is without irreducible assumption. → Use policy iteration algorithm combined with multigrids to solve the dynamic programming equation Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 9 / 22

  16. DP for zero-sum stochastic games Dynamic programming for multichain games In general, the value of the game is solution of the dynamic programming equation: ρ ( x ) ( t + 1) + v ( x ) = F ( ρ t + v ; x ) , x ∈ X , t large enough where F is the dynamic programming operator: � F ( v ; x ) := max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) . α ∈A ( x ) β ∈B ( x ,α ) y ∈ X ( { ρ t + v , t large } is an invariant half line) . Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 10 / 22

  17. DP for zero-sum stochastic games This is equivalent to solve the system for x ∈ X : � ρ ( x ) = max min P ( y | x , α, β ) ρ ( y ) α ∈A ( x ) β ∈B ( x ,α ) y ∈ X � ρ ( x ) + v ( x ) = max min P ( y | x , α, β ) v ( y ) + r ( x , α, β ) α ∈A ρ ( x ) β ∈B ρ ( x ,α ) y ∈ X � � with A ρ ( x ) := argmax α ∈A ( x ) min β ∈B ( x ,α ) � y ∈ X P ( y | x , α, β ) ρ ( y ) �� � and B ρ ( x , α ) := argmin β ∈B ( x ,α ) y ∈ X P ( y | x , α, β ) ρ ( y ) For a one player game: � ρ ( x ) = min P ( y | x , β ) ρ ( y ) β ∈B ( x ) y ∈ X � ρ ( x ) + v ( x ) = min P ( y | x , β ) v ( y ) + r ( x , β ) β ∈B ρ ( x ) y ∈ X with B ρ ( x ) = argmin β ∈B ( x ) � y ∈ X P ( y | x , β ) ρ ( y ). Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 11 / 22

  18. Policy iteration (PI) algorithm Multichain Policy Iteration Algorithm for one player (Denardo, Fox, 67) Start with ¯ β 0 : x �→ ¯ β 0 ( x ) Calculate value and bias ( ρ k +1 , v k +1 ) for policy ¯ β k solution of 1 ρ k +1 = P ρ k +1 + v k +1 = P β k v k +1 + r ¯ ¯ ¯ β k ρ k +1 β k and Improve the policy: find ¯ β k +1 optimal for ( ρ k +1 , v k +1 ) 2 �� � ¯ P ( y | x , β ) v k +1 ( y ) + r ( x , β ) β k +1 ( x ) ∈ argmin , x ∈ X β ∈B ρ k +1 ( x ) y ∈ X with B ρ ( x ) = argmin β ∈B ( x ) � y ∈ X P ( y | x , β ) ρ ( y ). Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 12 / 22

  19. Policy iteration (PI) algorithm Easy to show ρ k +1 ≤ ρ k � if ρ k +1 = ρ k → degenerate iteration v k +1 is defined up to Ker ( I − P ¯ β k ) with dim = nb of ergodic class of P ¯ β k ≥ 1. → PI may cycle when they are multiple ergodic classes To avoid this : Optimal strategies are improved in a conservative way (¯ β k +1 ( x ) = ¯ β k ( x ) if optimal) v k +1 is fixed on a point of each ergodic class of P ¯ β k ⇒ when ρ k +1 = ρ k , v k +1 ( x ) = v k ( x ) on each ergodic classes of P ¯ β k ⇒ ( ρ k , v k ) k ≥ 1 is non increasing in a lexicographical order ρ k +1 ≤ ρ k and if ρ k +1 = ρ k , v k +1 ≤ v k ⇒ PI stops after a finite time when sets of actions are finite Remark: PI ≈ Newton algorithm in the case with unique solution v . Sylvie Detournay (INRIA and CMAP) MG for zero-sum stochastic games Copper 2011 13 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend