monte carlo tree search guided by symbolic advice for mdps
play

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien - PowerPoint PPT Presentation

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Universit Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13 Markov Decision Process 1 s 0 4 a 1 a 2 1


  1. Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj Chakraborty and Jean-Francois Raskin Université Libre de Bruxelles September 16, 2020 HIGHLIGHTS 2020 1/13

  2. Markov Decision Process 1 s 0 4 a 1 a 2 1 3 a 4 2 3 4 3 1 a 3 s 1 s 2 1 2 1 2 2 1 a 1 a 3 3 2 Path of length 2: s 0 − → ��� s 1 − → ��� s 2 2/13

  3. Markov Decision Process 1 s 0 4 a 1 a 2 1 0 1 3 a 4 2 1 3 4 3 1 a 3 s 1 s 2 2 1 2 1 2 2 1 a 1 a 3 3 2 Path of length 2: s 0 − → ��� s 1 − → ��� s 2 Finite-horizon total reward (horizon H ) Val( s 0 ) = sup σ :Paths → A E [Reward( p )] where p is a random variable over Paths H ( s 0 , σ ) Link with infinite-horizon average reward for H large enough 2/13

  4. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 1 v 2 s 1 s 2 s 2 a 3 a 4 v 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates 3/13

  5. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 1 v 2 s 1 s 2 s 2 a 3 a 4 v 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates Selection of a new node � simulation 3/13

  6. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 2 v ′ 1 s 1 s 2 s 2 a 3 a 4 v ′ a 4 4 s 1 s 2 s 0 v s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 Iterative construction of a sparse tree with value estimates Selection of a new node � simulation � update of the estimates 3/13

  7. Monte Carlo tree search (MCTS) s 0 a 1 a 2 v 2 v ′ 1 s 1 s 2 s 2 a 3 a 4 v ′ a 4 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: After a given number of iterations n , MCTS outputs the best action The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n 3/13

  8. Symbolic advice 4/13

  9. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13

  10. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13

  11. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 3 a 4 a 1 a 2 a 4 a 1 a 2 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X An advice is a subset of Paths H ( s 0 ) Defined symbolically as a logical formula ϕ (reachability or safety property, LTL formula over finite traces, regular expression . . . ) 5/13 ϕ defines a pruning of the unfolded MDP

  12. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X � � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

  13. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X � � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

  14. Symbolic advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 H s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X X X X � � � � � � � Strongly enforceable advice: can be enforced by controller if the MDP is seen as a game � does not partially prune stochastic transitions 6/13

  15. Boolean Solvers The advice ψ can be encoded as a Boolean Formula 7/13

  16. Boolean Solvers The advice ψ can be encoded as a Boolean Formula QBF solver A first action a 0 is compatible with ϕ iff ∀ s 1 ∃ a 1 ∀ s 2 . . . , s 0 a 0 s 1 a 1 s 2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ 7/13

  17. Boolean Solvers The advice ψ can be encoded as a Boolean Formula QBF solver A first action a 0 is compatible with ϕ iff ∀ s 1 ∃ a 1 ∀ s 2 . . . , s 0 a 0 s 1 a 1 s 2 . . . | = ψ Inductive way of constructing paths that satisfy the strongly enforceable advice ϕ Weighted sampling Simulation of safe paths according to ψ Weighted SAT sampling (Chakraborty, Fremont, Meel, Seshia, & Vardi, 2014) 7/13

  18. MCTS under advice 8/13

  19. MCTS under advice s 0 a 1 a 2 s 1 s 2 s 2 a 3 a 4 a 4 s 1 s 2 s 0 s 2 s 0 s 2 a 1 a 2 a 1 a 2 a 3 a 4 a 4 a 4 s 1 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 s 1 s 2 s 2 s 0 s 2 X X X X � � X � � X X X X X Select actions in the unfolding pruned by a selection advice ϕ Simulation is restricted according to a simulation advice ψ 9/13

  20. MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which 10/13

  21. MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which . . . are Strongly enforceable advice 10/13

  22. MCTS under advice Convergence properties With UCT (Kocsis & Szepesvári, 2006) as the selection strategy: The probability of choosing a suboptimal action converges to zero v i converges to the real value of a i at a speed of (log n ) / n The convergence properties are maintained: for all simulation advice for all selection advice which are Strongly enforceable advice satisfy an optimality assumption: does not prune all optimal actions 10/13

  23. Experimental results 11/13

  24. Experimental results Figure: 9 x 21 maze, 4 random ghosts % of no result 1 Algorithm % of win % of loss % of food eaten MCTS 17 59 24 67 MCTS+Selection advice 25 54 21 71 MCTS+Simulation advice 71 29 0 88 MCTS+both advice 85 15 0 94 Human 44 56 0 75 1 after 300 steps 12/13

  25. Future works Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice 13/13

  26. Future works Compiler LTL → symbolic advice Study interactions with reinforcement learning techniques (and neural networks) Weighted advice Thank You 13/13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend