monte carlo tree search
play

Monte-Carlo Tree Search Mich` ele Sebag TAO: Theme Apprentissage - PowerPoint PPT Presentation

Monte-Carlo Tree Search Mich` ele Sebag TAO: Theme Apprentissage & Optimization Acknowledgments: Olivier Teytaud , Sylvain Gelly, Philippe Rolet, Romaric Gaudel CP 2012 Foreword Disclaimer 1 There is no shortage of tree-based


  1. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 Gradually grow the search tree: ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Search Tree Grow a leaf of the search tree ◮ Select next action bis Random phase, roll-out New Node ◮ Compute instant reward Evaluate Random ◮ Update information in visited nodes Phase Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often

  2. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 Gradually grow the search tree: ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Search Tree Grow a leaf of the search tree ◮ Select next action bis Random phase, roll-out New Node ◮ Compute instant reward Evaluate Random ◮ Update information in visited nodes Phase Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often

  3. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 Gradually grow the search tree: ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Search Tree Grow a leaf of the search tree ◮ Select next action bis Random phase, roll-out New Node ◮ Compute instant reward Evaluate Random ◮ Update information in visited nodes Phase Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often

  4. MCTS Algorithm Main Input: number N of tree-walks Initialize search tree T ← initial state Loop: For i = 1 to N TreeWalk( T , initial state ) EndLoop Return most visited child node of root node

  5. MCTS Algorithm, ctd Tree walk Input: search tree T , state s Output: reward r If s is not a leaf node Select a ∗ = argmax { ˆ µ ( s , a ) , tr ( s , a ) ∈ T } TreeWalk( T , tr ( s , a ∗ )) r ← Else A s = { admissible actions not yet visited in s } Select a ∗ in A s Add tr ( s , a ∗ ) as child node of s RandomWalk( tr ( s , a ∗ )) r ← End If Update n s , n s , a ∗ and ˆ µ s , a ∗ Return r

  6. MCTS Algorithm, ctd Random walk Input: search tree T , state u Output: reward r A rnd ← {} // store the set of actions visited in the random phase While u is not final state Uniformly select an admissible action a for u A rnd ← A rnd ∪ { a } u ← tr( u , a ) EndWhile r = Evaluate ( u ) //reward vector of the tree-walk Return r

  7. Monte-Carlo Tree Search Properties of interest ◮ Consistency: Pr(finding optimal path) → 1 when the number of tree-walks go to infinity ◮ Speed of convergence; can be exponentially slow. Coquelin Munos 07

  8. Comparative results 2012 MoGoTW used for physiological measurements of human players 2012 7 wins out of 12 games against professional players and 9 wins out of 12 games against 6D players MoGoTW 2011 20 wins out of 20 games in 7x7 with minimal computer komi MoGoTW 2011 First win against a pro (6D), H2, 13 × 13 MoGoTW 2011 First win against a pro (9P), H2.5, 13 × 13 MoGoTW 2011 First win against a pro in Blind Go, 9 × 9 MoGoTW 2010 Gold medal in TAAI, all categories MoGoTW 19 × 19, 13 × 13, 9 × 9 2009 Win against a pro (5P), 9 × 9 (black) MoGo 2009 Win against a pro (5P), 9 × 9 (black) MoGoTW 2008 in against a pro (5P), 9 × 9 (white) MoGo 2007 Win against a pro (5P), 9 × 9 (blitz) MoGo 2009 Win against a pro (8P), 19 × 19 H9 MoGo 2009 Win against a pro (1P), 19 × 19 H6 MoGo 2008 Win against a pro (9P), 19 × 19 H7 MoGo

  9. Overview Motivations Monte-Carlo Tree Search Multi-Armed Bandits Random phase Evaluation and Propagation Advanced MCTS Rapid Action Value Estimate Improving the rollout policy Using prior knowledge Parallelization Open problems MCTS and 1-player games MCTS and CP Optimization in expectation Conclusion and perspectives

  10. Action selection as a Multi-Armed Bandit problem Lai, Robbins 85 In a casino, one wants to maximize one’s gains while playing . Lifelong learning Exploration vs Exploitation Dilemma ◮ Play the best arm so far ? Exploitation ◮ But there might exist better arms... Exploration

  11. The multi-armed bandit (MAB) problem ◮ K arms ◮ Each arm gives reward 1 with probability µ i , 0 otherwise ◮ Let µ ∗ = argmax { µ 1 , . . . µ K } , with ∆ i = µ ∗ − µ i ◮ In each time t , one selects an arm i ∗ t and gets a reward r t � t n i , t = u =1 I 1 i ∗ number of times i has been selected u = i 1 ˆ = � µ i , t u = i r u average reward of arm i i ∗ n i , t Goal: Maximize � t u =1 r u ⇔ t K K � � � ( µ ∗ − r u ) = t µ ∗ − Minimize Regret ( t ) = n i , t ˆ µ i , t ≈ n i , t ∆ i u =1 i =1 i =1

  12. The simplest approach: ǫ -greedy selection At each time t , ◮ With probability 1 − ε select the arm with best empirical reward i ∗ t = argmax { ˆ µ 1 , t , . . . ˆ µ K , t } ◮ Otherwise, select i ∗ t uniformly in { 1 . . . K } Regret ( t ) > ε t 1 � i ∆ i K Optimal regret rate: log ( t ) Lai Robbins 85

  13. Upper Confidence Bound Auer et al. 2002 C log ( � n j , t ) � � � Select i ∗ t = argmax µ i , t + ˆ n i , t Arm A Arm A Arm B Arm A Arm B Arm B Decision: Optimism in front of unknown !

  14. Upper Confidence bound, followed UCB achieves the optimal regret rate log ( t ) log ( � n j , t ) � � � Select i ∗ t = argmax µ i , t + ˆ c e n i , t Extensions and variants ◮ Tune c e control the exploration/exploitation trade-off ◮ UCB-tuned: take into account the standard deviation of ˆ µ i : Select i ∗ t = argmax � log ( � n j , t ) log ( � n j , t )  � � � � 1   � σ 2  ˆ µ i , t + � c e + min 4 , ˆ i , t + c e n i , t n i , t  ◮ Many-armed bandit strategies ◮ Extension of UCB to trees: UCT Kocsis & Szepesv´ ari, 06

  15. Monte-Carlo Tree Search. Random phase Gradually grow the search tree: ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Search Tree Grow a leaf of the search tree ◮ Select next action bis Random phase, roll-out New Node ◮ Compute instant reward Evaluate Random ◮ Update information in visited nodes Phase Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often

  16. Random phase − Roll-out policy Monte-Carlo-based Br¨ ugman 93 1. Until the goban is filled, add a stone (black or white in turn) at a uniformly selected empty position 2. Compute r = Win(black) 3. The outcome of the tree-walk is r

  17. Random phase − Roll-out policy Monte-Carlo-based Br¨ ugman 93 1. Until the goban is filled, add a stone (black or white in turn) at a uniformly selected empty position 2. Compute r = Win(black) 3. The outcome of the tree-walk is r Improvements ? ◮ Put stones randomly in the neighborhood of a previous stone ◮ Put stones matching patterns prior knowledge ◮ Put stones optimizing a value function Silver et al. 07

  18. Evaluation and Propagation The tree-walk returns an evaluation r win(black) Propagate ◮ For each node ( s , a ) in the tree-walk n s , a ← n s , a + 1 1 ˆ ← ˆ µ s , a + n s , a ( r − µ s , a ) µ s , a

  19. Evaluation and Propagation The tree-walk returns an evaluation r win(black) Propagate ◮ For each node ( s , a ) in the tree-walk n s , a ← n s , a + 1 1 ˆ ← ˆ µ s , a + n s , a ( r − µ s , a ) µ s , a Variants Kocsis & Szepesv´ ari, 06 � min { ˆ µ x , x child of ( s , a ) } if ( s , a ) is a black node µ s , a ← ˆ max { ˆ µ x , x child of ( s , a ) } if ( s , a ) is a white node

  20. Dilemma ◮ smarter roll-out policy → more computationally expensive → less tree-walks on a budget ◮ frugal roll-out → more tree-walks → more confident evaluations

  21. Overview Motivations Monte-Carlo Tree Search Multi-Armed Bandits Random phase Evaluation and Propagation Advanced MCTS Rapid Action Value Estimate Improving the rollout policy Using prior knowledge Parallelization Open problems MCTS and 1-player games MCTS and CP Optimization in expectation Conclusion and perspectives

  22. Action selection revisited � � � log ( n s ) Select a ∗ = argmax µ s , a + ˆ c e n s , a ◮ Asymptotically optimal ◮ But visits the tree infinitely often ! Being greedy is excluded not consistent Frugal and consistent Select a ∗ = argmax Nb win( s , a ) + 1 Nb loss( s , a ) + 2 Berthier et al. 2010 Further directions ◮ Optimizing the action selection rule Maes et al., 11

  23. Controlling the branching factor What if many arms ? degenerates into exploration ◮ Continuous heuristics Use a small exploration constant c e ◮ Discrete heuristics Progressive Widening Coulom 06; Rolet et al. 09 � Limit the number of considered actions to ⌊ b n ( s ) ⌋ (usually b = 2 or 4) considered actions Number of Number of iterations � � Introduce a new action when ⌊ b n ( s ) + 1 ⌋ > ⌊ b n ( s ) ⌋ (which one ? See RAVE, below).

  24. RAVE: Rapid Action Value Estimate Gelly Silver 07 Motivation ◮ It needs some time to decrease the variance of ˆ µ s , a ◮ Generalizing across the tree ? s a a a a a a RAVE ( s , a ) = a µ ( s ′ , a ) , s parent of s ′ } average { ˆ a local RAVE global RAVE

  25. Rapid Action Value Estimate, 2 Using RAVE for action selection In the action selection rule, replace ˆ µ s , a by α ˆ µ s , a + (1 − α ) ( β RAVE ℓ ( s , a ) + (1 − β ) RAVE g ( s , a )) n parent ( s ) n s , a α = β = n s , a + c 1 n parent ( s ) + c 2 Using RAVE with Progressive Widening � � ◮ PW: introduce a new action if ⌊ b n ( s ) + 1 ⌋ > ⌊ b n ( s ) ⌋ ◮ Select promising actions: it takes time to recover from bad ones ◮ Select argmax RAVE ℓ ( parent ( s )).

  26. A limit of RAVE ◮ Brings information from bottom to top of tree ◮ Sometimes harmful: B2 is the only good move for white B2 only makes sense as first move (not in subtrees) ⇒ RAVE rejects B2.

  27. Improving the roll-out policy π π 0 Put stones uniformly in empty positions π random Put stones uniformly in the neighborhood of a previous stone π MoGo Put stones matching patterns prior knowledge π RLGO Put stones optimizing a value function Silver et al. 07 Beware! Gelly Silver 07 π better π ′ MCTS ( π ) better MCTS ( π ′ ) �⇒

  28. Improving the roll-out policy π , followed π RLGO against π random π RLGO against π MoGo Evaluation error on 200 test cases

  29. Interpretation What matters: ◮ Being biased is more harmful than being weak... ◮ Introducing a stronger but biased rollout policy π is detrimental. if there exist situations where you (wrongly) think you are in good shape then you go there and you are in bad shape...

  30. Using prior knowledge Assume a value function Q prior ( s , a ) ◮ Then when action a is first considered in state s , initialize n s , a = n prior ( s , a ) equivalent experience / confidence of priors µ s , a = Q prior ( s , a ) The best of both worlds ◮ Speed-up discovery of good moves ◮ Does not prevent from identifying their weaknesses

  31. Overview Motivations Monte-Carlo Tree Search Multi-Armed Bandits Random phase Evaluation and Propagation Advanced MCTS Rapid Action Value Estimate Improving the rollout policy Using prior knowledge Parallelization Open problems MCTS and 1-player games MCTS and CP Optimization in expectation Conclusion and perspectives

  32. Parallelization. 1 Distributing the roll-outs comp. comp node 1 node k Distributing roll-outs on different computational nodes does not work.

  33. Parallelization. 2 With shared memory comp. comp node 1 node k ◮ Launch tree-walks in parallel on the same MCTS ◮ (micro) lock the indicators during each tree-walk update. Use virtual updates to enforce the diversity of tree walks.

  34. Parallelization. 3. Without shared memory comp. comp node 1 node k ◮ Launch one MCTS per computational node ◮ k times per second k = 3 ◮ Select nodes with sufficient number of simulations > . 05 × # total simulations ◮ Aggregate indicators Good news Parallelization with and without shared memory can be combined.

  35. It works ! 32 cores against Winning rate on 9 × 9 Winning rate on 19 × 19 1 75.8 ± 2.5 95.1 ± 1.4 2 66.3 ± 2.8 82.4 ± 2.7 4 62.6 ± 2.9 73.5 ± 3.4 8 59.6 ± 2.9 63.1 ± 4.2 16 52 ± 3. 63 ± 5.6 32 48.9 ± 3. 48 ± 10 Then: ◮ Try with a bigger machine ! and win against top professional players ! ◮ Not so simple... there are diminishing returns.

  36. Increasing the number N of tree-walks N 2 N against N Winning rate on 9 × 9 Winning rate on 19 × 19 1,000 71.1 ± 0.1 90.5 ± 0.3 4,000 68.7 ± 0.2 84.5 ± 0,3 16,000 66.5 ± 0.9 80.2 ± 0.4 256,000 61 ± 0,2 58.5 ± 1.7

  37. The limits of parallelization R. Coulom Improvement in terms of performance against humans ≪ Improvement in terms of performance against computers ≪ Improvements in terms of self-play

  38. Overview Motivations Monte-Carlo Tree Search Multi-Armed Bandits Random phase Evaluation and Propagation Advanced MCTS Rapid Action Value Estimate Improving the rollout policy Using prior knowledge Parallelization Open problems MCTS and 1-player games MCTS and CP Optimization in expectation Conclusion and perspectives

  39. Failure: Semeai

  40. Failure: Semeai

  41. Failure: Semeai

  42. Failure: Semeai

  43. Failure: Semeai

  44. Failure: Semeai

  45. Failure: Semeai

  46. Failure: Semeai

  47. Failure: Semeai Why does it fail ◮ First simulation gives 50% ◮ Following simulations give 100% or 0% ◮ But MCTS tries other moves: doesn’t see all moves on the black side are equivalent.

  48. Implication 1 MCTS does not detect invariance → too short-sighted and parallelization does not help.

  49. Implication 2 MCTS does not build abstractions → too short-sighted and parallelization does not help.

  50. Overview Motivations Monte-Carlo Tree Search Multi-Armed Bandits Random phase Evaluation and Propagation Advanced MCTS Rapid Action Value Estimate Improving the rollout policy Using prior knowledge Parallelization Open problems MCTS and 1-player games MCTS and CP Optimization in expectation Conclusion and perspectives

  51. MCTS for one-player game ◮ The MineSweeper problem ◮ Combining CSP and MCTS

  52. Motivation ◮ All locations have same probability of death 1/3 ◮ Are then all moves equivalent ?

  53. Motivation ◮ All locations have same probability of death 1/3 ◮ Are then all moves equivalent ? NO !

  54. Motivation ◮ All locations have same probability of death 1/3 ◮ Are then all moves equivalent ? NO ! ◮ Top, Bottom: Win with probability 2/3

  55. Motivation ◮ All locations have same probability of death 1/3 ◮ Are then all moves equivalent ? NO ! ◮ Top, Bottom: Win with probability 2/3 ◮ MYOPIC approaches LOSE.

  56. MineSweeper, State of the art Markov Decision Process Very expensive; 4 × 4 is solved Single Point Strategy (SPS) local solver CSP ◮ Each unknown location j , a variable x [ j ] ◮ Each visible location, a constraint, e.g. loc (15) = 4 → x [04]+ x [05]+ x [06]+ x [14]+ x [16]+ x [24]+ x [25]+ x [26] = 4 ◮ Find all N solutions ◮ P(mine in j ) = number of solutions with mine in j N ◮ Play j with minimal P(mine in j )

  57. Constraint Satisfaction for MineSweeper State of the art ◮ 80% success beginner (9x9, 10 mines) ◮ 45% success intermediate (16x16, 40 mines) ◮ 34% success expert (30x40, 99 mines) PROS ◮ Very fast CONS ◮ Not optimal ◮ Beware of first move (opening book)

  58. Upper Confidence Tree for MineSweeper Couetoux Teytaud 11 ◮ Cannot compete with CSP in terms of speed ◮ But consistent (find the optimal solution if given enough time) Lesson learned ◮ Initial move matters ◮ UCT improves on CSP ◮ 3x3, 7 mines ◮ Optimal winning rate: 25% ◮ Optimal winning rate if uniform initial move: 17/72 ◮ UCT improves on CSP by 1/72

  59. UCT for MineSweeper Another example ◮ 5x5, 15 mines ◮ GnoMine rule (first move gets 0) ◮ if 1st move is center, optimal winning rate is 100 % ◮ UCT finds it; CSP does not.

  60. The best of both worlds CSP ◮ Fast ◮ Suboptimal (myopic) UCT ◮ Needs a generative model ◮ Asymptotic optimal Hybrid ◮ UCT with generative model based on CSP

  61. UCT needs a generative model Given ◮ A state, an action ◮ Simulate possible transitions Initial state, play top left probabilistic transitions Simulating transitions ◮ Using rejection (draw mines and check if consistent) SLOW ◮ Using CSP FAST

  62. The algorithm: Belief State Sampler UCT ◮ One node created per simulation/tree-walk ◮ Progressive widening ◮ Evaluation by Monte-Carlo simulation ◮ Action selection: UCB tuned (with variance) ◮ Monte-Carlo moves ◮ If possible, Single Point Strategy (can propose riskless moves if any) ◮ Otherwise, move with null probability of mines (CSP-based) ◮ Otherwise, with probability .7, move with minimal probability of mines (CSP-based) ◮ Otherwise, draw a hidden state compatible with current observation (CSP-based) and play a safe move.

  63. The results ◮ BSSUCT: Belief State Sampler UCT ◮ CSP-PGMS: CSP + initial moves in the corners

  64. Partial conclusion Given a myopic solver ◮ It can be combined with MCTS / UCT: ◮ Significant (costly) improvements

  65. Overview Motivations Monte-Carlo Tree Search Multi-Armed Bandits Random phase Evaluation and Propagation Advanced MCTS Rapid Action Value Estimate Improving the rollout policy Using prior knowledge Parallelization Open problems MCTS and 1-player games MCTS and CP Optimization in expectation Conclusion and perspectives

  66. Active Learning, position of the problem Supervised learning, the setting ◮ Target hypothesis h ∗ ◮ Training set E = { ( x i , y i ) , i = 1 . . . n } ◮ Learn h n from E Criteria ◮ Consistency: h n → h ∗ when n → ∞ . ◮ Sample complexity: number of examples needed to reach the target with precision ǫ ǫ → n ǫ s . t . || h n − h ∗ || < ǫ

  67. Active Learning, definition Passive learning iid examples E = { ( x i , y i ) , i = 1 . . . n } Active learning x n +1 selected depending on { ( x i , y i ) , i = 1 . . . n } In the best case, exponential improvement:

  68. A motivating application Numerical Engineering ◮ Large codes ◮ Computationally heavy ∼ days ◮ not fool-proof Inertial Confinement Fusion, ICF

  69. Goal Simplified models ◮ Approximate answer ◮ ... for a fraction of the computational cost ◮ Speed-up the design cycle ◮ Optimal design More is Different

  70. Active Learning as a Game Ph. Rolet, 2010 E : Training data set Optimization problem A : Machine Learning algorithm Z : Set of instances F ∗ = argmin Find σ : E �→ Z sampling strategy I E h ∼A ( E ,σ, T ) Err ( h , σ, T ) T : Time horizon Err : Generalization error Bottlenecks ◮ Combinatorial optimization problem ◮ Generalization error unknown

  71. Where is the game ? ◮ Wanted: a good strategy to find, as accurately as possible, the true target concept. ◮ If this is a game, you play it only once ! ◮ But you can train... Training game: Iterate ◮ Draw a possible goal (fake target concept h ∗ ); use it as oracle ◮ Try a policy (sequence of instances E h ∗ , T = { ( x 1 , h ∗ ( x 1 )) , . . . ( x T , h ∗ ( x T )) } ◮ Evaluate: Learn h from E h ∗ , T . Reward = || h − h ∗ ||

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend