mcts extensions
play

MCTS Extensions 2/15/17 The Monte Carlo Tree Search Algorithm MCTS - PowerPoint PPT Presentation

MCTS Extensions 2/15/17 The Monte Carlo Tree Search Algorithm MCTS Pseudocode for i = 1 : rollouts node = root init empty path # selection while all children expanded and node not terminal node = UCB_sample(node) add node to path #


  1. MCTS Extensions 2/15/17

  2. The Monte Carlo Tree Search Algorithm

  3. MCTS Pseudocode for i = 1 : rollouts node = root init empty path # selection while all children expanded and node not terminal node = UCB_sample(node) add node to path # expansion if node not terminal node = expand(random unexpanded child of node) # simulation outcome = random_playout(node's state) # backpropagation for each node in the path update node’s value and visits

  4. Selection Expansion Simulation Backpropagation 1 2.0 1 0.0

  5. Selection Expansion Simulation Backpropagation 1 2.0 1 2 1 0.0 1.0 0.0

  6. Selection Expansion Simulation Backpropagation 1 2.0 2 3 1 1.0 1.0 0.0 1 1.0

  7. Selection Expansion Simulation Backpropagation C = 5.0 1 w i = v i + 5*ln(3) .5 2.0 weights = [7.24, 5.24, 6.24] distribution = [.39, .28, .33] 3 4 1 1.0 .75 0.0 1 2 1.0 1.5 1 0.0

  8. Selection Expansion Simulation Backpropagation C = 5.0 1 w i = v i + 5 * ln(4) .5 /n i .5 2.0 weights = [7.89, 5.89, 6.45] distribution = [.39, .29, .32] weights = [7.24, 5.24, 6.24] distribution = [.39, .28, .33] 5 4 1 .75 1.0 0.0 1 2.0 2 3 1.5 1.0 1 0.0

  9. Exercise: construct the UCB distribution 19 .45 5 3 8 2 .6 .5 .75 0. weights = [2.13, 2.48, 1.96, 2.43] probs = [0.24, 0.28, 0.22, 0.27]

  10. How do we pick a move? MCTS builds a tree, with visits and values for each node. How can we use this to pick a move? 1 2.0 • Pick the highest-value move. • Pick the most-visited move. 5 1 1.0 0.0 • Can we do both? 1 2.0 • Use some weighted combination. • Keep simulating until they agree. 3 1.0 1 0.0

  11. Generalizing MCTS Beyond UCT The tree policy returns a child node in the explored region of the tree. The default policy returns a UCT uses a tree policy value estimate for a newly that draws samples expanded node. according to UCB. UCT uses a default policy that completes a uniform random playout.

  12. Alternative tree policies Requirement: The tree policy needs to trade off exploration and exploitation. • Epsilon-greedy: pick a uniform random child with probability ε and the best child with probability (1-ε). • We’ll see this again soon. • Use UCB, but seed the tree within initial values. • From previous runs. • Using a heuristic. • Other ideas?

  13. Alternative default policies Requirement: The default policy needs to run quickly and return a value estimate. • Use the board evaluation heuristic from bounded minimax. • Run multiple random rollouts for each expanded node. • Other ideas?

  14. Exercise: extend MCTS to these games How can MCTS handle non-zero-sum games? How can MCTS handle games with randomness?

  15. Non-Zero-Sum Games Key idea: store a value tuple with the average utility for each player. • Each node now stores visits, children, and one value for each player. • The agent who’s making a decision will compute UCB weights using only their component of the value tuple.

  16. Randomness in the Environment This is what Monte Carlo simulations were made for! • Whenever we hit a move-by-nature in the game tree, sample from nature’s move distribution. • We still need to track value and visits for the nature node, so that the parent can make its choices. 1 N 2 .4 .6 1 1 2 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend