lecture 33 reinforcement learning for two player games
play

Lecture 33 Reinforcement Learning for Two-Player Games Mark - PowerPoint PPT Presentation

Lecture 33 Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/ Outline Review: minimax


  1. Lecture 33 – Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/

  2. Outline • Review: minimax and alpha-beta • Move ordering: policy network • Evaluation function: value network • Training the value network • Exact training: endgames • Stochastic training: Monte Carlo tree search • Case study: alphago

  3. Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of states reachable in one move), then the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

  4. Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

  5. Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

  6. Minimax games 6 Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

  7. Minimax complexity 6 𝑐 = branching factor 𝑒 = search depth Complexity = 𝑃{𝑐 ! } 8 9 5 7 5 7 6 9 3 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7

  8. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ Each node has two internal meta-parameters, initialized from its parent: • 𝛽 = highest value that MAX knows how to force MIN to accept • 𝛾 = lowest value that MIN knows how to force MAX to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 accept • 𝛽 ≤ 𝛾 • Initial values: 𝛽 = −∞, 𝛾 = ∞

  9. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ Each node has two internal meta-parameters, initialized from its parent: 𝛽 = −∞ 𝛾 = ∞ • 𝛽 = highest value that MAX knows how to force MIN to 𝛽 = −∞ accept 𝛾 = ∞ • 𝛾 = lowest value that MIN knows how to force MAX to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 accept • 𝛽 ≤ 𝛾 • Initial values: 𝛽 = −∞, 𝛾 = ∞

  10. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = ∞ 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 6 𝛾 = ∞ this node. • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

  11. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = ∞ 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = 6 MIN will never let us reach 6 𝛾 = ∞ this node. • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

  12. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MIN node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ < 𝛾 = 6 𝛽 𝑡 then prune all remaining children of 𝑡 : MIN will never let us reach 6 this node. • Otherwise, if 𝑉 𝑡’ < 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛾 𝑡 , then set 𝛾 𝑡 = 𝑉(𝑡’) . MAX might still choose 𝑡 (because 𝑉 𝑡’ ≥ 𝛽 𝑡 ), then MIN can choose 𝑡’ .

  13. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = 6 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 𝛾 = 6 6 ≥ 𝟗 this node. XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

  14. Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = 6 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 𝛾 = 6 6 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

  15. Alpha-Beta Pruning 𝛽 = 6 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : • If you realize that 𝑉 𝑡’ > 𝛾 𝑡 then prune all remaining children of 𝑡 : MIN will never let us reach 6 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

  16. Alpha-Beta Pruning 𝛽 = 6 6 𝛾 = ∞ If 𝑡 is a MIN node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = 6 𝛽 = 6 • If you realize that 𝑉 𝑡’ < 𝛾 = ∞ 𝛾 = ∞ 𝛽 𝑡 then prune all X X X X remaining children of 𝑡 : MAX will never let us reach 5 6 3 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉 𝑡’ < 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛾 𝑡 , then set 𝛾 𝑡 = 𝑉(𝑡’) . MAX might still choose 𝑡 (because 𝑉 𝑡’ ≥ 𝛽 𝑡 ), then MIN can choose 𝑡’ .

  17. Optimum node ordering 𝛽 = 6 6 𝛾 = ∞ Imagine you had an oracle, who could tell you which node to evaluate first. Which one should you evaluate first? X X X X • Children of MAX nodes: evaluate 5 6 3 ≥ 𝟗 ≥ 𝟘 the highest-value child first. XX XX • Children of MIN nodes: evaluate 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 the lowest-value child first.

  18. Complexity of alpha-beta 𝛽 = 6 6 𝛾 = ∞ If nodes are optimally ordered, then for each node 𝑡 , we evaluate • The 𝑐 children of its first child. • The first child of each of its other X X X X 𝑐 − 1 children. 5 6 3 ≥ 𝟗 ≥ 𝟘 Total complexity: 2𝑐 − 1 = 𝑃{𝑐} per XX XX two levels. 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • With 𝑒 levels, total complexity = (2𝑐 − 1) !/# = 𝑃{𝑐 !/# } . Evaluated

  19. Optimal node ordering???!!! Op 𝛽 = 6 6 𝛾 = ∞ How on Earth can we decide which child to evaluate first? • “Children of MAX nodes: evaluate the highest-value child first.” X X X X 5 6 3 ≥ 𝟗 ≥ 𝟘 But if we knew which one had the XX XX highest value, we wouldn’t need to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 search the tree! We would already know the optimal move! Evaluated

  20. Outline • Review: minimax and alpha-beta • Move ordering: policy network • Evaluation function: value network • Training the value network • Exact training: endgames • Stochastic training: Monte Carlo tree search • Case study: alphago

  21. Op Optimal node ordering???!!! • If we knew which child had the highest value, we wouldn’t need to search the tree! We would already know the optimal move! • Solution: train a policy network, 𝜌 𝑡, 𝑏

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend