Lecture 33 Reinforcement Learning for Two-Player Games Mark - PowerPoint PPT Presentation

Lecture 33 – Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/

Outline • Review: minimax and alpha-beta • Move ordering: policy network • Evaluation function: value network • Training the value network • Exact training: endgames • Stochastic training: Monte Carlo tree search • Case study: alphago

Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of states reachable in one move), then the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

Minimax games 6 Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min

Minimax complexity 6 𝑐 = branching factor 𝑒 = search depth Complexity = 𝑃{𝑐 ! } 8 9 5 7 5 7 6 9 3 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ Each node has two internal meta-parameters, initialized from its parent: • 𝛽 = highest value that MAX knows how to force MIN to accept • 𝛾 = lowest value that MIN knows how to force MAX to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 accept • 𝛽 ≤ 𝛾 • Initial values: 𝛽 = −∞, 𝛾 = ∞

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ Each node has two internal meta-parameters, initialized from its parent: 𝛽 = −∞ 𝛾 = ∞ • 𝛽 = highest value that MAX knows how to force MIN to 𝛽 = −∞ accept 𝛾 = ∞ • 𝛾 = lowest value that MIN knows how to force MAX to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 accept • 𝛽 ≤ 𝛾 • Initial values: 𝛽 = −∞, 𝛾 = ∞

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = ∞ 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 6 𝛾 = ∞ this node. • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = ∞ 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = 6 MIN will never let us reach 6 𝛾 = ∞ this node. • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MIN node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ < 𝛾 = 6 𝛽 𝑡 then prune all remaining children of 𝑡 : MIN will never let us reach 6 this node. • Otherwise, if 𝑉 𝑡’ < 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛾 𝑡 , then set 𝛾 𝑡 = 𝑉(𝑡’) . MAX might still choose 𝑡 (because 𝑉 𝑡’ ≥ 𝛽 𝑡 ), then MIN can choose 𝑡’ .

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = 6 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 𝛾 = 6 6 ≥ 𝟗 this node. XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = 6 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 𝛾 = 6 6 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

Alpha-Beta Pruning 𝛽 = 6 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : • If you realize that 𝑉 𝑡’ > 𝛾 𝑡 then prune all remaining children of 𝑡 : MIN will never let us reach 6 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .

Alpha-Beta Pruning 𝛽 = 6 6 𝛾 = ∞ If 𝑡 is a MIN node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = 6 𝛽 = 6 • If you realize that 𝑉 𝑡’ < 𝛾 = ∞ 𝛾 = ∞ 𝛽 𝑡 then prune all X X X X remaining children of 𝑡 : MAX will never let us reach 5 6 3 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉 𝑡’ < 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛾 𝑡 , then set 𝛾 𝑡 = 𝑉(𝑡’) . MAX might still choose 𝑡 (because 𝑉 𝑡’ ≥ 𝛽 𝑡 ), then MIN can choose 𝑡’ .

Optimum node ordering 𝛽 = 6 6 𝛾 = ∞ Imagine you had an oracle, who could tell you which node to evaluate first. Which one should you evaluate first? X X X X • Children of MAX nodes: evaluate 5 6 3 ≥ 𝟗 ≥ 𝟘 the highest-value child first. XX XX • Children of MIN nodes: evaluate 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 the lowest-value child first.

Complexity of alpha-beta 𝛽 = 6 6 𝛾 = ∞ If nodes are optimally ordered, then for each node 𝑡 , we evaluate • The 𝑐 children of its first child. • The first child of each of its other X X X X 𝑐 − 1 children. 5 6 3 ≥ 𝟗 ≥ 𝟘 Total complexity: 2𝑐 − 1 = 𝑃{𝑐} per XX XX two levels. 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • With 𝑒 levels, total complexity = (2𝑐 − 1) !/# = 𝑃{𝑐 !/# } . Evaluated

Optimal node ordering???!!! Op 𝛽 = 6 6 𝛾 = ∞ How on Earth can we decide which child to evaluate first? • “Children of MAX nodes: evaluate the highest-value child first.” X X X X 5 6 3 ≥ 𝟗 ≥ 𝟘 But if we knew which one had the XX XX highest value, we wouldn’t need to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 search the tree! We would already know the optimal move! Evaluated

Outline • Review: minimax and alpha-beta • Move ordering: policy network • Evaluation function: value network • Training the value network • Exact training: endgames • Stochastic training: Monte Carlo tree search • Case study: alphago

Op Optimal node ordering???!!! • If we knew which child had the highest value, we wouldn’t need to search the tree! We would already know the optimal move! • Solution: train a policy network, 𝜌 𝑡, 𝑏

Lecture 33 Reinforcement Learning for Two-Player Games Mark - PowerPoint PPT Presentation

Lecture 33 Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/ Outline Review: minimax

CS 598 RM : Algorithmic game theory Lecture 1 Two-player games For any two-player game, we have

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

ARTigo Tag Cluster tags of player 2 player 4 player 1 player 3 1 russian 1 army 1

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CSC2556 Lecture 11 Noncooperative Games 2: Zero-Sum Games, Stackelberg Games CSC2556 - Nisarg

CS440/ECE448 Lecture 8: Two-Player Games Slides by Svetlana Lazebnik 9/2016 Modified by Mark

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

The Player Agent The Player Agent Are they the most important league official right now? right

R i f Reinforcement Learning in L i i Board Games Board Games G E O R G E T U C K E R G E

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Extensive Form Games Extensive-form games with perfect information When moving, each player

Two-Player Game State Machine 2-Player Game State Diagram 2PG2 2-Player

Security in Wireless Ecosystems Security in Wireless Ecosystems Wade Trappe Wireless Ecosystems

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Scene 2: The Dreamer The plot line is developed from Jacobs two dreams in vv. 5-11 HW

Sustainability, Strategy and Management Control: A Review of the Literature 1 Nathalie CRUTZEN ,

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

Accelerating a local search algorithm for large instances of the independent task scheduling

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING