Lecture 33 β Reinforcement Learning for Two-Player Games
Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source
Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/
Lecture 33 Reinforcement Learning for Two-Player Games Mark - - PowerPoint PPT Presentation
Lecture 33 Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/ Outline Review: minimax
Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source
Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/
Let π‘ be the state of the game: complete specification of the board, and a statement about whose turn it is.
π·(π‘) are the children of π‘ (the set of states reachable in one move), then the value of the board is π π‘ = max
!"β$(!) π(π‘")
π π‘ = min
!"β$(!) π(π‘")
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
Let π‘ be the state of the game: complete specification of the board, and a statement about whose turn it is.
π·(π‘) are the children of π‘ (the set of states reachable in one move), then the value of the board is π π‘ = max
!"β$(!) π(π‘")
π π‘ = min
!"β$(!) π(π‘")
7 5 9 5 3 7 8 6 9 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
Let π‘ be the state of the game: complete specification of the board, and a statement about whose turn it is.
π·(π‘) are the children of π‘ (the set of states reachable in one move), then the value of the board is π π‘ = max
!"β$(!) π(π‘")
π π‘ = min
!"β$(!) π(π‘")
7 5 9 5 3 7 8 6 9 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
Let π‘ be the state of the game: complete specification of the board, and a statement about whose turn it is.
π·(π‘) are the children of π‘ (the set of states reachable in one move), then the value of the board is π π‘ = max
!"β$(!) π(π‘")
π π‘ = min
!"β$(!) π(π‘")
6 7 5 9 5 3 7 8 6 9 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
6 7 5 9 5 3 7 8 6 9 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
Each node has two internal meta-parameters, initialized from its parent:
knows how to force MIN to accept
knows how to force MAX to accept
β
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = ββ πΎ = β
Each node has two internal meta-parameters, initialized from its parent:
knows how to force MIN to accept
knows how to force MAX to accept
β
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = ββ πΎ = β π½ = ββ πΎ = β π½ = ββ πΎ = β
πΎ π‘ then prune all remaining children of π‘: MIN will never let us reach this node.
π½ π‘ , then set π½ π‘ = π(π‘β). MIN might still choose π‘ (because π π‘β β€ πΎ π‘ ), then MAX can choose π‘β. 6 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = ββ πΎ = β π½ = ββ πΎ = β π½ = ββ πΎ = β
πΎ π‘ then prune all remaining children of π‘: MIN will never let us reach this node.
π½ π‘ , then set π½ π‘ = π(π‘β). MIN might still choose π‘ (because π π‘β β€ πΎ π‘ ), then MAX can choose π‘β. 6 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β π½ = ββ πΎ = β π½ = ββ πΎ = β
π½ π‘ then prune all remaining children of π‘: MIN will never let us reach this node.
πΎ π‘ , then set πΎ π‘ = π(π‘β). MAX might still choose π‘ (because π π‘β β₯ π½ π‘ ), then MIN can choose π‘β. 6 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = ββ πΎ = 6 π½ = ββ πΎ = β
πΎ π‘ then prune all remaining children of π‘: MIN will never let us reach this node.
π½ π‘ , then set π½ π‘ = π(π‘β). MIN might still choose π‘ (because π π‘β β€ πΎ π‘ ), then MAX can choose π‘β.
β₯ π
6 6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = ββ πΎ = 6 π½ = ββ πΎ = β π½ = ββ πΎ = 6
XX
πΎ π‘ then prune all remaining children of π‘: MIN will never let us reach this node.
π½ π‘ , then set π½ π‘ = π(π‘β). MIN might still choose π‘ (because π π‘β β€ πΎ π‘ ), then MAX can choose π‘β.
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = ββ πΎ = 6 π½ = ββ πΎ = β π½ = ββ πΎ = 6
XX XX
πΎ π‘ then prune all remaining children of π‘: MIN will never let us reach this node.
π½ π‘ , then set π½ π‘ = π(π‘β). MIN might still choose π‘ (because π π‘β β€ πΎ π‘ ), then MAX can choose π‘β.
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β
XX XX
π½ π‘ then prune all remaining children of π‘: MAX will never let us reach this node.
πΎ π‘ , then set πΎ π‘ = π(π‘β). MAX might still choose π‘ (because π π‘β β₯ π½ π‘ ), then MIN can choose π‘β. 6 5 3
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β π½ = 6 πΎ = β
XX XX X X
π½ = 6 πΎ = β
X X
6 5 3
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β
XX XX X X X X
6 5 3
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β
XX XX X X X X Evaluated
6 5 3
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β
XX XX X X X X Evaluated
For example, the game of Go:
positions, each of which is 1 =black (MAX), β 1 =white (MIN), or 0 =empty.
board to place the next stone.
π'*+ π‘, π , probability that action π is the best move for MAX/MIN, π'() π‘, π = π,$%&(!,.) β." π,$%&(!,.')
Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/
How on Earth can we decide which child to evaluate first?
highest value of π'*+ π‘, π (=probability that this node will be evaluated to have the highest value).
highest value of π'() π‘, π (= probability that this node will be evaluated to have the lowest value).
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/ ? ? ? ?
%
Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/
6 5 3
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β
XX XX X X X X Evaluated
6 5 3
β₯ π
6
β₯ π
6 3 8 49 2 5 2 7 69 1 3 1 5 47 1 4 6 7 3 6 8 3 4 4
π½ = 6 πΎ = β
XX XX X X X X These are not the end of the game! These are actually the outputs of the value network, πβ π‘ , at these game positions.
%
Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/
Steps in Monte Carlo Tree Search. By Rmoss92 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=88889583
βsupervised learningβ policy network, π"# π‘, π .
with too few trainable parameters, hence less accurate. Call this the βrollout network,β π$%&&%'( π‘, π .
its results to improve π"# π‘, π .
(8 minutes, 2016)
January 2016
children of each node are ordered just right (MAX: largest first, MIN: smallest first)
moves evaluated, with some loss of accuracy
the data to estimate win frequency