Mastering the game of Go with deep neural networks and tree search - - PowerPoint PPT Presentation

mastering the game of go with deep neural networks and
SMART_READER_LITE
LIVE PREVIEW

Mastering the game of Go with deep neural networks and tree search - - PowerPoint PPT Presentation

David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016 T HE G AME OF G O B OARD B OARD S TONES B


slide-1
SLIDE 1

Article overview by Ilya Kuzovkin

David Silver et al. from Google DeepMind

Reinforcement Learning Seminar University of Tartu, 2016

Mastering the game of Go with deep neural networks and tree search

slide-2
SLIDE 2

THE GAME OF GO

slide-3
SLIDE 3

BOARD

slide-4
SLIDE 4

BOARD STONES

slide-5
SLIDE 5

BOARD STONES GROUPS

slide-6
SLIDE 6

BOARD STONES LIBERTIES GROUPS

slide-7
SLIDE 7

BOARD STONES LIBERTIES CAPTURE GROUPS

slide-8
SLIDE 8

BOARD STONES LIBERTIES CAPTURE KO GROUPS

slide-9
SLIDE 9

BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES

slide-10
SLIDE 10

BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES

slide-11
SLIDE 11

BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES

slide-12
SLIDE 12

BOARD STONES LIBERTIES CAPTURE T

WO EYES

KO GROUPS EXAMPLES

slide-13
SLIDE 13

BOARD STONES LIBERTIES CAPTURE FINAL COUNT KO GROUPS EXAMPLES T

WO EYES

slide-14
SLIDE 14

TRAINING

slide-15
SLIDE 15

Supervised policy network pσ(a|s) Reinforcement policy network pρ(a|s) Rollout policy network pπ(a|s) Value network vθ(s) Tree policy network pτ(a|s)

TRAINING THE BUILDING BLOCKS

SUPERVISED

CLASSIFICATION

REINFORCEMENT SUPERVISED

REGRESSION

slide-16
SLIDE 16

Supervised policy network pσ(a|s)

slide-17
SLIDE 17

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU 1 convolutional layer 1x1 ReLU Softmax

slide-18
SLIDE 18

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax 1 convolutional layer 1x1 ReLU

  • 29.4M positions from games

between 6 to 9 dan players

slide-19
SLIDE 19

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax

  • stochastic gradient ascent
  • learning rate = 0.003,

halved every 80M steps

  • batch size m = 16
  • 3 weeks on 50 GPUs to make

340M steps

α

1 convolutional layer 1x1 ReLU

  • 29.4M positions from games

between 6 to 9 dan players

slide-20
SLIDE 20

Supervised policy network pσ(a|s)

19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Softmax

  • stochastic gradient ascent
  • learning rate = 0.003,

halved every 80M steps

  • batch size m = 16
  • 3 weeks on 50 GPUs to make

340M steps

α

1 convolutional layer 1x1 ReLU

  • 29.4M positions from games

between 6 to 9 dan players

  • Augmented: 8 reflections/rotations
  • Test set (1M) accuracy: 57.0%
  • 3 ms to select an action
slide-21
SLIDE 21

19 X 19 X 48 INPUT

slide-22
SLIDE 22

19 X 19 X 48 INPUT

slide-23
SLIDE 23

19 X 19 X 48 INPUT

slide-24
SLIDE 24

19 X 19 X 48 INPUT

slide-25
SLIDE 25

Rollout policy pπ(a|s)

  • Supervised — same data as
  • Less accurate: 24.2% (vs. 57.0%)
  • Faster: 2μs per action (1500 times)
  • Just a linear model with softmax

pσ(a|s)

slide-26
SLIDE 26

Rollout policy pπ(a|s)

  • Supervised — same data as
  • Less accurate: 24.2% (vs. 57.0%)
  • Faster: 2μs per action (1500 times)
  • Just a linear model with softmax

pσ(a|s)

slide-27
SLIDE 27

Rollout policy pπ(a|s) Tree policy pτ(a|s)

  • Supervised — same data as
  • Less accurate: 24.2% (vs. 57.0%)
  • Faster: 2μs per action (1500 times)
  • Just a linear model with softmax

pσ(a|s)

slide-28
SLIDE 28

Rollout policy pπ(a|s)

  • “similar to the rollout policy but

with more features”

Tree policy pτ(a|s)

  • Supervised — same data as
  • Less accurate: 24.2% (vs. 57.0%)
  • Faster: 2μs per action (1500 times)
  • Just a linear model with softmax

pσ(a|s)

slide-29
SLIDE 29

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

slide-30
SLIDE 30

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

  • Self-play: current network vs.

randomized pool of previous versions

slide-31
SLIDE 31

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

  • Self-play: current network vs.

randomized pool of previous versions

  • Play a game until the end, get the reward zt = ±r(sT ) = ±1
slide-32
SLIDE 32

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

  • Self-play: current network vs.

randomized pool of previous versions

  • Play a game until the end, get the reward
  • Set and play the same game again, this time

updating the network parameters at each time step t

zt = ±r(sT ) = ±1 zi

t = zt

slide-33
SLIDE 33

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

  • Self-play: current network vs.

randomized pool of previous versions

  • Play a game until the end, get the reward
  • Set and play the same game again, this time

updating the network parameters at each time step t

  • = …
  • 0 “on the first pass through the training pipeline”
  • “on the second pass”

zt = ±r(sT ) = ±1 zi

t = zt

v(si

t)

vθ(si

t)

slide-34
SLIDE 34

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

  • Self-play: current network vs.

randomized pool of previous versions

  • Play a game until the end, get the reward
  • Set and play the same game again, this time

updating the network parameters at each time step t

  • = …
  • 0 “on the first pass through the training pipeline”
  • “on the second pass”
  • batch size n = 128 games
  • 10,000 batches
  • One day on 50 GPUs

zt = ±r(sT ) = ±1 zi

t = zt

v(si

t)

vθ(si

t)

slide-35
SLIDE 35

Reinforcement policy network pρ(a|s)

Same architecture Weights are initialized with

ρ σ

  • Self-play: current network vs.

randomized pool of previous versions

  • 80% wins against Supervised Network
  • 85% wins against Pachi (no search yet!)
  • 3 ms to select an action
  • Play a game until the end, get the reward
  • Set and play the same game again, this time

updating the network parameters at each time step t

  • = …
  • 0 “on the first pass through the training pipeline”
  • “on the second pass”
  • batch size n = 128 games
  • 10,000 batches
  • One day on 50 GPUs

zt = ±r(sT ) = ±1 zi

t = zt

v(si

t)

vθ(si

t)

slide-36
SLIDE 36

Value network vθ(s)

slide-37
SLIDE 37

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit 1 convolutional layer 1x1 ReLU

slide-38
SLIDE 38

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

  • Evaluate the value of the position s under policy p:
  • Double approximation

1 convolutional layer 1x1 ReLU

slide-39
SLIDE 39

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

  • Evaluate the value of the position s under policy p:
  • Double approximation

1 convolutional layer 1x1 ReLU

  • Stochastic gradient descent to

minimize MSE

slide-40
SLIDE 40

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

  • Evaluate the value of the position s under policy p:
  • Double approximation

1 convolutional layer 1x1 ReLU

  • Stochastic gradient descent to

minimize MSE

  • Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

(s, z)

slide-41
SLIDE 41

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

  • Evaluate the value of the position s under policy p:
  • Double approximation

1 convolutional layer 1x1 ReLU

  • Stochastic gradient descent to

minimize MSE

  • Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

  • choose a random time step u
  • sample moves t=1…u-1 from

SL policy

  • make random move u
  • sample t=u+1…T from RL

policy and get game

  • utcome z
  • add pair to the

training set

(s, z) (su, zu)

slide-42
SLIDE 42

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

  • Evaluate the value of the position s under policy p:
  • Double approximation

1 convolutional layer 1x1 ReLU

  • Stochastic gradient descent to

minimize MSE

  • Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

  • choose a random time step u
  • sample moves t=1…u-1 from

SL policy

  • make random move u
  • sample t=u+1…T from RL

policy and get game

  • utcome z
  • add pair to the

training set

  • One week on 50 GPUs to train
  • n 50M batches of size m=32

(s, z) (su, zu)

slide-43
SLIDE 43

19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 filters, ReLU 11 convolutional layers 3x3 with k=192 filters, ReLU Fully connected layer 256 ReLU units

Value network vθ(s)

Fully connected layer 1 tanh unit

  • Evaluate the value of the position s under policy p:
  • Double approximation
  • MSE on the test set: 0.234
  • Close to MC estimation from RL policy; 15,000 faster

1 convolutional layer 1x1 ReLU

  • Stochastic gradient descent to

minimize MSE

  • Train on 30M state-outcome

pairs , each from a unique game generated by self-play:

  • choose a random time step u
  • sample moves t=1…u-1 from

SL policy

  • make random move u
  • sample t=u+1…T from RL

policy and get game

  • utcome z
  • add pair to the

training set

  • One week on 50 GPUs to train
  • n 50M batches of size m=32

(s, z) (su, zu)

slide-44
SLIDE 44
slide-45
SLIDE 45

PLAYING

slide-46
SLIDE 46

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

slide-47
SLIDE 47

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

slide-48
SLIDE 48

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.

slide-49
SLIDE 49

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue. sL

slide-50
SLIDE 50

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue. sL

Bunch of nodes selected for evaluation…

slide-51
SLIDE 51

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

slide-52
SLIDE 52

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain

vθ(s)

slide-53
SLIDE 53

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the final game score.

pπ(

vθ(s)

slide-54
SLIDE 54

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the final game score.

pπ(

vθ(s)

Each leaf is evaluated, we are ready to propagate updates

slide-55
SLIDE 55

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

slide-56
SLIDE 56

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L

slide-57
SLIDE 57

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well

slide-58
SLIDE 58

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated

slide-59
SLIDE 59

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated

Current tree is updated

slide-60
SLIDE 60

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

slide-61
SLIDE 61

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

slide-62
SLIDE 62

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

It is initialized using the tree policy to and updated with SL policy:

pτ(a|s0)

τ

slide-63
SLIDE 63

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Tree is expanded, fully updated and ready for the next move!

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

It is initialized using the tree policy to and updated with SL policy:

pτ(a|s0)

τ

slide-64
SLIDE 64
slide-65
SLIDE 65

https://www.youtube.com/watch?v=oRvlyEpOQ-8

WINNING

slide-66
SLIDE 66

https://www.youtube.com/watch?v=oRvlyEpOQ-8