mastering the game of go with deep neural networks and
play

Mastering the game of Go with deep neural networks and tree search - PowerPoint PPT Presentation

David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016 T HE G AME OF G O B OARD B OARD S TONES B


  1. David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016

  2. T HE G AME OF G O

  3. B OARD

  4. B OARD S TONES

  5. B OARD S TONES G ROUPS

  6. B OARD L IBERTIES S TONES G ROUPS

  7. B OARD L IBERTIES C APTURE S TONES G ROUPS

  8. B OARD L IBERTIES C APTURE K O S TONES G ROUPS

  9. B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS

  10. B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS

  11. B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS

  12. B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS T WO EYES

  13. F INAL COUNT B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS T WO EYES

  14. T RAINING

  15. T RAINING THE BUILDING BLOCKS S UPERVISED S UPERVISED R EINFORCEMENT CLASSIFICATION REGRESSION Supervised Reinforcement Value � policy network policy network network p σ ( a | s ) p ρ ( a | s ) v θ ( s ) Rollout policy network p π ( a | s ) Tree policy network p τ ( a | s )

  16. Supervised policy network p σ ( a | s )

  17. Supervised policy network p σ ( a | s ) Softmax 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 48 input

  18. • 29.4M positions from games Supervised between 6 to 9 dan players policy network p σ ( a | s ) Softmax 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 48 input

  19. • 29.4M positions from games Supervised between 6 to 9 dan players policy network p σ ( a | s ) Softmax • stochastic gradient ascent � • learning rate = 0.003, 1 convolutional layer 1x1 α ReLU halved every 80M steps � 11 convolutional layers 3x3 • batch size m = 16 with k =192 filters, ReLU � • 3 weeks on 50 GPUs to make 1 convolutional layer 5x5 340M steps with k =192 filters, ReLU 19 x 19 x 48 input

  20. • 29.4M positions from games Supervised between 6 to 9 dan players policy network • Augmented: 8 reflections/rotations • Test set (1M) accuracy: 57.0% p σ ( a | s ) • 3 ms to select an action Softmax • stochastic gradient ascent � • learning rate = 0.003, 1 convolutional layer 1x1 α ReLU halved every 80M steps � 11 convolutional layers 3x3 • batch size m = 16 with k =192 filters, ReLU � • 3 weeks on 50 GPUs to make 1 convolutional layer 5x5 340M steps with k =192 filters, ReLU 19 x 19 x 48 input

  21. 19 X 19 X 48 INPUT

  22. 19 X 19 X 48 INPUT

  23. 19 X 19 X 48 INPUT

  24. 19 X 19 X 48 INPUT

  25. Rollout policy p π ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

  26. Rollout policy p π ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

  27. Rollout policy p π ( a | s ) Tree policy p τ ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

  28. Rollout policy p π ( a | s ) Tree policy p τ ( a | s ) • Supervised — same data as • “ similar to the rollout policy but p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) with more features ” • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

  29. Reinforcement policy network p ρ ( a | s ) Same architecture Weights are initialized with ρ σ

  30. • Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) Same architecture Weights are initialized with ρ σ

  31. • Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) • Play a game until the end, get the reward z t = ± r ( s T ) = ± 1 Same architecture Weights are initialized with ρ σ

  32. • Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t Same architecture Weights are initialized with ρ σ

  33. • Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” Same architecture Weights are initialized with ρ σ

  34. • Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs Same architecture Weights are initialized with ρ σ

  35. • Self-play: current network vs. Reinforcement randomized pool of previous versions policy network • 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) p ρ ( a | s ) • 3 ms to select an action z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs Same architecture Weights are initialized with ρ σ

  36. Value � network v θ ( s )

  37. Value � network v θ ( s ) Fully connected layer 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

  38. Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) Fully connected layer 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

  39. Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

  40. Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

  41. Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: ‣ choose a random time step u 1 convolutional layer 1x1 ‣ sample moves t =1… u -1 from ReLU SL policy ‣ make random move u 11 convolutional layers 3x3 ‣ sample t = u +1…T from RL with k =192 filters, ReLU policy and get game 1 convolutional layer 5x5 outcome z with k =192 filters, ReLU ( s u , z u ) ‣ add pair to the training set 19 x 19 x 49 input

  42. Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: ‣ choose a random time step u 1 convolutional layer 1x1 ‣ sample moves t =1… u -1 from ReLU SL policy ‣ make random move u 11 convolutional layers 3x3 ‣ sample t = u +1…T from RL with k =192 filters, ReLU policy and get game 1 convolutional layer 5x5 outcome z with k =192 filters, ReLU ( s u , z u ) ‣ add pair to the training set 19 x 19 x 49 input • One week on 50 GPUs to train on 50M batches of size m =32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend