Mastering the game of Go with deep neural networks and tree search - PowerPoint PPT Presentation

David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016

T HE G AME OF G O

B OARD

B OARD S TONES

B OARD S TONES G ROUPS

B OARD L IBERTIES S TONES G ROUPS

B OARD L IBERTIES C APTURE S TONES G ROUPS

B OARD L IBERTIES C APTURE K O S TONES G ROUPS

B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS

B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS T WO EYES

F INAL COUNT B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS T WO EYES

T RAINING

T RAINING THE BUILDING BLOCKS S UPERVISED S UPERVISED R EINFORCEMENT CLASSIFICATION REGRESSION Supervised Reinforcement Value � policy network policy network network p σ ( a | s ) p ρ ( a | s ) v θ ( s ) Rollout policy network p π ( a | s ) Tree policy network p τ ( a | s )

Supervised policy network p σ ( a | s )

Supervised policy network p σ ( a | s ) Softmax 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 48 input

• 29.4M positions from games Supervised between 6 to 9 dan players policy network p σ ( a | s ) Softmax 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 48 input

• 29.4M positions from games Supervised between 6 to 9 dan players policy network p σ ( a | s ) Softmax • stochastic gradient ascent � • learning rate = 0.003, 1 convolutional layer 1x1 α ReLU halved every 80M steps � 11 convolutional layers 3x3 • batch size m = 16 with k =192 filters, ReLU � • 3 weeks on 50 GPUs to make 1 convolutional layer 5x5 340M steps with k =192 filters, ReLU 19 x 19 x 48 input

• 29.4M positions from games Supervised between 6 to 9 dan players policy network • Augmented: 8 reflections/rotations • Test set (1M) accuracy: 57.0% p σ ( a | s ) • 3 ms to select an action Softmax • stochastic gradient ascent � • learning rate = 0.003, 1 convolutional layer 1x1 α ReLU halved every 80M steps � 11 convolutional layers 3x3 • batch size m = 16 with k =192 filters, ReLU � • 3 weeks on 50 GPUs to make 1 convolutional layer 5x5 340M steps with k =192 filters, ReLU 19 x 19 x 48 input

19 X 19 X 48 INPUT

Rollout policy p π ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

Rollout policy p π ( a | s ) Tree policy p τ ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

Rollout policy p π ( a | s ) Tree policy p τ ( a | s ) • Supervised — same data as • “ similar to the rollout policy but p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) with more features ” • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax

Reinforcement policy network p ρ ( a | s ) Same architecture Weights are initialized with ρ σ

• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) Same architecture Weights are initialized with ρ σ

• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) • Play a game until the end, get the reward z t = ± r ( s T ) = ± 1 Same architecture Weights are initialized with ρ σ

• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t Same architecture Weights are initialized with ρ σ

• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” Same architecture Weights are initialized with ρ σ

• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs Same architecture Weights are initialized with ρ σ

• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network • 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) p ρ ( a | s ) • 3 ms to select an action z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs Same architecture Weights are initialized with ρ σ

Value � network v θ ( s )

Value � network v θ ( s ) Fully connected layer 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) Fully connected layer 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input

Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: ‣ choose a random time step u 1 convolutional layer 1x1 ‣ sample moves t =1… u -1 from ReLU SL policy ‣ make random move u 11 convolutional layers 3x3 ‣ sample t = u +1…T from RL with k =192 filters, ReLU policy and get game 1 convolutional layer 5x5 outcome z with k =192 filters, ReLU ( s u , z u ) ‣ add pair to the training set 19 x 19 x 49 input

Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: ‣ choose a random time step u 1 convolutional layer 1x1 ‣ sample moves t =1… u -1 from ReLU SL policy ‣ make random move u 11 convolutional layers 3x3 ‣ sample t = u +1…T from RL with k =192 filters, ReLU policy and get game 1 convolutional layer 5x5 outcome z with k =192 filters, ReLU ( s u , z u ) ‣ add pair to the training set 19 x 19 x 49 input • One week on 50 GPUs to train on 50M batches of size m =32

Mastering the game of Go with deep neural networks and tree search - PowerPoint PPT Presentation

David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016 T HE G AME OF G O B OARD B OARD S TONES B

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do & what it

e-Bug Junior Game Junior Game Game Style Game Process Demo Game Mechanics and

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

e-Bug Senior Game Senior Game Game Style Game Process Demo Game Puzzles and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Mastering the game of Go with deep neural networks and tree search Nature, Jan, 2016 Roadmap

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Game interoperability with functors functor AgsFun (structure Game : GAME) :> sig structure

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

@ Mastering the Millennial Mindset and Beyond How to Attract and Retain Emerging Leaders Lisa

Effect of Non-Passive Operator on Enhanced Wave-Based Teleoperator for Robotic-Assisted Surgery:

Sector/Sphere Tutorial Yunhong Gu CloudCom 2010, Nov. 30, Indianapolis, IN Outline Outline

Invasive Malleable Applications Sebastian Buchwald, Manuel Mohr, Andreas Zwinkau Karlsruhe

[Transition from Matts presentation] Before the University Libraries at UNCG began making the

Mastering the Game of Go without Human Knowledge 06/15/18 Presented by: Henry Chen CS885

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #8: DATABASES

ADEPTSHIP AD MYSTIC ADEPT We are ready to transfer to you blessings and initiations and an

Mastering the game of Go with deep neural networks and tree search - PowerPoint PPT Presentation

David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016 T HE G AME OF G O B OARD B OARD S TONES B

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do &amp; what it

e-Bug Junior Game Junior Game Game Style Game Process Demo Game Mechanics and

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

e-Bug Senior Game Senior Game Game Style Game Process Demo Game Puzzles and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Mastering the game of Go with deep neural networks and tree search Nature, Jan, 2016 Roadmap

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Game interoperability with functors functor AgsFun (structure Game : GAME) :&gt; sig structure

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

@ Mastering the Millennial Mindset and Beyond How to Attract and Retain Emerging Leaders Lisa

Effect of Non-Passive Operator on Enhanced Wave-Based Teleoperator for Robotic-Assisted Surgery:

Sector/Sphere Tutorial Yunhong Gu CloudCom 2010, Nov. 30, Indianapolis, IN Outline Outline

Invasive Malleable Applications Sebastian Buchwald, Manuel Mohr, Andreas Zwinkau Karlsruhe

[Transition from Matts presentation] Before the University Libraries at UNCG began making the

Mastering the Game of Go without Human Knowledge 06/15/18 Presented by: Henry Chen CS885

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #8: DATABASES

ADEPTSHIP AD MYSTIC ADEPT We are ready to transfer to you blessings and initiations and an

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do & what it

Game interoperability with functors functor AgsFun (structure Game : GAME) :> sig structure