deep reinforcement learning
play

Deep Reinforcement Learning [Mastering the Game of Go with Deep - PowerPoint PPT Presentation

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree Search, Nature 2016] CS 486/686 University of Waterloo Lecture 21: July 12, 2017 Outline AlphaGo Supervised Learning of Policy Networks


  1. Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree Search, Nature 2016] CS 486/686 University of Waterloo Lecture 21: July 12, 2017

  2. Outline • AlphaGo – Supervised Learning of Policy Networks – Reinforcement Learning of Policy Networks – Reinforcement Learning of Value Networks – Searching with Policy and Value Networks 2 CS486/686 Lecture Slides (c) 2017 P. Poupart

  3. Game of Go • (simplified) rules: – Two players (black and white) – Players alternate to place a stone of their color on a vacant intersection. – Connected stones without any liberty (i.e., no adjacent vacant intersection) are captured and removed from the board – Winner: player that controls the largest number of intersections at the end of the game 3 CS486/686 Lecture Slides (c) 2017 P. Poupart

  4. Computer Go Deep RL Monte Carlo Tree Search • Oct 2015: • March 2016: AlphaGo defeats Lee Sedol (9-dan) 4

  5. Winning Strategy • Four steps: 1. Supervised Learning of Policy Networks 2. Reinforcement Learning of Policy Networks 3. Reinforcement Learning of Value Networks 4. Searching with Policy and Value Networks 5 CS486/686 Lecture Slides (c) 2017 P. Poupart

  6. Policy Network • Train policy network to imitate Go experts based on a database of 30 million board configurations from the KGS Go Server. • Policy network: – Input: state (board configuration) – Output: distribution over actions (intersection on which the next stone will be placed) 6 CS486/686 Lecture Slides (c) 2017 P. Poupart

  7. Supervised Learning of the Policy Network • Let be the weights of the policy network • Training: – Data: suppose is optimal in – Objective: maximize – Gradient: 𝒙 – Weight update: 7 CS486/686 Lecture Slides (c) 2017 P. Poupart

  8. Reinforcement Learning of the Policy Network • How can we update a policy network based on reinforcements instead of the optimal action? • Let be the discounted sum of rewards in a trajectory that starts in by executing . • Gradient: 𝒙 – Intuition rescale supervised learning gradient by – Formally: see derivation in [Sutton and Barto, Reinforcement learning, Chapter 13] • Weight update: 8 CS486/686 Lecture Slides (c) 2017 P. Poupart

  9. Reinforcement Learning of the Policy Network • In computer Go, program repeatedly plays games against its former self. • For each game • For each of turn of the game, compute – Gradient: 𝒙 – Weight update: 9 CS486/686 Lecture Slides (c) 2017 P. Poupart

  10. Value Network � • Predict (i.e., who will win game) in each state with a value network – Input: state (board configuration) – Output: expected discounted sum of rewards 10 CS486/686 Lecture Slides (c) 2017 P. Poupart

  11. Reinforcement Learning of Value Networks • Let be the weights of the value network • Training: – Data: where – Objective: minimize – Gradient: 𝒘 – Weight update: 11 CS486/686 Lecture Slides (c) 2017 P. Poupart

  12. Searching with Policy and Value Networks • AlphaGo combines policy and value networks into a Monte Carlo Tree Search algorithm • Idea: construct a search tree – Node: – Edge: 12 CS486/686 Lecture Slides (c) 2017 P. Poupart

  13. Search Tree • At each edge store , , • Where is the visit count of Sample trajectory 13 CS486/686 Lecture Slides (c) 2017 P. Poupart

  14. Simulation • At each node, select edge that maximizes • where is an exploration bonus 14 CS486/686 Lecture Slides (c) 2017 P. Poupart

  15. Competition 15 CS486/686 Lecture Slides (c) 2017 P. Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend