Deep Reinforcement Learning [Mastering the Game of Go with Deep - PowerPoint PPT Presentation

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree Search, Nature 2016] CS 486/686 University of Waterloo Lecture 21: July 12, 2017

Outline • AlphaGo – Supervised Learning of Policy Networks – Reinforcement Learning of Policy Networks – Reinforcement Learning of Value Networks – Searching with Policy and Value Networks 2 CS486/686 Lecture Slides (c) 2017 P. Poupart

Game of Go • (simplified) rules: – Two players (black and white) – Players alternate to place a stone of their color on a vacant intersection. – Connected stones without any liberty (i.e., no adjacent vacant intersection) are captured and removed from the board – Winner: player that controls the largest number of intersections at the end of the game 3 CS486/686 Lecture Slides (c) 2017 P. Poupart

Computer Go Deep RL Monte Carlo Tree Search • Oct 2015: • March 2016: AlphaGo defeats Lee Sedol (9-dan) 4

Winning Strategy • Four steps: 1. Supervised Learning of Policy Networks 2. Reinforcement Learning of Policy Networks 3. Reinforcement Learning of Value Networks 4. Searching with Policy and Value Networks 5 CS486/686 Lecture Slides (c) 2017 P. Poupart

Policy Network • Train policy network to imitate Go experts based on a database of 30 million board configurations from the KGS Go Server. • Policy network: – Input: state (board configuration) – Output: distribution over actions (intersection on which the next stone will be placed) 6 CS486/686 Lecture Slides (c) 2017 P. Poupart

Supervised Learning of the Policy Network • Let be the weights of the policy network • Training: – Data: suppose is optimal in – Objective: maximize – Gradient: 𝒙 – Weight update: 7 CS486/686 Lecture Slides (c) 2017 P. Poupart

Reinforcement Learning of the Policy Network • How can we update a policy network based on reinforcements instead of the optimal action? • Let be the discounted sum of rewards in a trajectory that starts in by executing . • Gradient: 𝒙 – Intuition rescale supervised learning gradient by – Formally: see derivation in [Sutton and Barto, Reinforcement learning, Chapter 13] • Weight update: 8 CS486/686 Lecture Slides (c) 2017 P. Poupart

Reinforcement Learning of the Policy Network • In computer Go, program repeatedly plays games against its former self. • For each game • For each of turn of the game, compute – Gradient: 𝒙 – Weight update: 9 CS486/686 Lecture Slides (c) 2017 P. Poupart

Value Network � • Predict (i.e., who will win game) in each state with a value network – Input: state (board configuration) – Output: expected discounted sum of rewards 10 CS486/686 Lecture Slides (c) 2017 P. Poupart

Reinforcement Learning of Value Networks • Let be the weights of the value network • Training: – Data: where – Objective: minimize – Gradient: 𝒘 – Weight update: 11 CS486/686 Lecture Slides (c) 2017 P. Poupart

Searching with Policy and Value Networks • AlphaGo combines policy and value networks into a Monte Carlo Tree Search algorithm • Idea: construct a search tree – Node: – Edge: 12 CS486/686 Lecture Slides (c) 2017 P. Poupart

Deep Reinforcement Learning [Mastering the Game of Go with Deep - PowerPoint PPT Presentation

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree Search, Nature 2016] CS 486/686 University of Waterloo Lecture 21: July 12, 2017 Outline AlphaGo Supervised Learning of Policy Networks

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Policy vs. Mechanism Policy Decisions about what should be done. Mechanism

Monetary policy, the financial cycle and ultra-low interest rates Mikael Juselius DNB Workshop on

Discussion of Monetary Policy, the financial cycle and ultra-low interest rates by M.

Overcoming short-termism after COVID-19: how can policymakers better prepare for the future? 24

Integrity Policies CSE497b - Spring 2007 Introduction Computer and Network Security Professor

CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21

Network Policy Controller in Weave Net Blocking unwanted network traffic in Kubernetes Bryan

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training