Deep Reinforcement Learning [Mastering the Game of Go with Deep - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning [Mastering the Game of Go with Deep - - PowerPoint PPT Presentation

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree Search, Nature 2016] CS 486/686 University of Waterloo Lecture 21: July 12, 2017 Outline AlphaGo Supervised Learning of Policy Networks


slide-1
SLIDE 1

Deep Reinforcement Learning

[Mastering the Game of Go with Deep Reinforcement Learning and Tree Search, Nature 2016]

CS 486/686 University of Waterloo Lecture 21: July 12, 2017

slide-2
SLIDE 2

CS486/686 Lecture Slides (c) 2017 P. Poupart

2

Outline

  • AlphaGo

– Supervised Learning of Policy Networks – Reinforcement Learning of Policy Networks – Reinforcement Learning of Value Networks – Searching with Policy and Value Networks

slide-3
SLIDE 3

CS486/686 Lecture Slides (c) 2017 P. Poupart

3

Game of Go

  • (simplified) rules:

– Two players (black and white) – Players alternate to place a stone of their color on a vacant intersection. – Connected stones without any liberty (i.e., no adjacent vacant intersection) are captured and removed from the board – Winner: player that controls the largest number of intersections at the end of the game

slide-4
SLIDE 4

4

Computer Go

  • Oct 2015:
  • March 2016: AlphaGo defeats Lee Sedol (9-dan)

Monte Carlo Tree Search Deep RL

slide-5
SLIDE 5

CS486/686 Lecture Slides (c) 2017 P. Poupart

5

Winning Strategy

  • Four steps:
  • 1. Supervised Learning of Policy Networks
  • 2. Reinforcement Learning of Policy Networks
  • 3. Reinforcement Learning of Value Networks
  • 4. Searching with Policy and Value Networks
slide-6
SLIDE 6

CS486/686 Lecture Slides (c) 2017 P. Poupart

6

Policy Network

  • Train policy network to imitate Go experts

based on a database of 30 million board configurations from the KGS Go Server.

  • Policy network:

– Input: state (board configuration) – Output: distribution

  • ver actions

(intersection on which the next stone will be placed)

slide-7
SLIDE 7

CS486/686 Lecture Slides (c) 2017 P. Poupart

7

Supervised Learning of the Policy Network

  • Let

be the weights of the policy network

  • Training:

– Data: suppose is optimal in – Objective: maximize – Gradient:

𝒙

– Weight update:

slide-8
SLIDE 8

CS486/686 Lecture Slides (c) 2017 P. Poupart

8

Reinforcement Learning of the Policy Network

  • How can we update a policy network based on

reinforcements instead of the optimal action?

  • Let

be the discounted sum of rewards in a trajectory that starts in by executing .

  • Gradient:

𝒙

– Intuition rescale supervised learning gradient by – Formally: see derivation in [Sutton and Barto, Reinforcement learning, Chapter 13]

  • Weight update:
slide-9
SLIDE 9

CS486/686 Lecture Slides (c) 2017 P. Poupart

9

Reinforcement Learning of the Policy Network

  • In computer Go, program repeatedly plays

games against its former self.

  • For each game
  • For each
  • f turn of the game,

compute

– Gradient:

𝒙

– Weight update:

slide-10
SLIDE 10

CS486/686 Lecture Slides (c) 2017 P. Poupart

10

Value Network

  • Predict

(i.e., who will win game) in each state with a value network

– Input: state (board configuration) – Output: expected discounted sum of rewards

slide-11
SLIDE 11

CS486/686 Lecture Slides (c) 2017 P. Poupart

11

Reinforcement Learning of Value Networks

  • Let

be the weights of the value network

  • Training:

– Data: where – Objective: minimize – Gradient:

𝒘

– Weight update:

slide-12
SLIDE 12

CS486/686 Lecture Slides (c) 2017 P. Poupart

12

Searching with Policy and Value Networks

  • AlphaGo combines policy

and value networks into a Monte Carlo Tree Search algorithm

  • Idea: construct

a search tree

– Node: – Edge:

slide-13
SLIDE 13

CS486/686 Lecture Slides (c) 2017 P. Poupart

13

Search Tree

  • At each edge store

, ,

  • Where

is the visit count of

Sample trajectory

slide-14
SLIDE 14

CS486/686 Lecture Slides (c) 2017 P. Poupart

14

Simulation

  • At each node, select edge

that maximizes

  • where

is an exploration bonus

slide-15
SLIDE 15

CS486/686 Lecture Slides (c) 2017 P. Poupart

15

Competition