A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar - - PowerPoint PPT Presentation

a deep journey of playing games with rl
SMART_READER_LITE
LIVE PREVIEW

A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar - - PowerPoint PPT Presentation

Intro AlphaGo AlphaGo Zero AlphaZero Summary A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar kimham@kth.se January 31, 2020 1 / 41 2 / 41 Why Combine the two? Branching factor = 3 VA


slide-1
SLIDE 1

Intro AlphaGo AlphaGo Zero AlphaZero Summary

A Deep Journey of Playing Games with RL

NSE Seminar Kim Hammar kimham@kth.se January 31, 2020

1 / 41

slide-2
SLIDE 2

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Why Games

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Agent Environment Action at rt st st+1 rt+1

C O N V C O N V C O N V C O N V C O N V

r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P

Ply 1 Ply 2 Ply 3 Depth = 3 Branching factor = 3

歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉

VA LV E R

AI & Machine Learning Games

Why Combine the two?

2 / 41

slide-3
SLIDE 3

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Why Games

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Agent Environment Action at rt st st+1 rt+1

C O N V C O N V C O N V C O N V C O N V

r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P

Ply 1 Ply 2 Ply 3 Depth = 3 Branching factor = 3

歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉

VA LV E R

AI & Machine Learning Games

Why Combine the two? ▸ AI & Games have a long history (Turing ’50& Minsky 60’) ▸ Simple to evaluate, reproducible, controllable, quick feedback loop ▸ Common benchmark for the research community

2 / 41

slide-4
SLIDE 4

Intro AlphaGo AlphaGo Zero AlphaZero Summary

1997: DeepBlue1 vs Kasparov

2em11 Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. “Deep Blue”. In: Artif.

  • Intell. 134.1–2 (Jan. 2002), 57–83. issn: 0004-3702. doi: 10.1016/S0004- 3702(01)00129- 1. url:

https://doi.org/10.1016/S0004-3702(01)00129-1. 3 / 41

slide-5
SLIDE 5

Intro AlphaGo AlphaGo Zero AlphaZero Summary

1992: Tesauro’s TD-Gammon2

2em12 Gerald Tesauro. “TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play”. In: Neural Comput. 6.2 (Mar. 1994), 215–219. issn: 0899-7667. doi: 10.1162/neco.1994.6.2.215. url: https://doi.org/10.1162/neco.1994.6.2.215. 4 / 41

slide-6
SLIDE 6

Intro AlphaGo AlphaGo Zero AlphaZero Summary

1959: Arthur Samuel’s Checkers Player3

2em13

  • A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In:

IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi: 10.1147/rd.33.0210. url: https://doi.org/10.1147/rd.33.0210, A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi: 10.1147/rd.33.0210. url: https://doi.org/10.1147/rd.33.0210. 5 / 41

slide-7
SLIDE 7

Intro AlphaGo AlphaGo Zero AlphaZero Summary 6 / 41

slide-8
SLIDE 8

Intro AlphaGo AlphaGo Zero AlphaZero Summary 7 / 41

slide-9
SLIDE 9

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Papers in Focus Today

▸ AlphaGo4 ▸ AlphaGo Zero5 ▸ AlphaZero6

AlphaGo

Nature, 6.5k citations

2016 AlphaGo Zero

Nature, 2.5k citations

2017 Alpha Zero

Science, 400 citations

2018

2em14 David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961. 2em15 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2em16 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http : //science.sciencemag.org/content/362/6419/1140/tab-pdf. 8 / 41

slide-10
SLIDE 10

Intro AlphaGo AlphaGo Zero AlphaZero Summary

The Reinforcement Learning Problem

▸ Notation; policy: π, state: s, reward: r, action: a ▸ Agent’s goal: maximize reward, Rt =

k=0

γkrt+k+1 0 ≤ γ ≤ 1 ▸ RL’s goal, find optimal policy π∗ = maxπ E[R∣π]

Agent Environment Action at rt st st+1 rt+1

9 / 41

slide-11
SLIDE 11

Intro AlphaGo AlphaGo Zero AlphaZero Summary

RL Examples: Elevator (Crites & Barto ’957)

Elevator Agent

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Observations ElevatorPosition rt+1 Reward ∈ R select(up,down,wait,stop at floor 1,⋯,n) at+1

2em17 Robert H. Crites and Andrew G. Barto. “Improving Elevator Performance Using Re- inforcement Learning”. In: Proceedings of the 8th International Conference on Neural Information Processing Systems. NIPS’95. Denver, Colorado: MIT Press, 1995, 1017–1023. 10 / 41

slide-12
SLIDE 12

Intro AlphaGo AlphaGo Zero AlphaZero Summary

RL Examples: Atari (Mnih ’15)8

DQN Agent

Q(s,a1) Q(s,a18)

Observations rt+1 Screen frames ∈ R4×84×84 Reward ∈ R at+1

2em18 Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/ nature14236. 11 / 41

slide-13
SLIDE 13

Intro AlphaGo AlphaGo Zero AlphaZero Summary

How to Act Optimally? (Bellman 57’9)

  • ptimal(st) = max

π

E[

k=1

γk−1rt+k∣st]

2em19 Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn: 9780486428093. 12 / 41

slide-14
SLIDE 14

Intro AlphaGo AlphaGo Zero AlphaZero Summary

How to Act Optimally? (Bellman 57’10)

  • ptimal(st) = max

π

E[

k=1

γk−1rt+k∣st] = max

π

E[rt+1

k=2

γk−1rt+k∣st] = max

at E[rt+1 + max π

E[

k=2

γk−1rt+k∣st+1]∣st] = max

at E[rt+1 + γ max π

E[

k=2

γk−2rt+k∣st+1]∣st] = max

at E[rt+1 + γ max π

E[

k=2

γk−2rt+k∣st+1]∣st] = max

at E[rt+1 + γoptimal(st+1)∣st]

2em110 Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn: 9780486428093. 12 / 41

slide-15
SLIDE 15

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Reinforcement Learning: An Overview

Deep Reinforcement Learning

C O N V C O N V C O N V C O N V C O N V ⎛ ⎜ ⎜ ⎝ x1 ⋮ xn ⎞ ⎟ ⎟ ⎠

Features

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

Algorithms: DQN, DDPG, Double-DQN

13 / 41

slide-16
SLIDE 16

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Reinforcement Learning: An Overview

Deep Reinforcement Learning

C O N V C O N V C O N V C O N V C O N V ⎛ ⎜ ⎜ ⎝ x1 ⋮ xn ⎞ ⎟ ⎟ ⎠ Features b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y Model θ

ˆ y

Prediction

L(y, ˆ y)

Loss Gradient ∇θL(y, ˆ y)

Algorithms: DQN, DDPG, Double-DQN

Direct Policy Search

w[0] w[1] w[2]

Policy Space π1 π2 π3 π4

Algorithms: REINFORCE, Evolutionary Search, DPG, Actor-Critic

Value Based Methods

s a r s′ max Value iteration ⋮ Monte-Carlo s a s′ r TD(0) s,a s′ a′ r max Q-Learning s,a s′ a′ r Sarsa

Dynamic Programming

V (s1) V (s2) V (s4) V (s5) V (s3) V (s4) V (s6)

s0 s1 s2 a0,r1 a1,r2

Heuristic Search

Expansion One or more nodes in the search tree are created. Selection The selection function is applied recursively until a leaf node is reached. Simulation terminal state rollout game using model
  • f environment M
and policy π Backpropagation ∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree

Algorithms: MCTS, Minimax Search

Mathematical Foundations

s0 s1 s2 a0,r1 a1,r2 Markov Decision Process Y Z 0.8 0.6 0.2 0.4 Markov Chain S0 S1 S2 1 R = −2 0.5 0.5 R = +1 R = +0 R = +0 0.5 0.5 Markov Reward Process

13 / 41

slide-17
SLIDE 17

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo ’201611

2em111 David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961. 14 / 41

slide-18
SLIDE 18

Intro AlphaGo AlphaGo Zero AlphaZero Summary 15 / 41

slide-19
SLIDE 19

Intro AlphaGo AlphaGo Zero AlphaZero Summary

The game of Go

▸ The world’s oldest game: 3000 years old, over 40M players world wide ▸ To win: capture the most territory on the board

▸ Surrounded stones/areas are captured and removed

▸ Why is it so hard for computers? 10170 unique states!!, ≈ 250 branching factor

▸ High branching factor, large board (19 × 19), hard to evaluate etc..

Capture

16 / 41

slide-20
SLIDE 20

Intro AlphaGo AlphaGo Zero AlphaZero Summary

The game of Go

▸ The world’s oldest game: 3000 years old, over 40M players world wide ▸ To win: capture the most territory on the board

▸ Surrounded stones are captured and removed

▸ Why is it so hard for computers? 10170 unique states!! ≈ 250 branching factor

▸ High branching factor, large board (19 × 19), hard to evaluate etc..

(1) (2)

16 / 41

slide-21
SLIDE 21

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Game Trees

▸ How do you program a computer to play a board game? ▸ Simplest approach:

▸ (1) Program a game tree; (2) Assume opponent think like you; (3) Look-ahead and evaluate each move ▸ Requires Knowledge of game rules and evaluation function

Ply 1 Ply 2 Ply 3 Depth = 3 Branching factor = 3 17 / 41

slide-22
SLIDE 22

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Search + Go =

18 / 41

slide-23
SLIDE 23

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Some Numbers

▸ Atoms in the universe

▸ ≈ 1080

▸ States

▸ Go: 10170, Chess: 1047

▸ Game tree complexity

▸ Go: 10360, Chess: 10123

▸ Average branching factor

▸ Go: 250, Chess: 35

▸ Board size (positions)

▸ Go: 361, Chess: 64

2 4 6 8 1 · 109 2 · 109 3 · 109 250depth states depth

19 / 41

slide-24
SLIDE 24

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Monte-Carlo Tree Search

Expansion

One or more nodes in the search tree are created.

Selection

The selection function is applied recursively until a leaf node is reached.

Simulation

terminal state rollout game using model

  • f environment M

and policy π

Backpropagation

∆ ∆ ∆ ∆

The result of the rollout is backpropagated in the tree

20 / 41

slide-25
SLIDE 25

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo’s Approach

▸ Brute-Force Search does not work

▸ At least not until hardware has improved a lot.

▸ Human Go professionals rely on small search guided by intuition/experience ▸ AlphaGo’s Approach: Complement MCTS with “artifical intuition”

▸ Artificial intuition provided by two neural networks: value network and policy network

21 / 41

slide-26
SLIDE 26

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Computer Go as an RL Problem

AlphaGo Agent Observations rt+1 Board features ∈ RK×19×19 Reward ∈ [−1,1] at+1

22 / 41

slide-27
SLIDE 27

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Training Pipeline (1/2)

Human Expert Moves D1

23 / 41

slide-28
SLIDE 28

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Training Pipeline (1/2)

C O N V C O N V C O N V

Supervised Rollout Policy Network pπ (a∣s) Classification minpπL(predicted move,expert move)

Human Expert Moves D1

23 / 41

slide-29
SLIDE 29

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Training Pipeline (1/2)

C O N V C O N V C O N V

Supervised Rollout Policy Network pπ (a∣s) Classification minpπL(predicted move,expert move)

C O N V C O N V C O N V C O N V C O N V

Supervised Policy Network pσ (a∣s) Classification minpσL(predicted move,expert move)

Human Expert Moves D1

23 / 41

slide-30
SLIDE 30

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Training Pipeline (2/2)

C O N V C O N V C O N V C O N V C O N V

Reinforcement Learning Policy Network pρ (a∣s) Initialize with pσ weights PolicyGradient J(pρ) = Epρ [∑∞

t=0 rt]

ρ ← ρ + α∇ρJ(pρ) Self Play

24 / 41

slide-31
SLIDE 31

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Training Pipeline (2/2)

C O N V C O N V C O N V C O N V C O N V

Reinforcement Learning Policy Network pρ (a∣s) Initialize with pσ weights PolicyGradient J(pρ) = Epρ [∑∞

t=0 rt]

ρ ← ρ + α∇ρJ(pρ) Self Play

Self Play Dataset D2

24 / 41

slide-32
SLIDE 32

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Training Pipeline (2/2)

C O N V C O N V C O N V C O N V C O N V

Reinforcement Learning Policy Network pρ (a∣s) Initialize with pσ weights PolicyGradient J(pρ) = Epρ [∑∞

t=0 rt]

ρ ← ρ + α∇ρJ(pρ) Self Play

C O N V C O N V C O N V C O N V C O N V

Supervised Value Network vθ(s′) Classification minvθL(predicted outcome,actual outcome)

Self Play Dataset D2

24 / 41

slide-33
SLIDE 33

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Guided Search Search Using the Policy Network

C O N V C O N V C O N V C O N V C O N V

PolicyNetwork pσ/ρ (a∣s)

C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a∣s) C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a∣s) C O N V C O N V C O N V C O N V C O N V

PolicyNetwork pσ/ρ (a∣s) 25 / 41

slide-34
SLIDE 34

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Depth-Limited Search Using the Value Network

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

C O N V C O N V C O N V C O N V C O N V C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′) 26 / 41

slide-35
SLIDE 35

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Prediction Pipeline

Input Position Game State s

27 / 41

slide-36
SLIDE 36

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Prediction Pipeline

ML predictions Input Position

C O N V C O N V C O N V C O N V C O N V

PolicyNetwork pσ/ρ (a∣s)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

Game State s

27 / 41

slide-37
SLIDE 37

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Prediction Pipeline

Artifical Intuition By NNs ML predictions Input Position

C O N V C O N V C O N V C O N V C O N V

PolicyNetwork pσ/ρ (a∣s)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

Game State s HighValueAction ActionDistribution pσ/ρ (a∣s) ValueEstimates vθ(s′)

27 / 41

slide-38
SLIDE 38

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Prediction Pipeline

Guided- Look-ahead Search (Self-play) Artifical Intuition By NNs ML predictions Input Position

C O N V C O N V C O N V C O N V C O N V

PolicyNetwork pσ/ρ (a∣s)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′)

Game State s HighValueAction ActionDistribution pσ/ρ (a∣s) ValueEstimates vθ(s′)

Expansion

One or more nodes in the search tree are created.

Selection

The selection function is applied recursively until a leaf node is reached.

Simulation

terminal state rollout game using model
  • f environment M
and policy π

Backpropagation

∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree

27 / 41

slide-39
SLIDE 39

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Prediction Pipeline

Selected Move Guided- Look-ahead Search (Self-play) Artifical Intuition By NNs ML predictions Input Position

C O N V C O N V C O N V C O N V C O N V

PolicyNetwork pσ/ρ (a∣s)

C O N V C O N V C O N V C O N V C O N V

ValueNetwork vθ(s′) Game State s SelectedAction HighValueAction ActionDistribution pσ/ρ (a∣s) ValueEstimates vθ(s′)

Expansion

One or more nodes in the search tree are created.

Selection

The selection function is applied recursively until a leaf node is reached.

Simulation

terminal state rollout game using model
  • f environment M
and policy π

Backpropagation

∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree

27 / 41

slide-40
SLIDE 40

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Vs Lee Sedol

▸ In March 2016, Alpha Go won against Lee Sedol 4-1 ▸ Lee Sedol was 18-time World Champion prior to the game ▸ Two famous moves: Move 37 by AlphaGo and Move 78 by Sedol

28 / 41

slide-41
SLIDE 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo: Key Takeaways

  • 1. Different AI techniques can be complementary

▸ Supervised learning ▸ Reinforcement learning ▸ Search ▸ Rules/Domain Knowledge ▸ What ever it takes to win!!

  • 2. Self-play
  • 3. Vast computation still required for training and inference

▸ AlphaGo used 1200 CPUs and 176 GPUs

29 / 41

slide-42
SLIDE 42

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero ’201712

2em112 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 30 / 41

slide-43
SLIDE 43

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero

▸ AlphaGo Zero is a successor to AlphaGo ▸ AlphaGo Zero is simpler and stronger than AlphaGo

▸ AlphaGo Zero beats AlphaGo 100 − 0 in matches

▸ AlphaGo Zero starts from Zero domain knowledge

▸ Uses a single neural network (compared to 4 NNs in AlphaGo) ▸ Learns by Self-Play only (No supervised learning like in AlphaGo)

31 / 41

slide-44
SLIDE 44

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Neural Network Architecture

v(s) ≈ P[win∣s] P[a∣s] P[an∣s] P[a1∣s] ⋯ x × y kernels Input State(s) features Multi Headed ResNet fθ(s) = (p,v) Policy Head Value Head

32 / 41

slide-45
SLIDE 45

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1

32 / 41

slide-46
SLIDE 46

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1 π1

32 / 41

slide-47
SLIDE 47

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1 s2 a1 ∼ π1 π1

32 / 41

slide-48
SLIDE 48

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1 s2 a1 ∼ π1 π1 π2

32 / 41

slide-49
SLIDE 49

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1 s2 a1 ∼ π1

at ∼ πt π1 π2 z

32 / 41

slide-50
SLIDE 50

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1 s2 a1 ∼ π1

at ∼ πt π1 π2 z Self Play Training Data fθ(st) = (pt,vt) θ′ = θ − α∇θL((pt,vt),(πt,z))

32 / 41

slide-51
SLIDE 51

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Algorithm

s1 s2 a1 ∼ π1

at ∼ πt π1 π2 z Self Play Training Data fθ(st) = (pt,vt) θ′ = θ − α∇θL((pt,vt),(πt,z)) L(fθ(st),(πt,z)) = (z − vt)2 ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

MSE

−πT

t log pt + c∣∣θ∣∣2

ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

Cross-entropy loss

32 / 41

slide-52
SLIDE 52

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Pipeline

v(s) ≈ P[win∣s] P[a∣s] P[an∣s] P[a1∣s] ⋯

fθ = (p,v)

Self Play 1

33 / 41

slide-53
SLIDE 53

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Pipeline

v(s) ≈ P[win∣s] P[a∣s] P[an∣s] P[a1∣s] ⋯

fθ = (p,v)

Self Play 1 Learning 2

θt+1 = θt − α∇θL(fθ,D) Self Play Data D(π,z)

33 / 41

slide-54
SLIDE 54

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero Self-Play Training Pipeline

v(s) ≈ P[win∣s] P[a∣s] P[an∣s] P[a1∣s] ⋯

fθ = (p,v)

Self Play 1 Repeat 3 Learning 2

θt+1 = θt − α∇θL(fθ,D) Self Play Data D(π,z)

33 / 41

slide-55
SLIDE 55

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaGo Zero: Key Takeaways

  • 1. A simpler system can be more powerful than a complex
  • ne (AlphaGo Zero vs AlphaGo)
  • 2. Neural networks can be combined like LEGO blocks
  • 3. ResNet is better than traditional ConvNets
  • 4. Self-play

34 / 41

slide-56
SLIDE 56

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaZero ’201813

2em113 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http : //science.sciencemag.org/content/362/6419/1140/tab-pdf. 35 / 41

slide-57
SLIDE 57

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaZero

▸ AlphaGo Zero is able to reach superhuman level at Go without any domain knowledge... ▸ As AlphaGo Zero is not dependent on Go, can the same algorithm play other games? ▸ AlphaZero extends AlphaGo to play not only Go but also Chess and Shogi

▸ The same algorithm achieves superhuman performance on all three games

36 / 41

slide-58
SLIDE 58

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaZero

v(s) ≈ P[win∣s] P[a∣s] P[an∣s] P[a1∣s] ⋯ r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P r N Q p p b K k P P P

歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角 角 香 香 桂 桂 銀 銀 金 金 玉

Go Input Features Shogi Input Features Chess Input Features

OR OR

Multi Headed ResNet fθ(s) = (p,v) Policy Head Value Head 37 / 41

slide-59
SLIDE 59

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaZero is much Simpler than AlphaGo, Yet More Powerful

Generality Performance AlphaGo AlphaGo Zero AlphaZero

38 / 41

slide-60
SLIDE 60

Intro AlphaGo AlphaGo Zero AlphaZero Summary

AlphaZero: Key Takeaways

  • 1. Being able to play three games at a superhuman level,

AlphaZero is one step closer to general AI

  • 2. Massive compute power still required for training

AlphaZero

▸ 5000 TPUs

  • 3. Self-play

39 / 41

slide-61
SLIDE 61

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Next Challenge..

40 / 41

slide-62
SLIDE 62

Intro AlphaGo AlphaGo Zero AlphaZero Summary

Presentation Summary

▸ Sometimes a simpler system can be more powerful than a complex one ▸ Universal research principle: strive for generality, simplicity, Occam’s Razor ▸ Self-play: no human bias, learn from first principles ▸ Deep RL is still in its infancy, a lot to more to be expected in the next few years ▸ Open challenges: Sample efficiency, data efficiency

▸ Yes, AlphaGo can learn to play Go after hundreds of game years, but a human can reach a decent level of play in only a couple of hours ▸ How can we make reinforcement learning more efficient? Model-based learning is a research area with increasing attention

41 / 41

slide-63
SLIDE 63

Intro AlphaGo AlphaGo Zero AlphaZero Summary

References19

▸ DQN14 ▸ AlphaGo15 ▸ AlphaGo Zero16 ▸ AlphaZero17 ▸ AlphaStar18

2em114 Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/ nature14236. 2em115 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2em116 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2em117 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http : //science.sciencemag.org/content/362/6419/1140/tab-pdf. 2em118 Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575 (Nov. 2019). doi: 10.1038/s41586-019-1724-z. 2em119 Thanks to Rolf Stadler for Reviewing and discussing drafts of this presentation 41 / 41