Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp - - PowerPoint PPT Presentation

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019 Reinforcement Learning 1 Sequence of actions moves in chess driving controls in car


slide-1
SLIDE 1

Deep Reinforcement Learning

Philipp Koehn 18 April 2019

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-2
SLIDE 2

1

Reinforcement Learning

  • Sequence of actions

– moves in chess – driving controls in car

  • Uncertainty

– moves by component – random outcomes (e.g., dice rolls, impact of decisions)

  • Reward delayed

– chess: win/loss at end of game – Pacman: points scored throughout game

  • Challenge: find optimal policy for actions

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-3
SLIDE 3

2

Deep Learning

  • Mapping input to output through multiple layers
  • Weight matrices and activation functions

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-4
SLIDE 4

3

AlphaGo

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-5
SLIDE 5

4

Book

  • Lecture based on the book

Deep Learning and the Game of Go by Pumperla and Ferguson, 2019

  • Hands-on introduction to game playing

and neural networks

  • Lots of Python code

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-6
SLIDE 6

5

go

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-7
SLIDE 7

6

Go

  • Board game with white and black stones
  • Stones may be placed anywhere
  • If opponents stones are surrounded, you can capture them
  • Ultimately: you need to claim territory
  • Player with most territory and captured stones wins

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-8
SLIDE 8

7

Go Board

  • Starting board, standard board is 19x19, but can also play with 9x9 or 13x13

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-9
SLIDE 9

8

Move 1

  • First move: white

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-10
SLIDE 10

9

Move 2

  • Second move: black

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-11
SLIDE 11

10

Move 3

  • Third move: white

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-12
SLIDE 12

11

Move 7

  • Situation after 7 moves, black’s turn

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-13
SLIDE 13

12

Move 8

  • Move by black: surrounded white stone in the middle

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-14
SLIDE 14

13

Capture

  • White stone in middle is captured

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-15
SLIDE 15

14

Final State

  • Any further moves will not change outcome

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-16
SLIDE 16

15

Final State with Territory Marked

  • Total score: number of squares in territory + number of captured stones

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-17
SLIDE 17

16

Why is Go Hard for Computers?

  • Many moves possible

– 19x19 board – 361 moves initially – games may last 300 moves ⇒ Huge branching factor in search space

  • Hard to evaluate board positions

– control of board most important – number of captured stones less relevant

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-18
SLIDE 18

17

game playing

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-19
SLIDE 19

18

Game Tree

etc.

  • Recall: game tree to consider all possible moves

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-20
SLIDE 20

19

Alpha-Beta Search

  • Explore game tree depth-first
  • Exploration stops at win or loss
  • Backtrack to other paths, note best/worst outcome
  • Ignore paths with worse outcomes
  • This does not work for a game tree with about 361300 states

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-21
SLIDE 21

20

Evaluation Function for States

  • Explore game tree up to some specified maximum depth
  • Evaluate leaf states

– informed by knowledge of game – e.g., chess: pawn count, control of board

  • This does not work either due

– high branching factor – difficulty of defining evaluation function

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-22
SLIDE 22

21

monte carlo tree search

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-23
SLIDE 23

22

Monte Carlo Tree Search

etc. 1/0 win

  • Explore depth-first randomly (”roll-out”), record win on all states along path

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-24
SLIDE 24

23

Monte Carlo Tree Search

etc. etc. win loss 1/0 0/1 1/1

  • Pick existing node as starting point, execute another roll-out, record loss

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-25
SLIDE 25

24

Monte Carlo Tree Search

etc. etc. etc. win loss win 1/0 0/1 1/0 1/1 1/0

  • Pick existing node as starting point, execute another roll-out

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-26
SLIDE 26

25

Monte Carlo Tree Search

etc. etc. etc. etc. win loss win loss 1/0 0/1 1/0 0/1 1/1 1/0 0/1

  • Pick existing node as starting point, execute another roll-out

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-27
SLIDE 27

26

Monte Carlo Tree Search

etc. etc. etc. etc. loss win loss win loss 1/1 0/1 1/0 0/1 1/2 1/0 0/1

  • Increasingly, prefer to explore paths with high win percentage

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-28
SLIDE 28

27

Monte Carlo Tree Search

  • Which node to pick?

w + c √ logN n – N total number of roll-outs – n number of roll-outs for this node in the game tree – w winning percentage – c hyper parameter to balance exploration

  • This is an inference algorithm

– execute, say, 10,000 roll-outs – pick initial action with best win percentage w – can be improved by following rules based on well-known local shapes

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-29
SLIDE 29

28

action prediction with neural networks

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-30
SLIDE 30

29

Learning Moves

  • We would like to learn actions of game playing agent
  • Input state: board position
  • Output action: optimal move

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-31
SLIDE 31

30

Learning Moves

  • 1
  • 1
  • 1
  • 1

1 1 1 1

  • Machine learning problem
  • Input: 5x5 matrix
  • Output: 5x5 matrix

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-32
SLIDE 32

31

Neural Networks

  • 1
  • 1
  • 1
  • 1

1 1 1 1

  • First idea: feed-forward neural network

– encode board position in n × n sized vector – encode correct move in n × n sized vector – add some hidden layers

  • Many parameters

– input and output vectors have dimension 361 (19x19 board) – if hidden layers have same size → 361x361 weights for each

  • Does not generalize well

– same patterns on various locations of the board – has to learn moves for each location – consider everything moved one position to the right

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-33
SLIDE 33

32

Convolutional Neural Networks

  • 1

1

  • 1

1

  • 1
  • 1
  • 1

1 1 1 1

  • 1
  • 1
  • 1

1 1 1

  • 1
  • 1

1

  • 1
  • 1
  • 1
  • 1

1 1 1 1

  • Convolutional kernel: here maps 3x3 matrix to 1x1 value
  • Applied to all 3x3 regions of the original matrix
  • Learns local features

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-34
SLIDE 34

33

Move Prediction with CNNs

1

Convolutional Layer Convolutional Layer Flatten Feed-forward Layer

  • May use multiple convolutional kernels (of same size)

→ learn different local features

  • Resulting values may be added or maximum value selected (max-pooling)
  • May have several convolutional neural network layers
  • Final layer: softmax prediction of move

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-35
SLIDE 35

34

Human Game Play Data

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-36
SLIDE 36

35

Human Game Play Data

  • Game records

– sequence of moves – winning player

  • Convert into training data for move prediction

– one move at a time – prediction +1 for move if winner – prediction −1 for move if loser

  • learn winning moves, avoid losing moves

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-37
SLIDE 37

36

Playing Go with Neural Move Predictor

  • Greedy search
  • Make prediction at each turn
  • Selection move with highest probability

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-38
SLIDE 38

37

reinforcement learning

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-39
SLIDE 39

38

Self-Play

  • Previously: learn policy from human play data
  • Now: learn policy from self-play
  • Need to have an agent that plays reasonably well to start

→ learn initial policy from human play data

  • Greedy move selection with same policy will result in the same game each time

– stochastic moves: move predicted with 80% confidence → select it 80% of the time – may have to clip probabilities that are too certain (e.g., 99.9% to 80%)

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-40
SLIDE 40

39

Experience from Self-Play

  • Self play will generate self play data (”experience”)

– sequence of moves – winner at the end

  • Can be used as training data to improve model

– first train model on human play data – then, run 1 epoch over self-play data

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-41
SLIDE 41

40

Policy Search

1

Convolutional Layer Convolutional Layer Flatten Feed-forward Layer

  • Reminder: policy informs which action to take in each state
  • Learning move predictor = learning policy

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-42
SLIDE 42

41

Q Learning

Utility

Convolutional Layer Convolutional Layer Flatten Feed-forward Layer

  • Learn utility value for each state = likelihood of winning
  • Training on game play data, utility=1 for win, 0 for loss
  • Game play with utility predictor

– consider all possible actions – compute utility value for resulting state – choose action with maximum utility outcome

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-43
SLIDE 43

42

Q Learning

Utility

Convolutional Layer Convolutional Layer Flatten Feed-forward Layer Feed-forward Layer Feed-forward Layer Feed-forward Layer

  • Alternative architecture
  • Explicitly modeling the last move

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-44
SLIDE 44

43

actor-critic learning

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-45
SLIDE 45

44

Credit Assignment Problem

  • Go game lasts many moves (say, 300 moves)

– some of the moves are good – some of the moves are bad – some of the moves make no difference

  • We want to learn from the moves that made a difference

– before: low chance of winning – move – at the end → win

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-46
SLIDE 46

45

Consider Win Probability

moves win probability

0.5 1

important moves unimportant moves

  • Moves that pushed towards win matter more

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-47
SLIDE 47

46

Consider Win Probability

win probability

0.5 1

very important moves moves unimportant moves

important moves important moves

  • Especially important moves: change from losing position to winning position

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-48
SLIDE 48

47

Advantage

win probability

0.5 1

very important moves advantage moves unimportant moves

important moves important moves

  • Compute utility of state V (s). Definition of advantage: A = R − V (s)

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-49
SLIDE 49

48

Actor-Critic Learning

  • Combination of policy learning and Q learning

– actor: move predictor (as in policy learning) s → a – critic: value of state (as in Q learning) V (s)

  • We use this setup to influence how much to boost good moves

– advantage A = R − V (s) – good moves when advantage is high

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-50
SLIDE 50

49

Policy Learning with Advantage

  • Before: predict win
  • Now: predict advantage
  • 1
  • 1
  • 1
  • 1

1 1 1

0.8

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-51
SLIDE 51

50

Architecture of Actor-Critic Model

Convolutional Layer Convolutional Layer Flatten Feed-forward Layer

Utility

Feed-forward Layer

0.8

Feed-forward Layer

  • Game play data with advantage scores for each move
  • Training of actor and critic similar

⇒ Share components, train them jointly

  • Multi-task learning helps regularization

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-52
SLIDE 52

51

alpha go

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-53
SLIDE 53

52

Overview

Fast policy network Supervised learning Human game records Strong policy network Self-play reinforcement learning Stronger policy network Value network Inference

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-54
SLIDE 54

53

Encoding the Board

  • We encoded each board position with a integer (+1=white, –1=black, 0=blank)
  • AlphaGo uses a 48-dimensional vector that encode knowledge about the game

– 3 booleans for stone color – 1 boolean for legal and fundamentally sensible move – 8 boolean to record how far back stone was placed – 8 booleans to encode liberty – 8 booleans to encode liberty after move – 8 booleans to encode capture size – 8 booleans to encode how many of your own stones will be placed in jeopardy because of move – 2 booleans for ladder detection – 3 booleans for technical values

  • Note: ladder, liberty, and capture are basic concepts of the game

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-55
SLIDE 55

54

Policy and Value Networks

  • Policy network: s → a
  • Value network: s → V (s)
  • These networks are trained as previously described
  • Fairly deep networks

– 13 layers for policy network – 16 layers for value network

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-56
SLIDE 56

55

Monte Carlo Tree Search

  • Inference uses a refined version of Monte Carlo Tree Search (MCTS)
  • Roll-out guided by fast policy network (greedy search)
  • When visiting a node with some unexplored children (”leaf”)

→ use probability distribution from strong policy network for stochastic choice

  • Combine roll-out statistics with prediction from value network

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-57
SLIDE 57

56

MCTS with Value Network

  • Estimate value of a leaf node l in the game tree where a roll-out started as

V (l) = 1 2 value(l) + 1 2 roll-out(l) – value(s) is prediction from value network – roll-out(s) is win percentage from Monte Carlo Tree Search

  • This is used to compute Q values for any state-action pair given its leaf nodes li

Q(s,a) = ∑i V (li) N(s,a)

  • Combine with the prediction of the strong policy network P(s,a)

a′ = argmaxaQ(s,a) + P(s,a) 1 + N(s,a)

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-58
SLIDE 58

57

alpha go zero

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-59
SLIDE 59

58

Less and More

  • Less

– no pre-training with human game play data – no hand crafted features in board encoding – no Monte Carlo rollouts

  • More

– 80 convolutional layers – tree search also used in self-play

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-60
SLIDE 60

59

Improved Tree Search

  • Tree search adds one node in each iteration (not full roll-out)
  • When exploring a new node

– compute its Q value – compute action prediction probability distribution – pass Q value back up through search tree

  • Each node in search tree keeps record of

– P prediction for action leading to this node – Q average of all terminal Q values from visits passing through node – N number of visits of parent – n number of visits of node

  • Score of node (c is hyper parameter to be optimized)

Q + cP √ N 1 + n

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-61
SLIDE 61

60

Inference and Training

  • Inference

– choose action from most visited branch – visit count is impacted by both action prediction and success in tree search → more reliable than win statistics or raw action prediction

  • Training

– predict visit count

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-62
SLIDE 62

61

and more...

Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019

slide-63
SLIDE 63

62 Philipp Koehn Artificial Intelligence: Deep Reinforcement Learning 18 April 2019