Deep RL Robert Platt Northeastern University Q-learning - - PowerPoint PPT Presentation

deep rl
SMART_READER_LITE
LIVE PREVIEW

Deep RL Robert Platt Northeastern University Q-learning - - PowerPoint PPT Presentation

Deep RL Robert Platt Northeastern University Q-learning Q-function Q action argmax state action World e t a t s Update rule Q-learning Q-function Q action argmax state action World e t a t s Update rule Deep Q-learning


slide-1
SLIDE 1

Deep RL

Robert Platt Northeastern University

slide-2
SLIDE 2

Q-learning

World

state

action

argmax

Q

action

s t a t e

Q-function Update rule

slide-3
SLIDE 3

World

state action

argmax Q action s t a t e

Q-function

Update rule

Q-learning

slide-4
SLIDE 4

World

state action

argmax

Q-function

Deep Q-learning (DQN)

Values of different possible discrete actions

slide-5
SLIDE 5

World

state action

argmax

Q-function

But, why would we want to do this?

Deep Q-learning (DQN)

slide-6
SLIDE 6

Where does “state” come from?

Agent

a s,r

Agent takes actions Agent perceives states and rewards

Earlier, we dodged this question: “it’s part of the MDP problem statement” But, that’s a cop out. How do we get state? Typically can’t use “raw” sensor data as state w/ a tabular Q-function – it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

slide-7
SLIDE 7

Where does “state” come from?

Agent

a s,r

Agent takes actions Agent perceives states and rewards

Earlier, we dodged this question: “it’s part of the MDP problem statement” But, that’s a cop out. How do we get state? Typically can’t use “raw” sensor data as state w/ a tabular Q-function – it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

Is it possible to do RL WITHOUT hand-coding states?

slide-8
SLIDE 8

DQN

slide-9
SLIDE 9

Instead of state, we have an image – in practice, it could be a history of the k most recent images stacked as a single k-channel image Hopefully this new image representation is Markov… – in some domains, it might not be!

DQN

slide-10
SLIDE 10

Q-function

Conv 1 Conv 2 FC 1 Output Stack of images

DQN

slide-11
SLIDE 11

Q-function

Conv 1 Conv 2 FC 1 Output Stack of images

DQN

slide-12
SLIDE 12

Q-function

Conv 1 Conv 2 FC 1 Output Stack of images Num output nodes equals the number of actions

DQN

slide-13
SLIDE 13

Q-function updates in DQN

Here’s the standard Q-learning update equation:

slide-14
SLIDE 14

Here’s the standard Q-learning update equation:

Q-function updates in DQN

slide-15
SLIDE 15

Here’s the standard Q-learning update equation: Rewriting:

Q-function updates in DQN

slide-16
SLIDE 16

Here’s the standard Q-learning update equation: Rewriting: let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

Q-function updates in DQN

slide-17
SLIDE 17

Here’s the standard Q-learning update equation: Rewriting: let’s call this the “target” We’re going to accomplish this same thing in a different way using neural networks...

This equation adjusts Q(s,a) in the direction of the target

Q-function updates in DQN

slide-18
SLIDE 18

Use this loss function:

Q-function updates in DQN

slide-19
SLIDE 19

Use this loss function: Notice that Q is now parameterized by the weights, w

Q-function updates in DQN

slide-20
SLIDE 20

Use this loss function: I’m including the bias in the weights

Q-function updates in DQN

slide-21
SLIDE 21

Use this loss function: target

Q-function updates in DQN

slide-22
SLIDE 22

Use this loss function: target

Question

What’s this called?

slide-23
SLIDE 23

Use this loss function: target

Q-function updates in DQN

We’re going to optimize this loss function using the following gradient:

slide-24
SLIDE 24

Use this loss function: target

Think-pair-share

We’re going to optimize this loss function using the following gradient:

What’s wrong with this?

slide-25
SLIDE 25

Use this loss function: target

Q-function updates in DQN

We’re going to optimize this loss function using the following gradient:

What’s wrong with this?

We call this the semigradient rather than the gradient – semi-gradient descent still converges – this is often more convenient

slide-26
SLIDE 26

“Barebones” DQN

Initialize Q(s,a;w) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ Until s is terminal

Where:

slide-27
SLIDE 27

Initialize Q(s,a;w) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ Until s is terminal

Where: This is all that changed relative to standard q-learning

“Barebones” DQN

slide-28
SLIDE 28

Example: 4x4 frozen lake env

Get to the goal (G) Don’t fall in a hole (H) Demo!

slide-29
SLIDE 29

Think-pair-share

Suppose the “barebones” DQN algorithm w/ this DQN network experiences the following transition: Which weights in the network could be updated on this iteration?

slide-30
SLIDE 30

Experience replay

Deep learning typically assumes independent, identically distributed (IID) training data

slide-31
SLIDE 31

Experience replay

Initialize Q(s,a;w) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ Until s is terminal

Deep learning typically assumes independent, identically distributed (IID) training data But is this true in the deep RL scenario?

slide-32
SLIDE 32

Experience replay

Initialize Q(s,a;w) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ Until s is terminal

Deep learning typically assumes independent, identically distributed (IID) training data But is this true in the deep RL scenario?

Our solution: buffer experiences and then “replay” them during training

slide-33
SLIDE 33

Experience replay

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D Replay buffer Add this exp to buffer One step grad descent WRT buffer Train every trainfreq steps

slide-34
SLIDE 34

Experience replay

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D

Where:

Replay buffer Add this exp to buffer One step grad descent WRT buffer Train every trainfreq steps

slide-35
SLIDE 35

Experience replay

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D

Where:

Replay buffer Add this exp to buffer One step grad descent WRT buffer Train every trainfreq steps

Buffers like this are pretty common in DL

slide-36
SLIDE 36

Think-pair-share

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D What do you think are the tradeoffs between: – large replay buffer vs small replay buffer? – large batch size vs small batch size?

slide-37
SLIDE 37

With target network

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D if mod(step,copyfreq) == 0:

Where:

Target network helps stabilize DL – why?

slide-38
SLIDE 38

Example: 4x4 frozen lake env

Get to the goal (G) Don’t fall in a hole (H) Demo!

slide-39
SLIDE 39

Comparison: replay vs no replay

(Avg final score achieved)

slide-40
SLIDE 40

Double DQN

Recall the problem of maximization bias:

slide-41
SLIDE 41

Double DQN

Recall the problem of maximization bias: Our solution from the TD lecture: Can we adapt this to the DQN setting?

slide-42
SLIDE 42

Where:

Double DQN

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D if mod(step,copyfreq) == 0:

slide-43
SLIDE 43

Where:

Think-pair-share

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D if mod(step,copyfreq) == 0:

  • 1. In what sense is this double

q-learning?

  • 2. What are the pros/cons vs earlier

version of double-Q?

  • 3. Why not convert the original

double-Q algorithm into a deep version?

slide-44
SLIDE 44

Double DQN

slide-45
SLIDE 45

Double DQN

slide-46
SLIDE 46

Prioritized Replay Buffer

Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s’ If mod(step,trainfreq) == 0: sample batch B from D Previously this sample was uniformly random Can we do better by sampling the batch intelligently?

slide-47
SLIDE 47

Prioritized Replay Buffer

– Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

slide-48
SLIDE 48

Question

– Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

Why is the sampling method particularly important in this Domain?

slide-49
SLIDE 49

Prioritized Replay Buffer

Num of updates needed to learn true value fn as a function of replay buffer size Larger replay buffer corresponds to larger values of n in cliffworld. Black line selects minibatches randomly Blue line greedily selects transitions that minimize loss over entire buffer – Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

slide-50
SLIDE 50

Prioritized Replay Buffer

Num of updates needed to learn true value fn as a function of replay buffer size Larger replay buffer corresponds to larger values of n in cliffworld. Black line selects minibatches randomly Blue line greedily selects transitions that minimize loss over entire buffer – Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

Minimizing loss over entire buffer is impractical Can we achieve something similar online?

slide-51
SLIDE 51

Question

Idea: sample elements of minibatch by drawing samples with probability: Problem: since we’re changing the distribution of updates performed, this is off policy. – need to weight sample updates… Question: qualitatively, how should we re-weight experiences? – e.g. how should we re-weight an experience that prioritized replay does not sample often? where denotes the priority of a sample – simplest case: ( this is “proportional” sampling)

slide-52
SLIDE 52

Prioritized Replay Buffer

Idea: sample elements of minibatch by drawing samples with probability: Problem: since we’re changing the distribution of updates performed, this is off policy. – need to weight sample updates: where denotes the priority of a sample – simplest case: ( this is “proportional” sampling)

slide-53
SLIDE 53

Prioritized Replay Buffer

Idea: sample elements of minibatch by drawing samples with probability: Problem: since we’re changing the distribution of updates performed, this is off policy. – need to weight sample updates: where denotes the priority of a sample – simplest case: ( this is “proportional” sampling) Why is epsilon needed?

slide-54
SLIDE 54

Prioritized Replay Buffer

– Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1 Prioritized buffer is not as good as oracle, but it is better than uniform sampling...

slide-55
SLIDE 55

Prioritized Replay Buffer

– averaged results over 57 atari games

slide-56
SLIDE 56

Dueling networks for Q-learning

Recall architecture of Q-network:

slide-57
SLIDE 57

Dueling networks for Q-learning

This is a more common way of drawing it:

CONV Layers Fully Connected Layers

slide-58
SLIDE 58

Dueling networks for Q-learning

This is a more common way of drawing it:

CONV Layers Fully Connected Layers

We’re going to express the q-function using a new network architecture

slide-59
SLIDE 59

Dueling networks for Q-learning

Decompose Q as:

Advantage function

slide-60
SLIDE 60

Think-pair-share

Decompose Q as:

Advantage function

  • 1. Why might this decomposition be better?
  • 2. is A always positive, negative, or either? Why?
slide-61
SLIDE 61

Intuition

slide-62
SLIDE 62

Dueling networks for Q-learning

Notice that the V/Q decomposition is not unique, given Q targets only Therefore:

slide-63
SLIDE 63

Question

Notice that the V/Q decomposition is not unique, given Q targets only Therefore: Why does this help?

slide-64
SLIDE 64

Dueling networks for Q-learning

Notice that the V/Q decomposition is not unique, given Q targets only Actually:

slide-65
SLIDE 65

Dueling networks for Q-learning

Action set: left, right, up, down, no-op (arbitrary number of no-op actions). SE: squared error relative to true value function Compare dueling w/ single stream networks (all networks are three-layer MLPs) Increasing number of actions in above corresponds to increases in no-op actions Conclusion: Dueling networks can help a lot for large numbers of actions.

slide-66
SLIDE 66

Dueling networks for Q-learning

Change in avg rewards for 57 ALE domains versus DQN w/ single network.

slide-67
SLIDE 67

Asynchronous methods

Idea: run multiple RL agents in parallel – all agents run against their own environments and Q fn – periodically, all agents synch w/ a global Q fn. Instantiations of the idea: – asynchronous Q-learning – asynchronous SARSA – asynchronous advantage actor critic (A3C)

slide-68
SLIDE 68

Asynchronous Q-learning

Periodically, apply batch

  • f weight updates

Accumulate gradients Update target network Shared Q functions

slide-69
SLIDE 69

Asynchronous Q-learning

Why does this approach help?

slide-70
SLIDE 70

Asynchronous Q-learning

Why does this approach help? It helps decorrelate training data – standard DQN relies on the replay buffer and the target network to decorrelate data – asynchronous methods accomplish the same thing by having multiple learners – makes it feasible to use on-policy methods like SARSA (why?)

slide-71
SLIDE 71

Asynchronous Q-learning

Different numbers of learners versus wall clock time

slide-72
SLIDE 72

Asynchronous Q-learning

Different numbers of learners versus number of SGD steps across all threads – speedup is not just due to greater computational efficiency

slide-73
SLIDE 73

Combine all these ideas!