Randomized Prior Functions for Deep Reinforcement Learning Ian - - PowerPoint PPT Presentation

randomized prior functions for deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Randomized Prior Functions for Deep Reinforcement Learning Ian - - PowerPoint PPT Presentation

bit.ly/rpf_nips @ianosband Randomized Prior Functions for Deep Reinforcement Learning Ian Osband, John Aslanides, Albin Cassirer bit.ly/rpf_nips Reinforcement Learning @ianosband bit.ly/rpf_nips Reinforcement Learning @ianosband Data


slide-1
SLIDE 1

Randomized Prior Functions for Deep Reinforcement Learning

Ian Osband, John Aslanides, Albin Cassirer

bit.ly/rpf_nips @ianosband

slide-2
SLIDE 2

Reinforcement Learning

bit.ly/rpf_nips @ianosband

slide-3
SLIDE 3

Reinforcement Learning

bit.ly/rpf_nips @ianosband Data & Estimation = Supervised Learning

slide-4
SLIDE 4

Reinforcement Learning

bit.ly/rpf_nips @ianosband + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

slide-5
SLIDE 5

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

slide-6
SLIDE 6

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
slide-7
SLIDE 7

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
slide-8
SLIDE 8

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
  • Three necessary building blocks:
slide-9
SLIDE 9

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
  • Three necessary building blocks:
  • 1. Generalization
slide-10
SLIDE 10

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
  • Three necessary building blocks:
  • 1. Generalization
  • 2. Exploration vs Exploitation
slide-11
SLIDE 11

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
  • Three necessary building blocks:
  • 1. Generalization
  • 2. Exploration vs Exploitation
  • 3. Credit assignment
slide-12
SLIDE 12

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
  • Three necessary building blocks:
  • 1. Generalization
  • 2. Exploration vs Exploitation
  • 3. Credit assignment
slide-13
SLIDE 13

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”
  • Three necessary building blocks:
  • 1. Generalization
  • 2. Exploration vs Exploitation
  • 3. Credit assignment
  • As a field, we are pretty good at combining any 2 of these 3.


… but we need practical solutions that combine them all.

slide-14
SLIDE 14

Reinforcement Learning

bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning

  • “Sequential decision making under uncertainty.”

We need effective uncertainty estimates for Deep RL

  • Three necessary building blocks:
  • 1. Generalization
  • 2. Exploration vs Exploitation
  • 3. Credit assignment
  • As a field, we are pretty good at combining any 2 of these 3.


… but we need practical solutions that combine them all.

slide-15
SLIDE 15

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

slide-16
SLIDE 16

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling

slide-17
SLIDE 17

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

slide-18
SLIDE 18

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data.

slide-19
SLIDE 19

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.

slide-20
SLIDE 20

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.

slide-21
SLIDE 21

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss.

slide-22
SLIDE 22

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

slide-23
SLIDE 23

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

slide-24
SLIDE 24

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

slide-25
SLIDE 25

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty.

slide-26
SLIDE 26

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration.

slide-27
SLIDE 27

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration.

slide-28
SLIDE 28

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus.

slide-29
SLIDE 29

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task.

slide-30
SLIDE 30

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" 
 != uncertainty.

slide-31
SLIDE 31

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" 
 != uncertainty.

slide-32
SLIDE 32

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" 
 != uncertainty. Train ensemble on noisy data - classic statistical procedure!

slide-33
SLIDE 33

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" 
 != uncertainty. Train ensemble on noisy data - classic statistical procedure! No explicit “prior” mechanism for “intrinsic motivation”

slide-34
SLIDE 34

Estimating uncertainty in deep RL

bit.ly/rpf_nips @ianosband

Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)

Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error:
 
 Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.

Q(s, a) = r + γ max

α

Q(s′, α)

This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" 
 != uncertainty. Train ensemble on noisy data - classic statistical procedure! No explicit “prior” mechanism for “intrinsic motivation” If you’ve never seen a reward, why would the agent explore?

slide-35
SLIDE 35

Randomized prior functions

bit.ly/rpf_nips @ianosband

slide-36
SLIDE 36

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

slide-37
SLIDE 37

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

slide-38
SLIDE 38

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
slide-39
SLIDE 39

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
slide-40
SLIDE 40

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
slide-41
SLIDE 41

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
slide-42
SLIDE 42

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
slide-43
SLIDE 43

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
  • Trainable function f(x) dotted line.
slide-44
SLIDE 44

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
  • Trainable function f(x) dotted line.
slide-45
SLIDE 45

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
  • Trainable function f(x) dotted line.
  • Output prediction Q(x) red line.
slide-46
SLIDE 46

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
  • Trainable function f(x) dotted line.
  • Output prediction Q(x) red line.
slide-47
SLIDE 47

Randomized prior functions

bit.ly/rpf_nips @ianosband

  • Key idea: add a random untrainable “prior”

function to each member of the ensemble.

  • Visualize effects in 1D regression:
  • Training data (x,y) black points.
  • Prior function p(x) blue line.
  • Trainable function f(x) dotted line.
  • Output prediction Q(x) red line.

Exact Bayes posterior for linear functions!

slide-48
SLIDE 48

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

slide-49
SLIDE 49

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

slide-50
SLIDE 50

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
slide-51
SLIDE 51

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
slide-52
SLIDE 52

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
slide-53
SLIDE 53

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
slide-54
SLIDE 54

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
  • “left” has reward = 0, “right” has reward = -0.1/N
slide-55
SLIDE 55

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
  • “left” has reward = 0, “right” has reward = -0.1/N
  • … but if you make it to bottom right you get +1.
slide-56
SLIDE 56

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
  • “left” has reward = 0, “right” has reward = -0.1/N
  • … but if you make it to bottom right you get +1.
slide-57
SLIDE 57

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
  • “left” has reward = 0, “right” has reward = -0.1/N
  • … but if you make it to bottom right you get +1.
  • Only one policy (out of more than 2N) has positive return.

slide-58
SLIDE 58

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
  • “left” has reward = 0, “right” has reward = -0.1/N
  • … but if you make it to bottom right you get +1.
  • Only one policy (out of more than 2N) has positive return.

  • ε-greedy / Boltzmann / policy gradient / are useless.

slide-59
SLIDE 59

“Deep Sea” Exploration

bit.ly/rpf_nips @ianosband

  • Stylized “chain” domain testing “deep exploration”:
  • State = N x N grid, observations 1-hot.
  • Start in top left cell, fall one row each step.
  • Actions {0,1} map to left/right in each cell.
  • “left” has reward = 0, “right” has reward = -0.1/N
  • … but if you make it to bottom right you get +1.
  • Only one policy (out of more than 2N) has positive return.

  • ε-greedy / Boltzmann / policy gradient / are useless.

  • Algorithms with deep exploration can learn fast!


[1] “Deep Exploration via Randomized Value Functions”

slide-60
SLIDE 60

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

slide-61
SLIDE 61

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
slide-62
SLIDE 62

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.

1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
slide-63
SLIDE 63

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
  • Heat map shows estimated value of each state.

1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
slide-64
SLIDE 64

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
  • Heat map shows estimated value of each state.

1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
slide-65
SLIDE 65

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
  • Heat map shows estimated value of each state.

1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
  • Red line shows exploration path taken by agent.
slide-66
SLIDE 66

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
  • Heat map shows estimated value of each state.

1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
  • Red line shows exploration path taken by agent.
slide-67
SLIDE 67

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
  • Heat map shows estimated value of each state.
  • DQN+ε-greedy gets stuck on the left, gives up.


1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
  • Red line shows exploration path taken by agent.
slide-68
SLIDE 68

Visualize BootDQN+prior exploration.

bit.ly/rpf_nips @ianosband

  • Compare DQN+ε-greedy vs BootDQN+prior.
  • Heat map shows estimated value of each state.
  • DQN+ε-greedy gets stuck on the left, gives up.

  • BootDQN+prior hopes something is out there, keeps


exploring potentially-rewarding states… learns fast!

1 K

K

k=1

max

α

Qk(s, α)

  • Define ensemble average:
  • Red line shows exploration path taken by agent.
slide-69
SLIDE 69

VIDEO TO COVER THIS WHOLE SLIDE

bit.ly/rpf_nips @ianosband

slide-70
SLIDE 70

Come visit our poster!

Ian Osband, John Aslanides, Albin Cassirer

bit.ly/rpf_nips @ianosband

slide-71
SLIDE 71

Come visit our poster!

Ian Osband, John Aslanides, Albin Cassirer

bit.ly/rpf_nips @ianosband

Blog post
 bit.ly/rpf_nips

slide-72
SLIDE 72

Come visit our poster!

Ian Osband, John Aslanides, Albin Cassirer

bit.ly/rpf_nips @ianosband

Blog post
 bit.ly/rpf_nips Montezuma’s Revenge!

slide-73
SLIDE 73

Come visit our poster!

Ian Osband, John Aslanides, Albin Cassirer

bit.ly/rpf_nips @ianosband

Blog post
 bit.ly/rpf_nips Montezuma’s Revenge! Demo code
 bit.ly/rpf_nips