Randomized Prior Functions for Deep Reinforcement Learning
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Randomized Prior Functions for Deep Reinforcement Learning Ian - - PowerPoint PPT Presentation
bit.ly/rpf_nips @ianosband Randomized Prior Functions for Deep Reinforcement Learning Ian Osband, John Aslanides, Albin Cassirer bit.ly/rpf_nips Reinforcement Learning @ianosband bit.ly/rpf_nips Reinforcement Learning @ianosband Data
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
… but we need practical solutions that combine them all.
bit.ly/rpf_nips @ianosband + delayed consequences = Reinforcement Learning + partial feedback = Multi-armed Bandit Data & Estimation = Supervised Learning
We need effective uncertainty estimates for Deep RL
… but we need practical solutions that combine them all.
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
Dropout sampling
bit.ly/rpf_nips @ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
bit.ly/rpf_nips @ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data.
bit.ly/rpf_nips @ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" != uncertainty.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" != uncertainty.
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" != uncertainty. Train ensemble on noisy data - classic statistical procedure!
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" != uncertainty. Train ensemble on noisy data - classic statistical procedure! No explicit “prior” mechanism for “intrinsic motivation”
bit.ly/rpf_nips @ianosband
Dropout sampling Variational inference Distributional RL Count-based density Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate. Apply VI to Bellman error as if it was an i.i.d. supervised loss. Models Q-value as a distribution, rather than point estimate. Bellman error: Uncertainty in Q ➡ correlated TD loss. VI on i.i.d. model does not propagate uncertainty.
Q(s, a) = r + γ max
α
Q(s′, α)
This distribution != posterior uncertainty. Aleatoric vs Epistemic … it’s not the right thing for exploration. Estimate number of “visit counts” to state, add bonus. The “density model” has nothing to do with the actual task. With generalization, state “visit count" != uncertainty. Train ensemble on noisy data - classic statistical procedure! No explicit “prior” mechanism for “intrinsic motivation” If you’ve never seen a reward, why would the agent explore?
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
bit.ly/rpf_nips @ianosband
function to each member of the ensemble.
Exact Bayes posterior for linear functions!
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
[1] “Deep Exploration via Randomized Value Functions”
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
bit.ly/rpf_nips @ianosband
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
exploring potentially-rewarding states… learns fast!
1 K
K
∑
k=1
max
α
Qk(s, α)
bit.ly/rpf_nips @ianosband
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips Montezuma’s Revenge!
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips Montezuma’s Revenge! Demo code bit.ly/rpf_nips