randomized prior functions for deep reinforcement learning

Randomized Prior Functions for Deep Reinforcement Learning Ian - PowerPoint PPT Presentation

bit.ly/rpf_nips @ianosband Randomized Prior Functions for Deep Reinforcement Learning Ian Osband, John Aslanides, Albin Cassirer bit.ly/rpf_nips Reinforcement Learning @ianosband bit.ly/rpf_nips Reinforcement Learning @ianosband Data


  1. bit.ly/rpf_nips @ianosband Randomized Prior Functions for Deep Reinforcement Learning Ian Osband, John Aslanides, Albin Cassirer

  2. bit.ly/rpf_nips Reinforcement Learning @ianosband

  3. bit.ly/rpf_nips Reinforcement Learning @ianosband Data & Estimation = Supervised Learning

  4. bit.ly/rpf_nips Reinforcement Learning @ianosband Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit

  5. bit.ly/rpf_nips Reinforcement Learning @ianosband Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning

  6. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning

  7. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning

  8. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning

  9. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning

  10. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning

  11. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit + delayed consequences = Reinforcement Learning

  12. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit + delayed consequences = Reinforcement Learning

  13. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit • As a field, we are pretty good at combining any 2 of these 3. 
 + delayed consequences … but we need practical solutions that combine them all. = Reinforcement Learning

  14. bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit • As a field, we are pretty good at combining any 2 of these 3. 
 + delayed consequences … but we need practical solutions that combine them all. = Reinforcement We need effective uncertainty estimates for Deep RL Learning

  15. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband

  16. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling

  17. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015)

  18. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015) Dropout rate does not concentrate with the data.

  19. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015) Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.

  20. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational sampling inference “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015) Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.

  21. bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational sampling inference “Dropout sample Apply VI to Bellman ⩬ posterior sample” error as if it was an i.i.d. supervised loss. (Gal+Gharamani 2015) Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.

  22. 
 bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational sampling inference “Dropout sample Apply VI to Bellman ⩬ posterior sample” error as if it was an i.i.d. supervised loss. (Gal+Gharamani 2015) Bellman error: 
 Dropout rate does Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.

  23. 
 bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman ⩬ posterior sample” error as if it was an i.i.d. supervised loss. (Gal+Gharamani 2015) Bellman error: 
 Dropout rate does Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.

  24. 
 bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: 
 Dropout rate does Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.

  25. 
 bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: 
 Dropout rate does This distribution != Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α posterior uncertainty. the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.

  26. 
 bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: 
 Dropout rate does This distribution != Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α posterior uncertainty. the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” Aleatoric vs Epistemic VI on i.i.d. model dropout not … it’s n ot the right does not propagate necessarily right rate. thing for exploration. uncertainty.

  27. 
 bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional Count-based sampling inference RL density “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: 
 Dropout rate does This distribution != Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α posterior uncertainty. the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” Aleatoric vs Epistemic VI on i.i.d. model dropout not … it’s n ot the right does not propagate necessarily right rate. thing for exploration. uncertainty.

Recommend


More recommend