 
              bit.ly/rpf_nips @ianosband Randomized Prior Functions for Deep Reinforcement Learning Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips Reinforcement Learning @ianosband
bit.ly/rpf_nips Reinforcement Learning @ianosband Data & Estimation = Supervised Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit
bit.ly/rpf_nips Reinforcement Learning @ianosband Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit + delayed consequences = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit • As a field, we are pretty good at combining any 2 of these 3. + delayed consequences … but we need practical solutions that combine them all. = Reinforcement Learning
bit.ly/rpf_nips Reinforcement Learning @ianosband • “Sequential decision making under uncertainty.” Data & Estimation = • Three necessary building blocks: Supervised 1. Generalization Learning 2. Exploration vs Exploitation + partial feedback = 3. Credit assignment Multi-armed Bandit • As a field, we are pretty good at combining any 2 of these 3. + delayed consequences … but we need practical solutions that combine them all. = Reinforcement We need effective uncertainty estimates for Deep RL Learning
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015)
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015) Dropout rate does not concentrate with the data.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout sampling “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015) Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational sampling inference “Dropout sample ⩬ posterior sample” (Gal+Gharamani 2015) Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational sampling inference “Dropout sample Apply VI to Bellman ⩬ posterior sample” error as if it was an i.i.d. supervised loss. (Gal+Gharamani 2015) Dropout rate does not concentrate with the data. Even “concrete” dropout not necessarily right rate.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational sampling inference “Dropout sample Apply VI to Bellman ⩬ posterior sample” error as if it was an i.i.d. supervised loss. (Gal+Gharamani 2015) Bellman error: Dropout rate does Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman ⩬ posterior sample” error as if it was an i.i.d. supervised loss. (Gal+Gharamani 2015) Bellman error: Dropout rate does Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: Dropout rate does Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: Dropout rate does This distribution != Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α posterior uncertainty. the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” VI on i.i.d. model dropout not does not propagate necessarily right rate. uncertainty.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional sampling inference RL “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: Dropout rate does This distribution != Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α posterior uncertainty. the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” Aleatoric vs Epistemic VI on i.i.d. model dropout not … it’s n ot the right does not propagate necessarily right rate. thing for exploration. uncertainty.
bit.ly/rpf_nips Estimating uncertainty in deep RL @ianosband Dropout Variational Distributional Count-based sampling inference RL density “Dropout sample Apply VI to Bellman Models Q-value as a ⩬ posterior sample” error as if it was an distribution, rather i.i.d. supervised loss. than point estimate. (Gal+Gharamani 2015) Bellman error: Dropout rate does This distribution != Q ( s , a ) = r + γ max Q ( s ′ � , α ) not concentrate with α posterior uncertainty. the data. Uncertainty in Q ➡ correlated TD loss. Even “concrete” Aleatoric vs Epistemic VI on i.i.d. model dropout not … it’s n ot the right does not propagate necessarily right rate. thing for exploration. uncertainty.
Recommend
More recommend