How I Learned To Stop Worrying And Love Offmine RL
An Optimistic Perspective on Offline Reinforcement Learning
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
How I Learned To Stop Worrying And Love Offmine RL An - - PowerPoint PPT Presentation
Rishabh Agarwal , Dale Schuurmans, Mohammad Norouzi How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline Reinforcement Learning What makes Deep Learning Successful? Expressive function approximators An
An Optimistic Perspective on Offline Reinforcement Learning
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
What makes Deep Learning Successful?
P 2 An Optimistic Perspective on Offmine Reinforcement Learning
Expressive function approximators
What makes Deep Learning Successful?
P 3 An Optimistic Perspective on Offmine Reinforcement Learning
Expressive function approximators Powergul learning algorithms
What makes Deep Learning Successful?
P 4 An Optimistic Perspective on Offmine Reinforcement Learning
Expressive function approximators Large and Diverse Datasets Powergul learning algorithms
How to make Deep RL similarly successful?
P 5 An Optimistic Perspective on Offmine Reinforcement Learning
Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP
How to make Deep RL similarly successful?
P 6 An Optimistic Perspective on Offmine Reinforcement Learning
Large and Diverse Datasets Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP
How to make Deep RL similarly successful?
P 7 An Optimistic Perspective on Offmine Reinforcement Learning
Interactive Environments Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP Active Data Collection
RL for Real-World: RL with Large Datasets
P 8 An Optimistic Perspective on Offmine Reinforcement Learning
[1] Dasari, Ebert, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.
RoboNet
Robotics
RL for Real-World: RL with Large Datasets
P 9 An Optimistic Perspective on Offmine Reinforcement Learning
[1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.
RoboNet
Robotics Recommender Systems
RL for Real-World: RL with Large Datasets
P 10 An Optimistic Perspective on Offmine Reinforcement Learning
[1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.
RoboNet
Robotics Recommender Systems Self-Driving Cars
RL for Real-World: RL with Large Datasets
P 11 An Optimistic Perspective on Offmine Reinforcement Learning
[1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.
RoboNet
Robotics Recommender Systems Self-Driving Cars
P 12 An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL: A Data-Driven RL Paradigm
Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/
P 13 An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL: A Data-Driven RL Paradigm
Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/
Offmine RL can help:
logged data.
P 14 An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL: A Data-Driven RL Paradigm
Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/
Offmine RL can help:
logged data.
the basis of exploitation alone on common datasets.
P 15 An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL: A Data-Driven RL Paradigm
Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/
Offmine RL can help:
logged data.
datasets.
P 16 An Optimistic Perspective on Offmine Reinforcement Learning
But .. Offmine RL is Hard!
NO new corrective feedback!
P 17 An Optimistic Perspective on Offmine Reinforcement Learning
But .. Offmine RL is Hard!
Requires Countergactual Generalization
P 18 An Optimistic Perspective on Offmine Reinforcement Learning
But .. Offmine RL is Hard!
Bootstrapping (Learning guess from a guess) Function Approximation
Fully Off-Policy
P 19 An Optimistic Perspective on Offmine Reinforcement Learning
Standard RL fails in Offmine setuing ..
P 20 An Optimistic Perspective on Offmine Reinforcement Learning
Standard RL fails in Offmine setuing ..
P 21 An Optimistic Perspective on Offmine Reinforcement Learning
Standard RL fails in Offmine setuing ..
P 22 An Optimistic Perspective on Offmine Reinforcement Learning
Can standard ofg-policy RL succeed in the offmine Setuing?
P 23 An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL on Atari 2600
200 million frames (standard protocol) Train 5 DQN (Nature) agents on each Atari game using sticky actions (stochasticity)
P 24 An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL on Atari 2600
Save all of the tuples of (observation, action, next
dataset(s)
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL on Atari 2600
Train ofg-policy agents using DQN-replay dataset(s) without any furuher environment interaction
An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine DQN work?
An Optimistic Perspective on Offmine Reinforcement Learning
Distributional RL uses Z(s, a), a distribution over returns, instead of the Q-function.
Let's try recent ofg-policy algorithms!
Z(1/K) Z(K/K)
Shared Neural Network
Z(2/K)
QR-DQN
Actions R e t u r n s
An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine QR-DQN work?
An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine DQN work?
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine DQN (Nature) vs Offmine C51
Average online scores of C51 and DQN (Nature) agents trained offmine on DQN replay dataset for the same number of gradient steps as online DQN. The horizontal line shows the pergormance of fully-trained DQN.
An Optimistic Perspective on Offmine Reinforcement Learning
Developing Robust Offmine RL algorithms ➢ Emphasis on Generalization
○ Given a fjxed dataset, generalize to unseen states during evaluation.
An Optimistic Perspective on Offmine Reinforcement Learning
Developing Robust Offmine RL algorithms
➢ Emphasis on Generalization ○ Given a fjxed dataset, generalize to unseen states during evaluation. ➢ Ensemble of Q-estimates: ○ Ensembling, Dropout widely used for improving generalization.
An Optimistic Perspective on Offmine Reinforcement Learning
Ensemble-DQN Train multiple (linear) Q-heads with difgerent random initialization.
Shared Neural Network
Q1 Q2
Ensemble-DQN
QK
Returns Actions Actions
..
Actions
An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine Ensemble-DQN work?
An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine DQN work?
An Optimistic Perspective on Offmine Reinforcement Learning
Developing Robust Offmine RL algorithms
➢ Emphasis on Generalization ○ Given a fjxed dataset, generalize to unseen states during evaluation. ➢ Q-learning as constraint satisfaction: ○
An Optimistic Perspective on Offmine Reinforcement Learning
Random Ensemble Mixture (REM) Minimize TD error on random (per minibatch) convex combination of multiple Q-estimates.
𝛽2
REM
∑i ⍺i Qi
𝛽K Shared Neural Network
Q1 Q2 QK
Actions R e t u r n s
An Optimistic Perspective on Offmine Reinforcement Learning
REM vs QR-DQN
𝛽2
REM
∑i ⍺i Qi
𝛽K Shared Neural Network
Q1 Q2 QK
Actions R e t u r n s
Z(1/K) Z(K/K)
Shared Neural Network
Z(2/K)
QR-DQN
Returns
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine Stochastic Atari Results
Scores averaged over 5 runs of offline agents trained using DQN replay data across 60 Atari games for 5X gradient steps. Offline REM surpasses gains from online C51 and offline QR-DQN.
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine REM vs. Baselines
An Optimistic Perspective on Offmine Reinforcement Learning
Reviewers asked: Does Online REM work?
Average normalized scores of online agents trained for 200 million game frames. Multi-network REM with 4 Q-functions performs comparably to QR-DQN.
An Optimistic Perspective on Offmine Reinforcement Learning
Key Factor in Success: Offmine Dataset Size Randomly subsample N% of frames from 200 million frames for offmine training. Divergence with 1% of data for prolonged training!
An Optimistic Perspective on Offmine Reinforcement Learning
Key Factor in Success: Offmine Dataset Composition Subsample fjrst 10% of total frames (20 million) for offmine training -- much lower quality data.
An Optimistic Perspective on Offmine Reinforcement Learning
Choice of Algorithm: Offmine Continuous Control
Offmine agents trained using full experience replay of DDPG on MuJoCo environments.
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL: Stability / Overgituing
More gradient updates eventually degrade pergormance :(
Average online scores of offmine agents trained on 5 games using logged DQN replay data for 5X gradient steps compared to online DQN.
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL for Robotics
Future Work
The potential for ofg-policy learning remains tantalizing, the best way to achieve it still a mystery. - Sutuon & Baruo The potential for ofg-policy learning remains tantalizing, the best way to achieve it still a mystery. - Sutuon & Baruo
An Optimistic Perspective on Offmine Reinforcement Learning
An Optimistic Perspective on Offmine Reinforcement Learning
Offmine RL: Future Work
An Optimistic Perspective on Offmine Reinforcement Learning
○
Subsampling DQN-replay datasets (e.g., fjrst / last k million frames)
Offmine RL: Future Work
An Optimistic Perspective on Offmine Reinforcement Learning
○ Subsampling DQN-replay datasets (e.g., fjrst / last k million frames)
○ Currently, online evaluation used for early stopping. “True” offmine RL requires offmine policy evaluation.
Offmine RL: Future Work
An Optimistic Perspective on Offmine Reinforcement Learning
○
Subsampling DQN-replay datasets (e.g., fjrst / last k million frames)
○ Currently, online evaluation used for early stopping. “True” offmine RL require offmine policy evaluation.
Offmine RL: Future Work
An Optimistic Perspective on Offmine Reinforcement Learning
suffjciently large and diverse datasets, pergorm quite well in the offmine setuing.
○ Isolating exploitation from exploration ○ Developing sample effjcient and stable algorithms ○ Pretrain RL agents on logged data
TL;DR
An Optimistic Perspective on Offmine Reinforcement Learning