How I Learned To Stop Worrying And Love Offmine RL An - - PowerPoint PPT Presentation

how i learned to stop worrying and love offmine rl
SMART_READER_LITE
LIVE PREVIEW

How I Learned To Stop Worrying And Love Offmine RL An - - PowerPoint PPT Presentation

Rishabh Agarwal , Dale Schuurmans, Mohammad Norouzi How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline Reinforcement Learning What makes Deep Learning Successful? Expressive function approximators An


slide-1
SLIDE 1

How I Learned To Stop Worrying And Love Offmine RL

An Optimistic Perspective on Offline Reinforcement Learning

Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi

slide-2
SLIDE 2

What makes Deep Learning Successful?

P 2 An Optimistic Perspective on Offmine Reinforcement Learning

Expressive function approximators

slide-3
SLIDE 3

What makes Deep Learning Successful?

P 3 An Optimistic Perspective on Offmine Reinforcement Learning

Expressive function approximators Powergul learning algorithms

slide-4
SLIDE 4

What makes Deep Learning Successful?

P 4 An Optimistic Perspective on Offmine Reinforcement Learning

Expressive function approximators Large and Diverse Datasets Powergul learning algorithms

slide-5
SLIDE 5

How to make Deep RL similarly successful?

P 5 An Optimistic Perspective on Offmine Reinforcement Learning

Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP

slide-6
SLIDE 6

How to make Deep RL similarly successful?

P 6 An Optimistic Perspective on Offmine Reinforcement Learning

Large and Diverse Datasets Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP

slide-7
SLIDE 7

How to make Deep RL similarly successful?

P 7 An Optimistic Perspective on Offmine Reinforcement Learning

Interactive Environments Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP Active Data Collection

slide-8
SLIDE 8

RL for Real-World: RL with Large Datasets

P 8 An Optimistic Perspective on Offmine Reinforcement Learning

[1] Dasari, Ebert, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.

RoboNet

Robotics

slide-9
SLIDE 9

RL for Real-World: RL with Large Datasets

P 9 An Optimistic Perspective on Offmine Reinforcement Learning

[1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.

RoboNet

Robotics Recommender Systems

slide-10
SLIDE 10

RL for Real-World: RL with Large Datasets

P 10 An Optimistic Perspective on Offmine Reinforcement Learning

[1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.

RoboNet

Robotics Recommender Systems Self-Driving Cars

slide-11
SLIDE 11

RL for Real-World: RL with Large Datasets

P 11 An Optimistic Perspective on Offmine Reinforcement Learning

[1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database.

RoboNet

Robotics Recommender Systems Self-Driving Cars

slide-12
SLIDE 12

P 12 An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL: A Data-Driven RL Paradigm

Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/

slide-13
SLIDE 13

P 13 An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL: A Data-Driven RL Paradigm

Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/

Offmine RL can help:

  • Pretrain agents on existing

logged data.

slide-14
SLIDE 14

P 14 An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL: A Data-Driven RL Paradigm

Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/

Offmine RL can help:

  • Pretrain agents on existing

logged data.

  • Evaluate RL algorithms on

the basis of exploitation alone on common datasets.

slide-15
SLIDE 15

P 15 An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL: A Data-Driven RL Paradigm

Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/

Offmine RL can help:

  • Pretrain the agents on existing

logged data.

  • Evaluate RL algorithms on the basis
  • f exploitation alone on common

datasets.

  • Deliver real world impact.
slide-16
SLIDE 16

P 16 An Optimistic Perspective on Offmine Reinforcement Learning

But .. Offmine RL is Hard!

NO new corrective feedback!

slide-17
SLIDE 17

P 17 An Optimistic Perspective on Offmine Reinforcement Learning

But .. Offmine RL is Hard!

Requires Countergactual Generalization

slide-18
SLIDE 18

P 18 An Optimistic Perspective on Offmine Reinforcement Learning

But .. Offmine RL is Hard!

Bootstrapping (Learning guess from a guess) Function Approximation

Fully Off-Policy

slide-19
SLIDE 19

P 19 An Optimistic Perspective on Offmine Reinforcement Learning

Standard RL fails in Offmine setuing ..

slide-20
SLIDE 20

P 20 An Optimistic Perspective on Offmine Reinforcement Learning

Standard RL fails in Offmine setuing ..

slide-21
SLIDE 21

P 21 An Optimistic Perspective on Offmine Reinforcement Learning

Standard RL fails in Offmine setuing ..

slide-22
SLIDE 22

P 22 An Optimistic Perspective on Offmine Reinforcement Learning

Can standard ofg-policy RL succeed in the offmine Setuing?

slide-23
SLIDE 23

P 23 An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL on Atari 2600

200 million frames (standard protocol) Train 5 DQN (Nature) agents on each Atari game using sticky actions (stochasticity)

slide-24
SLIDE 24

P 24 An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL on Atari 2600

Save all of the tuples of (observation, action, next

  • bservation, reward) encountered to DQN-replay

dataset(s)

slide-25
SLIDE 25

An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL on Atari 2600

Train ofg-policy agents using DQN-replay dataset(s) without any furuher environment interaction

slide-26
SLIDE 26

An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine DQN work?

slide-27
SLIDE 27

An Optimistic Perspective on Offmine Reinforcement Learning

Distributional RL uses Z(s, a), a distribution over returns, instead of the Q-function.

Let's try recent ofg-policy algorithms!

Z(1/K) Z(K/K)

Shared Neural Network

Z(2/K)

QR-DQN

Actions R e t u r n s

slide-28
SLIDE 28

An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine QR-DQN work?

slide-29
SLIDE 29

An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine DQN work?

slide-30
SLIDE 30

An Optimistic Perspective on Offmine Reinforcement Learning

Offmine DQN (Nature) vs Offmine C51

Average online scores of C51 and DQN (Nature) agents trained offmine on DQN replay dataset for the same number of gradient steps as online DQN. The horizontal line shows the pergormance of fully-trained DQN.

slide-31
SLIDE 31

An Optimistic Perspective on Offmine Reinforcement Learning

Developing Robust Offmine RL algorithms ➢ Emphasis on Generalization

○ Given a fjxed dataset, generalize to unseen states during evaluation.

slide-32
SLIDE 32

An Optimistic Perspective on Offmine Reinforcement Learning

Developing Robust Offmine RL algorithms

➢ Emphasis on Generalization ○ Given a fjxed dataset, generalize to unseen states during evaluation. ➢ Ensemble of Q-estimates: ○ Ensembling, Dropout widely used for improving generalization.

slide-33
SLIDE 33

An Optimistic Perspective on Offmine Reinforcement Learning

Ensemble-DQN Train multiple (linear) Q-heads with difgerent random initialization.

Shared Neural Network

Q1 Q2

Ensemble-DQN

QK

Returns Actions Actions

..

Actions

slide-34
SLIDE 34

An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine Ensemble-DQN work?

slide-35
SLIDE 35

An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine DQN work?

slide-36
SLIDE 36

An Optimistic Perspective on Offmine Reinforcement Learning

Developing Robust Offmine RL algorithms

➢ Emphasis on Generalization ○ Given a fjxed dataset, generalize to unseen states during evaluation. ➢ Q-learning as constraint satisfaction: ○

slide-37
SLIDE 37

An Optimistic Perspective on Offmine Reinforcement Learning

Random Ensemble Mixture (REM) Minimize TD error on random (per minibatch) convex combination of multiple Q-estimates.

𝛽2

REM

∑i ⍺i Qi

𝛽K Shared Neural Network

Q1 Q2 QK

Actions R e t u r n s

slide-38
SLIDE 38

An Optimistic Perspective on Offmine Reinforcement Learning

REM vs QR-DQN

𝛽2

REM

∑i ⍺i Qi

𝛽K Shared Neural Network

Q1 Q2 QK

Actions R e t u r n s

Z(1/K) Z(K/K)

Shared Neural Network

Z(2/K)

QR-DQN

Returns

slide-39
SLIDE 39

An Optimistic Perspective on Offmine Reinforcement Learning

Offmine Stochastic Atari Results

Scores averaged over 5 runs of offline agents trained using DQN replay data across 60 Atari games for 5X gradient steps. Offline REM surpasses gains from online C51 and offline QR-DQN.

slide-40
SLIDE 40

An Optimistic Perspective on Offmine Reinforcement Learning

Offmine REM vs. Baselines

slide-41
SLIDE 41

An Optimistic Perspective on Offmine Reinforcement Learning

Reviewers asked: Does Online REM work?

Average normalized scores of online agents trained for 200 million game frames. Multi-network REM with 4 Q-functions performs comparably to QR-DQN.

slide-42
SLIDE 42

An Optimistic Perspective on Offmine Reinforcement Learning

Key Factor in Success: Offmine Dataset Size Randomly subsample N% of frames from 200 million frames for offmine training. Divergence with 1% of data for prolonged training!

slide-43
SLIDE 43

An Optimistic Perspective on Offmine Reinforcement Learning

Key Factor in Success: Offmine Dataset Composition Subsample fjrst 10% of total frames (20 million) for offmine training -- much lower quality data.

slide-44
SLIDE 44

An Optimistic Perspective on Offmine Reinforcement Learning

Choice of Algorithm: Offmine Continuous Control

Offmine agents trained using full experience replay of DDPG on MuJoCo environments.

slide-45
SLIDE 45

An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL: Stability / Overgituing

More gradient updates eventually degrade pergormance :(

Average online scores of offmine agents trained on 5 games using logged DQN replay data for 5X gradient steps compared to online DQN.

slide-46
SLIDE 46

An Optimistic Perspective on Offmine Reinforcement Learning

Offmine RL for Robotics

slide-47
SLIDE 47

Future Work

฀The potential for ofg-policy learning remains tantalizing, the best way to achieve it still a mystery.฀ - Sutuon & Baruo ฀The potential for ofg-policy learning remains tantalizing, the best way to achieve it still a mystery.฀ - Sutuon & Baruo

An Optimistic Perspective on Offmine Reinforcement Learning

slide-48
SLIDE 48

An Optimistic Perspective on Offmine Reinforcement Learning

  • Rigorous characterization of role of generalization in
  • ffmine RL

Offmine RL: Future Work

slide-49
SLIDE 49

An Optimistic Perspective on Offmine Reinforcement Learning

  • Rigorous characterization of role of generalization in offmine RL
  • Benchmarking with various data collection strategies

Subsampling DQN-replay datasets (e.g., fjrst / last k million frames)

Offmine RL: Future Work

slide-50
SLIDE 50

An Optimistic Perspective on Offmine Reinforcement Learning

  • Rigorous characterization of role of generalization in offmine RL
  • Benchmarking with various data collection strategies

○ Subsampling DQN-replay datasets (e.g., fjrst / last k million frames)

  • Offmine Evaluation / Hyperparameter Tuning

○ Currently, online evaluation used for early stopping. “True” offmine RL requires offmine policy evaluation.

Offmine RL: Future Work

slide-51
SLIDE 51

An Optimistic Perspective on Offmine Reinforcement Learning

  • Rigorous characterization of role of generalization in offmine RL
  • Benchmarking with various data collection strategies

Subsampling DQN-replay datasets (e.g., fjrst / last k million frames)

  • Offmine Evaluation / Hyperparameter Tuning

○ Currently, online evaluation used for early stopping. “True” offmine RL require offmine policy evaluation.

  • Model-based RL approaches

Offmine RL: Future Work

slide-52
SLIDE 52

An Optimistic Perspective on Offmine Reinforcement Learning

  • Robust RL algorithms (e.g. REM, QR-DQN), trained on

suffjciently large and diverse datasets, pergorm quite well in the offmine setuing.

  • Offmine RL provides a standardized setup for:

○ Isolating exploitation from exploration ○ Developing sample effjcient and stable algorithms ○ Pretrain RL agents on logged data

TL;DR

slide-53
SLIDE 53

An Optimistic Perspective on Offmine Reinforcement Learning

For code, DQN-replay dataset(s) and previous version of paper, refer to

  • ffline-rl.github.io

Thank you!