Offline Reinforcement Learning CS 285 Instructor: Aviral Kumar UC - PowerPoint PPT Presentation

Offline Reinforcement Learning   CS 285 Instructor: Aviral Kumar UC Berkeley

What have we covered so far? • Exploration:   - Strategies to discover high-reward states, diverse skills, etc.   - How hard is exploration? ◆ Super Large ✓ |S||A| (1 − γ ) 3 log |S||A| #Samples ≥ Ω δ How many states to visit in the “best” case   to learn an optimal Q-function • Even if we are ready to collect so many samples, it may be dangerous in practice: imagine a random policy on an autonomous car or a robot! Azar, Munos, Kappen. On the Sample Complexity of RL with a Generative Model. ICML 2012 and many others…

Can we apply standard RL in the real-world? • RL is fundamentally an “active” learning paradigm: the agent needs   to collect its own dataset to learn meaningful policies • This can be unsafe or expensive in real world problems! Generalization? ? Iterated data collection can cause poor generalization! Gottesman, Johansson, Komorowski, Faisal, Sontag, Doshi-Velez. Guidelines for RL in Healtcare. Nature Medicine, 2019. Kumar , Gupta, Levine. DisCor: Corrective Feedback in RL via Distribution Correction, NeurIPS 2020.

O ffl ine (Batch) Reinforcement Learning Learn from a previously collected static dataset • Large static datasets of meaningful Why is offline RL behaviours already exist   promising? • Large datasets at the core of successes in Vision and NLP   Lange, Gabel, Reidmiller. Batch Reinforcement Learning. 2012 . Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.

Applications of O ffl ine RL Kalashnikov et al. QT-Opt: Scalable Deep RL for Vision-Based Robotic Manipulation. CoRL 2018. Jaques et al. Way O ff -Policy Batch Reinforcement Learning for Dialog. EMNLP 2020. Guez et al. Adaptive Treatment of Epilepsy via Batch-Mode Reinforcement Learning. AAAI 2008. Kendall et al. Learning to Drive in a Day. ICRA 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.

How good can o ffl ine RL perform? Supervised Learning Dog Can do as good as the dataset! Cat? O ffl ine Reinforcement Learning Can do better than the dataset! Stitching Can show that Q-learning recovers optimal policy from random data. Fu, Kumar, Nachum, Tucker, Levine. D4RL: Datasets for Deep Data-Driven RL. arXiv 2020.

Formalism and Notation • Dataset construction:   Reward known - Several trajectories: 0 t τ i = { s t i , a t i , r t i } H D = { τ 1 , · · · , τ N } , i , s t =1 • Approximate “distribution” of states in the dataset: D ( s ) • Approximate distribution of actions at a given state in the dataset: D ( a | s ) • Will use notation for the behavior policy, π β ( a | s ) = D ( a | s ) • Standard RL notation from before: Q π ( s, a ) , V π ( s ) , d π ( s ) , etc.

Part 1: Classic Offline RL Algorithms and Challenges   With Offline RL Part 2: Deep RL Algorithms to Address These Challenges Part 3: Related Problems, Evaluation Protocols, Applications

Part 1: Classic Algorithms and Challenges With Offline RL

A Generic O ff -Policy RL Algorithm DQN and Actor-critic algorithms both follow a similar skeleton, but   with di ff erent design choices. 1. Collect data using the current policy   2. Store this data in a replay bu ff er   3. Use replay bu ff er to make updates on the policy and the Q-function   4. Continue from step 1.

Can such o ff -policy RL algorithms be used? O ff -Policy RL Algorithms can be applied, in principle “Off-Policy” buffer from past policies “Off-Policy” buffer from some unknown policies We will discuss some classical algorithms based on this idea next Lagoudakis, Parr. Least Squares Policy Iteration. JMLR 2003. Ernest el al. Tree-Based Batch Mode Reinforcement Learning. JMLR 2005 Gordon G. J. Stable Function Approximation in Dynamic Programming. ICML 1995, and many more…

Classic Batch Q-Learning Algorithms 1. Compute target values using the current Q-function   2. Train Q-function by minimizing TD error with respect to target values from Step 1. Q ( s, a ) = w T φ ( s, a ) Linear Q-functions Least Squares w T φ ( s, a ) ≈ R + γ max w T φ ( s 0 , a 0 ) Temporal Di ff erence Q-Learning (LSTD- a 0 Q) Can be solved in many ways:   (1) find fixed point of the above equation   (2) minimise the gap between the two sides of the equation Lagoudakis, Parr. Least Squares Policy Iteration. JMLR 2003. Ernest el al. Tree-Based Batch Mode Reinforcement Learning. JMLR 2005 Riedmiller. Neural Fitted Q-Iteration. ECML 2005. Gordon G. J. Stable Function Approximation in Dynamic Programming. ICML 1995 Antos, Szepesvari, Munos. Fitted Q-Iteration in Continuous Action-Space MDPS. NeurIPS 2007.

Classic Batch RL Algorithms based on IS High-confidence bounds on the return estimate   Doubly-robust Variance reduction techniques Precup. Eligibility Traces for O ff -Policy Policy Evaluation. CSD Faculty Publication Series, 2000. Precup, Sutton, Dasgupta. O ff -Policy TD Learning with Function Approximation. ICML 2001. Peshkin and Shelton. Learning from Scarce Experience. 2002. Thomas, Theocharous, Ghavamzadeh. High Confidence O ff -Policy Evaluation. AAAI 2015. Thomas, Theocharous, Ghavamzadeh. High Confidence O ff -Policy Improvement. ICML 2015. Thomas, Brunskill. Magical Policy Search: Data E ffi cient RL with Guarantees of Global Optimality. EWRL 2016. Jiang and Li. Doubly-Robust O ff -Policy Value Estimation for Reinforcement Learning. ICML 2016.

Modern O ffl ine RL: A Simple Experiment Collect expert data and run actor-critic algorithms on this data Performance doesn’t improve with more data Learning diverges “Policy unlearning” Not a classical overfitting issue! how well it does how well it thinks it does Kumar, Fu, Tucker, Levine. Stabilizing O ff -Policy RL via Bootstrapping Error Reduction , NeurIPS 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.

So, why do RL algorithms fail, even though imitation learning would work in this setting (e.g., in Lecture 2)?

Let’s see how the Q-function is updated a 0 Q ( s 0 , a 0 ) Q ( s, a ) ← r ( s, a ) + γ max h a 0 Q ( s 0 , a 0 ))) 2 i ( Q ( s, a ) − ( r ( s, a ) + γ max E s,a,s 0 ⇠ D Q-values at other actions Q-values on the data Which actions does the Q- function train on? s, a ∼ D Where does the action a’ for the target value come from? a 0 Q ( s 0 , a 0 ) max Q-learning queries values at unseen action targets, which are never trained during training

    Why are erroneous backups a big deal? • This phenomenon also happens in online RL settings, where the Q-function is erroneously optimistic • But Boltzmann or epsilon-greedy exploration on this overoptimistic Q- function (generally) leads to “error correction”   π explore ( a | s ) ∝ exp( Q ( s, a )) Error correction is not necessarily guaranteed with online data collection when using deep neural nets, but mostly works fine in practice (trick: use replay bu ff ers, perform distribution correction, etc) • But the primary ability of error correction, i.e., exploration, is impossible in o ffl ine RL, due to no access to an environment…. Kumar, Fu, Tucker, Levine. Stabilizing O ff -Policy RL via Bootstrapping Error Reduction , NeurIPS 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020. Kumar, Gupta, Levine. DisCor: Corrective-Feedback in RL via Distribution Correction. NeurIPS 2020. Kumar , Gupta. Does On-Policy Data Collection Fix Errors in O ff -Policy Reinforcement Learning?, BAIR blog.

Distributional Shift in O ffl ine RL • Distribution shift between the behavior policy (the policy that collected the data) and the policy during learning 6 = π β ( a | s ) a 0 Q ( s 0 , a 0 ) Q ( s, a ) ← r ( s, a ) + γ max Q ( s, a ) ← r ( s, a ) + γ E a 0 ⇠ π ( a 0 | s 0 ) Q ( s 0 , a 0 ) = π β ( a | s ) ( Q ( s, a ) − B ¯ Q ( s, a )) 2 ⇤ ⇥ Training: E s,a ∼ d πβ ( s,a ) Offline Q-Learning algorithms can overestimate the value of unseen actions and can thus be falsely optimistic Kumar, Fu, Tucker, Levine. Stabilizing O ff -Policy RL via Bootstrapping Error Reduction , NeurIPS 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.

Error Compounds in RL (Additional Slide) Typical cartoon showing “error compounding” in RL Error compounding over the horizon magnifies a small error into a big one. Recent work has also showed counterexamples that indicate we can’t do better. Janner, Fu, Zhang, Levine. When to Trust Your Model: Model-Based Policy Optimization. NeurIPS 2019. Ross, Gordon, Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. AISTATS 2011 Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.

Part 2: Deep RL Algorithms to Address Distribution Shift

Addressing Distribution Shift via Pessimism Q ( s, a ) ← r ( s, a ) + γ E a 0 ⇠ π φ ( a | s ) [ Q ( s 0 , a 0 )] “Policy Constraint” π φ := arg max E a ∼ π φ ( a | s ) [ Q ( s, a )] s.t. D ( π φ ( a | s ) , π β ( a | s )) ≤ ε φ Out-of-distribution action values are no longer used for the backup Hence, all values used during training are also trained, E a 0 ⇠ π φ ( a | s ) [ Q ( s 0 , a 0 )] leading to better learning Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.

Offline Reinforcement Learning CS 285 Instructor: Aviral Kumar UC - PowerPoint PPT Presentation

Offline Reinforcement Learning CS 285 Instructor: Aviral Kumar UC Berkeley What have we covered so far? Exploration: - Strategies to discover high-reward states, diverse skills, etc. - How hard is exploration? Super Large

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS 285 Instructor: Sergey Levine UC Berkeley The goal of reinforcement learning well come

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Offline Policy-search in Bayesian Reinforcement Learning Castronovo Michael University of Li`

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Javier Sanchez* and Jean-Michel Morel** * Universidad de Las Palmas Gran Canarias ** Ecole

Lua: language for the Web? @paulcuth ES6 ES2015 Generators Coroutines Spread operator

Local Conformal Nets and Vertex Operators Sebastiano Carpi University of Chieti and Pescara

Shadows of Quantum Spacetime and pale glares of Dark Matter Sergio Doplicher Universit` a di

is Bf EXEIPE in X r plane An REE field extasio Grcr n Wendi've dim W Vectorspace S Gr

How does bilateral filter relates with other methods? Pierre Kornprobst (INRIA) 0:35 Many

Nonlocal methods for image processing Lecture note, Xiaoqun Zhang Oct 30, 2009 1/29 Nonlocal

Optimizing for Space and Time Optimizing for Space and Time Usage with Speculative Par Usage