CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC - PowerPoint PPT Presentation

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC Berkeley EECS

Outline n Model-based RL n Ensemble Methods n Model-Ensemble Trust Region Policy Optimization n Model-based RL via Meta Policy Optimization n Asynchronous Model-based RL n Vision-based Model-based RL

Reinforcement Learning u t [Figure source: Sutton & Barto, 1998] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

“Algorithm”: Model-Based RL n For iter = 1, 2, … n Collect data under current policy n Learn dynamics model from past data n Improve policy by using dynamics model n e.g SVG(k) requires dynamics model, but can also run TRPO/A3C in simulator

Why Model-Based RL? n Anticipate data-efficiency n Get model out of data, which might allow for more significant policy updates than just a policy gradient n Learning a model n Re-usable for other tasks [assuming general enough]

“Algorithm”: Model-Based RL for iter = 1, 2, … n Collect data under current policy n Learn dynamics model from past data n Improve policy by using dynamics model Anticipated benefit? – much better sample efficiency So why not used all the time? -- training instability à ME-TRPO -- not achieving same asymptotic performance as model-free methods à MB-MPO

Overfitting in Model-based RL n Standard overfitting (in supervised learning) n Neural network performs well on training data, but poorly on test data n E.g. on prediction of s_next from (s, a) n New overfitting challenge in Model-based RL n policy optimization tends to exploit regions where insufficient data is available to train the model, leading to catastrophic failures n = “model-bias” (Deisenroth & Rasmussen, 2011; Schneider, 1997; Atkeson & Santamaria, 1997) n Proposed fix: Model-Ensemble Trust Region Policy Optimization (ME-TRPO)

Model-Ensemble Trust-Region Policy Optimization [Kurutach, Clavera, Duan, Tamar, Abbeel, ICLR 2018]

ME-TRPO Evaluation n Environments: [Kurutach, Clavera, Duan, Tamar, Abbeel, ICLR 2018]

ME-TRPO Evaluation n Comparison with state of the art [Kurutach, Clavera, Duan, Tamar, Abbeel, ICLR 2018]

ME-TRPO -- Ablation TRPO vs. BPTT in standard model-based RL [Kurutach, Clavera, Duan, Tamar, Abbeel, ICLR 2018]

ME-TRPO -- Ablation Number of learned dynamics models in the ensemble [Kurutach, Clavera, Duan, Tamar, Abbeel, ICLR 2018]

“Algorithm”: Model-Based RL for iter = 1, 2, … n Collect data under current policy n Learn dynamics model from past data n Improve policy by using dynamics model Anticipated benefit? – much better sample efficiency So why not used all the time? -- training instability à ME-TRPO -- not achieving same asymptotic performance as model-free methods à MB-MPO

Model-based RL Asymptotic Performance Because learned (ensemble of) model imperfect n Resulting policy good in simulation(s), but not optimal in real world n Attempted Fix 1: learn better dynamics model n Such efforts have so far proven insufficient n Attempted Fix 2: model-based RL via meta-policy optimization (MB-MPO) n Key idea: n Learn ensemble of models representative of generally how the real world works n Learn an ***adaptive policy*** that can quickly adapt to any of the learned models n Such adaptive policy can quickly adapt to how the real world works n

Model-Based RL via Meta Policy Optimization (MB-MPO) for iter = 1, 2, … n collect data under current adaptive policies n learn ENSEMBLE of K simulators from all past data n meta-policy optimization over ENSEMBLE n à new meta-policy π θ n à new adaptive policies Pieter Abbeel -- UC Berkeley | Covariant.AI | BerkeleyOpenArms.org [Clavera*, Rothfuss*, Schulman, Fujita, Asfour, Abbeel, CoRL 2018]

Model-based via Meta-Policy Optimization MB-MPO [Clavera*, Rothfuss*, Schulman, Fujita, Asfour, Abbeel, CoRL 2018]

MB-MPO Evaluation Pieter Abbeel -- UC Berkeley | Covariant.AI | BerkeleyOpenArms.org [Clavera*, Rothfuss*, Schulman, Fujita, Asfour, Abbeel, CoRL 2018]

MB-MPO Evaluation n Comparison with state of the art model-free Pieter Abbeel -- UC Berkeley | Covariant.AI | BerkeleyOpenArms.org [Clavera*, Rothfuss*, Schulman, Fujita, Asfour, Abbeel, CoRL 2018]

MB-MPO Evaluation n Comparison with state of the art model-based Pieter Abbeel -- UC Berkeley | Covariant.AI | BerkeleyOpenArms.org [Clavera*, Rothfuss*, Schulman, Fujita, Asfour, Abbeel, CoRL 2018]

Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope

So are we done? n No… n Not real-time --- exacerbated by need for extensive hyperparameter tuning n Limited to short horizon n From state (though some results have started to happen from images)

Environment Collect Data Data Buffer Improve Policy Learn Model

Environment Policy Data Data Collection Parameters Buffer Worker Model Policy Model Improvement Learning Parameters Worker Worker

Questions to be answered 1. Performance?

Questions to be answered 1. Performance? 2. Effect on policy regularization?

Questions to be answered 1. Performance? 2. Effect on policy regularization? 3. Effect on data exploration?

Questions to be answered 1. Performance? 2. Effect on policy regularization? 3. Effect on data exploration? 4. Robustness to hyperparameters?

Questions to be answered 1. Performance? 2. Effect on policy regularization? 3. Effect on data exploration? 4. Robustness to hyperparameters? 5. Robustness to data collection frequency?

Experiments 1. How does the asynch-framework perform? Asynch: ME-TRPO, ME-PPO, MB-MPO Baselines: ME-TRPO, ME-PPO, MB-MPO; TRPO, PPO a. Average Return vs. Time b. Average Return vs. Sample complexity (Timesteps)

Performance Comparison: Wall-Clock Time

Performance Comparison: Sample Complexity

Experiments 1. Performance comparison 2. Are there benefits of being asynchronous other than speed? a. Policy learning regularization b. Exploration in data collection

Policy Learning Regularization Data Collection Data Collection Policy Model Improve- Policy Learning Model ment Improve- Learning ment Partially Synchronous Asynchronous

Policy Learning Regularization

Improved Exploration for Data Collection Partially Data Asynchronous Collection Data Collection Policy Model Improve- Policy Learning Model ment Improve- Learning ment Synchronous

Improved Exploration for Data Collection

Experiments 1. Performance comparison 2. Asynchronous effects 3. Is the asynch-framework robust to data collection frequency?

Ablations: Sampling Speed

Experiments 1. Performance comparison 2. Asynchronous effects 3. Ablations 4. Does the aynch-framework work in real robotics tasks? a. Reaching a position b. Inserting a unique shape into its matching hole in a box c. Stacking a modular block onto a fixed base

Real Robot Tasks: Reaching Position

Real Robot Tasks: Matching Shape

Real Robot Tasks: Stacking Lego

Summary of Asynchronous Model-based RL ● Problem Need fast and data efficient methods for robotic tasks ○ ● Contributions General asynchronous model-based framework ○ Wall-clock time speed-up ○ Sample efficiency ○ Effect on policy regularization & data exploration ○ Effective on real robots ○

Outline n Model-based RL n Ensemble Methods n Model-Ensemble Trust Region Policy Optimization n Model-based RL via Meta Policy Optimization n Asynchronous Model-based RL n Vision-based Model-based RL

World Models 57

World Models 58

World Models 59

World Models 60

World Models 72

Embed to Control 75

Embed to Control 76

SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning Marvin Zhang*, Sharad Vikram*, Laura Smith, Pieter Abbeel, Matthew Johnson, Sergey Levine collect N initial learn representation infer latent dynamics update policy given random rollouts and latent dynamics given observed data latent dynamics (optionally) fine-tune collect new data representation from updated policy https://goo.gl/AJKocL 98

Deep Spatial Autoencoders ■ Deep Spatial Autoencoders for Visuomotor Learning, Finn, Tan, Duan, Darrell, Levine, Abbeel, 2016 ( https://arxiv.org/abs/1509.06113 ) ■ Train deep spatial autoencoder Model-based RL through iLQR in the latent space ■ 99

Robotic Priors / PVEs ■ PVEs: Position-Velocity Encoders for Unsupervised Learning of Structured State Representations Rico Jonschkowski, Roland Hafner, Jonathan Scholz, and Martin Riedmiller ( https://arxiv.org/pdf/1705.09805.pdf ) ■ Learn an embedding without reconstruction 10 0

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC - PowerPoint PPT Presentation

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC Berkeley EECS Outline n Model-based RL n Ensemble Methods n Model-Ensemble Trust Region Policy Optimization n Model-based RL via Meta Policy Optimization n Asynchronous Model-based RL

287(g) Program Sheriff Eric J. Severson Waukesha County, WI 287(g) Program Legal Authority

US 287 / SH 40 Passing Lane Pre-Proposal August 18, 2020 1 US 287 / SH 40 Passing Lane Project

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

CS 287 Advanced Robotics (Fall 2019) Lecture 9: Motion Planning Lecture by: Huazhe (Harry) Xu

Immigration and Customs Enforcement 287(g) Program & Secure Communities Terry S. Johnson

Reshaping Westchesters I-287 Corridor Making the most of a major investment in regional

Southeast Connector Update Village Creek Neighborhood Association I-20, I-820, & US 287

CS 287 Lecture 12 (Fall 2019) Kalman Filtering Lecturer: Ignasi Clavera Slides by Pieter Abbeel

CS 287 Advanced Robotics (Fall 2019) Lecture 7: Constrained Optimization Pieter Abbeel UC

CS 287 Lecture 21 (Fall 2019) Physics Simulation Pieter Abbeel UC Berkeley EECS A lightning

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

CS 287 Lecture 24 (Fall 2019) Autonomous Helicopter Flight Pieter Abbeel UC Berkeley EECS

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

CS 287 Lecture 11 (Fall 2019) Probability Review, Bayes Filters, Gaussians Pieter Abbeel UC

Seasonal Outreach Fall Fall Outreach Campaign Fall Outreach Campaign Fall Outreach Fall

Invasive Fetal Therapy Stephen R. Carr Francois I. Luks Fetal Therapy Definitions: Fetal

Wit ith Im Image Clu lustering Jianwei Yang Devi Parikh Dhruv Batra Vir irgin inia ia

Procdures dAblation et NACO Dr Walid AMARA GHI Le Raincy-Montfermeil Relations

SIIM 2018 Cardiovascular Informatics: Imaging and Workflows Session Co-chairs: Bruce Bray and

Learning Perceptual Inference by Contrasting http://wellyzhang.github.io/project/copinet.html Chi

Ti Timi ming of ADT T and ch chemotherapy Thomas Keane M.D. Medical University of South

Structure at the meta-level: Observations on the structure of design spaces of high-performance

Translator Research Production Shared Research task Dataset newstest2016 newstest2017

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC - PowerPoint PPT Presentation

CS 287 Lecture 20 (Fall 2019) Model-based RL Pieter Abbeel UC Berkeley EECS Outline n Model-based RL n Ensemble Methods n Model-Ensemble Trust Region Policy Optimization n Model-based RL via Meta Policy Optimization n Asynchronous Model-based RL

287(g) Program Sheriff Eric J. Severson Waukesha County, WI 287(g) Program Legal Authority

US 287 / SH 40 Passing Lane Pre-Proposal August 18, 2020 1 US 287 / SH 40 Passing Lane Project

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

CS 287 Advanced Robotics (Fall 2019) Lecture 9: Motion Planning Lecture by: Huazhe (Harry) Xu

Immigration and Customs Enforcement 287(g) Program &amp; Secure Communities Terry S. Johnson

Reshaping Westchesters I-287 Corridor Making the most of a major investment in regional

Southeast Connector Update Village Creek Neighborhood Association I-20, I-820, &amp; US 287

CS 287 Lecture 12 (Fall 2019) Kalman Filtering Lecturer: Ignasi Clavera Slides by Pieter Abbeel

CS 287 Advanced Robotics (Fall 2019) Lecture 7: Constrained Optimization Pieter Abbeel UC

CS 287 Lecture 21 (Fall 2019) Physics Simulation Pieter Abbeel UC Berkeley EECS A lightning

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC

CS 287 Lecture 24 (Fall 2019) Autonomous Helicopter Flight Pieter Abbeel UC Berkeley EECS

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

CS 287 Lecture 11 (Fall 2019) Probability Review, Bayes Filters, Gaussians Pieter Abbeel UC

Seasonal Outreach Fall Fall Outreach Campaign Fall Outreach Campaign Fall Outreach Fall

Invasive Fetal Therapy Stephen R. Carr Francois I. Luks Fetal Therapy Definitions: Fetal

Wit ith Im Image Clu lustering Jianwei Yang Devi Parikh Dhruv Batra Vir irgin inia ia

Procdures dAblation et NACO Dr Walid AMARA GHI Le Raincy-Montfermeil Relations

SIIM 2018 Cardiovascular Informatics: Imaging and Workflows Session Co-chairs: Bruce Bray and

Learning Perceptual Inference by Contrasting http://wellyzhang.github.io/project/copinet.html Chi

Ti Timi ming of ADT T and ch chemotherapy Thomas Keane M.D. Medical University of South

Structure at the meta-level: Observations on the structure of design spaces of high-performance

Translator Research Production Shared Research task Dataset newstest2016 newstest2017

Immigration and Customs Enforcement 287(g) Program & Secure Communities Terry S. Johnson

Southeast Connector Update Village Creek Neighborhood Association I-20, I-820, & US 287