Benchmarking and Evaluation in Inverse Reinforcement Learning Peter - PowerPoint PPT Presentation

Benchmarking and Evaluation in Inverse Reinforcement Learning Peter Henderson Workshop on New Benchmarks, Metrics, and Competitions for Robotic Learning RSS 2018

Where do you see the shortcomings in existing benchmarks and evaluation metrics ? What are the challenges for learning in robotic perception, planning, and control that are not well covered by existing benchmarks ? What are the characteristics new benchmarks should have to allow meaningful repeatable evaluation of approaches in robotics, while steering the community to addressing the open research challenges?

What is Inverse Reinforcement Learning? Given observations from an expert policy , find the reward function for which the policy is optimal. Often involves learning a novice policy either while learning or for evaluation after learning.

What is Inverse Reinforcement Learning? https://youtu.be/CDSdJks-D3U https://youtu.be/bD-UPoLMoXw

What might characteristics might we want to examine for IRL algorithms? Does the learned reward and optimal policy under this reward: • Capture the generalizable goal of the expert? • Learn to mimic the style or align with the values of the expert? • Transfer to its own setting from a variety of different experts? • Correlate with performance/evaluation https://youtu.be/CDSdJks-D3U metrics or the true reward?

Sampling of Current Evaluation Environments • MuJoCo and other robot simulations • (Ho and Ermon, 2016); (Henderson et al, 2018); (Finn et al, 2016); (Fu et al, 2018) • Variations on Sprite, ObjectWorld, GridWorld • (Xu et al, 2018); (Li et al, 2017) • 2D Driving • (Majumdar et al, 2017); (Metelli et al, 2017) • Other • Surgery simulator (Li & Burdick, 2017); PR2 (Finn et al, 2016)

Evaluation Expected Value Difference (Common, 100% of papers in previous slide) • Given the learned optimal policy under the learned reward, what is the difference in true reward value with the optimal policy learned with the true reward Difference from Ground Truth Reward (2/9 papers in previous slide) Performance Under Transfer (4/9 papers have some notion of transfer) • Expert is in a different environment than the agent • Transfer of learned reward function to other settings • Generalization ability of policy from learned reward vs. real reward Distillation of Information (2/9 papers, conservatively) • Experts performing many tasks or many experts with large variations, can distill one or more goals.

Sampling of Current Evaluation Environments Original Policy Transfer Reward Transfer Fu, Justin, Katie Luo, and Sergey Levine. "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning." arXiv preprint arXiv:1710.11248 (2017).

What’s missing? • Not consistent domains and demonstrations across papers (need a benchmark suite) • Mostly inconsistent evaluations • Not consistent notions of transfer • Benchmark suite could provide consistent demonstrations and increasingly difficult or entangled notions of reward. Could encompass different notions of transfer and evaluation metrics in simulation settings with known reward.

Learning from Human Demonstrators Yu, Tianhe, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. "One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning." arXiv preprint arXiv:1802.01557 (2018).

Learning from Human Demonstrators Evaluated as success rate (known goal) for one-shot learning

Learning from Human Demonstrators • For goal-driven demonstrations, can match correlation of reward to goal achievement. • But how can we ensure that true goal is being captured and not simply mimicking exact movement? • What about more ambiguous goals or properties? (e.g., agent learns goal, but doesn’t care how it gets there) • There are some challenges in evaluation and fairness of comparison.

Benchmarking IRL • Need a suite of benchmarks which can let us checkpoint progress in IRL and begin to capture more complex goals and objectives • Increasing domain similarity distances and complexity of demonstrations. • Consistent and diverse evaluations to characterize algorithms • Checkpoints until can learn from raw video (e.g., unsupervised watching of soccer players and learn to play in RoboCup)

Benchmarking IRL • Key to any benchmark is ease of use, characterization of algorithms , and setting realistic expectations of performance

But what about reproducibility? • Reproducibility in robotics seems to go hand in hand with consistency and robustness under new conditions

But what about reproducibility? • Release code and as much as possible to reproduce the experiments. • Provide replicable simulated results and all details (e.g., hyperparameters) needed to run experiments.

But what about reproducibility? An intricate interplay of hyperparameters. For many/most algorithms, hyper parameters can have a profound effect on performance. When testing a baseline, how motivated are we to find the best hyperparameters? Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. "Deep reinforcement learning that matters." arXiv preprint arXiv:1709.06560 (2017).

But what about reproducibility? • (Cohen et al, 2018) suggest evaluating across ranges of hyperparameters in “Distributed Evaluations: Ending Neural Point Metrics” • May provide some sense of how easy it may be to reproduce the method in a different setting.

But what about reproducibility? • Because of hyperparameter tuning, need hold out test sets describing a range of environment settings • There is unaccounted for computation when hyperparameters are tuned extensively without a test set. • Benchmarks grow stale (overfitting), can swap out demonstrations and environments or place more emphasis on increasingly difficult versions

But what about reproducibility? • Benchmark suite should provide a range of data adequate for assessing performance in a variety of environments. • Ideally should pool together resources in a central benchmark suite. If need new settings to demonstrate algorithm ability, should build on top of suite.

But what about reproducibility? • Should cover enough random starts (e.g., random seeds) and increasingly difficult settings (e.g., distance in domains for transfer tasks) to provide an informative spectrum of performance settings Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. "Deep reinforcement learning that matters." arXiv preprint arXiv:1709.06560 (2017).

But what about reproducibility? Can see extended thoughts on reproducibility in: Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. "Deep reinforcement learning that matters." arXiv preprint arXiv:1709.06560 (2017). And ICLR 2018 KeyNote by Joelle Pineau based on that work: https://www.youtube.com/watch?v=Vh4H0gOwdIg

Inverse Reinforcement Learning lacks a commonly used benchmark suite of tasks, demonstrations, environments. When evaluating new algorithms, should provide enough benchmarks and experiments to characterize performance across hyperparameters, random initializations, and differing conditions.

Thank you! Feel free to email questions. peter (dot) henderson (at) mail.mcgill.ca

Benchmarking and Evaluation in Inverse Reinforcement Learning Peter - PowerPoint PPT Presentation

Benchmarking and Evaluation in Inverse Reinforcement Learning Peter Henderson Workshop on New Benchmarks, Metrics, and Competitions for Robotic Learning RSS 2018 Where do you see the shortcomings in existing benchmarks and evaluation metrics ?

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

Inverse Kinematics Inverse Kinematics Inverse Kinematics Carnegie Carnegie Sebastian Grassia

Course on Inverse Problems Albert Tarantola Lesson VI: a) General Formulation of the Inverse

Autonomous Navigation CSE 571 Inverse Optimal Control (Inverse Reinforcement Learning) Many

Advanced planning for autonomous vehicles using reinforcement learning and deep inverse

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from

Inverse Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Todays

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Blood Pressure Management: An Improvement Story Annie Sy, PharmD, MHSA Director, SMF Integrated

Applications of R Shiny to Explore, Evaluate and Improve Total Survey Quality Xiaodan Lyu

Latent Variable Models for Text, Event, and Network Data MURI Project: University of California,

Fanfiction, Canon, and Possible Worlds, Or, Why Academics Should Care About Fanfiction Dr. Sara

YOUR SUCCESS | Welcoming Visitors to the Farm Annie Baggett, Agritourism Marketing Specialist

Incrementalization Across Object Abstraction Y. Annie Liu Computer Science Department State

Whilst you are learning at home, each time we set you Maths tasks we will include a warm up,

Growing Lifelong Philanthropists AUDIO PROBLEMS? LISTEN ON YOUR PHONE : +1 (929) 205-6099|