Reproducibility and Replicability in Deep Reinforcement Learning - PowerPoint PPT Presentation

Reproducibility and Replicability in Deep Reinforcement Learning (and Other Deep Learning Methods) Peter Henderson Statistical Society of Canada Annual Meeting 2018

Contributors Riashat Islam Phil Bachman Doina Precup David Meger Joelle Pineau Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. "Deep reinforcement learning that matters.” Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). 2018.

What are we talking about? “ Repeatability (Same team, same experimental setup): A researcher can reliably repeat her own computation. Replicability (Different team, same experimental setup): An independent group can obtain the same result using the author's own artifacts. Reproducibility (Different team, different experimental setup): An independent group can obtain the same result using artifacts which they develop completely independently.” Plesser, Hans E. "Reproducibility vs. replicability: a brief history of a confused terminology." Frontiers in Neuroinformatics 11 (2018): 76.

“Reproducibility is a minimum necessary condition for a finding to be believable and informative.” Cacioppo, John T., et al. "Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science." (2015).

“An article about computational science in a scientific publication is not the scholarship itself, it is merely the advertising of the scholarship . The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” Buckheit, Jonathan B., and David L. Donoho. "Wavelab and reproducible research." Wavelets and statistics. Springer New York, 1995. 55-81.

Why should we care about this? Fermat’s last theorem (1637) “No three positive integers a, b, and c satisfy the equation a n + b n = c n for any integer value of n greater than 2.” Image Source: https://upload.wikimedia.org/wikipedia/commons/4/47/Diophantus-II-8-Fermat.jpg

Why should we care about this? Claimed proof too long for margins, so not included. Proof not able to be reproduced until 1995 (358 years!!) Image Source: https://upload.wikimedia.org/wikipedia/commons/4/47/Diophantus-II-8-Fermat.jpg

Why should we care about this? https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

So what’s the deal with Deep Learning? Generative Adversarial Networks (GANs) : “We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. (…) We did not find evidence that any of the tested algorithms consistently outperforms the original one.” Lucic, Mario, et al. "Are GANs Created Equal? A Large-Scale Study." arXiv preprint arXiv:1711.10337 (2017). Neural Language Models (NLMs): “We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models” Melis, Gábor, Chris Dyer, and Phil Blunsom. "On the state of the art of evaluation in neural language models." arXiv preprint arXiv:1707.05589 (2017). Reinforcement Learning (RL): “Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful.” Henderson, Peter, et al. "Deep reinforcement learning that matters." arXiv preprint arXiv:1709.06560 (2017).

REUSABLE material (software, datasets, experimental platforms) help us REPRODUCE, REPLICATE, REPEAT scientific methods to establish the ROBUSTNESS of the findings using fair comparisons and informative evaluation methods.

What is RL?

25 years of RL papers # of RL papers per year scraped from Google Scholar searches. P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger. Deep Reinforcement Learning that Matters. AAAI 2017.

https://www.quora.com/Why-did-AlphaGo-resign-in-the-4th-match-against-Lee-Sedol https://neuralsculpt.com/2018/05/26/gym-retro/ https://eng.uber.com/deep-neuroevolution/

http://www.cim.mcgill.ca/~dmeger/ICRA2015_GaitLearning/icra2015_submission3011_megerHigueraXuDudek.pdf https://amp.businessinsider.com/images/574995b4dd089506238b45e6-750-419.jpg http://robohub.org/deep-learning-in-robotics/

Investigated Reproducibility in Policy Gradient Methods Codebases Hyperparameters Random Seeds Variable Performance Across Different Settings

Codebases

Hyperparameters An intricate interplay of hyper-parameters. For many/most algorithms, hyper parameters can have a profound effect on performance.

Hyperparameters An intricate interplay of hyper-parameters. For many/most algorithms, hyper parameters can have a profound effect on performance. When testing a baseline, how motivated are we to find the best hyperparameters?

Environments Alg.1 Alg.2 Alg.3 Alg.4 Alg.1 Alg.1 Alg.2 Alg.2 Alg.3 Alg.3 Alg.4 Alg.4

Behaviour Video taken from: https://gym.openai.com/envs/HalfCheetah-v1

How do we evaluate? Alg.1 Alg.2 Alg.3 Alg.4 • Average return over test set of trials and random seeds? + • Confidence interval? How do we pick n ?

How many trials?

How many trials? n=10 70 60 50 Baseline to beat 40 30 20 10 0

How many trials? n=10 Top-3 results 70 70 60 60 50 50 Baseline to beat Baseline to beat 40 40 30 30 20 20 10 10 0 0 • We beat the baseline! • And with low variance!

How many trials? n=5 n=5

How many trials? Both are same TRPO code with best hyper-parameter configuration but different random seeds!

Other Considerations Overfitting and generalization to new conditions (reproducible algorithm performance in varying environments)

Other Considerations What constitutes a fair comparison ? Using different hyperparameters? What about different codebases? Should hyperparameter optimization computation time be included when comparing algorithm sample efficiency?

What can we do? Commit to releasing all materials necessary for replicability, repeatability, and reproducibility (e.g., code, hyperparameters, tricks, etc.)

What can we do? Develop new methods for and ensure the use of rigorous evaluation and experimental methodology (at least use many trials and random seeds as indications of robustness, and better statistical indicators of significant findings)

What can we do? Develop new methods for and ensure the use of rigorous evaluation and experimental methodology 2nd Workshop on Reproducibility in Machine Learning (ICML 2018) (deadline to submit June 10th) https://sites.google.com/view/icml-reproducibility-workshop/home

What can we do? Aim to make algorithms more robust to variation in different settings and at least across random seeds. Better generalization and robustness

What can we do? Reduce the amount of hyperparameters for ease-of-use by outside communities, robustness, and easier fair comparisons. (e.g., fewer hyperparameters, AutoML, etc.)

What can we do? Align rewards with desired behaviors 1st Workshop on Goal Specification (ICML 2018) https://sites.google.com/view/goalsrl/home

What can we do as a community? Commit to sharing reusable material. Develop a culture of good experimental practice. Be thorough, be fair, be as critical of the “good” results as the “bad results”. Contribute to the reproducibility effort! Organize an event, sign-up for a challenge, include it in your work, in your course.

Thank you! Feel free to email with questions and comments. Submit your awesome work on reproducibility to the 2nd Workshop on Reproducibility in Machine Learning (ICML 2018) (deadline to submit June 10th)

Reproducibility and Replicability in Deep Reinforcement Learning - PowerPoint PPT Presentation

Reproducibility and Replicability in Deep Reinforcement Learning (and Other Deep Learning Methods) Peter Henderson Statistical Society of Canada Annual Meeting 2018 Contributors Riashat Islam Phil Bachman Doina Precup David Meger Joelle

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Discussion: Reproducibility and Cross-study Replicability of Prognostic Signatures from High

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reproducibility and Cross-study Replicability of Prognostic Signatures from High Throughput

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: REPEATABILITY REPEATABILITY, ,

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Replicability in Psychological Science WAN Ching, Catherine Associate Chair (Research), School of

A Docker-Based Replicability Study of a Neural Information Retrieval Model Nicola Ferro, Stefano

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

Numerical reproducibility of high-performance computations using floating-point or interval

Overview of Program PROGRAM ITEMS: Large space: New Gym Existing gym size for large middle

BST Obstetrics and Gynaecology - Your Training Journey Catherine Corcoran, O&G Medical

OPEN HOUSE & HAPPY HOUR WITH THE BOARD Education Report Student Enrollment by Semester

From Script to Open Source Project Python standards, tools and continuous integration Micha

2020 Budget Hearing 9/17/19 1 2020 Budgets 2020 Budgets advertised for the following four

Gym: A VNF Testing Framework - Design and Prototype Insights Prof. Christian Rothenberg Raphael

Learning in Robotic Systems Robotic Agents @ Allegheny College Janyl Jumadinova November 27,

Updatable Security Views Nate Foster Benjamin Pierce Steve Zdancewic University of Pennsylvania

Reproducibility and Replicability in Deep Reinforcement Learning - PowerPoint PPT Presentation

Reproducibility and Replicability in Deep Reinforcement Learning (and Other Deep Learning Methods) Peter Henderson Statistical Society of Canada Annual Meeting 2018 Contributors Riashat Islam Phil Bachman Doina Precup David Meger Joelle

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Discussion: Reproducibility and Cross-study Replicability of Prognostic Signatures from High

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reproducibility and Cross-study Replicability of Prognostic Signatures from High Throughput

THE 3-R'S OF DATA- THE 3-R'S OF DATA- SCIENCE: SCIENCE: REPEATABILITY REPEATABILITY, ,

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Replicability in Psychological Science WAN Ching, Catherine Associate Chair (Research), School of

A Docker-Based Replicability Study of a Neural Information Retrieval Model Nicola Ferro, Stefano

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Reproducibility &amp; Generalizability @ Twitter Strengthening Reproducibility in Network Science

Numerical reproducibility of high-performance computations using floating-point or interval

Overview of Program PROGRAM ITEMS: Large space: New Gym Existing gym size for large middle

BST Obstetrics and Gynaecology - Your Training Journey Catherine Corcoran, O&amp;G Medical

OPEN HOUSE &amp; HAPPY HOUR WITH THE BOARD Education Report Student Enrollment by Semester

From Script to Open Source Project Python standards, tools and continuous integration Micha

2020 Budget Hearing 9/17/19 1 2020 Budgets 2020 Budgets advertised for the following four

Gym: A VNF Testing Framework - Design and Prototype Insights Prof. Christian Rothenberg Raphael

Learning in Robotic Systems Robotic Agents @ Allegheny College Janyl Jumadinova November 27,

Updatable Security Views Nate Foster Benjamin Pierce Steve Zdancewic University of Pennsylvania

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

BST Obstetrics and Gynaecology - Your Training Journey Catherine Corcoran, O&G Medical

OPEN HOUSE & HAPPY HOUR WITH THE BOARD Education Report Student Enrollment by Semester