Tackling the Reproducibility Problem in Systems Research with - PowerPoint PPT Presentation

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau ( UWM )

The Reproducibility Problem • Network • Magic numbers • Disks • Workload • BIOS • Jitter • OS conf. • etc... Reproduced? 140 Throughput (MB/s) Original 120 100 80 60 40 Goal : define methodology so that 20 we don’t end up in this situation 1 2 3 4 5 6 7 8 9 10 11 12 13 Cluster size 2

Outline • Re-execution vs. validation • Declarative Experiment Specification (ESF) • Case Study • Benefits & Challenges 3

Reproducibility Workflow 1. Re-execute experiment – Recreate original setup, re-execute experiments – Technical task 2. Validate results – Compare against original – A subjective task • How do we express objective validation criteria? • What contextual information to include with results? 5

Figure Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art. Raw data libs Means of code, data OS workload Experiment Observations hardware 6

Experiment Goal: Show that my algorithm/system/etc. is better than the state-of-the-art. Means of Experiment 8

Validation Language Syntax validation validation : 'for' condition ('and' condition)* 'expect' result ('and' result)* : 'for' condition ('and' condition)* 'expect' result ('and' result)* ; ; condition condition : vars ('in' range | ('=' | '<' | '>' | '!=') value) : vars ('in' range | ('=' | '<' | '>' | '!=') value) ; ; result result : condition : condition ; ; vars vars : var (',' var)* : var (',' var)* ; ; range range : '[' range_num (',' range_num)* ']' : '[' range_num (',' range_num)* ']' ; ; range_num range_num : NUMBER '-' NUMBER | '*' : NUMBER '-' NUMBER | '*' ; ; value value : '*' | 'NUMBER (',' NUMBER)* : '*' | 'NUMBER (',' NUMBER)* ; ; 9

Ceph OSDI ‘06 • Select scalability experiment. – Distributed; makes use of all resources – Main bottlenecks: I/O and network • Why this experiment? – Top conference – 10 year old experiment – Ideal reproducibility conditions • Access to authors, topic familiarity, same hardware, – Even in an ideal scenario, we s:ll struggle • Demonstrates which missing info is captured by an ESF! 11

Ceph OSDI ’06 Scalability Experiment Validation Statement Schema of Experiment Output Data "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "independent_variables": [{ "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "type": “cluster_size”, "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” "values": “2-28” for for for for for for for for for for for for }, },{ }], }, }, { },{ }, }], { { { cluster_size <= 24 cluster_size <= 24 cluster_size <= 24 cluster_size <= 24 cluster_size <= 24 cluster_size = * and and "type": "method", "type": "method", "type": "method", "type": "method", "dependent_variable": { "dependent_variable": { "type": "method", "type": "method", expect expect expect expect expect not expect expect expect expect expect not net_saturated "values": ["raw", "ceph"] "type": "throughput", "type": "throughput", "values": ["raw", "ceph"] "values": ["raw", "ceph"] "values": ["raw", "ceph"] "values": ["raw", "ceph"] "values": ["raw", "ceph"] }], "scale": "mb/s" "scale": "mb/s" },{ }], }, ], ], { ceph >= (raw * .90) expect ceph >= (raw * .90) expect ceph >= (raw * .90) ceph >= 55 mb/s ceph >= 55 mb/s "dependent_variable": { "dependent_variable": { "dependent_variable": { "type": ”net_saturated", "dependent_variable": { }, }, "type": ”net_saturated", ceph >= (raw * .90) "type": "throughput", "values": [”true", ”false"] "values": [”true", ”false"] "type": "throughput", "type": "throughput", "type": "throughput", }], "scale": "mb/s" "scale": "mb/s" "scale": "mb/s" "scale": "mb/s" ], }, }, "dependent_variable": { }, }, "dependent_variable": { "type": "throughput", "type": "throughput", 60 "scale": "mb/s" "scale": "mb/s" Throughput (MB/s) Per-OSD Average }, }, 50 40 30 2 6 10 14 18 22 26 12 Cluster size

1.1 . . . 1 Normalized Per-OSD Throughput 0.9 0.8 0.7 0.6 0.5 0.4 0.3 reproduced 0.2 original 0.1 0 . . . 24 26 1 2 3 4 5 6 7 8 9 10 11 12 OSD Cluster Size 13

Benefits & Challenges 14

Why care about Reproducibility? • Good enough is not an excuse – We can always improve the state of our practice – How do we compare hardware/software in a scientific way? • Experimental Cloud Infrastructure – PRObE / CloudLab / Chameleon – Having reproducible / validated experiments would represent a significant step toward embodying the scientific method as a core component of these infrastructures 15

Benefits of ESF-based methodology • Brings falsibiability to our field – Statements can be proven false • Automate validation – Validation becomes an objective task 16

Validation Workflow Obtain/ Re-run and check Original work no recreate validation clauses against findings are means of output. Any validation corroborated failed? experiment. Any significant Update yes differences yes means of between original experiment and recreated means? Cannot no validate original claims 17

Benefits of ESF-based methodology • Brings falsibiability to our field – Statements can be proven false • Automate validation – Validation becomes an objective task • Usability – We all do this anyway, albeit in an ad-hoc way • Integrate into existing infrastructure 18

Integration with Existing Infrastructure pull push code push code Test: Test: and - Unit - Unit ESF - Integration - Integration - Validations 19

Challenges • Reproduce every time – Include sanity checks as part of experiment – Alternative: corroborate that network/disk observes expected behavior at runtime • Reproduce everywhere – Example: GCC’s flags, 10 806 combinations – Alternative: provide image of complete software stack (e.g. linux containers) 20

Conclusion ESFs: • Embody all components of an experiment • Enable automation of result validation • Brings us closer to the scientific method • Our ideal future: – Researchers use ESFs to express an hypothesis – Toolkits for ESFs produce metadata-rich figures – Machine-readable evaluation section https://github.com/systemslab/esf 21

Thanks! 22

Tackling the Reproducibility Problem in Systems Research with - PowerPoint PPT Presentation

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau (

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

Numerical reproducibility of high-performance computations using floating-point or interval

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Repeatability Reproducibility & Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Tackling social work student poverty IN EDUCATION RESEARCH POLICY Tackling Social Work

Tackling Europes health Tackling Europe s health priorities p Meeting of the European

New NIH requirements regarding Rigor and Reproducibility

R and Reproducibility A Proposal David Smith Revolu0on

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

Marianne Boyle & Suzie Wall LBH Sport & Physical Activity Tackling Inactivity in the

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Science is in trouble Information overload Built-in bias Reproducibility issues Access issues

F o r ma l S e c u r i t y A n a l y s i s o f t h e O p e n I D F

PanORAMa: Oblivious RAM with Logarithmic Overhead Sarvar Patel, Giuseppe Persiano, Mariana

UC.yber; Meeting 20 Announcements Tomorrow UCRI will be hearing out our research ideas IEEE

Peer-to-Peer Networks 14 Security Christian Schindelhauer Technical Faculty Computer-Networks

Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Bre, Stefan

Introduction: Japanese Language and Culture, Mathematical Linguistics. Hilofumi Yamamoto, Ph. D.

SIMD Vectorized Hashing for Grouped Aggregation Bala Gurumurthy, David Broneske, Marcus Pinnecke,

Its Not Contagious: Connecting With Customers Who Have Mental Health Problems James Hudson

Tackling the Reproducibility Problem in Systems Research with - PowerPoint PPT Presentation

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment Specifications Ivo Jimenez , Carlos Maltzahn ( UCSC ) Adam Moody, Kathryn Mohror ( LLNL ) Jay Lofstead ( Sandia ) Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau (

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Reproducibility &amp; Generalizability @ Twitter Strengthening Reproducibility in Network Science

Numerical reproducibility of high-performance computations using floating-point or interval

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Repeatability Reproducibility &amp; Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Tackling social work student poverty IN EDUCATION RESEARCH POLICY Tackling Social Work

Tackling Europes health Tackling Europe s health priorities p Meeting of the European

New NIH requirements regarding Rigor and Reproducibility

R and Reproducibility A Proposal David Smith Revolu0on

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

Marianne Boyle &amp; Suzie Wall LBH Sport &amp; Physical Activity Tackling Inactivity in the

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Science is in trouble Information overload Built-in bias Reproducibility issues Access issues

F o r ma l S e c u r i t y A n a l y s i s o f t h e O p e n I D F

PanORAMa: Oblivious RAM with Logarithmic Overhead Sarvar Patel, Giuseppe Persiano, Mariana

UC.yber; Meeting 20 Announcements Tomorrow UCRI will be hearing out our research ideas IEEE

Peer-to-Peer Networks 14 Security Christian Schindelhauer Technical Faculty Computer-Networks

Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Bre, Stefan

Introduction: Japanese Language and Culture, Mathematical Linguistics. Hilofumi Yamamoto, Ph. D.

SIMD Vectorized Hashing for Grouped Aggregation Bala Gurumurthy, David Broneske, Marcus Pinnecke,

Its Not Contagious: Connecting With Customers Who Have Mental Health Problems James Hudson

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

Repeatability Reproducibility & Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Marianne Boyle & Suzie Wall LBH Sport & Physical Activity Tackling Inactivity in the