Reproducibility "[An article about computational science in a - PowerPoint PPT Presentation

P RUNE : A Preserving Run Environment for Reproducible Scientific Computing

Reproducibility • "[An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions]" –Jon Claerbout

Verify and Extend • Don’t re-invent the wheel • Stand on the shoulders of giants

Accepted philosphy Preserve Later • Libraries • Hardware Design • Network Execute Observe • System Administrators Share/Publish • Remote Collaborators Preserve • Graduated Students

Proposed philosophy Preserve Later Preserve First Design Design Preserve Execute Observe Execute Share/Publish Share/Publish Observe Preserve Unpreserve

Differences • Git: User decides when to preserve Preserve First • Preserve ALL specification Design changes Preserve • Git: Code Commits separate from Code Execute Share/Publish Execution • System Manages Observe ALL computation Unpreserve • Remove unneeded code later on

What to Preserve arguments : [ file_id1, file_id2 ] parameters : [ ‘in.txt’, ‘in.dat’ ] Virtual Machine / Container Command : ‘do < in.txt in.dat > out.txt o2.txt’ Prune Task returns : [ ‘out.txt’, ‘o2.txt’ ] Environment results : [ file_id3, file_id4 ] environment : envi_id1 Data Software Operating System Kernel Hardware

Overview E1 = envi_add ( type=‘EC2’, image= ‘ hep. beta ’ ) E1 T1 T4 Simulate (E1) (E2) E2 Workflow Version #2 Compute Resources E2 = envi_add ( type=‘EC2’ , image= ‘ hep. stable ’ ) F5 F2 F1 T4 = task_add ( cmd= ‘ simulate > output’, User space returns=[ ‘ output'], environment= E1 ) F1 = file_add ( filename=‘./observed.dat’ ) T3 T2 T5 T6 Analyze (E1) (E1) (E2) (E2) T6 = task_add ( args=[ T4[0] ], params=['input_data’], cmd= ‘ analyze < in_data > out_data’, returns=[ ‘ out_data'], environment=E2 ) F4 F3 F6 F7 T5 = task_add ( args=[ F1 ], ...) (remaining arguments the same as above) File T7 Plot (E2) T7 = task_add ( cmd=‘plot in1 in2 out1 out2’, Environment args=[ T5[0], T6[0] ], params=[ ‘ in1’, ‘ in2’], returns=[‘out1’,‘out2’], environment=E2 ) Task F8 F9 export ( [ T7[1] ], filename=‘./plot.jpg’ ) PRUNE space User interface

User Interface E1 = envi_add ( type=‘EC2’, image= ‘ hep. beta ’ ) E1 T1 T4 Simulate (E1) (E2) E2 Workflow Version #2 Compute Resources E2 = envi_add ( type=‘EC2’ , image= ‘ hep. stable ’ ) F5 F2 F1 T4 = task_add ( cmd= ‘ simulate > output’, User space returns=[ ‘ output'], environment= E1 ) F1 = file_add ( filename=‘./observed.dat’ ) T3 T5 T2 T6 Analyze (E1) (E1) (E2) (E2) T6 = task_add ( args=[ T4[0] ], params=['input_data’], cmd= ‘ analyze < in_data > out_data’, returns=[ ‘ out_data'], environment=E2 ) F4 F6 F3 F7 T5 = task_add ( args=[ F1 ], ...) (remaining arguments the same as above) File T7 Plot T7 = task_add ( cmd=‘plot in1 in2 out1 out2’, (E2) Environment args=[ T5[0], T6[0] ], params=[ ‘ in1’, ‘ in2’], returns=[‘out1’,‘out2’], environment=E2 ) Task F8 F9 export ( [ T7[1] ], filename=‘./plot.jpg’ ) PRUNE space User interface

Overview E1 = envi_add ( type=‘EC2’, image= ‘ hep. beta ’ ) E1 T1 T4 Simulate (E1) (E2) E2 Workflow Version #2 Compute Resources E2 = envi_add ( type=‘EC2’ , image= ‘ hep. stable ’ ) F5 F2 F1 T4 = task_add ( cmd= ‘ simulate > output’, User space returns=[ ‘ output'], environment= E1 ) F1 = file_add ( filename=‘./observed.dat’ ) T3 T5 T2 T6 Analyze (E1) (E1) (E2) (E2) T6 = task_add ( args=[ T4[0] ], params=['input_data’], cmd= ‘ analyze < in_data > out_data’, returns=[ ‘ out_data'], environment=E2 ) F4 F6 F3 F7 T5 = task_add ( args=[ F1 ], ...) (remaining arguments the same as above) File T7 Plot T7 = task_add ( cmd=‘plot in1 in2 out1 out2’, (E2) Environment args=[ T5[0], T6[0] ], params=[ ‘ in1’, ‘ in2’], returns=[‘out1’,‘out2’], environment=E2 ) Task F8 F9 export ( [ T7[1] ], filename=‘./plot.jpg’ ) PRUNE space User interface

Sample code: Merge sort #!/usr/bin/env python from prune import client prune = client.Connect() #Use SQLite3 ###### Import sources stage ###### E1 = prune.env_add(type=`EC2', image=`ami-b06a98d8') D1, D2 = prune.file_add( `nouns.txt', `verbs.txt' )

Sample code: Merge sort ###### Sort stage ###### D3, = prune.task_add( returns=[òutput.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D1], params=[ìnput.txt'] ) D4, = prune.task_add( returns=[òutput.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D2], params=[ìnput.txt'] ) ###### Merge stage ###### D5, = prune.task_add( returns=[`merged_out.txt'], env=E1, cmd=`sort -m input*.txt > merged_out.txt', args=[D3,D4], params=[ìnput1.txt',ìnput2.txt'] )

arguments : [ file_id1, file_id2 ] parameters : [ ‘in.txt’, ‘in.dat’ ] Virtual Machine / Container Command : ‘do < in.txt in.dat > out.txt o2.txt’ Prune Task returns : [ ‘out.txt’, ‘o2.txt’ ] Environment results : [ file_id3, file_id4 ] environment : envi_id1 Data Software Operating System Kernel Hardware

Sample code: Merge sort ###### Sort stage ###### D3, = prune.task_add( returns=[òutput.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D1], params=[ìnput.txt'] ) D4, = prune.task_add( returns=[òutput.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D2], params=[ìnput.txt'] ) ###### Merge stage ###### D5, = prune.task_add( returns=[`merged_out.txt'], env=E1, cmd=`sort -m input*.txt > merged_out.txt', args=[D3,D4], params=[ìnput1.txt',ìnput2.txt'] )

Sample code: Merge sort ###### Execute the workflow ###### prune.execute( worker_type='local', cores=8 ) ###### Export ###### prune.export( D5, `merged.txt' ) # Final data prune.export( D5, `wf.prune', lineage=2 )

Derivation History = Cachable Results

Quotas

Scalability • ~12,000 parallel cores • ~3 million tasks • Wall clock overhead – ~1% above native

Thank You! • Sample workflows • http://ccl.cse.nd.edu/software/prune/prune.html – Merge sort – Pairwise comparisons (US Censuses) – High-energy Physics

Reproducibility "[An article about computational science in a - PowerPoint PPT Presentation

P RUNE : A Preserving Run Environment for Reproducible Scientific Computing Reproducibility "[An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

AATO CONSTITUTION 1 Article of the Constitution Article 6 The Council Article 1

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Numerical reproducibility of high-performance computations using floating-point or interval

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Research Reproducibility in Computational Social Science Aek Palakorn Achananuparp, SMU Research

Article 1-To accept reports Article 2-To set salaries for school officials Article 3-To

Reproducibility as a Community Effort Lessons from the Madagascar Project Sergey Fomel Jackson

Science is in trouble Information overload Built-in bias Reproducibility issues Access issues

Repeatability Reproducibility & Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Article 6 Kelley Kizzier UNFCCC Co-Chair Article 6 Context and Overview The last Article to

Paris Agreements Article 6 Update Stefano De Clara Director for International Policy, IETA

Experiment Reproducibility in Planetlab RP 1.1 Project Presentation Sudesh Jethoe Experiment

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

HyperService: Interoperability and Programmability Across Heterogeneous Blockchains Make Web3.0

Pascals Triangle MCR3U: Functions Pascals Triangle is an arrangement of numbers, generated

Gaseous Galaxy Halos Josh Peek Columbia / Hubble Fellow w ith Mary Putman Columbia Ryan Joung

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Expectations on Remote Data Supporting the Prometheus Remote Storage API Alfred Landrum Engineer

Compilerconstructie najaar 2019 http://www.liacs.leidenuniv.nl/~vlietrvan1/coco/ Rudy van Vliet

T4-Input/Output License This document is under a license Attribution Non-commercial - Share

Reproducibility "[An article about computational science in a - PowerPoint PPT Presentation

P RUNE : A Preserving Run Environment for Reproducible Scientific Computing Reproducibility "[An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the

Computational Reproducibility Daniel S. Katz Jennifer Freeman Smith Computational

Computational Reproducibility in Production Physics Applications Numerical Reproducibility at

Reproducibility &amp; Generalizability @ Twitter Strengthening Reproducibility in Network Science

AATO CONSTITUTION 1 Article of the Constitution Article 6 The Council Article 1

Rigor, Reproducibility, and Transparency David T. Redden, PhD Co-Director, CCTS BERD Chair,

Worksheets Percy Liang UCI Reproducibility Symposium September 22, 2020 The current research

Numerical reproducibility of high-performance computations using floating-point or interval

Everware - lowering reproducibility barriers Andrey Ustyuzhanin Yandex School of Data Analysis

Research Reproducibility in Computational Social Science Aek Palakorn Achananuparp, SMU Research

Article 1-To accept reports Article 2-To set salaries for school officials Article 3-To

Reproducibility as a Community Effort Lessons from the Madagascar Project Sergey Fomel Jackson

Science is in trouble Information overload Built-in bias Reproducibility issues Access issues

Repeatability Reproducibility &amp; Rigor Jan Vitek Kalibera, Vitek. Repeatability,

Article 6 Kelley Kizzier UNFCCC Co-Chair Article 6 Context and Overview The last Article to

Paris Agreements Article 6 Update Stefano De Clara Director for International Policy, IETA

Experiment Reproducibility in Planetlab RP 1.1 Project Presentation Sudesh Jethoe Experiment

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

HyperService: Interoperability and Programmability Across Heterogeneous Blockchains Make Web3.0

Pascals Triangle MCR3U: Functions Pascals Triangle is an arrangement of numbers, generated

Gaseous Galaxy Halos Josh Peek Columbia / Hubble Fellow w ith Mary Putman Columbia Ryan Joung

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Expectations on Remote Data Supporting the Prometheus Remote Storage API Alfred Landrum Engineer

Compilerconstructie najaar 2019 http://www.liacs.leidenuniv.nl/~vlietrvan1/coco/ Rudy van Vliet

T4-Input/Output License This document is under a license Attribution Non-commercial - Share

Reproducibility & Generalizability @ Twitter Strengthening Reproducibility in Network Science

Repeatability Reproducibility & Rigor Jan Vitek Kalibera, Vitek. Repeatability,