converting scripts into reproducible workflow research
play

Converting Scripts into Reproducible Workflow Research Objects - PowerPoint PPT Presentation

Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016 Background and Motivation


  1. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016

  2. Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers 2

  3. Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data How to understand, reproduce or reuse data and models of Papers experiments? 3

  4. Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data How to understand, reproduce or reuse data and models of Papers experiments? Manual collection and 4 organization of data provenance

  5. Background and Motivation ● Script-based experiments What are the inputs and outputs? How to change this local program for a similar web service? Difficult to understand, to reuse, and to reproduce. 5 Example of script code.

  6. Background and Motivation ● Scientific Workflows 6 Example of Scientific Workflow Management System.

  7. Overview Understand Reuse Create Reproduce 7

  8. Overview Understand + Reuse Create Reproduce 8

  9. Overview Understand + Reuse Create Reproduce Methodology Step 2 Step 3 Step 1 Step 4 Step 5 9

  10. Related Work ● Script-language specific. ● Workflow-engine specific. ● A new language is needed. ● Outcome is not an executable workflow. ● Do not collect provenance data of the conversion process. 10

  11. Two Kind of Experts ● Scientists – Domain experts who understand the experiment, and the script (sometimes called user ); ● Curators: – Scientists who are also familiar with workflow and script programming or; – Computer scientists who are familiar enough with the domain to be able to implement our methodology; – Responsible for authoring, documenting and publishing workflows and associated resources. 11

  12. Requirements ● Produce workflow-like view of the script. 1 ● Create an executable workflow and compare 2 execution of workflow and script. ● Modify the workflow resources. 3 ● Record provenance data. 4 ● Aggregate all resources to support 5 Reproducibility and Reuse. 12

  13. Requirements ● Produce workflow-like view of the script. 1 Port 1 Port 2 Port 3 Activity 1 Port 1 Port 2 Port 3 Activity 2 Port 3 Activity n Port n Abstract workflow. Script-based experiment. 13

  14. Requirements ● Create executable workflow and compare 2 execution of workflow and script. Executable workflow. Script-based experiment. 14

  15. Requirements ● Modify the workflow resources. 3 (a) Local (b) Algorithm A Algorithm B 15

  16. Requirements ● Record provenance data 4 wasAssociatedWith Workflow Lucas Run used Sample “2012-06-01” used wasStartedAt Activity 1 wasGeneratedBy wasGeneratedBy Output 2 Output 1 used Activity 2 16

  17. Requirements ● Aggregate all resources to support 5 Reproducibility and Reuse. Authors Data Annotations Provenance Scripts Concrete Abstract Papers and workfmows workfmows Reports 17

  18. Methodology 2 Create an Create an executable workfmow executable workfmow 3 Refjne workfmow Refjne workfmow Concrete Abstract workfmow workfmow Generate Abstract Generate Abstract 4 Annotate and Annotate and 1 Workfmow Workfmow check quality check quality Script Bundle Resources into Bundle Resources into 5 a Research Object a Research Object 18

  19. Workflow Research Object (WRO) ● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. ● WROs encapsulate scientific workflows and additional information regarding their Research Object Model context and resources. 19

  20. Running Example ● Molecular Dynamics Simulations – Many branches of material sciences, computational engineering, physics and chemistry. – Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases : set up, simulation and analysis of trajectories. – Inputs : protein structure, simulation parameters and force field files. – Output : trajectories and analysis results. 20

  21. Step 1 Generate Abstract Workfmow Script code. 21

  22. Step 1 Generate Abstract Workfmow Manually annotate Script code. Annotated script code. 22

  23. Step 1 Generate Abstract Workfmow Manually annotate Script code. Annotated script code. Create workflow-like view Abstract workflow. 23

  24. Step 1 Generate Abstract Workfmow code blocks YesWorkflow McPhillips et. al, 2015 Input/ouput - Code comments - Tags: ● @begin ● @end ● @desc ● @in ● @out ● ... Annotated script code. Create Workflow-like T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language- independent tool for recovering workflow information from scripts,” view International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015. Abstract workflow. 24

  25. Step 1 Generate Abstract Workfmow Annotated script code. Create Workflow-like view Abstract workflow. 25

  26. Step 2 Create an executable workfmow Abstract workflow. 26

  27. Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 27 Executable workflow.

  28. Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 28 Executable workflow.

  29. Step 2 Create an executable workfmow Abstract workflow. Create implementation of activities Copy code blocks from the script. 29 Executable workflow. Script code.

  30. Step 3 Refjne executable workfmow Modify resources: Algorithms ● Data Sets ● Parallelization ● Web Services ● ... ● 30 Executable workflow. New workflow version.

  31. Step 3 Refjne executable workfmow Create new version Modify resources: Algorithms ● Data Sets ● Parallelization ● Web Services ● ... ● 31 Executable workflow. New workflow version.

  32. Steps 3 2 Record provenance data: execution traces. wasAssociatedWith Workflow Lucas Run used hasSpecification “2012-06-01” Sample used wasStartedAt split wasGeneratedBy wasGeneratedBy Output 2 Output 1 used psgen Executable workflow. wasEnactedBy W3C PROV 32

  33. Steps 3 2 Record provenance data: conversion process. wasDerivedFrom wasDerivedFrom Script code. wasDerivedFrom Executable workflow. New workflow version. W3C PROV wasAssociatedWith Curator Curator 33

  34. Step 4 Annotate and check quality ● Annotations describing the workflow. ● Use provenance data – To check the quality of the conversion process. ● Run checks to verify the soundness of the workflow. 34

  35. Step 4 Annotate and check quality Script code. Executable workflow. 35

  36. Step 4 Annotate and check quality Initial Executable workflow. 36 Workflow version.

  37. Step 4 Annotate and check quality ● Common mistakes during the conversion: – not clearly identified the main logical processing units in the script; – a mistake when migrating script code into the corresponding activity; – not provided the correct input files and parameters; – the coding of the workflow itself contained errors. 37

  38. Step 5 Bundle Resources into a Research Object Provenance Data Attributions Annotations Script Concrete Abstract Paper workfmow(s) workfmow 38

  39. Contributions ● A methodology that guides curators in a principled manner to transform scripts into reproducible and reusable WRO; ● This addresses an important issue in the area of script provenance; 39

  40. Conclusions ● We addressed issues wrt understanding, reuse and reproducibility of script-based experiments. ● The methodology created was: – elaborated based on requirements; – showcased via a real world use case from the field of Molecular Dynamics; ● We exploited tools and standards from the scientific community: – Scientific Workflows, YesWorkflow, Research Objects, the W3C PROV recommendations and the Web Annotation Data Model. ● The bundle is available at http://w3id.org/w2share/s2rwro/ 40

  41. Next Steps ● Evaluation using other case studies; ● Evaluation of the cost of the effectiveness of our methodology; ● Extension of YesWorkflow to support the semantic annotation of blocks; ● Implementation of tools. 41

  42. Acknowledgments ● FAPESP (grant # 2014/23861-4) ● CCES/CEPID (grant # 2013/08293-7) – Center for Computational Engineering & Sciences ● LIS (Laboratory of Information Systems) ● Prof. Munir Skaf and his group from Institute of Chemistry - Unicamp. 42

  43. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend