Converting Scripts into Reproducible Workflow Research Objects - - PowerPoint PPT Presentation

converting scripts into reproducible workflow research
SMART_READER_LITE
LIVE PREVIEW

Converting Scripts into Reproducible Workflow Research Objects - - PowerPoint PPT Presentation

Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016 Background and Motivation


slide-1
SLIDE 1

Converting Scripts into Reproducible Workflow Research Objects

Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br

Baltimore, Maryland, USA October 23-26, 2016

slide-2
SLIDE 2

2

Background and Motivation

  • Data-Intensive Experiments

– Collection of scripts, programs and (big) data

Papers

slide-3
SLIDE 3

3

Background and Motivation

  • Data-Intensive Experiments

– Collection of scripts, programs and (big) data

Papers How to understand, reproduce or reuse data and models of experiments?

slide-4
SLIDE 4

4

Background and Motivation

  • Data-Intensive Experiments

– Collection of scripts, programs and (big) data

Manual collection and

  • rganization of data provenance

Papers How to understand, reproduce or reuse data and models of experiments?

slide-5
SLIDE 5

5

Background and Motivation

  • Script-based experiments

What are the inputs and outputs? How to change this local program for a similar web service?

Example of script code.

Difficult to understand, to reuse, and to reproduce.

slide-6
SLIDE 6

6

Background and Motivation

  • Scientific Workflows

Example of Scientific Workflow Management System.

slide-7
SLIDE 7

7

Create Understand Reuse Reproduce

Overview

slide-8
SLIDE 8

8

Create Understand Reuse Reproduce

Overview

+

slide-9
SLIDE 9

9

Create Understand Reuse Reproduce

Overview

+

Step 2 Step 1 Step 3 Step 4 Step 5 Methodology

slide-10
SLIDE 10

10

Related Work

  • Script-language specific.
  • Workflow-engine specific.
  • A new language is needed.
  • Outcome is not an executable workflow.
  • Do not collect provenance data of the

conversion process.

slide-11
SLIDE 11

11

Two Kind of Experts

  • Scientists

– Domain experts who understand the experiment, and

the script (sometimes called user);

  • Curators:

– Scientists who are also familiar with workflow and

script programming or;

– Computer scientists who are familiar enough with the

domain to be able to implement our methodology;

– Responsible for authoring, documenting and

publishing workflows and associated resources.

slide-12
SLIDE 12

12

Requirements

  • Produce workflow-like view of the script.
  • Create an executable workflow and compare

execution of workflow and script.

  • Modify the workflow resources.
  • Record provenance data.
  • Aggregate all resources to support

Reproducibility and Reuse.

1 2 3 4 5

slide-13
SLIDE 13

13

Requirements

  • Produce workflow-like view of the script.

1 Activity 1 Port 1 Port 2 Port 3

Port 1 Port 2

Activity 2 Port 3

Port 3

Activity n Port n

Script-based experiment. Abstract workflow.

slide-14
SLIDE 14

14

Requirements

  • Create executable workflow and compare

execution of workflow and script.

2

Executable workflow. Script-based experiment.

slide-15
SLIDE 15

15

Requirements

  • Modify the workflow resources.

3 Local (a) (b) Algorithm A Algorithm B

slide-16
SLIDE 16

16

Requirements

  • Record provenance data

4 Activity 1 Output 1 Output 2

wasGeneratedBy wasGeneratedBy

Sample

used “2012-06-01” wasStartedAt

Activity 2

used

Lucas Workflow Run

wasAssociatedWith used

slide-17
SLIDE 17

17

Requirements

  • Aggregate all resources to support

Reproducibility and Reuse.

5

Abstract workfmows Concrete workfmows Annotations Papers and Reports Provenance Authors Scripts Data

slide-18
SLIDE 18

18 Script

Generate Abstract Workfmow Generate Abstract Workfmow Create an executable workfmow Create an executable workfmow Refjne workfmow Refjne workfmow Bundle Resources into a Research Object Bundle Resources into a Research Object Annotate and check quality Annotate and check quality

Abstract workfmow Concrete workfmow

2 1 3 4 5

Methodology

slide-19
SLIDE 19

19

Workflow Research Object (WRO)

  • Research Objects are

semantically rich aggregations of resources that bring together data, methods and people in scientific investigations.

  • WROs encapsulate scientific

workflows and additional information regarding their context and resources.

Research Object Model

slide-20
SLIDE 20

20

Running Example

  • Molecular Dynamics Simulations

– Many branches of material sciences, computational

engineering, physics and chemistry.

– Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases: set up, simulation and analysis of trajectories. – Inputs: protein structure, simulation parameters and

force field files.

– Output: trajectories and analysis results.

slide-21
SLIDE 21

21

Step

Generate Abstract Workfmow

1

Script code.

slide-22
SLIDE 22

22

Step

Generate Abstract Workfmow

1 Manually annotate

Script code. Annotated script code.

slide-23
SLIDE 23

23

Step

Generate Abstract Workfmow

1 Manually annotate Create workflow-like view

Script code. Annotated script code. Abstract workflow.

slide-24
SLIDE 24

24

Step

Generate Abstract Workfmow

1 code blocks Input/ouput YesWorkflow McPhillips et. al, 2015

  • Code comments
  • Tags:
  • @begin
  • @end
  • @desc
  • @in
  • @out
  • ...
  • T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language-

independent tool for recovering workflow information from scripts,” International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015.

Create Workflow-like view

Abstract workflow. Annotated script code.

slide-25
SLIDE 25

25

Step

Generate Abstract Workfmow

1 Create Workflow-like view

Abstract workflow. Annotated script code.

slide-26
SLIDE 26

26

Step

Create an executable workfmow

2

Abstract workflow.

slide-27
SLIDE 27

27

Step

Create an executable workfmow

2 Create implementation

  • f activities

Copy code blocks from the script.

Abstract workflow. Executable workflow.

slide-28
SLIDE 28

28

Step

Create an executable workfmow

2 Create implementation

  • f activities

Copy code blocks from the script.

Abstract workflow. Executable workflow.

slide-29
SLIDE 29

29

Step

Create an executable workfmow

2 Create implementation

  • f activities

Copy code blocks from the script.

Abstract workflow. Executable workflow. Script code.

slide-30
SLIDE 30

30

Step

Refjne executable workfmow

3 Modify resources:

  • Algorithms
  • Data Sets
  • Parallelization
  • Web Services
  • ...

Executable workflow. New workflow version.

slide-31
SLIDE 31

31

Step

Refjne executable workfmow

3 Create new version Modify resources:

  • Algorithms
  • Data Sets
  • Parallelization
  • Web Services
  • ...

Executable workflow. New workflow version.

slide-32
SLIDE 32

32

Steps

Record provenance data: execution traces.

2 3

wasEnactedBy

split Output 1 Output 2

wasGeneratedBy wasGeneratedBy

Sample

used “2012-06-01” wasStartedAt

psgen

used

Lucas Workflow Run

wasAssociatedWith used hasSpecification

W3C PROV

Executable workflow.

slide-33
SLIDE 33

33

Steps

Record provenance data: conversion process.

2 3

wasDerivedFrom wasDerivedFrom wasDerivedFrom wasAssociatedWith

Curator Curator

W3C PROV

Executable workflow. New workflow version. Script code.

slide-34
SLIDE 34

34

Step

Annotate and check quality

  • Annotations describing the workflow.
  • Use provenance data

– To check the quality of the conversion process.

  • Run checks to verify the soundness of the

workflow.

4

slide-35
SLIDE 35

35

Step

Annotate and check quality

4

Script code. Executable workflow.

slide-36
SLIDE 36

36

Step

Annotate and check quality

4

Workflow version. Initial Executable workflow.

slide-37
SLIDE 37

37

Step

Annotate and check quality

  • Common mistakes during the conversion:

– not clearly identified the main logical processing

units in the script;

– a mistake when migrating script code into the

corresponding activity;

– not provided the correct input files and parameters; – the coding of the workflow itself contained errors.

4

slide-38
SLIDE 38

38

Step

Bundle Resources into a Research Object

5

Script Abstract workfmow Concrete workfmow(s) Annotations Paper Provenance Data Attributions

slide-39
SLIDE 39

39

Contributions

  • A methodology that guides curators in a

principled manner to transform scripts into reproducible and reusable WRO;

  • This addresses an important issue in the area
  • f script provenance;
slide-40
SLIDE 40

40

Conclusions

  • We addressed issues wrt understanding, reuse and

reproducibility of script-based experiments.

  • The methodology created was:

– elaborated based on requirements; – showcased via a real world use case from the field of Molecular

Dynamics;

  • We exploited tools and standards from the scientific

community:

– Scientific Workflows, YesWorkflow, Research Objects, the W3C

PROV recommendations and the Web Annotation Data Model.

  • The bundle is available at http://w3id.org/w2share/s2rwro/
slide-41
SLIDE 41

41

Next Steps

  • Evaluation using other case studies;
  • Evaluation of the cost of the effectiveness of
  • ur methodology;
  • Extension of YesWorkflow to support the

semantic annotation of blocks;

  • Implementation of tools.
slide-42
SLIDE 42

42

Acknowledgments

  • FAPESP (grant # 2014/23861-4)
  • CCES/CEPID (grant # 2013/08293-7)

– Center for Computational Engineering & Sciences

  • LIS (Laboratory of Information Systems)
  • Prof. Munir Skaf and his group from Institute of

Chemistry - Unicamp.

slide-43
SLIDE 43

Converting Scripts into Reproducible Workflow Research Objects

Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br

Baltimore, Maryland, USA October 23-26, 2016