Wings for Pegasus: A Semantic Approach for Creating Very Large - - PowerPoint PPT Presentation

wings for pegasus
SMART_READER_LITE
LIVE PREVIEW

Wings for Pegasus: A Semantic Approach for Creating Very Large - - PowerPoint PPT Presentation

Powered by Powered by Wings for Pegasus: A Semantic Approach for Creating Very Large Scientific Workflows Yolanda Gil Jihie Kim Varun Ratnakar Ewa Deelman USC Information Sciences Institute www.isi.edu/ikcap/wings pegasus.isi.edu


slide-1
SLIDE 1

1

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Wings for Pegasus:

A Semantic Approach for Creating Very Large Scientific Workflows

Yolanda Gil Jihie Kim Varun Ratnakar Ewa Deelman USC Information Sciences Institute www.isi.edu/ikcap/wings pegasus.isi.edu

Presentation at “OWL: Experiences and Directions”, Athens, GA, November 10-11, 2006

Powered by

slide-2
SLIDE 2

2

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Computing and the Future of Science

slide-3
SLIDE 3

3

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Sharing Data Collection Instruments: LIGO

(ligo.caltech.edu)

slide-4
SLIDE 4

4

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Sharing Computing Resources

[Slide from C. Cattlet of UC and TeraGrid]

slide-5
SLIDE 5

5

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Seismic Hazard Model

Seismicity Seismicity Paleoseismology Paleoseismology Local site effects Local site effects Geologic structure Geologic structure Faults Faults Stress Stress transfer transfer Crustal Crustal motion motion Crustal Crustal deformation deformation Seismic velocity Seismic velocity structure structure Rupture Rupture dynamics dynamics

Integrating Diverse Models of Complex Phenomena [Slide from T. Jordan of SCEC]

slide-6
SLIDE 6

6

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Computational Workflows

Interdependent sets of computations

Dependencies are data flow: output of C1 is input for C2

Computations can be submitted for execution in various remote resources

Input data may be obtained from remote data repositories

New data products may be stored in remote data repositories

Hazard Curve Calculator: SA

  • vs. prob. exc.

SA exc. probs. SA exc. prob. Rupture Ruptures

Site VS30 Site Basin-Depth-2.5 SA Period Gaussian Truncation

  • Std. Dev. Type

Task Result: Hazard curve: SA vs.

  • prob. exc.

Hazard curve: SA

  • vs. prob. exc.

Field (2000) IMR: SA

  • exc. prob.

Basin-Depth Calculator

Basin-Depth Lat Long.

UTM Converter (get-Lat-Long- given-UTM)

Lat. long UTM (, , , ) Lat Long.

CVM-get- Velocity- at-point

Velocity Lat Long.

Ruptures

PEER-Fault Gaussian Dist No Truncation Total Moment Rate

Duration-Year Fault-Grid-Spacing Rupture Offset Mag-Length-sigma Dip Rake Magnitude (min) Magnitude (max) Magnitude (mean)

rfml rfml

slide-7
SLIDE 7

7

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Pegasus: Planning for Execution in Grids

[Deelman et al SPJ’05; Deelman et al JGC’05; Deelman et al JGC’03]

 Maps from an workflow instance to executable workflow  Automatically locates physical locations for both workflow components

and data

 Finds appropriate resources to execute the components  Reuses existing data products where applicable  Publishes newly derived data products

  • Adds data management nodes to the workflow
  • Supports automated provenance information capture

 Restructures workflows to improve performance  Provides reliability via re-tries and re-mapping in case of failures

slide-8
SLIDE 8

8

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Mapping Workflow Instances to Grid Resources in Pegasus

a d e g h c f i b

Workflow of tasks

KEY The original node Input transfer node Registration node Output transfer node Unnecessary nodes e g h d a c f i b

Final Workflow

a

Desired Results

h f i

slide-9
SLIDE 9

9

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Pegasus Application Domains

Southern California Earthquake Center

  • _ million jobs, ~10TB data per

workflow 

Pulsar search for gravitational- wave physics (LIGO)

  • Largest ever NSF project
  • ~100,000 tasks per workflow

Galaxy morphology for NVO and NASA in Montage

  • ~50,000 tasks per workflow

Thomography for neural structure reconstruction

High-energy physics – Compact Muon Solenoid

Gene alignment

slide-10
SLIDE 10

10

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

… …

Creating Large Scientific Workflows

 Current approaches: scripts to create thousands of jobs

and the dataflow among them

  • Scripts are workflow-specific and costly to create and debug
slide-11
SLIDE 11

11

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Our Approach to Creating Large-Scale Scientific Workflows

  • 1. Capture the underlying structure of workflows as generic

workflow “templates”

  • 2. Automatic creation of “workflow instances” for given data

inputs Workflow instance Workflow template 2 1

slide-12
SLIDE 12

12

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Wings/Pegasus Framework: Creation of Large-Scale Grid Workflows

  • 1. Workflow Template (generic known-to-work recipes)
  • Specifies application components and dataflow

among them

  • No data specified, just their type
  • 2. Workflow Instance (data-specific)
  • Specifies data files for a given template
  • Expands parallel data processing steps
  • Logical file names, not physical file replicas
  • 3. Executable Workflow (actual run)
  • Specifies physical locations of data files (may be in data repositories)
  • Assigned hosts/pools for execution of each component
  • Expand workflow to includes data movements among execution sites
  • Reduce workflow by reusing previously executed computations
  • Restructure workflow by grouping related executions for efficiency
slide-13
SLIDE 13

13

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Workflow Selection Workflow Template Data Selection Workflow Instance

Workflow Libraries Data Repositories

Application Components Ontologies:

Domain terms, Component types, Workflow Products

  • Preexisting data collections
  • Workflow execution results

“Show me workflows that prune MT rules” “Run this workflow with the WSJ-04 data set” “Validate this workflow based on the component specs”

STUDENT SEASONED NL RESEARCHER

Workflow Creation

ALGORITHM DEVELOPER

  • Workflow templates specify

complex analyses sequences

  • Workflow instances specify data

“Here is a new Rule pruning code, takes in a set of MT rules, is compiled for MPI”

Component Specification Executable Workflow Pegasus

WINGS

  • Specifies data

requirements

  • Specifies execution

requirements

DAGMan/ Grid

(OWL)

Wings: Workflow Instance Generation and Selection

slide-14
SLIDE 14

14

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Example: A Workflow for Pruning Rules in a Machine Translation System

slide-15
SLIDE 15

15

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Workflows for Brain Imaging Analysis

(full ontologies and data available at http://vtcpc.isi.edu/provenance)

Workflow template Workflow instance

slide-16
SLIDE 16

16

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Workflows for Brain Imaging Analysis

(full ontologies and data available at http://vtcpc.isi.edu/provenance)

Workflow template Workflow instance

Template Metadata Propagation Axioms

slide-17
SLIDE 17

17

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Workflows for Brain Imaging Analysis

(full ontologies and data available at http://vtcpc.isi.edu/provenance)

Workflow template Workflow instance

Template Metadata Propagation Axioms Metadata

  • f Actual

Input data

slide-18
SLIDE 18

18

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Workflows for Brain Imaging Analysis

(full ontologies and data available at http://vtcpc.isi.edu/provenance)

Workflow template Workflow instance

Template Metadata Propagation Axioms Metadata

  • f Actual

Input data

Metadata Attributes Automatically Generated for New Data Products of the Workflow

slide-19
SLIDE 19

19

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Editing and Creating Workflows with Repetitive Structure

Workflow template Workflow instance Wings Editor

slide-20
SLIDE 20

20

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

A Wings Workflow Template for Seismic Hazard Analysis

Single File File Collection Nested File Collection

Application Component Component Collection

slide-21
SLIDE 21

21

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Constraints on Workflow Templates

CybershakeTemplate InputLink_SiteNameFil e_to_BoxNameCheck hasSiteName InputLink_RuptureVars _to_SeisgmogramGen hasLink

F-RV

C-RuptVars CC-RuptureVariations InputLink_SGTCollforRup _to_SeismogramGen

F-SGT

C-SGT-forRups CC-SGTs hasFile hasFile hasFile SGTsSiteName

SiteNameFile

hasSiteName SiteName N_Rups hasN_Items hasN_Items

… …

isSameAs

Constraints on number of elements in different collections Constraints on files/collections of different workflow components

slide-22
SLIDE 22

22

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Workflow Instance Generation with Wings for Seismic Hazard Analysis

Input data: a site and an earthquake forecast model

  • thousands of possible fault ruptures and rupture

variations, each a file, unevenly distributed

  • ~110,000 rupture variations to be simulated for a given

site

8043 application nodes in the workflow instance generated by Wings

24,135 nodes in the executable workflow generated by Pegasus, including:

  • data stage-in jobs, data stage-out jobs, data registration

jobs

Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available

  • Including MPI jobs, each runs on hundreds of

processors for 25-33 hours

  • Runtime was 1.9 CPU years

Significant contribution to create a more accurate seismic hazard map for SoCal

  • First integration of multiple physics-based models
  • Currently fine-tuning and cross-validating models
slide-23
SLIDE 23

23

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Wings Reasoners and Scale

15 min n/a Mapping time (Pegasus) 2,001,972 322,473 Number of OWL individuals created 117,379 15,888 Number of file instances created Creation time (Wings) Seismic hazard analysis workflow 22 min 52 sec 7 min 59 sec A full workflow A sub workflow

Metadata Catalog (Data/File Descriptions)

No-inference Reasoner (Jena OWL Mem Writer)

Workflow Templates Library

OWL-DL Reasoner (Jena OWL Micro * Reasoner) Domain Component Library + Metadata Definitions Core Component + DataSet Ontologies

Template Data Instantiation Workflow Node Unrolling For DAX Generation

Workflow Instance (Expanded) DAX

Ontology Import Process Flow Data Selection for Workflow Instance Creation

Workflow Instance (Compact) Workflow Template

slide-24
SLIDE 24

24

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

What We Learned

 Used a subset of OWL-DL

  • subClassOf, equivalentClass, intersectionOf,

 Needed to make assertions about workflow templates

  • Represented as Skolems

 Needed to represent ordered collections of files

  • Thousands of them
  • Used rdf:list

 Currently metadata is propagated outside the reasoner

  • To report inconsistencies appropriately

 Ideally would use rules to represent and propagate metadata  Performance is an issue

  • Generated the workflow instance in slices to control the number of triples

to manage

 Open question for scientific data management

  • Peta-scale data catalogs and storage (“silos”) are not on the web
  • Where should metadata reside? On the silos or on the semantic web?
slide-25
SLIDE 25

25

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Summary: Creating Workflows with WINGS

Separates analysis spec from data

  • Workflow template as reusable well-defined acceptable analysis process
  • Workflow instance binds template to data for particular analyses

Assists users by finding suitable pre-defined workflow templates

  • Query by component type
  • Query by metadata properties of desired data product

Ensures that the data complies with the component specifications and their constraints within the workflow

Represents data collections (nominal or otherwise) within the workflow specification

Attaches descriptions and metadata to new data products to be created by the workflow execution

Records data provenance (workflow instance) and pedigree (workflow template)

Compact workflow instance is user-friendly and reusable

Expands workflow instance into DAX for Pegasus, which creates the executable workflow

slide-26
SLIDE 26

26

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Key Benefits

 Efficient and correct creation of new workflows

  • By retrieving a template and filling in the data

 Framework ensures adherence to scientific methodology

  • Represents as templates widely-accepted analysis methodologies
  • Supports repeatability of experiments/analyses
  • Enables controlled variations

 Ensures better quality of data analysis results

  • Attaches provenance and pedigree information
slide-27
SLIDE 27

27

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Ongoing and Future Work

 Interactive assistance in creating valid workflow templates

  • Based on CAT (Composition Analysis Tool) [Kim et al 04]

 More sophisticated models of components  Automatic completion of workflow’s data conversion and

formatting steps through AI planning techniques

 Tracking new versions of components, invalidate data and

workflows from old versions

 Workflow template libraries

  • Indexing, retrieval

 Managing collections of workflows as part of an overall

analysis activity

  • Eg: parameter sweeping, variants of analysis
slide-28
SLIDE 28

28

USC INFORMATION SCIENCES INSTITUTE Yolanda Gil

Powered by November 11, 2006

Additional Pointers

On Wings: www/isi.edu/ikcap/wings. See also:

  • “Wings for Pegasus: A Semantic Approach to Creating Very Large Scientific Workflows”, Yolanda

Gil, Varun Ratnakar, Ewa Deelman, Marc Spraragen, and Jihie Kim. Proceedings of the OWL: Experiences and Directions 2006 (OWL-06), Athens, GA, November 10-11, 2006.

  • “Semantic Metadata Generation for Large Scientific Workflows”, Jihie Kim, Yolanda Gil, and Varun
  • Ratnakar. Proceedings of the Fifth International Semantic Web Conference (ISWC-06), Athens, GA,

November 5-9, 2006.

  • “Provenance Trails in the Wings/Pegasus System”, Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang

Mehta, and Varun Ratnakar. Forthcoming.

  • “Managing Large-Scale Scientific Workflows in Distributed Environments: Experiences and

Challenges”, Ewa Deelman and Yolanda Gil. Proceedings of the Workshop on Scientific Workflows and Business Workflow Standards in e-Science, The Second IEEE International Conference on e- Science and Grid Computing, Amsterdam, The Netherlands, December 4-6, 2006. 

On Pegasus: pegasus.isi.edu. See also:

  • "Pegasus: A Framework for Mappign Complex Scientific Workflows onto Distributed Systems" E.

Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, J. Kim, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, D. S. Katz. Scientific Programming, Vol. 13, No. 3, 2005.

  • "Optimizing Grid-Based Workflow Execution" Gurmeet Singh,Carl Kesselman, Ewa Deelman,

Journal of Grid Computing, Volume 3(3-4), December 2005, Pages 201-219

  • “Mapping Abstract Workflows onto Grid Environments”, Ewa Deelman, Jim Blythe, Yolanda Gil, Carl

Kesselman, Gaurang Mehta, Karan Vahi, Kent Blackburn, Albert Lazzarini, Adam Arbree, Richard Cavanaugh, and Scott Koranda. Journal of Grid Computing, Vol. 1, No. 1, 2003.