data lineage model for taverna workflows with lightweight
play

Data lineage model for Taverna workflows with lightweight - PowerPoint PPT Presentation

Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 Salt Lake City, Utah, June 2008


  1. Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK IPAW'08 – Salt Lake City, Utah, June 2008

  2. Context and scope Ongoing work on a new provenance component for Taverna • myGrid consortium Scope: • capture raw provenance events – data transformations, data transfers • store one lineage graph for each dataflow execution • query over single or multiple lineage graphs IPAW'08 – Salt Lake City, Utah, June 2008

  3. Example (Taverna) dataflow QTL -> genes -> Kegg pathways IPAW'08 – Salt Lake City, Utah, June 2008

  4. Some user questions on lineage • on a single workflow run: – find all genes that participate in some pathway p – find all pathways derived from Uniprot genes – describe the complete derivation of each pathway in which gene g is involved • on a collection of runs: – find all distinct pathways produced by runs of a dataflow [over a period of time, produced by a member of my group, ...] IPAW'08 – Salt Lake City, Utah, June 2008

  5. Shortcomings of lineage data • Granularity – risk of returning trivial answers – “all outputs depend on all inputs” • Semantics – Results not expressed in the language of the designer • Abstraction level, noise – the “latent data model” – many processors are irrelevant – shims, mundane tasks IPAW'08 – Salt Lake City, Utah, June 2008

  6. The need for selective annotations • As long as processors are black boxes, these remain difficult problems • Adding annotations to processors is tempting Scope of this work: to explore the “gray box” region • simple annotations with minimal semantics • driving principle: justified by technical benefits – precision of query results – efficiency of query processing IPAW'08 – Salt Lake City, Utah, June 2008

  7. Test dataflow model configuration P 1 V I1 P 1 V I2 documents P 1 extract query terms P 1 V O1 P 4 V I1 P 2 V I1 P 2 query prep P 4 query 2 P 4 V O1 P 2 V O1 P 3 V I1 P 5 V I1 P 3 query1 P 5 post-proc P 3 V O1 P 5 V O1 P 6 V I1 P 6 V I2 P 6 merge results number of P 6 V O1 P 6 V O2 duplicates P 7 V I merged P 7 sort results P 7 V O IPAW'08 – Salt Lake City, Utah, June 2008

  8. Two main annotation types Focusing: processor selection  some processors are more interesting than others  “boring” annotations  query-time user selection of interesting processors Precision: fine-grained lineage tracing  goal: trace lineage of individual items within a collection IPAW'08 – Salt Lake City, Utah, June 2008

  9. Abstraction by modularization Lucene_query extract diseases NERecognize from OMIM shims IPAW'08 – Salt Lake City, Utah, June 2008

  10. Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008

  11. Abstraction by selection select IPAW'08 – Salt Lake City, Utah, June 2008

  12. Focusing – processor selection P 1 V I1 P 1 V I2 = a1 = a2 P4 is P 1 extract query terms the = b P 1 V O1 o P 4 V I1 P 2 V I1 = b = b nly interesting processor P 2 query prep P 4 query 2 Assume all values atomic P 4 V O1 P 2 V O1 Query: lineage(P 7 V O ,{P 4 }) ‏ P 3 V I1 P 5 V I1 Goal: P 3 query1 P 5 post-proc • avoid recursive queries on P 3 V O1 P 5 V O1 instance tables P 6 V I1 P 6 V I2 Idea: P 6 merge results  use recursion on static P 6 V O1 P 6 V O2 model to generate a P 7 V I targeted query P 7 sort  execute query only once P 7 V O = g IPAW'08 – Salt Lake City, Utah, June 2008

  13. Precision: elements within collections Problem: xform() also applies to list values • It may be impossible to trace individual elements – “which pathways (out) depend on which genes (in)” ? Goal: extend the query generation idea just sketched to trace element-level lineage within collections Approach: exploit static typing of Taverna processors P 1 P 1 Taverna resolves mismatches P 1 V o : l(s) = [a, b, P 1 V o : l(s) = [a, b, on nesting levels: c] c] (map P 2 [a,b,c]) ‏ P 2 V I : l(s) = [a, b, c] P 2 V I : s [a, b, c] P 2 P 2 IPAW'08 – Salt Lake City, Utah, June 2008

  14. Loss of precision in transformations PV I : s = a PV I : s = a “lossless” P P transformations PV O : s = a' PV O : l(s) = [x, y, z] possible behaviours: PV I : l(s) = [a, b, c] • selection of an element P x → [a, b, c] • aggregation lossy PV O : s = x fun c tion f() useful annotation: PV I : l(s) = [a, b, c] lineage(PV O ) = f(PV I ) ‏ only useful annotation: x → [a, b, c] P P is index-preserving : y → [a, b, c] PV O [i] = PV I [i] PV O : l(s) = [x, y] lineage(PV O [i]) = PV I [i] PV O : l(s) = [a',b',c'] IPAW'08 – Salt Lake City, Utah, June 2008

  15. Cooperative processors – Passive processors do not contribute explicit provenance info – Cooperative processors actively feed metadata to the lineage service PV I : l(s) = [a, b, c] PV I : l(s) = [a, b, c] P P PV O : s = x PV O : l(s) = [x, y] Static aggregation f() ‏ PV O [i] = PV I [i] annotations: sorting: selection: Dynamic PV O = Π (PV I ) ‏ annotations: x = PV I [i] ‏ IPAW'08 – Salt Lake City, Utah, June 2008

  16. Other annotations • Distinction between configuration and input data PV I1 PV I2 PV I3 – PVI 3 is a configuration parameter P – compare effect of different config. PV O across multiple runs • specific functional dependencies [ PV I1 , PV I2 ] → PV O • stateless processor – execute process ↔ retrieve provenance More evaluation needed on these IPAW'08 – Salt Lake City, Utah, June 2008

  17. Towards a 2 tier provenance model “describe the derivation of reference each pathway through Kegg, ontologies query in which gene g is involved” semantic Semantic resource overlays annotations lineag Lineage service e database structural (RDB) ‏ dataflow topology + annotations raw lineage events Taverna P1 P1 runtime P4 P4 P2 P1 P2 P5 P5 P3 P4 P3 P2 P6 P5 P6 P3 P6 IPAW'08 – Salt Lake City, Utah, June 2008

  18. Conclusions A data lineage model for Taverna workflows • Raw lineage data has shortcomings • A few, selected lightweight annotations added in a principled way – win-win: – helpful to users – and enable query optimization • Form the base layer in a broader approach to efficient querying of semantic provenance for e- science • Ongoing implementation IPAW'08 – Salt Lake City, Utah, June 2008

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend