Policy-Based Integration of Provenance Metadata Ashish Gehani - - PowerPoint PPT Presentation

policy based integration of provenance metadata
SMART_READER_LITE
LIVE PREVIEW

Policy-Based Integration of Provenance Metadata Ashish Gehani - - PowerPoint PPT Presentation

Policy-Based Integration of Provenance Metadata Ashish Gehani Dawood Tariq Basim Baig SRI International Tanu Malik University of Chicago Policy-Based Integration of Provenance Metadata p. 1/14 Background Traditional science: Human


slide-1
SLIDE 1

Policy-Based Integration

  • f Provenance Metadata

Ashish Gehani Dawood Tariq Basim Baig SRI International Tanu Malik University of Chicago

Policy-Based Integration of Provenance Metadata – p. 1/14

slide-2
SLIDE 2

Background

Traditional science: Human hypotheses Experimental, theoretical verification Computational paradigm: Automated exploration Machine verification Provenance tracking need: Reproducibility (execution context) Data sharing (dependencies) Scenario analysis (application profiles)

Policy-Based Integration of Provenance Metadata – p. 2/14

slide-3
SLIDE 3

Open Provenance Model

Vertex types

Artifact Process Agent

Edge types

Used WasGeneratedBy WasDerivedFrom WasControlledBy WasTriggeredBy

Domain semantics in annotations

Policy-Based Integration of Provenance Metadata – p. 3/14

slide-4
SLIDE 4

Integration Need

Genome analysis provenance record

Operating system provenance NCBI TIGR PDB Swiss−Prot GADU Server Pegasus Planner PFAM BLOCKS BLAST THMM 300 Nodes Globus Node Globus Node Globus Node Workflow provenance Comparative Analysis Data Ingestion Curated provenance Application provenance JGI Database

Policy-Based Integration of Provenance Metadata – p. 4/14

slide-5
SLIDE 5

Integration Issues

Metadata variation: Abstraction levels Completeness Identifiers Semantics Querying requires: Record assembly Reconciling syntax Mapping semantics

Policy-Based Integration of Provenance Metadata – p. 5/14

slide-6
SLIDE 6

Provenance Middleware

Support for Provenance Auditing in Distributed Environments

Policy-Based Integration of Provenance Metadata – p. 6/14

slide-7
SLIDE 7

Implementing Policy

Annotations are key-value pairs Filters operate on vertex / edge stream Arbitrary transformations possible putEdge() Policy Filter putVertex() putEdge() putVertex() putEdge()

Policy-Based Integration of Provenance Metadata – p. 7/14

slide-8
SLIDE 8

Sample Workload

BLAST (sequence alignment) Influenza data

ftp://ftp.ncbi.nlm.nih.gov/genomes/INFLUENZA/influenza.faa

Database construction

makeblastdb -in influenza.faa

  • parse_seqids -hash_index -out outputdb

Policy-Based Integration of Provenance Metadata – p. 8/14

slide-9
SLIDE 9

Aggregation Policies

Policy Version created after: ALL Every write SEQ Each sequence of read() or write() CA Cycle avoidance (Harvard) GF Graph finesse (Harvard) OC

  • pen() - close() pair

Policy-Based Integration of Provenance Metadata – p. 9/14

slide-10
SLIDE 10

Policy Size

Size of different aggregation policy filters. Policy Lines of code ALL 38 SEQ 73 CA 63 GF 80 OC 6

Policy-Based Integration of Provenance Metadata – p. 10/14

slide-11
SLIDE 11

Artifact Vertices

Effect of aggregation policies over time.

Policy-Based Integration of Provenance Metadata – p. 11/14

slide-12
SLIDE 12

Used Edges

Effect of aggregation policies over time.

Policy-Based Integration of Provenance Metadata – p. 12/14

slide-13
SLIDE 13

WasGeneratedBy Edges

Effect of aggregation policies over time.

Policy-Based Integration of Provenance Metadata – p. 13/14

slide-14
SLIDE 14

Conclusion

Provenance integration policy matters Substantial impact: Runtime - CPU, memory Persistent - storage, querying Acknowledgement US NSF Grant OCI-0722068 URL: http://spade.csl.sri.com Code license: GPLv3 Email: ashish.gehani@sri.com Questions?

Policy-Based Integration of Provenance Metadata – p. 14/14