A Provenance Model for Manually Curated Data James Cheney Joint - PowerPoint PPT Presentation

A Provenance Model for Manually Curated Data James Cheney Joint work with Peter Buneman, Adriane Chapman, and Stijn Vansummeren IPAW 2006 May 4, 2006 Chicago, Illinois A Provenance Model for Manually Curated Data – p.1/22

Curated databases Many scientific databases (especially bioinformatics) are constructed largely “by hand” as opposed to by fixed, automatic process such as a view or workflow copy Source Source Source DB DB DB S1 S2 S3 copy copy copy Curated insert DB Papers and reference books We call such DBs (manually) curated A Provenance Model for Manually Curated Data – p.2/22

State of practice Currently, curators manually add links (e.g. URLs) from copied data to relevant source(s) Drawbacks: Time consuming Error prone Danger of link rot (if remote database/Web site changes structure) No support for provenance-based queries Can we provide automated support for this process? First step: develop a coherent data model for provenance information describing curation process A Provenance Model for Manually Curated Data – p.3/22

Constraints This is a highly constrained problem: a good solution should be decentralized be data model-independent require minimal changes to curator practice require minimal changes to DB systems be robust in the face of changes to DB structure scale gracefully to multiple cooperating DBs be efficient/scale to large DBs A Provenance Model for Manually Curated Data – p.4/22

Constraints This is a highly constrained problem: a good solution should be decentralized be data model-independent require minimal changes to curator practices require minimal changes to DB systems be robust in the face of changes to DB structure scale gracefully to multiple cooperating DBs be efficient/scale to large DBs These are the most important factors for immediate applicability to manually curated data A Provenance Model for Manually Curated Data – p.5/22

Prior work Most approaches to provenance consider static data In databases, provenance investigated for queries/views of fixed database In scientific computation, provenance defined for workflows that construct new data from existing data Prior work does not consider dynamic data that can be updated, copied, or deleted A Provenance Model for Manually Curated Data – p.6/22

Approach To simplify matters, we consider only a single dynamic database with several static source databases We also view databases abstractly as mappings from locations (“keys”) to values There are many possible instantiations of this framework: Table names/keys/field names addressing data in an RDBMS XPointers addressing data in XML documents Line/column numbers addressing data in text files (x,y) coordinates addressing data in images For concreteness, we’ll deal with paths addressing data in trees. A Provenance Model for Manually Curated Data – p.7/22

Update language We model the curator’s actions in modifying the database as a sequence of “simple” updates Insertion : ins p v means “insert the new location p with value v ” Deletion : del p means “delete the location p ” Copy-paste : p := q means “copy the data at q into location p ” A Provenance Model for Manually Curated Data – p.8/22

History A history is a sequence of DB versions, together with provenance links indicating where the data in each version “came from” We can refine a history by grouping update operations into transactions ins a/e a/e := c b := a del a a a a a c c c c b c b b b b d d e d e d e d e d e A Provenance Model for Manually Curated Data – p.9/22

History A history is a sequence of DB versions, together with provenance links indicating where the data in each version “came from” We can refine a history by grouping update operations into transactions ins a/e a/e := c b := a del a a a a a c c c c b c b b b b d d e d e d e d e d e transaction 1 transaction 2 A Provenance Model for Manually Curated Data – p.10/22

History A history is a sequence of DB versions, together with provenance links indicating where the data in each version “came from” We can refine a history by grouping update operations into transactions ins a/e; a/e := c b := a; del a a a c c b c b b d d e d e transaction 1 transaction 2 A Provenance Model for Manually Curated Data – p.11/22

Provenance data model The provenance data can be stored as a table Prov ( Tid, From, To ) Prov ins a/e; a/e := c b := a; del a Tid From To a a c c b c 1 c c b b 1 c a/e d d e d e 1 a/d a/d 2 c c transaction 1 transaction 2 2 b/d a/d Trans 2 b/e a/e 1 jcheney Tue Apr 18 10:47 AM 2 a NULL 2 jcheney Tue Apr 18 12:37 PM 2 a/d NULL 2 a/e NULL A Provenance Model for Manually Curated Data – p.12/22

Provenance data model Additional data can be stored in a side table Trans ( Tid, Uid, Time, ... ) Prov ins a/e; a/e := c b := a; del a Tid From To a a c c b c 1 c c b b 1 c a/e d d e d e 1 a/d a/d 2 c c transaction 1 transaction 2 2 b/d a/d Trans 2 b/e a/e 1 jcheney Tue Apr 18 10:47 AM 2 a NULL 2 jcheney Tue Apr 18 12:37 PM 2 a/d NULL 2 a/e NULL A Provenance Model for Manually Curated Data – p.13/22

What can we do with this information? Since Prov and Trans are standard relational tables, we can formulate many provenance queries as relational queries. Example: “Data was copied from p to q during transaction t ” Copied ( t, p, q ) ← Prov ( t, p, q ) , p � = q Example: “Data at p was inserted during transaction t ” Inserted ( t, p ) ← Prov ( t, NULL, p ) A Provenance Model for Manually Curated Data – p.14/22

A query example Example: “Data at l at end of tid was originally inserted by during transaction u ” Q ( l, tid, tid ) Ins ( tid, l ) . ← Q ( l, tid, u ) ← Prov ( tid, l, m ) , Q ( m, tid − 1 , uid ) . Query: Q ( l, 3 , u ) 1 2 3 l = 5 l = 13 l = "foo" m = 12 m = 12 m = "foo" m = "foo" n = "foo" o = 0 o = 12 o = 12 o = 12 A Provenance Model for Manually Curated Data – p.15/22

A query example Example: “Data at l at end of tid was originally inserted by during transaction u ” Q ( l, tid, tid ) Ins ( tid, l ) . ← Q ( l, tid, u ) ← Prov ( tid, l, m ) , Q ( m, tid − 1 , uid ) . Query: Q ( l, 3 , u ) 1 2 3 l = 5 l = 13 l = "foo" m = 12 m = 12 m = "foo" m = "foo" n = "foo" o = 0 o = 12 o = 12 o = 12 Prov(3,m,l) A Provenance Model for Manually Curated Data – p.16/22

A query example Example: “Data at l at end of tid was originally inserted by during transaction u ” Q ( l, tid, tid ) Ins ( tid, l ) . ← Q ( l, tid, u ) ← Prov ( tid, l, m ) , Q ( m, tid − 1 , uid ) . Query: Q ( l, 3 , u ) 1 2 3 l = 5 l = 13 l = "foo" m = 12 m = 12 m = "foo" m = "foo" n = "foo" o = 0 o = 12 o = 12 o = 12 Prov(2,n,m) Prov(3,m,l) A Provenance Model for Manually Curated Data – p.17/22

A query example Example: “Data at l at end of tid was originally inserted by during transaction u ” Q ( l, tid, tid ) Ins ( tid, l ) . ← Q ( l, tid, u ) ← Prov ( tid, l, m ) , Q ( m, tid − 1 , uid ) . Query: Q ( l, 3 , u ) ⇒ u = 1 1 2 3 l = 5 l = 13 l = "foo" m = 12 m = 12 m = "foo" m = "foo" n = "foo" o = 0 o = 12 o = 12 o = 12 Inserted(1,n) Prov(2,n,m) Prov(3,m,l) A Provenance Model for Manually Curated Data – p.18/22

Challenging issues We believe the following issues are the most important for evaluating a solution (in order of importance): 1. Minimizing the impact of provenance tracking on curation performance 2. Minimizing the space required for storing provenance data 3. Providing efficient & expressive provenance querying facilities since provenance tracking must be performed at every step, but provenance queries are relatively rare. A Provenance Model for Manually Curated Data – p.19/22

Example: efficient storage The provenance relation defined above contains edges for unchanged data (e.g. Prov (1 , c, c ) , Prov (2 , c, c ) ) Updates usually modify only a small part of the data, so this is wasteful. If we explicitly store only provenance edges that involve changes, such unchanged provenance links can always be inferred . For tree-structured data, further optimizations are possible since the provenance of a child can often be inferred from its parent A Provenance Model for Manually Curated Data – p.20/22

Current & future work Have implemented a prototype system along with experimental evaluation Proof-of-concept for efficient provenance tracking and storage Next steps: Non-intrusive techniques for collecting provenance via user browsing/form submission actions Larger scale experiments with more realistic data Techniques for handling “bulk” queries and updates Integrating with “workflow” provenance techniques Combining/querying provenance records involving multiple databases A Provenance Model for Manually Curated Data – p.21/22

A Provenance Model for Manually Curated Data James Cheney Joint - PowerPoint PPT Presentation

A Provenance Model for Manually Curated Data James Cheney Joint work with Peter Buneman, Adriane Chapman, and Stijn Vansummeren IPAW 2006 May 4, 2006 Chicago, Illinois A Provenance Model for Manually Curated Data p.1/22 Curated databases

Provenance for Interactive Visualizations Fotis Psallidas Eugene Wu fotis@cs.columbia.edu

Provenance Tracking in CXXR Chris A. Silles Andrew R. Runnalls Computing Laboratory, University

PASS PASS Provenance-Aware Storage System Provenance-Aware Storage System Margo Seltzer, David

Scalable Uncertainty Management 03 Provenance Rainer Gemulla May 18, 2012 Overview In this

Provenance of astronomical data The IVOA Provenance Working Group: Catherine Boisson Franois

Curated playground Metaphore Curated playground A space where you can discover information

Curated by Francesco Saverio Russo T he Biennale is created and curated by Detail of the park

Provenance from the data provider view constructing provenance information for the APPLAUSE

Tow ards a Model of Tow ards a Model of Provenance and User View s Provenance and User View s

Towards Semantics for Provenance Security Stephen Chong Harvard University TaPP 09

Curated Courses in Mathematics Petra Bonfert-Taylor, Sarah E. Eichhorn, David Farmer and Jim

Community of Interest (on Future Scientific Methodologies) Curated Unconference Richard Carlson

Provenance -Only Integration Ashish Gehani Dawood Tariq SRI Provenance -Only Integration p.

Provenance Analytics and Visualization Juliana Freire VisTrails Group & Web and Databases

VERSIONING, VERSIONING, PROVENANCE, AND PROVENANCE, AND REPRODUCABILITY REPRODUCABILITY

A graph model for data and workflow provenance Umut Acar, Peter Buneman, James Cheney , Natalia

Key Topics: Trends in . . . Farmland values Cash rental rates Costs of crop

ASDF 3 Why Lisp is Now an Acceptable Scripting Language Franois-Ren Rideau

Agenda Syed Nayyar Hussain , Director, Securities and Exchange Commission of Pakistan, MEFIN

A comparative analysis of global agricultural policies - Lessons for the future CAP Simone

Recap: rigid motions Rigid motion is a combination of rotation and translation Defined

Colored sl ( N ) link homology via matrix factorizations Hao Wu George Washington University

NUSTAR Annual Meeting 2013 Nuclear Structure Features as a Guide to SHE 120 copernicium 112

Exact Neutrino Mixing Angles from Three Subgroups of SU(2) and the Physics Consequences

A Provenance Model for Manually Curated Data James Cheney Joint - PowerPoint PPT Presentation

A Provenance Model for Manually Curated Data James Cheney Joint work with Peter Buneman, Adriane Chapman, and Stijn Vansummeren IPAW 2006 May 4, 2006 Chicago, Illinois A Provenance Model for Manually Curated Data p.1/22 Curated databases

Provenance for Interactive Visualizations Fotis Psallidas Eugene Wu fotis@cs.columbia.edu

Provenance Tracking in CXXR Chris A. Silles Andrew R. Runnalls Computing Laboratory, University

PASS PASS Provenance-Aware Storage System Provenance-Aware Storage System Margo Seltzer, David

Scalable Uncertainty Management 03 Provenance Rainer Gemulla May 18, 2012 Overview In this

Provenance of astronomical data The IVOA Provenance Working Group: Catherine Boisson Franois

Curated playground Metaphore Curated playground A space where you can discover information

Curated by Francesco Saverio Russo T he Biennale is created and curated by Detail of the park

Provenance from the data provider view constructing provenance information for the APPLAUSE

Tow ards a Model of Tow ards a Model of Provenance and User View s Provenance and User View s

Towards Semantics for Provenance Security Stephen Chong Harvard University TaPP 09

Curated Courses in Mathematics Petra Bonfert-Taylor, Sarah E. Eichhorn, David Farmer and Jim

Community of Interest (on Future Scientific Methodologies) Curated Unconference Richard Carlson

Provenance -Only Integration Ashish Gehani Dawood Tariq SRI Provenance -Only Integration p.

Provenance Analytics and Visualization Juliana Freire VisTrails Group &amp; Web and Databases

VERSIONING, VERSIONING, PROVENANCE, AND PROVENANCE, AND REPRODUCABILITY REPRODUCABILITY

A graph model for data and workflow provenance Umut Acar, Peter Buneman, James Cheney , Natalia

Key Topics: Trends in . . . Farmland values Cash rental rates Costs of crop

ASDF 3 Why Lisp is Now an Acceptable Scripting Language Franois-Ren Rideau

Agenda Syed Nayyar Hussain , Director, Securities and Exchange Commission of Pakistan, MEFIN

A comparative analysis of global agricultural policies - Lessons for the future CAP Simone

Recap: rigid motions Rigid motion is a combination of rotation and translation Defined

Colored sl ( N ) link homology via matrix factorizations Hao Wu George Washington University

NUSTAR Annual Meeting 2013 Nuclear Structure Features as a Guide to SHE 120 copernicium 112

Exact Neutrino Mixing Angles from Three Subgroups of SU(2) and the Physics Consequences

Provenance Analytics and Visualization Juliana Freire VisTrails Group & Web and Databases