Putting Lipstick on Pig: Enabling Database-style Workflow Provenance
Yael Amsterdamer, Susan B. Davidson, Daniel Deutch Tova Milo, Julia Stoyanovich, Val Tannen Presented by GuozhangWang DB Lunch, Apr. 23rd, 2012
Provenance Yael Amsterdamer, Susan B. Davidson, Daniel Deutch Tova - - PowerPoint PPT Presentation
Putting Lipstick on Pig: Enabling Database-style Workflow Provenance Yael Amsterdamer, Susan B. Davidson, Daniel Deutch Tova Milo, Julia Stoyanovich, Val Tannen Presented by GuozhangWang DB Lunch, Apr. 23rd, 2012 A Story of How Research
Yael Amsterdamer, Susan B. Davidson, Daniel Deutch Tova Milo, Julia Stoyanovich, Val Tannen Presented by GuozhangWang DB Lunch, Apr. 23rd, 2012
A short time ago, somewhere in the
Motivated by Scientific Workflows
Annotated directed acyclic graph
in new artifacts
Aims to capture causal dependencies
Each process is treated as a “black-box”
On the other side of the Globe …
Motivated by Prob. DB, data warehousing ..
K-relations
provenance “token”
Operations:
q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_ ,y,z)
a b c p d b e r f g e s a c 2p2 a e pr d c pr d e 2r2 + rs f e 2s2 + rs
Slide borrowed from Green et al.
Data Provenance Researchers Workflow Provenance Researchers
The black-box assumption: each output of
So: replace it with Semirings!
General workflow modules is
However, modules written in Pig Latin is
Let us write a paper, woho!
Data: unordered (nested) bag of tuples Operators:
ReqModel: { Model } Inventory: { CarId, Model } SoldInventory: { CarId, Model, BidId } CarsByModel: { Model, { CarId } } SoldByModel: { Model, { CarId, BidId } } NumCarsByModel: { Model, NumAvail} NumSoldByModel: { Model, NumSold} AllInfoByModel: { UserId, BidId, Model, NumA, NumS }
Provenance node and value nodes
State nodes
FOREACH (projection, no OP)
JOIN
GROUP
FOREACH (aggregation, OP)
COGROUP
FOREACH (UDF Black Box)
Zoom-In v.s. Zoom-Out
Coarse-grained Fine-grained
Deletion Propagation
All its in-edges are deleted It has label • and one of its in-edges is deleted
Lipstick prototype
with the graph written to files
memory.
Benchmark data
size
Overhead increases with execution time
Parallelism helps with up to # modules
Increase with graph size
Feasible with various sizes
Query efficiently with sub-second time
Data provenance ideas such as Semirings
No second conclusion, sorry ..
The introduction of
applications
back to workflow apps