SLIDE 1 A graph model for data and workflow provenance
Umut Acar, Peter Buneman, James Cheney, Natalia Kwasnikowska, Jan van den Bussche, & Stijn Vansummeren TaPP 2010
SLIDE 2 Provenance in ...
- Databases
- Mainly for (nested)
relational model
("source location")
- Lineage, why ("witnesses")
- How/semiring model
- Relatively formal
- Workflows
- Many different systems
- Many different models
- (converging on OPM?)
- Graphs/DAGs
- Relatively informal
SLIDE 3 Provenance in ...
- Databases
- Mainly for (nested)
relational model
("source location")
- Lineage, why ("witnesses")
- How/semiring model
- Relatively formal
- Workflows
- Many different systems
- Many different models
- (converging on OPM?)
- Graphs/DAGs
- Relatively informal
?????
SLIDE 4 This talk
- Relate database & workflow "styles"
- Develop a common graph formalism
- Need a common, expressive language that
- supports many database queries
- describes some (simple) workflows
SLIDE 5 Previous work
- Dataflow calculus (DFL), based on nested
relational calculus (NRC)
- Provenance "run" model by Kwasnikowska & Van
den Bussche (DILS 07, IPAW 08)
- "Provenance trace" model for NRC
- by (Acar, Ahmed & C. '08)
- Open Provenance Model (bipartite graphs)
- (Moreau et al. 2008-9), used in many WF systems
SLIDE 6 NRC/DFL background
- A very simple, functional language:
- basic functions +, *,... & constants 0,1,2,3...
- variables x,y,z
- pair/record types (A:e,...,B:e), πA (e)
- collection (set) types
- {e,...} e ∪ e {e | x in e'} ∪e
SLIDE 7
An example
SLIDE 8 An example
- Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
SLIDE 9 An example
- Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
SLIDE 10 An example
- Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}
SLIDE 11 An example
- Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5}
SLIDE 12 An example
- Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5} = sum {2,20}
SLIDE 13 An example
- Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
sum { x * y | (x,y,z) in R, x < y}
= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5} = sum {2,20} = 22
SLIDE 14 Another example
- In DFL, built-in functions / constants can be
whole programs & files,
- as in Provenance Challenge 1 workflow:
let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in softmean(Reslices)
SLIDE 15
Goal: Define "provenance graphs" for DFL
SLIDE 16 Goal: Define "provenance graphs" for DFL
let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in in softmean(Reslices)
SLIDE 17 Goal: Define "provenance graphs" for DFL
let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in in softmean(Reslices)
http://www.flickr.com/photos/schneertz/679692806/
SLIDE 18 First step: values
c <> {}
...
elem elem A1 An
v
v v
v v
...
copy
v
SLIDE 19 Example value
1 <> {}
elem elem A B
2 3 <>
A B
SLIDE 20 Next step: evaluation nodes ("process")
c
...
1 n e
f
e
x letx
body e e
Constants, primitive functions Variables & temporary bindings
head
SLIDE 21 Pairing
...
A1 An e
<>
e
πA
e
Record building Field lookup
SLIDE 22 Conditionals
if
test then e e
if
test else e e
Note: Only taken branch is recorded
SLIDE 23 Sets: basic operations
{}
e
∅ ∪
1 2 e e
Empty set Singleton Union
SLIDE 24 Sets: complex
∪
e
forx
head body e e e body
...
Flattening Iteration
SLIDE 25 Provenance graphs
- are graphs with "both value and evaluation
structure"
! " # $ % &
! " # $%&" ' ( ) # $%&" ' ( * + , ' '- ( ./01" 2/34 (5 6%4" ./01! 2/34 $%&" 6%4" $%&"
! " # $ % &'( ) # * % +,- &'( # % ./01
SLIDE 26
A bigger example
! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
SLIDE 27 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Value structure
SLIDE 28 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Value structure
{} {} {} C C 2 C C C C C C T {} {} {} {}
<>
1 2 1
<>
1 C {} C C C C C F C
SLIDE 29 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Input values
{} {} {} C C 2 C C C C C C T {} {} {} {}
<>
1 2 1
<>
1 C {} C C C C C F C
SLIDE 30 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Return value
{} {} {} C C 2 C C C C C C T {} {} {} {}
<>
1 2 1
<>
1 C {} C C C C C F C
SLIDE 31 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Expression structure
SLIDE 32 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Expression structure
= fst x snd empty let R let S for y s U for x if if R = {} x snd fst fst snd y +
SLIDE 33 Building provenance graphs
- is complicated
- Here we'll use high-level "graph rewrite
rule" formalism
- Mostly because it is nicer to look at than
formal version
SLIDE 34 c
c
c
f
f
f(v1,...,vn)
1 n vn v1
...
1 n vn v1
...
letx
head body v
e
x
head body v
e
x
letx
copy
copy
SLIDE 35 πAi
<>
A1 An vn v1
...
<>
A1 An v v
...
πAi
...
vi
copy ... ...
vi
<>
A1 An vn v1
...
A1 An vn v1
...
<>
<>
A1 An
SLIDE 36 if e2 e1
True
e1
True
if
test then test then else
if e2 e1
False
e2
False
if
test else test then else
copy
copy
SLIDE 37 empty?
{} {}
empty? False
...
elem elem v v
...
elem elem v v
empty?
{} {}
empty? True
SLIDE 38 ∅ ∅ ∅ {}
∪
...
elem elem v v {}
...
elem elem v v {} v elem v
{} {} {}
∪
...
v v
...
v v elem elem {} elem elem {} elem elem
... ...
SLIDE 39
OK, take a deep breath!
SLIDE 40 e e
x
copy
x
copy
forx
head body {}
e
x
...
elem elem vn v1 head body {}
...
elem elem vn v1 body
forx
{}
elem elem
... ...
elem elem v v {}
...
elem elem v v {} elem elem {}
∪
...
elem elem v v {}
...
elem elem v v {} elem elem {}
{}
∪
elem elem
... ... ...
SLIDE 41 An example
forx
head body {} elem elem 2 1
+ 1 x
SLIDE 42 An example
forx
head body {} elem elem 2 1
+ 1 x
SLIDE 43 An example
head body {} elem elem 2 1
+ 1
forx
{}
elem elem
+ 1
x
C
x
C
SLIDE 44 An example
head body {} elem elem 2 1
+ 1
forx
{}
elem elem
+ 1
x
C
x
C
SLIDE 45 An example
head body {} elem elem 2 1
+
forx
{}
elem elem
+
x
C
x
C
1
1
1
1
SLIDE 46 An example
head body {} elem elem 2 1
+
forx
{}
elem elem
+
x
C
x
C
1
1
1
1
SLIDE 47 An example
head body {} elem elem 2 1
forx
{}
elem elem
x
C
x
C
1
1
1
1
+
2
+
3
SLIDE 48 Graphs can "lie" (inconsistency)
5
2 2
SLIDE 49 Graphs can "lie" (inconsistency)
5
2 2
if
copy
2 True
test else
SLIDE 50 Graphs can "lie" (inconsistency)
5
2 2
if
copy
2 True
test else
4 3 2 1
head body elem elem body
forx
{}
elem elem
{}
3 4
SLIDE 51 Graphs can "lie" (inconsistency)
5
2 2
if
copy
2 True
test else
4 3 2 1
head body elem elem body
forx
{}
elem elem
{}
3 4
"Locally" but not "globally" consistent
SLIDE 52 Graph queries
- Many possible approaches
- In paper: some Datalog
- Maybe overkill, seems fragile
- In code: some "annotation propagation"
traversals
- Seems to handle where, "explanations",
"summaries"
SLIDE 53 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Explaining
SLIDE 54 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Explaining
SLIDE 55 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Explaining
SLIDE 56 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Explaining
Note: Smallest consistent subgraph (NOT transitive closure!)
SLIDE 57 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
Summarizing
SLIDE 58 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
+ 2
Summarizing
{} 1 1
SLIDE 59 Graphs are partially "replayable"
- If we change a value node, can try to
"readjust" to recover consistency
- Formalized in (Acar, Ahmed, Cheney 08)
+
4
2 2
SLIDE 60 Graphs are partially "replayable"
- If we change a value node, can try to
"readjust" to recover consistency
- Formalized in (Acar, Ahmed, Cheney 08)
+
4
2 17
SLIDE 61 Graphs are partially "replayable"
- If we change a value node, can try to
"readjust" to recover consistency
- Formalized in (Acar, Ahmed, Cheney 08)
+
2 17
19
SLIDE 62 Graphs are partially "replayable"
- If we change a value node, can try to
"readjust" to recover consistency
- Formalized in (Acar, Ahmed, Cheney 08)
+
2 17
19
if
copy
2
test else
False
SLIDE 63 Graphs are partially "replayable"
- If we change a value node, can try to
"readjust" to recover consistency
- Formalized in (Acar, Ahmed, Cheney 08)
+
2 17
19
if
copy
2 True
test else
SLIDE 64 Graphs are partially "replayable"
- If we change a value node, can try to
"readjust" to recover consistency
- Formalized in (Acar, Ahmed, Cheney 08)
+
2 17
19
if
2 True
test else
Stuck!
????
SLIDE 65 Implementation in Haskell
- Summarized in paper, full code on request
- roughly 250 LOC for basic evaluator
- another 300 for graphviz translation, basic queries, examples
- Point?
- No claim of efficiency/scalability but easy to understand,
experiment
- Elucidates some tricky details that pictures hide
- Similar "lightweight modeling" might be valuable for
understanding/relating other WF/DB models
SLIDE 66 Related work
- This work synthesizes/rearranges ideas from
several previous works & "folklore"
- traces (Acar, Ahmed, Cheney 2008)
- runs (Kwasnikowska, van den Bussche, DILS 2007, IPAW
2008)
- OPM graphs (Moreau et al. IPAW 2008 etc.)
- and many workflow systems
- More can be done to relate DB & workflow
models
SLIDE 67 Future work
- This is work in progress
- Next steps:
- Extending to understand/model other workflow
features
- Better grasp of "real" queries and features needed
- Implementa(tion|ability)?
- Optimization?
SLIDE 68 Conclusions
- DB & WF provenance have much in
common
- We develop common graph model
- with both intuitive & precise presentations
- Still much to do to relate and integrate DB
& WF models
- let alone integrate models at scale in real systems