A graph model for data and workflow provenance Umut Acar, Peter - - PowerPoint PPT Presentation

a graph model for data and workflow provenance
SMART_READER_LITE
LIVE PREVIEW

A graph model for data and workflow provenance Umut Acar, Peter - - PowerPoint PPT Presentation

A graph model for data and workflow provenance Umut Acar, Peter Buneman, James Cheney , Natalia Kwasnikowska, Jan van den Bussche, & Stijn Vansummeren TaPP 2010 Provenance in ... Databases Workflows Mainly for (nested)


slide-1
SLIDE 1

A graph model for data and workflow provenance

Umut Acar, Peter Buneman, James Cheney, Natalia Kwasnikowska, Jan van den Bussche, & Stijn Vansummeren TaPP 2010

slide-2
SLIDE 2

Provenance in ...

  • Databases
  • Mainly for (nested)

relational model

  • Where-provenance

("source location")

  • Lineage, why ("witnesses")
  • How/semiring model
  • Relatively formal
  • Workflows
  • Many different systems
  • Many different models
  • (converging on OPM?)
  • Graphs/DAGs
  • Relatively informal
slide-3
SLIDE 3

Provenance in ...

  • Databases
  • Mainly for (nested)

relational model

  • Where-provenance

("source location")

  • Lineage, why ("witnesses")
  • How/semiring model
  • Relatively formal
  • Workflows
  • Many different systems
  • Many different models
  • (converging on OPM?)
  • Graphs/DAGs
  • Relatively informal

?????

slide-4
SLIDE 4

This talk

  • Relate database & workflow "styles"
  • Develop a common graph formalism
  • Need a common, expressive language that
  • supports many database queries
  • describes some (simple) workflows
slide-5
SLIDE 5

Previous work

  • Dataflow calculus (DFL), based on nested

relational calculus (NRC)

  • Provenance "run" model by Kwasnikowska & Van

den Bussche (DILS 07, IPAW 08)

  • "Provenance trace" model for NRC
  • by (Acar, Ahmed & C. '08)
  • Open Provenance Model (bipartite graphs)
  • (Moreau et al. 2008-9), used in many WF systems
slide-6
SLIDE 6

NRC/DFL background

  • A very simple, functional language:
  • basic functions +, *,... & constants 0,1,2,3...
  • variables x,y,z
  • pair/record types (A:e,...,B:e), πA (e)
  • collection (set) types
  • {e,...} e ∪ e {e | x in e'} ∪e
slide-7
SLIDE 7

An example

slide-8
SLIDE 8

An example

  • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}
slide-9
SLIDE 9

An example

  • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

slide-10
SLIDE 10

An example

  • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}

slide-11
SLIDE 11

An example

  • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5}

slide-12
SLIDE 12

An example

  • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5} = sum {2,20}

slide-13
SLIDE 13

An example

  • Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}} = sum {1 * 2, 4 * 5} = sum {2,20} = 22

slide-14
SLIDE 14

Another example

  • In DFL, built-in functions / constants can be

whole programs & files,

  • as in Provenance Challenge 1 workflow:

let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in softmean(Reslices)

slide-15
SLIDE 15

Goal: Define "provenance graphs" for DFL

slide-16
SLIDE 16

Goal: Define "provenance graphs" for DFL

let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in in softmean(Reslices)

slide-17
SLIDE 17

Goal: Define "provenance graphs" for DFL

let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} in let Reslices := {reslice(wp) | wp in WarpParams} in in softmean(Reslices)

http://www.flickr.com/photos/schneertz/679692806/

slide-18
SLIDE 18

First step: values

c <> {}

...

elem elem A1 An

v

v v

  • r

v v

  • r

...

copy

v

  • r
slide-19
SLIDE 19

Example value

1 <> {}

elem elem A B

2 3 <>

A B

slide-20
SLIDE 20

Next step: evaluation nodes ("process")

c

...

1 n e

f

e

x letx

body e e

Constants, primitive functions Variables & temporary bindings

head

slide-21
SLIDE 21

Pairing

...

A1 An e

<>

e

πA

e

Record building Field lookup

slide-22
SLIDE 22

Conditionals

if

test then e e

if

test else e e

Note: Only taken branch is recorded

slide-23
SLIDE 23

Sets: basic operations

{}

e

∅ ∪

1 2 e e

Empty set Singleton Union

slide-24
SLIDE 24

Sets: complex

  • perations

e

forx

head body e e e body

...

Flattening Iteration

slide-25
SLIDE 25

Provenance graphs

  • are graphs with "both value and evaluation

structure"

! " # $ % &

! " # $%&" ' ( ) # $%&" ' ( * + , ' '- ( ./01" 2/34 (5 6%4" ./01! 2/34 $%&" 6%4" $%&"

! " # $ % &'( ) # * % +,- &'( # % ./01

slide-26
SLIDE 26

A bigger example

! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()
slide-27
SLIDE 27 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Value structure

slide-28
SLIDE 28 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Value structure

{} {} {} C C 2 C C C C C C T {} {} {} {}

<>

1 2 1

<>

1 C {} C C C C C F C

slide-29
SLIDE 29 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Input values

{} {} {} C C 2 C C C C C C T {} {} {} {}

<>

1 2 1

<>

1 C {} C C C C C F C

slide-30
SLIDE 30 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Return value

{} {} {} C C 2 C C C C C C T {} {} {} {}

<>

1 2 1

<>

1 C {} C C C C C F C

slide-31
SLIDE 31 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Expression structure

slide-32
SLIDE 32 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Expression structure

= fst x snd empty let R let S for y s U for x if if R = {} x snd fst fst snd y +

slide-33
SLIDE 33

Building provenance graphs

  • is complicated
  • Here we'll use high-level "graph rewrite

rule" formalism

  • Mostly because it is nicer to look at than

formal version

slide-34
SLIDE 34

c

c

c

f

f

f(v1,...,vn)

1 n vn v1

...

1 n vn v1

...

letx

head body v

e

x

head body v

e

x

letx

copy

copy

slide-35
SLIDE 35

πAi

<>

A1 An vn v1

...

<>

A1 An v v

...

πAi

...

vi

copy ... ...

vi

<>

A1 An vn v1

...

A1 An vn v1

...

<>

<>

A1 An

slide-36
SLIDE 36

if e2 e1

True

e1

True

if

test then test then else

if e2 e1

False

e2

False

if

test else test then else

copy

copy

slide-37
SLIDE 37

empty?

{} {}

empty? False

...

elem elem v v

...

elem elem v v

empty?

{} {}

empty? True

slide-38
SLIDE 38

∅ ∅ ∅ {}

...

elem elem v v {}

...

elem elem v v {} v elem v

{} {} {}

...

v v

...

v v elem elem {} elem elem {} elem elem

... ...

slide-39
SLIDE 39

OK, take a deep breath!

slide-40
SLIDE 40

e e

x

copy

x

copy

forx

head body {}

e

x

...

elem elem vn v1 head body {}

...

elem elem vn v1 body

forx

{}

elem elem

... ...

elem elem v v {}

...

elem elem v v {} elem elem {}

...

elem elem v v {}

...

elem elem v v {} elem elem {}

{}

elem elem

... ... ...

slide-41
SLIDE 41

An example

forx

head body {} elem elem 2 1

+ 1 x

slide-42
SLIDE 42

An example

forx

head body {} elem elem 2 1

+ 1 x

slide-43
SLIDE 43

An example

head body {} elem elem 2 1

+ 1

forx

{}

elem elem

+ 1

x

C

x

C

slide-44
SLIDE 44

An example

head body {} elem elem 2 1

+ 1

forx

{}

elem elem

+ 1

x

C

x

C

slide-45
SLIDE 45

An example

head body {} elem elem 2 1

+

forx

{}

elem elem

+

x

C

x

C

1

1

1

1

slide-46
SLIDE 46

An example

head body {} elem elem 2 1

+

forx

{}

elem elem

+

x

C

x

C

1

1

1

1

slide-47
SLIDE 47

An example

head body {} elem elem 2 1

forx

{}

elem elem

x

C

x

C

1

1

1

1

+

2

+

3

slide-48
SLIDE 48

Graphs can "lie" (inconsistency)

  • +

5

2 2

slide-49
SLIDE 49

Graphs can "lie" (inconsistency)

  • +

5

2 2

if

copy

2 True

test else

slide-50
SLIDE 50

Graphs can "lie" (inconsistency)

  • +

5

2 2

if

copy

2 True

test else

4 3 2 1

head body elem elem body

forx

{}

elem elem

{}

3 4

slide-51
SLIDE 51

Graphs can "lie" (inconsistency)

  • +

5

2 2

if

copy

2 True

test else

4 3 2 1

head body elem elem body

forx

{}

elem elem

{}

3 4

"Locally" but not "globally" consistent

slide-52
SLIDE 52

Graph queries

  • Many possible approaches
  • In paper: some Datalog
  • Maybe overkill, seems fragile
  • In code: some "annotation propagation"

traversals

  • Seems to handle where, "explanations",

"summaries"

slide-53
SLIDE 53 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Explaining

slide-54
SLIDE 54 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Explaining

slide-55
SLIDE 55 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Explaining

slide-56
SLIDE 56 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Explaining

Note: Smallest consistent subgraph (NOT transitive closure!)

slide-57
SLIDE 57 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

Summarizing

slide-58
SLIDE 58 ! " #$% &'() *+, $-. / &'() &'() 1 2+3 4# 5678 %8$% 2+3 %98- " #$% &'() *+, $-. / &'() &'() 1 1 8:(%) 4# ;<=$8 %8$% 2+3 8=$8 #'6+" &'() 98<. &'() >'.) &'() >'.) 2+3 ? 2+3 @ ) #$% &'() $-. A &'() &'() 1 #'6+) &'() 98<. 1 >'.) 2+3 =8%+@ 98<. 2+3 >'.) =8%+! 98<. &'() >'.) 1 2+3 1 2+3 &'()

+ 2

Summarizing

{} 1 1

slide-59
SLIDE 59

Graphs are partially "replayable"

  • If we change a value node, can try to

"readjust" to recover consistency

  • Formalized in (Acar, Ahmed, Cheney 08)

+

4

2 2

slide-60
SLIDE 60

Graphs are partially "replayable"

  • If we change a value node, can try to

"readjust" to recover consistency

  • Formalized in (Acar, Ahmed, Cheney 08)

+

4

2 17

slide-61
SLIDE 61

Graphs are partially "replayable"

  • If we change a value node, can try to

"readjust" to recover consistency

  • Formalized in (Acar, Ahmed, Cheney 08)

+

2 17

19

slide-62
SLIDE 62

Graphs are partially "replayable"

  • If we change a value node, can try to

"readjust" to recover consistency

  • Formalized in (Acar, Ahmed, Cheney 08)

+

2 17

19

if

copy

2

test else

False

slide-63
SLIDE 63

Graphs are partially "replayable"

  • If we change a value node, can try to

"readjust" to recover consistency

  • Formalized in (Acar, Ahmed, Cheney 08)

+

2 17

19

if

copy

2 True

test else

slide-64
SLIDE 64

Graphs are partially "replayable"

  • If we change a value node, can try to

"readjust" to recover consistency

  • Formalized in (Acar, Ahmed, Cheney 08)

+

2 17

19

if

2 True

test else

Stuck!

????

slide-65
SLIDE 65

Implementation in Haskell

  • Summarized in paper, full code on request
  • roughly 250 LOC for basic evaluator
  • another 300 for graphviz translation, basic queries, examples
  • Point?
  • No claim of efficiency/scalability but easy to understand,

experiment

  • Elucidates some tricky details that pictures hide
  • Similar "lightweight modeling" might be valuable for

understanding/relating other WF/DB models

slide-66
SLIDE 66

Related work

  • This work synthesizes/rearranges ideas from

several previous works & "folklore"

  • traces (Acar, Ahmed, Cheney 2008)
  • runs (Kwasnikowska, van den Bussche, DILS 2007, IPAW

2008)

  • OPM graphs (Moreau et al. IPAW 2008 etc.)
  • and many workflow systems
  • More can be done to relate DB & workflow

models

slide-67
SLIDE 67

Future work

  • This is work in progress
  • Next steps:
  • Extending to understand/model other workflow

features

  • Better grasp of "real" queries and features needed
  • Implementa(tion|ability)?
  • Optimization?
slide-68
SLIDE 68

Conclusions

  • DB & WF provenance have much in

common

  • We develop common graph model
  • with both intuitive & precise presentations
  • Still much to do to relate and integrate DB

& WF models

  • let alone integrate models at scale in real systems