Tow ards a Model of Tow ards a Model of Provenance and User View s - - PowerPoint PPT Presentation

tow ards a model of tow ards a model of provenance and
SMART_READER_LITE
LIVE PREVIEW

Tow ards a Model of Tow ards a Model of Provenance and User View s - - PowerPoint PPT Presentation

Tow ards a Model of Tow ards a Model of Provenance and User View s Provenance and User View s in Scientific W orkflow s in Scientific W orkflow s Shirley Cohen Sarah Cohen-Boulakia Susan Davidson University of Pennsylvania DILS06 July,


slide-1
SLIDE 1

DILS’06 July, 22nd 1

Tow ards a Model of Tow ards a Model of Provenance and User View s Provenance and User View s in Scientific W orkflow s in Scientific W orkflow s

Shirley Cohen Sarah Cohen-Boulakia Susan Davidson University of Pennsylvania

slide-2
SLIDE 2

DILS’06 July, 22nd 2

Need for provenance! Need for provenance!

Public Public sources sources

TGCCGTGTGGC TAAATGTCTGT GC … CCCTTTCCGTG TGGCTAAATGT CTGTGC … TGCCGTGTGGC TAAATGTCTGT GC GTCTGTGC… TGCCGTGTGGC TAAATGTCTGT GC GTCTGTGC… TGCCGTGTGGC TAAATGTCTGT GC GTCTGTGC… ATGGCCGTGTG GCTAAATGTCT GTGCCTAACTA ACTAA…

Alignments ClustalW PAUPS Phillips … Bootstrap

Biologist’s w orkspace Biologist’s w orkspace

CI PRES project CI PRES project

Cyberinfrastructure Cyberinfrastructure for for Phylogenetic Phylogenetic RESearch RESearch

Bioinform atics Bioinform atics protocols protocols

Which sequences have been used to produce this tree? How this tree has been generated?

? ?

Can I throw away some of these data? Which ones are really important to keep?

slide-3
SLIDE 3

DILS’06 July, 22nd 3

Scientific Analysis Scientific Analysis

Explosion of biological data, must be analyzed to

create knowledge

Scientific analysis is complex Reproducing, interpreting results depends on the

provenance provenance of the data (how, where, who…)

Workflow systems

Support scientists in their analysis

  • Trace

Trace the data used / generated at each step

Are heterogeneous

heterogeneous

Different graph-based models Different technologies

Need a generic generic m odel m odel of provenance

slide-4
SLIDE 4

DILS’06 July, 22nd 4

Provenance Provenance

Provenance is an increasingly important topic

specialized workshops, survey papers…

Models for data provenance exist in the database

community

E.g. [Buneman et al.,01], [Bhagwat et al.,04], [Widom et al.,06]

However, several features of scientific workflows

are not addressed

Data are derived by chaining

chaining and com posing com posing analytical tools

Steps are black boxes

black boxes

Different view s

view s of a given workflow (sub-steps) may be considered

  • Model

Model of provenance for scientific workflows must incorporate these features features

slide-5
SLIDE 5

DILS’06 July, 22nd 5

Outline Outline

Motivation

  • Case study: Tree I nference

Case study: Tree I nference

Model for provenance and user views Querying provenance Conclusion

slide-6
SLIDE 6

DILS’06 July, 22nd 6

Tree I nference W orkflow Tree I nference W orkflow

(S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) (S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G)

Designed in the context of the CIPRES project Represents how phylogeneticists analyze data Terminology

Nodes are step-classes (static) Edges capture the flow of data between step-classes

Loops are possible

An execution of a workflow generates a partial order

  • f steps (dynamic)

Instances of step classes

Each step has input and output data

slide-7
SLIDE 7

DILS’06 July, 22nd 7

Tree I nference W orkflow , cont. Tree I nference W orkflow , cont.

A step-class may itself be a workflow Users may zoom-in to the boxes

Kepler, myGrid…

Different user view s can be considered

Am I allowed to zoom in S4?

(S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) (S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) (S4c) Bootstrap Tree (S4d) Root Tree (S4a) Compute Trees (S4b) Create Consensus Tree (O4a) unrooted trees (O3) edited alignment (O4b) consensus tree (O4c) bootstrap tree (O4) rooted tree

slide-8
SLIDE 8

DILS’06 July, 22nd 8

Querying Provenance Querying Provenance

From what im m ediate data products did this tree

  • riginate?

What are all the data products which have been used

to produce this tree?

What step produced this tree? What sequence of steps produced this tree?

Data vs step provenance Immediate vs deep provenance

(S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) (S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G)

(S4c) Bootstrap Tree (S4d) Root Tree (S4a) Compute Trees (S4b) Create Consensus Tree (O4a) unrooted trees (O3) edited alignment (O4b) consensus tree (O4c) bootstrap tree (O4) rooted tree

slide-9
SLIDE 9

DILS’06 July, 22nd 9

Outline Outline

Motivation Case study: Tree Inference

  • Model for provenance and user view s

Model for provenance and user view s

Querying provenance Conclusion

slide-10
SLIDE 10

DILS’06 July, 22nd 10

Model of Provenance: Logs Model of Provenance: Logs

A log is a sequence of entries

Input(sid,iid,ts) sid takes iid as input at time ts Output(sid,did,ts) sid produces did at time ts

Immediate provenance

All the data and steps directly used to produce did

ImmProv(did,sid,iid):- Input(sid,iid,tsi) ∧ Output(sid,did,tso) ∧ tsi ≤ tso

S1 S2 I1 I2 D O1

  • Imm. Provenance of O1

ImmDProv: D ImmSProv: S2 Output SID DID TSO

  • S1 D 2

S2 O1 4 Input SID IID TSI

  • S1 I1 1

S1 I2 1 S2 D 3

ImmDProv and ImmSProv are also defined

Each Each input/ output input/ output data is stored! data is stored!

slide-11
SLIDE 11

DILS’06 July, 22nd 11

Deep Provenance Deep Provenance

  • Recursive

Recursive definition

  • Deep Data

Deep Data provenance (D): DProv(did, iid):- ImmProv(did,_, iid) DProv(did, iid):- ImmProv(did,_, x) ∧ DProv(x, iid)

  • Deep Step

Deep Step provenance (S): SProv(did, sid):- ImmProv(did,sid,_) SProv(did, sid):- ImmProv(did,_, x) ∧ Sprov(x,sid) S1 S2 I1 I2 D O1 DProv for O1: [{D}, {I1, I2}] SProv for O1: [{S2}, {S1}]

slide-12
SLIDE 12

DILS’06 July, 22nd 12

Com position and User View s Com position and User View s

What is the immediate data provenance of O4?

If I can zoom into S4 O4c Otherwise O3

  • UserView ( U) :

UserView ( U) : set of the lowest level step classes that U is entitled to see.

  • Ordering on user view s:

Ordering on user view s: U2 > U2 > u

u U1

U1 U2 is finer than U1 (sees provenance in more detail)

(S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) (S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G)

(S4c) Bootstrap Tree (S4d) Root Tree (S4a) Compute Trees (S4b) Create Consensus Tree (O4a) unrooted trees (O3) edited alignment (O4b) consensus tree (O4c) bootstrap tree (O4) rooted tree

U1 U2

slide-13
SLIDE 13

DILS’06 July, 22nd 13

User View s User View s

  • W hat

W hat are User views?

Level of detail

detail the user wishes to track

  • Perm issions

Perm issions given to the user

  • Ability

Ability of the user to see / know the sub-steps (distributed computation)

Similar to checkpoints

checkpoints in logs

  • W hy

W hy use User Views?

  • Throw aw ay

Throw aw ay unimportant intermediate results

  • Reduce

Reduce the amount of work to be redone

Storage efficiency efficiency

slide-14
SLIDE 14

DILS’06 July, 22nd 14

Reasoning w ith User View s Reasoning w ith User View s

Logging occurs at lowest level steps Reasoning uses information from

Workflow: Step-classes containment and user views Cinput(sid,idid,tsi), Coutput(sid,idid,tso) calculated from log

Immediate user-provenance

ImmUserProv(u,did,sid,idid):- Cinput(sid,idid,tsi) ∧

Coutput(sid,did,tso) ∧ tsi≤ tso ∧ userView ( u,sid)

COutput SID DID TSO

  • S1 D 2

S2 O1 4 Sc O1 4 Scc O1 4 S3 O2 5 Scc O2 5 CInput SID IDID TSI

  • S1 I1 1

Sc I 1 1 Scc I1 1 S3 I2 1 Scc I2 1 S2 D 3

S1 S3 S2 I1 I2 O1 O2 D U1 (black box) U3 (admin) ImmUserDProv for O1 viewed by U2: {I1} ImmUserDProv for O1 viewed by U3: {D} U2

User Deep provenance is analogously defined

Sc Scc ImmUserDProv ImmUserSProv

slide-15
SLIDE 15

DILS’06 July, 22nd 15

Reasoning w ith User View s Reasoning w ith User View s ( cont.)

( cont.)

A finer user view

user view allows

more data and steps to be seen more precise reasoning about data provenance

Lem m a

Given a data object did and two user views u1 and u2 such that u1 <u u2 and did is visible in u1. Then Prov Prov-

  • visible( u1 ,u1 ,did)

visible( u1 ,u1 ,did) ⊇

⊇ Prov

Prov-

  • visible( u1 ,u2 ,did)

visible( u1 ,u2 ,did)

Prov-visible(U1,U3,O1)={I1} Prov-visible(U1,U1,O1)={I1,I2} S1 S3 S2 I1 I2 O1 O2 D U1 (black box) U3 (admin) U2 Sc Scc

Different granularity granularity levels of provenance Storage efficiency efficiency

slide-16
SLIDE 16

DILS’06 July, 22nd 16

Outline Outline

Motivation Tree Inference use case Model for provenance

  • Querying Provenance

Querying Provenance

Conclusion

slide-17
SLIDE 17

DILS’06 July, 22nd 17

Querying Provenance Querying Provenance

  • From what direct data products did this tree originate

this tree originate? ImmUserDProv (U1,O4): O3 ImmUserDProv (U2,O4): O4c

  • What are all the data products

all the data products which have been used to produce this tree? userDProv (U1,O4): O3,O2,O1,G userDProv (U2,O4): O4c,O4b,O4a,O3,O2,O1,G

  • What sequence of steps

sequence of steps produced this tree? userSProv (U1,O4): S4,S3,S2,S1 userSProv (U2,O4): S4d,S4c,S4b,S4a,S3,S2,S1

(S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) (S1) Download Sequences GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G) GenBank (S3) Refine Alignment (S4) Infer Tree (S2) Create Alignment (O1) raw sequences (O2) alignment (O3) edited alignment (O4) rooted tree Tree Repository If (rooted tree (O4) = unsatisfactory) repeat process (G)

(S4c) Bootstrap Tree (S4d) Root Tree (S4a) Compute Trees (S4b) Create Consensus Tree (O4a) unrooted trees (O3) edited alignment (O4b) consensus tree (O4c) bootstrap tree (O4) rooted tree

U1 U2

slide-18
SLIDE 18

DILS’06 July, 22nd 18

Conclusion Conclusion

Model of provenance

provenance

Based on study of user requirements (Tree I nference

W orkflow )

Uses generic

generic and m inim al m inim al information information

Based on careful studies of workflow systems (Kepler,

MyGrid, Chimera)

Definitions include

  • Data

Data and Step provenance

  • I m m ediate

I m m ediate and and Deep Deep provenance

  • User View s

User View s

Multi-granularity

granularity levels of provenance

Only visible and necessary data are kept

  • Efficiency

Efficiency in storage

Model is rich enough to answer the collected

queries

slide-19
SLIDE 19

DILS’06 July, 22nd 19

Ongoing W ork Ongoing W ork

Experiment with the expressiveness

expressiveness of the language

Queries over concurrent and partial executions Use an object-oriented data model (JDBC/Oracle)

  • I m plem ent

I m plem ent the model (efficiently)

Experiment with storage models Collect real scientific logging information Study use within in real workflow system

Collaboration with the Kepler group

slide-20
SLIDE 20

DILS’06 July, 22nd 20

Acknow ledgem ents Acknow ledgem ents

Kepler Group

Shawn Bowers Bertram Ludascher Timothy McPhillips

Biologists from the CIPRES project Members from the Database group,

University of Pennsylvania

This work is supported by NSF grants

IIS0513778 and IIS0415810