Language-integrated Provenance Stefan Fehrenbach James Cheney - - PowerPoint PPT Presentation

language integrated provenance
SMART_READER_LITE
LIVE PREVIEW

Language-integrated Provenance Stefan Fehrenbach James Cheney - - PowerPoint PPT Presentation

Language-integrated Provenance Stefan Fehrenbach James Cheney PPDP 2016 A database Agencies oid name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burnss Glasgow 9307 2394 104 ExternalTours oid name destination type


slide-1
SLIDE 1

Language-integrated Provenance

Stefan Fehrenbach James Cheney PPDP 2016

slide-2
SLIDE 2
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours

A database

2

slide-3
SLIDE 3
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] }

Language-integrated query

3

slide-4
SLIDE 4
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Language-integrated query

4

slide-5
SLIDE 5
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Where-provenance

5

slide-6
SLIDE 6
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Where-provenance

6

slide-7
SLIDE 7
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Where-provenance

7

slide-8
SLIDE 8
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Where-provenance

8

slide-9
SLIDE 9
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Lineage (why-provenance)

9

slide-10
SLIDE 10
  • id

name based_in phone 1 EdinTours Edinburgh 8740 2489 123 2 Burns’s Glasgow 9307 2394 104

  • id

name destination type price in £ 3 EdinTours Edinburgh bus 20 4 EdinTours Loch Ness bus 50 5 EdinTours Loch Ness boat 200 6 EdinTours Firth of Forth boat 50 7 Burns’s Islay boat 100 8 Burns’s Mallaig train 40 Agencies ExternalTours query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } name phone EdinTours 8740 2489 123 EdinTours 8740 2489 123 Burns’s 9307 2394 104

Lineage (why-provenance)

10

slide-11
SLIDE 11

Language-integrated provenance builds on

Language-integrated query

  • LINQ: … Meijer, Beckman, Bierman.

SIGMOD 2006

  • The script-writer’s dream. Cooper.

DBPL 2009

  • Query shredding: … Cheney,

Lindley, Wadler. SIGMOD 2014

  • Effective quotation: … Cheney,

Lindley, Radanne, Wadler. PEPM 2014

Provenance in databases

  • Why and where: … Buneman,

Khanna, Tan. ICDT 2001

  • On the expressiveness of implicit

provenance … Buneman, Cheney,

  • Vansummeren. TODS 2008
  • Perm: … Glavic, Alonso. ICDE 2009
  • Using SQL for efficient generation

and querying … Glavic, Miller,

  • Alonso. Buneman Festschrift 2013

11

slide-12
SLIDE 12

This talk

  • 1. Why?
  • 2. Language-integrated where-provenance in Links
  • 3. Rewriting Links to Links
  • 4. User-defined where-provenance
  • 5. Lineage in Links and its translation to Links
  • 6. Performance

W W

The paper

L

12

slide-13
SLIDE 13

Why?

Easy access to data and its provenance Provenance is not data – it is metadata data without provenance is less than complete provenance on its own is quite useless data with fake provenance is an affront Calculating provenance and propagating it manually is hard

  • r least cumbersome enough to want to automate it

13

slide-14
SLIDE 14

Where-provenance in Links

Mark data carrying provenance metadata with an abstract type: Prov(O) O is a base type Two operations: No constructor! – only the runtime can create provenance-annotated data Print as a comment, because it cannot appear in a program anyway: “EdinTours” #(“Agencies”, “name”, 2)

Γ ⊢ data M : O Γ ⊢ M : Prov(O) Γ ⊢ prov M : (String, String, Int) Γ ⊢ M : Prov(O)

W

14

slide-15
SLIDE 15

query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] }

Language-integrated query in Links

15

slide-16
SLIDE 16

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies =

Language-integrated query in Links

16

slide-17
SLIDE 17

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: String, agencies : [(based_in: String, agencies : [(phone: String)]

Language-integrated query in Links

17

slide-18
SLIDE 18

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: String, agencies : [(based_in: String, agencies : [(phone: String)] [(name = “EdinTours”, #(“ExternalTours”, “name”, 5) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “EdinTours”, #(“ExternalTours”, “name”, 6) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “Burns’s”, #(“ExternalTours”, “name”, 7) [(phone = “9307 2394 104”)] #(“Agencies”, “phone”, 2) : [(name: String, phone: String)]

Language-integrated query in Links

18

slide-19
SLIDE 19

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: String, agencies : [(based_in: String, agencies : [(phone: String)] [(name = “EdinTours”, #(“ExternalTours”, “name”, 5) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “EdinTours”, #(“ExternalTours”, “name”, 6) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “Burns’s”, #(“ExternalTours”, “name”, 7) [(phone = “9307 2394 104”)] #(“Agencies”, “phone”, 2) : [(name: Prov(String), phone: Prov(String))]

Where-provenance in LinksW

19

slide-20
SLIDE 20

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: String, agencies : [(based_in: String, agencies : [(phone: String)] [(name = “EdinTours”, #(“ExternalTours”, “name”, 5) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “EdinTours”, #(“ExternalTours”, “name”, 6) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “Burns’s”, #(“ExternalTours”, “name”, 7) [(phone = “9307 2394 104”)] #(“Agencies”, “phone”, 2) : [(name: Prov(String), phone: Prov(String))]

Where-provenance in LinksW

20

slide-21
SLIDE 21

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: Prov(String), agencies : [(based_in: String, agencies : [(phone: Prov(String))] [(name = “EdinTours”, #(“ExternalTours”, “name”, 5) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “EdinTours”, #(“ExternalTours”, “name”, 6) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “Burns’s”, #(“ExternalTours”, “name”, 7) [(phone = “9307 2394 104”)] #(“Agencies”, “phone”, 2) : [(name: Prov(String), phone: Prov(String))]

Where-provenance in LinksW

21

slide-22
SLIDE 22

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (data a.name == e.name && e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: Prov(String), agencies : [(based_in: String, agencies : [(phone: Prov(String))] [(name = “EdinTours”, #(“ExternalTours”, “name”, 5) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “EdinTours”, #(“ExternalTours”, “name”, 6) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “Burns’s”, #(“ExternalTours”, “name”, 7) [(phone = “9307 2394 104”)] #(“Agencies”, “phone”, 2) : [(name: Prov(String), phone: Prov(String))]

Where-provenance in LinksW

22

slide-23
SLIDE 23

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default query { for (a <-- agencies) for (e <-- externalTours) where (data a.name == data e.name && data e.type == “boat”) [(name = e.name, [(phone = a.phone)] } var agencies = agencies : [(oid: Int, agencies : [(name: Prov(String), agencies : [(based_in: String, agencies : [(phone: Prov(String))] [(name = “EdinTours”, #(“ExternalTours”, “name”, 5) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “EdinTours”, #(“ExternalTours”, “name”, 6) [(phone = “8740 2489 123”), #(“Agencies”, “phone”, 1) [(name = “Burns’s”, #(“ExternalTours”, “name”, 7) [(phone = “9307 2394 104”)] #(“Agencies”, “phone”, 2) : [(name: Prov(String), phone: Prov(String))]

Where-provenance in LinksW

23

slide-24
SLIDE 24

This talk

  • 1. Why?
  • 2. Language-integrated where-provenance in Links
  • 3. Rewriting Links to Links
  • 4. User-defined where-provenance
  • 5. Lineage in Links and its translation to Links
  • 6. Performance

W W

The paper

L

24

slide-25
SLIDE 25

Links

Prov(O) data M prov M table N with (a: A, …) where a prov default, … for (x <-- T) M update (x <-- T) M (data: O, prov: (String, String, Int)) M.data M.prov Pair of table and view with initial provenance annotations for (x <-- T.2()) M update (x <-- T.1) M

Links

W

25

slide-26
SLIDE 26

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default

26

slide-27
SLIDE 27

(table “Agencies” (with (oid: Int, (with (name: String, (with (based_in: String, (with (phone: String), fun () { for (a <-- table “Agencies” with …) fo[(oid = a.oid, fo[(name = (data = a.name, fo[(name = (prov = (“Agencies”, “name”, a.oid)), fo[(based_in = a.based_in, fo[(phone = (data = a.phone, fo[(phone = (prov = (“Agencies”, “phone”, a.oid)))] }) table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default

27

slide-28
SLIDE 28

(table “Agencies” (with (oid: Int, (with (name: String, (with (based_in: String, (with (phone: String), fun () { for (a <-- table “Agencies” with …) fo[(oid = a.oid, fo[(name = (data = a.name, fo[(name = (prov = (“Agencies”, “name”, a.oid)), fo[(based_in = a.based_in, fo[(phone = (data = a.phone, fo[(phone = (prov = (“Agencies”, “phone”, a.oid)))] }) table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default

28

slide-29
SLIDE 29

(table “Agencies” (with (oid: Int, (with (name: String, (with (based_in: String, (with (phone: String), fun () { for (a <-- table “Agencies” with …) fo[(oid = a.oid, fo[(name = (data = a.name, fo[(name = (prov = (“Agencies”, “name”, a.oid)), fo[(based_in = a.based_in, fo[(phone = (data = a.phone, fo[(phone = (prov = (“Agencies”, “phone”, a.oid)))] }) table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov default

29

slide-30
SLIDE 30

query { for (a <-- agencies) for (e <-- externalTours) where (data a.name == data e.name && data e.type == “boat”) [(name = e.name, [(phone = a.phone)] }

30

slide-31
SLIDE 31

query { for (a <-- agencies) for (e <-- externalTours) where (data a.name == data e.name && data e.type == “boat”) [(name = e.name, [(phone = a.phone)] } query { for (a <- agencies.2()) for (e <- externalTours.2()) where (a.name.data == e.name.data && e.type.data == “boat”) [(name = e.name, [(phone = a.phone)] }

31

slide-32
SLIDE 32

Links

Prov(O) data M prov M table N with (a: A, …) where a prov default, … for (x <-- T) M update (x <-- T) M (data: O, prov: (String, String, Int)) M.data M.prov Pair of table and view with initial provenance annotations for (x <-- T.2()) M update (x <-- T.1) M

Links

W

32

slide-33
SLIDE 33

This talk

  • 1. Why?
  • 2. Language-integrated where-provenance in Links
  • 3. Rewriting Links to Links
  • 4. User-defined where-provenance
  • 5. Lineage in Links and its translation to Links
  • 6. Performance

W W

The paper

L

33

slide-34
SLIDE 34

User-defined where-provenance

34

table “Agencies” with (oid: Int, with (name: String, with (based_in: String, with (phone: String) where phone prov default, where name prov anyDBFun sig anyDBFun: (_) -> (String, String, Int) fun anyDBFun (r) { (“Answers”, “Life, Universe and everything…”, 42) } sig defaultAgenciesPhone: (_) -> (String, String, Int) fun defaultAgenciesPhone (r) { (“Agencies”, “phone”, r.oid) } fun () { for (a <-- table “Agencies” with …) fo[(oid = a.oid, fo[(name = (data = a.name, fo[(name = (prov = anyDBFun(a)), fo[(based_in = a.based_in, fo[(phone = (data = a.phone, fo[(phone = (prov = defaultAgenciesPhone(a)))] }

slide-35
SLIDE 35

Lineage in Links

  • No special type, lineage keyword triggers query rewriting
  • Query result changes from [A] to [(data: A, prov: [(String, Int)])]
  • Initial annotations on table with a view
  • Add input’s annotations to the body’s annotations
  • Need every function twice, for use within and outwith lineage blocks

35

L

〚for (x <-- N) M〛 = for (y <--〚N’〛) for (z <--〚 M’〛[y.data/x]) [(data = z.data, prov = y.prov ++ z.prov)]

slide-36
SLIDE 36

Performance

  • Slowdown for where-provenance: 1.2-2.8x
  • Slowdown for lineage: 1.3-7.6x
  • Comparable to calculating provenance on the database

36

slide-37
SLIDE 37

Conclusions

Building on language-integrated query technology we can build provenance tracking into the programming language Type-safe handling of provenance-annotated data No need for database plugins – works with any plain SQL database Next steps: richer queries, other forms of provenance, provenance for programming language values, other host languages, …

37

slide-38
SLIDE 38

query { for (a <- fun() { for (a <-- table “Agencies” …) for (a <- fun() { [(oid = a.oid, for (a <- fun() { [(name = (data = a.name, for (a <- fun() { [(name = (prov = (“Agencies”, “name”, a.oid)), for (a <- fun() { [(based_in = a.based_in, for (a <- fun() { [(phone = (data = a.phone, for (a <- fun() { [(phone = (prov = (“Agencies”, “phone”, a.oid)))]}()) for (e <- fun() { for (e <-- table “ExternalTours” …) … }()) where (a.name.data == e.name.data && e.type.data == “boat”) [(name = e.name, [(phone = a.phone)] }

38

slide-39
SLIDE 39

SELECT e.name AS name_data, SELECT ‘ExternalTours’ AS name_prov_1, SELECT ‘name’ AS name_prov_2, SELECT e.oid AS name_prov_3, SELECT a.phone AS phone_data, SELECT ‘Agencies’ AS phone_prov_1, SELECT ‘phone’ AS phone_prov_2, SELECT a.oid AS phone_prov_3 FROM Agencies AS a, FROM ExternalTours AS e WHERE a.name = e.name AND e.type = ‘boat’

39