Using PROV-O to represent lineage in statistical processes: a record - - PowerPoint PPT Presentation

using prov o to represent lineage in statistical
SMART_READER_LITE
LIVE PREVIEW

Using PROV-O to represent lineage in statistical processes: a record - - PowerPoint PPT Presentation

Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio Rizzolo Statistics Canada Guillaume Dufges Institut National de la Statistique et des tudes conomiques Franck Cotton Institut National de


slide-1
SLIDE 1

Using PROV-O to represent lineage in statistical processes: a record linkage example

Flavio Rizzolo – Statistics Canada Guillaume Dufges – Institut National de la Statistique et des Études Économiques Franck Cotton – Institut National de la Statistique et des Études Économiques

slide-2
SLIDE 2

SemStats 2019 Lineage metadata for record linkage with PROV-O 2

Contents

Context and objectives Record linkage and lineage metadata What is PROV-O? PROV-O representations for record linkage lineage Conclusions and future work

slide-3
SLIDE 3

SemStats 2019 Lineage metadata for record linkage with PROV-O 3

Context and objectives

Statistical ofgices need to provide trusted data Information on how data was produced helps doing that Provenance and lineage metadata are information on

Processes and methods used Actors involved (data providers, owners, publishers, etc.) Relations between data outputs and data sources

That metadata should

Use a standard model in order to be easily understandable Be accessible and (machine-)usable

slide-4
SLIDE 4

SemStats 2019 Lineage metadata for record linkage with PROV-O 4

Context and objectives

Main goal of the paper: proof of concept about using the PROV model to represent lineage information on statistical processes Record linkage chosen as example process

Sufgiciently complex, but not too much Widely used in statistical production Formal descriptions already available Lineage metadata can be defined at various levels of detail Various sofuware packages exist

slide-5
SLIDE 5

SemStats 2019 Lineage metadata for record linkage with PROV-O 5

Context and objectives

This is a very practical work, not groundbreaking research

slide-6
SLIDE 6

SemStats 2019 Lineage metadata for record linkage with PROV-O 6

Record linkage and lineage metadata

Record linkage

Matching of data about real-world entities (people, businesses, products…) coming from difgerent data sources Typical process Widely used (e.g. data integration), lots of methodological work Even a dedicated record linkage process model (Statistics Canada)

Source A Source B Match Possible Non-match Match Non-match

Automatic matching Expert review

slide-7
SLIDE 7

SemStats 2019 Lineage metadata for record linkage with PROV-O 7

Record linkage and lineage metadata

Lineage model

slide-8
SLIDE 8

SemStats 2019 Lineage metadata for record linkage with PROV-O 8

Record linkage and lineage metadata

Types of lineage metadata

Dataset lineage

A dataset is derived from others by record linkage: keep track of sources and transformations applied

Record lineage

Track where the record comes from or which records are its contributors and what integration was applied

Variable lineage

Track how a variable (e.g. linkage key) is derived from variables in source datasets

Data point lineage

Not used for record linkage but heavily used in upstream tasks like data cleansing

slide-9
SLIDE 9

SemStats 2019 Lineage metadata for record linkage with PROV-O 9

used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime

What is PROV-O?

W3C recommendation part of the PROV familly (provenance metadata) OWL2 expression of the PROV data model Simple “Starting point” model, expanded terms and qualification mechanism

Starting point terms

slide-10
SLIDE 10

SemStats 2019 Lineage metadata for record linkage with PROV-O 10

What is PROV-O?

generatedAtTime value hadMember invalidatedAtTime wasStartedBy / wasEndedBy wasInvalidatedBy wasInfluencedBy / wasQuotedFrom / wasRevisionOf / hadPrimarySource Activity Entity Collection xsd:dateTime xsd:dateTime alternateOf / specializationOf atLocation Location Agent Person SoftwareAgent Organization Plan Bundle

Expanded terms

slide-11
SLIDE 11

SemStats 2019 Lineage metadata for record linkage with PROV-O 11

What is PROV-O?

Qualification mechanism

slide-12
SLIDE 12

SemStats 2019 Lineage metadata for record linkage with PROV-O 12

PROV-O representations for record linkage lineage

Simple example: the high-level view

slide-13
SLIDE 13

SemStats 2019 Lineage metadata for record linkage with PROV-O 13

PROV-O representations for record linkage lineage

Simple example: the high-level view

slide-14
SLIDE 14

SemStats 2019 Lineage metadata for record linkage with PROV-O 14

PROV-O representations for record linkage lineage

Simple example: the high-level view

slide-15
SLIDE 15

SemStats 2019 Lineage metadata for record linkage with PROV-O 15

PROV-O representations for record linkage lineage

Simple example: the high-level view

slide-16
SLIDE 16

SemStats 2019 Lineage metadata for record linkage with PROV-O 16

PROV-O representations for record linkage lineage

The record linkage process (simplified)

slide-17
SLIDE 17

SemStats 2019 Lineage metadata for record linkage with PROV-O 17

PROV-O representations for record linkage lineage

Produce linkage-ready datasets – process

slide-18
SLIDE 18

SemStats 2019 Lineage metadata for record linkage with PROV-O 18

PROV-O representations for record linkage lineage

Produce linkage-ready datasets – PROV-O representation

slide-19
SLIDE 19

SemStats 2019 Lineage metadata for record linkage with PROV-O 19

PROV-O representations for record linkage lineage

Produce linkage keys – process

slide-20
SLIDE 20

SemStats 2019 Lineage metadata for record linkage with PROV-O 20

PROV-O representations for record linkage lineage

Produce linkage keys – PROV-O representation – blocking

slide-21
SLIDE 21

SemStats 2019 Lineage metadata for record linkage with PROV-O 21

PROV-O representations for record linkage lineage

Produce linkage keys – PROV-O representation – linking

slide-22
SLIDE 22

SemStats 2019 Lineage metadata for record linkage with PROV-O 22

Conclusions and future work

Proof of concept conclusive

PROV-O can be used to represent the process Using PROV-O allows to represent coherently the difgerent levels of lineage metadata The “russian dolls” nature of the PROV-O model implies that metadata can be produced at difgerent levels Example of queries that can be made

List output datasets produced from a given data sources Which dataset(s) does this record come from?

slide-23
SLIDE 23

SemStats 2019 Lineage metadata for record linkage with PROV-O 23

Conclusions and future work

Future work

Continue work on record linkage, in particular on the representation of methodology Test how to automate the production of metadata in usual sofuware Study the possibility to activate metadata (i.e. use it as specification) Adapt to other statistical operations (e.g. data editing, variable derivation...) Promote the work in the Ofgicial Statistics community

slide-24
SLIDE 24

SemStats 2019 Lineage metadata for record linkage with PROV-O 24

Thank you for your attention

Any questions?