NADEEF: A Commodity Data Cleaning System Data analytics, QCRI - - PowerPoint PPT Presentation

nadeef a commodity data cleaning system
SMART_READER_LITE
LIVE PREVIEW

NADEEF: A Commodity Data Cleaning System Data analytics, QCRI - - PowerPoint PPT Presentation

NADEEF: A Commodity Data Cleaning System Data analytics, QCRI Michele Dallachiesa Amr Ebaid Ahmed Eldawy University of Trento Purdue University University of Minnesota Ahmed Elmagarmid Ihab F. Ilyas Mourad Ouzzani Nan Tang 1 2


slide-1
SLIDE 1

NADEEF: A Commodity Data Cleaning System

Ahmed Elmagarmid Ihab F. Ilyas Mourad Ouzzani Nan Tang Michele Dallachiesa

University of Trento

Amr Ebaid

Purdue University

Ahmed Eldawy

University of Minnesota

Data analytics, QCRI

1

slide-2
SLIDE 2

2

slide-3
SLIDE 3

2

Bob should be standardized to Robert

slide-4
SLIDE 4

2

Country code determines a country

slide-5
SLIDE 5

2

slide-6
SLIDE 6

A Motivating Scenario

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank 3

slide-7
SLIDE 7

A Motivating Scenario

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank

If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; ETL rules (lookup table) Extended CFDs

3

slide-8
SLIDE 8

A Motivating Scenario

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank

If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; Editing rules (w.r.t. master data)

3

slide-9
SLIDE 9

A Motivating Scenario

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank

If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; A country code (CC) uniquely determines a country CFDs (FDs)

3

slide-10
SLIDE 10

A Motivating Scenario

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank

If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; A country code (CC) uniquely determines a country If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud. Write a special-purpose application

3

slide-11
SLIDE 11

A Motivating Scenario

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

Challenging to capture multiple types of rules tran bank

If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; A country code (CC) uniquely determines a country If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud. Write a special-purpose application

3

slide-12
SLIDE 12

The User Perspective

CFD MD Customized rule

These are our data quality rules

4

slide-13
SLIDE 13

The User Perspective

CFD MD Customized rule

Data Cleaning System

These are our data quality rules

4

slide-14
SLIDE 14

The User Perspective

CFD MD Customized rule

Data Cleaning System

These are our data quality rules

Easy to specify and easy to deploy 4

slide-15
SLIDE 15
  • Heterogeneity
  • Interdependency
  • Deployment and

extensibility

  • Metadata management

and user interaction

Challenges

5

slide-16
SLIDE 16
  • Heterogeneity
  • Interdependency
  • Deployment and

extensibility

  • Metadata management

and user interaction

  • Integrity constraints (CFDs, DCs)

ETL rules, customized rules

Challenges

5

slide-17
SLIDE 17
  • Heterogeneity
  • Interdependency
  • Deployment and

extensibility

  • Metadata management

and user interaction

  • Integrity constraints (CFDs, DCs)

ETL rules, customized rules

  • Interaction of various types of rules

Challenges

5

slide-18
SLIDE 18
  • Heterogeneity
  • Interdependency
  • Deployment and

extensibility

  • Metadata management

and user interaction

  • Integrity constraints (CFDs, DCs)

ETL rules, customized rules

  • Interaction of various types of rules
  • Download, compile and run

Extend with new cleaning solutions

Challenges

5

slide-19
SLIDE 19
  • Heterogeneity
  • Interdependency
  • Deployment and

extensibility

  • Metadata management

and user interaction

  • Integrity constraints (CFDs, DCs)

ETL rules, customized rules

  • Interaction of various types of rules
  • Download, compile and run

Extend with new cleaning solutions

  • Dashboard and metadata profiling

Challenges

5

slide-20
SLIDE 20

NADEEF Architecture

Data Loader Rule Collector

Data

6

slide-21
SLIDE 21

NADEEF Architecture

Data Loader Rule Collector

Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Data

6

slide-22
SLIDE 22

NADEEF Architecture

Data Loader Rule Collector Metadata Management

Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Metadata

Data Quality Dashboard

Data NADEEF

6

slide-23
SLIDE 23

NADEEF Architecture

Data Loader Rule Collector Metadata Management

Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Metadata

Data Quality Dashboard

Data

Users

NADEEF

6

slide-24
SLIDE 24

NADEEF Architecture

Data Loader Rule Collector Metadata Management

Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Metadata

Data Quality Dashboard

Data

Users

NADEEF

heterogeneity interdependency metadata management and data custodians extensibility 6

slide-25
SLIDE 25

NADEEF Architecture

Data Loader Rule Collector Metadata Management

Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Metadata

Data Quality Dashboard

Data

Users

NADEEF

heterogeneity interdependency metadata management and data custodians Demo at VLDB 2013 extensibility 6

slide-26
SLIDE 26

NADEEF Architecture

Data Loader Rule Collector Metadata Management

Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Metadata

Data Quality Dashboard

Data

Users

NADEEF

heterogeneity interdependency metadata management and data custodians Demo at VLDB 2013 extensibility 6

slide-27
SLIDE 27

NADEEF Architecture

Data Loader Rule Collector Metadata Management

Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing

Rules Core

Detection and Cleaning Algorithms

Metadata

Data Quality Dashboard

Data

Users

NADEEF

A commodity data cleaning system heterogeneity interdependency metadata management and data custodians Demo at VLDB 2013 extensibility 6

slide-28
SLIDE 28

Programming Interface

Rules NADEEF

7

slide-29
SLIDE 29

Programming Interface

Rules NADEEF Rule static semantics vio(tuple t) vio(tuple t1, tuple t2) dynamic semantics fix(violation V)

7

slide-30
SLIDE 30

Programming Interface

Rules NADEEF Rule static semantics vio(tuple t) vio(tuple t1, tuple t2) dynamic semantics fix(violation V)

A simple and declarative interface 7

slide-31
SLIDE 31

Sample Rules

(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.

8

slide-32
SLIDE 32

Sample Rules

Class Rule1 { } set⟨cell⟩ vio(tuple t) { /*s in table tran */ if (t[CC] = 31 ∧ !(t[country] = Netherlands ∨ t[country] = Holland)) return { t[CC, country]; } return ∅; } set⟨Expression⟩ fix (set⟨cell⟩ V) { return { V.t[country] ← Netherlands; } }

(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.

8

slide-33
SLIDE 33

Sample Rules

Class Rule1 { } set⟨cell⟩ vio(tuple t) { /*s in table tran */ if (t[CC] = 31 ∧ !(t[country] = Netherlands ∨ t[country] = Holland)) return { t[CC, country]; } return ∅; } set⟨Expression⟩ fix (set⟨cell⟩ V) { return { V.t[country] ← Netherlands; } }

static semantics: what is wrong

(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.

8

slide-34
SLIDE 34

Sample Rules

Class Rule1 { } set⟨cell⟩ vio(tuple t) { /*s in table tran */ if (t[CC] = 31 ∧ !(t[country] = Netherlands ∨ t[country] = Holland)) return { t[CC, country]; } return ∅; } set⟨Expression⟩ fix (set⟨cell⟩ V) { return { V.t[country] ← Netherlands; } }

static semantics: what is wrong dynamic semantics: possible ways to repair

(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.

8

slide-35
SLIDE 35

Sample Rules

(tran) If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud.

9

slide-36
SLIDE 36

Sample Rules

Class Rule4 { } set⟨cell⟩ vio(tuple t1, tuple t2) { /* t1, t2 in table tran */ if (t1[name] ≈ t2[name] ∧ t1[tel] = t2[tel] ∧ t1[where] = Netherlands ∧ t2[where] = US ∧ | t1[when] - t2[when] | <= 1 ) return { t1[name, tel, where, when]; t2[name, tel, where, when]; } return ∅; }

(tran) If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud.

9

slide-37
SLIDE 37

Sample Rules

Class Rule4 { } set⟨cell⟩ vio(tuple t1, tuple t2) { /* t1, t2 in table tran */ if (t1[name] ≈ t2[name] ∧ t1[tel] = t2[tel] ∧ t1[where] = Netherlands ∧ t2[where] = US ∧ | t1[when] - t2[when] | <= 1 ) return { t1[name, tel, where, when]; t2[name, tel, where, when]; } return ∅; }

static semantics: what is wrong

(tran) If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud.

9

slide-38
SLIDE 38

NADEEF Extensibility

10

slide-39
SLIDE 39

NADEEF Extensibility

10

slide-40
SLIDE 40

NADEEF Extensibility

10

slide-41
SLIDE 41

NADEEF Extensibility

NADEEF core 10

slide-42
SLIDE 42

NADEEF Extensibility

NADEEF core 10

slide-43
SLIDE 43

NADEEF Extensibility

NADEEF core 10

slide-44
SLIDE 44

Inside NADEEF

11

slide-45
SLIDE 45

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank 12

Inside NADEEF

Rule 1 Rule 2 Rule 3 Rule 4

slide-46
SLIDE 46

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank 12

UK Netherlands 66700543

Inside NADEEF

Rule 1 Rule 2 Rule 3 Rule 4

slide-47
SLIDE 47

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank 12

UK Netherlands 66700543

Inside NADEEF

Rule 1 Rule 2 Rule 3 Rule 4

slide-48
SLIDE 48

name street city CC country tel gd

David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M

name street city CC country phn when where

David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands

tran bank 12

UK Netherlands 66700543

Inside NADEEF

Rule 1 Rule 2 Rule 3 Rule 4

slide-49
SLIDE 49

Violation Detection

13

slide-50
SLIDE 50

Violation Detection

  • Brute force approach (black-boxes)

14

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

Violations: (r1, r3), (r2, r3)

slide-51
SLIDE 51

Violation Detection

  • Brute force approach (black-boxes)

14

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

Violations: (r1, r3), (r2, r3)

  • Optimized approach (white-boxes, e.g., CC->country)
slide-52
SLIDE 52

Violation Detection

  • Brute force approach (black-boxes)

14

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

Violations: (r1, r3), (r2, r3)

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

partition

  • Optimized approach (white-boxes, e.g., CC->country)
slide-53
SLIDE 53

Violation Detection

  • Brute force approach (black-boxes)

14

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

Violations: (r1, r3), (r2, r3)

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

partition

CC country ... r12 44 UK ... r3 44 Netherlands ... r4 31 UK ...

compression

  • Optimized approach (white-boxes, e.g., CC->country)
slide-54
SLIDE 54

Violation Detection

  • Brute force approach (black-boxes)

14

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

Violations: (r1, r3), (r2, r3)

CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...

partition

CC country ... r12 44 UK ... r3 44 Netherlands ... r4 31 UK ...

compression

(r12, r3)

  • Optimized approach (white-boxes, e.g., CC->country)
slide-55
SLIDE 55

Data Repairing

15

slide-56
SLIDE 56

Holistic Data Cleaning

16

rule specification data cleaning

slide-57
SLIDE 57

Holistic Data Cleaning

Violations V1: {r4[CC, country]} V2: {t1[name, street, city, tel], r3[name, street, city, phn]} V3: {r1[CC,country], r3[CC, country]} V4: {r2[CC,country], r3[CC, country]} V5: {r1[name, tel, where, when], r3[name, tel, where, when]}

16

rule specification data cleaning

slide-58
SLIDE 58

Holistic Data Cleaning

Violations V1: {r4[CC, country]} V2: {t1[name, street, city, tel], r3[name, street, city, phn]} V3: {r1[CC,country], r3[CC, country]} V4: {r2[CC,country], r3[CC, country]} V5: {r1[name, tel, where, when], r3[name, tel, where, when]} Candidate fixes F1: r4[country]←Netherlands F2: r3[phn] ← t1[tel] F3: r1[country] ← r3[country] F4: r3[country] ← r1[country] F5: r2[country] ← r3[country] F6: r3[country] ← r2[country]

16

rule specification data cleaning

slide-59
SLIDE 59

A Variable-Weighted Max-SAT

Violations Candidate fixes

17

slide-60
SLIDE 60

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

17

slide-61
SLIDE 61

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

Variable-weighted MAX-SAT solver

17

slide-62
SLIDE 62

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

Variable-weighted MAX-SAT solver

Repairs

17

slide-63
SLIDE 63

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

Variable-weighted MAX-SAT solver

Repairs

variable

17

slide-64
SLIDE 64

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

Variable-weighted MAX-SAT solver

Repairs

variable inclusive assignment

17

slide-65
SLIDE 65

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

Variable-weighted MAX-SAT solver

Repairs

variable inclusive assignment exclusive assignment

17

slide-66
SLIDE 66

A Variable-Weighted Max-SAT

Violations Candidate fixes CNF

Variable-weighted MAX-SAT solver

Repairs

variable inclusive assignment exclusive assignment avoid violations

17

slide-67
SLIDE 67

Experimental Study

18

slide-68
SLIDE 68

Effectiveness

0.2 0.4 0.6 0.8 1 2 4 6 8 10

NADEEF HEU VEU

0.2 0.4 0.6 0.8 1 2 4 6 8 10 F-measure F-measure noise% noise%

(a) Hospital dataset (100K, 9 attributes, 10 rules) (b) Bus dataset (160K, 16 attributes, 11 rules)

slide-69
SLIDE 69

Conclusion

  • A generalized programming interface (heterogeneity)
  • Holistic data cleaning (interdependency)
  • An extensible system (extensibility)

20