NADEEF: A Commodity Data Cleaning System
Ahmed Elmagarmid Ihab F. Ilyas Mourad Ouzzani Nan Tang Michele Dallachiesa
University of Trento
Amr Ebaid
Purdue University
Ahmed Eldawy
University of Minnesota
Data analytics, QCRI
1
NADEEF: A Commodity Data Cleaning System Data analytics, QCRI - - PowerPoint PPT Presentation
NADEEF: A Commodity Data Cleaning System Data analytics, QCRI Michele Dallachiesa Amr Ebaid Ahmed Eldawy University of Trento Purdue University University of Minnesota Ahmed Elmagarmid Ihab F. Ilyas Mourad Ouzzani Nan Tang 1 2
Ahmed Elmagarmid Ihab F. Ilyas Mourad Ouzzani Nan Tang Michele Dallachiesa
University of Trento
Amr Ebaid
Purdue University
Ahmed Eldawy
University of Minnesota
1
2
2
Bob should be standardized to Robert
2
Country code determines a country
2
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank 3
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank
If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; ETL rules (lookup table) Extended CFDs
3
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank
If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; Editing rules (w.r.t. master data)
3
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank
If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; A country code (CC) uniquely determines a country CFDs (FDs)
3
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank
If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; A country code (CC) uniquely determines a country If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud. Write a special-purpose application
3
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
Challenging to capture multiple types of rules tran bank
If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands; If the same person from different tables has different phones, the phone number from table bank is more reliable; A country code (CC) uniquely determines a country If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud. Write a special-purpose application
3
CFD MD Customized rule
These are our data quality rules
4
CFD MD Customized rule
Data Cleaning System
These are our data quality rules
4
CFD MD Customized rule
Data Cleaning System
These are our data quality rules
Easy to specify and easy to deploy 4
extensibility
and user interaction
5
extensibility
and user interaction
ETL rules, customized rules
5
extensibility
and user interaction
ETL rules, customized rules
5
extensibility
and user interaction
ETL rules, customized rules
Extend with new cleaning solutions
5
extensibility
and user interaction
ETL rules, customized rules
Extend with new cleaning solutions
5
Data Loader Rule Collector
Data
6
Data Loader Rule Collector
Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Data
6
Data Loader Rule Collector Metadata Management
Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Metadata
Data Quality Dashboard
Data NADEEF
6
Data Loader Rule Collector Metadata Management
Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Metadata
Data Quality Dashboard
Data
Users
NADEEF
6
Data Loader Rule Collector Metadata Management
Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Metadata
Data Quality Dashboard
Data
Users
NADEEF
heterogeneity interdependency metadata management and data custodians extensibility 6
Data Loader Rule Collector Metadata Management
Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Metadata
Data Quality Dashboard
Data
Users
NADEEF
heterogeneity interdependency metadata management and data custodians Demo at VLDB 2013 extensibility 6
Data Loader Rule Collector Metadata Management
Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Metadata
Data Quality Dashboard
Data
Users
NADEEF
heterogeneity interdependency metadata management and data custodians Demo at VLDB 2013 extensibility 6
Data Loader Rule Collector Metadata Management
Auditing and lineage Indices Probabilistic models Rule compiler Violation detection Data repairing
Rules Core
Detection and Cleaning Algorithms
Metadata
Data Quality Dashboard
Data
Users
NADEEF
A commodity data cleaning system heterogeneity interdependency metadata management and data custodians Demo at VLDB 2013 extensibility 6
Rules NADEEF
7
Rules NADEEF Rule static semantics vio(tuple t) vio(tuple t1, tuple t2) dynamic semantics fix(violation V)
7
Rules NADEEF Rule static semantics vio(tuple t) vio(tuple t1, tuple t2) dynamic semantics fix(violation V)
A simple and declarative interface 7
(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.
8
Class Rule1 { } set⟨cell⟩ vio(tuple t) { /*s in table tran */ if (t[CC] = 31 ∧ !(t[country] = Netherlands ∨ t[country] = Holland)) return { t[CC, country]; } return ∅; } set⟨Expression⟩ fix (set⟨cell⟩ V) { return { V.t[country] ← Netherlands; } }
(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.
8
Class Rule1 { } set⟨cell⟩ vio(tuple t) { /*s in table tran */ if (t[CC] = 31 ∧ !(t[country] = Netherlands ∨ t[country] = Holland)) return { t[CC, country]; } return ∅; } set⟨Expression⟩ fix (set⟨cell⟩ V) { return { V.t[country] ← Netherlands; } }
static semantics: what is wrong
(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.
8
Class Rule1 { } set⟨cell⟩ vio(tuple t) { /*s in table tran */ if (t[CC] = 31 ∧ !(t[country] = Netherlands ∨ t[country] = Holland)) return { t[CC, country]; } return ∅; } set⟨Expression⟩ fix (set⟨cell⟩ V) { return { V.t[country] ← Netherlands; } }
static semantics: what is wrong dynamic semantics: possible ways to repair
(tran) If a customer’s CC is 31, but his/her country is neither Netherlands nor Holland, update the country to Netherlands.
8
(tran) If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud.
9
Class Rule4 { } set⟨cell⟩ vio(tuple t1, tuple t2) { /* t1, t2 in table tran */ if (t1[name] ≈ t2[name] ∧ t1[tel] = t2[tel] ∧ t1[where] = Netherlands ∧ t2[where] = US ∧ | t1[when] - t2[when] | <= 1 ) return { t1[name, tel, where, when]; t2[name, tel, where, when]; } return ∅; }
(tran) If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud.
9
Class Rule4 { } set⟨cell⟩ vio(tuple t1, tuple t2) { /* t1, t2 in table tran */ if (t1[name] ≈ t2[name] ∧ t1[tel] = t2[tel] ∧ t1[where] = Netherlands ∧ t2[where] = US ∧ | t1[when] - t2[when] | <= 1 ) return { t1[name, tel, where, when]; t2[name, tel, where, when]; } return ∅; }
static semantics: what is wrong
(tran) If two purchases of the same person happened in the Netherlands and the US (East Coast) within 1 hour, these two purchases might be a fraud.
9
10
10
10
NADEEF core 10
NADEEF core 10
NADEEF core 10
11
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank 12
Rule 1 Rule 2 Rule 3 Rule 4
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank 12
UK Netherlands 66700543
Rule 1 Rule 2 Rule 3 Rule 4
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank 12
UK Netherlands 66700543
Rule 1 Rule 2 Rule 3 Rule 4
name street city CC country tel gd
David Holywell Oxford 44 UK 66700543 M Paul Ratcliffe Oxford 44 UK 44944631 M
name street city CC country phn when where
David Holywell Oxford 44 UK 66700543 1pm 6/05/2012 Netherlands Paul Ratcliffe Oxford 44 UK 44944631 11am 2/12/2011 Netherlands David Holywell Oxford 44 Netherlands 66700541 6am 6/05/2012 US Paul Market Amsterdam 31 UK 55384922 9am 6/02/2012 Netherlands
tran bank 12
UK Netherlands 66700543
Rule 1 Rule 2 Rule 3 Rule 4
13
14
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
Violations: (r1, r3), (r2, r3)
14
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
Violations: (r1, r3), (r2, r3)
14
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
Violations: (r1, r3), (r2, r3)
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
partition
14
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
Violations: (r1, r3), (r2, r3)
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
partition
CC country ... r12 44 UK ... r3 44 Netherlands ... r4 31 UK ...
compression
14
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
Violations: (r1, r3), (r2, r3)
CC country ... r1 44 UK ... r2 44 UK ... r3 44 Netherlands ... r4 31 UK ...
partition
CC country ... r12 44 UK ... r3 44 Netherlands ... r4 31 UK ...
compression
(r12, r3)
15
16
rule specification data cleaning
Violations V1: {r4[CC, country]} V2: {t1[name, street, city, tel], r3[name, street, city, phn]} V3: {r1[CC,country], r3[CC, country]} V4: {r2[CC,country], r3[CC, country]} V5: {r1[name, tel, where, when], r3[name, tel, where, when]}
16
rule specification data cleaning
Violations V1: {r4[CC, country]} V2: {t1[name, street, city, tel], r3[name, street, city, phn]} V3: {r1[CC,country], r3[CC, country]} V4: {r2[CC,country], r3[CC, country]} V5: {r1[name, tel, where, when], r3[name, tel, where, when]} Candidate fixes F1: r4[country]←Netherlands F2: r3[phn] ← t1[tel] F3: r1[country] ← r3[country] F4: r3[country] ← r1[country] F5: r2[country] ← r3[country] F6: r3[country] ← r2[country]
16
rule specification data cleaning
Violations Candidate fixes
17
Violations Candidate fixes CNF
17
Violations Candidate fixes CNF
Variable-weighted MAX-SAT solver
17
Violations Candidate fixes CNF
Variable-weighted MAX-SAT solver
Repairs
17
Violations Candidate fixes CNF
Variable-weighted MAX-SAT solver
Repairs
variable
17
Violations Candidate fixes CNF
Variable-weighted MAX-SAT solver
Repairs
variable inclusive assignment
17
Violations Candidate fixes CNF
Variable-weighted MAX-SAT solver
Repairs
variable inclusive assignment exclusive assignment
17
Violations Candidate fixes CNF
Variable-weighted MAX-SAT solver
Repairs
variable inclusive assignment exclusive assignment avoid violations
17
18
0.2 0.4 0.6 0.8 1 2 4 6 8 10
NADEEF HEU VEU
0.2 0.4 0.6 0.8 1 2 4 6 8 10 F-measure F-measure noise% noise%
(a) Hospital dataset (100K, 9 attributes, 10 rules) (b) Bus dataset (160K, 16 attributes, 11 rules)
20