Data Cleaning
Nan Tang, QCRI
Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI - - PowerPoint PPT Presentation
Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Data Cleaning? 2 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data
Nan Tang, QCRI
Nan Tang, QCRI
Nan Tang, QCRI
2
2
incomplete inconsistent inaccurate …
2
incomplete inconsistent inaccurate …
25% companies: flawed data 3+ trillion $: US economy 20%: labor productivity … …
2
incomplete inconsistent inaccurate …
25% companies: flawed data 3+ trillion $: US economy 20%: labor productivity … …
2
Data Cleaning Market
3
Data Explorer
Data Quality 11g
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36
typo
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Currency
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Currency Completeness
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Currency Completeness
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Consistency
Currency Completeness
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … …
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … … truth discovery
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … … truth discovery name full
NTU Nanyang Technological University NUS National University of Singapore
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … … truth discovery name full
NTU Nanyang Technological University NUS National University of Singapore
ETL (transformation)
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … … truth discovery name full
NTU Nanyang Technological University NUS National University of Singapore
ETL (transformation)
(UK, hasCapital, London) KBs (e.g., Yago)
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … … truth discovery name full
NTU Nanyang Technological University NUS National University of Singapore
ETL (transformation)
(UK, hasCapital, London) KBs (e.g., Yago) Heterogeneous sources
Data Cleaning Problems
4
name graduated affiliation country capital age
Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar
typo
Duplicates
Consistency
Currency Completeness name affiliation
Nan Tang QCRI
source2 name affiliation
Nan Tang CWI
source3 … … … … truth discovery name full
NTU Nanyang Technological University NUS National University of Singapore
ETL (transformation)
(UK, hasCapital, London) KBs (e.g., Yago) Heterogeneous sources
Volume Velocity V…
Data Cleaning Solutions
5
Data Cleaning Solutions
5
Error detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
Data Cleaning Solutions
5
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputError detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
Data Cleaning Solutions
5
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputError detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
DCs (QCRI, VLDB 2014, ICDE2014 demo) CFDs
Rule discovery
Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputData Cleaning Solutions
5
Generic system NADEEF (SIGMOD 2013)
Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputError detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
DCs (QCRI, VLDB 2014, ICDE2014 demo) CFDs
Rule discovery
Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputData Cleaning Solutions
5
Generic system NADEEF (SIGMOD 2013)
Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputError detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
DCs (QCRI, VLDB 2014, ICDE2014 demo) CFDs
Rule discovery
Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputError Detection
6
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital]
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital]
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing]
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing]
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax)
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax)
emp
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax)
emp
country capital s1 China Beijing s2 Canada Ottawa s3 … …
cap
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital])
emp
country capital s1 China Beijing s2 Canada Ottawa s3 … …
cap
Error Detection
6
name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800
FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital])
emp
country capital s1 China Beijing s2 Canada Ottawa s3 … …
cap
Inclusion dependency Currency constraint Sequential dependency … … … … Aggregation constraint Accuracy constraint
Data Repairing
7
Error detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputCFDs
Rule discovery
DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputGeneric system NADEEF (SIGMOD 2013)
Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize
Data Repairing
7
Error detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputCFDs
Rule discovery
DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputGeneric system NADEEF (SIGMOD 2013)
Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize
Computing a Consistent Database
9
D
Computing a Consistent Database
9
D
Dg ?
Computing a Consistent Database
9
D
Consistent
Dg ?
Computing a Consistent Database
9
D
Consistent
D’ D’’ … … D’’’ D
Consistent
Dg ?
Computing a Consistent Database
9
D
Consistent
D’ D’’ … … D’’’ D
Consistent
find a D’ such that dist(D,D’) is minimum
Dg ?
Computing a Consistent Database
10
name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a
Computing a Consistent Database
10
name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a
FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]
Computing a Consistent Database
10
name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a
FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]
Computing a Consistent Database
10
name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a
FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]
Beijing Beijing
Computing a Consistent Database
10
name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a
FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]
Beijing Beijing
Equivalence class Vertex cover SAT solver … …
Computing a Consistent Database
10
name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a
FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]
Beijing Beijing
Equivalence class Vertex cover SAT solver … …
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
12
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital))
12
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital))
12
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES.
12
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing
12
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … …
12
Certain Fixes (VLDB 2010 Best Paper)
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … …
check each tuple: not cheap !!
12
Heuristic
(Automated)
Certain
(User guided)
precision: + recall: ++ precision: ++ recall: ++
13
Heuristic
(Automated)
Certain
(User guided)
precision: + recall: ++ precision: ++ recall: ++
precision: ++ recall: +
Fixing Rules
13
14
country capital
China Shanghai
Data patterns
14
country capital
China Shanghai
Data patterns
evidence negative
14
country capital
China Shanghai
Data patterns
China T
evidence negative
14
country capital
China Shanghai
Data patterns
China T
evidence negative
?
(China, Beijing) (Japan, T
14
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative
?
(China, Beijing) (Japan, T
14
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative evidence negative
?
(China, Beijing) (Japan, T
14
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative evidence negative
?
(China, Beijing) (Japan, T
city area code
Beijing 110002
14
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative evidence negative evidence negative
?
(China, Beijing) (Japan, T
city area code
Beijing 110002
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
Beijing
name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
Beijing
name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto
Fixing Rules (SIGMOD 2014)
15
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
Beijing
Matching and Repairing (SIGMOD 2011)
17
name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)
FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))
Matching and Repairing (SIGMOD 2011)
17
name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)
FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))
Matching and Repairing (SIGMOD 2011)
17
name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)
FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))
Matching and Repairing (SIGMOD 2011)
17
name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)
Ottawa (1.0) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))
Matching and Repairing (SIGMOD 2011)
17
name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)
Ottawa (1.0) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))
Summary of Data Repairing
18
Consistent database (heuristic)
Equivalence class Vertex cover Sat solver
Summary of Data Repairing
18
Consistent database (heuristic)
Equivalence class Vertex cover Sat solver
Users Reference data Confidence value
improve accuracy
Summary of Data Repairing
18
Consistent database (heuristic)
Equivalence class Vertex cover Sat solver
Users Reference data Confidence value
improve accuracy
Machine learning
Scared GDR
Summary of Data Repairing
18
Consistent database (heuristic)
Equivalence class Vertex cover Sat solver
Users Reference data Confidence value
improve accuracy
Machine learning
Scared GDR
Automated and Dependable
Fixing rules
Generic Data Cleaning System
19
Error detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputCFDs
Rule discovery
DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputGeneric system NADEEF (SIGMOD 2013)
Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize
Generic Data Cleaning System
19
Error detection
Functional dependency
CFD (ICDE 2012)
Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...
Data repairing
Heuristic methods
Equivalence class Set cover Sat solverConfidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)
inputCFDs
Rule discovery
DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)
inputGeneric system NADEEF (SIGMOD 2013)
Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize
NADEEF (SIGMOD 2013)
20
Rule Collector Data ETLs, CFDs, MDs, Business rules Data Loader
Metadata
Metadata Management Auditing and Lineage Indices Probabilistic models Data Quality DashboardNADEEF
NADEEF (SIGMOD 2013)
20
Rule Collector Data ETLs, CFDs, MDs, Business rules Data Loader
Violation Detection Data Repairing Rule CompilerDetection and Cleaning Core Rules Data owners Experts
Metadata
Metadata Management Auditing and Lineage Indices Probabilistic models Data Quality DashboardNADEEF
NADEEF (SIGMOD 2013)
20
Rule Collector Data ETLs, CFDs, MDs, Business rules Data Loader
Violation Detection Data Repairing Rule CompilerDetection and Cleaning Core Rules Data owners Experts
extensibility heterogeneity interdependency metadata management and data custodians
NADEEF (SIGMOD 2013)
21
NADEEF Online
22
NADEEF for Big Data
23
NADEEF
NADEEF for Big Data
23
NADEEF
Volumn SparkNADEEF for Big Data
23
NADEEF
Volumn Spark Velocity Inc InterfaceNADEEF for Big Data
23
NADEEF
Volumn Spark Velocity Inc Interface Validity AnnotationNADEEF for Big Data
23
NADEEF
Volumn Spark Velocity Inc Interface Validity AnnotationVariety KBs Web tables
NADEEF for Big Data
23
NADEEF
Volumn Spark Velocity Inc Interface Validity Annotation Veracity Consistency Accuracy CurrencyVariety KBs Web tables
Future Work
24