Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI - - PowerPoint PPT Presentation

data cleaning
SMART_READER_LITE
LIVE PREVIEW

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI - - PowerPoint PPT Presentation

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Data Cleaning? 2 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data


slide-1
SLIDE 1

Data Cleaning

Nan Tang, QCRI

slide-2
SLIDE 2

Data Cleaning

Nan Tang, QCRI

Big

slide-3
SLIDE 3

Data Cleaning

Nan Tang, QCRI

Big

slide-4
SLIDE 4

2

Data Cleaning?

slide-5
SLIDE 5

Data is Dirty

2

slide-6
SLIDE 6

Data is Dirty

incomplete inconsistent inaccurate …

2

slide-7
SLIDE 7

Data is Dirty

incomplete inconsistent inaccurate …

25% companies: flawed data 3+ trillion $: US economy 20%: labor productivity … …

2

slide-8
SLIDE 8

Data is Dirty

incomplete inconsistent inaccurate …

25% companies: flawed data 3+ trillion $: US economy 20%: labor productivity … …

Big (clean) data: new oil

2

slide-9
SLIDE 9

Data Cleaning Market

3

Data Explorer

Data Quality 11g

slide-10
SLIDE 10

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36

slide-11
SLIDE 11

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36

typo

slide-12
SLIDE 12

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

slide-13
SLIDE 13

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Currency

slide-14
SLIDE 14

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Currency Completeness

slide-15
SLIDE 15

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Currency Completeness

slide-16
SLIDE 16

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Consistency

Currency Completeness

slide-17
SLIDE 17

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness

slide-18
SLIDE 18

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … …

slide-19
SLIDE 19

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … … truth discovery

slide-20
SLIDE 20

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … … truth discovery name full

NTU Nanyang Technological University NUS National University of Singapore

slide-21
SLIDE 21

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … … truth discovery name full

NTU Nanyang Technological University NUS National University of Singapore

ETL (transformation)

slide-22
SLIDE 22

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … … truth discovery name full

NTU Nanyang Technological University NUS National University of Singapore

ETL (transformation)

(UK, hasCapital, London) KBs (e.g., Yago)

slide-23
SLIDE 23

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … … truth discovery name full

NTU Nanyang Technological University NUS National University of Singapore

ETL (transformation)

(UK, hasCapital, London) KBs (e.g., Yago) Heterogeneous sources

slide-24
SLIDE 24

Data Cleaning Problems

4

name graduated affiliation country capital age

Nan Tang CUHK QCRI Qatari Doha 33 Xiaokui Xiao CUHK NTU Singapore Singapore Nan Tang CUHK University of Edinburgh UK Edinburgh 31 Gao Cong NUS University of Edinburgh UK London 36 Qatar

typo

Duplicates

Consistency

Currency Completeness name affiliation

Nan Tang QCRI

source2 name affiliation

Nan Tang CWI

source3 … … … … truth discovery name full

NTU Nanyang Technological University NUS National University of Singapore

ETL (transformation)

(UK, hasCapital, London) KBs (e.g., Yago) Heterogeneous sources

Volume Velocity V…

slide-25
SLIDE 25

Data Cleaning Solutions

5

slide-26
SLIDE 26

Data Cleaning Solutions

5

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

slide-27
SLIDE 27

Data Cleaning Solutions

5

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

slide-28
SLIDE 28

Data Cleaning Solutions

5

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

DCs (QCRI, VLDB 2014, ICDE2014 demo) CFDs

Rule discovery

Unique columns (QCRI, ICDE 2013, VLDB 2014)

input
slide-29
SLIDE 29

Data Cleaning Solutions

5

Generic system NADEEF (SIGMOD 2013)

Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

DCs (QCRI, VLDB 2014, ICDE2014 demo) CFDs

Rule discovery

Unique columns (QCRI, ICDE 2013, VLDB 2014)

input
slide-30
SLIDE 30

Data Cleaning Solutions

5

Generic system NADEEF (SIGMOD 2013)

Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

DCs (QCRI, VLDB 2014, ICDE2014 demo) CFDs

Rule discovery

Unique columns (QCRI, ICDE 2013, VLDB 2014)

input
slide-31
SLIDE 31

Error Detection

6

slide-32
SLIDE 32

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

emp

slide-33
SLIDE 33

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital]

emp

slide-34
SLIDE 34

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital]

emp

slide-35
SLIDE 35

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing]

emp

slide-36
SLIDE 36

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing]

emp

slide-37
SLIDE 37

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax)

emp

slide-38
SLIDE 38

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax)

emp

slide-39
SLIDE 39

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax)

emp

country capital s1 China Beijing s2 Canada Ottawa s3 … …

cap

slide-40
SLIDE 40

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital])

emp

country capital s1 China Beijing s2 Canada Ottawa s3 … …

cap

slide-41
SLIDE 41

Error Detection

6

name country capital city salary tax r1 Nan China Beijing Beijing 50000 1000 r2 Yin China Shanghai Hongkong 40000 1200 r3 Si Netherlands Den Hagg Utrecht 60000 1400 r4 Lei Netherlands Amsterdam Amsterdam 35000 800

FD: [country] -> [capital] CFD: [country = China] -> [capital = Beijing] DC: ⌉t1, t2 (t1.salary > t2.salary and t1.tax < t2.tax) MD: (emp[country] = cap[country]) -> (emp[capital] <=> cap[capital])

emp

country capital s1 China Beijing s2 Canada Ottawa s3 … …

cap

Inclusion dependency Currency constraint Sequential dependency … … … … Aggregation constraint Accuracy constraint

slide-42
SLIDE 42

Data Repairing

7

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

CFDs

Rule discovery

DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)

input

Generic system NADEEF (SIGMOD 2013)

Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize

slide-43
SLIDE 43

Data Repairing

7

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

CFDs

Rule discovery

DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)

input

Generic system NADEEF (SIGMOD 2013)

Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize

slide-44
SLIDE 44

Automated

slide-45
SLIDE 45

Computing a Consistent Database

9

D

slide-46
SLIDE 46

Computing a Consistent Database

9

D

Dg ?

slide-47
SLIDE 47

Computing a Consistent Database

9

D

Consistent

Dg ?

slide-48
SLIDE 48

Computing a Consistent Database

9

D

Consistent

D’ D’’ … … D’’’ D

Consistent

Dg ?

slide-49
SLIDE 49

Computing a Consistent Database

9

D

Consistent

D’ D’’ … … D’’’ D

Consistent

find a D’ such that dist(D,D’) is minimum

Dg ?

slide-50
SLIDE 50

Computing a Consistent Database

10

name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a

slide-51
SLIDE 51

Computing a Consistent Database

10

name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a

FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]

slide-52
SLIDE 52

Computing a Consistent Database

10

name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a

FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]

slide-53
SLIDE 53

Computing a Consistent Database

10

name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a

FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]

Beijing Beijing

slide-54
SLIDE 54

Computing a Consistent Database

10

name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a

FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]

Beijing Beijing

Equivalence class Vertex cover SAT solver … …

slide-55
SLIDE 55

Computing a Consistent Database

10

name nationality capital areacode bornAt salary tax r1 Nan China Beijing 10 Shenyang 50000 1000 r2 Yan China Shanghai 10 a Hangzhou 40000 a 900 a r3 Si China Beijing 10 Changsha 60000 1400 r4 Miura China Tokyo 3 a Kyoto 35000 a 800 a

FD1: [nationality] -> [capital] FD2: [areacode] -> [capital]

Beijing Beijing

Equivalence class Vertex cover SAT solver … …

slide-56
SLIDE 56

User Guided

slide-57
SLIDE 57

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

12

slide-58
SLIDE 58

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

editing rule: ((country, country) -> (capital, capital))

12

slide-59
SLIDE 59

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

editing rule: ((country, country) -> (capital, capital))

12

slide-60
SLIDE 60

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES.

12

slide-61
SLIDE 61

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing

12

slide-62
SLIDE 62

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … …

12

slide-63
SLIDE 63

Certain Fixes (VLDB 2010 Best Paper)

name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo

editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … …

check each tuple: not cheap !!

12

slide-64
SLIDE 64

Heuristic

(Automated)

Certain

(User guided)

precision: + recall: ++ precision: ++ recall: ++

13

slide-65
SLIDE 65

Heuristic

(Automated)

Certain

(User guided)

precision: + recall: ++ precision: ++ recall: ++

  • (Automated)

precision: ++ recall: +

Fixing Rules

13

slide-66
SLIDE 66

14

country capital

China Shanghai

Data patterns

slide-67
SLIDE 67

14

country capital

China Shanghai

Data patterns

evidence negative

slide-68
SLIDE 68

14

country capital

China Shanghai

Data patterns

China T

  • kyo

evidence negative

slide-69
SLIDE 69

14

country capital

China Shanghai

Data patterns

China T

  • kyo

evidence negative

?

(China, Beijing) (Japan, T

  • kyo)
slide-70
SLIDE 70

14

country capital

China Shanghai

Data patterns

China T

  • kyo

name work mail

Ian ian@gmail.com

evidence negative

?

(China, Beijing) (Japan, T

  • kyo)
slide-71
SLIDE 71

14

country capital

China Shanghai

Data patterns

China T

  • kyo

name work mail

Ian ian@gmail.com

evidence negative evidence negative

?

(China, Beijing) (Japan, T

  • kyo)
slide-72
SLIDE 72

14

country capital

China Shanghai

Data patterns

China T

  • kyo

name work mail

Ian ian@gmail.com

evidence negative evidence negative

?

(China, Beijing) (Japan, T

  • kyo)

city area code

Beijing 110002

slide-73
SLIDE 73

14

country capital

China Shanghai

Data patterns

China T

  • kyo

name work mail

Ian ian@gmail.com

evidence negative evidence negative evidence negative

?

(China, Beijing) (Japan, T

  • kyo)

city area code

Beijing 110002

slide-74
SLIDE 74
  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

slide-75
SLIDE 75
  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong

slide-76
SLIDE 76
  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative

slide-77
SLIDE 77
  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative fact

slide-78
SLIDE 78

name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto

  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative fact

slide-79
SLIDE 79

name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto

  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative fact

slide-80
SLIDE 80

name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto

  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative fact

Beijing

slide-81
SLIDE 81

name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto

  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative fact

Beijing

slide-82
SLIDE 82

name nationality capital bornAt r1 Nan China Beijing Shenyang r2 Yan China Shanghai Hangzhou r3 Si China Beijing Changsha r4 Miura China Tokyo Kyoto

  • Syntax

Fixing Rules (SIGMOD 2014)

15

fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing

country {capital capital China Shanghai Beijing Hongkong evidence negative fact

Beijing

slide-83
SLIDE 83

Confidence values Interaction … …

slide-84
SLIDE 84

Matching and Repairing (SIGMOD 2011)

17

name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)

FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))

slide-85
SLIDE 85

Matching and Repairing (SIGMOD 2011)

17

name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)

FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))

slide-86
SLIDE 86

Matching and Repairing (SIGMOD 2011)

17

name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)

FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))

slide-87
SLIDE 87

Matching and Repairing (SIGMOD 2011)

17

name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)

Ottawa (1.0) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))

slide-88
SLIDE 88

Matching and Repairing (SIGMOD 2011)

17

name nationality capital bornAt r1 Nan (0.9) China (1.0) Beijing (1.0) Shenyang (0.9) r2 Yan (0.8) China (1.0) Beijing (0.5) Hangzhou (0.9) r3 Si (0.9) Canada (1.0) Toronto (0.5) Changsha (0.8) r4 Miura (0.9) Canada (0.9) Vancuver (0.5) Kyoto (1.0) country capital s1 China (1.0) Beijing (1.0) s2 Canada (1.0) Ottawa (1.0) s3 Japan (1.0) Tokyo (1.0)

Ottawa (1.0) FD: [nationality] -> [capital] MD: ((nationality, country) -> (capital, capital))

slide-89
SLIDE 89

Summary of Data Repairing

18

Consistent database (heuristic)

Equivalence class Vertex cover Sat solver

slide-90
SLIDE 90

Summary of Data Repairing

18

Consistent database (heuristic)

Equivalence class Vertex cover Sat solver

Users Reference data Confidence value

improve accuracy

slide-91
SLIDE 91

Summary of Data Repairing

18

Consistent database (heuristic)

Equivalence class Vertex cover Sat solver

Users Reference data Confidence value

improve accuracy

Machine learning

Scared GDR

slide-92
SLIDE 92

Summary of Data Repairing

18

Consistent database (heuristic)

Equivalence class Vertex cover Sat solver

Users Reference data Confidence value

improve accuracy

Machine learning

Scared GDR

Automated and Dependable

Fixing rules

slide-93
SLIDE 93

Generic Data Cleaning System

19

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

CFDs

Rule discovery

DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)

input

Generic system NADEEF (SIGMOD 2013)

Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize

slide-94
SLIDE 94

Generic Data Cleaning System

19

Error detection

Functional dependency

CFD (ICDE 2012)

Denial constraints Currency constraints (ICDE 2013) Fixing rules (SIGMOD 2014) ... ...

Data repairing

Heuristic methods

Equivalence class Set cover Sat solver

Confidence values (SIGMOD 2011) User guided (VLDB 2010 best paper)

input

CFDs

Rule discovery

DCs (QCRI, VLDB 2014) Unique columns (QCRI, ICDE 2013, VLDB 2014)

input

Generic system NADEEF (SIGMOD 2013)

Dashboard (VLDB 2013 demo) ER (SIGMOD 2014 demo) Open source Commercialize

slide-95
SLIDE 95

NADEEF (SIGMOD 2013)

20

Rule Collector Data ETLs, CFDs, MDs, Business rules Data Loader

slide-96
SLIDE 96

Metadata

Metadata Management Auditing and Lineage Indices Probabilistic models Data Quality Dashboard

NADEEF

NADEEF (SIGMOD 2013)

20

Rule Collector Data ETLs, CFDs, MDs, Business rules Data Loader

Violation Detection Data Repairing Rule Compiler

Detection and Cleaning Core Rules Data owners Experts

slide-97
SLIDE 97

Metadata

Metadata Management Auditing and Lineage Indices Probabilistic models Data Quality Dashboard

NADEEF

NADEEF (SIGMOD 2013)

20

Rule Collector Data ETLs, CFDs, MDs, Business rules Data Loader

Violation Detection Data Repairing Rule Compiler

Detection and Cleaning Core Rules Data owners Experts

extensibility heterogeneity interdependency metadata management and data custodians

slide-98
SLIDE 98

NADEEF (SIGMOD 2013)

21

slide-99
SLIDE 99

NADEEF Online

22

slide-100
SLIDE 100

NADEEF for Big Data

23

NADEEF

slide-101
SLIDE 101

NADEEF for Big Data

23

NADEEF

Volumn Spark
slide-102
SLIDE 102

NADEEF for Big Data

23

NADEEF

Volumn Spark Velocity Inc Interface
slide-103
SLIDE 103

NADEEF for Big Data

23

NADEEF

Volumn Spark Velocity Inc Interface Validity Annotation
slide-104
SLIDE 104

NADEEF for Big Data

23

NADEEF

Volumn Spark Velocity Inc Interface Validity Annotation

Variety KBs Web tables

slide-105
SLIDE 105

NADEEF for Big Data

23

NADEEF

Volumn Spark Velocity Inc Interface Validity Annotation Veracity Consistency Accuracy Currency

Variety KBs Web tables

slide-106
SLIDE 106
  • Error detection
  • Rule discovery and validation
  • Combining different methods
  • Explain errors to users
  • Summarization
  • Visualization
  • Reliable data repairing
  • Effectively involve users as first-class citizens

Future Work

24