Big Data Cleaning Paolo Papotti EURECOM, France 3rd International - - PowerPoint PPT Presentation

big data cleaning
SMART_READER_LITE
LIVE PREVIEW

Big Data Cleaning Paolo Papotti EURECOM, France 3rd International - - PowerPoint PPT Presentation

Big Data Cleaning Paolo Papotti EURECOM, France 3rd International KEYSTONE Conference 2017 2 up to 26% errors [Abedjan et al, 2015] 3 5% measurement errors 7% duplicate devices sensors with up to 30% errors 4 Is quality of data


slide-1
SLIDE 1

Big Data Cleaning

Paolo Papotti


EURECOM, France

3rd International KEYSTONE Conference 2017

slide-2
SLIDE 2

2

slide-3
SLIDE 3

up to 26% errors

[Abedjan et al, 2015]

3

slide-4
SLIDE 4

5% measurement errors 7% duplicate devices sensors with up to 30% errors

4

slide-5
SLIDE 5

Is quality of data important?

  • Many decisions are taken after manually

scrutinizing the data –Military attack

  • But more and more are taken by algorithms

–Stocks trading –Credit report/Risk assessment –Self driving cars

5

slide-6
SLIDE 6

But it is expensive!

6

slide-7
SLIDE 7

Data quality facts

“50 people curating products’ data” 


[Chief scientist]

“engineers dedicated to data integration and cleaning”


[CIO]

“Typical duration of an integration project is in terms of years” 


[Former Chief Scientist]

7

slide-8
SLIDE 8

[https://cloud.google.com/dataprep] 8

slide-9
SLIDE 9

[https://cloud.google.com/dataprep] 8

slide-10
SLIDE 10

Source 1! Source 2! Source 3!

Target!

9

slide-11
SLIDE 11

Source 1! Source 2! Source 3!

Target!

BEGIN TRANSACTION; SET CONSTRAINTS ALL DEFERRED;delete from target.PersonSet;delete from target.CarSet;delete from target.MakeSet;delete from target.CitySet;

  • ----------------------------- TGDS -----------------------------------

create table work.TARGET_VALUES_TGD_v8_v3 AS select distinct null as v3id, rel_v8.cityName as v3name, rel_v8.region as v3region from source.CityRegionSet AS rel_v8; create table work.TARGET_VALUES_TGD_v5_v0v1 AS select distinct null as v0id, rel_v5.personName as v0name, null as v0age, 'SK{T='||'[0.0:'||rel_v5.personName||']'||'-'||'[1.1:'||rel_v5.carModel||']'||'J='||'['||'[0.0:'||rel_v5.personName||']'||'.0.2'||'-'||'[1.1:'||rel_v5.carModel||']'||'.1.3'||'V='||'['||'0.2'||'-'||'1.3'||'}' as v0carId, null as v0cityId, 'SK{T='||'[0.0:'||rel_v5.personName||']'||'-'||'[1.1:'||rel_v5.carModel||']'||'J='||'['||'[0.0:'||rel_v5.personName||']'||'.0.2'||'-'||'[1.1:'||rel_v5.carModel||']'||'.1.3'||'V='||'['||'0.2'||'-'||'1.3'||'}' as v1id, rel_v5.carModel as v1model, null as v1plate, null as v1makeId from source.PersonCarSet2 AS rel_v5; create table work.TARGET_VALUES_TGD_v6_v0v3 AS select distinct null as v0id, rel_v6.personName as v0name, null as v0age, null as v0carId, 'SK{T='||'[0.0:'||rel_v6.personName||']'||'-'||'[2.4:'||rel_v6.cityName||']'||'J='||'['||'[0.0:'||rel_v6.personName||']'||'.0.5'||'-'||'[2.4:'||rel_v6.cityName||']'||'.2.6'||'V='||'['||'0.5'||'-'||'2.6'||'}' as v0cityId, 'SK{T='||'[0.0:'||rel_v6.personName||']'||'-'||'[2.4:'||rel_v6.cityName||']'||'J='||'['||'[0.0:'||rel_v6.personName||']'||'.0.5'||'-'||'[2.4:'||rel_v6.cityName||']'||'.2.6'||'V='||'['||'0.5'||'-'||'2.6'||'}' as v3id, rel_v6.cityName as v3name, null as v3region from source.PersonCitySet AS rel_v6; create table work.TARGET_VALUES_TGD_v7_v1v2 AS select distinct null as v1id, rel_v7.carModel as v1model, null as v1plate, 'SK{T='||'[1.1:'||rel_v7.carModel||']'||'-'||'[3.7:'||rel_v7.makeName||']'||'J='||'['||'[1.1:'||rel_v7.carModel||']'||'.1.8'||'-'||'[3.7:'||rel_v7.makeName||']'||'.3.9'||'V='||'['||'1.8'||'-'||'3.9'||'}' as v1makeId, 'SK{T='||'[1.1:'||rel_v7.carModel||']'||'-'||'[3.7:'||rel_v7.makeName||']'||'J='||'['||'[1.1:'||rel_v7.carModel||']'||'.1.8'||'-'||'[3.7:'||rel_v7.makeName||']'||'.3.9'||'V='||'['||'1.8'||'-'||'3.9'||'}' as v2id, rel_v7.makeName as v2name from source.CarMakeSet AS rel_v7; create table work.TARGET_VALUES_TGD_v4_v0v1 AS select distinct null as v0id, rel_v4.personName as v0name, rel_v4.age as v0age, 'SK{T='||'[0.0:'||rel_v4.personName||'-'||'0.10:'||rel_v4.age||']'||'-'||'[1.11:'||rel_v4.carPlate||']'||'J='||'['||'[0.0:'||rel_v4.personName||'-'||'0.10:'||rel_v4.age||']'||'.0.2'||'-'||'[1.11:'|| rel_v4.carPlate||']'||'.1.3'||'V='||'['||'0.2'||'-'||'1.3'||'}' as v0carId, null as v0cityId, 'SK{T='||'[0.0:'||rel_v4.personName||'-'||'0.10:'||rel_v4.age||']'||'-'||'[1.11:'||rel_v4.carPlate||']'||'J='||'['||'[0.0:'||rel_v4.personName||'-'||'0.10:'||rel_v4.age||']'||'.0.2'||'-'||'[1.11:'|| rel_v4.carPlate||']'||'.1.3'||'V='||'['||'0.2'||'-'||'1.3'||'}' as v1id, null as v1model, rel_v4.carPlate as v1plate, null as v1makeId from source.PersonCarSet1 AS rel_v4;

  • ---------------------- RESULT OF EXCHANGE ---------------------------

insert into target.PersonSet select cast(work.TARGET_VALUES_TGD_v4_v0v1.v0id as text) as v0id, cast(work.TARGET_VALUES_TGD_v4_v0v1.v0name as text) as v0name, cast(work.TARGET_VALUES_TGD_v4_v0v1.v0age as text) as v0age, cast(work.TARGET_VALUES_TGD_v4_v0v1.v0carId as text) as v0carId, cast(work.TARGET_VALUES_TGD_v4_v0v1.v0cityId as text) as v0cityId from work.TARGET_VALUES_TGD_v4_v0v1 UNION select cast(work.TARGET_VALUES_TGD_v6_v0v3.v0id as text) as v0id, cast(work.TARGET_VALUES_TGD_v6_v0v3.v0name as text) as v0name, cast(work.TARGET_VALUES_TGD_v6_v0v3.v0age as text) as v0age, cast(work.TARGET_VALUES_TGD_v6_v0v3.v0carId as text) as v0carId, cast(work.TARGET_VALUES_TGD_v6_v0v3.v0cityId as text) as v0cityId from work.TARGET_VALUES_TGD_v6_v0v3 UNION select cast(work.TARGET_VALUES_TGD_v5_v0v1.v0id as text) as v0id, cast(work.TARGET_VALUES_TGD_v5_v0v1.v0name as text) as v0name, cast(work.TARGET_VALUES_TGD_v5_v0v1.v0age as text) as v0age, cast(work.TARGET_VALUES_TGD_v5_v0v1.v0carId as text) as v0carId, cast(work.TARGET_VALUES_TGD_v5_v0v1.v0cityId as text) as v0cityId from work.TARGET_VALUES_TGD_v5_v0v1; insert into target.CarSet select cast(work.TARGET_VALUES_TGD_v4_v0v1.v1id as text) as v1id, cast(work.TARGET_VALUES_TGD_v4_v0v1.v1model as text) as v1model, cast(work.TARGET_VALUES_TGD_v4_v0v1.v1plate as text) as v1plate, cast(work.TARGET_VALUES_TGD_v4_v0v1.v1makeId as text) as v1makeId from work.TARGET_VALUES_TGD_v4_v0v1 UNION select cast(work.TARGET_VALUES_TGD_v7_v1v2.v1id as text) as v1id, cast(work.TARGET_VALUES_TGD_v7_v1v2.v1model as text) as v1model, cast(work.TARGET_VALUES_TGD_v7_v1v2.v1plate as text) as v1plate, cast(work.TARGET_VALUES_TGD_v7_v1v2.v1makeId as text) as v1makeId from work.TARGET_VALUES_TGD_v7_v1v2 UNION select cast(work.TARGET_VALUES_TGD_v5_v0v1.v1id as text) as v1id, cast(work.TARGET_VALUES_TGD_v5_v0v1.v1model as text) as v1model, cast(work.TARGET_VALUES_TGD_v5_v0v1.v1plate as text) as v1plate, cast(work.TARGET_VALUES_TGD_v5_v0v1.v1makeId as text) as v1makeId from work.TARGET_VALUES_TGD_v5_v0v1; insert into target.MakeSet select cast(work.TARGET_VALUES_TGD_v7_v1v2.v2id as text) as v2id, cast(work.TARGET_VALUES_TGD_v7_v1v2.v2name as text) as v2name from work.TARGET_VALUES_TGD_v7_v1v2; insert into target.CitySet select cast(work.TARGET_VALUES_TGD_v6_v0v3.v3id as text) as v3id, cast(work.TARGET_VALUES_TGD_v6_v0v3.v3name as text) as v3name, cast(work.TARGET_VALUES_TGD_v6_v0v3.v3region as text) as v3region from work.TARGET_VALUES_TGD_v6_v0v3 UNION select cast(work.TARGET_VALUES_TGD_v8_v3.v3id as text) as v3id, cast(work.TARGET_VALUES_TGD_v8_v3.v3name as text) as v3name,

10

slide-12
SLIDE 12

clear notion of desired solution

Declarative Approach

  • 1. Formalization
  • 2. Scalable algorithms

handle large datasets

Data Preparation

Extract Map Clean

11

slide-13
SLIDE 13

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Data Cleaning

Up to 25% of business, health, and scientific data is dirty: errors, missing values, duplicates 


[https://www.gartner.com/doc/3169421/magic-quadrant-data-quality-tools]

12

slide-14
SLIDE 14

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Data Cleaning

  • One declarative approach based on rules
  • Functional Dependency: zip code identifies state
  • A repair is an updated, consistent instance

15

slide-15
SLIDE 15

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Data Cleaning

  • One declarative approach based on rules
  • Functional Dependency: zip code identifies state
  • A repair is an updated, consistent instance
  • An optimal repair is minimal in terms of number of

changes between the original dataset and the repair

AZ

  • Computing an optimal repair is a NP problem

17

slide-16
SLIDE 16

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

  • Multiple possible ways to repair a violation
  • Domino effect: new violations could be generated

by resolving a violation [Xu et al, 2013a]

  • Approximate solution with heuristics
  • Computing an optimal repair is a NP problem

Data Cleaning

18

slide-17
SLIDE 17

Rule Based Data Cleaning

  • Functional dependencies [Bohannon et al, 2005], Conditional

Function Dependencies [Cong et al, 2007], Conditional Inclusion Dependencies [Bravo et al, 2007], Matching Dependencies [Bertossi et al, 2011], Editing Rules [Fan et al, 2010], Fixing Rules [Tang, 2014]

  • Each fragment covers a new aspect: 


axioms, complexity study, heuristic repair algorithm

  • Sequence of repair algorithms: poor repair

  • 0.3 F-measure over real data
  • Piecemeal approach misses evidence!

20

slide-18
SLIDE 18

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST)

21

slide-19
SLIDE 19

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-20
SLIDE 20

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-21
SLIDE 21

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-22
SLIDE 22

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-23
SLIDE 23

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

repair condition 8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-24
SLIDE 24

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

repair condition 8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-25
SLIDE 25

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

Denial Constraints (DCs)

repair condition 8tα, tβ 2 R, ¬(tα.ZIP = tβ.ZIP ^ tα.ST 6= tβ.ST) ∀tα, tβ ∈ R, ¬(tα.ST = tβ.ST ∧ tα.ROLE = “M” ∧tβ.ROLE = “E” ∧ tα.SAL < tβ.SAL)

21

slide-26
SLIDE 26

Two Steps for Cleaning

  • Detect:


identify constraint violations

  • Repair:


identify errors and suggest repairs

  • idea: exploit interactions among

violations for better repairs

22

slide-27
SLIDE 27

Conflict Hypergraph

t1 t2 t3 t4

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

[Xu et al, 2013a]

23

slide-28
SLIDE 28

Conflict Hypergraph

t1.ST

e1

t3.ZIP t3.ST t1.ZIP

t1 t2 t3 t4

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

[Xu et al, 2013a]

23

slide-29
SLIDE 29

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

t1.ROLE t1.ST t2.ST t2.ROLE t1.SAL t2.SAL

e2

Conflict Hypergraph

t1 t2 t3 t4

24

slide-30
SLIDE 30

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

t1.ROLE t1.ST t2.ST

e1

t2.ROLE t1.SAL t2.SAL t3.ZIP t3.ST t1.ZIP

e2

Conflict Hypergraph

t1 t2 t3 t4

25

slide-31
SLIDE 31

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

t1.ROLE t1.ST t2.ST

e1

t2.ROLE t1.SAL t2.SAL t3.ZIP t3.ST t1.ZIP

e2

Conflict Hypergraph

t1 t2 t3

  • MVC: t1.ST

t1.ST

t4

25

slide-32
SLIDE 32

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

t1.ROLE t1.ST t2.ST

e1

t2.ROLE t1.SAL t2.SAL t3.ZIP t3.ST t1.ZIP

e2

Conflict Hypergraph

t1 t2 t3

  • MVC: t1.ST
  • system: 


t1.ST != t2.ST
 t1.ST = t3.ST

t1.ST

t4

t2.ST t3.ST

25

slide-33
SLIDE 33

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

t1.ROLE t1.ST t2.ST

e1

t2.ROLE t1.SAL t2.SAL t3.ZIP t3.ST t1.ZIP

e2

Conflict Hypergraph

t1 t2 t3

  • MVC: t1.ST
  • system: 


t1.ST != t2.ST
 t1.ST = t3.ST

  • update and iterate

t1.ST

t4

t2.ST t3.ST

AZ

25

slide-34
SLIDE 34

ID FN LN ROLE ZIP ST SAL

105 Anne Nash E 85281 NY 110 211 Mark White M 15544 NY 80 386 Mark Lee E 85281 AZ 75 215

Anna Smith Nash E

85283

t1.ROLE t1.ST t2.ST

e1

t2.ROLE t1.SAL t2.SAL t3.ZIP t3.ST t1.ZIP

e2

Conflict Hypergraph

t1 t2 t3

  • MVC: t1.ST
  • system: 


t1.ST != t2.ST
 t1.ST = t3.ST

  • update and iterate

t1.ST

t4

t2.ST t3.ST

AZ

  • Th: constant factor approx. algorithm

25

slide-35
SLIDE 35

Experimental Results: DCs

  • Nine datasets, 4000 manually annotated tuples

0.54 0.84

[Abedjan et al, 2015]

26

slide-36
SLIDE 36

Cleaning with Denial Constraints

  • Language: axioms, implication testing
  • Semantics: partial order over groups of values
  • Algorithms: constant factor approximation
  • System: scalable, disk-based cleaning tools

27

Users define the rules: model for the background knowledge to be enforced on the data

slide-37
SLIDE 37

Supporting Rules Discovery

  • Large literature on Functional Dependencies 


[Kivinen and Mannila, 1995]

  • More recent efforts on data quality rules
  • Conditional Functional Dependencies 


[Chiang and Miller, 2008]

  • Matching Dependencies [Song and Chen, 2009]
  • Denial Constraints [Xu et al, 2013b]

28

slide-38
SLIDE 38

Discovering DCs

29

slide-39
SLIDE 39

Supporting Rules Discovery

  • Large literature on Functional Dependencies 


[Kivinen and Mannila, 1995]

  • More recent efforts on data quality rules
  • Conditional Functional Dependencies 


[Chiang and Miller, 2008]

  • Matching Dependencies [Song and Chen, 2009]
  • Denial Constraints [Xu et al, 2013b]

30

slide-40
SLIDE 40
  • 1. Noise in the data: hard to set parameters
  • 2. Search space is exponential: no trial and error
  • 3. Lots of rules, unfriendly output for domain

experts

  • Same problem for other methods in curation:

transformations, outliers detection, deduplication

Three Big Data challenges

31

slide-41
SLIDE 41

32

slide-42
SLIDE 42

New (ML/PL) tools to the rescue

  • DCs cleaning and mining [Xu et al, 2013a] [Xu et al, 2013b]
  • Temporal rules from noisy data [Abedjan et al, 2015]
  • Interactive discovery with domain experts [He et al, 2016]
  • Synthesizing cleaning programs (UDFs) [Singh et al, 2017]

33

slide-43
SLIDE 43

Program synthesis

ML black box

34

Best F-measure Not interpretable

Rules

Lower F-measure Interpretable Tuneable trade off

slide-44
SLIDE 44

Program synthesis

F-measure comparable to DTs depth 10 and SVM

35

slide-45
SLIDE 45

Research Direction

  • Rules for challenging applications
  • fact checking
  • identification of cyber attacks
  • recognizing credit card frauds

36

slide-46
SLIDE 46

[www.opensources.co] 


881 sources 
 “~200 suggested waiting to be added”

120 organizations, at least two days delay

37

slide-47
SLIDE 47

[http://www.npr.org/ 2016/08/31/49209656 5/fact-check-donald- trumps-speech-on- immigration]

38

slide-48
SLIDE 48

Pattern Discovery

ROOT (S (S (NP (DT The) (NN truth)) (VP (VBZ is) (SBAR (SBAR (S (NP (DT the) (JJ central) (NN issue)) (VP (VBZ is) (RB not) (NP (NP (DT the) (NNS needs)) (RB always)) (NP …

39

US 30M #illegal
 Immigrants US 11M #illegal
 Immigrants isIn

  • North. America

NYC isIn #population 323M 30M #illegal
 Immigrants

slide-49
SLIDE 49

40

Make it explicable with rules over the KB!

slide-50
SLIDE 50

Paolo Papotti
 papotti@eurecom.fr

Gdansk, 11/9/2017

41

Conclusions

  • Big challenges in data cleaning
  • No magic: large human involvement
  • New tools for the existing problems
  • New applications for the existing tools