Xiaolan Wang Xin Luna Dong Alexandra Meliou U NIVERSITY OF M - - PowerPoint PPT Presentation

xiaolan wang
SMART_READER_LITE
LIVE PREVIEW

Xiaolan Wang Xin Luna Dong Alexandra Meliou U NIVERSITY OF M - - PowerPoint PPT Presentation

Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou U NIVERSITY OF M ASSACHUSETTS , A MHERST College of Information and Computer Sciences MANY APPLICATIONS RELY ON DATA Data is not perfect! Erroneous


slide-1
SLIDE 1

UNIVERSITY OF MASSACHUSETTS, AMHERST • College of Information and Computer Sciences

Data X-Ray: A diagnostic tool for data errors

Xiaolan Wang

Xin Luna Dong Alexandra Meliou

slide-2
SLIDE 2

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

MANY APPLICATIONS RELY ON DATA

Knowledge graph

(www.google.com)

Social network analytics Shopping systems of retail companies

2

Data is not perfect! Erroneous data can be extremely costly!

slide-3
SLIDE 3

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor Fusion Extractor Extractor

… … Extraction System

3.0 billion extracted triples

More than 70% are wrong

3

prKB prKB

slide-4
SLIDE 4

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor Fusion prKB Extractor Extractor

… … Extraction System Traditional method: identify errors Traditional method: identify errors and drop them

Perfect KB

4

slide-5
SLIDE 5

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor Fusion Extractor Extractor

… … Extraction System

Errors are Systematic

Faulty information Bad extraction rules

5

Perfect KB prKB

slide-6
SLIDE 6

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor Fusion prKB Extractor Extractor

… … Extraction System

prKB prKB

… …

Continue to generate erroneous data

6

slide-7
SLIDE 7

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

KNOWLEDGE VAULT [Dong14]

TXT DOM TBL ANO Web Sources

Extractor Fusion prKB Extractor Extractor

… … Extraction System

prKB prKB

… …

Diagnose root reason for errors

7

slide-8
SLIDE 8

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

REAL-WORLD SYSTEMATIC ERRORS

(besoccer.com, date_of_birth, 1986_02_18) # Triples 630 Error Rate 100% Context: Date of birth of athletes extracted from besoccer.com is set to default value 1986_02_18

8

Default Value Error

slide-9
SLIDE 9

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

(Extractor S, obj: Baseball Coach) # Triples 674,000 Error Rate 89.3% Context: reconciling all coaches to baseball coaches E.g., [Bob Barton, profession, Baseball Coach]

9

Reconciliation Error REAL-WORLD SYSTEMATIC ERRORS

slide-10
SLIDE 10

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

(Extractor T, pred:namesakes, obj:the county) # Triples 4878 Error Rate 99.8% E.g., [Salmon P. Chase, namesakes, the county] Contexts: The county was named for Salmon P. Chase, former senator and governor of Ohio

10

Coreference Errors REAL-WORLD SYSTEMATIC ERRORS

slide-11
SLIDE 11

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

HOW TO DERIVE A DIAGNOSIS?

Knowledge triple Correct? <Domenico Modugno, DoB, 01/09/1958> <Bert Kaempfert, DoB, 09/01/1961> <The Singing Nun, DoB, 07/12/1963> <Paul Mauriat, DoB, 10/02/1963> <Shocking Blue, DoB, 02/07/1968> <U2, DoB, 05/16/1987> Knowledge triple Correct? <Domenico Modugno, DoB, 01/09/1958> False <Bert Kaempfert, DoB, 09/01/1961> False <The Singing Nun, DoB, 07/12/1963> False <Paul Mauriat, DoB, 10/02/1963> False <Shocking Blue, DoB, 02/07/1968> True <U2, DoB, 05/16/1987> True

Leveraging on existing data cleaning methods [Abiteboul99, Fan08, Kalashnikov06, Rahm00, Raman01]

11

Q: Can we treat the error triples as a diagnosis? A: No; for two reasons:

  • Too many erroneous triples (more than 2B in KV)
  • Due to a variety of errors
slide-12
SLIDE 12

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

WHAT IS A DIAGNOSIS?

Knowledge triple Correct? Subject Predicate Object Web source Extractor <Domenico Modugno, DoB, 01/09/1958> False People/ D.M. Bio/DoB Date/ 01091958 euromusic xx.com Extractor 1 <Bert Kaempfert, DoB, 09/01/1961> False People/ B.K. Bio/DoB Date/ 09011961 euromusic xx.com Extractor 1 <The Singing Nun, DoB, 07/12/1963> False People/ TSN Bio/DoB Date/ 07121963 euromusic xx.com Extractor 1 <Paul Mauriat, DoB, 10/02/1963> False People/ P.M. Bio/DoB Date/ 10021963 euromusic xx.com Extractor 1 <Shocking Blue, DoB, 02/07/1968> True People/ S.B. Bio/DoB Date/ 02071968 wiki.com Extractor 1 <U2, DoB, 05/16/1987> True People/U2 Bio/DoB Date/ 05161987 wiki.com Extractor 1 Knowledge triple Correct? Subject Predicate Object Web source Extractor <Domenico Modugno, DoB, 01/09/1958> False People/ D.M. Bio/DoB Date/ 01091958 euromusic xx.com Extractor 1 <Bert Kaempfert, DoB, 09/01/1961> False People/ B.K. Bio/DoB Date/ 09011961 euromusic xx.com Extractor 1 <The Singing Nun, DoB, 07/12/1963> False People/ TSN Bio/DoB Date/ 07121963 euromusic xx.com Extractor 1 <Paul Mauriat, DoB, 10/02/1963> False People/ P.M. Bio/DoB Date/ 10021963 euromusic xx.com Extractor 1 <Shocking Blue, DoB, 02/07/1968> True People/ S.B. Bio/DoB Date/ 02071968 wiki.com Extractor 1 <U2, DoB, 05/16/1987> True People/U2 Bio/DoB Date/ 05161987 wiki.com Extractor 1

Group error data: Date from website (euromusicxx.com) extracted by Extractor 1 is wrong. (Bad extraction rule: use U.S. date format rule to extract date information from European website).

12 Knowledge triple Correct? <Domenico Modugno, DoB, 01/09/1958> False <Bert Kaempfert, DoB, 09/01/1961> False <The Singing Nun, DoB, 07/12/1963> False <Paul Mauriat, DoB, 10/02/1963> False <Shocking Blue, DoB, 02/07/1968> True <U2, DoB, 05/16/1987> True

slide-13
SLIDE 13

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

WHAT IS A DIAGNOSIS?

Knowledge triple Correct? Subject Predicate Object Web source Extractor <Domenico Modugno, DoB, 01/09/1958> False People/ D.M. Bio/DoB Date/ 01091958 euromusic xx.com Extractor 1 <Bert Kaempfert, DoB, 09/01/1961> False People/ B.K. Bio/DoB Date/ 09011961 euromusic xx.com Extractor 1 <The Singing Nun, DoB, 07/12/1963> False People/ TSN Bio/DoB Date/ 07121963 euromusic xx.com Extractor 1 <Paul Mauriat, DoB, 10/02/1963> False People/ P.M. Bio/DoB Date/ 10021963 euromusic xx.com Extractor 1 <Shocking Blue, DoB, 02/07/1968> True People/ S.B. Bio/DoB Date/ 02071968 wiki.com Extractor 1 <U2, DoB, 05/16/1987> True People/U2 Bio/DoB Date/ 05161987 wiki.com Extractor 1

Input2: Features

Combination of meta-data information

Input1: Element

And its correctness

Output (diagnosis): set of features

13

Which diagnosis is the best?

slide-14
SLIDE 14

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

DATAXRAY: COST MODEL

Cost Model: Conciseness: fewer features preferred Specificity: higher error rate preferred Consistency: fewer true elements preferred

Pr(F|E) = Y

fi∈F

↵✏|fi.E−

i |

i

(1 − ✏i)|fi.E+

i |

Bayesian estimate of causal likelihood

14

Theorem 1: Derive a diagnosis with minimum cost is NP-Complete

Probability of being the cause of errors under the observation of data items

F

E

True elements in the feature Error rate of the feature False elements in the feature

slide-15
SLIDE 15

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

DATAXRAY: ALGORITHM

Top-down iterative traversal

(all, all, all) (all, all, extractor1) (all, wiki, all) (all, euromusicxx, all) (date, all, all) (all, wiki, extractor1) (all, euromusic, extractor1) (date, all, extractor1) (date, wiki, all) (date, euromusic, all) Split Split Compare Merge (all, all, all) (all, wiki, all) (all, euromusicxx, all) Compare Merge (all, all, all) Theorem 2: The DataXRay traversal has linear complexity in the number of features; with O(# of features) approximation.

15

slide-16
SLIDE 16

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION (ReVerb ClueWeb Extraction dataset)

Execution time: 0.43 sec vs. 3 sec

Recall Precision F-measure 0.0 0.2 0.4 0.6 0.8 1.0 DataXRay+Greedy DataXRay Greedy RedBlue DataAuditor FeatureSelection

DataXRay vs. SetCover[Chvatal79]

16

slide-17
SLIDE 17

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION (ReVerb ClueWeb Extraction dataset)

Execution time: 0.43 sec vs. 4.2 sec

Recall Precision F-measure 0.0 0.2 0.4 0.6 0.8 1.0 DataXRay+Greedy DataXRay Greedy RedBlue DataAuditor FeatureSelection

DataXRay vs. RedBlue[Peleg07]

17

Finer-granularity features preferred

slide-18
SLIDE 18

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION (ReVerb ClueWeb Extraction dataset)

Execution time: 0.43 sec vs. 5.5 sec

Recall Precision F-measure 0.0 0.2 0.4 0.6 0.8 1.0 DataXRay+Greedy DataXRay Greedy RedBlue DataAuditor FeatureSelection

DataXRay vs. FeatureSelection[Tibshirani96, Ng04]

18

Target on predication Redundant features Low error rate features

slide-19
SLIDE 19

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

EVALUATION SUMMARY

DataXRay is effective several real-world scenarios

Extraction errors, traffic incidents, …

DataXRay is better than alternative algorithms

Classification, summarization, set cover methods

DataXRay is robust under different parameters and settings

Different error rate, feature failure, …

DataXRay is parallelizable in MapReduce

19

slide-20
SLIDE 20

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

Takeaways

Diagnosis is different than cleaning

Reason about root cause of data errors.

Defined a good diagnosis

Cost function based on Bayesian analysis: Conciseness, Specificity, Consistency.

Designed a scalable algorithm

Leverage the feature hierarchy.

The top-down iterative algorithm is

efficient and easy to parallelize.

20

slide-21
SLIDE 21

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

References

[Abiteboul99] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3–8, 1999. [Carr00] R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover problem. In In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345–353, 2000. [Chvatal79] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics

  • f Operations Research, 4(3):233–235, 1979.

[Dong14] Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. [Eckerson02] W. W. Eckerson. Data warehousing special report: Data quality and the bottom line. http://www.adtmag.com/article.asp?id=6321, 2002. [Fader11] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. [Fan08] W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints for data

  • cleaning. Proc. VLDB Endow., 1(2):1522–1523, Aug. 2008.

[Golab08] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-

  • ptimal tableaux for conditional functional dependencies. PVLDB, 1(1):376–390, Aug.

2008. [Golab10] L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. PVLDB, 3(2):1641–1644, 2010.

21

slide-22
SLIDE 22

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

References

[Kalashnikov06] D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2):716–767, June 2006. [Ng04] A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In In ICML, 2004. [Peleg07] D. Peleg. Approximation algorithms for the label-cover< sub> max</sub> and red-blue set cover problems. Journal of Discrete Algorithms, (1):55–64, March 2007. [Quinlan86] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81– 106, Mar. 1986. [Rahm00] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3–13, 2000. [Raman01] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 381–390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [Sakl12] M. Sakal and L. Rakovi´c. Errors in building and using electronic tables: Financial consequences and minimisation techniques. Strategic Management, 17(3): 29–35, 2012. [Samar08] V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.

22

slide-23
SLIDE 23

UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences

Reference

[Tibshirani96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal

  • f the Royal Statistical Society, Series B, 58(1):267–288, 1996.

[ten Cate et al. 2015] ten Cate, Balder, et al. "High-Level Why-Not Explanations using Ontologies." Proceedings of the 34th ACM Symposium on Principles of Database

  • Systems. ACM, 2015.

23