UNIVERSITY OF MASSACHUSETTS, AMHERST • College of Information and Computer Sciences
Data X-Ray: A diagnostic tool for data errors
Xiaolan Wang
Xin Luna Dong Alexandra Meliou
Xiaolan Wang Xin Luna Dong Alexandra Meliou U NIVERSITY OF M - - PowerPoint PPT Presentation
Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou U NIVERSITY OF M ASSACHUSETTS , A MHERST College of Information and Computer Sciences MANY APPLICATIONS RELY ON DATA Data is not perfect! Erroneous
UNIVERSITY OF MASSACHUSETTS, AMHERST • College of Information and Computer Sciences
Xin Luna Dong Alexandra Meliou
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Knowledge graph
(www.google.com)
Social network analytics Shopping systems of retail companies
2
Data is not perfect! Erroneous data can be extremely costly!
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
TXT DOM TBL ANO Web Sources
Extractor Fusion Extractor Extractor
… … Extraction System
3.0 billion extracted triples
More than 70% are wrong
3
prKB prKB
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
TXT DOM TBL ANO Web Sources
Extractor Fusion prKB Extractor Extractor
… … Extraction System Traditional method: identify errors Traditional method: identify errors and drop them
Perfect KB
4
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
TXT DOM TBL ANO Web Sources
Extractor Fusion Extractor Extractor
… … Extraction System
Faulty information Bad extraction rules
5
Perfect KB prKB
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
TXT DOM TBL ANO Web Sources
Extractor Fusion prKB Extractor Extractor
… … Extraction System
prKB prKB
… …
6
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
TXT DOM TBL ANO Web Sources
Extractor Fusion prKB Extractor Extractor
… … Extraction System
prKB prKB
… …
7
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
(besoccer.com, date_of_birth, 1986_02_18) # Triples 630 Error Rate 100% Context: Date of birth of athletes extracted from besoccer.com is set to default value 1986_02_18
8
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
(Extractor S, obj: Baseball Coach) # Triples 674,000 Error Rate 89.3% Context: reconciling all coaches to baseball coaches E.g., [Bob Barton, profession, Baseball Coach]
9
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
(Extractor T, pred:namesakes, obj:the county) # Triples 4878 Error Rate 99.8% E.g., [Salmon P. Chase, namesakes, the county] Contexts: The county was named for Salmon P. Chase, former senator and governor of Ohio
10
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Knowledge triple Correct? <Domenico Modugno, DoB, 01/09/1958> <Bert Kaempfert, DoB, 09/01/1961> <The Singing Nun, DoB, 07/12/1963> <Paul Mauriat, DoB, 10/02/1963> <Shocking Blue, DoB, 02/07/1968> <U2, DoB, 05/16/1987> Knowledge triple Correct? <Domenico Modugno, DoB, 01/09/1958> False <Bert Kaempfert, DoB, 09/01/1961> False <The Singing Nun, DoB, 07/12/1963> False <Paul Mauriat, DoB, 10/02/1963> False <Shocking Blue, DoB, 02/07/1968> True <U2, DoB, 05/16/1987> True
Leveraging on existing data cleaning methods [Abiteboul99, Fan08, Kalashnikov06, Rahm00, Raman01]
11
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Knowledge triple Correct? Subject Predicate Object Web source Extractor <Domenico Modugno, DoB, 01/09/1958> False People/ D.M. Bio/DoB Date/ 01091958 euromusic xx.com Extractor 1 <Bert Kaempfert, DoB, 09/01/1961> False People/ B.K. Bio/DoB Date/ 09011961 euromusic xx.com Extractor 1 <The Singing Nun, DoB, 07/12/1963> False People/ TSN Bio/DoB Date/ 07121963 euromusic xx.com Extractor 1 <Paul Mauriat, DoB, 10/02/1963> False People/ P.M. Bio/DoB Date/ 10021963 euromusic xx.com Extractor 1 <Shocking Blue, DoB, 02/07/1968> True People/ S.B. Bio/DoB Date/ 02071968 wiki.com Extractor 1 <U2, DoB, 05/16/1987> True People/U2 Bio/DoB Date/ 05161987 wiki.com Extractor 1 Knowledge triple Correct? Subject Predicate Object Web source Extractor <Domenico Modugno, DoB, 01/09/1958> False People/ D.M. Bio/DoB Date/ 01091958 euromusic xx.com Extractor 1 <Bert Kaempfert, DoB, 09/01/1961> False People/ B.K. Bio/DoB Date/ 09011961 euromusic xx.com Extractor 1 <The Singing Nun, DoB, 07/12/1963> False People/ TSN Bio/DoB Date/ 07121963 euromusic xx.com Extractor 1 <Paul Mauriat, DoB, 10/02/1963> False People/ P.M. Bio/DoB Date/ 10021963 euromusic xx.com Extractor 1 <Shocking Blue, DoB, 02/07/1968> True People/ S.B. Bio/DoB Date/ 02071968 wiki.com Extractor 1 <U2, DoB, 05/16/1987> True People/U2 Bio/DoB Date/ 05161987 wiki.com Extractor 1
Group error data: Date from website (euromusicxx.com) extracted by Extractor 1 is wrong. (Bad extraction rule: use U.S. date format rule to extract date information from European website).
12 Knowledge triple Correct? <Domenico Modugno, DoB, 01/09/1958> False <Bert Kaempfert, DoB, 09/01/1961> False <The Singing Nun, DoB, 07/12/1963> False <Paul Mauriat, DoB, 10/02/1963> False <Shocking Blue, DoB, 02/07/1968> True <U2, DoB, 05/16/1987> True
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Knowledge triple Correct? Subject Predicate Object Web source Extractor <Domenico Modugno, DoB, 01/09/1958> False People/ D.M. Bio/DoB Date/ 01091958 euromusic xx.com Extractor 1 <Bert Kaempfert, DoB, 09/01/1961> False People/ B.K. Bio/DoB Date/ 09011961 euromusic xx.com Extractor 1 <The Singing Nun, DoB, 07/12/1963> False People/ TSN Bio/DoB Date/ 07121963 euromusic xx.com Extractor 1 <Paul Mauriat, DoB, 10/02/1963> False People/ P.M. Bio/DoB Date/ 10021963 euromusic xx.com Extractor 1 <Shocking Blue, DoB, 02/07/1968> True People/ S.B. Bio/DoB Date/ 02071968 wiki.com Extractor 1 <U2, DoB, 05/16/1987> True People/U2 Bio/DoB Date/ 05161987 wiki.com Extractor 1
Combination of meta-data information
And its correctness
13
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
i |
i |
14
Theorem 1: Derive a diagnosis with minimum cost is NP-Complete
Probability of being the cause of errors under the observation of data items
F
E
True elements in the feature Error rate of the feature False elements in the feature
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
(all, all, all) (all, all, extractor1) (all, wiki, all) (all, euromusicxx, all) (date, all, all) (all, wiki, extractor1) (all, euromusic, extractor1) (date, all, extractor1) (date, wiki, all) (date, euromusic, all) Split Split Compare Merge (all, all, all) (all, wiki, all) (all, euromusicxx, all) Compare Merge (all, all, all) Theorem 2: The DataXRay traversal has linear complexity in the number of features; with O(# of features) approximation.
15
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Recall Precision F-measure 0.0 0.2 0.4 0.6 0.8 1.0 DataXRay+Greedy DataXRay Greedy RedBlue DataAuditor FeatureSelection
DataXRay vs. SetCover[Chvatal79]
16
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Recall Precision F-measure 0.0 0.2 0.4 0.6 0.8 1.0 DataXRay+Greedy DataXRay Greedy RedBlue DataAuditor FeatureSelection
DataXRay vs. RedBlue[Peleg07]
17
Finer-granularity features preferred
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Recall Precision F-measure 0.0 0.2 0.4 0.6 0.8 1.0 DataXRay+Greedy DataXRay Greedy RedBlue DataAuditor FeatureSelection
DataXRay vs. FeatureSelection[Tibshirani96, Ng04]
18
Target on predication Redundant features Low error rate features
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
Extraction errors, traffic incidents, …
Classification, summarization, set cover methods
Different error rate, feature failure, …
19
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
The top-down iterative algorithm is
20
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
[Abiteboul99] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3–8, 1999. [Carr00] R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover problem. In In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345–353, 2000. [Chvatal79] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics
[Dong14] Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. [Eckerson02] W. W. Eckerson. Data warehousing special report: Data quality and the bottom line. http://www.adtmag.com/article.asp?id=6321, 2002. [Fader11] A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. [Fan08] W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints for data
[Golab08] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-
2008. [Golab10] L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. PVLDB, 3(2):1641–1644, 2010.
21
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
[Kalashnikov06] D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2):716–767, June 2006. [Ng04] A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In In ICML, 2004. [Peleg07] D. Peleg. Approximation algorithms for the label-cover< sub> max</sub> and red-blue set cover problems. Journal of Discrete Algorithms, (1):55–64, March 2007. [Quinlan86] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81– 106, Mar. 1986. [Rahm00] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3–13, 2000. [Raman01] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 381–390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [Sakl12] M. Sakal and L. Rakovi´c. Errors in building and using electronic tables: Financial consequences and minimisation techniques. Strategic Management, 17(3): 29–35, 2012. [Samar08] V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.
22
UNIVERSITY OF MASSACHUSETTS AMHERST • College of Information and Computer Sciences
[Tibshirani96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal
[ten Cate et al. 2015] ten Cate, Balder, et al. "High-Level Why-Not Explanations using Ontologies." Proceedings of the 34th ACM Symposium on Principles of Database
23