Data Quality: Where are we
- n the journey from theory to practice?
Angela Bonifati
University of Lyon 1 Liris – CNRS, France
June 23, 2017
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 1 / 27
Data Quality: Where are we on the journey from theory to practice? - - PowerPoint PPT Presentation
Data Quality: Where are we on the journey from theory to practice? Angela Bonifati University of Lyon 1 Liris CNRS, France June 23, 2017 Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 1 / 27 Table of contents Big Data
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 1 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 2 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 3 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 3 / 27
1’Dirty Data’ is a Business Problem, Not an IT Problem, Gartner.
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 4 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 5 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 6 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 7 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 7 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 7 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 7 / 27
1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 8 / 27
§ What data dependencies should we use to detect errors? § What repair model do we adopt to fix the errors? 1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 8 / 27
1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 9 / 27
§ What data dependencies should we use to detect errors? § What repair model do we adopt to fix the errors? 1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 9 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 10 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 10 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 10 / 27
§ given a finite set Σ Ď C defined on a relational schema R, whether
§ That is, whether the data quality rules in Σ are consistent themselves. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 11 / 27
§ given a finite set Σ Ď C and φ P C defined on a relational schema R,
§ That is, whether data quality rules in Σ can be removed to speed up
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 12 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 13 / 27
1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 14 / 27
§ What data dependencies should we use to detect errors? § What repair model do we adopt to fix the errors? 1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 14 / 27
1Wenfei Fan: Data Quality: From Theory to Practice. Sigmod Record, 2015. Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 15 / 27
§ Given an instance D of R, a set E of entity types, a set X of attributes
§ for all tuples t,t’ in D, and for each entity type e[X], whether t[X] and
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 16 / 27
§ Given an instance D of R, a set E of entity types, a set X of attributes
§ for all tuples t,t’ in D, and for each entity type e[X], whether t[X] and
§ rule-based (in this talk), probabilistic, § learning-based and distance-based.
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 16 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 17 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 18 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 19 / 27
§ Syntactical patterns, such as date formatting § Semantical patterns, such as location names
§ Statistical outliers Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 20 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 21 / 27
§ What is the precision and recall of each tool? § How many errors in the data sets are detectable by applying all the
§ Is there a strategy to minimize human effort by leveraging the
1Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 22 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 23 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 24 / 27
1https://liris.cnrs.fr/medclean/wiki/doku.php Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 25 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 26 / 27
Angela Bonifati Atelier Qualimados@Madics2017 June 23, 2017 27 / 27