automated data quality assurance with machine learning

Automated Data Quality Assurance with Machine Learning and - PowerPoint PPT Presentation

Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Mller-Lennert Milica Petrovi Senior Data Scientist Senior Data Scientist martin.mueller-lennert@incubegroup.com


  1. Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Müller-Lennert Milica Petrović Senior Data Scientist Senior Data Scientist martin.mueller-lennert@incubegroup.com milica.petrovic@incubegroup.com Digitization - Data - Intelligence

  2. Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 2

  3. Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 3

  4. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 4

  5. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 5

  6. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 6

  7. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 7

  8. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 8

  9. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 9

  10. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 10

  11. Data Quality Today What’s wrong with it? Data Quality Data Sources End User 11

  12. Data Quality Today Our Take at a Solution ▪ Too much data : set of rules Data Quality Today ▪ Too few rules : undetected Challenges errors ▪ Manually coded SQL rules ▪ Too narrow focus : one- dimensional ▪ Uni-/bi-variate checks ▪ Too late : new errors types ▪ Unsupervised Autoancoders detected after occurrence ▪ Model of input data: → Anomalies easily detected ▪ Capture multivariate Solutions with Machine Learning relationships ▪ Automate : simultaneous error detection & faster process ▪ Reusability : tailored ML algorithms reused for fields of similar type ▪ Deep dive : discovery of new types of errors based on multivariate relationships 12

  13. Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 13

  14. Autoencoders for Data Quality Architecture and Training INPUT OUTPUT INPUT For good data only Target: Reconstruct input Bottleneck: Enforced by architecture or regularization Ensures network learns structure of input data 14

  15. Autoencoders for Data Quality Architecture and Training INPUT OUTPUT INPUT For good data only Training on imperfect data: Requires large share of good data Limits potency of network: More layers not always better From simple one-layer NN up to VAE with LSTM cells 15

  16. Discriminating Good and Bad Data Records Clustering the Reconstruction Errors Kernel Density Estimate Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance 16

  17. Discriminating Good and Bad Data Records Clustering the Reconstruction Errors Kernel Density Estimate Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance 17

  18. Discriminating Good and Bad Data Records Sequence of Autoencoders 1 st iteration Remove Detected Anomalies Keep Rest of Data Challenge: Magnitude of reconstruction error varies across data error types 18

  19. Discriminating Good and Bad Data Records Sequence of Autoencoders 19

  20. Discriminating Good and Bad Data Records Sequence of Autoencoders 20

  21. Discriminating Good and Bad Data Records Sequence of Autoencoders Across iterations: Increase model complexity Stopping: When threshold separates large chunk of data 21

  22. Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 22

  23. Demo Birth date 23

  24. Demo Birth date 24

  25. Demo First name 25

  26. Demo First name 26

  27. Demo Revenue 27

  28. Demo Revenue 28

  29. Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 29

  30. Reusability of Pre-Processing and Model Setup Type of variable Pre-processing Model Variational autoencoders with Character One-hot encoding of characters LSTM cells Complete autoencoder with Categorical One-hot encoding regularization Numerical features from digits Complete autoencoder with Date regularization Normalization Undercomplete autoencoder Numerical Normalization with custom loss Generic pipeline per field type → can be reused for other fields of same type 30

  31. Key Findings from Application to Production Data Unsupervised learning! Training data quality matters Training SME feedback necessary: Sanity checks during model building High reusability: One-time customization effort Replication: ML can automatically replicate rule-based data quality checks Performance Extension: Autoencoders can find additional errors Multivariate relationships: Detection of interdependencies 31

  32. Outlook Future Endeavors Model lifecycle: improve models over time from feedback Detected anomalies are errors: correct or leave out Detected anomalies are false positives: increase weight during training Batch processing: extend error detection to whole batches of data Detect faulty data sourcing process Automated error correction Data remediation using RPA 32

Recommend


More recommend