Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Müller-Lennert Milica Petrović Senior Data Scientist Senior Data Scientist martin.mueller-lennert@incubegroup.com milica.petrovic@incubegroup.com Digitization - Data - Intelligence
Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 2
Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 3
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 4
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 5
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 6
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 7
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 8
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 9
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 10
Data Quality Today What’s wrong with it? Data Quality Data Sources End User 11
Data Quality Today Our Take at a Solution ▪ Too much data : set of rules Data Quality Today ▪ Too few rules : undetected Challenges errors ▪ Manually coded SQL rules ▪ Too narrow focus : one- dimensional ▪ Uni-/bi-variate checks ▪ Too late : new errors types ▪ Unsupervised Autoancoders detected after occurrence ▪ Model of input data: → Anomalies easily detected ▪ Capture multivariate Solutions with Machine Learning relationships ▪ Automate : simultaneous error detection & faster process ▪ Reusability : tailored ML algorithms reused for fields of similar type ▪ Deep dive : discovery of new types of errors based on multivariate relationships 12
Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 13
Autoencoders for Data Quality Architecture and Training INPUT OUTPUT INPUT For good data only Target: Reconstruct input Bottleneck: Enforced by architecture or regularization Ensures network learns structure of input data 14
Autoencoders for Data Quality Architecture and Training INPUT OUTPUT INPUT For good data only Training on imperfect data: Requires large share of good data Limits potency of network: More layers not always better From simple one-layer NN up to VAE with LSTM cells 15
Discriminating Good and Bad Data Records Clustering the Reconstruction Errors Kernel Density Estimate Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance 16
Discriminating Good and Bad Data Records Clustering the Reconstruction Errors Kernel Density Estimate Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance 17
Discriminating Good and Bad Data Records Sequence of Autoencoders 1 st iteration Remove Detected Anomalies Keep Rest of Data Challenge: Magnitude of reconstruction error varies across data error types 18
Discriminating Good and Bad Data Records Sequence of Autoencoders 19
Discriminating Good and Bad Data Records Sequence of Autoencoders 20
Discriminating Good and Bad Data Records Sequence of Autoencoders Across iterations: Increase model complexity Stopping: When threshold separates large chunk of data 21
Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 22
Demo Birth date 23
Demo Birth date 24
Demo First name 25
Demo First name 26
Demo Revenue 27
Demo Revenue 28
Talk Outline 1 What’s Wrong with Data Quality? 2 Error Detection using ML 3 Demo 4 Findings and Outlook 29
Reusability of Pre-Processing and Model Setup Type of variable Pre-processing Model Variational autoencoders with Character One-hot encoding of characters LSTM cells Complete autoencoder with Categorical One-hot encoding regularization Numerical features from digits Complete autoencoder with Date regularization Normalization Undercomplete autoencoder Numerical Normalization with custom loss Generic pipeline per field type → can be reused for other fields of same type 30
Key Findings from Application to Production Data Unsupervised learning! Training data quality matters Training SME feedback necessary: Sanity checks during model building High reusability: One-time customization effort Replication: ML can automatically replicate rule-based data quality checks Performance Extension: Autoencoders can find additional errors Multivariate relationships: Detection of interdependencies 31
Outlook Future Endeavors Model lifecycle: improve models over time from feedback Detected anomalies are errors: correct or leave out Detected anomalies are false positives: increase weight during training Batch processing: extend error detection to whole batches of data Detect faulty data sourcing process Automated error correction Data remediation using RPA 32
Recommend
More recommend