Automated Data Quality Assurance with Machine Learning and - - PowerPoint PPT Presentation

▶

Jun 18, 2023 339 likes •672 views

Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Mller-Lennert Milica Petrovi Senior Data Scientist Senior Data Scientist martin.mueller-lennert@incubegroup.com

SLIDE 1

Digitization - Data - Intelligence

Automated Data Quality Assurance with Machine Learning and Autoencoders

SDS2019 Bern, 14.06.2019

Martin Müller-Lennert Senior Data Scientist martin.mueller-lennert@incubegroup.com Milica Petrović Senior Data Scientist milica.petrovic@incubegroup.com

SLIDE 2

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

SLIDE 3

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

SLIDE 4

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 5

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 6

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 7

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 8

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 9

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 10

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 11

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

SLIDE 12

Data Quality Today

Our Take at a Solution

Data Quality Today ▪ Manually coded SQL rules ▪ Uni-/bi-variate checks ▪ Too much data: set of rules ▪ Too few rules: undetected errors ▪ Too narrow focus: one- dimensional ▪ Too late: new errors types detected after occurrence ▪ Automate: simultaneous error detection & faster process ▪ Reusability: tailored ML algorithms reused for fields of similar type ▪ Deep dive: discovery of new types of errors based on multivariate relationships ▪ Unsupervised ▪ Model of input data: → Anomalies easily detected ▪ Capture multivariate relationships

Challenges

Solutions with Machine Learning Autoancoders

SLIDE 13

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

SLIDE 14

Autoencoders for Data Quality

Architecture and Training

Target: Reconstruct input Bottleneck: Enforced by architecture or regularization Ensures network learns structure of input data For good data only

INPUT INPUT OUTPUT

SLIDE 15

Autoencoders for Data Quality

Architecture and Training

Training on imperfect data: Requires large share of good data Limits potency of network: More layers not always better From simple one-layer NN up to VAE with LSTM cells

INPUT OUTPUT INPUT

For good data only

SLIDE 16

Discriminating Good and Bad Data Records

Clustering the Reconstruction Errors

Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance Kernel Density Estimate

SLIDE 17

Discriminating Good and Bad Data Records

Clustering the Reconstruction Errors

Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance Kernel Density Estimate

SLIDE 18

Discriminating Good and Bad Data Records

Sequence of Autoencoders

1st iteration Keep Rest of Data Challenge: Magnitude of reconstruction error varies across data error types Remove Detected Anomalies

SLIDE 19

Discriminating Good and Bad Data Records

Sequence of Autoencoders

SLIDE 20

Discriminating Good and Bad Data Records

Sequence of Autoencoders

SLIDE 21

Discriminating Good and Bad Data Records

Sequence of Autoencoders

Across iterations: Increase model complexity Stopping: When threshold separates large chunk of data

SLIDE 22

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

SLIDE 23

Demo

Birth date

SLIDE 24

Demo

Birth date

SLIDE 25

Demo

First name

SLIDE 26

Demo

First name

SLIDE 27

Demo

Revenue

SLIDE 28

Demo

Revenue

SLIDE 29

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

SLIDE 30

Reusability of Pre-Processing and Model Setup

Type of variable Pre-processing Model

Character One-hot encoding of characters Variational autoencoders with LSTM cells Categorical One-hot encoding Complete autoencoder with regularization Date Numerical features from digits Complete autoencoder with regularization Normalization Numerical Normalization Undercomplete autoencoder with custom loss

Generic pipeline per field type → can be reused for other fields of same type

SLIDE 31

Key Findings from Application to Production Data

High reusability: One-time customization effort Replication: ML can automatically replicate rule-based data quality checks Extension: Autoencoders can find additional errors SME feedback necessary: Sanity checks during model building Multivariate relationships: Detection of interdependencies Unsupervised learning! Training data quality matters

Training Performance

SLIDE 32

Model lifecycle: improve models over time from feedback Detected anomalies are errors: correct or leave out Automated error correction Data remediation using RPA Batch processing: extend error detection to whole batches of data Detect faulty data sourcing process Detected anomalies are false positives: increase weight during training