Automated Data Quality Assurance with Machine Learning and - - PowerPoint PPT Presentation

automated data quality assurance with machine learning
SMART_READER_LITE
LIVE PREVIEW

Automated Data Quality Assurance with Machine Learning and - - PowerPoint PPT Presentation

Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Mller-Lennert Milica Petrovi Senior Data Scientist Senior Data Scientist martin.mueller-lennert@incubegroup.com


slide-1
SLIDE 1

Digitization - Data - Intelligence

Automated Data Quality Assurance with Machine Learning and Autoencoders

SDS2019 Bern, 14.06.2019

Martin Müller-Lennert Senior Data Scientist martin.mueller-lennert@incubegroup.com Milica Petrović Senior Data Scientist milica.petrovic@incubegroup.com

slide-2
SLIDE 2

2

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

slide-3
SLIDE 3

3

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

slide-4
SLIDE 4

4

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-5
SLIDE 5

5

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-6
SLIDE 6

6

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-7
SLIDE 7

7

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-8
SLIDE 8

8

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-9
SLIDE 9

9

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-10
SLIDE 10

10

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-11
SLIDE 11

11

Data Quality Today

What’s wrong with it?

Data Sources Data Quality End User

slide-12
SLIDE 12

12

Data Quality Today

Our Take at a Solution

Data Quality Today ▪ Manually coded SQL rules ▪ Uni-/bi-variate checks ▪ Too much data: set of rules ▪ Too few rules: undetected errors ▪ Too narrow focus: one- dimensional ▪ Too late: new errors types detected after occurrence ▪ Automate: simultaneous error detection & faster process ▪ Reusability: tailored ML algorithms reused for fields of similar type ▪ Deep dive: discovery of new types of errors based on multivariate relationships ▪ Unsupervised ▪ Model of input data: → Anomalies easily detected ▪ Capture multivariate relationships

Challenges

Solutions with Machine Learning Autoancoders

slide-13
SLIDE 13

13

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

slide-14
SLIDE 14

14

Autoencoders for Data Quality

Architecture and Training

Target: Reconstruct input Bottleneck: Enforced by architecture or regularization Ensures network learns structure of input data For good data only

INPUT INPUT OUTPUT

slide-15
SLIDE 15

15

Autoencoders for Data Quality

Architecture and Training

Training on imperfect data: Requires large share of good data Limits potency of network: More layers not always better From simple one-layer NN up to VAE with LSTM cells

INPUT OUTPUT INPUT

For good data only

slide-16
SLIDE 16

16

Discriminating Good and Bad Data Records

Clustering the Reconstruction Errors

Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance Kernel Density Estimate

slide-17
SLIDE 17

17

Discriminating Good and Bad Data Records

Clustering the Reconstruction Errors

Mean Squared Error Individual Data Records Challenge: Many data points and potentially extreme class imbalance Kernel Density Estimate

slide-18
SLIDE 18

18

Discriminating Good and Bad Data Records

Sequence of Autoencoders

1st iteration Keep Rest of Data Challenge: Magnitude of reconstruction error varies across data error types Remove Detected Anomalies

slide-19
SLIDE 19

19

Discriminating Good and Bad Data Records

Sequence of Autoencoders

slide-20
SLIDE 20

20

Discriminating Good and Bad Data Records

Sequence of Autoencoders

slide-21
SLIDE 21

21

Discriminating Good and Bad Data Records

Sequence of Autoencoders

Across iterations: Increase model complexity Stopping: When threshold separates large chunk of data

slide-22
SLIDE 22

22

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

slide-23
SLIDE 23

23

Demo

Birth date

slide-24
SLIDE 24

24

Demo

Birth date

slide-25
SLIDE 25

25

Demo

First name

slide-26
SLIDE 26

26

Demo

First name

slide-27
SLIDE 27

27

Demo

Revenue

slide-28
SLIDE 28

28

Demo

Revenue

slide-29
SLIDE 29

29

Talk Outline

2 What’s Wrong with Data Quality? 3 Error Detection using ML 4 Demo Findings and Outlook 1

slide-30
SLIDE 30

30

Reusability of Pre-Processing and Model Setup

Type of variable Pre-processing Model

Character One-hot encoding of characters Variational autoencoders with LSTM cells Categorical One-hot encoding Complete autoencoder with regularization Date Numerical features from digits Complete autoencoder with regularization Normalization Numerical Normalization Undercomplete autoencoder with custom loss

Generic pipeline per field type → can be reused for other fields of same type

slide-31
SLIDE 31

31

Key Findings from Application to Production Data

High reusability: One-time customization effort Replication: ML can automatically replicate rule-based data quality checks Extension: Autoencoders can find additional errors SME feedback necessary: Sanity checks during model building Multivariate relationships: Detection of interdependencies Unsupervised learning! Training data quality matters

Training Performance

slide-32
SLIDE 32

32

Model lifecycle: improve models over time from feedback Detected anomalies are errors: correct or leave out Automated error correction Data remediation using RPA Batch processing: extend error detection to whole batches of data Detect faulty data sourcing process Detected anomalies are false positives: increase weight during training

Outlook

Future Endeavors