[PPT] - Building a Fault-Tolerant ETL Pipeline for Claims CAF Internship PowerPoint Presentation

SLIDE 1

Building a Fault-Tolerant ETL Pipeline for Claims CAFé

Internship Presentation - Summer 2018 Kim Hammar

Data Analytics Engineer khamq@allstate.com or kimham@kth.se

August 30, 2018

Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24

SLIDE 2

Extract Transform Load

Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24

SLIDE 3

Extract Transform Load Corrupt data that cannot be parsed Inconsistent Schema Wrong File paths Unexpected null values Duplicates Code bugs Missing values Inconsistent data types Target database unavailable Permission error Scalability problems

Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24

SLIDE 4

Outline

1

Claims CAFé Background

2

Ensuring Data Quality

3

DEMO

4

Conclusion

5

Questions

Kim Hammar (DAE) Claims CAFé August 30, 2018 3 / 24

SLIDE 5

What is Claims CAFé?

Claims CAFé: Claims (C)entral (A)nalytical (F)il(e) claimId policy participant persVeh ... B005 auto John Doe URK389 Audi ... B007 home A.Svensson PTO291 Ford ... B003 life C.Strömbäck RNU999 Volvo ... B004 property G.Åsbrink WEM650 Benz ... B002 health L.Löfven KQO209 Tesla ...                Claims A one stop shop for claims analytics. Optimized for analytical use-cases by saving raw denormalized data in hadoop:

Increase scalability Makes data processing easier: No more slow and complex SQL-Joins

Kim Hammar (DAE) Claims CAFé August 30, 2018 4 / 24

SLIDE 6

What Actually Goes Into Building a Model

Data Science & Analytics Data Engineering

Building models, reports & Visualizations ETL processes, data transformations & data cleaning

Kim Hammar (DAE) Claims CAFé August 30, 2018 5 / 24

SLIDE 7

Now:

complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields Predict claim automation: Deep FNN model

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Customer report & dashboards Link Analysis Fraud Detection

Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 1

Classify Image Vehicle Loss Deep CNN Model

      repairable = 0 total loss = 1 . . . no loss = 0      

Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24

SLIDE 8

Now:

complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields Predict claim automation: Deep FNN model

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y

Customer report & dashboards Link Analysis Fraud Detection

Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 1

Classify Image Vehicle Loss Deep CNN Model

      repairable = 0 total loss = 1 . . . no loss = 0      

Future:

complex and slow join to pull data transform & clean data create derived fields Claims CAFé

b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11

      repairable = 0 total loss = 1 . . . no loss = 0      

Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24

SLIDE 9

What Goes Into Making CAFé

Coffee Plants Harvest Processing: pulping the cherries, filtering out bad ones, sorting&drying Prepare for consumption: roasting, grinding, brewing Consume: drink, sell

Making Coffee

Raw Data Sources ETL X f () Y Processing: fill in null values, create derived fields, data cleaning Prepare for consumption: flatten, create different views, save to Hive

x f(x) Sekante P f(x0) x0 Q x0 + ε f(x0 + ε) − f(x0) ε ε f(x0 + ε)

Consume: build models, generate business reports

Making Claims CAFé

Kim Hammar (DAE) Claims CAFé August 30, 2018 7 / 24

SLIDE 10

Claims CAFé Data Architecture

Data Sources ADW NextGen

. . .

Allstate Canada Claims MDM Claims MDM Synchronizes and reconciles data from various sources into a single place Claims CAFé Claims CAFé Serves the need of data scientists. Contains data optimized for analytics Derived fields Data cleaning Data restructuring Machine Learning

x f(x) Sekante P f(x0) x0 Q x0 + ε f(x0 + ε) − f(x0) ε ε f(x0 + ε)

Analytics

[−4, −2) [−2, 0) [0, 2) [2, 4) 500 1,000 1,500 2,000 2,500

Reports Number of fraudulent claims during the third quarter: X

Kim Hammar (DAE) Claims CAFé August 30, 2018 8 / 24

SLIDE 11

What Is a Claim From A Data Pespective?

Claim

Claim Details
.

. .

Policy Details

.

. .

Participants Details

.

. .

Vehicle Details

.

. .

...

.

. .

Figure: Some of the data that a single Claim in Claims CAFé contains.

Kim Hammar (DAE) Claims CAFé August 30, 2018 9 / 24

SLIDE 12

Data Quality Assurance

Data Quality is key to the success of Claims CAFé Examples of data quality issues:

Null values in the wrong place Duplicates Missing values Inconsistent data types . . . Data Quality Issues Data Size

Why does my Model not work? John Doe, Data Scientist

Kim Hammar (DAE) Claims CAFé August 30, 2018 10 / 24

SLIDE 13

Data Quality Mechanisms Motivation

Failures are the norm, they are not an exception

Optimistic error estimate: P("CAFé error") = P("Data quality issue")∪P("Hadoop failure")∪P("Network failure")∪ P("ClaimsMDM failure") ∪ P("CAFé code bug") ≈ 1/1000 = 0.001 CAFé stretch goal: near-real-time updates, say we pull data every 15 minutes = ⇒ 96 pulls per day = ⇒ 672 pulls per week failure probability per week: 0.001 × 672 = 0.672 = ⇒ a failure will happen on average every other week We want built-in mechanisms in the CAFé pipeline to detect and deal with errors before they affect end-users: Tests!

Kim Hammar (DAE) Claims CAFé August 30, 2018 11 / 24

SLIDE 14

How to detect data quality issues? Know your data

. . .

Claim 1 . . . Claim 2 . . . Claim 3 . . . Claim 4 . . . Claim 5 . . . How many claims were recorded in Rhode Island 2016-2017? How many null claims exists in Idaho? What is the average number of claim participants? What is the average number of covered claimants in a claim? What is the maximum number of vehicles re- lated to a claim? Kim Hammar (DAE) Claims CAFé August 30, 2018 12 / 24

SLIDE 15

Anomaly Detection for Detecting Potential Data Issues (1/3)

Spikes in the number of null values indicate a data issue.

06/18/18 06/18/18 06/19/18 06/19/18 06/20/18 06/20/18 06/21/18 06/21/18 Date 2000 4000 6000 8000 10000 12000 14000 16000 Null Counts

Indication of a data pull issue

Kim Hammar (DAE) Claims CAFé August 30, 2018 13 / 24

SLIDE 16

Anomaly Detection for Detecting Potential Data Issues (2/3)

04.2017 05.2017 06.2017 07.2017 08.2017 09.2017 10.2017 11.2017 12.2017 01.2018 02.2018 03.2018 6,000 8,000 10,000 12,000 14,000 Date Count Number of claims in Rhode Island over time Why did this drop happen?

Kim Hammar (DAE) Claims CAFé August 30, 2018 14 / 24

SLIDE 17

Anomaly Detection for Detecting Potential Data Issues (3/3)

CoverageClaimants PersVeh ClaimsParty 20 40 2 1 3 1 4 39 Count Comparing a single claim statistics with the average Average cross entire dataset Actual value for this claim 39 Parties in this claim vs average on 3 is this valid data or a bug?

Kim Hammar (DAE) Claims CAFé August 30, 2018 15 / 24

SLIDE 18

Regression Tests Background (1/2)

What is regression testing? Regression tests verify modifications of a program or data. If the modification fail the tests, the program can regress back. I just refactored the CAFé pipeline, how do I know that I didn’t break anything? I just pulled new data into CAFé, how do I know I did not introduce data quality issues? Why do we want regression tests? It increases the confidence in making code and data changes We can detect bugs before they bother end-users We can avoid unnecessary work: If the tests fail we can abort early and save time.

Kim Hammar (DAE) Claims CAFé August 30, 2018 16 / 24

SLIDE 19

Regression Tests Background (2/2)

Naive ETL Pipeline For Updating Claims CAFé:

Load & Transform Claims MDM Claims CAFé

Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24

SLIDE 20

Regression Tests Background (2/2)

Naive ETL Pipeline For Updating Claims CAFé:

Load & Transform Claims MDM Claims CAFé

A More Robust ETL Pipeline:

Claims MDM Regression Tests Claims CAFé Tests failed Email maintainers

Missing columns:... Null values:... Anomalies:... . . .

Generate report Abort ETL Load & Transform Tests passed

Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24

SLIDE 21

Regression Tests Pipeline for CAFé

Data Sources

Abort Abort Abort Data paths does not exist Invalid schema Data paths exist

CAFé Regression Tests

Verify data paths of sources Verify schema of sources Verify data quality Count null values and compare with snapshot Count columns and compare with snapshot Verdict ¬∃ duplicates ∧ null increase < x ∧ counts increase < y = ⇒ build ∃ duplicates ∧ null increase < x ∧ counts increase < y = ⇒ build & email ∃ duplicates ∧ null increase > x ∧ counts increase > y = ⇒ abort & email Valid schema Email maintainers Save test report Save snapshot Kim Hammar (DAE) Claims CAFé August 30, 2018 18 / 24

SLIDE 22

Regression Tests Reporting

Test Report: 26/7-2018

Passed 79.0% Failed 21.0%

Test Results

valid_data_paths: passed, time: 7s valid_source_schema: passed, time: 19s no_duplicates: failed, time: 26s claim_counts: passed, time: 41s claim_null_counts: passed, time: 38s . . .

c l a i m C

u

n t c l a i m N u l l C

u

n t f i n a n c i a l s C

u

n t f i n a n c i a l s N u l l C

u

n t s u m m a r y C

u

n t s u m m a r y N u l l C

u

n t i d C

u

n t i d N u l l C

u

n t c l a i m O c c u r r e n c e C

u

n t c l a i m O c c u r r e n c e N u l l C

u

n t p

l

i c y C

u

n t p

l

i c y N u l l C

u

n t p e r s V e h C

u

n t p e r s V e h N u l l C

u

n t c l a i m s P a r t y C

u

n t c l a i m s P a r t y N u l l C

u

n t l

c

a t i

n

C

u

n t l

c

a t i

n

N u l l C

u

n t Metric 500 Counts

Snapshot counts

7 / 2 3 / 1 8 8 / 6 / 1 8 8 / 2 / 1 8 9 / 3 / 1 8 9 / 1 7 / 1 8 1 / 1 / 1 8 1 / 1 5 / 1 8 1 / 2 9 / 1 8 Date 20 40 60 Failure Count

Build failures over time

7 / 2 3 / 1 8 8 / 6 / 1 8 8 / 2 / 1 8 9 / 3 / 1 8 9 / 1 7 / 1 8 1 / 1 / 1 8 1 / 1 5 / 1 8 1 / 2 9 / 1 8 Date 20 40 60 80 100 Count

Claims counts over time

claimNullCount claimCount

Kim Hammar (DAE) Claims CAFé August 30, 2018 19 / 24

SLIDE 23

Data Quality Mechanisms: Putting It All Together

Cluster Daily Data Pull Build → Unit tests → Regression Tests Build → Unit tests → Regression Tests Abort Merge Master Branch Update CAFé Push Commit Trigger Jenkins Job Trigger Jenkins Job Kim Hammar (DAE) Claims CAFé August 30, 2018 20 / 24

SLIDE 24

DEMO

Kim Hammar (DAE) Claims CAFé August 30, 2018 21 / 24

SLIDE 25

Conclusion

Fault-Tolerant ETL What An ETL pipeline that can detect and deal with data quality errors and bugs Why Avoid errors in production. Make full use of data. More productive analytics. How Know your data. Use tests. Integrate with CI jobs. Lessons Learned Setting verdicts is hard. Without CI the tests will not be run. Security must be in the design. Release management is important. Trade-offs when choosing architecture.

Kim Hammar (DAE) Claims CAFé August 30, 2018 22 / 24

SLIDE 26

Thank You! Questions?

Kim Hammar (DAE) Claims CAFé August 30, 2018 23 / 24

SLIDE 27

Links

Claims CAFé Schema Repository1 Claims CAFé Schema Confluence Page2 Claims CAFé Data Pipeline and Tests Repository3 Claims CAFé Tests Pipeline Confluence Page4 Claims CAFé Continuous Integration Confluence Page5 Claims CAFé API Documentation Confluence Page6

1https://github.allstate.com/d3-cafe/ClaimsCAFeSchema 2http://conflu.allstate.com/display/DOMF2/Schema 3https://github.allstate.com/d3-cafe/CAFe 4http://conflu.allstate.com/display/DOMF2/Tests 5http://conflu.allstate.com/display/DOMF2/Continuous+Integration 6http://conflu.allstate.com/display/DOMF2/API+Documentation

Kim Hammar (DAE) Claims CAFé August 30, 2018 24 / 24