Building a Fault-Tolerant ETL Pipeline for Claims CAFé
Internship Presentation - Summer 2018 Kim Hammar
Data Analytics Engineer khamq@allstate.com or kimham@kth.se
August 30, 2018
Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24
Building a Fault-Tolerant ETL Pipeline for Claims CAF Internship - - PowerPoint PPT Presentation
Building a Fault-Tolerant ETL Pipeline for Claims CAF Internship Presentation - Summer 2018 Kim Hammar Data Analytics Engineer khamq@allstate.com or kimham@kth.se August 30, 2018 Kim Hammar (DAE) Claims CAF August 30, 2018 1 / 24
Internship Presentation - Summer 2018 Kim Hammar
Data Analytics Engineer khamq@allstate.com or kimham@kth.se
August 30, 2018
Kim Hammar (DAE) Claims CAFé August 30, 2018 1 / 24
Extract Transform Load
Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
Extract Transform Load Corrupt data that cannot be parsed Inconsistent Schema Wrong File paths Unexpected null values Duplicates Code bugs Missing values Inconsistent data types Target database unavailable Permission error Scalability problems
Kim Hammar (DAE) Claims CAFé August 30, 2018 2 / 24
1
Claims CAFé Background
2
Ensuring Data Quality
3
DEMO
4
Conclusion
5
Questions
Kim Hammar (DAE) Claims CAFé August 30, 2018 3 / 24
Claims CAFé: Claims (C)entral (A)nalytical (F)il(e) claimId policy participant persVeh ... B005 auto John Doe URK389 Audi ... B007 home A.Svensson PTO291 Ford ... B003 life C.Strömbäck RNU999 Volvo ... B004 property G.Åsbrink WEM650 Benz ... B002 health L.Löfven KQO209 Tesla ... Claims A one stop shop for claims analytics. Optimized for analytical use-cases by saving raw denormalized data in hadoop:
Increase scalability Makes data processing easier: No more slow and complex SQL-Joins
Kim Hammar (DAE) Claims CAFé August 30, 2018 4 / 24
Data Science & Analytics Data Engineering
Building models, reports & Visualizations ETL processes, data transformations & data cleaning
Kim Hammar (DAE) Claims CAFé August 30, 2018 5 / 24
Now:
complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields Predict claim automation: Deep FNN model
b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ yCustomer report & dashboards Link Analysis Fraud Detection
Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 1Classify Image Vehicle Loss Deep CNN Model
repairable = 0 total loss = 1 . . . no loss = 0
Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
Now:
complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data complex and slow join to pull data transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields transform & clean data create derived fields Predict claim automation: Deep FNN model
b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ yCustomer report & dashboards Link Analysis Fraud Detection
Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 1Classify Image Vehicle Loss Deep CNN Model
repairable = 0 total loss = 1 . . . no loss = 0
Future:
complex and slow join to pull data transform & clean data create derived fields Claims CAFé
b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆ y Customer1 Customer2 Claim1 Claim2 Claimant Participant 3 10 10 10 4 9 4 5 10 11 repairable = 0 total loss = 1 . . . no loss = 0
Kim Hammar (DAE) Claims CAFé August 30, 2018 6 / 24
Coffee Plants Harvest Processing: pulping the cherries, filtering out bad ones, sorting&drying Prepare for consumption: roasting, grinding, brewing Consume: drink, sell
Making Coffee
Raw Data Sources ETL X f () Y Processing: fill in null values, create derived fields, data cleaning Prepare for consumption: flatten, create different views, save to Hive
x f(x) Sekante P f(x0) x0 Q x0 + ε f(x0 + ε) − f(x0) ε ε f(x0 + ε)Consume: build models, generate business reports
Making Claims CAFé
Kim Hammar (DAE) Claims CAFé August 30, 2018 7 / 24
Data Sources ADW NextGen
Allstate Canada Claims MDM Claims MDM Synchronizes and reconciles data from various sources into a single place Claims CAFé Claims CAFé Serves the need of data scientists. Contains data optimized for analytics Derived fields Data cleaning Data restructuring Machine Learning
x f(x) Sekante P f(x0) x0 Q x0 + ε f(x0 + ε) − f(x0) ε ε f(x0 + ε)
Analytics
[−4, −2) [−2, 0) [0, 2) [2, 4) 500 1,000 1,500 2,000 2,500
Reports Number of fraudulent claims during the third quarter: X
Kim Hammar (DAE) Claims CAFé August 30, 2018 8 / 24
Claim
. .
Policy Details
. .
Participants Details
. .
Vehicle Details
. .
...
. .
Figure: Some of the data that a single Claim in Claims CAFé contains.
Kim Hammar (DAE) Claims CAFé August 30, 2018 9 / 24
Data Quality is key to the success of Claims CAFé Examples of data quality issues:
Null values in the wrong place Duplicates Missing values Inconsistent data types . . . Data Quality Issues Data Size
Why does my Model not work? John Doe, Data Scientist
Kim Hammar (DAE) Claims CAFé August 30, 2018 10 / 24
Failures are the norm, they are not an exception
Optimistic error estimate: P("CAFé error") = P("Data quality issue")∪P("Hadoop failure")∪P("Network failure")∪ P("ClaimsMDM failure") ∪ P("CAFé code bug") ≈ 1/1000 = 0.001 CAFé stretch goal: near-real-time updates, say we pull data every 15 minutes = ⇒ 96 pulls per day = ⇒ 672 pulls per week failure probability per week: 0.001 × 672 = 0.672 = ⇒ a failure will happen on average every other week We want built-in mechanisms in the CAFé pipeline to detect and deal with errors before they affect end-users: Tests!
Kim Hammar (DAE) Claims CAFé August 30, 2018 11 / 24
Claim 1 . . . Claim 2 . . . Claim 3 . . . Claim 4 . . . Claim 5 . . . How many claims were recorded in Rhode Island 2016-2017? How many null claims exists in Idaho? What is the average number of claim par- ticipants? What is the average number of covered claimants in a claim? What is the maximum number of vehicles re- lated to a claim? Kim Hammar (DAE) Claims CAFé August 30, 2018 12 / 24
Spikes in the number of null values indicate a data issue.
06/18/18 06/18/18 06/19/18 06/19/18 06/20/18 06/20/18 06/21/18 06/21/18 Date 2000 4000 6000 8000 10000 12000 14000 16000 Null Counts
Indication of a data pull issue
Kim Hammar (DAE) Claims CAFé August 30, 2018 13 / 24
04.2017 05.2017 06.2017 07.2017 08.2017 09.2017 10.2017 11.2017 12.2017 01.2018 02.2018 03.2018 6,000 8,000 10,000 12,000 14,000 Date Count Number of claims in Rhode Island over time Why did this drop happen?
Kim Hammar (DAE) Claims CAFé August 30, 2018 14 / 24
CoverageClaimants PersVeh ClaimsParty 20 40 2 1 3 1 4 39 Count Comparing a single claim statistics with the average Average cross entire dataset Actual value for this claim 39 Parties in this claim vs average on 3 is this valid data or a bug?
Kim Hammar (DAE) Claims CAFé August 30, 2018 15 / 24
What is regression testing? Regression tests verify modifications of a program or data. If the modification fail the tests, the program can regress back. I just refactored the CAFé pipeline, how do I know that I didn’t break anything? I just pulled new data into CAFé, how do I know I did not introduce data quality issues? Why do we want regression tests? It increases the confidence in making code and data changes We can detect bugs before they bother end-users We can avoid unnecessary work: If the tests fail we can abort early and save time.
Kim Hammar (DAE) Claims CAFé August 30, 2018 16 / 24
Naive ETL Pipeline For Updating Claims CAFé:
Load & Transform Claims MDM Claims CAFé
Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
Naive ETL Pipeline For Updating Claims CAFé:
Load & Transform Claims MDM Claims CAFé
A More Robust ETL Pipeline:
Claims MDM Regression Tests Claims CAFé Tests failed Email maintainers
Missing columns:... Null values:... Anomalies:... . . .
Generate report Abort ETL Load & Transform Tests passed
Kim Hammar (DAE) Claims CAFé August 30, 2018 17 / 24
Data Sources
Abort Abort Abort Data paths does not exist Invalid schema Data paths exist
CAFé Regression Tests
Verify data paths of sources Verify schema of sources Verify data quality Count null values and compare with snapshot Count columns and compare with snapshot Verdict ¬∃ duplicates ∧ null increase < x ∧ counts increase < y = ⇒ build ∃ duplicates ∧ null increase < x ∧ counts increase < y = ⇒ build & email ∃ duplicates ∧ null increase > x ∧ counts increase > y = ⇒ abort & email Valid schema Email maintainers Save test report Save snapshot Kim Hammar (DAE) Claims CAFé August 30, 2018 18 / 24
Test Report: 26/7-2018
Passed 79.0% Failed 21.0%
Test Results
valid_data_paths: passed, time: 7s valid_source_schema: passed, time: 19s no_duplicates: failed, time: 26s claim_counts: passed, time: 41s claim_null_counts: passed, time: 38s . . .
c l a i m C
n t c l a i m N u l l C
n t f i n a n c i a l s C
n t f i n a n c i a l s N u l l C
n t s u m m a r y C
n t s u m m a r y N u l l C
n t i d C
n t i d N u l l C
n t c l a i m O c c u r r e n c e C
n t c l a i m O c c u r r e n c e N u l l C
n t p
i c y C
n t p
i c y N u l l C
n t p e r s V e h C
n t p e r s V e h N u l l C
n t c l a i m s P a r t y C
n t c l a i m s P a r t y N u l l C
n t l
a t i
C
n t l
a t i
N u l l C
n t Metric 500 Counts
Snapshot counts
7 / 2 3 / 1 8 8 / 6 / 1 8 8 / 2 / 1 8 9 / 3 / 1 8 9 / 1 7 / 1 8 1 / 1 / 1 8 1 / 1 5 / 1 8 1 / 2 9 / 1 8 Date 20 40 60 Failure Count
Build failures over time
7 / 2 3 / 1 8 8 / 6 / 1 8 8 / 2 / 1 8 9 / 3 / 1 8 9 / 1 7 / 1 8 1 / 1 / 1 8 1 / 1 5 / 1 8 1 / 2 9 / 1 8 Date 20 40 60 80 100 Count
Claims counts over time
claimNullCount claimCount
Kim Hammar (DAE) Claims CAFé August 30, 2018 19 / 24
Cluster Daily Data Pull Build → Unit tests → Regression Tests Build → Unit tests → Regression Tests Abort Merge Master Branch Update CAFé Push Commit Trigger Jenkins Job Trigger Jenkins Job Kim Hammar (DAE) Claims CAFé August 30, 2018 20 / 24
Kim Hammar (DAE) Claims CAFé August 30, 2018 21 / 24
Fault-Tolerant ETL What An ETL pipeline that can detect and deal with data quality errors and bugs Why Avoid errors in production. Make full use of data. More productive analytics. How Know your data. Use tests. Integrate with CI jobs. Lessons Learned Setting verdicts is hard. Without CI the tests will not be run. Security must be in the design. Release management is important. Trade-offs when choosing architecture.
Kim Hammar (DAE) Claims CAFé August 30, 2018 22 / 24
Kim Hammar (DAE) Claims CAFé August 30, 2018 23 / 24
Claims CAFé Schema Repository1 Claims CAFé Schema Confluence Page2 Claims CAFé Data Pipeline and Tests Repository3 Claims CAFé Tests Pipeline Confluence Page4 Claims CAFé Continuous Integration Confluence Page5 Claims CAFé API Documentation Confluence Page6
1https://github.allstate.com/d3-cafe/ClaimsCAFeSchema 2http://conflu.allstate.com/display/DOMF2/Schema 3https://github.allstate.com/d3-cafe/CAFe 4http://conflu.allstate.com/display/DOMF2/Tests 5http://conflu.allstate.com/display/DOMF2/Continuous+Integration 6http://conflu.allstate.com/display/DOMF2/API+Documentation
Kim Hammar (DAE) Claims CAFé August 30, 2018 24 / 24