circuit breakers to safeguard for garbage in garbage out
play

Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep - PowerPoint PPT Presentation

Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep Uttamchandani Chief Data Architect & Head of Data Platform Engineering, Small Business & Self Employed Group Intuit 2 Real? OR Data quality issue? 3 4 Who we serve


  1. Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep Uttamchandani Chief Data Architect & Head of Data Platform Engineering, Small Business & Self Employed Group Intuit

  2. 2

  3. Real? OR Data quality issue? 3

  4. 4

  5. Who we serve Consumers Small Businesses Self-Employed 5

  6. Our mission Powering Prosperity Around the World 6

  7. Data Pipeline: Physical View Data Insights Sources User Entered Data In-product Clickstream Collect Store Analyze Serve Business CRM Operations Social 7

  8. Data Pipeline: Physical View Data Insights Sources Store ↔ Analyze Collect Serve User Entered Data In-product Clickstream Business CRM Operations Social 8

  9. Data Pipeline: Logical View Table Table Table/ Table Job Dashboard Job Table Table Data Pipeline 9

  10. Data Pipeline: Logical View Example 10

  11. Key Reasons for Data Quality Issues Data Source Issues Data Ingestion Issues Referential Integrity Issues ● Uncoordinated upstream changes ● Table inconsistencies ● Data elements have a. Volume of data a. Illegitimate values different data types and/or b. Change in schema b. Missing values meaning in different c. Change in meaning of c. Duplicate Primary sources data keys ● Inconsistent data element d. Upgrade of platform ● Hard deletes enums ● No CDC for large tables ● Bulk inserts ● Heuristic ID Correlation leading to delayed ● Missing updates to CDC ● Uncoordinated schema availability column changes ● Errors in ETL logic ● Dropped updates across ● Timezone inconsistencies data sources ● Duplicate or null records due to ingestion errors 11

  12. Circuit Breakers - Avoiding Electric Fires BEFORE AFTER 12

  13. Circuit Breakers - Avoiding Services with High Response Time Orders Orders API Client Invoices Gateway Invoices Offers Offers BEFORE AFTER Response Time Response Time Certain services unavailable 13

  14. Circuit Breaker - Avoiding Insights with Data Quality Issues Table Table Table Job Table Job Table Table BEFORE AFTER Team alerted to diagnose & backfill (if possible) Time-to-Reliable-Insights Time-to-Reliable-Insights BOUNDED UNBOUNDED 14

  15. Circuit Breaker - Avoiding Insights with Data Quality Issues BEFORE AFTER Answering the question: Hours/Days/ Minutes Weeks Is it Real? OR a Data quality issue? UNBOUNDED Time-to-Reliable-Insights BOUNDED 15

  16. Implementing Circuit Breakers for Data Pipelines Data Quality Results w/ fails Warning Data Quality Alert Circuit Closed Circuit Open Data Quality passes Soft Alerts Hard Alerts 16

  17. Implementing Circuit Breakers for Data Pipelines 1. Track Lineage 2. Profile Pipeline 3. Control the Circuit 17

  18. 1. Track Lineage Data Pipeline Table Table Table Table/ Job Dashboard Job Table Table 18

  19. 1. Track Lineage Pipeline Job Job Script Script Script Script ... ... Executable Executable ... Query Query <Input Tables, <Input Tables, Output Tables> ... Output Tables> 19

  20. 1. Track Lineage Data Pipeline <Job, Input, Output> <Job, Input, Output> . . . . <Job, Input, Output> 20

  21. 2. Profile Pipeline Operational Profiling Data Profiling ● Single Column ● Job Health ● Multi Column ● Data Fabric Health ● Cross-DB Dependencies Platform Engineers Data Engineers 21

  22. 2. Profile Pipeline - Job Health Example 22

  23. 2. Profile Pipeline - Data Fabric Health Data Insights Sources Store ↔ Analyze Collect Serve User Entered Data In-product Clickstream Business CRM Operations Social Monitoring & Logging API 23

  24. 2. Profile Pipeline - Data Profiling Single Column Multi-Column Cross-DB dependencies ● Correlations ● Unique column ● Cardinalities ● Association rules combinations ● Patterns & Data types ● Clustering ● Inclusion dependencies ● Value distributions ● Outliers ● Functional dependencies ● Domain classification ● Summaries & sketches Reference: Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (August 2015), 557-581. 24

  25. 2. Profile Pipeline - Data Profiling Single Column ● Cardinalities ● Patterns & Data types ● Value distributions ● Domain classification Reference: Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (August 2015), 557-581. 25

  26. 2. Profile Pipeline - Data Profiling Example Single Column ● Cardinalities ● Patterns & Data types ● Value distributions ● Domain classification 26

  27. 3. Control the Circuit Detecting Issues Absolute Threshold Rules History-based Anomaly Detection If current_state <differs> history_state If current_state >= threshold_value Then <Alert> Then <Alert> Example Operational If job_retries > 3, Profiling then fail job Data Profiling If col_val > 2 𝝉 , then alert 27

  28. 3. Control the Circuit Soft Alerts Hard Alerts (Low Confidence) (High Confidence) - Anomaly in Job - Job Failed event Runtime - Access denied - Anomaly in Job Start log events Operational time - Read/Write Profiling - Multiple retry errors errors in ingestion job - ... - ... - Illegal column values - Schema mismatch - Change in column with source table Data Profiling value distribution - Aggregate mismatch - ... for Cross-DB dependencies - ... 28

  29. Summary: Trade-off Between Data Quality & Availability Time-to-Reliable-Insights UNBOUNDED Hard Soft Events Events Operational Profiling CLOSED CLOSED Data Availability Data Profiling CLOSED CLOSED Data Quality 29

  30. Summary: Trade-off Between Data Quality & Availability Time-to-Reliable-Insights BOUNDED Hard Soft Events Events Operational Availability Profiling CLOSED CLOSED ∝ Data Quality Availability Data Profiling CLOSED CLOSED Data Quality 30

  31. The Rockstar Team driving this work! We are hiring! Come join us! sandeep_uttamchandani@intuit.com 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend