Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep - - PowerPoint PPT Presentation
Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep - - PowerPoint PPT Presentation
Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep Uttamchandani Chief Data Architect & Head of Data Platform Engineering, Small Business & Self Employed Group Intuit 2 Real? OR Data quality issue? 3 4 Who we serve
2
3
Real? OR Data quality issue?
4
5
Consumers Small Businesses Self-Employed
Who we serve
6
Our mission
Powering Prosperity Around the World
7
User Entered Data Clickstream CRM Social In-product Business Operations
Insights Data Sources
Collect Store Analyze Serve
Data Pipeline: Physical View
8
User Entered Data Clickstream CRM Social In-product Business Operations
Insights Data Sources
Data Pipeline: Physical View
Store ↔ Analyze Collect Serve
9
Job Table Table Table Table Table Job Table/ Dashboard
Data Pipeline
Data Pipeline: Logical View
10
Data Pipeline: Logical View Example
11
Key Reasons for Data Quality Issues
- Table inconsistencies
a. Illegitimate values b. Missing values c. Duplicate Primary keys
- Hard deletes
- Bulk inserts
- Missing updates to CDC
column
- Uncoordinated upstream
changes a. Volume of data b. Change in schema c. Change in meaning of data d. Upgrade of platform
- No CDC for large tables
leading to delayed availability
- Errors in ETL logic
- Timezone inconsistencies
- Duplicate or null records
due to ingestion errors
- Data elements have
different data types and/or meaning in different sources
- Inconsistent data element
enums
- Heuristic ID Correlation
- Uncoordinated schema
changes
- Dropped updates across
data sources
Data Source Issues Data Ingestion Issues Referential Integrity Issues
12
Circuit Breakers - Avoiding Electric Fires
BEFORE AFTER
13
Circuit Breakers - Avoiding Services with High Response Time
Certain services unavailable
BEFORE AFTER
Response Time Client API Gateway Orders Invoices Offers Orders Invoices Offers Response Time
14
Job Table Table Table Table Table Job
Circuit Breaker - Avoiding Insights with Data Quality Issues
Table
BEFORE AFTER
Time-to-Reliable-Insights UNBOUNDED Time-to-Reliable-Insights BOUNDED Team alerted to diagnose & backfill (if possible)
15
BEFORE AFTER
Answering the question:
Hours/Days/ Weeks Minutes Time-to-Reliable-Insights Is it Real? OR a Data quality issue? UNBOUNDED BOUNDED
Circuit Breaker - Avoiding Insights with Data Quality Issues
16
Circuit Closed Circuit Open
Results w/ Warning Data Quality passes Data Quality fails Soft Alerts Hard Alerts Data Quality Alert
Implementing Circuit Breakers for Data Pipelines
17
- 1. Track Lineage
- 2. Profile Pipeline
- 3. Control the Circuit
Implementing Circuit Breakers for Data Pipelines
18
Job Table Table Table Table Table Job Table/ Dashboard
- 1. Track Lineage
Data Pipeline
19
Job
Script Executable Query <Input Tables, Output Tables> Script Script Executable Query
Pipeline
Job
<Input Tables, Output Tables>
... ... ... ...
- 1. Track Lineage
Script
20
<Job, Input, Output> <Job, Input, Output> . . . . <Job, Input, Output>
Data Pipeline
- 1. Track Lineage
21
- 2. Profile Pipeline
Operational Profiling Data Profiling
Platform Engineers Data Engineers
- Job Health
- Data Fabric Health
- Single Column
- Multi Column
- Cross-DB Dependencies
22
- 2. Profile Pipeline - Job Health Example
23
User Entered Data Clickstream CRM Social In-product Business Operations
Insights Data Sources Store ↔ Analyze Collect Serve
- 2. Profile Pipeline - Data Fabric Health
Monitoring & Logging API
24
- 2. Profile Pipeline - Data Profiling
Reference: Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (August 2015), 557-581.
- Cardinalities
- Patterns & Data types
- Value distributions
- Domain classification
- Correlations
- Association rules
- Clustering
- Outliers
- Summaries & sketches
- Unique column
combinations
- Inclusion dependencies
- Functional dependencies
Single Column Multi-Column Cross-DB dependencies
25
- 2. Profile Pipeline - Data Profiling
Reference: Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (August 2015), 557-581.
- Cardinalities
- Patterns & Data types
- Value distributions
- Domain classification
Single Column
26
- 2. Profile Pipeline - Data Profiling Example
- Cardinalities
- Patterns & Data types
- Value distributions
- Domain classification
Single Column
27
- 3. Control the Circuit
Absolute Threshold Rules History-based Anomaly Detection
If current_state >= threshold_value Then <Alert> If current_state <differs> history_state Then <Alert> Example Operational Profiling If job_retries > 3, then fail job Data Profiling If col_val > 2𝝉, then alert
Detecting Issues
28
Soft Alerts (Low Confidence) Hard Alerts (High Confidence) Operational Profiling Data Profiling
- Job Failed event
- Access denied
log events
- Read/Write
errors
- ...
- Anomaly in Job
Runtime
- Anomaly in Job Start
time
- Multiple retry errors
in ingestion job
- ...
- Illegal column values
- Change in column
value distribution
- ...
- Schema mismatch
with source table
- Aggregate mismatch
for Cross-DB dependencies
- ...
- 3. Control the Circuit
29
Summary: Trade-off Between Data Quality & Availability
CLOSED CLOSED CLOSED CLOSED Soft Events Hard Events Operational Profiling Data Profiling
Data Quality Data Availability
Time-to-Reliable-Insights UNBOUNDED
30
CLOSED
Summary: Trade-off Between Data Quality & Availability
Data Quality Data Availability
Time-to-Reliable-Insights BOUNDED
CLOSED CLOSED CLOSED Soft Events Hard Events Operational Profiling Data Profiling
Availability ∝ Quality
31