Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep - - PowerPoint PPT Presentation

circuit breakers to safeguard for garbage in garbage out
SMART_READER_LITE
LIVE PREVIEW

Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep - - PowerPoint PPT Presentation

Circuit Breakers to safeguard for Garbage in, Garbage out Sandeep Uttamchandani Chief Data Architect & Head of Data Platform Engineering, Small Business & Self Employed Group Intuit 2 Real? OR Data quality issue? 3 4 Who we serve


slide-1
SLIDE 1

Circuit Breakers to safeguard for Garbage in, Garbage out

Sandeep Uttamchandani Chief Data Architect & Head of Data Platform Engineering, Small Business & Self Employed Group Intuit

slide-2
SLIDE 2

2

slide-3
SLIDE 3

3

Real? OR Data quality issue?

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

Consumers Small Businesses Self-Employed

Who we serve

slide-6
SLIDE 6

6

Our mission

Powering Prosperity Around the World

slide-7
SLIDE 7

7

User Entered Data Clickstream CRM Social In-product Business Operations

Insights Data Sources

Collect Store Analyze Serve

Data Pipeline: Physical View

slide-8
SLIDE 8

8

User Entered Data Clickstream CRM Social In-product Business Operations

Insights Data Sources

Data Pipeline: Physical View

Store ↔ Analyze Collect Serve

slide-9
SLIDE 9

9

Job Table Table Table Table Table Job Table/ Dashboard

Data Pipeline

Data Pipeline: Logical View

slide-10
SLIDE 10

10

Data Pipeline: Logical View Example

slide-11
SLIDE 11

11

Key Reasons for Data Quality Issues

  • Table inconsistencies

a. Illegitimate values b. Missing values c. Duplicate Primary keys

  • Hard deletes
  • Bulk inserts
  • Missing updates to CDC

column

  • Uncoordinated upstream

changes a. Volume of data b. Change in schema c. Change in meaning of data d. Upgrade of platform

  • No CDC for large tables

leading to delayed availability

  • Errors in ETL logic
  • Timezone inconsistencies
  • Duplicate or null records

due to ingestion errors

  • Data elements have

different data types and/or meaning in different sources

  • Inconsistent data element

enums

  • Heuristic ID Correlation
  • Uncoordinated schema

changes

  • Dropped updates across

data sources

Data Source Issues Data Ingestion Issues Referential Integrity Issues

slide-12
SLIDE 12

12

Circuit Breakers - Avoiding Electric Fires

BEFORE AFTER

slide-13
SLIDE 13

13

Circuit Breakers - Avoiding Services with High Response Time

Certain services unavailable

BEFORE AFTER

Response Time Client API Gateway Orders Invoices Offers Orders Invoices Offers Response Time

slide-14
SLIDE 14

14

Job Table Table Table Table Table Job

Circuit Breaker - Avoiding Insights with Data Quality Issues

Table

BEFORE AFTER

Time-to-Reliable-Insights UNBOUNDED Time-to-Reliable-Insights BOUNDED Team alerted to diagnose & backfill (if possible)

slide-15
SLIDE 15

15

BEFORE AFTER

Answering the question:

Hours/Days/ Weeks Minutes Time-to-Reliable-Insights Is it Real? OR a Data quality issue? UNBOUNDED BOUNDED

Circuit Breaker - Avoiding Insights with Data Quality Issues

slide-16
SLIDE 16

16

Circuit Closed Circuit Open

Results w/ Warning Data Quality passes Data Quality fails Soft Alerts Hard Alerts Data Quality Alert

Implementing Circuit Breakers for Data Pipelines

slide-17
SLIDE 17

17

  • 1. Track Lineage
  • 2. Profile Pipeline
  • 3. Control the Circuit

Implementing Circuit Breakers for Data Pipelines

slide-18
SLIDE 18

18

Job Table Table Table Table Table Job Table/ Dashboard

  • 1. Track Lineage

Data Pipeline

slide-19
SLIDE 19

19

Job

Script Executable Query <Input Tables, Output Tables> Script Script Executable Query

Pipeline

Job

<Input Tables, Output Tables>

... ... ... ...

  • 1. Track Lineage

Script

slide-20
SLIDE 20

20

<Job, Input, Output> <Job, Input, Output> . . . . <Job, Input, Output>

Data Pipeline

  • 1. Track Lineage
slide-21
SLIDE 21

21

  • 2. Profile Pipeline

Operational Profiling Data Profiling

Platform Engineers Data Engineers

  • Job Health
  • Data Fabric Health
  • Single Column
  • Multi Column
  • Cross-DB Dependencies
slide-22
SLIDE 22

22

  • 2. Profile Pipeline - Job Health Example
slide-23
SLIDE 23

23

User Entered Data Clickstream CRM Social In-product Business Operations

Insights Data Sources Store ↔ Analyze Collect Serve

  • 2. Profile Pipeline - Data Fabric Health

Monitoring & Logging API

slide-24
SLIDE 24

24

  • 2. Profile Pipeline - Data Profiling

Reference: Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (August 2015), 557-581.

  • Cardinalities
  • Patterns & Data types
  • Value distributions
  • Domain classification
  • Correlations
  • Association rules
  • Clustering
  • Outliers
  • Summaries & sketches
  • Unique column

combinations

  • Inclusion dependencies
  • Functional dependencies

Single Column Multi-Column Cross-DB dependencies

slide-25
SLIDE 25

25

  • 2. Profile Pipeline - Data Profiling

Reference: Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (August 2015), 557-581.

  • Cardinalities
  • Patterns & Data types
  • Value distributions
  • Domain classification

Single Column

slide-26
SLIDE 26

26

  • 2. Profile Pipeline - Data Profiling Example
  • Cardinalities
  • Patterns & Data types
  • Value distributions
  • Domain classification

Single Column

slide-27
SLIDE 27

27

  • 3. Control the Circuit

Absolute Threshold Rules History-based Anomaly Detection

If current_state >= threshold_value Then <Alert> If current_state <differs> history_state Then <Alert> Example Operational Profiling If job_retries > 3, then fail job Data Profiling If col_val > 2𝝉, then alert

Detecting Issues

slide-28
SLIDE 28

28

Soft Alerts (Low Confidence) Hard Alerts (High Confidence) Operational Profiling Data Profiling

  • Job Failed event
  • Access denied

log events

  • Read/Write

errors

  • ...
  • Anomaly in Job

Runtime

  • Anomaly in Job Start

time

  • Multiple retry errors

in ingestion job

  • ...
  • Illegal column values
  • Change in column

value distribution

  • ...
  • Schema mismatch

with source table

  • Aggregate mismatch

for Cross-DB dependencies

  • ...
  • 3. Control the Circuit
slide-29
SLIDE 29

29

Summary: Trade-off Between Data Quality & Availability

CLOSED CLOSED CLOSED CLOSED Soft Events Hard Events Operational Profiling Data Profiling

Data Quality Data Availability

Time-to-Reliable-Insights UNBOUNDED

slide-30
SLIDE 30

30

CLOSED

Summary: Trade-off Between Data Quality & Availability

Data Quality Data Availability

Time-to-Reliable-Insights BOUNDED

CLOSED CLOSED CLOSED Soft Events Hard Events Operational Profiling Data Profiling

Availability ∝ Quality

slide-31
SLIDE 31

31

The Rockstar Team driving this work!

We are hiring! Come join us! sandeep_uttamchandani@intuit.com