Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon - - PowerPoint PPT Presentation

resilient predictive data pipelines
SMART_READER_LITE
LIVE PREVIEW

Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon - - PowerPoint PPT Presentation

Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1 About Me Work [ed | s] @ Co-Chair for Maintainer on airbnb/airflow Report to 2 Motivation Why is a Data Pipeline talk in this High Availability Track? 3


slide-1
SLIDE 1

Resilient Predictive Data Pipelines

Sid Anand (@r39132) QCon London 2016

1

slide-2
SLIDE 2

About Me

2

Work [ed | s] @ Maintainer on Report to Co-Chair for airbnb/airflow

slide-3
SLIDE 3

Motivation

Why is a Data Pipeline talk in this High Availability Track?

3

slide-4
SLIDE 4

Different Types of Data Pipelines

4

ETL

  • used for : loading data related

to business health into a Data Warehouse

  • user-engagement stats (e.g.

social networking)

  • product success stats (e.g.

e-commerce)

  • audience : Business, BizOps
  • downtime? : 1-3 days

Predictive

  • used for :
  • building recommendation

products (e.g. social networking, shopping)

  • updating fraud prevention

endpoints (e.g. security, payments, e-commerce)

  • audience : Customers
  • downtime? : < 1 hour

VS

slide-5
SLIDE 5

5

ETL

  • used for : loading data into a

Data Warehouse —> Reports

  • user-engagement stats (e.g.

social networking)

  • product success stats (e.g.

e-commerce)

  • audience : Business
  • downtime? : 1-3 days

Predictive

  • used for :
  • building recommendation

products (e.g. social networking, shopping)

  • updating fraud prevention

endpoints (e.g. security, payments, e-commerce)

  • audience : Customers
  • downtime? : < 1 hour

VS

Different Types of Data Pipelines

slide-6
SLIDE 6

Why Do We Care About Resilience?

6

d4 d5 d6 d7 d8 DB

Search Engines Recommenders

d1 d2 d3

slide-7
SLIDE 7

DB

Why Do We Care About Resilience?

7

d4 d6 d7 d8

Search Engines Recommenders

d1 d2 d3 d5

slide-8
SLIDE 8

Why Do We Care About Resilience?

8

d7 d8 DB

Search Engines Recommenders

d1 d2 d3 d4 d5 d6

slide-9
SLIDE 9

Why Do We Care About Resilience?

9

d7 d8 DB

Search Engines Recommenders

d1 d2 d3 d4 d5 d6

slide-10
SLIDE 10

Any Take-aways?

10

d7 d8 DB

Search Engines Recommenders

d1 d2 d3 d4 d5 d6

  • Bugs happen!
  • Bugs in Predictive Data Pipelines have a large blast

radius

  • The bugs can affect customers and a company’s

profits & reputation!

slide-11
SLIDE 11

Design Goals

Desirable Qualities of a Resilient Data Pipeline

11

slide-12
SLIDE 12

12

  • Scalable
  • Available
  • Instrumented, Monitored, & Alert-enabled
  • Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

slide-13
SLIDE 13

13

  • Scalable
  • Build your pipelines using [infinitely] scalable components
  • The scalability of your system is determined by its least-scalable

component

  • Available
  • Instrumented, Monitored, & Alert-enabled
  • Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

slide-14
SLIDE 14

Desirable Qualities of a Resilient Data Pipeline

14

  • Scalable
  • Build your pipelines using [infinitely] scalable components
  • The scalability of your system is determined by its least-scalable

component

  • Available
  • Ditto
  • Instrumented, Monitored, & Alert-enabled
  • Quickly Recoverable
slide-15
SLIDE 15

Instrumented

15

Instrumentation must reveal SLA metrics at each stage of the pipeline! What SLA metrics do we care about? Correctness & Timeliness

  • Correctness
  • No Data Loss
  • No Data Corruption
  • No Data Duplication
  • A Defined Acceptable Staleness of Intermediate Data
  • Timeliness
  • A late result == a useless result
  • Delayed processing of now()’s data may delay the processing of future

data

slide-16
SLIDE 16

Instrumented, Monitored, & Alert- enabled

16

  • Instrument
  • Instrument Correctness & Timeliness SLA metrics at each stage of the

pipeline

  • Monitor
  • Continuously monitor that SLA metrics fall within acceptable bounds (i.e.

pre-defined SLAs)

  • Alert
  • Alert when we miss SLAs
slide-17
SLIDE 17

Desirable Qualities of a Resilient Data Pipeline

17

  • Scalable
  • Build your pipelines using [infinitely] scalable components
  • The scalability of your system is determined by its least-scalable

component

  • Available
  • Ditto
  • Instrumented, Monitored, & Alert-enabled
  • Quickly Recoverable
slide-18
SLIDE 18

Quickly Recoverable

18

  • Bugs happen!
  • Bugs in Predictive Data Pipelines have a large blast radius
  • Optimize for MTTR
slide-19
SLIDE 19

Implementation

Using AWS to meet Design Goals

19

slide-20
SLIDE 20

SQS

Simple Queue Service

20

slide-21
SLIDE 21

SQS - Overview

21

AWS’s low-latency, highly scalable, highly available message queue Infinitely Scalable Queue (though not FIFO) Low End-to-end latency (generally sub-second) Pull-based How it Works! Producers publish messages, which can be batched, to an SQS queue Consumers consume messages, which can be batched, from the queue commit message contents to a data store ACK the messages as a batch

slide-22
SLIDE 22

visibility timer

SQS - Typical Operation Flow

22

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 1: A consumer reads a message from

  • SQS. This starts a visibility timer!
slide-23
SLIDE 23

visibility timer

SQS - Typical Operation Flow

23

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 2: Consumer persists message contents to DB

slide-24
SLIDE 24

visibility timer

SQS - Typical Operation Flow

24

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 3: Consumer ACKs message in SQS

slide-25
SLIDE 25

visibility timer

SQS - Time Out Example

25

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 1: A consumer reads a message from SQS

slide-26
SLIDE 26

visibility timer

SQS - Time Out Example

26

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 2: Consumer attempts persists message contents to DB

slide-27
SLIDE 27

visibility time out

SQS - Time Out Example

27

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 3: A Visibility Timeout occurs & the message becomes visible again.

slide-28
SLIDE 28

visibility timer

SQS - Time Out Example

28

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1 m1

SQS

Step 4: Another consumer reads and persists the same message

slide-29
SLIDE 29

visibility timer

SQS - Time Out Example

29

Producer Producer Producer

m1 m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS

Step 5: Consumer ACKs message in SQS

slide-30
SLIDE 30

SQS - Dead Letter Queue

30

SQS - DLQ

visibility timer Producer Producer Producer

m2 m3 m4 m5

Consumer Consumer Consumer

DB m1

SQS Redrive rule : 2x

m1

slide-31
SLIDE 31

SNS

Simple Notification Service

31

slide-32
SLIDE 32

SNS - Overview

32

Highly Scalable, Highly Available, Push-based Topic Service Whereas SQS ensures each message is seen by at least 1 consumer SNS ensures that each message is seen by every consumer Reliable Multi-Push Whereas SQS is pull-based, SNS is push-based There is no message retention & there is a finite retry count No Reliable Message Delivery Can we work around this limitation?

slide-33
SLIDE 33

SNS + SQS Design Pattern

33

m1 m2 m1 m2 m1 m2

SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Reliable Message Delivery

slide-34
SLIDE 34

SNS + SQS

34

Producer Producer Producer

m1 m2

Consumer Consumer Consumer

DB m1 m1 m2 m1 m2

SQS Q1 SQS Q2 SNS T1

Consumer Consumer Consumer

ES m1

slide-35
SLIDE 35

S3 + SNS + SQS Design Pattern

35

m1 m2 m1 m2 m1 m2

SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Transactions S3

d1 d2

slide-36
SLIDE 36

Batch Pipeline Architecture

Putting the Pieces Together

36

slide-37
SLIDE 37

Architecture

37

slide-38
SLIDE 38

Architectural Elements

38

A Schema-aware Data format for all data (Avro) The entire pipeline is built from Highly-Available/Highly-Scalable components S3, SNS, SQS, ASG, EMR Spark (exception DB) The pipeline is never blocked because we use a DLQ for messages we cannot process We use queue-based auto-scaling to get high on-demand ingest rates We manage everything with Airflow Every stage in the pipeline is idempotent Every stage in the pipeline is instrumented

slide-39
SLIDE 39

ASG

Auto Scaling Group

39

slide-40
SLIDE 40

ASG - Overview

40

What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service always up Fulfills AWS’s pay-per-use promise! When to use it? Feed-processing, web traffic load balancing, zone outage, etc…

slide-41
SLIDE 41

ASG - Data Pipeline

41

importer importer importer importer

Importer ASG scale out / in

SQS DB

slide-42
SLIDE 42

ASG : CPU-based

42

Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

slide-43
SLIDE 43

ASG : CPU-based

43

Sent CPU Recv Premature Scale-in Premature Scale-in: The CPU drops to noise-levels before all messages are consumed. This causes scale in to occur while the last few messages are still being committed resulting in a long time- to-drain for the queue!

slide-44
SLIDE 44

ASG - Queue-based

44

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink

slide-45
SLIDE 45

Architecture

45

slide-46
SLIDE 46

Reliable Hourly Job Scheduling

Workflow Automation & Scheduling

46

slide-47
SLIDE 47

47

Historical Context Our first cut at the pipeline used cron to schedule hourly runs of Spark Problem We only knew if Spark succeeded. What if a downstream task failed? We needed something smarter than cron that Reliably managed a graph of tasks (DAG - Directed Acyclic Graph) Orchestrated hourly runs of that DAG Retried failed tasks Tracked the performance of each run and its tasks Reported on failure/success of runs

Our Needs

slide-48
SLIDE 48

Airflow

Workflow Automation & Scheduling

48

slide-49
SLIDE 49

Airflow - DAG Dashboard

49

Airflow: It’s easy to manage multiple DAGs

slide-50
SLIDE 50

Airflow - Authoring DAGs

50

Airflow: Visualizing a DAG

slide-51
SLIDE 51

51

Airflow: Author DAGs in Python! No need to bundle many config files!

Airflow - Authoring DAGs

slide-52
SLIDE 52

Airflow - Performance Insights

52

Airflow: Gantt chart view reveals the slowest tasks for a run!

slide-53
SLIDE 53

53

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

slide-54
SLIDE 54

Airflow - Alerting

54

Airflow: …And easy to integrate with Ops tools!

slide-55
SLIDE 55

Airflow - Monitoring

55

slide-56
SLIDE 56

56

Airflow - Join the Community

With >30 Companies, >100 Contributors , and >2500 Commits, Airflow is growing rapidly! We are looking for more contributors to help support the community! Disclaimer : I’m a maintainer on the project

slide-57
SLIDE 57

Design Goal Scorecard

Are We Meeting Our Design Goals?

57

slide-58
SLIDE 58

58

  • Scalable
  • Available
  • Instrumented, Monitored, & Alert-enabled
  • Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

slide-59
SLIDE 59

59

  • Scalable
  • Build using scalable components from AWS
  • SQS, SNS, S3, ASG, EMR Spark
  • Exception = DB (WIP)
  • Available
  • Build using available components from AWS
  • Airflow for reliable job scheduling
  • Instrumented, Monitored, & Alert-enabled
  • Airflow
  • Quickly Recoverable
  • Airflow, DLQs, ASGs, Spark & DB

Desirable Qualities of a Resilient Data Pipeline

slide-60
SLIDE 60

Questions? (@r39132)

60