Building ( Better ) Data Pipelines using Apache Airflow Sid Anand - - PowerPoint PPT Presentation

building better data pipelines using apache airflow
SMART_READER_LITE
LIVE PREVIEW

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand - - PowerPoint PPT Presentation

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1 About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time 2 Apache Airflow What is it? 3 Apache Airflow : What is it? In a


slide-1
SLIDE 1

Building (Better) Data Pipelines using Apache Airflow

Sid Anand (@r39132) QCon.AI 2018

1

slide-2
SLIDE 2

About Me

2

Work [ed | s] @ Maintainer of Spare time Co-Chair for

slide-3
SLIDE 3

Apache Airflow

3

What is it?

slide-4
SLIDE 4

4

Apache Airflow : What is it?

In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs

  • r Directed Acyclic Graphs)
slide-5
SLIDE 5

Apache Airflow

5

UI Walk-Through

slide-6
SLIDE 6

6

Apache Airflow : UI Walk-through

slide-7
SLIDE 7

Airflow - Authoring DAGs

7

Airflow: Visualizing a DAG

slide-8
SLIDE 8

8

Airflow: Author DAGs in Python! No need to bundle many XML files!

Airflow - Authoring DAGs

slide-9
SLIDE 9

9

Airflow: The Tree View offers a view of DAG Runs over time!

Airflow - Authoring DAGs

slide-10
SLIDE 10

Airflow - Performance Insights

10

Airflow: Gantt charts reveal the slowest tasks for a run!

slide-11
SLIDE 11

11

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

slide-12
SLIDE 12

Apache Airflow

12

Why use it?

slide-13
SLIDE 13

13

Apache Airflow : Why use it?

When would you use a Workflow Scheduler like Airflow?

  • ETL Pipelines
  • Machine Learning Pipelines
  • Predictive Data Pipelines
  • Fraud Detection, Scoring/Ranking, Classification,

Recommender System, etc…

  • General Job Scheduling (e.g. Cron)
  • DB Back-ups, Scheduled code/config deployment
slide-14
SLIDE 14

14

What should a Workflow Scheduler do well?

  • Schedule a graph of dependencies
  • where Workflow = A DAG of Tasks
  • Handle task failures
  • Report / Alert on failures
  • Monitor performance of tasks over time
  • Enforce SLAs
  • E.g. Alerting if time or correctness SLAs are not met
  • Easily scale for growing load

Apache Airflow : Why use it?

slide-15
SLIDE 15

15

What Does Apache Airflow Add?

  • Configuration-as-code
  • Usability - Stunning UI / UX
  • Centralized configuration
  • Resource Pooling
  • Extensibility

Apache Airflow : Why use it?

slide-16
SLIDE 16

Use-Case : Message Scoring

Batch Pipeline Architecture

16

slide-17
SLIDE 17

Use-Case : Message Scoring

17

enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes

slide-18
SLIDE 18

Use-Case : Message Scoring

18

enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour

slide-19
SLIDE 19

Use-Case : Message Scoring

19

enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3

slide-20
SLIDE 20

Use-Case : Message Scoring

20

enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS

slide-21
SLIDE 21

Use-Case : Message Scoring

21

enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers

ASG

slide-22
SLIDE 22

22

enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers

ASG

DB

Use-Case : Message Scoring

slide-23
SLIDE 23

23

enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers

ASG

DB

Use-Case : Message Scoring

slide-24
SLIDE 24

24

enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers

ASG

DB Airflow manages the entire process

Use-Case : Message Scoring

slide-25
SLIDE 25

25

Airflow DAG

slide-26
SLIDE 26

Apache Airflow

26

Incubating

slide-27
SLIDE 27

27

Apache Airflow : Incubating

Timeline

  • Airflow was created @ Airbnb in 2015 by Maxime

Beauchemin

  • Max launched it @ Hadoop Summit in Summer 2015
  • On 3/31/2016, Airflow —> Apache Incubator

Today

  • 2400+ Forks
  • 7600+ GitHub Stars
  • 430+ Contributors
  • 150+ companies officially using it!
  • 14 Committers/Maintainers <— We’re growing here
slide-28
SLIDE 28

Thank You!

28

slide-29
SLIDE 29

Apache Airflow

29

Behind the Scenes

slide-30
SLIDE 30

30

Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with a

  • DAG Scheduler
  • Web application (UI)
  • Powerful CLI
  • Celery Workers!

Apache Airflow : Behind the Scenes

slide-31
SLIDE 31

31

Apache Airflow : Behind the Scenes

Webserver Scheduler Worker Worker Worker Meta DB

  • 1. A user schedules / manages

DAGs using the Airflow UI!

  • 2. Airflow’s webserver stores

scheduling metadata in the metadata DB

  • 3. The scheduler picks up new

schedules and distributes work over Celery / RabbitMQ

  • 4. Airflow workers pick up

Airflow tasks over Celery

Celery / RabbitMQ

slide-32
SLIDE 32

32

Webserver Scheduler Worker Worker Worker Meta DB

  • 1. A user schedules / manages

DAGs using the Airflow UI!

  • 2. Airflow’s webserver stores

scheduling metadata in the metadata DB

  • 3. The scheduler picks up new

schedules and distributes work over Celery / RabbitMQ

  • 4. Airflow workers pick up

Airflow tasks over Celery

Celery / RabbitMQ

Apache Airflow : Behind the Scenes

slide-33
SLIDE 33
  • 1. A user schedules / manages

DAGs using the Airflow UI!

  • 2. Airflow’s webserver stores

scheduling metadata in the metadata DB

  • 3. The scheduler picks up new

schedules and distributes work over Celery / RabbitMQ

  • 4. Airflow workers pick up

Airflow tasks over Celery

33

Webserver Scheduler Worker Worker Worker Meta DB Celery / RabbitMQ

Apache Airflow : Behind the Scenes

slide-34
SLIDE 34

34

Webserver Scheduler Worker Worker Worker Meta DB

  • 1. A user schedules / manages

DAGs using the Airflow UI!

  • 2. Airflow’s webserver stores

scheduling metadata in the metadata DB

  • 3. The scheduler picks up new

schedules and distributes work over Celery / RabbitMQ

  • 4. Airflow workers pick up

Airflow tasks from RabbitMQ

Celery / RabbitMQ

Apache Airflow : Behind the Scenes

slide-35
SLIDE 35

Thank You!

35