Building (Better) Data Pipelines using Apache Airflow
Sid Anand (@r39132) QCon.AI 2018
1
Building ( Better ) Data Pipelines using Apache Airflow Sid Anand - - PowerPoint PPT Presentation
Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1 About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time 2 Apache Airflow What is it? 3 Apache Airflow : What is it? In a
Sid Anand (@r39132) QCon.AI 2018
1
2
3
4
5
6
7
Airflow: Visualizing a DAG
8
Airflow: Author DAGs in Python! No need to bundle many XML files!
9
Airflow: The Tree View offers a view of DAG Runs over time!
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11
Airflow: …And we can easily see performance trends over time
12
13
14
15
16
17
enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
18
enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
19
enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
20
enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
21
enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers
22
enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers
DB
23
enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers
DB
24
enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers
DB Airflow manages the entire process
25
26
27
28
29
30
31
Webserver Scheduler Worker Worker Worker Meta DB
DAGs using the Airflow UI!
scheduling metadata in the metadata DB
schedules and distributes work over Celery / RabbitMQ
Airflow tasks over Celery
Celery / RabbitMQ
32
Webserver Scheduler Worker Worker Worker Meta DB
DAGs using the Airflow UI!
scheduling metadata in the metadata DB
schedules and distributes work over Celery / RabbitMQ
Airflow tasks over Celery
Celery / RabbitMQ
DAGs using the Airflow UI!
scheduling metadata in the metadata DB
schedules and distributes work over Celery / RabbitMQ
Airflow tasks over Celery
33
Webserver Scheduler Worker Worker Worker Meta DB Celery / RabbitMQ
34
Webserver Scheduler Worker Worker Worker Meta DB
DAGs using the Airflow UI!
scheduling metadata in the metadata DB
schedules and distributes work over Celery / RabbitMQ
Airflow tasks from RabbitMQ
Celery / RabbitMQ
35