building better data pipelines using apache airflow
play

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand - PowerPoint PPT Presentation

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1 About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time 2 Apache Airflow What is it? 3 Apache Airflow : What is it? In a


  1. Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 � 1

  2. About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time � 2

  3. Apache Airflow What is it? � 3

  4. Apache Airflow : What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs) � 4

  5. Apache Airflow UI Walk-Through � 5

  6. Apache Airflow : UI Walk-through � 6

  7. Airflow - Authoring DAGs Airflow : Visualizing a DAG � 7

  8. Airflow - Authoring DAGs Airflow : Author DAGs in Python! No need to bundle many XML files! � 8

  9. Airflow - Authoring DAGs Airflow : The Tree View offers a view of DAG Runs over time! � 9

  10. Airflow - Performance Insights Airflow : Gantt charts reveal the slowest tasks for a run! � 10

  11. Airflow - Performance Insights Airflow : …And we can easily see performance trends over time � 11

  12. Apache Airflow Why use it? � 12

  13. Apache Airflow : Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment � 13

  14. Apache Airflow : Why use it? What should a Workflow Scheduler do well? Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load • � 14

  15. Apache Airflow : Why use it? What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility � 15

  16. Use-Case : Message Scoring Batch Pipeline Architecture � 16

  17. Use-Case : Message Scoring S3 uploads every 15 minutes enterprise A enterprise B S3 enterprise C � 17

  18. Use-Case : Message Scoring enterprise A enterprise B S3 enterprise C Airflow kicks of a Spark message scoring job every hour � 18

  19. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C Spark job writes scored messages and stats to another S3 bucket � 19

  20. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C This triggers SNS/SQS SNS messages events SQS � 20

  21. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS An Autoscale Group (ASG) of Importers spins up when it detects SQS SQS messages ASG Importers � 21

  22. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS The importers rapidly ingest scored messages and aggregate statistics into SQS the DB ASG DB Importers � 22

  23. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS Users receive alerts of untrusted emails & SQS can review them in the web app ASG DB Importers � 23

  24. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS Airflow manages the entire process SQS ASG DB Importers � 24

  25. Airflow DAG � 25

  26. Apache Airflow Incubating � 26

  27. Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here � 27

  28. Thank You! � 28

  29. Apache Airflow Behind the Scenes � 29

  30. Apache Airflow : Behind the Scenes Airflow is a platform to programmatically author, schedule and monitor workflows ( a.k.a. DAGs ) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! � 30

  31. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new Celery / RabbitMQ schedules and distributes work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks over Celery � 31

  32. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores Meta DB scheduling metadata in the metadata DB Celery / RabbitMQ 3. The scheduler picks up new schedules and distributes work over Celery / Worker Worker Worker RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery � 32

  33. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new Celery / RabbitMQ schedules and distributes work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks over Celery � 33

  34. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores Scheduler scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new schedules and distributes Celery / RabbitMQ work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks from RabbitMQ � 34

  35. Thank You! � 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend