Manageable data pipelines with Airflow (and Kubernetes) GDG - - PowerPoint PPT Presentation

manageable data pipelines with airflow
SMART_READER_LITE
LIVE PREVIEW

Manageable data pipelines with Airflow (and Kubernetes) GDG - - PowerPoint PPT Presentation

Manageable data pipelines with Airflow (and Kubernetes) GDG DevFest Warsaw 2018 @higrys, @sprzedwojski GDG DevFest Warsaw 2018 @higrys, @sprzedwojski Airflow Airflow is a platform to programmatically author, schedule and monitor workflows.


slide-1
SLIDE 1

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Manageable data pipelines with Airflow

(and Kubernetes)

slide-2
SLIDE 2

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

slide-3
SLIDE 3

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Airflow

Airflow is a platform to programmatically author, schedule and monitor workflows.

Dynamic/Elegant Extensible Scalable

slide-4
SLIDE 4

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Workflows

Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a

slide-5
SLIDE 5

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Companies using Airflow

(>200 officially)

slide-6
SLIDE 6

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Data Pipeline

https://xkcd.com/2054/

slide-7
SLIDE 7

GDG DevFest Warsaw 2018 @higrys, @sprzedwojski

Airflow vs. other workflow platforms

  • Programming workflows

○ writing code not XML ○ versioning as usual ○ automated testing as usual ○ complex dependencies between tasks

  • Managing workflows

○ aggregate logs in one UI ○ tracking execution ○ re-running, backfilling (run all missed runs)

slide-8
SLIDE 8

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Airflow use cases

  • ETL jobs
  • ML pipelines
  • Regular operations:

○ Delivering data ○ Performing backups

  • ...
slide-9
SLIDE 9

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Core concepts - Directed Acyclic Graph (DAG)

Source: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_twitter_README.md

slide-10
SLIDE 10

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Core concepts - Operators

Source: https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c

slide-11
SLIDE 11

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Operator types

  • Action Operators

○ Python, Bash, Docker, GCEInstanceStart, ...

  • Sensor Operators

○ S3KeySensor, HivePartitionSensor, BigtableTableWaitForReplicationOperator , ...

  • Transfer Operators

○ MsSqlToHiveTransfer, RedshiftToS3Transfer, …

slide-12
SLIDE 12

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

class ExampleOperator(BaseOperator): def execute(self, context): # Do something pass

Operator and Sensor

slide-13
SLIDE 13

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

class ExampleOperator(BaseOperator): def execute(self, context): # Do something pass class ExampleSensorOperator(BaseSensorOperator): def poke(self, context): # Check if the condition occurred return True

Operator and Sensor

slide-14
SLIDE 14

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Operator good practices

  • Idempotent
  • Atomic
  • No direct data sharing

○ Small portions of data between tasks: XCOMs ○ Large amounts of data: S3, GCS, etc.

slide-15
SLIDE 15

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Core concepts - Tasks, TaskInstances, DagRuns

Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a

slide-16
SLIDE 16

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Show me the code!

slide-17
SLIDE 17

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Source: https://www.logolynx.com/images/logolynx/0b/0b42e766caee6dcd7355c1c95ddaaa1c.png

slide-18
SLIDE 18

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Source: http://www.faicoach.com/wp-content/uploads/2017/10/cash-burn.jpg

slide-19
SLIDE 19

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

The solution

Sources: https://services.garmin.cn/appsLibraryBusinessServices_v0/rest/apps/9b5dabf3-925b https://malloc.fi/static/images/slack-memory-management.png https://i.gifer.com/9GXs.gif

slide-20
SLIDE 20

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Solution components

  • Generic

○ BashOperator ○ PythonOperator

  • Specific

○ EmailOperator

slide-21
SLIDE 21

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

The DAG

slide-22
SLIDE 22

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Initialize DAG

dag = DAG(dag_id='gcp_spy', ... )

slide-23
SLIDE 23

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Initialize DAG

dag = DAG(dag_id='gcp_spy', default_args={ 'start_date': utils.dates.days_ago(1), 'retries': 1 }, ... )

slide-24
SLIDE 24

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Initialize DAG

dag = DAG(dag_id='gcp_spy', default_args={ 'start_date': utils.dates.days_ago(1), 'retries': 1 }, schedule_interval='0 16 * * *' )

slide-25
SLIDE 25

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

List of instances

bash_task = BashOperator( task_id="gcp_service_list_instances_sql", ... )

slide-26
SLIDE 26

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

List of instances

bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr '\n' ' '", ... )

slide-27
SLIDE 27

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

List of instances

bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr '\n' ' '", xcom_push=True, ... )

slide-28
SLIDE 28

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

List of instances

bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr '\n' ' '", xcom_push=True, dag=dag )

slide-29
SLIDE 29

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

All services

GCP_SERVICES = [ ('sql', 'Cloud SQL'), ('spanner', 'Spanner'), ('bigtable', 'BigTable'), ('compute', 'Compute Engine'), ]

slide-30
SLIDE 30

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

List of instances - all services

???? bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr '\n' ' '", xcom_push=True, dag=dag )

slide-31
SLIDE 31

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

List of instances - all services

for gcp_service in GCP_SERVICES: bash_task = BashOperator( task_id="gcp_service_list_instances_{}".format(gcp_service[0]), bash_command= "gcloud {} instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr '\n' ' '".format(gcp_service[0]), xcom_push=True, dag=dag )

slide-32
SLIDE 32

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Send Slack message

send_slack_msg_task = PythonOperator( python_callable=send_slack_msg, provide_context=True, task_id='send_slack_msg_task', dag=dag )

slide-33
SLIDE 33

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Send Slack message

send_slack_msg_task = PythonOperator( python_callable=send_slack_msg, provide_context=True, task_id='send_slack_msg_task', dag=dag )

slide-34
SLIDE 34

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance'].\ xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ...

slide-35
SLIDE 35

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance'].\ xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ...

slide-36
SLIDE 36

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance'].\ xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) data = ... ...

slide-37
SLIDE 37

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance'].\ xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) data = ... requests.post( url=SLACK_WEBHOOK, data=json.dumps(data), headers={'Content-type': 'application/json'} )

slide-38
SLIDE 38

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Prepare email

prepare_email_task = PythonOperator( python_callable=prepare_email, provide_context=True, task_id='prepare_email_task', dag=dag )

slide-39
SLIDE 39

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Prepare email

prepare_email_task = PythonOperator( python_callable=prepare_email, provide_context=True, task_id='prepare_email_task', dag=dag )

slide-40
SLIDE 40

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

def prepare_email(**context): for gcp_service in GCP_SERVICES: result = context['task_instance'].\ xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ... html_content = ... context['task_instance'].xcom_push(key='email', value=html_content)

slide-41
SLIDE 41

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

def prepare_email(**context): for gcp_service in GCP_SERVICES: result = context['task_instance'].\ xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ... html_content = ... context['task_instance'].xcom_push(key='email', value=html_content)

slide-42
SLIDE 42

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Send email

send_email_task = EmailOperator( task_id='send_email', to='szymon.przedwojski@polidea.com', subject=INSTANCES_IN_PROJECT_TITLE, html_content=..., dag=dag )

slide-43
SLIDE 43

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Send email

send_email_task = EmailOperator( task_id='send_email', to='szymon.przedwojski@polidea.com', subject=INSTANCES_IN_PROJECT_TITLE, html_content= "{{ task_instance.xcom_pull(task_ids='prepare_email_task', key='email') }}", dag=dag )

slide-44
SLIDE 44

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Dependencies

for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task

slide-45
SLIDE 45

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Dependencies

for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task prepare_email_task >> send_email_task

slide-46
SLIDE 46

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Dependencies

for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task prepare_email_task >> send_email_task

slide-47
SLIDE 47

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Demo

https://github.com/PolideaInternal/airflow-gcp-spy

slide-48
SLIDE 48

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Complex DAGs

Source: https://speakerdeck.com/pybay/2016-matt-davis-a-practical-introduction-to-airflow?slide=13

slide-49
SLIDE 49

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Complex, Manageable, DAGs

slide-50
SLIDE 50

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a

slide-51
SLIDE 51

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Single node

Local Executor

Web server RDBMS DAGs Scheduler Local executors` Local executors Local executors Local executors multiprocessing

slide-52
SLIDE 52

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Celery Executor

Controller Web server RDBMS DAGs Scheduler Celery Broker RabbitMQ/Redis/AmazonSQS Node 1 Node 2 DAGs DAGs Worker Worker Sync files (Chef/Puppet/Ansible/NFS)

slide-53
SLIDE 53

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

(Beta): Kubernetes Executor

Controller Web server RDBMS DAGs Scheduler Kubernetes Cluster Node 1 Node 2 Pod Sync files

  • Git Init
  • Persistent Volume
  • Baked-in (future)

Package as pods Kubernetes Master DAGs DAGs Pod Pod Pod

slide-54
SLIDE 54

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

GCP - Composer

https://github.com/GoogleCloudPlatform/airflow-operator

slide-55
SLIDE 55

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Thank You!

Follow us @ polidea.com/blog

Source: https://techflourish.com/images/superman-animated-clipart.gif https://airflow.apache.org/_images/pin_large.png

slide-56
SLIDE 56

GDG DevFest Warsaw 2018 @higrys, @sprzedwojski

slide-57
SLIDE 57

@higrys, @sprzedwojski GDG DevFest Warsaw 2018

Questions? :)

Follow us @ polidea.com/blog

Source: https://techflourish.com/images/superman-animated-clipart.gif https://airflow.apache.org/_images/pin_large.png