Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP - PowerPoint PPT Presentation

Modern day work�ow management BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

What is a work�ow? A work�ow: Sequence of tasks Scheduled at a time or triggered by an event Orchestrate data processing pipelines BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron Cron reads “crontab” �les: tabulate tasks to be executed at certain times one task per line */15 9-17 * * 1-3,5 log_my_activity BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron */15 9-17 * * 1-3,5 log_my_activity ____ BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron The Air�ow task: An instance of an Operator class Inherits from BaseOperator -> Must implement execute() method. Performs a speci�c action (delegation): BashOperator -> run bash command/script PythonOperator -> run Python script SparkSubmitOperator -> submit a Spark job with a cluster BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron */15 9-17 * * 1-3,5 log_my_activity _ BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron */15 9-17 * * 1-3,5 log_my_activity _____ BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron */15 9-17 * * 1-3,5 log_my_activity BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron */15 9-17 * * 1-3,5 log_my_activity # Minutes Hours Days Months Days of the week Command */15 9-17 * * 1-3,5 log_my_activity Cron is a dinosaur. Modern work�ow managers: Luigi (Spotify, 2011, Python-based) Azkaban (LinkedIn, 2009, Java-based) Air�ow (Airbnb, 2015, Python-based) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Apache Air�ow ful�lls modern engineering needs 1. Create and visualize complex work�ows, BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Apache Air�ow ful�lls modern engineering needs 1. Create and visualize complex work�ows, 2. Monitor and log work�ows, BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Apache Air�ow ful�lls modern engineering needs 1. Create and visualize complex work�ows, 2. Monitor and log work�ows, 3. Scales horizontally. BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The Directed Acyclic Graph (DAG) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The Directed Acyclic Graph in code from airflow import DAG my_dag = DAG( dag_id="publish_logs", schedule_interval="* * * * *", start_date=datetime(2010, 1, 1) ) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Classes of operators The Air�ow task: An instance of an Operator class Inherits from BaseOperator -> Must implement execute() method. Performs a speci�c action (delegation): BashOperator -> run bash command/script PythonOperator -> run Python script SparkSubmitOperator -> submit a Spark job with a cluster BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Expressing dependencies between operators dag = DAG(…) task1 = BashOperator(…) task2 = PythonOperator(…) task3 = PythonOperator(…) task1.set_downstream(task2) task3.set_upstream(task2) # equivalent, but shorter: # task1 >> task2 # task3 << task2 # Even clearer: # task1 >> task2 >> task3 BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice! BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Building a data pipeline with Air�ow BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

Air�ow’s BashOperator Executes bash commands Air�ow adds logging, retry options and metrics over running this yourself. from airflow.operators.bash_operator import BashOperator bash_task = BashOperator( task_id='greet_world', dag=dag, bash_command='echo "Hello, world!"' ) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Air�ow’s PythonOperator Executes Python callables from airflow.operators.python_operator import PythonOperator from my_library import my_magic_function python_task = PythonOperator( dag=dag, task_id='perform_magic', python_callable=my_magic_function, op_kwargs={"snowflake": "*", "amount": 42} ) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Running PySpark from Air�ow BashOperator: SSHOperator spark_master = ( from airflow.contrib.operators\ "spark://" .ssh_operator import SSHOperator "spark_standalone_cluster_ip" ":7077") task = SSHOperator( task_id='ssh_spark_submit', command = ( dag=dag, "spark-submit " command=command, "--master {master} " ssh_conn_id='spark_master_ssh' "--py-files package1.zip " ) "/path/to/app.py" ).format(master=spark_master) BashOperator(bash_command=command, …) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Running PySpark from Air�ow SparkSubmitOperator SSHOperator from airflow.contrib.operators\ from airflow.contrib.operators\ .spark_submit_operator \ .ssh_operator import SSHOperator import SparkSubmitOperator task = SSHOperator( spark_task = SparkSubmitOperator( task_id='ssh_spark_submit', task_id='spark_submit_id', dag=dag, dag=dag, command=command, application="/path/to/app.py", ssh_conn_id='spark_master_ssh' py_files="package1.zip", ) conn_id='spark_default' ) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Deploying Air�ow BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

Installing and con�guring Air�ow export AIRFLOW_HOME=~/airflow pip install apache-airflow airflow initdb [core] # lots of other configuration settings # … # The executor class that airflow should u # Choices include SequentialExecutor, # LocalExecutor, CeleryExecutor, DaskExecu # KubernetesExecutor executor = SequentialExecutor BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Setting up for production dags : place to store the dags (con�gurable) tests : unit test the possible deployment, possibly ensure consistency across DAGs plugins : store custom operators and hooks connections , pools , variables : provide a location for various con�guration �les you can import into Air�ow. BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Example Air�ow deployment test from airflow.models import DagBag def test_dagbag_import(): """Verify that Airflow will be able to import all DAGs in the repository.""" dagbag = DagBag() number_of_failures = len(dagbag.import_errors) assert number_of_failures == 0, \ "There should be no DAG failures. Got: %s" % dagbag.import_errors BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Transferring DAGs and plugins BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Final thoughts BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

What you learned De�ne purpose of components of data platforms Write an ingestion pipeline using Singer Create and deploy pipelines for big data in Spark Con�gure automated testing using CircleCI Manage and deploy a full data pipeline with Air�ow BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Additional resources External resources DataCamp courses Singer: https://www.singer.io/ Software engineering: https://www.datacamp.com/courses/software Apache Spark: https://spark.apache.org/ engineering-for-data-scientists-in-python Pytest: https://pytest.org/en/latest/ Spark: Flake8: http://�ake8.pycqa.org/en/latest/ https://www.datacamp.com/courses/cleaning- Circle CI: - https://circleci.com/ data-with-apache-spark-in-python (and other courses) Apache Air�ow: https://air�ow.apache.org/ Unit testing: link yet to be revealed BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Congratulations! ???? BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP - PowerPoint PPT Presentation

Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded What is a workow? A workow: Sequence of tasks Scheduled at a time or triggered by an event

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Modern Defense Tooling Artificial intelligence as a defense platform Ameen Altajer Modern

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Modern Mississippi Chapters 10-11 MS state flag, seal and Coat of Arms Chp 10 &11 (Modern

CISM MODERN PENTATHLON COMMITTEE CISM Modern Pentathlon Committee Composition of the CISM Modern

Modern Alchemy Modern Alchemy Turning Waste into Gold Turning Waste into Gold Stephen Salter,

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

H6751 Summary Zhao Rui Agenda 1. Modern AI 2. Course Summary Modern AI Modern AI

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Three things you really should know about DAGMan Allstars Alain Roy OSG Software Coordinator

A Practical Approach for a Workflow Management System Simone Pellegrini, Francesco Giacomini,

Directed Graphical Models Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134)

Dr. Ampl A Meta Solver for Optimization Dominique Orban Bob Fourer cole Polytechnique de

CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall http:/

All Pairs Shortest Paths Carola Wenk Slides courtesy of Charles Leiserson y with changes by

(Weighted) Regular DAG Languages Properties and Algorithms WATA 2018 F. Drewes (joint work with

Part C Instruction scheduling Instruction scheduling character stream token stream

Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP - PowerPoint PPT Presentation

Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded What is a workow? A workow: Sequence of tasks Scheduled at a time or triggered by an event

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses &amp; Ambulances In Use Day 1 Day 2 Days 2 &amp; 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Modern Defense Tooling Artificial intelligence as a defense platform Ameen Altajer Modern

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Modern Mississippi Chapters 10-11 MS state flag, seal and Coat of Arms Chp 10 &amp;11 (Modern

CISM MODERN PENTATHLON COMMITTEE CISM Modern Pentathlon Committee Composition of the CISM Modern

Modern Alchemy Modern Alchemy Turning Waste into Gold Turning Waste into Gold Stephen Salter,

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

H6751 Summary Zhao Rui Agenda 1. Modern AI 2. Course Summary Modern AI Modern AI

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Three things you really should know about DAGMan Allstars Alain Roy OSG Software Coordinator

A Practical Approach for a Workflow Management System Simone Pellegrini, Francesco Giacomini,

Directed Graphical Models Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134)

Dr. Ampl A Meta Solver for Optimization Dominique Orban Bob Fourer cole Polytechnique de

CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall http:/

All Pairs Shortest Paths Carola Wenk Slides courtesy of Charles Leiserson y with changes by

(Weighted) Regular DAG Languages Properties and Algorithms WATA 2018 F. Drewes (joint work with

Part C Instruction scheduling Instruction scheduling character stream token stream

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Modern Mississippi Chapters 10-11 MS state flag, seal and Coat of Arms Chp 10 &11 (Modern