Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP - - PowerPoint PPT Presentation

modern day work ow management
SMART_READER_LITE
LIVE PREVIEW

Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP - - PowerPoint PPT Presentation

Modern day workow management BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded What is a workow? A workow: Sequence of tasks Scheduled at a time or triggered by an event


slide-1
SLIDE 1

Modern day workow management

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Oliver Willekens

Data Engineer at Data Minded

slide-2
SLIDE 2

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

What is a workow?

A workow: Sequence of tasks Scheduled at a time or triggered by an event Orchestrate data processing pipelines

slide-3
SLIDE 3

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

Cron reads “crontab” les: tabulate tasks to be executed at certain times

  • ne task per line

*/15 9-17 * * 1-3,5 log_my_activity

slide-4
SLIDE 4

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

*/15 9-17 * * 1-3,5 log_my_activity ____

slide-5
SLIDE 5

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

The Airow task: An instance of an Operator class Inherits from BaseOperator -> Must implement execute() method. Performs a specic action (delegation):

BashOperator -> run bash command/script PythonOperator -> run Python script SparkSubmitOperator -> submit a Spark job with a cluster

slide-6
SLIDE 6

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

*/15 9-17 * * 1-3,5 log_my_activity _

slide-7
SLIDE 7

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

*/15 9-17 * * 1-3,5 log_my_activity _

slide-8
SLIDE 8

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

*/15 9-17 * * 1-3,5 log_my_activity _____

slide-9
SLIDE 9

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

*/15 9-17 * * 1-3,5 log_my_activity

slide-10
SLIDE 10

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Scheduling with cron

*/15 9-17 * * 1-3,5 log_my_activity # Minutes Hours Days Months Days of the week Command */15 9-17 * * 1-3,5 log_my_activity

Cron is a dinosaur. Modern workow managers: Luigi (Spotify, 2011, Python-based) Azkaban (LinkedIn, 2009, Java-based) Airow (Airbnb, 2015, Python-based)

slide-11
SLIDE 11

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Apache Airow fullls modern engineering needs

  • 1. Create and visualize complex workows,
slide-12
SLIDE 12

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Apache Airow fullls modern engineering needs

  • 1. Create and visualize complex workows,
  • 2. Monitor and log workows,
slide-13
SLIDE 13

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Apache Airow fullls modern engineering needs

  • 1. Create and visualize complex workows,
  • 2. Monitor and log workows,
  • 3. Scales horizontally.
slide-14
SLIDE 14

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The Directed Acyclic Graph (DAG)

slide-15
SLIDE 15

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The Directed Acyclic Graph (DAG)

slide-16
SLIDE 16

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The Directed Acyclic Graph in code

from airflow import DAG my_dag = DAG( dag_id="publish_logs", schedule_interval="* * * * *", start_date=datetime(2010, 1, 1) )

slide-17
SLIDE 17

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Classes of operators

The Airow task: An instance of an Operator class Inherits from BaseOperator -> Must implement execute() method. Performs a specic action (delegation):

BashOperator -> run bash command/script PythonOperator -> run Python script SparkSubmitOperator -> submit a Spark job with a cluster

slide-18
SLIDE 18

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Expressing dependencies between operators

dag = DAG(…) task1 = BashOperator(…) task2 = PythonOperator(…) task3 = PythonOperator(…) task1.set_downstream(task2) task3.set_upstream(task2) # equivalent, but shorter: # task1 >> task2 # task3 << task2 # Even clearer: # task1 >> task2 >> task3

slide-19
SLIDE 19

Let’s practice!

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

slide-20
SLIDE 20

Building a data pipeline with Airow

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Oliver Willekens

Data Engineer at Data Minded

slide-21
SLIDE 21

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Airow’s BashOperator

Executes bash commands Airow adds logging, retry options and metrics over running this yourself.

from airflow.operators.bash_operator import BashOperator bash_task = BashOperator( task_id='greet_world', dag=dag, bash_command='echo "Hello, world!"' )

slide-22
SLIDE 22

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Airow’s PythonOperator

Executes Python callables

from airflow.operators.python_operator import PythonOperator from my_library import my_magic_function python_task = PythonOperator( dag=dag, task_id='perform_magic', python_callable=my_magic_function,

  • p_kwargs={"snowflake": "*", "amount": 42}

)

slide-23
SLIDE 23

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Running PySpark from Airow

BashOperator:

spark_master = ( "spark://" "spark_standalone_cluster_ip" ":7077") command = ( "spark-submit " "--master {master} " "--py-files package1.zip " "/path/to/app.py" ).format(master=spark_master) BashOperator(bash_command=command, …)

SSHOperator

from airflow.contrib.operators\ .ssh_operator import SSHOperator task = SSHOperator( task_id='ssh_spark_submit', dag=dag, command=command, ssh_conn_id='spark_master_ssh' )

slide-24
SLIDE 24

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Running PySpark from Airow

SparkSubmitOperator

from airflow.contrib.operators\ .spark_submit_operator \ import SparkSubmitOperator spark_task = SparkSubmitOperator( task_id='spark_submit_id', dag=dag, application="/path/to/app.py", py_files="package1.zip", conn_id='spark_default' )

SSHOperator

from airflow.contrib.operators\ .ssh_operator import SSHOperator task = SSHOperator( task_id='ssh_spark_submit', dag=dag, command=command, ssh_conn_id='spark_master_ssh' )

slide-25
SLIDE 25

Let’s practice!

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

slide-26
SLIDE 26

Deploying Airow

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Oliver Willekens

Data Engineer at Data Minded

slide-27
SLIDE 27

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Installing and conguring Airow

export AIRFLOW_HOME=~/airflow pip install apache-airflow airflow initdb [core] # lots of other configuration settings # … # The executor class that airflow should u # Choices include SequentialExecutor, # LocalExecutor, CeleryExecutor, DaskExecu # KubernetesExecutor executor = SequentialExecutor

slide-28
SLIDE 28

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Setting up for production

dags: place to store the dags (congurable) tests: unit test the possible deployment, possibly ensure consistency across DAGs plugins: store custom operators and hooks connections, pools, variables: provide a location for various conguration les you can import into Airow.

slide-29
SLIDE 29

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Example Airow deployment test

from airflow.models import DagBag def test_dagbag_import(): """Verify that Airflow will be able to import all DAGs in the repository.""" dagbag = DagBag() number_of_failures = len(dagbag.import_errors) assert number_of_failures == 0, \ "There should be no DAG failures. Got: %s" % dagbag.import_errors

slide-30
SLIDE 30

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Transferring DAGs and plugins

slide-31
SLIDE 31

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Transferring DAGs and plugins

slide-32
SLIDE 32

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Transferring DAGs and plugins

slide-33
SLIDE 33

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Transferring DAGs and plugins

slide-34
SLIDE 34

Let’s practice!

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

slide-35
SLIDE 35

Final thoughts

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Oliver Willekens

Data Engineer at Data Minded

slide-36
SLIDE 36

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

What you learned

Dene purpose of components of data platforms Write an ingestion pipeline using Singer Create and deploy pipelines for big data in Spark Congure automated testing using CircleCI Manage and deploy a full data pipeline with Airow

slide-37
SLIDE 37

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Additional resources

External resources Singer: https://www.singer.io/ Apache Spark: https://spark.apache.org/ Pytest: https://pytest.org/en/latest/ Flake8: http://ake8.pycqa.org/en/latest/ Circle CI: - https://circleci.com/ Apache Airow: https://airow.apache.org/ DataCamp courses Software engineering: https://www.datacamp.com/courses/software engineering-for-data-scientists-in-python Spark: https://www.datacamp.com/courses/cleaning- data-with-apache-spark-in-python (and other courses) Unit testing: link yet to be revealed

slide-38
SLIDE 38

Congratulations! ????

BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON