Data Orchestration with Apache Airflow Data driven empower the - - PowerPoint PPT Presentation

data orchestration with apache airflow data driven
SMART_READER_LITE
LIVE PREVIEW

Data Orchestration with Apache Airflow Data driven empower the - - PowerPoint PPT Presentation

Data Orchestration with Apache Airflow Data driven empower the organization to seek more understanding, through data analytics, about their business processes Apache Airflow Airflow is a platform to programmatically author, schedule


slide-1
SLIDE 1

Data Orchestration with Apache Airflow

slide-2
SLIDE 2
slide-3
SLIDE 3

Data driven

“empower the organization to seek more understanding, through data analytics, about their business processes”

slide-4
SLIDE 4
slide-5
SLIDE 5

Apache Airflow

“Airflow is a platform to programmatically author, schedule and monitor workflows.”

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

Cloud Pub/Sub Cloud Dataflow

Batch Data Sources

Cloud Storage BigQuery Friendly Data Lake

External Data Sink

BigQuery Data Warehouse

Stream Data Sources

slide-19
SLIDE 19
slide-20
SLIDE 20

Cloud Composer

slide-21
SLIDE 21

Important ETL principles

  • Processes are idempotent and deterministic
slide-22
SLIDE 22

Important ETL principles

  • Processes are idempotent and deterministic
  • Reusable, parameterisable components
slide-23
SLIDE 23

Important ETL principles

  • Processes are idempotent and deterministic
  • Reusable, parameterisable components
  • Data partitioning (usually by date)
slide-24
SLIDE 24

Important ETL principles

  • Processes are idempotent and deterministic
  • Reusable, parameterisable components
  • Data partitioning (usually by date)
  • Dealing with alerts and SLA’s
slide-25
SLIDE 25

Important ETL principles

  • Processes are idempotent and deterministic
  • Reusable, parameterisable components
  • Data partitioning (usually by date)
  • Dealing with alerts and SLA’s
  • Globally consistent paths to data
slide-26
SLIDE 26

Important ETL principles

  • Processes are idempotent and deterministic
  • Reusable, parameterisable components
  • Data partitioning (usually by date)
  • Dealing with alerts and SLA’s
  • Globally consistent paths to data
  • Data at rest is immutable
slide-27
SLIDE 27

Important ETL principles

  • Processes are idempotent and deterministic
  • Reusable, parameterisable components
  • Data partitioning (usually by date)
  • Dealing with alerts and SLA’s
  • Globally consistent paths to data
  • Data at rest is immutable

“Data at rest to data at rest”

slide-28
SLIDE 28

extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

slide-29
SLIDE 29

extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

slide-30
SLIDE 30

extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

slide-31
SLIDE 31

extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

slide-32
SLIDE 32
slide-33
SLIDE 33

https://github.com/gtoonstra/airflow-deployments https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home https://airflow.incubator.apache.org/ https://gtoonstra.github.io/etl-with-airflow/

References

https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

slide-34
SLIDE 34

Gerard Toonstra - Data Engineer / Architect - BigData Republic gerard.toonstra@bigdatarepublic.nl

https://www.bigdatarepublic.nl/

Thank you!