Data Orchestration with Apache Airflow Data driven empower the - - PowerPoint PPT Presentation
Data Orchestration with Apache Airflow Data driven empower the - - PowerPoint PPT Presentation
Data Orchestration with Apache Airflow Data driven empower the organization to seek more understanding, through data analytics, about their business processes Apache Airflow Airflow is a platform to programmatically author, schedule
Data driven
“empower the organization to seek more understanding, through data analytics, about their business processes”
Apache Airflow
“Airflow is a platform to programmatically author, schedule and monitor workflows.”
Cloud Pub/Sub Cloud Dataflow
Batch Data Sources
Cloud Storage BigQuery Friendly Data Lake
External Data Sink
BigQuery Data Warehouse
Stream Data Sources
Cloud Composer
Important ETL principles
- Processes are idempotent and deterministic
Important ETL principles
- Processes are idempotent and deterministic
- Reusable, parameterisable components
Important ETL principles
- Processes are idempotent and deterministic
- Reusable, parameterisable components
- Data partitioning (usually by date)
Important ETL principles
- Processes are idempotent and deterministic
- Reusable, parameterisable components
- Data partitioning (usually by date)
- Dealing with alerts and SLA’s
Important ETL principles
- Processes are idempotent and deterministic
- Reusable, parameterisable components
- Data partitioning (usually by date)
- Dealing with alerts and SLA’s
- Globally consistent paths to data
Important ETL principles
- Processes are idempotent and deterministic
- Reusable, parameterisable components
- Data partitioning (usually by date)
- Dealing with alerts and SLA’s
- Globally consistent paths to data
- Data at rest is immutable
Important ETL principles
- Processes are idempotent and deterministic
- Reusable, parameterisable components
- Data partitioning (usually by date)
- Dealing with alerts and SLA’s
- Globally consistent paths to data
- Data at rest is immutable
“Data at rest to data at rest”
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
https://github.com/gtoonstra/airflow-deployments https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home https://airflow.incubator.apache.org/ https://gtoonstra.github.io/etl-with-airflow/
References
https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b