Building Reusable and Trustworthy pipelines
1 — Airflow Summit 2020, @nehiljain
Building Reusable and Trustworthy pipelines 1 Airflow Summit - - PowerPoint PPT Presentation
Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 Airflow Summit 2020, @nehiljain Context 3 Airflow Summit 2020,
1 — Airflow Summit 2020, @nehiljain
2 — Airflow Summit 2020, @nehiljain
3 — Airflow Summit 2020, @nehiljain
▸ Data engineer @ SnapTravel ▸ SnapTravel ▸ M-commerce startup ▸ Data team: 8, Data Sources: 86 ▸ Data infrastructure, Data engineering, Analytics engineering ▸ + + + stack
4 — Airflow Summit 2020, @nehiljainShare ! BI pipelines " Community with lessons learnt # feedback
5 — Airflow Summit 2020, @nehiljain
▸
gross_revenue
▸
contribution_margin
▸
number_of_active_users
▸
retention_rate
▸
conversion_rate
6 — Airflow Summit 2020, @nehiljain
▸
number_prs_merged
▸
number_prs_closed_without_merge
▸
number_prs_opened
▸
number_of_commits
7 — Airflow Summit 2020, @nehiljain
8 — Airflow Summit 2020, @nehiljain
▸ The pipeline failed in production ▸ Shift focus on to issues, comments ▸ Gitlab released a new version of API ▸ I want to analyze other apache projects too ▸ Github produced similar insights and their numbers didn't match mine
9 — Airflow Summit 2020, @nehiljain
10 — Airflow Summit 2020, @nehiljain
▸ Toil ▸ Cannot scale Data Analytics ▸ Data Discovery ▸ Data Trust ▸ Throw over the boundary ▸ Ambiguous ownership
11 — Airflow Summit 2020, @nehiljain
12 — Airflow Summit 2020, @nehiljain
..build tools, infrastructure, frameworks and services — Maxime Beauchemin
13 — Airflow Summit 2020, @nehiljain
14 — Airflow Summit 2020, @nehiljain
15 — Airflow Summit 2020, @nehiljain
▸ Standardization ▸ Data Lineage ▸ Empower non-technical folks
16 — Airflow Summit 2020, @nehiljain
▸ Airflow + Other OSS ▸ Ideally pip install awesome-elt-tool ▸ Low barrier to entry for data analytics ▸ Operational creep
17 — Airflow Summit 2020, @nehiljain
▸ Test the raw data supply ▸ Automated analytics testing
18 — Airflow Summit 2020, @nehiljain
19 — Airflow Summit 2020, @nehiljain
20 — Airflow Summit 2020, @nehiljain
21 — Airflow Summit 2020, @nehiljain
22 — Airflow Summit 2020, @nehiljain
▸ Load once and transform ▸ Reduced complexity ▸ Reduce cost ▸ Speed of delivery
23 — Airflow Summit 2020, @nehiljain
24 — Airflow Summit 2020, @nehiljain
▸
expect_column_to_exist
▸
expect_table_row_count_to_be_between
▸
expect_table_row_count_to_equal
▸
expect_multicolumn_values_to_be_unique
▸
expect_column_values_to_not_be_null
▸
expect_column_values_to_be_null
▸
expect_column_fancy_statistic_to_be
25 — Airflow Summit 2020, @nehiljain▸ Profiling ▸ Data Docs <-> Tests ▸ Send notifications automatically
26 — Airflow Summit 2020, @nehiljain
27 — Airflow Summit 2020, @nehiljain
28 — Airflow Summit 2020, @nehiljain
tap-github --config tap_config.json | target-postgres --config target_config.json >> state.json
29 — Airflow Summit 2020, @nehiljain
▸ Standardized communication ▸ Incremental out of the box ▸ Documentation ▸ See your data in under 10 mins
30 — Airflow Summit 2020, @nehiljain
31 — Airflow Summit 2020, @nehiljain
32 — Airflow Summit 2020, @nehiljain
33 — Airflow Summit 2020, @nehiljain
34 — Airflow Summit 2020, @nehiljain
35 — Airflow Summit 2020, @nehiljain
36 — Airflow Summit 2020, @nehiljain
▸ Modular code
37 — Airflow Summit 2020, @nehiljain
▸ Modular code ▸ Testing is 1st Class
38 — Airflow Summit 2020, @nehiljain
▸ Modular code ▸ Testing is 1st Class ▸ Data documentation is 1st Class
39 — Airflow Summit 2020, @nehiljain
40 — Airflow Summit 2020, @nehiljain
41 — Airflow Summit 2020, @nehiljain
▸ Open Source, GitLab ▸ Self Hosted
pip3 install meltano meltano init airflow-analytics-project meltano add extractor tap-github meltano add loader target-postgres meltano add transformer dbt meltano add transform tap-github # add env variables meltano elt tap-gitlab target-postgres --transform=run --job_id=gitlab-to-postgres meltano add orchestrator airflow
42 — Airflow Summit 2020, @nehiljain
43 — Airflow Summit 2020, @nehiljain
44 — Airflow Summit 2020, @nehiljain
45 — Airflow Summit 2020, @nehiljain
46 — Airflow Summit 2020, @nehiljain
47 — Airflow Summit 2020, @nehiljain
48 — Airflow Summit 2020, @nehiljain
49 — Airflow Summit 2020, @nehiljain
50 — Airflow Summit 2020, @nehiljain
▸ Visualisation/BI layer ▸ Analytics code coverage ▸ Singer community
51 — Airflow Summit 2020, @nehiljain
▸ Standardized tooling ▸ ELT >> ETL ▸ GE + Singer + DBT orchestrated by Airflow
52 — Airflow Summit 2020, @nehiljain
53 — Airflow Summit 2020, @nehiljain
54 — Airflow Summit 2020, @nehiljain
▸ Meltano Project ▸ Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin ▸ The Rise of the Data Engineer ▸ The Future of Data Engineering ▸ Downfall of the data engineer
55 — Airflow Summit 2020, @nehiljain
▸ Singer | Open Source ETL ▸ Why we are building an open-source platform for ELT pipelines - Meltano ▸ Dbt Docs
56 — Airflow Summit 2020, @nehiljain