building reusable and trustworthy pipelines
play

Building Reusable and Trustworthy pipelines 1 Airflow Summit - PowerPoint PPT Presentation

Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 Airflow Summit 2020, @nehiljain Context 3 Airflow Summit 2020,


  1. Building Reusable and Trustworthy pipelines 1 — Airflow Summit 2020, @nehiljain

  2. Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 — Airflow Summit 2020, @nehiljain

  3. Context 3 — Airflow Summit 2020, @nehiljain

  4. Hello ! Data engineer @ SnapTravel ▸ SnapTravel ▸ M-commerce startup ▸ Data team: 8, Data Sources: 86 ▸ Data infrastructure, Data engineering, Analytics engineering ▸ + + + stack ▸ 4 — Airflow Summit 2020, @nehiljain

  5. Purpose " # ! Share BI pipelines Community with lessons learnt feedback 5 — Airflow Summit 2020, @nehiljain

  6. How are my company ? ▸ gross_revenue ▸ contribution_margin ▸ number_of_active_users ▸ retention_rate ▸ conversion_rate 6 — Airflow Summit 2020, @nehiljain

  7. Hows my airflow repo ? ▸ number_prs_merged ▸ number_prs_closed_without_merge ▸ number_prs_opened ▸ number_of_commits 7 — Airflow Summit 2020, @nehiljain

  8. 8 — Airflow Summit 2020, @nehiljain

  9. Let us consider ▸ The pipeline failed in production ▸ Shift focus on to issues, comments ▸ Gitlab released a new version of API ▸ I want to analyze other apache projects too ▸ Github produced similar insights and their numbers didn't match mine 9 — Airflow Summit 2020, @nehiljain

  10. ! Been there done that? 10 — Airflow Summit 2020, @nehiljain

  11. Classify the problems ▸ Toil ▸ Cannot scale Data Analytics ▸ Data Discovery ▸ Data Trust ▸ Throw over the boundary ▸ Ambiguous ownership 11 — Airflow Summit 2020, @nehiljain

  12. What can we do to solve this? 12 — Airflow Summit 2020, @nehiljain

  13. ..build tools, infrastructure, frameworks and services — Maxime Beauchemin 13 — Airflow Summit 2020, @nehiljain

  14. Design Requirements 14 — Airflow Summit 2020, @nehiljain

  15. 15 — Airflow Summit 2020, @nehiljain

  16. Single Source of Truth ▸ Standardization ▸ Data Lineage ▸ Empower non-technical folks 16 — Airflow Summit 2020, @nehiljain

  17. Easy to consume ▸ Airflow + Other OSS ▸ Ideally pip install awesome-elt-tool ▸ Low barrier to entry for data analytics ▸ Operational creep 17 — Airflow Summit 2020, @nehiljain

  18. Promote data integrity ▸ Test the raw data supply ▸ Automated analytics testing 18 — Airflow Summit 2020, @nehiljain

  19. Meta Data Engineering 19 — Airflow Summit 2020, @nehiljain

  20. 20 — Airflow Summit 2020, @nehiljain

  21. Proposed Solution 21 — Airflow Summit 2020, @nehiljain

  22. Conceptually 22 — Airflow Summit 2020, @nehiljain

  23. ETL vs ELT ▸ Load once and transform ▸ Reduced complexity ▸ Reduce cost ▸ Speed of delivery 23 — Airflow Summit 2020, @nehiljain

  24. Validate your source data 24 — Airflow Summit 2020, @nehiljain

  25. ▸ expect_column_to_exist ▸ expect_table_row_count_to_be_between ▸ expect_table_row_count_to_equal ▸ expect_multicolumn_values_to_be_unique ▸ expect_column_values_to_not_be_null ▸ expect_column_values_to_be_null ▸ expect_column_fancy_statistic_to_be 25 — Airflow Summit 2020, @nehiljain

  26. Why? ▸ Profiling ▸ Data Docs <-> Tests ▸ Send notifications automatically 26 — Airflow Summit 2020, @nehiljain

  27. Extract - Load 27 — Airflow Summit 2020, @nehiljain

  28. Singer - What? 28 — Airflow Summit 2020, @nehiljain

  29. tap-github --config tap_config.json | target-postgres --config target_config.json >> state.json 29 — Airflow Summit 2020, @nehiljain

  30. Singer - Why? ▸ Standardized communication ▸ Incremental out of the box ▸ Documentation ▸ See your data in under 10 mins 30 — Airflow Summit 2020, @nehiljain

  31. 31 — Airflow Summit 2020, @nehiljain

  32. It's a long list 32 — Airflow Summit 2020, @nehiljain

  33. Transform 33 — Airflow Summit 2020, @nehiljain

  34. DBT - What? 34 — Airflow Summit 2020, @nehiljain

  35. 35 — Airflow Summit 2020, @nehiljain

  36. 36 — Airflow Summit 2020, @nehiljain

  37. DBT - Why? ▸ Modular code 37 — Airflow Summit 2020, @nehiljain

  38. DBT - Why? ▸ Modular code ▸ Testing is 1st Class 38 — Airflow Summit 2020, @nehiljain

  39. DBT - Why? ▸ Modular code ▸ Testing is 1st Class ▸ Data documentation is 1st Class 39 — Airflow Summit 2020, @nehiljain

  40. Great adoption 40 — Airflow Summit 2020, @nehiljain

  41. All together 41 — Airflow Summit 2020, @nehiljain

  42. Meltano ▸ Open Source, GitLab ▸ Self Hosted pip3 install meltano meltano init airflow-analytics-project meltano add extractor tap-github meltano add loader target-postgres meltano add transformer dbt meltano add transform tap-github # add env variables meltano elt tap-gitlab target-postgres --transform=run --job_id=gitlab-to-postgres meltano add orchestrator airflow 42 — Airflow Summit 2020, @nehiljain

  43. Let's look at the code 43 — Airflow Summit 2020, @nehiljain

  44. 44 — Airflow Summit 2020, @nehiljain

  45. A templated approach 45 — Airflow Summit 2020, @nehiljain

  46. 46 — Airflow Summit 2020, @nehiljain

  47. 47 — Airflow Summit 2020, @nehiljain

  48. 48 — Airflow Summit 2020, @nehiljain

  49. 49 — Airflow Summit 2020, @nehiljain

  50. Sit back & Relax 50 — Airflow Summit 2020, @nehiljain

  51. Some challenges out there ▸ Visualisation/BI layer ▸ Analytics code coverage ▸ Singer community 51 — Airflow Summit 2020, @nehiljain

  52. Key Takeaways ▸ Standardized tooling ▸ ELT >> ETL ▸ GE + Singer + DBT orchestrated by Airflow 52 — Airflow Summit 2020, @nehiljain

  53. Thanks 53 — Airflow Summit 2020, @nehiljain

  54. Q & A 54 — Airflow Summit 2020, @nehiljain

  55. Resources ▸ Meltano Project ▸ Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin ▸ The Rise of the Data Engineer ▸ The Future of Data Engineering ▸ Downfall of the data engineer 55 — Airflow Summit 2020, @nehiljain

  56. Resources ▸ Singer | Open Source ETL ▸ Why we are building an open-source platform for ELT pipelines - Meltano ▸ Dbt Docs 56 — Airflow Summit 2020, @nehiljain

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend