from s3 to bigquery how a first time airglow user
play

From S3 to BigQuery - How A First-Time Airglow User Successfully - PowerPoint PPT Presentation

From S3 to BigQuery - How A First-Time Airglow User Successfully Implemented a Data Pipeline Leah Cole (with huge thanks to Emily Darrow!) @leahecole July 13th, 2020 Intro to Leah @leahecole Today's Story Prologue Chapter 1:


  1. From S3 to BigQuery - How A First-Time Airglow User Successfully Implemented a Data Pipeline Leah Cole (with huge thanks to Emily Darrow!) @leahecole July 13th, 2020

  2. Intro to Leah @leahecole

  3. Today's Story • Prologue • Chapter 1: BigQuery Public Datasets • Chapter 2: Growing Pains • Chapter 3: The Goal • Chapter 4: The DAG • Epilogue • Q&A @leahecole

  4. Prologue @leahecole

  5. Intro to Composer Note: not all Composer components are depicted in the diagram (others were highlighted in a @leahecole presentation last week from Rafał Biegacz)

  6. UNIQUE Google Cloud’s enterprise data Fully managed and serverless warehouse for analytics for maximum agility and scale UNIQUE Google Gigabyte to petabyte scale Real-time insights BigQuery storage and SQL queries from streaming data UNIQUE Encrypted, durable, Built-in ML for out-of-the-box And highly available predictive insights UNIQUE High-speed, in-memory BI Engine for faster reporting and analysis @leahecole

  7. BigQuery: architecture Serverless. Decoupled storage and compute for maximum fmexibility. SQL:2011 Compliant BigQuery Replicated, High-available REST API Streaming ingest distributed storage cluster compute (99.9999999999% durability) (Dremel) Distributed Web UI, CLI memory shuffme tier Client libraries In 7 languages Petabit network Free bulk loading @leahecole

  8. Chapter 1: BigQuery Public Datasets @leahecole

  9. You need to... Discover the dataset and where to access it. Negotiate access to the dataset. Understand the dataset, how it can be joined with your data, and its changes. Load the data into your systems. The "Data Science" method Update, maintain, and secure your data and database. Manage access and keep the data updated. Link public data with private data. Analyze , Visualize and communicate your results. @leahecole

  10. You need to... Discover the dataset and where to access it. Negotiate access to the dataset. Understand the dataset, how it can be joined with your data, and its changes. Load the data into your systems. What if you only did this? Update, maintain, and secure your data and database. Manage access and keep the data updated. Link public data with private data. Analyze , Visualize and communicate your results. @leahecole

  11. Current catalog Data providers: >180 datasets Onboarded and maintained by Googler(s) with data provider input/guidance g.co/cloud/marketplace-datasets @leahecole

  12. Chapter 2: Growing Pains @leahecole

  13. Growing Pains in the Public Datasets Program Understand the dataset, how it can be joined with your data, and its changes. Load the data into your systems. @leahecole

  14. Late 2019: Onboarding a New Dataset • New dataset comes in • Temporarily stored • Perform transformations • Ends up in BQ @leahecole

  15. Late 2019: Problems with Current Process • Disparate data sources + formats • Internal/external resource communication • Access control inconsistent • Tooling • Transformations • Manual @leahecole

  16. Chapter 3: The Goal @leahecole

  17. The Goals • Unified, repeatable process • Utilize GCP products designed for this • Hopefully open source process • See process through eyes of first-time Airflow user (Leah + Emily) @leahecole

  18. Early 2020-Present: Proposed solution 1. Clone repo, 2. Add config + transformations 3. Generate DAG + .tf config make branch YAML config Custom transformations 5. Presubmit checks 6. Human review 7. Deploy 4. Create a PR @leahecole

  19. Chapter 4: The DAG @leahecole

  20. The DAG Development Process • Shared repo • Shared GCP project • Leah + Emily both owners • Shared notes • Meetings • Pairing as needed • Regular team meetings @leahecole

  21. DAG version 0.0 • Get data from S3, store in GCS • Make target dataset • Put data into BigQuery Problem: • Leftover GCS bucket @leahecole

  22. DAG version 0.1 • Get data from S3, store in GCS • Make target dataset • Put data into BigQuery • Delete staging bucket @leahecole

  23. DAG version 1.x - Schema Defjnition, Resource Creation @leahecole

  24. DAG version 1.x - YAML confjg @leahecole

  25. DAG version 1.x - Verify @leahecole

  26. DAG version 1.x - Verify @leahecole

  27. DAG version 1.x - Extract @leahecole

  28. DAG version 1.x - Transform + Load @leahecole

  29. Epilogue @leahecole

  30. Lessons Learned • Double check your Composer and Airflow versions • Documentation is extremely important • Changelogs and release notes are extremely important • Transferring data between cloud providers is REALLY easy with Airflow @leahecole

  31. Call to Action • Contribute • Automate • Collaborate @leahecole

  32. Thank you to • Emily Darrow - for their technical legwork with this project • Tim Swast - technical advice + vision, moral support • Shane Glass - technical advice, vision, and presentation content • Rafał Biegacz + the Composer Team - presentation content + tireless engineering work • Seth Hollyman and my Data Analytics DevRel colleagues - presentation content, moral support, and constant inspiration • Moderators, sponsors, and attendees! @leahecole

  33. Q&A with Leah, Tim, and Shane @leahecole

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend