From S3 to BigQuery - How A First-Time Airglow User Successfully - - PowerPoint PPT Presentation

from s3 to bigquery how a first time airglow user
SMART_READER_LITE
LIVE PREVIEW

From S3 to BigQuery - How A First-Time Airglow User Successfully - - PowerPoint PPT Presentation

From S3 to BigQuery - How A First-Time Airglow User Successfully Implemented a Data Pipeline Leah Cole (with huge thanks to Emily Darrow!) @leahecole July 13th, 2020 Intro to Leah @leahecole Today's Story Prologue Chapter 1:


slide-1
SLIDE 1

@leahecole

From S3 to BigQuery - How A First-Time Airglow User Successfully Implemented a Data Pipeline

Leah Cole (with huge thanks to Emily Darrow!) July 13th, 2020

slide-2
SLIDE 2

@leahecole

Intro to Leah

slide-3
SLIDE 3

@leahecole

Today's Story

  • Prologue
  • Chapter 1: BigQuery Public Datasets
  • Chapter 2: Growing Pains
  • Chapter 3: The Goal
  • Chapter 4: The DAG
  • Epilogue
  • Q&A
slide-4
SLIDE 4

@leahecole

Prologue

slide-5
SLIDE 5

@leahecole

Intro to Composer

Note: not all Composer components are depicted in the diagram (others were highlighted in a presentation last week from Rafał Biegacz)

slide-6
SLIDE 6

@leahecole

Real-time insights from streaming data

Google BigQuery

Google Cloud’s enterprise data warehouse for analytics Gigabyte to petabyte scale storage and SQL queries Encrypted, durable, And highly available Fully managed and serverless for maximum agility and scale

UNIQUE UNIQUE

Built-in ML for out-of-the-box predictive insights High-speed, in-memory BI Engine for faster reporting and analysis

UNIQUE UNIQUE

slide-7
SLIDE 7

@leahecole

BigQuery: architecture

  • Serverless. Decoupled storage and compute for maximum fmexibility.

SQL:2011 Compliant REST API Web UI, CLI Client libraries In 7 languages Streaming ingest Free bulk loading Replicated, distributed storage (99.9999999999% durability) High-available cluster compute (Dremel) Distributed memory shuffme tier Petabit network BigQuery

slide-8
SLIDE 8

@leahecole

Chapter 1: BigQuery Public Datasets

slide-9
SLIDE 9

@leahecole

The "Data Science" method

Discover the dataset and where to access it. Negotiate access to the dataset. Understand the dataset, how it can be joined with your data, and its changes. Load the data into your systems. Update, maintain, and secure your data and database. Manage access and keep the data updated. Link public data with private data. Analyze, Visualize and communicate your results. You need to...

slide-10
SLIDE 10

@leahecole

What if you

  • nly did this?

Discover the dataset and where to access it. Negotiate access to the dataset. Understand the dataset, how it can be joined with your data, and its changes. Load the data into your systems. Update, maintain, and secure your data and database. Manage access and keep the data updated. Link public data with private data. Analyze, Visualize and communicate your results. You need to...

slide-11
SLIDE 11

@leahecole

Data providers: Current catalog >180 datasets

Onboarded and maintained by Googler(s) with data provider input/guidance g.co/cloud/marketplace-datasets

slide-12
SLIDE 12

@leahecole

Chapter 2: Growing Pains

slide-13
SLIDE 13

@leahecole

Growing Pains in the Public Datasets Program

Understand the dataset, how it can be joined with your data, and its changes. Load the data into your systems.

slide-14
SLIDE 14

@leahecole

  • New dataset comes in
  • Temporarily stored
  • Perform transformations
  • Ends up in BQ

Late 2019: Onboarding a New Dataset

slide-15
SLIDE 15

@leahecole

  • Disparate data sources + formats
  • Internal/external resource

communication

  • Access control inconsistent
  • Tooling
  • Transformations
  • Manual

Late 2019: Problems with Current Process

slide-16
SLIDE 16

@leahecole

Chapter 3: The Goal

slide-17
SLIDE 17

@leahecole

The Goals

  • Unified, repeatable process
  • Utilize GCP products designed

for this

  • Hopefully open source process
  • See process through eyes of

first-time Airflow user (Leah + Emily)

slide-18
SLIDE 18

@leahecole

Early 2020-Present: Proposed solution

YAML config Custom transformations 1. Clone repo, make branch

  • 2. Add config + transformations
  • 3. Generate DAG + .tf config
  • 4. Create a PR
  • 5. Presubmit checks
  • 6. Human review
  • 7. Deploy
slide-19
SLIDE 19

@leahecole

Chapter 4: The DAG

slide-20
SLIDE 20

@leahecole

  • Shared repo
  • Shared GCP project
  • Leah + Emily both owners
  • Shared notes
  • Meetings
  • Pairing as needed
  • Regular team meetings

The DAG Development Process

slide-21
SLIDE 21

@leahecole

DAG version 0.0

  • Get data from S3, store in GCS
  • Make target dataset
  • Put data into BigQuery

Problem:

  • Leftover GCS bucket
slide-22
SLIDE 22

@leahecole

  • Get data from S3, store in GCS
  • Make target dataset
  • Put data into BigQuery
  • Delete staging bucket

DAG version 0.1

slide-23
SLIDE 23

@leahecole

DAG version 1.x - Schema Defjnition, Resource Creation

slide-24
SLIDE 24

@leahecole

DAG version 1.x - YAML confjg

slide-25
SLIDE 25

@leahecole

DAG version 1.x - Verify

slide-26
SLIDE 26

@leahecole

DAG version 1.x - Verify

slide-27
SLIDE 27

@leahecole

DAG version 1.x - Extract

slide-28
SLIDE 28

@leahecole

DAG version 1.x - Transform + Load

slide-29
SLIDE 29

@leahecole

Epilogue

slide-30
SLIDE 30

@leahecole

  • Double check your Composer and Airflow

versions

  • Documentation is extremely important
  • Changelogs and release notes are

extremely important

  • Transferring data between cloud

providers is REALLY easy with Airflow

Lessons Learned

slide-31
SLIDE 31

@leahecole

  • Contribute
  • Automate
  • Collaborate

Call to Action

slide-32
SLIDE 32

@leahecole

Thank you to

  • Emily Darrow - for their technical legwork with this project
  • Tim Swast - technical advice + vision, moral support
  • Shane Glass - technical advice, vision, and presentation

content

  • Rafał Biegacz + the Composer Team - presentation content +

tireless engineering work

  • Seth Hollyman and my Data Analytics DevRel colleagues -

presentation content, moral support, and constant inspiration

  • Moderators, sponsors, and attendees!
slide-33
SLIDE 33

@leahecole

Q&A with Leah, Tim, and Shane