Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends - - PowerPoint PPT Presentation

big data on google cloud
SMART_READER_LITE
LIVE PREVIEW

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends - - PowerPoint PPT Presentation

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud way William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / vbp@google.com Agenda 1 Big Data at Google 2


slide-1
SLIDE 1

Big Data on Google Cloud

William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / vbp@google.com

Using Cloud Dataflow, BigQuery, and friends to process data “the Cloud way”

slide-2
SLIDE 2

Big Data at Google Managing data through its lifecycle Google Cloud Dataflow Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark... Optimizing your time References and follow-up

1 2 3 4 5

Agenda

6

slide-3
SLIDE 3

Big Data at Google

slide-4
SLIDE 4

Building on Google’s infrastructure

1.5 million devices activated

every day (over a billion devices)

6 billion hours watched

every month (10h uploaded every minute)

20 billion pages crawled

every day

slide-5
SLIDE 5

Hardware and data center innovation

slide-6
SLIDE 6

Spanner Dremel MapReduce Big Table Colossus

2012 2013 2002 2004 2006 2008 2010

GFS MillWheel Flume

Pregel

Software innovation

Cloud Dataflow BigQuery

slide-7
SLIDE 7

Managing data through its lifecycle

slide-8
SLIDE 8

Data lifecycle

Stream Batch Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Google App Engine Cloud Dataflow BigQuery Storage

(tables)

Cloud Storage

(files)

Cloud Dataflow BigQuery Analytics

(SQL)

Real time analytics & alerts

slide-9
SLIDE 9

Descriptive Exploratory + Descriptive Predictive + Exploratory Descriptive Prescriptive + Predictive Exploratory Descriptive

Data usage organization maturity lifecycle

slide-10
SLIDE 10
  • no administration
  • most powerful tools in the easiest way
  • constant experimentation with low risks & cost
  • easy collaboration across teams and organizations
  • low costs without requiring usage commitments
  • best performance & virtually unlimited scale
  • always on

Supporting organizations with operational ease of use

slide-11
SLIDE 11

Google Cloud Dataflow

slide-12
SLIDE 12

Data lifecycle

Stream Batch Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Google App Engine Cloud Dataflow BigQuery Storage

(tables)

Cloud Storage

(files)

Cloud Dataflow BigQuery Analytics

(SQL)

Real time analytics & alerts

slide-13
SLIDE 13

Cloud Dataflow is a collection of SDKs for building parallelized data processing pipelines Cloud Dataflow is a managed service for executing parallelized data processing pipelines

What is Cloud Dataflow?

↳ Download from GitHub:

https://github.com/GoogleCloudPlatform/DataflowJavaSDK

↳ Use on Google Cloud:

https://cloud.google.com/dataflow/

slide-14
SLIDE 14

Cloud Dataflow SDK - Logical Model

Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs }

Unified programming model for both batch & stream processing.

slide-15
SLIDE 15
  • A Direct Acyclic Graph of data processing

transformations

  • Can be submitted to the Dataflow Service

for optimization and execution or executed

  • n an alternate runner e.g. Spark
  • May include multiple inputs and multiple
  • utputs
  • May encompass many logical MapReduce
  • perations
  • PCollections flow through the pipeline

Cloud Dataflow Pipeline

slide-16
SLIDE 16

Google Cloud Platform Managed Service

User Code & SDK Work Manager Deploy & Schedule Progress & Logs Monitoring UI Job Manager

Life of a Dataflow Pipeline

Graph

  • ptimization
slide-17
SLIDE 17

Deploy

Schedule & Monitor

Tear Down

Worker Lifecycle Management throughout batch execution

slide-18
SLIDE 18

100 mins. 65 mins.

Worker Optimization

vs.

slide-19
SLIDE 19

800 RPS 1,200 RPS 5,000 RPS 50 RPS

Continuous worker scaling for long-lived streaming pipelines

time

slide-20
SLIDE 20
  • Run the same code in multiple modes using different runners
  • Direct Runner
  • For local, in-memory execution.
  • Great for developing and unit tests
  • Cloud Dataflow Service Runner
  • Runs on the fully-manage Dataflow Service
  • Your code runs distributed across GCE instances
  • Community sourced
  • Spark runner @ github.com/cloudera/spark-dataflow
  • Flink runner coming soon from dataArtisans

Portability: Cloud Dataflow Runners The most productive and portable Data pipeline SDK.

slide-21
SLIDE 21

Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...

slide-22
SLIDE 22

Data lifecycle

Stream Batch Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Google App Engine Cloud Dataflow BigQuery Storage

(tables)

Cloud Storage

(files)

Cloud Dataflow BigQuery Analytics

(SQL)

Real time analytics & alerts

slide-23
SLIDE 23

Many-to-many asynchronous messaging Fast and reliable Cloud Pub/Sub

slide-24
SLIDE 24

BigQuery

  • Ingest data via streaming (100K rows/second/table) or file loader
  • Process interactive SQL queries on TB or PB of data
  • Zero administration; just upload data and send queries
  • Pay for storage and query separately, based on actual usage
  • Non-technical analysts can drive queries on massive datasets using BI

tools (e.g. Tableau)

  • Highly Available: Data replication in multiple geographies.
  • Secure and easy collaboration: access to data is controlled using

customer-owned ACLs

slide-25
SLIDE 25

Hadoop and Spark

HDFS (optional) Work Nodes Work Nodes

HDFS (optional) Name Node (optional) Local SSD PD SSD PD standard GCS Connector BigQuery Connector

Connectors bdutil orchestration

Master Node Work Nodes

slide-26
SLIDE 26

Optimizing your time

slide-27
SLIDE 27

Optimizing Your Time

slide-28
SLIDE 28

Optimizing Your Time

slide-29
SLIDE 29

Optimizing Your Time

slide-30
SLIDE 30

Optimizing Your Time

slide-31
SLIDE 31

Optimizing Your Time

slide-32
SLIDE 32

Optimizing Your Time

slide-33
SLIDE 33

Optimizing Your Time

slide-34
SLIDE 34

Optimizing Your Time

slide-35
SLIDE 35

Optimizing Your Time

slide-36
SLIDE 36

References and follow-up

slide-37
SLIDE 37

Cloud Dataflow

  • Service: https://cloud.google.com/dataflow
  • Questions: https://stackoverflow.com/questions/tagged/google-cloud-dataflow
  • SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK

BigQuery

  • https://cloud.google.com/bigquery/

Cloud Pub/Sub

  • https://cloud.google.com/pubsub/

Hadoop and Spark

  • https://cloud.google.com/hadoop/

Getting Started

Contact me

  • Twitter: @vambenepe
  • email: vbp@google.com
slide-38
SLIDE 38

Thank You!

cloud.google.com