SLIDE 1 Big Data on Google Cloud
William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / vbp@google.com
Using Cloud Dataflow, BigQuery, and friends to process data “the Cloud way”
SLIDE 2 Big Data at Google Managing data through its lifecycle Google Cloud Dataflow Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark... Optimizing your time References and follow-up
1 2 3 4 5
Agenda
6
SLIDE 3
Big Data at Google
SLIDE 4
Building on Google’s infrastructure
1.5 million devices activated
every day (over a billion devices)
6 billion hours watched
every month (10h uploaded every minute)
20 billion pages crawled
every day
SLIDE 5
Hardware and data center innovation
SLIDE 6 Spanner Dremel MapReduce Big Table Colossus
2012 2013 2002 2004 2006 2008 2010
GFS MillWheel Flume
Pregel
Software innovation
Cloud Dataflow BigQuery
SLIDE 7
Managing data through its lifecycle
SLIDE 8 Data lifecycle
Stream Batch Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Google App Engine Cloud Dataflow BigQuery Storage
(tables)
Cloud Storage
(files)
Cloud Dataflow BigQuery Analytics
(SQL)
Real time analytics & alerts
SLIDE 9 Descriptive Exploratory + Descriptive Predictive + Exploratory Descriptive Prescriptive + Predictive Exploratory Descriptive
Data usage organization maturity lifecycle
SLIDE 10
- no administration
- most powerful tools in the easiest way
- constant experimentation with low risks & cost
- easy collaboration across teams and organizations
- low costs without requiring usage commitments
- best performance & virtually unlimited scale
- always on
Supporting organizations with operational ease of use
SLIDE 11
Google Cloud Dataflow
SLIDE 12 Data lifecycle
Stream Batch Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Google App Engine Cloud Dataflow BigQuery Storage
(tables)
Cloud Storage
(files)
Cloud Dataflow BigQuery Analytics
(SQL)
Real time analytics & alerts
SLIDE 13 Cloud Dataflow is a collection of SDKs for building parallelized data processing pipelines Cloud Dataflow is a managed service for executing parallelized data processing pipelines
What is Cloud Dataflow?
↳ Download from GitHub:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
↳ Use on Google Cloud:
https://cloud.google.com/dataflow/
SLIDE 14 Cloud Dataflow SDK - Logical Model
Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs }
Unified programming model for both batch & stream processing.
SLIDE 15
- A Direct Acyclic Graph of data processing
transformations
- Can be submitted to the Dataflow Service
for optimization and execution or executed
- n an alternate runner e.g. Spark
- May include multiple inputs and multiple
- utputs
- May encompass many logical MapReduce
- perations
- PCollections flow through the pipeline
Cloud Dataflow Pipeline
SLIDE 16 Google Cloud Platform Managed Service
User Code & SDK Work Manager Deploy & Schedule Progress & Logs Monitoring UI Job Manager
Life of a Dataflow Pipeline
Graph
SLIDE 17 Deploy
Schedule & Monitor
Tear Down
Worker Lifecycle Management throughout batch execution
SLIDE 18 100 mins. 65 mins.
Worker Optimization
vs.
SLIDE 19 800 RPS 1,200 RPS 5,000 RPS 50 RPS
Continuous worker scaling for long-lived streaming pipelines
time
SLIDE 20
- Run the same code in multiple modes using different runners
- Direct Runner
- For local, in-memory execution.
- Great for developing and unit tests
- Cloud Dataflow Service Runner
- Runs on the fully-manage Dataflow Service
- Your code runs distributed across GCE instances
- Community sourced
- Spark runner @ github.com/cloudera/spark-dataflow
- Flink runner coming soon from dataArtisans
Portability: Cloud Dataflow Runners The most productive and portable Data pipeline SDK.
SLIDE 21
Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...
SLIDE 22 Data lifecycle
Stream Batch Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Google App Engine Cloud Dataflow BigQuery Storage
(tables)
Cloud Storage
(files)
Cloud Dataflow BigQuery Analytics
(SQL)
Real time analytics & alerts
SLIDE 23
Many-to-many asynchronous messaging Fast and reliable Cloud Pub/Sub
SLIDE 24 BigQuery
- Ingest data via streaming (100K rows/second/table) or file loader
- Process interactive SQL queries on TB or PB of data
- Zero administration; just upload data and send queries
- Pay for storage and query separately, based on actual usage
- Non-technical analysts can drive queries on massive datasets using BI
tools (e.g. Tableau)
- Highly Available: Data replication in multiple geographies.
- Secure and easy collaboration: access to data is controlled using
customer-owned ACLs
SLIDE 25 Hadoop and Spark
HDFS (optional) Work Nodes Work Nodes
HDFS (optional) Name Node (optional) Local SSD PD SSD PD standard GCS Connector BigQuery Connector
Connectors bdutil orchestration
Master Node Work Nodes
SLIDE 26
Optimizing your time
SLIDE 27
Optimizing Your Time
SLIDE 28
Optimizing Your Time
SLIDE 29
Optimizing Your Time
SLIDE 30
Optimizing Your Time
SLIDE 31
Optimizing Your Time
SLIDE 32
Optimizing Your Time
SLIDE 33
Optimizing Your Time
SLIDE 34
Optimizing Your Time
SLIDE 35
Optimizing Your Time
SLIDE 36
References and follow-up
SLIDE 37 Cloud Dataflow
- Service: https://cloud.google.com/dataflow
- Questions: https://stackoverflow.com/questions/tagged/google-cloud-dataflow
- SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK
BigQuery
- https://cloud.google.com/bigquery/
Cloud Pub/Sub
- https://cloud.google.com/pubsub/
Hadoop and Spark
- https://cloud.google.com/hadoop/
Getting Started
Contact me
- Twitter: @vambenepe
- email: vbp@google.com
SLIDE 38
Thank You!
cloud.google.com