Google Cloud Dataflow Cosmin Arad , Senior Software Engineer - PowerPoint PPT Presentation

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and Discussion

History of Big Data at Google Cloud Dataflow Dremel Flume MapReduce Spanner GFS Big Table Pregel MillWheel Colossus 2002 2004 2006 2008 2010 2012 2013

Big Data on Google Cloud Platform Store Capture Process Analyze Pub/Sub Cloud BigQuery Cloud SQL Cloud Dataflow Hadoop BigQuery Hadoop Larger Logs Storage Storage (mySQL) Datastore (stream Spark (on Spark (on Hadoop App Engine (objects) BigTable (NoSQL) and batch) GCE) GCE) Ecosystem BigQuery streaming (structured)

What is Cloud Dataflow? Cloud Dataflow is a Cloud Dataflow is managed service a collection of for executing SDK s for building parallelized data parallelized data processing processing pipelines pipelines

Where might you use Cloud Dataflow? ETL Orchestration Analysis

Where might you use Cloud Dataflow? Reduction Movement Composition Batch Filtering External computation orchestration Enrichment Continuous Simulation computation Shaping

Dataflow SDK Concepts (Programming Model)

Dataflow SDK(s) Easily construct parallelized data processing pipelines using an intuitive ● set of programming abstractions Do what the user expects. ○ No knobs whenever possible. ○ Build for extensibility. ○ Unified batch & streaming semantics. ○ Google supported and open sourced ● Java 7 (public) @ github.com/GoogleCloudPlatform/DataflowJavaSDK ○ Python 2 (in progress) ○ Community sourced ● Scala @ github.com/darkjh/scalaflow ○ Scala @ github.com/jhlch/scala-dataflow-dsl ○

Dataflow Java SDK Release Process weekly monthly

Pipeline • A directed graph of data processing transformations • Optimized and executed as a unit • May include multiple inputs and multiple outputs • May encompass many logical MapReduce or Millwheel operations • PCollections conceptually flow through the pipeline

Runners Specify how a pipeline should run ● Direct Runner ● For local, in-memory execution. Great for developing and unit tests ○ Cloud Dataflow Service ● batch mode: GCE instances poll for work items to execute. ○ streaming mode: GCE instances are set up in a semi-permanent topology ○ Community sourced ● Spark from Cloudera @ github.com/cloudera/spark-dataflow ○ Flink from dataArtisans @ github.com/dataArtisans/flink-dataflow ○

Example: #HashTag Autocompletion

Tweets {Go Hawks #Seahawks!, #Seattle works museum pass. Free! Read Go #PatriotsNation! Having fun at #seaside, … } ExtractTags {seahawks, seattle, patriotsnation, lovemypats, ...} Count {seahawks->5M, seattle->2M, patriots->9M, ...} {d->(deflategate, 10M), d->(denver, 2M), …, ExpandPrefixes sea->(seahawks, 5M), sea->(seaside, 2M), ...} {d->[deflategate, desafiodatransa, djokovic], ... Top(3) de->[deflategate, desafiodatransa, dead50],...} Write Predictions

Pipeline p = Pipeline.create(); Tweets p.begin() Read .apply(TextIO. Read .from(“gs://…”)) ExtractTags .apply(ParDo.of(new ExtractTags ())) Count .apply( Count .perElement()) ExpandPrefixes .apply(ParDo.of(new ExpandPrefixes ()) Top(3) .apply( Top .largestPerKey(3)) Write .apply(TextIO. Write. to(“gs://…”)); p.run(); Predictions

Dataflow Basics Pipeline Pipeline p = Pipeline.create(); • Directed graph of steps operating on data p.run();

Dataflow Basics Pipeline Pipeline p = Pipeline.create(); • Directed graph of steps operating on data p.begin() Data .apply(TextIO. Read .from(“gs://…”)) • PCollection • Immutable collection of same-typed elements that can be encoded • PCollectionTuple, PCollectionList .apply(TextIO. Write. to(“gs://…”)); p.run();

Dataflow Basics Pipeline Pipeline p = Pipeline.create(); • Directed graph of steps operating on data p.begin() Data .apply(TextIO. Read .from(“gs://…”)) • PCollection .apply(ParDo.of(new ExtractTags ())) • Immutable collection of same-typed .apply( Count .perElement()) elements that can be encoded .apply(ParDo.of(new ExpandPrefixes ()) • PCollectionTuple, PCollectionList .apply( Top .largestPerKey(3)) .apply(TextIO. Write. to(“gs://…”)); Transformation • Step that operates on data p.run(); • Core transforms • ParDo, GroupByKey, Combine, Flatten • Composite and custom transforms

Dataflow Basics Pipeline p = Pipeline.create(); p.begin() .apply(TextIO. Read .from(“gs://…”)) .apply(ParDo.of(new ExtractTags ())) class ExpandPrefixes … { ... .apply( Count .perElement()) public void processElement(ProcessContext c) { .apply(ParDo.of(new ExpandPrefixes ()) String word = c.element().getKey(); .apply( Top .largestPerKey(3)) for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); .apply(TextIO. Write. to(“gs://…”)); c.output(KV.of(prefix, c.element())); } p.run(); } }

PCollections • A collection of data of type T in a pipeline {Seahawks, NFC, Champions, Seattle, ...} • Maybe be either bounded or unbounded in size • Created by using a PTransform to: {..., • Build from a java.util.Collection “NFC Champions #GreenBay”, “Green Bay #superbowl!”, • Read from a backing data store ... • Transform an existing PCollection “#GoHawks”, ...} • Often contain key-value pairs using KV<K, V>

Inputs & Outputs • Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore, ... • Write your own custom source by teaching Dataflow how to read it in parallel Your Source/Sink Here • Write to standard Google Cloud Platform data sinks • GCS, BigQuery, Pub/Sub, Datastore, … • Can use a combination of text, JSON, XML, Avro formatted data

Coders • A Coder<T> explains how an element of type T can be written to disk or communicated between machines • Every PCollection<T> needs a valid coder in case the service decides to communicate those values between machines. • Encoded values are used to compare keys -- need to be deterministic. Avro Coder inference can infer a coder for many basic Java objects. •

ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} LowerCase {seahawks, nfc, champions, seattle, ...}

ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} PCollection<String> tweets = …; LowerCase tweets.apply(ParDo.of( new DoFn<String, String>() { @Override public void processElement( {seahawks, nfc, champions, seattle, ...} ProcessContext c) { c.output(c.element().toLowerCase()); }));

ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} FilterOutSWords {NFC, Champions, ...}

ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} ExpandPrefixes {s, se, sea, seah, seaha, seahaw, seahawk, seahawks, n, nf, nfc, c, ch, cha, cham, champ, champi, champio, champion, champions, s, se, sea, seat, seatt, seattl, seattle, ...}

ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} KeyByFirstLetter {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} • Elements are processed in arbitrary ‘bundles’ e. g. “shards” KeyByFirstLetter • startBundle(), processElement()*, finishBundle() • supports arbitrary amounts of {KV<S, Seahawks>, KV<C,Champions>, parallelization <KV<S, Seattle>, KV<N, NFC>, ...} • Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo

GroupByKey • Takes a PCollection of key-value pairs and {KV<S, Seahawks>, KV<C,Champions>, gathers up all values with the same key <KV<S, Seattle>, KV<N, NFC>, ...} • Corresponds to the shuffle phase in Hadoop GroupByKey {KV<S, {Seahawks, Seattle, …}, How do you do a GroupByKey on an unbounded KV<N, {NFC, …} PCollection? KV<C, {Champion, …}}

Windows • Logically divide up or groups the elements of a PCollection into finite windows • Fixed Windows: hourly, daily, … • Sliding Windows • Sessions • Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections • Window.into() can be called at any point in the Nighttime Mid-Day Nighttime pipeline and will be applied when needed • Can be tied to arrival time or custom event time

Event Time Skew Skew Watermark Processing Time Event Time

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer - PowerPoint PPT Presentation

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015 Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Google Cloud Dataflow Manuel Fahndrich Software Engineer Google Addictive Mobile Game

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE

What is Google App Engine? Wesley Chun Developer Advocate, Google

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

How to Publish Linked Data on the Web Tom Heath, Platform Division, Talis, UK Chris Bizer, FU

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Chinas Computer Industry: Manufacturing to Product Development Jason Dedrick and Kenneth L.

Semantic Modeling of Smart City Data and related challenges/opportunities Alessandra Mileo,

1 Going Back to School: The Cons/ Challenges It may be physically and emotionally challenging

Big data, big research? Opportunities and constraints for computer supported social science Jrgen

12/5/2015 Options for pediatric ptosis repair Olmsted County 1 in 842 live births Unilateral

AFFECTS THE RETINA, OPTIC NERVE, AND MAKES THE PATIENT SEE DOUBLE? Sachin Kedar MD Department

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer - PowerPoint PPT Presentation

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015 Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Google Cloud Dataflow Manuel Fahndrich Software Engineer Google Addictive Mobile Game

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (&amp; 6 TIPS!) BRAINJAR HOW GOOGLE

What is Google App Engine? Wesley Chun Developer Advocate, Google

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

How to Publish Linked Data on the Web Tom Heath, Platform Division, Talis, UK Chris Bizer, FU

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Chinas Computer Industry: Manufacturing to Product Development Jason Dedrick and Kenneth L.

Semantic Modeling of Smart City Data and related challenges/opportunities Alessandra Mileo,

1 Going Back to School: The Cons/ Challenges It may be physically and emotionally challenging

Big data, big research? Opportunities and constraints for computer supported social science Jrgen

12/5/2015 Options for pediatric ptosis repair Olmsted County 1 in 842 live births Unilateral

AFFECTS THE RETINA, OPTIC NERVE, AND MAKES THE PATIENT SEE DOUBLE? Sachin Kedar MD Department

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE