Dataflow/Apache Beam
A Unified Model for Batch and Streaming Data Processing
Eugene Kirpichov, Google
STREAM 2016
Dataflow/Apache Beam A Unified Model for Batch and Streaming Data - - PowerPoint PPT Presentation
Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov, Google STREAM 2016 Agenda Googles Data Processing Story 1 Philosophy of the Beam programming model 2 Apache Beam project 3 Googles 1
A Unified Model for Batch and Streaming Data Processing
Eugene Kirpichov, Google
STREAM 2016
Google’s Data Processing Story Philosophy of the Beam programming model
1 2 Apache Beam project 3
1
2012 2002 2004 2006 2008 2010
MapReduce
GFS Big Table Dremel Pregel
FlumeJava
Colossus Spanner
2014
MillWheel
Dataflow
2016
(distributed output dataset)
(distributed input dataset) Shuffle (GROUP BY) Map (SELECT) Reduce (SELECT)
2012 2002 2004 2006 2008 2010
MapReduce
GFS Big Table Dremel Pregel
FlumeJava
Colossus Spanner
2014
MillWheel
Dataflow
2016
// Collection of raw events PCollection<SensorEvent> raw = ...; // Element-wise extract location/temperature pairs PCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey()); // Write output
What Where When How
Tuesday Wednesday Thursday
Latency
MapReduce Tuesday Wednesday
Jose Lisa Ingo Asha Cheryl Ari
Wednesday Tuesday
9:00 8:00 14:00 13:00 12:00 11:00 10:00 2:00 1:00 7:00 6:00 5:00 4:00 3:00
Historical events Exact historical model
Periodic batch processing
Approximate real-time model Stream processing system
Continuous updates
2012 2002 2004 2006 2008 2010
MapReduce
GFS Big Table Dremel Pregel
FlumeJava
Colossus Spanner
2014
MillWheel
Dataflow
2016
data-processing applications
computations to be performed
persistent flow of elements
Correctness Latency
What Where When How
What Where When How
Processing Time Event Time
Watermark
PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());
What Where When How
2012 2002 2004 2006 2008 2010
MapReduce
GFS Big Table Dremel Pregel
FlumeJava
Colossus Spanner
2014
MillWheel
Dataflow
2016
Cloud Dataflow
○ On your development machine ○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.
3
Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://.../...")); p.run();
End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment
Java ... Python
End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment
Java ... Python
02/01/2016 Enter Apache Incubator Early 2016 Design for use cases, begin refactoring Mid 2016 Slight chaos Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository
with shared pipeline representation
implementing the model
run language-specific code
Beam Model: Fn Runners Runner A Runner B Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
2004 MapReduce (SELECT / GROUP BY) Library > DSL Abstract away fault tolerance & distribution 2010 FlumeJava: High-level API (typed DAG) 2013 MillWheel: Deterministic stream processing 2015 Dataflow: Unified batch/streaming model Windowing, Triggers, Retractions 2016 Beam: Portable programming model Language-agnostic runners
Programming model The World Beyond Batch: Streaming 101, Streaming 102 The Dataflow Model paper Cloud Dataflow http://cloud.google.com/dataflow/ Apache Beam https://wiki.apache.org/incubator/BeamProposal http://beam.incubator.apache.org/ Dataflow/Beam vs. Spark
Google confidential │ Do not distribute
Google confidential │ Do not distribute