Dataflow/Apache Beam A Unified Model for Batch and Streaming Data - - PowerPoint PPT Presentation

dataflow apache beam
SMART_READER_LITE
LIVE PREVIEW

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data - - PowerPoint PPT Presentation

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov, Google STREAM 2016 Agenda Googles Data Processing Story 1 Philosophy of the Beam programming model 2 Apache Beam project 3 Googles 1


slide-1
SLIDE 1

Dataflow/Apache Beam

A Unified Model for Batch and Streaming Data Processing

Eugene Kirpichov, Google

STREAM 2016

slide-2
SLIDE 2

Google’s Data Processing Story Philosophy of the Beam programming model

Agenda

1 2 Apache Beam project 3

slide-3
SLIDE 3

Google’s Data Processing Story

1

slide-4
SLIDE 4

2012 2002 2004 2006 2008 2010

MapReduce

GFS Big Table Dremel Pregel

FlumeJava

Colossus Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

slide-5
SLIDE 5

(distributed output dataset)

MapReduce: SELECT + GROUP BY

(distributed input dataset) Shuffle (GROUP BY) Map (SELECT) Reduce (SELECT)

slide-6
SLIDE 6

2012 2002 2004 2006 2008 2010

MapReduce

GFS Big Table Dremel Pregel

FlumeJava

Colossus Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

slide-7
SLIDE 7

FlumeJava Pipelines

  • A Pipeline represents a graph
  • f data processing

transformations

  • PCollections flow through the

pipeline

  • Optimized and executed as a

unit for efficiency

slide-8
SLIDE 8

// Collection of raw events PCollection<SensorEvent> raw = ...; // Element-wise extract location/temperature pairs PCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey()); // Write output

  • utput.apply(BigtableIO.Write.to(...));

Example: Computing mean temperature

What Where When How

slide-9
SLIDE 9

So, people used FJ to process data...

slide-10
SLIDE 10

...big data...

slide-11
SLIDE 11

...really, really big...

Tuesday Wednesday Thursday

slide-12
SLIDE 12

Batch failure mode #1

Latency

slide-13
SLIDE 13

MapReduce Tuesday Wednesday

Batch failure mode #2: Sessions

Jose Lisa Ingo Asha Cheryl Ari

Wednesday Tuesday

slide-14
SLIDE 14

Continuous & Unbounded

9:00 8:00 14:00 13:00 12:00 11:00 10:00 2:00 1:00 7:00 6:00 5:00 4:00 3:00

slide-15
SLIDE 15

Historical events Exact historical model

Periodic batch processing

Approximate real-time model Stream processing system

Continuous updates

State of the art until recently: Lambda Architecture

slide-16
SLIDE 16

2012 2002 2004 2006 2008 2010

MapReduce

GFS Big Table Dremel Pregel

FlumeJava

Colossus Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

slide-17
SLIDE 17

MillWheel: Deterministic, low-latency streaming

  • Framework for building low-latency

data-processing applications

  • User provides a DAG of

computations to be performed

  • System manages state and

persistent flow of elements

slide-18
SLIDE 18

Streaming or Batch?

1 + 1 = 2

Correctness Latency

Why not both?

slide-19
SLIDE 19

What are you computing? Where in event time? When in processing time? How do refinements relate?

slide-20
SLIDE 20

Where in event time?

What Where When How

  • Windowing divides data into event-time-based finite chunks.
  • Required when doing aggregations over unbounded data.
slide-21
SLIDE 21

What Where When How

When in Processing Time?

  • Triggers control

when results are emitted.

  • Triggers are often

relative to the watermark.

Processing Time Event Time

Watermark

slide-22
SLIDE 22

PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());

What Where When How

How do refinements relate?

slide-23
SLIDE 23

2012 2002 2004 2006 2008 2010

MapReduce

GFS Big Table Dremel Pregel

FlumeJava

Colossus Spanner

2014

MillWheel

Dataflow

2016

Data Processing @ Google

slide-24
SLIDE 24

Cloud Dataflow

A fully-managed cloud service and programming model for batch and streaming big data processing.

Google Cloud Dataflow

slide-25
SLIDE 25
slide-26
SLIDE 26

Dataflow SDK

  • Portable API to construct and run a pipeline.
  • Available in Java and Python (alpha)
  • Pipelines can run…

○ On your development machine ○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.

slide-27
SLIDE 27

Dataflow ⇒ Apache Beam

3

slide-28
SLIDE 28

Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://.../...")); p.run();

slide-29
SLIDE 29

Apache Beam ecosystem

End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment

Java ... Python

slide-30
SLIDE 30

Apache Beam ecosystem

End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment

Java ... Python

slide-31
SLIDE 31

Apache Beam Roadmap

02/01/2016 Enter Apache Incubator Early 2016 Design for use cases, begin refactoring Mid 2016 Slight chaos Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository

slide-32
SLIDE 32

Runner capability matrix

slide-33
SLIDE 33
  • Multiple SDKs

with shared pipeline representation

  • Language-agnostic runners

implementing the model

  • Fn Runners

run language-specific code

Technical Vision: Still more modular

Beam Model: Fn Runners Runner A Runner B Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution

slide-34
SLIDE 34

Recap: Timeline of ideas

2004 MapReduce (SELECT / GROUP BY) Library > DSL Abstract away fault tolerance & distribution 2010 FlumeJava: High-level API (typed DAG) 2013 MillWheel: Deterministic stream processing 2015 Dataflow: Unified batch/streaming model Windowing, Triggers, Retractions 2016 Beam: Portable programming model Language-agnostic runners

slide-35
SLIDE 35

Programming model The World Beyond Batch: Streaming 101, Streaming 102 The Dataflow Model paper Cloud Dataflow http://cloud.google.com/dataflow/ Apache Beam https://wiki.apache.org/incubator/BeamProposal http://beam.incubator.apache.org/ Dataflow/Beam vs. Spark

Learn More!

slide-36
SLIDE 36

Google confidential │ Do not distribute

Thank you

slide-37
SLIDE 37

Google confidential │ Do not distribute