Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer
A Whirlwind Overview of Apache Beam Eugene Kirpichov - - PowerPoint PPT Presentation
A Whirlwind Overview of Apache Beam Eugene Kirpichov - - PowerPoint PPT Presentation
A Whirlwind Overview of Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Unified batch/streaming, Open ecosystem, SELECT
Google Cloud Platform 2
(2008) FlumeJava High-level API
(2016) Apache Beam
Open ecosystem, Community-driven Vendor-independent (2004) MapReduce SELECT + GROUPBY (2014) Dataflow Unified batch/streaming, Portable (2013) Millwheel Deterministic Streaming
Google Cloud Platform 3
Read text files Split into words Count Format Write text files
Pipeline p = Pipeline.create(options); PCollection<String> lines = p.apply( TextIO.read().from("gs://.../*")); PCollection<KV<String, Long>> wordCounts = lines .apply(FlatMapElements.via(word → word.split("\\W+"))) .apply(Count.perElement()); wordCounts .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.write().to("gs://.../...")); p.run();
Google Cloud Platform 4
ParDo ("map") GroupByKey ("reduce") Composite DoFn
Beam PTransforms
Google Cloud Platform 5
Pillars of Beam
Unified model Portability Ecosystem
Confidential & Proprietary Google Cloud Platform 6
Unified Model Batch doesn't exist
Google Cloud Platform 7
T E L
Computes updates Grows (Always expect new data) Evolves Growing data is temporal ⇒ All data has timestamps (event-time: thappened)
Google Cloud Platform 8
Dealing with new data
ParDo GroupByKey ⇒ Apply to new data ⇒ ?
Google Cloud Platform 9
Continuous aggregation
Idea: per-key buffering
GroupByKey (K, V) (K, V[]) Group Ki, V Ki, V[] Group Group (K, V) (K, V[])
Google Cloud Platform 10
t (event time) tin:V tout:V[] Ki
See: Streams and Tables https://www.infoq.com/presentations/beam-model-stream-table=theory
Google Cloud Platform 11
Continuous aggregation
Idea: temporal windowing
Ki event time 14:03: (k, v) Element counts toward 1 or more windows watermark closes old windows Apply (user-specified) trigger ⇒ drop / add to buffer / emit buffer
T
Confidential & Proprietary Google Cloud Platform 12
There is no batch / streaming. Only different ways to control aggregation
Confidential & Proprietary Google Cloud Platform 13
Portability (vision for 2018)
Google Cloud Platform 14
Code in any supported language (or a mix) Run on any supported runner
Portable pipeline representation
. . . . . .
Google Cloud Platform 15
No vendor lock-in Run any language on any runner No language lock-in Users: Use all transforms from all languages Library authors: Will be usable by all languages Accelerated ecosystem growth New runner / new SDK ⇒ access all Beam libraries
Confidential & Proprietary Google Cloud Platform 16
Ecosystem
Google Cloud Platform 17
User code IO Language SDKs Runners SQL Other libs Portable Unified Model Powered by Beam Third-party SDKs
. . . . . .
Community
Google Cloud Platform 18
250 contributors 31 committers (11 orgs) ~5000 PRs ~12,500 commits 25+ IO connectors 5 stable releases 9 runners
Confidential & Proprietary Google Cloud Platform 19