A Whirlwind Overview of Apache Beam Eugene Kirpichov - - PowerPoint PPT Presentation

a whirlwind overview of apache beam
SMART_READER_LITE
LIVE PREVIEW

A Whirlwind Overview of Apache Beam Eugene Kirpichov - - PowerPoint PPT Presentation

A Whirlwind Overview of Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Unified batch/streaming, Open ecosystem, SELECT


slide-1
SLIDE 1

Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer

A Whirlwind Overview of Apache Beam

slide-2
SLIDE 2

Google Cloud Platform 2

(2008) FlumeJava High-level API

(2016) Apache Beam

Open ecosystem, Community-driven Vendor-independent (2004) MapReduce SELECT + GROUPBY (2014) Dataflow Unified batch/streaming, Portable (2013) Millwheel Deterministic Streaming

slide-3
SLIDE 3

Google Cloud Platform 3

Read text files Split into words Count Format Write text files

Pipeline p = Pipeline.create(options); PCollection<String> lines = p.apply( TextIO.read().from("gs://.../*")); PCollection<KV<String, Long>> wordCounts = lines .apply(FlatMapElements.via(word → word.split("\\W+"))) .apply(Count.perElement()); wordCounts .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.write().to("gs://.../...")); p.run();

slide-4
SLIDE 4

Google Cloud Platform 4

ParDo ("map") GroupByKey ("reduce") Composite DoFn

Beam PTransforms

slide-5
SLIDE 5

Google Cloud Platform 5

Pillars of Beam

Unified model Portability Ecosystem

slide-6
SLIDE 6

Confidential & Proprietary Google Cloud Platform 6

Unified Model Batch doesn't exist

slide-7
SLIDE 7

Google Cloud Platform 7

T E L

Computes updates Grows (Always expect new data) Evolves Growing data is temporal ⇒ All data has timestamps (event-time: thappened)

slide-8
SLIDE 8

Google Cloud Platform 8

Dealing with new data

ParDo GroupByKey ⇒ Apply to new data ⇒ ?

slide-9
SLIDE 9

Google Cloud Platform 9

Continuous aggregation

Idea: per-key buffering

GroupByKey (K, V) (K, V[]) Group Ki, V Ki, V[] Group Group (K, V) (K, V[])

slide-10
SLIDE 10

Google Cloud Platform 10

t (event time) tin:V tout:V[] Ki

See: Streams and Tables https://www.infoq.com/presentations/beam-model-stream-table=theory

slide-11
SLIDE 11

Google Cloud Platform 11

Continuous aggregation

Idea: temporal windowing

Ki event time 14:03: (k, v) Element counts toward 1 or more windows watermark closes old windows Apply (user-specified) trigger ⇒ drop / add to buffer / emit buffer

T

slide-12
SLIDE 12

Confidential & Proprietary Google Cloud Platform 12

There is no batch / streaming. Only different ways to control aggregation

slide-13
SLIDE 13

Confidential & Proprietary Google Cloud Platform 13

Portability (vision for 2018)

slide-14
SLIDE 14

Google Cloud Platform 14

Code in any supported language (or a mix) Run on any supported runner

Portable pipeline representation

. . . . . .

slide-15
SLIDE 15

Google Cloud Platform 15

No vendor lock-in Run any language on any runner No language lock-in Users: Use all transforms from all languages Library authors: Will be usable by all languages Accelerated ecosystem growth New runner / new SDK ⇒ access all Beam libraries

slide-16
SLIDE 16

Confidential & Proprietary Google Cloud Platform 16

Ecosystem

slide-17
SLIDE 17

Google Cloud Platform 17

User code IO Language SDKs Runners SQL Other libs Portable Unified Model Powered by Beam Third-party SDKs

. . . . . .

Community

slide-18
SLIDE 18

Google Cloud Platform 18

250 contributors 31 committers (11 orgs) ~5000 PRs ~12,500 commits 25+ IO connectors 5 stable releases 9 runners

slide-19
SLIDE 19

Confidential & Proprietary Google Cloud Platform 19

Thank you!