a whirlwind overview of apache beam
play

A Whirlwind Overview of Apache Beam Eugene Kirpichov - PowerPoint PPT Presentation

A Whirlwind Overview of Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Unified batch/streaming, Open ecosystem, SELECT


  1. A Whirlwind Overview of Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer

  2. (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Unified batch/streaming, Open ecosystem, SELECT + GROUPBY Portable Community-driven Vendor-independent (2013) Millwheel Deterministic Streaming Google Cloud Platform 2

  3. Pipeline p = Pipeline.create(options); Read text files PCollection<String> lines = p.apply( TextIO.read().from ( "gs://.../*" )); Split into words PCollection<KV<String, Long>> wordCounts = lines .apply( FlatMapElements.via (word → word.split( "\\W+" ))) .apply( Count.perElement() ); Count wordCounts .apply( MapElements.via ( Format count → count.getKey() + ": " + count.getValue()) .apply( TextIO.write().to ( "gs://.../..." )); Write text files p.run(); Google Cloud Platform 3

  4. Beam PTransforms DoFn ParDo GroupByKey Composite ("map") ("reduce") Google Cloud Platform 4

  5. Pillars of Beam Ecosystem Unified model Portability Google Cloud Platform 5

  6. Unified Model Batch doesn't exist Google Cloud Platform Confidential & Proprietary 6

  7. E T L Grows Evolves Computes updates (Always expect new data) Growing data is temporal ⇒ All data has timestamps ( event-time: t happened ) Google Cloud Platform 7

  8. Dealing with new data ParDo GroupByKey ⇒ Apply to new data ⇒ ? Google Cloud Platform 8

  9. Continuous aggregation Idea: per-key buffering (K, V) (K, V[]) GroupByKey K i , V K i , V[] Group (K, V) (K, V[]) Group Group Google Cloud Platform 9

  10. t in :V t (event time) K i t out :V[] See: Streams and Tables https://www.infoq.com/presentations/beam-model-stream-table=theory Google Cloud Platform 10

  11. Continuous aggregation Idea: temporal windowing 14:03: (k, v) event time K i Element counts toward 1 or more windows T watermark Apply (user-specified) trigger ⇒ closes old windows drop / add to buffer / emit buffer Google Cloud Platform 11

  12. There is no batch / streaming. Only different ways to control aggregation Google Cloud Platform Confidential & Proprietary 12

  13. Portability (vision for 2018) Google Cloud Platform Confidential & Proprietary 13

  14. Code in any . . . supported language (or a mix) Portable pipeline representation . . . Run on any supported runner Google Cloud Platform 14

  15. No vendor lock-in Run any language on any runner No language lock-in Users: Use all transforms from all languages Library authors: Will be usable by all languages Accelerated ecosystem growth New runner / new SDK ⇒ access all Beam libraries Google Cloud Platform 15

  16. Ecosystem Google Cloud Platform Confidential & Proprietary 16

  17. Community . . . User code Powered by Beam Third-party IO SQL Other libs SDKs Language SDKs Portable Unified Model . . . Runners Google Cloud Platform 17

  18. 250 contributors 31 committers ( 11 orgs) ~5000 PRs ~12,500 commits 25+ IO connectors 5 stable releases 9 runners Google Cloud Platform 18

  19. Thank you! Google Cloud Platform Confidential & Proprietary 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend