F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels - PowerPoint PPT Presentation

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels mxm@apache.org A PA C H E B E A M ’ S J O U R N E Y T O @stadtlegende C R O S S - L A N G U A G E D ATA P R O C E S S I N G maximilianmichels.com F O S D E M 2 0 1 9

What is Beam? What does portability mean? How do we achieve portability? Are we there yet? � 2

3 � W H AT I S B E A M ? • Apache open-source project • Parallel/distributed data processing • Unified programming model for batch/stream processing • Execution engine of your choice ("Uber API") Apache Beam • Programming language of your choice

� 4 B E A M V I S I O N Direct Apache Samza Write Pipeline Translate Apache Flink Google Cloud Dataflow Runners Apache Spark Apache Apex Apache Gearpump Apache Nemo (incubating) SDKs Execution Engines

5 � T H E B E A M A P I I N P C O L L E C T I O N P C O L L E C T I O N O U T T R A N S F O R M T R A N S F O R M Pipeline 1. Pipeline p = Pipeline.create(options) 2. PCollection pCol1 = p.apply(transform).apply(…).… 3. PCollection pcol2 = pCol1.apply(transform) 4. p.run()

� 6 T R A N S F O R M S P C O L L E C T I O N P C O L L E C T I O N T R A N S F O R M P R I M I T I V E T R A N S F O R M S • Transforms can be primitive or composite ParDo • Composite transforms expand to primitive • Only small set of primitive transforms GroupByKey • Runners can support specialized translation of AssignWindows composite transforms, but don't have to Flatten

� 7 C O R E P R I M I T I V E T R A N S F O R M S P a r D o G ro u p B y K e y input -> output KV<k,v>… -> KV<k, [v…]> “to” -> KV<“to”, 1> KV<“to”, [1,1]> “be” -> KV<“be”, 1> KV<“be”, [1,1]> “or” -> KV<“or”, 1> KV<“or”, [1 ]> “not”-> KV<“not”,1> KV<“not”,[1 ]> “to” -> KV<“to”, 1> “be” -> KV<“be”, 1> "Map/Reduce Phase" "Shuffle Phase"

8 � W O R D C O U N T — R A W pipeline .apply(Create. of ("hello", "hello", "fosdem")) .apply(ParDo. of ( new DoFn<String, KV<String, Integer>>() { @ProcessElement public void processElement(ProcessContext ctx) { KV<String, Integer> outputElement = KV. of (ctx.element(), 1); ctx.output(outputElement); } })) .apply(GroupByKey. create ()) .apply(ParDo. of ( new DoFn<KV<String, Iterable<Integer>>, KV<String, Long>>() { @ProcessElement public void processElement(ProcessContext ctx) { long count = 0; for (Integer wordCount : ctx.element().getValue()) { count += wordCount; } KV<String, Long> outputElement = KV.of(ctx.element().getKey(), count); ctx.output(outputElement); } }))

E X C U S E M E , T H AT WA S U G LY A S H E L L

10 � W O R D C O U N T — C O M P O S I T E T R A N S F O R M S pipeline .apply(Create. of ("hello", "fellow", "fellow")) .apply(MapElements. via ( new SimpleFunction<String, KV<String, Integer>>() { @Override public KV<String, Integer> apply(String input) { return KV. of (input, 1); } })) .apply(Sum. integersPerKey ()); Composite Transforms

� 11 W O R D C O U N T — M O R E C O M P O S I T E T R A N S F O R M S pipeline .apply(Create. of ("hello", "fellow", "fellow")) .apply(Count. perElement ()); Composite Transforms

12 � P Y T H O N T O T H E R E S C U E (p | beam.Create(['hello', 'hello', 'fosdem']) | beam.Map(lambda word: (word, 1)) | beam.GroupByKey() | beam.Map(lambda kv: (kv[0], sum(kv[1]))) )

� 13 P Y T H O N T O T H E R E S C U E (p | beam.Create(['hello', 'hello', 'fosdem']) | beam.Map(lambda word: (word, 1)) | beam.CombinePerKey(sum) )

14 � T H E R E I S M U C H M O R E T O B E A M • Watermarks • Flatten/Combine/Partition/ CoGroupByKey (Join) • Side Inputs • Define your own transforms! • Multiple Outputs • IOs / Splittable DoFn • State • Windowing • Timers • Event Time / Processing Time • ...

What is Beam? What does portability mean? How do we achieve portability? Are we there yet?

16 � P O R TA B I L I T Y Engine Portability Language Portability • Beam pipeline can be • Runners can translate a Beam pipeline for any of these generated from any of these execution engines language

� 17 B E A M V I S I O N Write Pipeline Translate Runners SDKs Execution Engines

� 18 C R O S S - E N G I N E P O R TA B I L I T Y 1. Set the Runner • options.setRunner(FlinkRunner.class) • --runner=FlinkRunner 2. Run! • p.run()

19 � P O R TA B I L I T Y Engine Portability Language Portability • Beam pipeline can be • Runners can translate a Beam pipeline for any of these generated from any of these execution engines language

� 20 W H Y W E WA N T T O U S E O T H E R L A N G U A G E S • Syntax / Expressiveness • Communities (Yes!) • Libraries (!)

� 21 B E A M W I T H O U T L A N G U A G E - P O R TA B I L I T Y Write Pipeline Translate Runners Wait, what?! SDKs Execution Engines

� 22 B E A M W I T H L A N G U A G E - P O R TA B I L I T Y Write Pipeline Translate Runners & language-portability framework SDKs Execution Engines

24 � L A N G U A G E - P O R TA B I L I T Y Beam Beam Beam Beam Java Python Java Go Apache Cloud Apache Apache Cloud Apache Flink Dataflow Spark Flink Dataflow Spark E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n

25 � L A N G U A G E - P O R TA B I L I T Y Beam Beam Beam Beam Beam Beam Beam Java Python Python Java Java Go Go Pipeline (Runner API) Apache Apache Cloud Cloud Apache Apache Apache Cloud Apache Flink Flink Dataflow Dataflow Spark Spark Flink Dataflow Spark E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n

26 � L A N G U A G E - P O R TA B I L I T Y Beam Beam Beam Beam Java Python Java Go Pipeline (Runner API) Apache Cloud Apache Apache Cloud Apache Flink Dataflow Spark Flink Dataflow Spark Execution (Fn API) E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n

� 27 language-specific W I T H O U T P O R TA B I L I T Y S D K R U N N E R Backend (e.g. Flink) TA S K 1 TA S K 2 TA S K 3 TA S K N All components are tight to a single language

28 � language-specific W I T H P O R TA B I L I T Y language-agnostic S D K J O B S E RV E R R U N N E R Job API Runner API Translate Backend (e.g. Flink) E X E C U TA B L E E X E C U TA B L E … TA S K 2 TA S K N S TA G E S TA G E Fn API Fn API S D K S D K H A R N E S S H A R N E S S

� 29 P I P E L I N E F U S I O N • SDK Harness environment comes at a cost • Serialization step before and after processing with SDK harness • User defined functions should be chained and share the same environment

� 30 F L I N K E X E C U TA B L E S TA G E S D K H A R N E S S J O B B U N D L E FA C T O RY S TA G E B U N D L E FA C T O RY E N V I R O N M E N T FA C T O RY • SDK Harness runs R E M O T E B U N D L E • in a Docker container (repository P r o A v r i t s Progress Report i Input Receivers can be specified) i f o State Request a n c i t n R Logging g e t r i e v a l • in a dedicated process (process- based execution) S D K H A R N E S S • directly in the process (only works if SDK and Runner share the same language)

31 � Files-Based C R O S S - L A N G U A G E P I P E L I N E S Apache HDFS Amazon S3 Google Cloud Storage local filesystems AvroIO TextIO TFRecordIO XmlIO • Java SDK has rich set of IO connectors, e.g. FileIO, TikaIO ParquetIO KafkaIO, PubSubIO, JDBC, Cassandra, Redis, Messaging Amazon Kinesis ElasticsearchIO, … AMQP Apache Kafka Google Cloud Pub/Sub JMS MQTT • Python SDK has replicated parts of it, i.e. FileIO Databases Apache Cassandra Apache Hadoop InputFormat Apache HBase • Are we going to replicate all the others? Apache Hive (HCatalog) Apache Kudu Apache Solr Elasticsearch (v2.x, v5.x, v6.x) • Solution: Use cross-language pipelines! Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis

� 32 C R O S S - L A N G U A G E P I P E L I N E S p = Pipeline() (p | IoExpansion(io='KafkaIO', configuration={ 'topic' : 'fosdem', 'offset' : 'latest' }) | … )

� 33 C R O S S - L A N G U A G E V I A M I X E D E N V I R O N M E N T S d E X PA N S I O N n a p x E S E RV I C E S D K J O B S E RV E R R U N N E R Job API Runner API Translate Execution Engine (e.g. Flink) … S O U R C E M A P G R O U P B Y K E Y C O U N T Fn API Fn API J AVA S D K P Y T H O N S D K H A R N E S S H A R N E S S

� 35 P O R TA B I L I T Y Language Portability Engine Portability pretty darn close

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels - PowerPoint PPT Presentation

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels mxm@apache.org A PA C H E B E A M S J O U R N E Y T O @stadtlegende C R O S S - L A N G U A G E D ATA P R O C E S S I N G maximilianmichels.com F O S D E M 2 0 1 9

ASL V6 Planning L. Strow Frequency Calibration L1b/L1c and RTA Planning for V6 L1b

Ag e nda I mpo rta nt de a dline info rma tio n fo r CSUF Brie f Stude nt Po rta l Ove

Pr e se ntation F a ll, 2018 Ma ryla nd De pa rtme nt o f T ra nspo rta tio n 2 T he Ma

THE CITY OF CHATTANOOGA SM A RT TRA NSPO RTA TIO N SYSTEM S T ra nspo rta tio n De pa rtme nt

Co mpa ny Pre se nta tio n No ve mb e r 2016 Ce rta in Disc lo sure s Ce rta in sta te me nts

ASL Cloudy RTA L. Strow Cloudy Radiative Transfer for AIRS Overview RTA Codes Comparisons to

RTA Rike Tech Associates LLC LPC-19-25589 4721 Delafield Avenue - Bronx, NY 10471 Residential

Legislative Conference Winter 2019 January 30, 2019 January 30, 2019 2 RTA I RT A Imp mplem

Who is Sauder ? Founded in 1934 Founder: Erie Sauder Birth of an Industry -The First RTA Table

I -95 a t Be lvid e re Ro a d T ra nspo rta tio n I mpro ve me nt Stud y WE L COME VI RT

Monte Ca rlo Ana lysis of Monte Ca rlo Ana lysis of Unc e rta intie s in the Ne the rla nds

Dirative p ro dution of heavy mesons at the LHC Ma rta uszzak Institut of

ASL L1c L. Strow UMBC AIRS L1C, Freq Cal, RTA L. Larrabee Strow and Scott Hannon Physics

MEETING OF THE RTA BOARD OF DIRECTORS SEPTEMBER 10, 2020 Welcome! Meeting Starts at 9 a.m.

RTA: Specific Aims Faheem Guirgis, MD Center for Research Training Slides from Rosemarie

CONSIDERATIONS IN BREAST SURGERY AFTER CHEMOTHERAPY CLIP PLACEMENT I s impo rta nt in a ll b

QGP Tomography Magdalena Djordjevic, Brief overview of Quark Gluon Plasma QGP is a new form

Issues in Accessing and Sharing Confidential Survey and Social Science Data CODATA 2002,

...a flavour of ... Themis Bowcock 2 About our Liverpool group ... Built the LHCb Vertex

Wireless Communication Systems @CS.NCTU Lecture 5: Multi-User MIMO (MU-MIMO) Instructor: Kate

LECTURE 18 MORE ON BOOLEANS AND ITERABLES MCS 260 Fall 2020 David Dumas / REMINDERS Quiz 6

Contjnuous Experimentatjon and A/B Testjng: A Mapping Study Rasmus Ros and Per Runeson A/B

2: Naive Bayes Classification Machine Learning and Real-world Data Simone Teufel and Ann

Class 15 Questions/comments Testing continued Assign (see Schedule for links)

Sambuz

Useful Links

Newsletter

Mail Us