Apache Beam Modle de programmation unifi pour Big Data Who am I? - PowerPoint PPT Presentation

Apache Beam Modèle de programmation unifié pour Big Data

Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software Architect at Talend PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva, Aries, ServiceMix, …) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, …)

Apache Beam origin Colossus BigTable PubSub Dremel Google Cloud Dataflow Apache Spanner Megastore Millwheel Flume Beam MapReduce

Beam model : asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

Customizing What Where When How 2 1 3 4 Windowed Classic Streaming Streaming Batch + Accumulation Batch

What is Apache Beam? 1. Unified model ( B atch + str EAM ) What / Where / When / How 2. SDKs (Java, Python, ...) & DSLs (Scala, …) 3. Runners for Existing Distributed Processing Backends (Google Dataflow, Spark, Flink, …) 4. IOs : Data store Sources / Sinks

Apache Beam vision 1. End users: who want to write Other Beam Beam Java Languages pipelines in a language that’s familiar. Python 2. SDK/DSL writers: who want to make Beam concepts available in new Beam Model: Pipeline Construction languages. Apache Cloud Apache 3. Runner writers: who have a Flink Dataflow Spark distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Execution Execution Execution

Apache Beam - SDKs & DSLs DSLs SDKs Domain-Specific Languages based on the API based on the Beam Model Beam Model: 1. Current: 1. Current: a. Java Scio (Scala API), • b. Python 2. Future (ideas): 2. Future (possible) SDKs: Streaming SQL (Calcite) • Go, Ruby, etc. Machine Learning • Complex Event Processing •

Apache Beam SDK concepts 1. Pipeline - data processing job as a directed graph of transformations 2. PCollection - the data inside a pipeline 3. PTransform - a transformation step in the pipeline a. IO transforms - Read from a Source or Write to a Sink. b. Core transforms - common transformation provided (ParDo, GroupByKey, …) c. Composite transforms - combine multiple transforms

Apache Beam - Pipeline Data processing pipeline (executed via a Beam runner) Read Write PTransform PTransform PTransform PTransform (source) (sink)

Apache Beam - PCollection 1. PCollection is immutable , does not support random access to element, belongs to a Pipeline 2. Each element in PCollection has a Timestamp (commonly set by IO Source) 3. Coder to support different data serialization 4. Bounded (batch) or Unbounded (streaming) (depending of the IO Source)

Apache Beam - PTransform 1. PTransform are operations that transform data 2. Receive one or multiple PCollections and produce one or multiple PCollections 3. They must be Serializable 4. Should be thread-compatible (If you create your threads you must sync them). 5. Idempotency is not required but recommended .

Apache Beam - IO Transforms 1. IO read/write data as PCollections ( Source/Sink ) 2. Support Bounded and/or Unbounded PCollections 3. Extensible API to create custom sources & sinks 4. Deal with timestamp, watermarks, deduplication , read/write parallelism

Agenda 1. Evolution of the Big Data programming models 2. The Beam approach 3. Apache Beam

Apache Beam - Current IOs Ready WIP MQTT File Hive JDBC Avro Cassandra Mongo / GridFS Google Cloud Storage Reddis JMS BigQuery RabbitMQ Kafka BigTable ... Kinesis DataStore HDFS Elasticsearch HBase

Apache Beam - Pipeline with IO Example public static void main(String[] args) { // Create a pipeline parameterized by command line flags eg. --runner Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(KafkaIO.read().withBootstrapServers(servers) .withTopics(topics)) // Read input .apply(new YourFancyFn()) // Do some processing .apply(ElasticsearchIO.write().withAddress(esServer) .withIndex(index).withType(type)); // Write output // Run the pipeline. p.run(); }

What are you computing? Element-Wise Aggregating Composite What Where When How

Apache Beam - Programming model in the SDK Element-wise Grouping Windowing/Triggers ParDo GroupByKey FixedWindows GlobalWindows MapElements Combine -> Reduce SlidingWindows Sum Sessions FlatMapElements Count Min AfterWatermark Filter Max AfterProcessingTime Mean AfterPane ... ... WithKeys Keys Values

Apache Beam - Example - GDELT Events by location Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline .apply( "GDELTFile" , TextIO.Read.from(options.getInput())) // Extract location from the fields .apply( "ExtractLocation" , ParDo.of(...) // Count events per location .apply( "CountPerLocation" , Count.<String>perElement()) // Reformat KV as a String .apply( "StringFormat" , MapElements.via(...)) // write to result files .apply( "Results" ,TextIO.Write.to(options.getOutput())); // Run the batch pipeline. pipeline.run();

Apache Bean - Runners / Execution Engines Runners “ translate ” the code to a target runtime (the runner itself doesn’t provide the runtime) Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability

Runners Apache Flink Apache Beam Google Cloud Apache Spark Dataflow Direct Runner Managed (NoOps) Local WIP Apache MapReduce Apache Karaf Apache Apex Apache Gearpump Same code, different runners & runtimes

Apache Beam - Use cases Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases Data processing, both batch and stream processing Real-time event processing from IoT devices Fraud detection, ...

Why Apache Beam? 1. Portable - You can use the same code with different runners (agnostic) and backends on premise, in the cloud, or locally 2. Unified - Same unified model for batch and stream processing 3. Advanced features - Event windowing, triggering, watermarking, lateness, etc. 4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel

Growing the Beam Community Collaborate - Beam is becoming a community- driven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

Learn More! Apache Beam http://beam.apache.org Join the Beam mailing lists! user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter

Thank You !

Apache Beam Modle de programmation unifi pour Big Data Who am I? - PowerPoint PPT Presentation

Apache Beam Modle de programmation unifi pour Big Data Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

ESSENTIALS OF NEPHROLOGY: Family Medicine Board Review ACUTE AND CHRONIC KIDNEY DISEASE

Step OOe Step OOe What would you like? rito .............................

Myanmar Land of Opportunities and Challenges 2016 What are the associations? U.S.A U.K. Aung

Semantic and syntactic functions of western Indonesian applicative morphology 1 Univ. of Hawai

Apache Tomcat NEXT Progress Report Jean-Frederic Clere, Manager, Red Hat AGENDA Who I am

XML What you didn't know that you wanted to know... ... or maybe you did, and just have a good

Assembling and Executing Kieker Analysis Configurations via Java API and Web UI Kieker Days

PLANNING FOR SHIFTING TRADE PANEL: New Challenges in Trade Policy Scott Sigman, Transportation and

Apache Beam Modle de programmation unifi pour Big Data Who am I? - PowerPoint PPT Presentation

Apache Beam Modle de programmation unifi pour Big Data Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Simplifying ML Workflows with Apache Beam &amp; TensorFlow Extended Tyler Akidau @takidau

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

ESSENTIALS OF NEPHROLOGY: Family Medicine Board Review ACUTE AND CHRONIC KIDNEY DISEASE

Step OOe Step OOe What would you like? rito .............................

Myanmar Land of Opportunities and Challenges 2016 What are the associations? U.S.A U.K. Aung

Semantic and syntactic functions of western Indonesian applicative morphology 1 Univ. of Hawai

Apache Tomcat NEXT Progress Report Jean-Frederic Clere, Manager, Red Hat AGENDA Who I am

XML What you didn't know that you wanted to know... ... or maybe you did, and just have a good

Assembling and Executing Kieker Analysis Configurations via Java API and Web UI Kieker Days

PLANNING FOR SHIFTING TRADE PANEL: New Challenges in Trade Policy Scott Sigman, Transportation and

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation