Apache Beam
Modèle de programmation unifié pour Big Data
Apache Beam Modle de programmation unifi pour Big Data Who am I? - - PowerPoint PPT Presentation
Apache Beam Modle de programmation unifi pour Big Data Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software
Modèle de programmation unifié pour Big Data
Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software Architect at Talend PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva, Aries, ServiceMix, …) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, …)
Apache Beam origin
MapReduce
BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel
Apache Beam
Google Cloud Dataflow
Beam model: asking the right questions
Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
What is Apache Beam?
1. Unified model (Batch + strEAM) What / Where / When / How
Backends (Google Dataflow, Spark, Flink, …)
Apache Beam vision
Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
pipelines in a language that’s familiar.
Beam concepts available in new languages.
distributed processing environment and want to support Beam pipelines
Apache Beam - SDKs & DSLs
SDKs
API based on the Beam Model 1. Current:
2.
Future (possible) SDKs:
Go, Ruby, etc.
DSLs Domain-Specific Languages based on the Beam Model:
1.
Current:
2.
Future (ideas):
Apache Beam SDK concepts
1. Pipeline - data processing job as a directed graph of transformations
Apache Beam - Pipeline
Data processing pipeline (executed via a Beam runner)
PTransform PTransform Read PTransform (source) Write PTransform (sink)
Apache Beam - PCollection
1. PCollection is immutable, does not support random access to element, belongs to a Pipeline
Apache Beam - PTransform
1. PTransform are operations that transform data
PCollections
Apache Beam - IO Transforms
1. IO read/write data as PCollections (Source/Sink)
Apache Beam - Current IOs
Ready
File Avro Google Cloud Storage BigQuery BigTable DataStore HDFS Elasticsearch HBase MQTT JDBC Mongo / GridFS JMS Kafka Kinesis
WIP
Hive Cassandra Reddis RabbitMQ ...
Apache Beam - Pipeline with IO Example
public static void main(String[] args) { // Create a pipeline parameterized by command line flags eg. --runner Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(KafkaIO.read().withBootstrapServers(servers) .withTopics(topics)) // Read input .apply(new YourFancyFn()) // Do some processing .apply(ElasticsearchIO.write().withAddress(esServer) .withIndex(index).withType(type)); // Write output // Run the pipeline. p.run(); }
What Where When How
Element-Wise Aggregating Composite
Grouping
GroupByKey Combine -> Reduce Sum Count Min Max Mean ...
Element-wise
ParDo MapElements FlatMapElements Filter WithKeys Keys Values
Windowing/Triggers
FixedWindows GlobalWindows SlidingWindows Sessions AfterWatermark AfterProcessingTime AfterPane ...
Apache Beam - Example - GDELT Events by location
Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline .apply("GDELTFile", TextIO.Read.from(options.getInput())) // Extract location from the fields .apply("ExtractLocation", ParDo.of(...) // Count events per location .apply("CountPerLocation", Count.<String>perElement()) // Reformat KV as a String .apply("StringFormat", MapElements.via(...)) // write to result files .apply("Results",TextIO.Write.to(options.getOutput())); // Run the batch pipeline. pipeline.run();
Apache Bean - Runners / Execution Engines
Runners “translate” the code to a target runtime (the runner itself doesn’t provide the runtime) Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability
Runners
Google Cloud Dataflow Managed (NoOps) Apache Flink Apache Spark Apache Apex Apache MapReduce Apache Gearpump Apache Beam Direct Runner Local Apache Karaf Same code, different runners & runtimes WIP
Apache Beam - Use cases
Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases Data processing, both batch and stream processing Real-time event processing from IoT devices Fraud detection, ...
Why Apache Beam?
backends on premise, in the cloud, or locally
read and write in parallel
Collaborate - Beam is becoming a community- driven effort with participation from many
Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem
Apache Beam http://beam.apache.org Join the Beam mailing lists! user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter