Apache Beam Modle de programmation unifi pour Big Data Who am I? - - PowerPoint PPT Presentation

apache beam
SMART_READER_LITE
LIVE PREVIEW

Apache Beam Modle de programmation unifi pour Big Data Who am I? - - PowerPoint PPT Presentation

Apache Beam Modle de programmation unifi pour Big Data Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software


slide-1
SLIDE 1

Apache Beam

Modèle de programmation unifié pour Big Data

slide-2
SLIDE 2

Who am I?

Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software Architect at Talend PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva, Aries, ServiceMix, …) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, …)

slide-3
SLIDE 3

Apache Beam origin

MapReduce

BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel

Apache Beam

Google Cloud Dataflow

slide-4
SLIDE 4

Beam model: asking the right questions

What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-5
SLIDE 5

Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch

slide-6
SLIDE 6

What is Apache Beam?

1. Unified model (Batch + strEAM) What / Where / When / How

  • 2. SDKs (Java, Python, ...) & DSLs (Scala, …)
  • 3. Runners for Existing Distributed Processing

Backends (Google Dataflow, Spark, Flink, …)

  • 4. IOs: Data store Sources / Sinks
slide-7
SLIDE 7

Apache Beam vision

Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution

  • 1. End users: who want to write

pipelines in a language that’s familiar.

  • 2. SDK/DSL writers: who want to make

Beam concepts available in new languages.

  • 3. Runner writers: who have a

distributed processing environment and want to support Beam pipelines

slide-8
SLIDE 8

Apache Beam - SDKs & DSLs

SDKs

API based on the Beam Model 1. Current:

  • a. Java
  • b. Python

2.

Future (possible) SDKs:

Go, Ruby, etc.

DSLs Domain-Specific Languages based on the Beam Model:

1.

Current:

  • Scio (Scala API),

2.

Future (ideas):

  • Streaming SQL (Calcite)
  • Machine Learning
  • Complex Event Processing
slide-9
SLIDE 9

Apache Beam SDK concepts

1. Pipeline - data processing job as a directed graph of transformations

  • 2. PCollection - the data inside a pipeline
  • 3. PTransform - a transformation step in the pipeline
  • a. IO transforms - Read from a Source or Write to a Sink.
  • b. Core transforms - common transformation provided (ParDo, GroupByKey, …)
  • c. Composite transforms - combine multiple transforms
slide-10
SLIDE 10

Apache Beam - Pipeline

Data processing pipeline (executed via a Beam runner)

PTransform PTransform Read PTransform (source) Write PTransform (sink)

slide-11
SLIDE 11

Apache Beam - PCollection

1. PCollection is immutable, does not support random access to element, belongs to a Pipeline

  • 2. Each element in PCollection has a Timestamp (commonly set by IO Source)
  • 3. Coder to support different data serialization
  • 4. Bounded (batch) or Unbounded (streaming) (depending of the IO Source)
slide-12
SLIDE 12

Apache Beam - PTransform

1. PTransform are operations that transform data

  • 2. Receive one or multiple PCollections and produce one or multiple

PCollections

  • 3. They must be Serializable
  • 4. Should be thread-compatible (If you create your threads you must sync them).
  • 5. Idempotency is not required but recommended.
slide-13
SLIDE 13

Apache Beam - IO Transforms

1. IO read/write data as PCollections (Source/Sink)

  • 2. Support Bounded and/or Unbounded PCollections
  • 3. Extensible API to create custom sources & sinks
  • 4. Deal with timestamp, watermarks, deduplication, read/write parallelism
slide-14
SLIDE 14
  • 1. Evolution of the Big Data programming models
  • 2. The Beam approach
  • 3. Apache Beam

Agenda

slide-15
SLIDE 15

Apache Beam - Current IOs

Ready

File Avro Google Cloud Storage BigQuery BigTable DataStore HDFS Elasticsearch HBase MQTT JDBC Mongo / GridFS JMS Kafka Kinesis

WIP

Hive Cassandra Reddis RabbitMQ ...

slide-16
SLIDE 16

Apache Beam - Pipeline with IO Example

public static void main(String[] args) { // Create a pipeline parameterized by command line flags eg. --runner Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(KafkaIO.read().withBootstrapServers(servers) .withTopics(topics)) // Read input .apply(new YourFancyFn()) // Do some processing .apply(ElasticsearchIO.write().withAddress(esServer) .withIndex(index).withType(type)); // Write output // Run the pipeline. p.run(); }

slide-17
SLIDE 17

What are you computing?

What Where When How

Element-Wise Aggregating Composite

slide-18
SLIDE 18

Apache Beam - Programming model in the SDK

Grouping

GroupByKey Combine -> Reduce Sum Count Min Max Mean ...

Element-wise

ParDo MapElements FlatMapElements Filter WithKeys Keys Values

Windowing/Triggers

FixedWindows GlobalWindows SlidingWindows Sessions AfterWatermark AfterProcessingTime AfterPane ...

slide-19
SLIDE 19

Apache Beam - Example - GDELT Events by location

Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline .apply("GDELTFile", TextIO.Read.from(options.getInput())) // Extract location from the fields .apply("ExtractLocation", ParDo.of(...) // Count events per location .apply("CountPerLocation", Count.<String>perElement()) // Reformat KV as a String .apply("StringFormat", MapElements.via(...)) // write to result files .apply("Results",TextIO.Write.to(options.getOutput())); // Run the batch pipeline. pipeline.run();

slide-20
SLIDE 20

Apache Bean - Runners / Execution Engines

Runners “translate” the code to a target runtime (the runner itself doesn’t provide the runtime) Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability

slide-21
SLIDE 21

Runners

Google Cloud Dataflow Managed (NoOps) Apache Flink Apache Spark Apache Apex Apache MapReduce Apache Gearpump Apache Beam Direct Runner Local Apache Karaf Same code, different runners & runtimes WIP

slide-22
SLIDE 22

Apache Beam - Use cases

Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases Data processing, both batch and stream processing Real-time event processing from IoT devices Fraud detection, ...

slide-23
SLIDE 23

Why Apache Beam?

  • 1. Portable - You can use the same code with different runners (agnostic) and

backends on premise, in the cloud, or locally

  • 2. Unified - Same unified model for batch and stream processing
  • 3. Advanced features - Event windowing, triggering, watermarking, lateness, etc.
  • 4. Extensible model and SDK - Extensible API; can define custom sources to

read and write in parallel

slide-24
SLIDE 24

Collaborate - Beam is becoming a community- driven effort with participation from many

  • rganizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

Growing the Beam Community

slide-25
SLIDE 25

Learn More!

Apache Beam http://beam.apache.org Join the Beam mailing lists! user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter

slide-26
SLIDE 26

Thank You !