stream processing with apache apex
play

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC - PowerPoint PPT Presentation

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise @atrato_io October 30, 2017, Dagstuhl Seminar Stream Processing with Apache Apex Real-time visualization, Transform / Analytics Data Sources Data


  1. Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise @atrato_io October 30, 2017, Dagstuhl Seminar

  2. Stream Processing with Apache Apex Real-time visualization, Transform / Analytics Data Sources Data Delivery & Storage storage, etc Declarative SQL API SAMOA SAMOA Beam Beam Operator DAG API Library Mobile Devices Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2

  3. Why Apex ● State Management & Fault tolerance ○ End-to-end Exactly-once, Checkpointing and Windowing ○ Fine grained recovery, low-latency SLA support ○ Queryable state ○ Accuracy, Repeatable/Replay ● Scalable, high throughput and low latency ○ Native Streaming, pipelined processing (data in motion) ○ Dynamic scaling and resource allocation, elasticity ● Comprehensive library of connectors and transformations ○ Accelerate development ○ Event time windowing ○ High-level and low level Java API, SQL, Beam Runner ● Used by GE (Predix), Capital One, Royal Bank of Canada, Pubmatic, Silver Spring Networks (more at https://apex.apache.org/powered-by-apex.html) 3

  4. Application Model A Stream is a sequence of data tuples Directed Acyclic Graph (DAG) An Operator takes one or more input streams, performs Filtered Enriched computations & emits one or Stream Stream more output streams • Custom business logic or Output Operator Stream built-in operator from Apex library • Operator has many Tuple instances that run in parallel Operator Operator Operator Operator and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) Operator of operators and streams 4

  5. DAG Translation 5

  6. Execution Layer ● AM requests worker containers YARN Apex CLI RM from YARN to run physical operators Apex Worker Containers send data ● AM Worker Worker Worker using a pub-sub mechanism 5 1 2 ● Workers heartbeat to master Worker 6 3 Worker 4 3 1 2 5 6 4 Checkpoints DFS (or other distributed storage) 6

  7. Operator API setup teardown (Component) (Component) beforeCheckpoint checkpointed committed activate deactivate (CheckpointListener) (ActivationListener) (ActivationListener) process (InputPort) beginWindow endWindow or (Operator) (Operator) emitTuples (InputOperator) 7

  8. Operator Library Messaging File Systems NoSQL RDBMS Other • JDBC • Kafka • HDFS / Hive • Cassandra, HBase • Elastic Search • MySQL • JMS (ActiveMQ etc.) • Aerospike, Accumulo • Solr • Local File • Kinesis, SQS • Couchbase, CouchDB • Oracle • Twitter • S3 • Flume, NiFi • Redis, MongoDB • WebSocket / HTTP • MemSQL • FTP • MQTT • Geode, Kudu • SMTP Stateless Transformations Stateful Transformations • Parsers: XML, JSON , CSV , Avro • Windowing: sliding, tumbling, session • Accumulations: sum, merge, join, sort, top n, … • Filter • Enrich • Triggering, Watermarks • Configurable POJO schema • Dimensional Aggregations (with state management for historical data + query) • Map, FlatMap (custom Java function) • Deduplication • Script (JavaScript, Jython) 8

  9. Queryable State A set of operators in the library that support real-time queries of operator state. Twitter Feed CountByKey Hashtag Input Window Extractor Operator WebSocket HTTP Pub/Sub Snapshot Broker TopN Server Result Window Query Input ● Example: https://github.com/tweise/apex-samples/tree/master/twitter ● Pub/Sub server: https://github.com/atrato/pubsub-server ● Grafana data source: https://github.com/atrato/apex-grafana-datasource-server 9

  10. Application API DAG API (compositional) Stream API (declarative) 10

  11. Fault Tolerance - Checkpointing ● Stream is divided into fixed time slices Bookkeeping & BeginWindow n+1 Checkpointing done called streaming windows here ● Checkpoint is performed by Worker ... ... Containers at streaming window boundaries EndWindow n+1 EndWindow n BeginWindow n ● Worker Containers send heartbeats to AM Recovery is incremental without resetting ● Time full DAG ● Checkpoints are purged after the corresponding window is committed AM is also checkpointed ● 11

  12. In-Memory PubSub & Recovery • Buffer results until committed Container 2 Container 1 • Backpressure / spillover to disk Buffer Operator Operator • Ordering, idempotency 1 2 Server Node 1 Node 2 Independent pipelines Downstream Operators reset (can be used for speculative execution) 12

  13. Processing Guarantees 13

  14. End-to-End Exactly-Once Exactly-once results = at-least-once + idempotency + operator logic 14

  15. Scaling/Partitioning Partitioning with Unifiers: NxM Partitioning (Shuffle): Logical DAG 0 1 2 3 0 1 2 1 2 a a Physical DAG with operator 1 with 3 partitions 1 0 3 U U b 1 2 1 a b c 1 0 2 U 1 b 2 a U1 a 1 1 0 3 U c b 2 1 U2 b c 15

  16. Scaling/Partitioning (cont’d) Parallel Partition: Cascading Unifiers: 1 2 1 a 1 0 2 3 4 U 1 1 b 2 U 1 1 1 2 3 a a a 1 0 4 U U 1 2 3 1 1 b b b U 2 3 1 U 2 1 16

  17. Dynamic Scaling 1a 1a 0a 1a 0a 1b 0a 1b 2a 2 2 0b 1c 0b 1c 2b 0b 1b 1d 1d Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports redistribution of state when number of partitions change ᵒ API for custom scaler or partitioner 17

  18. Compute Locality • By default operators are distributed on different nodes in the cluster • Can be collocated on machine, container or thread basis for efficiency HOST CONTAINER THREAD Default (serialization, loopback) (callstack) (serialization+IPC) (in-process queue) • Host Locality ᵒ Operators can be deployed on specific hosts • (Anti-)Affinity ᵒ Ability to express relative deployment without specifying a host 18

  19. Compute Locality (cont’d) Message size (default locality) CONTAINER_LOCAL THREAD_LOCAL (bytes) (bytes/s) (bytes/s) (bytes/s) 64 59,176,512 204,748,032 2,480,432,448 128 89,803,904 395,023,360 3,662,684,672 256 137,019,648 671,409,664 5,218,227,968 512 156,255,744 1,255,749,632 4,416,738,304 1024 167,139,328 2,022,868,992 3,423,519,744 2048 182,349,824 3,508,013,056 4,050,688,000 4096 255,229,952 3,732,725,760 3,884,101,632 https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ 19

  20. Recent Additions & Roadmap ● Apex runner in Apache Beam ● Iterative processing, Integrated with Apache Samoa, opens up ML ● Integrated with Apache Calcite, enables SQL ● Scalable, incremental state management ● User defined control tuples (watermarks, batch control, …) ● Apache Kudu connectors ● Support for Python ● Support for Docker, Mesos and Kubernetes ● Enhanced support for Batch Processing ● Encrypted Streams 20

  21. Adoption Challenges for Big Stream Processing Functionality Performance Usability Operations Testing 21

  22. Resources • http://apex.apache.org/ • Powered by Apex - http://apex.apache.org/powered-by-apex.html • Learn more - http://apex.apache.org/docs.html • Getting involved - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - https://www.meetup.com/topics/apache-apex/ • Examples - https://github.com/apache/apex-malhar/tree/master/examples • Slideshare - http://www.slideshare.net/ApacheApex/presentations 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend