apache apex next gen big data analytics
play

Apache Apex: Next Gen Big Data Analytics Thomas Weise - PowerPoint PPT Presentation

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016 Stream Data Processing Real-time Data Delivery Transform /


  1. Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016

  2. Stream Data Processing Real-time Data Delivery Transform / Analytics v isualization, … Declarative SQL SAMOA SAMOA API Beam Beam Data Operator DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2

  3. Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • • Large scale ingest and distribution Enforcing data quality and data governance requirements • • Real-time ELTA (Extract Load Transform Analyze) Real-time data enrichment with reference data • • Dimensional computation & aggregation Real-time machine learning model scoring 3

  4. Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in member variables • Windowing, event-time processing • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes and resource allocation, elasticity 4

  5. Native Hadoop Integration • YARN is the resource manager • HDFS for storing persistent state 5

  6. Application Development Model Directed Acyclic Graph (DAG) A Stream is a sequence of data tuples A typical Operator takes one or Filtered Enriched more input streams, performs Stream Stream computations & emits one or more output streams Output Operator Stream • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library Tuple Operator Operator Operator Operator • Operator has many instances that run in parallel and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) is Operator made up of operators and streams 6

  7. Development Process Kafka Word JDBC Parser Filter Input Counter Output Lines Words Filtered Counts Database Kafka Apex Application • Operators from library or develop for custom logic • Connect operators to form application • Configure operator properties • Configure scaling and other platform attributes • Test functionality, performance, iterate 7

  8. Application Specification DAG API (compositional) Java Stream API (declarative) 8

  9. Developing Operators 9

  10. Operator Library Messaging NoSQL RDBMS • Kafka • Cassandra, HBase • JDBC • JMS (ActiveMQ , …) • Aerospike, Accumulo • MySQL • Kinesis, SQS • Couchbase/ CouchDB • Oracle • Flume, NiFi • Redis, MongoDB • MemSQL • Geode File Systems Parsers Transformations • HDFS/ Hive • XML • Filter, Expression, Enrich • NFS • JSON • Windowing, Aggregation • S3 • CSV • Join • Avro • Dedup • Parquet Analytics Protocols Other • Dimensional Aggregations • HTTP • Elastic Search (with state management for • FTP • Script (JavaScript, Python, R) historical data + query) • WebSocket • Solr • MQTT • Twitter • SMTP 10

  11. Stateful Processing with Event Time Event Stream k=A k=B k=B k=A k=A t=4:00 t=5:00 t=5:59 t=4:30 t=5:00 Processing Time +30s +60s +90s (All) : 1 (All) : 4 (All) : 5 t=4:00 : 1 t=4:00 : 2 t=4:00 : 2 t=5:00 : 2 t=5:00 : 3 State k=A, t=4:00 : 1 k=A, t=4:00 : 2 k=A, t=4:00 : 2 k=A, t=5:00 : 1 K=B, t=5:00 : 2 k=B, t=5:00 : 2 11

  12. Windowing - Apache Beam Model Event-time Session windows Watermarks Accumulation Triggers Keyed or Not Keyed Allowed Lateness Accumulation Mode Merging streams ApexStream<String> stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print(); 12

  13. Fault Tolerance • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log 13

  14. Checkpointing State   Distributed, asynchronous No artificial latency   Periodic callbacks Pluggable storage 14

  15. Buffer Server & Recovery • In-memory PubSub Container 2 Container 1 • Stores results until committed Buffer Operator Operator • Backpressure / spillover to disk 1 2 Server • Ordering, idempotency Node 1 Node 2 Independent pipelines Downstream Operators reset (can be used for speculative execution) 15

  16. Recovery Scenario sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 0 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 7 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 10 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 7 16

  17. Processing Guarantees At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior 17

  18. End-to-End Exactly Once • Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing 18

  19. Scalability Unifier NxM Partitions Logical DAG 0 1 2 3 Logical Diagram 0 1 2 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier 1a 2a Physical Diagram with operator 1 with 3 partitions 0 1b Unifier Unifier 3 1 2b 1c Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck 0 1 Unifier 2 1a Unifier 2a 1 0 1b Unifier 3 Unifier 2b 1c 19

  20. Advanced Partitioning Parallel Partition Cascading Unifiers Logical DAG Logical Plan 0 1 2 3 4 uopr dopr Execution Plan, for N = 4; M = 1 Physical DAG uopr1 1a uopr2 Container NIC 0 Unifier 2 3 4 unifier dopr uopr3 1b uopr4 Physical DAG with Parallel Partition Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers uopr1 Container 1a 2a 3a NIC NIC unifier 0 Unifier 4 Container uopr2 NIC unifier dopr 1b 2b 3b uopr3 Container NIC NIC unifier uopr4 20

  21. Dynamic Partitioning 2a 2a 1a 2a 1a 2b 1a 2b 3a 3 3 1b 2c 1b 2c 3b 1b 2b 2d 2d Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend