Apache Apex: Next Gen Big Data Analytics Thomas Weise - PowerPoint PPT Presentation

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016

Stream Data Processing Real-time Data Delivery Transform / Analytics v isualization, … Declarative SQL SAMOA SAMOA API Beam Beam Data Operator DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2

Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • • Large scale ingest and distribution Enforcing data quality and data governance requirements • • Real-time ELTA (Extract Load Transform Analyze) Real-time data enrichment with reference data • • Dimensional computation & aggregation Real-time machine learning model scoring 3

Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in member variables • Windowing, event-time processing • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes and resource allocation, elasticity 4

Native Hadoop Integration • YARN is the resource manager • HDFS for storing persistent state 5

Application Development Model Directed Acyclic Graph (DAG) A Stream is a sequence of data tuples A typical Operator takes one or Filtered Enriched more input streams, performs Stream Stream computations & emits one or more output streams Output Operator Stream • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library Tuple Operator Operator Operator Operator • Operator has many instances that run in parallel and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) is Operator made up of operators and streams 6

Development Process Kafka Word JDBC Parser Filter Input Counter Output Lines Words Filtered Counts Database Kafka Apex Application • Operators from library or develop for custom logic • Connect operators to form application • Configure operator properties • Configure scaling and other platform attributes • Test functionality, performance, iterate 7

Application Specification DAG API (compositional) Java Stream API (declarative) 8

Developing Operators 9

Operator Library Messaging NoSQL RDBMS • Kafka • Cassandra, HBase • JDBC • JMS (ActiveMQ , …) • Aerospike, Accumulo • MySQL • Kinesis, SQS • Couchbase/ CouchDB • Oracle • Flume, NiFi • Redis, MongoDB • MemSQL • Geode File Systems Parsers Transformations • HDFS/ Hive • XML • Filter, Expression, Enrich • NFS • JSON • Windowing, Aggregation • S3 • CSV • Join • Avro • Dedup • Parquet Analytics Protocols Other • Dimensional Aggregations • HTTP • Elastic Search (with state management for • FTP • Script (JavaScript, Python, R) historical data + query) • WebSocket • Solr • MQTT • Twitter • SMTP 10

Stateful Processing with Event Time Event Stream k=A k=B k=B k=A k=A t=4:00 t=5:00 t=5:59 t=4:30 t=5:00 Processing Time +30s +60s +90s (All) : 1 (All) : 4 (All) : 5 t=4:00 : 1 t=4:00 : 2 t=4:00 : 2 t=5:00 : 2 t=5:00 : 3 State k=A, t=4:00 : 1 k=A, t=4:00 : 2 k=A, t=4:00 : 2 k=A, t=5:00 : 1 K=B, t=5:00 : 2 k=B, t=5:00 : 2 11

Windowing - Apache Beam Model Event-time Session windows Watermarks Accumulation Triggers Keyed or Not Keyed Allowed Lateness Accumulation Mode Merging streams ApexStream<String> stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print(); 12

Fault Tolerance • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log 13

Checkpointing State   Distributed, asynchronous No artificial latency   Periodic callbacks Pluggable storage 14

Buffer Server & Recovery • In-memory PubSub Container 2 Container 1 • Stores results until committed Buffer Operator Operator • Backpressure / spillover to disk 1 2 Server • Ordering, idempotency Node 1 Node 2 Independent pipelines Downstream Operators reset (can be used for speculative execution) 15

Recovery Scenario sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 0 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 7 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 10 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 7 16

Processing Guarantees At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior 17

End-to-End Exactly Once • Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing 18

Scalability Unifier NxM Partitions Logical DAG 0 1 2 3 Logical Diagram 0 1 2 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier 1a 2a Physical Diagram with operator 1 with 3 partitions 0 1b Unifier Unifier 3 1 2b 1c Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck 0 1 Unifier 2 1a Unifier 2a 1 0 1b Unifier 3 Unifier 2b 1c 19

Advanced Partitioning Parallel Partition Cascading Unifiers Logical DAG Logical Plan 0 1 2 3 4 uopr dopr Execution Plan, for N = 4; M = 1 Physical DAG uopr1 1a uopr2 Container NIC 0 Unifier 2 3 4 unifier dopr uopr3 1b uopr4 Physical DAG with Parallel Partition Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers uopr1 Container 1a 2a 3a NIC NIC unifier 0 Unifier 4 Container uopr2 NIC unifier dopr 1b 2b 3b uopr3 Container NIC NIC unifier uopr4 20

Dynamic Partitioning 2a 2a 1a 2a 1a 2b 1a 2b 3a 3 3 1b 2c 1b 2c 3b 1b 2b 2d 2d Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 21

Apache Apex: Next Gen Big Data Analytics Thomas Weise - PowerPoint PPT Presentation

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016 Stream Data Processing Real-time Data Delivery Transform /

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Stateful Streaming Data Pipelines with Apache Apex Chandni Singh Timothy Farkas PMC and

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Dimensions Computation With Apache Apex Devendra Tagare <devtagare@gmail.com> Data Engineer

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Examples of Obstructions to Apex Graphs, Edge-Apex Graphs, and Contraction-Apex Graphs

APEX Office Print Dimitri Gielis 0.01 5-SEP-2019 APEX Office Print 0.02

Examples of Obstructions to Apex Graphs, Edge-Apex Graphs, and Contraction-Apex Graphs Mike

APEX Extragalactic Surveys Attila Kovcs The Case for APEX in the ALMA Era Zero Spacing APEX

MPIfR APEX Instrumentation MPIfR APEX Instrumentation Bernd Klein Bernd Klein bklein@mpifr.de

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

CLOUD NINJA Catch Me If You Can! RSA 2014 Thursday, February 27, 2014 | 8:00am 9:00am | West

Introduction to Computer Security Why do we need computer security? What are our goals and

Transfer Learning Approach for Botnet Detection based on Recurrent Variational Autoencoder

Advertising, Analytics and Tracking Thierry Sans Advertising I have a cool car to sell and

Malice on the Internet A Peek into Todays Security Attacks Arvind Krishnamurthy Thursday,

HOW TO DETECT AND PREVENT FRAUD The webinar will begin shortly. Make sure your computers

Broadening the Housing Movement Webinar August 16, 2018 @OppStartsatHome National

FEDEX CORPORATION GLOBALLY 69.7 Billion FY19 AnnualRevenue > 15M Shipments each business day

Apache Apex: Next Gen Big Data Analytics Thomas Weise - PowerPoint PPT Presentation

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016 Stream Data Processing Real-time Data Delivery Transform /

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Stateful Streaming Data Pipelines with Apache Apex Chandni Singh Timothy Farkas PMC and

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Dimensions Computation With Apache Apex Devendra Tagare &lt;devtagare@gmail.com&gt; Data Engineer

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Examples of Obstructions to Apex Graphs, Edge-Apex Graphs, and Contraction-Apex Graphs

APEX Office Print Dimitri Gielis 0.01 5-SEP-2019 APEX Office Print 0.02

Examples of Obstructions to Apex Graphs, Edge-Apex Graphs, and Contraction-Apex Graphs Mike

APEX Extragalactic Surveys Attila Kovcs The Case for APEX in the ALMA Era Zero Spacing APEX

MPIfR APEX Instrumentation MPIfR APEX Instrumentation Bernd Klein Bernd Klein bklein@mpifr.de

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

CLOUD NINJA Catch Me If You Can! RSA 2014 Thursday, February 27, 2014 | 8:00am 9:00am | West

Introduction to Computer Security Why do we need computer security? What are our goals and

Transfer Learning Approach for Botnet Detection based on Recurrent Variational Autoencoder

Advertising, Analytics and Tracking Thierry Sans Advertising I have a cool car to sell and

Malice on the Internet A Peek into Todays Security Attacks Arvind Krishnamurthy Thursday,

HOW TO DETECT AND PREVENT FRAUD The webinar will begin shortly. Make sure your computers

Broadening the Housing Movement Webinar August 16, 2018 @OppStartsatHome National

FEDEX CORPORATION GLOBALLY 69.7 Billion FY19 AnnualRevenue &gt; 15M Shipments each business day

Dimensions Computation With Apache Apex Devendra Tagare <devtagare@gmail.com> Data Engineer

FEDEX CORPORATION GLOBALLY 69.7 Billion FY19 AnnualRevenue > 15M Shipments each business day