Stream Processing with Apache Apex
Thomas Weise
Apache Apex PMC Chair thw@apache.org @thweise @atrato_io
October 30, 2017, Dagstuhl Seminar
Stream Processing with Apache Apex Thomas Weise Apache Apex PMC - - PowerPoint PPT Presentation
Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise @atrato_io October 30, 2017, Dagstuhl Seminar Stream Processing with Apache Apex Real-time visualization, Transform / Analytics Data Sources Data
Apache Apex PMC Chair thw@apache.org @thweise @atrato_io
October 30, 2017, Dagstuhl Seminar
2 Mobile Devices Logs Sensor Data Social Databases CDC
Oper1 Oper2 Oper3
Real-time visualization, storage, etc Data Delivery & Storage
Transform / Analytics
SQL Declarative API DAG API
SAMOA
Beam Operator Library
SAMOA
Beam
(roadmap)
Data Sources
○ End-to-end Exactly-once, Checkpointing and Windowing ○ Fine grained recovery, low-latency SLA support ○ Queryable state ○ Accuracy, Repeatable/Replay
○ Native Streaming, pipelined processing (data in motion) ○ Dynamic scaling and resource allocation, elasticity
○ Accelerate development ○ Event time windowing ○ High-level and low level Java API, SQL, Beam Runner
3
4
A Stream is a sequence of data tuples An Operator takes one or more input streams, performs computations & emits one or more output streams
built-in operator from Apex library
instances that run in parallel and each instance is single-threaded Directed Acyclic Graph (DAG)
Directed Acyclic Graph (DAG)
Operator Operator Operator Operator Operator Operator Tuple Output Stream Filtered Stream Enriched Stream Filtered Stream Enriched Stream
5
6
Apex CLI YARN RM Apex AM
Worker Worker Worker Worker Worker
6 4 1 3 2 5 1 2 3 4 5 6 DFS
(or other distributed storage)
Checkpoints
7
setup (Component) activate (ActivationListener) beginWindow (Operator) endWindow (Operator) process (InputPort)
emitTuples (InputOperator) teardown (Component) deactivate (ActivationListener) beforeCheckpoint checkpointed committed (CheckpointListener)
8
Stateful Transformations
data + query)
RDBMS
NoSQL
Messaging
File Systems
Stateless Transformations
Other
9
Hashtag Extractor TopN Window Twitter Feed Input Operator CountByKey Window Snapshot Server Result
Pub/Sub Broker
HTTP WebSocket
Query Input
10
11
called streaming windows
Containers at streaming window boundaries
AM
full DAG
corresponding window is committed
BeginWindow n EndWindow n BeginWindow n+1 EndWindow n+1
Time ... ...
Bookkeeping & Checkpointing done here
12
Operator 1 Container 1
Node 1 Operator 2 Container 2 Node 2
13
14
15
1 2
1 a 1 b 1 c U
2
Logical DAG Physical DAG with operator 1 with 3 partitions
2 1 3
2 a 1 b
3
1 a 1 c 2 b U U 2 a 1 b
3
1 a 1 c 2 b U1 U U2
16
1 a 1 b U
2 3 4
1 a 1 b U
4
2 a 3 a 2 b 3 b
1 2 1 1 1 1
U
2 1 1 1 1
U 1
2
U 2 U 3
17
ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically
ᵒ Supports redistribution of state when number of partitions change ᵒ API for custom scaler or partitioner
1b 1c 2 1a 1d 0b 0a 0a 1a 0b 1b 2 0a 1b 0b 1c 2b 1a 1d 2a
Unifiers not shown
18
ᵒ Operators can be deployed on specific hosts
ᵒ Ability to express relative deployment without specifying a host
(serialization+IPC)
(serialization, loopback)
(in-process queue)
(callstack)
19
20
21
Functionality Performance Usability Testing Operations
22