Stream Processing with Apache Apex Thomas Weise Apache Apex PMC - - PowerPoint PPT Presentation

stream processing with apache apex
SMART_READER_LITE
LIVE PREVIEW

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC - - PowerPoint PPT Presentation

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise @atrato_io October 30, 2017, Dagstuhl Seminar Stream Processing with Apache Apex Real-time visualization, Transform / Analytics Data Sources Data


slide-1
SLIDE 1

Stream Processing with Apache Apex

Thomas Weise

Apache Apex PMC Chair thw@apache.org @thweise @atrato_io

October 30, 2017, Dagstuhl Seminar

slide-2
SLIDE 2

Stream Processing with Apache Apex

2 Mobile Devices Logs Sensor Data Social Databases CDC

Oper1 Oper2 Oper3

Real-time visualization, storage, etc Data Delivery & Storage

Transform / Analytics

SQL Declarative API DAG API

SAMOA

Beam Operator Library

SAMOA

Beam

(roadmap)

Data Sources

slide-3
SLIDE 3
  • State Management & Fault tolerance

○ End-to-end Exactly-once, Checkpointing and Windowing ○ Fine grained recovery, low-latency SLA support ○ Queryable state ○ Accuracy, Repeatable/Replay

  • Scalable, high throughput and low latency

○ Native Streaming, pipelined processing (data in motion) ○ Dynamic scaling and resource allocation, elasticity

  • Comprehensive library of connectors and transformations

○ Accelerate development ○ Event time windowing ○ High-level and low level Java API, SQL, Beam Runner

  • Used by GE (Predix), Capital One, Royal Bank of Canada, Pubmatic, Silver

Spring Networks (more at https://apex.apache.org/powered-by-apex.html)

Why Apex

3

slide-4
SLIDE 4

Application Model

4

A Stream is a sequence of data tuples An Operator takes one or more input streams, performs computations & emits one or more output streams

  • Custom business logic or

built-in operator from Apex library

  • Operator has many

instances that run in parallel and each instance is single-threaded Directed Acyclic Graph (DAG)

  • f operators and streams

Directed Acyclic Graph (DAG)

Operator Operator Operator Operator Operator Operator Tuple Output Stream Filtered Stream Enriched Stream Filtered Stream Enriched Stream

slide-5
SLIDE 5

DAG Translation

5

slide-6
SLIDE 6

Execution Layer

6

  • AM requests worker containers

from YARN to run physical

  • perators
  • Worker Containers send data

using a pub-sub mechanism

  • Workers heartbeat to master

Apex CLI YARN RM Apex AM

Worker Worker Worker Worker Worker

6 4 1 3 2 5 1 2 3 4 5 6 DFS

(or other distributed storage)

Checkpoints

slide-7
SLIDE 7

Operator API

7

setup (Component) activate (ActivationListener) beginWindow (Operator) endWindow (Operator) process (InputPort)

  • r

emitTuples (InputOperator) teardown (Component) deactivate (ActivationListener) beforeCheckpoint checkpointed committed (CheckpointListener)

slide-8
SLIDE 8

Operator Library

8

Stateful Transformations

  • Windowing: sliding, tumbling, session
  • Accumulations: sum, merge, join, sort, top n, …
  • Triggering, Watermarks
  • Dimensional Aggregations (with state management for historical

data + query)

  • Deduplication

RDBMS

  • JDBC
  • MySQL
  • Oracle
  • MemSQL

NoSQL

  • Cassandra, HBase
  • Aerospike, Accumulo
  • Couchbase, CouchDB
  • Redis, MongoDB
  • Geode, Kudu

Messaging

  • Kafka
  • JMS (ActiveMQ etc.)
  • Kinesis, SQS
  • Flume, NiFi
  • MQTT

File Systems

  • HDFS / Hive
  • Local File
  • S3
  • FTP

Stateless Transformations

  • Parsers: XML, JSON, CSV, Avro
  • Filter
  • Enrich
  • Configurable POJO schema
  • Map, FlatMap (custom Java function)
  • Script (JavaScript, Jython)

Other

  • Elastic Search
  • Solr
  • Twitter
  • WebSocket / HTTP
  • SMTP
slide-9
SLIDE 9

Queryable State

A set of operators in the library that support real-time queries of operator state.

9

Hashtag Extractor TopN Window Twitter Feed Input Operator CountByKey Window Snapshot Server Result

Pub/Sub Broker

HTTP WebSocket

Query Input

  • Example: https://github.com/tweise/apex-samples/tree/master/twitter
  • Pub/Sub server: https://github.com/atrato/pubsub-server
  • Grafana data source: https://github.com/atrato/apex-grafana-datasource-server
slide-10
SLIDE 10

Application API

10

Stream API (declarative) DAG API (compositional)

slide-11
SLIDE 11

Fault Tolerance - Checkpointing

11

  • Stream is divided into fixed time slices

called streaming windows

  • Checkpoint is performed by Worker

Containers at streaming window boundaries

  • Worker Containers send heartbeats to

AM

  • Recovery is incremental without resetting

full DAG

  • Checkpoints are purged after the

corresponding window is committed

  • AM is also checkpointed

BeginWindow n EndWindow n BeginWindow n+1 EndWindow n+1

Time ... ...

Bookkeeping & Checkpointing done here

slide-12
SLIDE 12

In-Memory PubSub & Recovery

12

  • Buffer results until committed
  • Backpressure / spillover to disk
  • Ordering, idempotency

Operator 1 Container 1

Buffer Server

Node 1 Operator 2 Container 2 Node 2

Downstream Operators reset Independent pipelines (can be used for speculative execution)

slide-13
SLIDE 13

Processing Guarantees

13

slide-14
SLIDE 14

End-to-End Exactly-Once

14

Exactly-once results = at-least-once + idempotency +

  • perator logic
slide-15
SLIDE 15

Scaling/Partitioning

15

Partitioning with Unifiers: NxM Partitioning (Shuffle):

1 2

1 a 1 b 1 c U

2

Logical DAG Physical DAG with operator 1 with 3 partitions

2 1 3

2 a 1 b

3

1 a 1 c 2 b U U 2 a 1 b

3

1 a 1 c 2 b U1 U U2

slide-16
SLIDE 16

Scaling/Partitioning (cont’d)

16

Parallel Partition: Cascading Unifiers:

1 a 1 b U

2 3 4

1 a 1 b U

4

2 a 3 a 2 b 3 b

1 2 1 1 1 1

U

2 1 1 1 1

U 1

2

U 2 U 3

slide-17
SLIDE 17

Dynamic Scaling

17

  • Partitioning change while application is running

ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically

  • Kafka operators scale according to number of kafka partitions

ᵒ Supports redistribution of state when number of partitions change ᵒ API for custom scaler or partitioner

1b 1c 2 1a 1d 0b 0a 0a 1a 0b 1b 2 0a 1b 0b 1c 2b 1a 1d 2a

Unifiers not shown

slide-18
SLIDE 18

Compute Locality

18

  • Host Locality

ᵒ Operators can be deployed on specific hosts

  • (Anti-)Affinity

ᵒ Ability to express relative deployment without specifying a host

Default

(serialization+IPC)

HOST

(serialization, loopback)

CONTAINER

(in-process queue)

THREAD

(callstack)

  • By default operators are distributed on different nodes in the cluster
  • Can be collocated on machine, container or thread basis for efficiency
slide-19
SLIDE 19

Compute Locality (cont’d)

19

Message size (bytes) (default locality) (bytes/s) CONTAINER_LOCAL (bytes/s) THREAD_LOCAL (bytes/s) 64 59,176,512 204,748,032 2,480,432,448 128 89,803,904 395,023,360 3,662,684,672 256 137,019,648 671,409,664 5,218,227,968 512 156,255,744 1,255,749,632 4,416,738,304 1024 167,139,328 2,022,868,992 3,423,519,744 2048 182,349,824 3,508,013,056 4,050,688,000 4096 255,229,952 3,732,725,760 3,884,101,632 https://www.datatorrent.com/blog/blog-apex-performance-benchmark/

slide-20
SLIDE 20
  • Apex runner in Apache Beam
  • Iterative processing, Integrated with Apache Samoa, opens up ML
  • Integrated with Apache Calcite, enables SQL
  • Scalable, incremental state management
  • User defined control tuples (watermarks, batch control, …)
  • Apache Kudu connectors
  • Support for Python
  • Support for Docker, Mesos and Kubernetes
  • Enhanced support for Batch Processing
  • Encrypted Streams

Recent Additions & Roadmap

20

slide-21
SLIDE 21

Adoption Challenges for Big Stream Processing

21

Functionality Performance Usability Testing Operations

slide-22
SLIDE 22

Resources

22

  • http://apex.apache.org/
  • Powered by Apex - http://apex.apache.org/powered-by-apex.html
  • Learn more - http://apex.apache.org/docs.html
  • Getting involved - http://apex.apache.org/community.html
  • Download - http://apex.apache.org/downloads.html
  • Follow @ApacheApex - https://twitter.com/apacheapex
  • Meetups - https://www.meetup.com/topics/apache-apex/
  • Examples - https://github.com/apache/apex-malhar/tree/master/examples
  • Slideshare - http://www.slideshare.net/ApacheApex/presentations