Apache Gearpump next-gen streaming engine Karol Brejna, Intel - - PowerPoint PPT Presentation

apache gearpump
SMART_READER_LITE
LIVE PREVIEW

Apache Gearpump next-gen streaming engine Karol Brejna, Intel - - PowerPoint PPT Presentation

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016 Agenda What is Gearpump? Why Apache


slide-1
SLIDE 1

Apache Gearpump

next-gen streaming engine

Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016

slide-2
SLIDE 2

Agenda

  • What is Gearpump?
  • Why Apache Gearpump?
  • Apache Gearpump features/internals
  • What’s next for Apache Gearpump

2

slide-3
SLIDE 3

What is Gearpump ?

  • A super simple pump that consists of only two gears

but very powerful at streaming water

  • An Akka[2] based real-time streaming engine
  • An Apache Incubator[1] project since Mar.8th, 2016

3

slide-4
SLIDE 4

Why Gearpump ?

4

slide-5
SLIDE 5

Stream processing is hard

  • Fault tolerance
  • Infinite Out-of-order data
  • Low latency assurance (e.g real-time recommendation)
  • Correctness requirement (e.g. charge advertisers for ads)
  • Cheap to update applications (e.g. tune machine learning

parameters)

slide-6
SLIDE 6

Gearpump makes stream processing easier

  • fault tolerant stream processing at latency of milliseconds
  • handling out-of-order data
  • event-time based window aggregation
  • Akka-stream DSL and Apache Beam API support
  • runtime DAG modification
  • responsive UI with abundant metrics information
slide-7
SLIDE 7

Gearpump on TAP

  • Gearpump on Trusted Analytics Platform (TAP)
  • Stream processing - performance experiments and results

7

slide-8
SLIDE 8

Gearpump on TAP

  • Gearpump on Trusted Analytics Platform (TAP)
  • Stream processing - performance experiments and results

8

slide-9
SLIDE 9

9

▪ Open Source project ▪ Collaborative, cloud-ready platform to build applications powered by Big Data Analytics ▪ Includes everything needed by data scientists, application developers and system operators ▪ Optimized for performance and security

Trusted Analytics Platform (TAP)

slide-10
SLIDE 10

Analytics Solutions – Big Data Scale Out

10

Applications

Analytics-powered vertical and horizontal solutions

Analytics

Open source platform for collaborative data science and analytics application development

Data

Open source, Hadoop-centric platform for distributed and scalable storage and processing

Infrastructure

Software-defined storage, network and cloud infrastructure optimized for Intel Architecture

Machine Learning

Multi-layered, fully-optimized algorithms

Performance and Security

Silicon and software enhancements to protect and accelerate data and analytics

slide-11
SLIDE 11

11

The Anatomy of Trusted Analytics Platform (TAP)

Infrastructure

TAP Core

Applications

Marketplace

Services Ingestion Analytics Data Platform

Management

TAP-powered Big Data Analytics applications and solutions Polyglot services and APIs for application developers Data Scientists workbench including models, algorithms, pipelines, engines and frameworks Extensible Marketplace of built-in tools, packages and recipes Message brokers and queues for batch and stream data ingestion Distributed processing and scalable data storage Public or private clouds User, tenant, security, provisioning and monitoring for system operators REST ATK, Spark*, Impala, H2O, Hue,* iPython Kafka*, GearPump, RabbitMQ, MQTT, WS, REST Cloudera CDH (Hadoop/HDFS, Hbase)*, PostgreSQL, MySQL, Redis, MongoDB, InfluxDB, Cassandra AWS, Rackspace, OVH, OpenStack, On/Off-prem Java, Go

* Leverages Cloudera Distribution of Apache Hadoop

slide-12
SLIDE 12

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

Gearpump on TAP

  • Gearpump on Trusted Analytics Platform (TAP)
  • Stream processing - performance experiments and results

14

slide-15
SLIDE 15

The problem

15

  • correlate messages using a key in one second sliding

window and produce latency stream messages

  • consume latency messages and compute average latency

per firm in one minute buckets

  • send the aggregate message to HBase
slide-16
SLIDE 16

The expectations

16

  • Handle load of 0.5M msg per second all the time
  • Handle load of 7M msg per second for peaks of 1 hour
  • Message size 250-500 bytes
  • Be able to scale for even more
slide-17
SLIDE 17

The hardware

17

  • CPU: Intel(R) Xeon(R)

CPU E5-2695 v3 @ 2.30GHz

  • Memory: 256 Gbytes

DDR4

  • Storage: 8 SATA SSDs
slide-18
SLIDE 18

The results (1) - let’s start small: ~700k msg/sec

18

Initial attempt

slide-19
SLIDE 19

The results (2) - 8 executors: ~1.6M msg/sec

19

Initial attempt

Findings:

  • We need to improve Kafka Source -

queue size, fetch frequency

  • Improve Kafka partitions design for

concurrency

  • Network throughput may be a

bottleneck (1.6M msg/sec * 0.5 k * 8 bit) - compression

slide-20
SLIDE 20

The results (3) - 16 executors: ~2.7M msg/sec

20

Initial attempt

Findings:

  • JVM defaults designed for

moderate workloads - we need to pump them up

  • Message marshalling starts to play

significant role in performance - look for better alternatives

slide-21
SLIDE 21

The results (4) - 32 executors: ~5M msg/sec

21

Findings:

  • Backpressure introduced by JVM

⇔ JVM communication - use task fusing

slide-22
SLIDE 22

The results (5) - 48 executors: 7.4M msg/sec

22

Mission accomplished!!!

slide-23
SLIDE 23

The results (6) - 64 executors

23

We can go even further..

slide-24
SLIDE 24

The results - summary

24

  • Great performance numbers on decent hardware
  • Predictable scalability

Executors number Req/sec 8

1.6 M

16

2.7 M

32

5 M

48

7,4 M

64

10 M

slide-25
SLIDE 25

Gearpump features

25

slide-26
SLIDE 26

Gearpump Architecture

  • Actor concurrency
  • Message passing

communication

  • error handling and

isolation with supervision hierarchy

  • Master HA with Akka

Cluster

26

slide-27
SLIDE 27

Use case - Windowed word count

  • 1. Words

KafkaSource WindowCounter KafkaSink

27

  • 2. Window Counts
slide-28
SLIDE 28

How Gearpump solves the hard parts

  • User interface
  • Flow control
  • Out-of-order processing
  • Exactly once
  • Dynamic DAG

28

slide-29
SLIDE 29

User interface - DSL

val app = StreamApp("dsl", context) app.source[String](kafkaSource). flatMap(line => line.split("[\\s]+")).map((_, 1)). window(FixedWindow.apply(Duration.ofMillis(5L)) .triggering(EventTimeTrigger)). // (word, count1), (word, count2) => (word, count1 + count2) groupBy(_._1).sum.sink(kafkaSink)

sink Window.groupByKey

29

slide-30
SLIDE 30

How Gearpump solves the hard parts

  • User interface
  • Flow control
  • Out-of-order processing
  • Exactly once
  • Dynamic DAG

30

slide-31
SLIDE 31

Without Flow Control - OOM

KafkaSource WindowCounter KafkaSink fast fast very slow

31

slide-32
SLIDE 32

With Flow Control - Backpressure

KafkaSource WindowCounter KafkaSink Slow down Slow down Very Slow

backpressure backpressure pull slower

32

slide-33
SLIDE 33

How Gearpump solves the hard parts

  • User Interface
  • Flow control
  • Out-of-order processing
  • Exactly Once
  • Dynamic DAG

33

slide-34
SLIDE 34

Out-of-order data Event time 3 2 1 Processing time

  • Event time - when data generated
  • Processing time - when data processed

1 2 3 4 5 6

34

slide-35
SLIDE 35

On Watermark[4] Event time 3 2 1 Processing time 1 2 3 4 5 6 watermark

  • No timestamp earlier than

watermark will be seen

35

slide-36
SLIDE 36

When can window counts be emitted ?

window messages [0, 2) [2, 4) [4, 6)

(“gearpump”, 1)

WindowCounter In-memory Table

  • No window can be emitted since

message as early as time 1 has not arrived

(“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2)

WindowCounter

36

slide-37
SLIDE 37

Out-of-order processing with watermark

(“gearpump”, 1)

WindowCounter

watermark = 0, No window can be emitted

37

window messages [0, 2) [2, 4) [4, 6) WindowCounter In-memory Table

(“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2)

slide-38
SLIDE 38

Out-of-order processing with watermark

window messages [2, 4) [4, 6) WindowCounter In-memory Table

(“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2)

WindowCounter

watermark = 2, Window [0, 2) can be emitted

38

(“gearpump”, [0, 2) 1)

slide-39
SLIDE 39

How to get watermark ?

  • 1. Words

KafkaSource WindowCounter Sink

39

  • 2. Window Counts
slide-40
SLIDE 40

From upstream

KafkaSource WindowCounter Sink 50 W(50) 40 30

Watermark Watermark of the operator

40

slide-41
SLIDE 41

From upstream

41

KafkaSource WindowCounter Sink 60 W(50) 50 40

Watermark Watermark of the operator

slide-42
SLIDE 42

More on Watermark

  • Source watermark defined by user
  • Usually heuristic based
  • Users decide whether to drop data arriving after

watermark

42

slide-43
SLIDE 43

How Gearpump solves the hard parts

  • User Interface
  • Flow control
  • Out-of-order processing
  • Exactly once
  • Dynamic DAG

43

slide-44
SLIDE 44

Exactly Once with asynchronous checkpointing

(2, kafka_offset) KafkaSource WindowCounter KafkaSink Watermark = 2 Watermark = 0 Watermark = 0 (2, kafka_offset)

44

slide-45
SLIDE 45

(2, kafka_offset) (2, window_counts)

Exactly Once with asynchronous checkpointing

KafkaSource WindowCounter KafkaSink Watermark = 2 Watermark = 2 Watermark = 0 (2, window_counts)

45

slide-46
SLIDE 46

Exactly Once with asynchronous checkpointing

KafkaSource WindowCounter KafkaSink Watermark = 2 Watermark = 2 Watermark = 2

46

Checkpoint succeed

(2, kafka_offset) (2, window_counts)

slide-47
SLIDE 47

Crash

KafkaSource WindowCounter Sink Watermark = 3 Watermark = 2 Watermark = 2

47

(2, kafka_offset) (2, window_counts)

slide-48
SLIDE 48

Recover to latest checkpoint at 2

KafkaSource WindowCounter KafkaSink window_counts kafka_offset Replay from kafka

48

(2, kafka_offset) (2, window_counts)

Get state at 2

slide-49
SLIDE 49

How Gearpump solves the hard parts

  • User Interface
  • Flow control
  • Out-of-order processing
  • Exactly Once
  • Dynamic DAG

49

slide-50
SLIDE 50

Update the DAG on-the-fly

50

  • 1. Words

KafkaSource WindowCounter KafkaSink

  • 2. Window Counts
  • 1. Words

KafkaSource WindowCounter HDFSSink

  • 2. Window Counts

Without Restart

slide-51
SLIDE 51

Advanced features

51

slide-52
SLIDE 52

DAG Visualization

  • Watermark
  • Node size reflects throughput
  • Edge width represents flow rate

52

slide-53
SLIDE 53

DAG Visualization

  • Data skew analysis

53

slide-54
SLIDE 54

Apache Beam[6] Gearpump Runner

Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gearpump

54

slide-55
SLIDE 55

What’s next for Gearpump

55

slide-56
SLIDE 56

Experimental features

  • Web UI Authorization / OAuth2 Authentication
  • CGroup Resource Isolation
  • Binary Storm compatibility
  • Akka Streams integration (Gearpump Materializer)

56

slide-57
SLIDE 57

Summary

  • Gearpump is good at streaming infinite out-of-order

data and guarantees correctness

  • Gearpump helps users to easily program streaming

applications, get runtime information and update dynamically

57

slide-58
SLIDE 58

References

1. gearpump.apache.org 2. akka.io 3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti me-dagprocessing-with-akka-at-scale 4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid au.pdf 5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami ng-computation-engines-at 6. Apache Beam [project overview] 7. www.trustedanalytics.org - learn more about TAP

58

slide-59
SLIDE 59

Get involved

Our home: http://gearpump.apache.org Contribute code: https://github.com/apache/incubator-gearpump Report issues: https://issues.apache.org/jira/browse/GEARPUMP The team: Kam Kasravi, Manu Zhang, Huafeng Wang, Weihua Jiang, Sean Zhong, Karol Brejna, Stanley Xu, …, YOU?

59

slide-60
SLIDE 60

Learn More About TAP

www.trustedanalytics.org

Engage in Community events

Meetups, workshops, & webinars http://trustedanalytics.org/#resources

60