StreamBox: Modern Stream Processing on a Multicore Machine Hongyu - - PowerPoint PPT Presentation

streambox modern stream processing on a multicore machine
SMART_READER_LITE
LIVE PREVIEW

StreamBox: Modern Stream Processing on a Multicore Machine Hongyu - - PowerPoint PPT Presentation

StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox


slide-1
SLIDE 1

StreamBox: Modern Stream Processing

  • n a Multicore Machine

Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE

http://xsel.rocks/p/streambox

slide-2
SLIDE 2

High velocity of streaming data requires real-time processing

IoT Data centers Humans

slide-3
SLIDE 3

Streaming ¡Pipeline

Input

Pipeline Infinite ¡data ¡stream

3

slide-4
SLIDE 4

Streaming ¡Pipeline

Input Epoch Epoch Epoch

Pipeline Infinite ¡data ¡stream

4

slide-5
SLIDE 5

Streaming ¡Pipeline

Input Transform ¡0 Epoch Epoch Epoch

Pipeline Infinite ¡data ¡stream

5

slide-6
SLIDE 6

Streaming ¡Pipeline

Input Transform ¡0 Transform ¡1 Transform ¡2 Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline Infinite ¡data ¡stream

6

slide-7
SLIDE 7

Streaming ¡Pipeline

Input Transform ¡0 Transform ¡1 Transform ¡2 Output Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline Infinite ¡data ¡stream

7

slide-8
SLIDE 8

Why ¡is ¡it ¡hard? ¡

Records ¡arrive ¡out-­‑of-­‑order

8

Input Transform-0 Transform-1 Transform-2 Output Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch

Infinite-data-stream

slide-9
SLIDE 9

Why ¡is ¡it ¡hard? ¡

Records ¡arrive ¡out-­‑of-­‑order High ¡Performance ¡on ¡Multicore

  • Data ¡parallelism
  • Pipeline ¡parallelism
  • Memory ¡locality

Intel ¡Xeon ¡E7-­‑4830 ¡v4

9

35MB L3

960K B

3584 KB

960K B 960K B 960K B

3584 KB 3584 KB 3584 KB

Core Core 1 Core 2 Core 13

… … …

NUMA% NUMA% 1 NUMA% 2 NUMA% 3

Input Transform-0 Transform-1 Transform-2 Output Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch Epoch

Infinite-data-stream

slide-10
SLIDE 10

Prior ¡work

Out-­‑of-­‑order ¡processing ¡within ¡epochs Processes ¡only ¡one ¡epoch ¡in ¡each ¡transform ¡at ¡a ¡time

Transform ¡0 Transform ¡1 Transform ¡2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 20:00 5:00

10

slide-11
SLIDE 11

Prior ¡work

Transform ¡0 Transform ¡1 Transform ¡2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 20:00 10:00 0:00 5:00

Out-­‑of-­‑order ¡processing ¡within ¡epochs Processes ¡only ¡one ¡epoch ¡in ¡each ¡transform ¡at ¡a ¡time

11

slide-12
SLIDE 12

Prior ¡work

Transform ¡0 Transform ¡1 Transform ¡2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 20:00 10:00 15:00 5:00 5:00 0:00

Out-­‑of-­‑order ¡processing ¡within ¡epochs Processes ¡only ¡one ¡epoch ¡in ¡each ¡transform ¡at ¡a ¡time

12

slide-13
SLIDE 13

St StreamBox insight

Transform ¡0 Transform ¡1 Transform ¡2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 10:00

Out-­‑of-­‑order ¡processing ¡across ¡epochs Process all ¡epochs ¡in ¡all ¡transforms ¡in ¡parallel

13

slide-14
SLIDE 14

Prior ¡work ¡vs. ¡StreamBox

Processes ¡only ¡one ¡epoch ¡in ¡each ¡ transform ¡at ¡a ¡time Process all ¡epochs ¡in ¡all ¡transforms ¡ in ¡parallel

StreamBox: ¡High ¡pipeline ¡and ¡data ¡parallel ¡processing ¡system

14

Transform)0 Transform)1 Transform)2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 10:00

Transform)0 Transform)1 Transform)2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 20:00 10:00 0:00 5:00

slide-15
SLIDE 15

Result: ¡StreamBox vs. ¡existing ¡systems ¡on ¡multicore

High ¡throughput ¡& ¡utilization ¡of ¡multicore ¡hardware

2000 4000 6000 8000 4 12 32 56

7K 10K 10K 8K

Throughput KRec/s # Cores StreamBox Spark Streaming Beam

15

slide-16
SLIDE 16

Roadmap

Background

Stream pipeline, streaming data, window, watermark, and epoch

StreamBox Design

  • Invariants to guarantee correctness
  • Out-of-order epoch processing

Evaluation

16

slide-17
SLIDE 17

Streaming pipeline for data analytics

Ingress Transform 1 Transform 2 Egress Group by word Count word

  • ccurrences

17

Transform a computation that consumes and produces streams Pipeline a dataflow graph of transforms A Simple WordCount Pipeline

slide-18
SLIDE 18

Stream records = data + event time

Records arrive out of order

  • Records travel diverse network paths
  • Computations execute at different rates

18

infinite data stream

Processing System

0:05 0:02 0:03

slide-19
SLIDE 19

Window

19

A temporal processing scope of records Chopping up infinite data into finite pieces along temporal boundaries Transforms do computation based on windows Window

1:00 – 1:05

1:02 1:03

Infinite input stream

1:11 1:08 1:09 1:13 Event Time

slide-20
SLIDE 20

Window

20

1:02

A temporal processing scope of records

1:00 – 1:05 Windows by event time Infinite input stream

1:11 1:08 1:09 1:13 1:03 1:02

slide-21
SLIDE 21

Window

21

A temporal processing scope of records

1:00 – 1:05 Windows by event time Infinite input stream

1:11 1:08 1:09 1:13 1:03 1:02

slide-22
SLIDE 22

Window

22

A temporal processing scope of records

1:00 – 1:05 Windows by event time Infinite input stream

1:11 1:08 1:09 1:13 1:03 1:02

slide-23
SLIDE 23

Window

23

1:05 – 1:10

A temporal processing scope of records

1:00 – 1:05 Windows by event time Infinite input stream

1:11 1:09 1:13 1:03 1:02 1:08

slide-24
SLIDE 24

Window

24

A temporal processing scope of records

1:05 – 1:10 1:00 – 1:05 Windows by event time Infinite input stream

1:09 1:13 1:03 1:02 1:08

1:10 – 1:15

1:11

slide-25
SLIDE 25

Window

25

A temporal processing scope of records

1:05 – 1:10 1:00 – 1:05 Windows by event time Infinite input stream

1:09 1:13 1:03 1:02 1:08

1:10 – 1:15

1:11

slide-26
SLIDE 26

Window

26

A temporal processing scope of records

1:05 – 1:10 1:00 – 1:05 Windows by event time Infinite input stream

1:09 1:13 1:03 1:02 1:08

1:10 – 1:15

1:11

slide-27
SLIDE 27

Out-of-order records

27

Windows by event time Infinite input stream

1:11 1:09 1:13

1:05 – 1:10

1:08

slide-28
SLIDE 28

When a window is complete?

28

1:05 – 1:10 Windows by event time Infinite input stream

1:13 1:08

1:10 – 1:15

1:11 1:09

slide-29
SLIDE 29

Watermark

Input completeness indicated by data source Watermark X all input data with event times less than X have arrived

29

Watermark 1:05 Watermark 1:10 Infinite input stream

1:11 1:08 1:09 1:13 1:03 1:02

slide-30
SLIDE 30

Handling out-of-order with watermarks

30

Watermark 1:10 Infinite input stream

1:11 1:09 1:13

1:05 – 1:10 Windows by event time

1:08

Watermark 1:05

slide-31
SLIDE 31

Handling out-of-order with watermarks

31

Watermark 1:10 Infinite input stream

1:11 1:09 1:13

1:05 – 1:10 Windows by event time

1:08

Watermark 1:05

slide-32
SLIDE 32

Handling out-of-order with watermarks

32

Watermark 1:10 Infinite input stream

1:09 1:13

1:05 – 1:10 Windows by event time

1:08

1:10 – 1:15

1:11

Watermark 1:05

slide-33
SLIDE 33

Handling out-of-order with watermarks

33

Watermark 1:10 Infinite input stream

1:09 1:13

1:05 – 1:10 Windows by event time

1:08

1:10 – 1:15

1:11

Watermark 1:05

slide-34
SLIDE 34

Handling out-of-order with watermarks

34

Infinite input stream

1:09 1:13

1:05 – 1:10 Windows by event time

1:08

1:10 – 1:15

1:11

Watermark 1:05 Watermark 1:10

slide-35
SLIDE 35

Epoch

A set of records arriving between two watermarks

35

An epoch

Watermark 1:05 Watermark 1:10 Infinite input stream

1:11 1:08 1:09 1:13 1:03 1:02

A window may span multiple epochs

slide-36
SLIDE 36

Roadmap

Background StreamBox Design

  • Invariants to guarantee correctness
  • Out-of-order epoch processing

Evaluation

36

slide-37
SLIDE 37

Stream processing engines

Most of stream engines optimize for a distributed system

  • Neglected efficient multicore implementation
  • Assume a single machine incapable of handling stream data

37

slide-38
SLIDE 38

Goal A stream engine for multicore

Multicore hardware with

  • High throughput I/O
  • Terabyte DRAMs
  • A large number of cores

A stream engine for multicore

  • Correctness respect dependences

with minimal synchronization

  • Dynamic parallelism processes any

records in any epochs

  • Target throughput & latency

38

35MB L3

960K B

3584 KB

960K B 960K B 960K B

3584 KB 3584 KB 3584 KB

Core Core 1 Core 2 Core 13

… … …

NUMA% NUMA% 1 NUMA% 2 NUMA% 3

Transform)0 Transform)1 Transform)2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 10:00

slide-39
SLIDE 39

Challenges

Correctness

  • Guarantee watermark semantics by meeting two invariants

Throughput

  • Never stall the pipeline

Latency

  • Do not relax the watermark
  • Dynamically adjust parallelism to relieve bottlenecks

39

slide-40
SLIDE 40

Invariant 1 Watermark ordering

Transforms consume watermarks in order Transforms consume all records in an epoch before consuming the watermark

40

Transform

0:20 0:22

Epoch 2 Epoch 1

0:12 0:18 0:05 0:11 0:10

slide-41
SLIDE 41

Invariant 2 Respect epoch boundaries

41

0:20 0:22

Epoch 2 Transform Epoch 1

0:12 0:18 0:05 0:11 0:10

Once a transform assigns a record an epoch, the record never changes epochs

slide-42
SLIDE 42

Invariant 2 Respect epoch boundaries

Once a transform assigns a record an epoch, the record never changes epochs

42

0:20 0:22

Epoch 2 Transform Epoch 1

0:12 0:18 0:05 0:11 0:10 0:20

Epoch 2 Epoch 1

0:12 0:18 0:05 0:11 0:10

slide-43
SLIDE 43

Invariant 2 Respect epoch boundaries

What if a record changes to a later epoch?

43

Violate watermark guarantee!

0:20 0:22

Epoch 2 Transform Epoch 1

0:12 0:18 0:05 0:11 0:10 0:20

Epoch 2 Epoch 1

0:12 0:18 0:05 0:11 0:10

slide-44
SLIDE 44

Invariant 2 Respect epoch boundaries

44

Relax watermark, and delay window completion!

0:20 0:22

Epoch 2 Transform Epoch 1

0:12 0:18 0:05 0:11 0:10 0:20

Epoch 2 Epoch 1

0:12 0:18 0:05 0:11 0:10

What if records change to an earlier epoch?

slide-45
SLIDE 45

Our solution: Cascading containers

Each cascading container

  • Corresponds to an epoch
  • Tracks an epoch state and the relationship between records and the

watermark

  • Orchestrates worker threads to consume watermarks and records

45

A container An epoch

End watermark 20:00

20:00

slide-46
SLIDE 46

Each transform has multiple containers

46

A transform has multiple epochs Each epoch corresponds to a container

Transform 0 Containers Oldest Newest

20:00 15:00

slide-47
SLIDE 47

Link each container to a downstream container defined by the transform

47

Transform 1 Transform 0

20:00 15:00

slide-48
SLIDE 48

Records/watermarks flow through the pipeline by following the links

Meets invariant 2: records respect epoch boundary Avoids relaxing watermark

48

Transform 1 Transform 0

20:00 15:00 20:00 15:00

slide-49
SLIDE 49

A watermark will be processed after all records within the container have been processed

Guarantees the invariant 1: watermark ordering

49

Transform 1 Transform 0

15:00 20:00 20:00

slide-50
SLIDE 50

Watermarks will be processed in order

50

Guarantees the invariant 1: watermark ordering

Transform 1 Transform 0

20:00 15:00

slide-51
SLIDE 51

All records in all containers can be processed in parallel

Avoids stalling pipeline

51

Transform 1 Transform 0

20:00 15:00

slide-52
SLIDE 52

Big picture

52

(Upstream) Transform 1 Transform 2 Transform 3 Oldest Newest Transform 0 (Downstream)

A pipeline: multiple transforms

  • Containers form a network
  • Records/watermarks flow through the links

High parallel pipeline

  • Guarantees watermark semantic
  • Avoids stalling pipeline (for throughput)
  • Avoids relaxing watermark (for latency)

25:00 20:00 15:00 10:00 09:00 04:00

slide-53
SLIDE 53

Other key optimizations

Organizing records into bundles

  • Minimize synchronization

Multi-input transforms

  • Defer container ordering in downstream

Pipeline scheduling

  • Prioritize externalization to minimize latency

Pipeline state management

  • Target NUMA-awareness and coarse-grained allocation

53

slide-54
SLIDE 54

StreamBox implementation

54

Built from scratch in 22K SLoC of C++11

  • Supported transforms: Windowing, GroupBy, Aggregation, Mapper,

Reducer, Temporal Join, Grep…

  • Source code @ http://xsel.rocks/p/streambox

C++ libraries

  • Intel TBB, Facebook folly, jemalloc, boost…

Concurrent hash tables

  • Wrapped TBB’s concurrent hash map
slide-55
SLIDE 55

StreamBox implementation

55

CM56 256GB DRAM 14 cores 14 cores 14 cores 14 cores CM12 256GB DRAM 6 cores 6 cores

Benchmarks:

  • Windowed grep
  • Word count
  • Counting distinct URLs
  • Network latency monitoring
  • Tweets sentiment analysis

Machine configurations:

slide-56
SLIDE 56

Roadmap

Background StreamBox Design Evaluation

56

slide-57
SLIDE 57

Evaluation

Throughput and scalability Comparison with existing stream engines Handling out-of-order input streaming data Epoch parallelism effectiveness

57

slide-58
SLIDE 58

Good throughput and scalability

58

1000 2000 3000 4000 5000 4 12 32 56 Throughput KRec/s # Cores T weets Sentiment Analysis CM56 (1sec)

slide-59
SLIDE 59

Good throughput and scalability

59

1000 2000 3000 4000 5000 4 12 32 56 Throughput KRec/s # Cores T weets Sentiment Analysis CM56 (1sec) CM56 (500ms)

slide-60
SLIDE 60

Good throughput and scalability

60

1000 2000 3000 4000 5000 4 12 32 56 Throughput KRec/s # Cores T weets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms)

slide-61
SLIDE 61

Good throughput and scalability

1000 2000 3000 4000 5000 4 12 32 56 Throughput KRec/s # Cores Word Count CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 61 1000 2000 3000 4000 5000 4 12 32 56 Throughput KRec/s # Cores T emporal Join CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 200 400 600 800 1000 1200 1400 4 12 32 56 Throughput KRec/s # Cores Network Latency Monitoring CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 1000 2000 3000 4000 5000 4 12 32 56 Throughput KRec/s # Cores T weets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 500 1000 1500 2000 4 12 32 56 Throughput KRec/s # Cores Counting Distinct URLs CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 10000 20000 30000 40000 4 12 32 56 Throughput KRec/s # Cores Windowed Grep CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms)

slide-62
SLIDE 62

StreamBox vs. existing stream engines

62

StreamBox achieves significantly better throughput and scalability

Spark: v2.1.0 Beam: v0.5.0

2000 4000 6000 8000 4 12 32 56

7K 10K 10K 8K

Throughput KRec/s # Cores StreamBox Spark Streaming Beam

slide-63
SLIDE 63

Handling out-of-order records

63

200 400 600 800 1000 4 12 32 56 Throughput KRec/s # Cores 0% 20% 40% 2000 4000 6000 4 12 32 56 Throughput KRec/s # Cores 0% 20% 40%

Netmon Tweets

StreamBox achieves good throughput even with lots of out-of-order records

2000 4000 6000 4 12 32 56 Throughput KRec/s # Cores 0% 20% 40%

WordCount

Drop 7%

slide-64
SLIDE 64

Epoch parallelism is effective

64 2000 4000 6000 8000 32 56 Throughput KRec/s # Cores StreamBox In-order

WordCount

10000 20000 30000 40000 50000 32 56 Throughput KRec/s # Cores StreamBox In-order

Grep

Drop 87%

NO parallel NO parallel

Prior work StreamBox

Transform)0 Transform)1 Transform)2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 10:00

Transform)0 Transform)1 Transform)2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 20:00 10:00 0:00 5:00

slide-65
SLIDE 65

Summary: StreamBox on multicores

Processes any records in any epochs in parallel by using all CPU cores Achieves high throughput with low latency

  • Millions records per second throughput, on a par with distributed engines on a

cluster with a few hundreds of CPU cores

  • Tens of milliseconds latency, 20x shorter than other large-scale engines

65

35MB L3

960K B

3584 KB

960K B 960K B 960K B

3584 KB 3584 KB 3584 KB

Core Core 1 Core 2 Core 13

… … …

NUMA% NUMA% 1 NUMA% 2 NUMA% 3

Transform)0 Transform)1 Transform)2

Epoch Epoch Epoch Epoch Epoch Epoch

Pipeline

10:00

Epoch Epoch Epoch

15:00 10:00

http://xsel.rocks/p/streambox