St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth - - PowerPoint PPT Presentation

st streambo eambox hb hbm
SMART_READER_LITE
LIVE PREVIEW

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth - - PowerPoint PPT Presentation

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST ; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox


slide-1
SLIDE 1

St StreamBo eamBox-HB HBM

Stream Analytics on High Bandwidth Hybrid Memory

Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox

slide-2
SLIDE 2

Timely processing of streaming data

2

High Throughput & Low Latency!

On 100+ GB memory

slide-3
SLIDE 3

Hybrid Memory: 3D Memory + DRAM

DRAM

  • Larger capacity, but lower bandwidth

3

Cores 3D Memory DRAM 80 GB/s

100+ GB 16 GB

375 GB/s

3D Memory

  • Higher bandwidth, but smaller capacity
  • NO latency benefit (Unlike cache: SRAM+DRAM)
  • Same as DRAM without high parallelism or sequential access
  • As cache of DRAM? à Poor performance…
slide-4
SLIDE 4

Can hybrid mem speed up stream analytics?

Yes! StreamBox-HBM

  • The first stream engine optimized for 3D memory + DRAM on real hardware
  • Achieves the best reported throughput on single node (win-avg:110MRec/s)
  • Speeds up stream analytics by 7x

4

5 10 15 20 25 30 35 10 20 30 40 50 60

# cores Throughput Mrec/s

3D + DRAM in-mem-index 3D as cache full-records

7x speedup

TopK Per Key

slide-5
SLIDE 5

Challenges

  • 1. Hash Grouping performs poorly on 3D memory
  • 2. 3D memory is capacity limited
  • 3. How to dynamically map streaming data to hybrid mem?

5

slide-6
SLIDE 6

Challenge 1: Hash Grouping performs poorly on 3D memory

  • Operators: computations consume/produce streams
  • Pipeline: a graph of streaming operators

6

  • Data Grouping
  • A set of very common and expensive operators that reorganize records
  • Hash with random access in existing engines à Performs poorly on 3D memory…

Ingestion Groupby key Average per key Window Top Key 10:00-10:05 130 500 302 100 150 500 302

Time 10:01 ID: 0x1024 Value: 200

Grouping

slide-7
SLIDE 7

Challenge 2: 3D memory is capacity limited

  • Streaming data
  • High data volume (100+ GB)
  • 3D Memory
  • Capacity limited (~ 16 GB)
  • 3D memory is NOT large enough to hold all streaming data….

7

Cores 3D Memory

16 GB

Cannot fit!

slide-8
SLIDE 8

Challenge 3: managing two types of memory

  • How to dynamically map data/operators to two types of memory?

8

What to map? Where to map?

Unbounded data Various queries Hybrid memory: benefit & limitation

Ingestion Groupby key Average per key Window Top Key 10:00-10:05 130 500 302 100 150 500 302

slide-9
SLIDE 9

StreamBox-HBM Solutions

  • 1. Hash grouping performs poorly on 3D memory
  • à Solution 1: Use high parallel Sort for grouping
  • 2. 3D memory is capacity limited
  • à Solution 2: Only use 3D memory to store in-memory indexes
  • 3. How to manage two types of memory?
  • à Solution 3: Balance two limited resource with a single knob

9

slide-10
SLIDE 10

Solution 1: Parallel Sort for Grouping

Known duals of Grouping: Hash vs. Sort

  • DRAM: Hash is the best [VLDB’09, VLDB’13, SIGMOD’15]
  • Contribution: 3D memory reverses the debate. Sort outperforms Hash.

Sort is worse than Hash on algorithmic complexity

  • O(NlogN) vs. O(N)

Yet, Sort outperforms Hash after we exploit all:

  • Abundant memory bandwidth
  • High task parallelism
  • Wide SIMD (avx512)

10 [VLDB’09] Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. [VLDB’13] Multi-core, main-memory joins: Sort vs. hash revisited [SIGMOD’15] Rethinking simd vectorization for in-memory databases

slide-11
SLIDE 11

11

20 40 60 80 100 120 140 160 180 20 40 60

million pairs / sec # cores

50 100 150 200 250 300 20 40 60

GB / sec # cores

Solution 1: Parallel Sort for Grouping

Throughput Mem bandwidth

So Sort outperforms s Hash sh on 3D memory

slide-12
SLIDE 12

12

20 40 60 80 100 120 140 160 180 20 40 60

million pairs / sec # cores

50 100 150 200 250 300 20 40 60

GB / sec # cores

Hash DRAM

Solution 1: Parallel Sort for Grouping

Hash DRAM

Throughput Mem bandwidth

So Sort outperforms s Hash sh on 3D memory

slide-13
SLIDE 13

13

20 40 60 80 100 120 140 160 180 20 40 60

million pairs / sec # cores

50 100 150 200 250 300 20 40 60

GB / sec # cores

Hash 3D mem Hash DRAM

Solution 1: Parallel Sort for Grouping

Hash DRAM Hash 3D mem

Throughput Mem bandwidth

So Sort outperforms s Hash sh on 3D memory

slide-14
SLIDE 14

14

20 40 60 80 100 120 140 160 180 20 40 60

million pairs / sec # cores

50 100 150 200 250 300 20 40 60

GB / sec # cores

Hash 3D mem Hash DRAM Sort DRAM Sort DRAM Hash DRAM Hash 3D mem

Solution 1: Parallel Sort for Grouping

Throughput Mem bandwidth

So Sort outperforms s Hash sh on 3D memory

slide-15
SLIDE 15

15

20 40 60 80 100 120 140 160 180 20 40 60

million pairs / sec # cores

50 100 150 200 250 300 20 40 60

GB / sec # cores

Throughput Mem bandwidth

Hash 3D mem Hash DRAM Hash 3D mem Hash DRAM Sort DRAM

Sort 3D mem

Sort 3D mem Sort DRAM

Solution 1: Parallel Sort for Grouping

So Sort outperforms s Hash sh on 3D memory

slide-16
SLIDE 16

Solution 2: Only use 3D memory for in-memory index

16

Streaming data Full Records <key, key1,v1, v2, v3…> Index <key, pointer>

Cores 3D Memory DRAM 80 GB/s

96 GB

16 GB 375 GB/s

Mi Minimize th the u e use of se of p prec eciou

  • us 3

s 3D m mem em’s c s capacity w ty while e ex exploit hig high h bandw bandwidt idth

Smaller Faster More efficient K Swapping

slide-17
SLIDE 17

Solution 3: balance two limited resources

17

3D Memory

DRAM Bandwidth 3D memory Capacity

DRAM

Cores

80 GB/s 16 GB

slide-18
SLIDE 18

Solution 3: balance two limited resources

18

Cores

High pressure on 3D Memory capacity

DRAM

DRAM Bandwidth 3D memory Capacity

3D Memory

80 GB/s 16 GB

slide-19
SLIDE 19

Solution 3: balance two limited resources

19

Cores

High pressure on 3D Memory capacity à indexes on DRAM

DRAM

DRAM Bandwidth 3D-stacked Capacity

3D Memory

80 GB/s 16 GB

slide-20
SLIDE 20

Solution 3: balance two limited resources

20

3D Memory

DRAM Bandwidth 3D-stacked Capacity

DRAM

Cores

Pressure rebalanced

80 GB/s 16 GB

slide-21
SLIDE 21

Solution 3: balance two limited resources

21

3D Memory

DRAM

Cores

High pressure on DRAM bandwidth

DRAM Bandwidth 3D-stacked Capacity

80 GB/s 16 GB

slide-22
SLIDE 22

Solution 3: balance two limited resources

22

3D Memory

DRAM

Cores

High pressure on DRAM bandwidth à more indexes on 3D memory

DRAM Bandwidth 3D-stacked Capacity

80 GB/s 16 GB

slide-23
SLIDE 23

Solution 3: balance two limited resources

23

3D Memory

DRAM Bandwidth 3D-stacked Capacity

DRAM

Cores

Pressure rebalanced

80 GB/s 16 GB

slide-24
SLIDE 24

Solution 3: balance two limited resources

24

3D Memory

DRAM

High pressure on both… à reach hardware limit à limit data ingestion

DRAM Bandwidth 3D-stacked Capacity

Cores

Back pressure

80 GB/s 16 GB

slide-25
SLIDE 25

Other optimizations

  • Customized memory allocator
  • Customized task scheduler for high pipeline and data parallelism
  • High parallel merge-sort kernels using avx-512
  • Dynamically handle key changes
  • Parallel aggregation
  • Co-design RDMA ingestion with memory management and task scheduling
  • Task parallelism to utilize all CPU cores

25

slide-26
SLIDE 26

St StreamBo mBox-HB HBM Im Implem plemen entatio tion

  • Based on our prior work StreamBox [USENIX ATC’17]
  • Implement on real hardware (Intel KNL) with RDMA network
  • 61K lines of C++11, of which 38K lines are new
  • Open source: http://xsel.rocks/p/streambox

26

[USENIX ATC’17] StreamBox: Modern Stream Processing on a Multicore Machine, Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin, in Proc. USENIX Annual Technical Conference, 2017.

Ninja Developer Platform (KNL) Mellanox ConnectX-2 16GB 3D memory 96GB DRAM 64 cores @1.3GHz 40Gb/s

slide-27
SLIDE 27

Evaluation

  • Comparing to widely used stream analytics engine
  • Validating our key system designs

27

slide-28
SLIDE 28

StreamBox-HBM is 10x faster than Flink

28

10 20 30 40 50 60 2 10 18 26 34 42 50 58 Throughput MRec/s

# Cores

Flink @ x56 Flink @ KNL Ours @ KNL RDMA ingestion limit KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark. Output delay: 1 second

5-10x

slide-29
SLIDE 29

Poor performance without any key designs

29

5 10 15 20 25 30 35 10 20 30 40 50 60

# cores Throughput Mrec/s

3D as cache full-records

TopK Per Key

slide-30
SLIDE 30

In-mem-index performs better than full-record

30

5 10 15 20 25 30 35 10 20 30 40 50 60

# cores Throughput Mrec/s

3D as cache in-mem-index 3D as cache full-records

Using in-mem index

TopK Per Key

slide-31
SLIDE 31

3D memory boosts performance

31

5 10 15 20 25 30 35 10 20 30 40 50 60

# cores Throughput Mrec/s

3D as cache in-mem-index DRAM only in-mem-index 3D as cache full-records

Using 3D memory

TopK Per Key

slide-32
SLIDE 32

SW better manages hybrid memory than HW

32

5 10 15 20 25 30 35 10 20 30 40 50 60

# cores Throughput Mrec/s

3D + DRAM in-mem-index 3D as cache in-mem-index DRAM only in-mem-index 3D as cache full-records

SW manages hybrid memory

TopK Per Key

slide-33
SLIDE 33

Performance improve with all system designs

33

5 10 15 20 25 30 35 10 20 30 40 50 60

# cores Throughput Mrec/s

3D + DRAM in-mem-index 3D as cache in-mem-index DRAM only in-mem-index 3D as cache full-records

Using all key system designs

TopK Per Key

slide-34
SLIDE 34

The first stream engine optimized for 3D Memory + DRAM on real hardware

34

Balance limited resources Minimize use of capacity

Hash à Sort Abundant memory High parallelism Wide SIMD (avx512) Sequential access

  • 1. Grouping with Sort
  • 2. In-memory index in 3D Memory
  • 3. Mng hybrid mem

DRAM Bandwidth 3D memory Capacity

http://xsel.rocks/p/streambox Exploit high bandwidth

St StreamBo mBox-HB HBM

slide-35
SLIDE 35

35

Lessons on exploiting 3D memory + DRAM

Cheap VM (huge page)

Apps OS kernel

RDMA network bypass kernel, free CPU High task parallelism Custom mem allocator Sequential mem access

Runtime

Thread pool + custom task scheduler Wide SIMD (avx512)

Hybrid Memory

Packed data structure