St StreamBo eamBox-HB HBM
Stream Analytics on High Bandwidth Hybrid Memory
Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox
St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth - - PowerPoint PPT Presentation
St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST ; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox
Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox
2
On 100+ GB memory
DRAM
3
Cores 3D Memory DRAM 80 GB/s
100+ GB 16 GB
375 GB/s
3D Memory
4
5 10 15 20 25 30 35 10 20 30 40 50 60
# cores Throughput Mrec/s
3D + DRAM in-mem-index 3D as cache full-records
7x speedup
TopK Per Key
5
6
Ingestion Groupby key Average per key Window Top Key 10:00-10:05 130 500 302 100 150 500 302
Time 10:01 ID: 0x1024 Value: 200
Grouping
7
Cores 3D Memory
16 GB
Cannot fit!
8
What to map? Where to map?
Unbounded data Various queries Hybrid memory: benefit & limitation
Ingestion Groupby key Average per key Window Top Key 10:00-10:05 130 500 302 100 150 500 302
9
Known duals of Grouping: Hash vs. Sort
Sort is worse than Hash on algorithmic complexity
Yet, Sort outperforms Hash after we exploit all:
10 [VLDB’09] Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. [VLDB’13] Multi-core, main-memory joins: Sort vs. hash revisited [SIGMOD’15] Rethinking simd vectorization for in-memory databases
11
20 40 60 80 100 120 140 160 180 20 40 60
million pairs / sec # cores
50 100 150 200 250 300 20 40 60
GB / sec # cores
Throughput Mem bandwidth
So Sort outperforms s Hash sh on 3D memory
12
20 40 60 80 100 120 140 160 180 20 40 60
million pairs / sec # cores
50 100 150 200 250 300 20 40 60
GB / sec # cores
Hash DRAM
Hash DRAM
Throughput Mem bandwidth
So Sort outperforms s Hash sh on 3D memory
13
20 40 60 80 100 120 140 160 180 20 40 60
million pairs / sec # cores
50 100 150 200 250 300 20 40 60
GB / sec # cores
Hash 3D mem Hash DRAM
Hash DRAM Hash 3D mem
Throughput Mem bandwidth
So Sort outperforms s Hash sh on 3D memory
14
20 40 60 80 100 120 140 160 180 20 40 60
million pairs / sec # cores
50 100 150 200 250 300 20 40 60
GB / sec # cores
Hash 3D mem Hash DRAM Sort DRAM Sort DRAM Hash DRAM Hash 3D mem
Throughput Mem bandwidth
So Sort outperforms s Hash sh on 3D memory
15
20 40 60 80 100 120 140 160 180 20 40 60
million pairs / sec # cores
50 100 150 200 250 300 20 40 60
GB / sec # cores
Throughput Mem bandwidth
Hash 3D mem Hash DRAM Hash 3D mem Hash DRAM Sort DRAM
Sort 3D mem
Sort 3D mem Sort DRAM
So Sort outperforms s Hash sh on 3D memory
16
Streaming data Full Records <key, key1,v1, v2, v3…> Index <key, pointer>
Cores 3D Memory DRAM 80 GB/s
16 GB 375 GB/s
Mi Minimize th the u e use of se of p prec eciou
s 3D m mem em’s c s capacity w ty while e ex exploit hig high h bandw bandwidt idth
Smaller Faster More efficient K Swapping
17
3D Memory
DRAM Bandwidth 3D memory Capacity
DRAM
Cores
80 GB/s 16 GB
18
Cores
High pressure on 3D Memory capacity
DRAM
DRAM Bandwidth 3D memory Capacity
3D Memory
80 GB/s 16 GB
19
Cores
High pressure on 3D Memory capacity à indexes on DRAM
DRAM
DRAM Bandwidth 3D-stacked Capacity
3D Memory
80 GB/s 16 GB
20
3D Memory
DRAM Bandwidth 3D-stacked Capacity
DRAM
Cores
Pressure rebalanced
80 GB/s 16 GB
21
3D Memory
DRAM
Cores
High pressure on DRAM bandwidth
DRAM Bandwidth 3D-stacked Capacity
80 GB/s 16 GB
22
3D Memory
DRAM
Cores
High pressure on DRAM bandwidth à more indexes on 3D memory
DRAM Bandwidth 3D-stacked Capacity
80 GB/s 16 GB
23
3D Memory
DRAM Bandwidth 3D-stacked Capacity
DRAM
Cores
Pressure rebalanced
80 GB/s 16 GB
24
3D Memory
DRAM
High pressure on both… à reach hardware limit à limit data ingestion
DRAM Bandwidth 3D-stacked Capacity
Cores
Back pressure
80 GB/s 16 GB
25
26
[USENIX ATC’17] StreamBox: Modern Stream Processing on a Multicore Machine, Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin, in Proc. USENIX Annual Technical Conference, 2017.
Ninja Developer Platform (KNL) Mellanox ConnectX-2 16GB 3D memory 96GB DRAM 64 cores @1.3GHz 40Gb/s
27
28
10 20 30 40 50 60 2 10 18 26 34 42 50 58 Throughput MRec/s
# Cores
Flink @ x56 Flink @ KNL Ours @ KNL RDMA ingestion limit KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark. Output delay: 1 second
5-10x
29
5 10 15 20 25 30 35 10 20 30 40 50 60
# cores Throughput Mrec/s
3D as cache full-records
TopK Per Key
30
5 10 15 20 25 30 35 10 20 30 40 50 60
# cores Throughput Mrec/s
3D as cache in-mem-index 3D as cache full-records
Using in-mem index
TopK Per Key
31
5 10 15 20 25 30 35 10 20 30 40 50 60
# cores Throughput Mrec/s
3D as cache in-mem-index DRAM only in-mem-index 3D as cache full-records
Using 3D memory
TopK Per Key
32
5 10 15 20 25 30 35 10 20 30 40 50 60
# cores Throughput Mrec/s
3D + DRAM in-mem-index 3D as cache in-mem-index DRAM only in-mem-index 3D as cache full-records
SW manages hybrid memory
TopK Per Key
33
5 10 15 20 25 30 35 10 20 30 40 50 60
# cores Throughput Mrec/s
3D + DRAM in-mem-index 3D as cache in-mem-index DRAM only in-mem-index 3D as cache full-records
Using all key system designs
TopK Per Key
The first stream engine optimized for 3D Memory + DRAM on real hardware
34
Balance limited resources Minimize use of capacity
Hash à Sort Abundant memory High parallelism Wide SIMD (avx512) Sequential access
DRAM Bandwidth 3D memory Capacity
http://xsel.rocks/p/streambox Exploit high bandwidth
35
Cheap VM (huge page)
Apps OS kernel
RDMA network bypass kernel, free CPU High task parallelism Custom mem allocator Sequential mem access
Runtime
Thread pool + custom task scheduler Wide SIMD (avx512)
Hybrid Memory
Packed data structure