St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth - PowerPoint PPT Presentation

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST ; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox

Timely processing of streaming data On 100+ GB memory High Throughput & Low Latency! 2

Hybrid Memory: 3D Memory + DRAM DRAM • Larger capacity, but lower bandwidth 3D Memory • Higher bandwidth, but smaller capacity • NO latency benefit (Unlike cache: SRAM+DRAM) • Same as DRAM without high parallelism or sequential access • As cache of DRAM? à Poor performance… DRAM 3D Memory 16 GB 100+ GB 375 GB/s Cores 80 GB/s 3

Can hybrid mem speed up stream analytics? Yes! StreamBox-HBM • The first stream engine optimized for 3D memory + DRAM on real hardware • Achieves the best reported throughput on single node (win-avg:110MRec/s) • Speeds up stream analytics by 7x TopK Per Key 35 3D + DRAM Throughput Mrec/s 30 in-mem-index 25 20 7x speedup 15 10 3D as cache 5 full-records 0 0 10 20 30 40 50 60 # cores 4

Challenges 1. Hash Grouping performs poorly on 3D memory 2. 3D memory is capacity limited 3. How to dynamically map streaming data to hybrid mem? 5

Challenge 1: Hash Grouping performs poorly on 3D memory • Operators: computations consume/produce streams • Pipeline: a graph of streaming operators 130 500 500 302 Time 10:01 302 ID: 0x1024 100 150 Value: 200 10:00-10:05 Groupby Average per Ingestion Window Top Key key key Grouping • Data Grouping • A set of very common and expensive operators that reorganize records • Hash with random access in existing engines à Performs poorly on 3D memory… 6

Challenge 2: 3D memory is capacity limited • Streaming data • High data volume (100+ GB) Cannot fit! • 3D Memory 3D Memory • Capacity limited (~ 16 GB) 16 GB Cores • 3D memory is NOT large enough to hold all streaming data…. 7

Challenge 3: managing two types of memory • How to dynamically map data/operators to two types of memory? 130 500 500 302 Unbounded data 302 100 150 10:00-10:05 Various queries Groupby Average per Ingestion Window Top Key key key What to map? Where to map? Hybrid memory: benefit & limitation 8

StreamBox-HBM Solutions 1. Hash grouping performs poorly on 3D memory • à Solution 1: Use high parallel Sort for grouping 2. 3D memory is capacity limited • à Solution 2: Only use 3D memory to store in-memory indexes 3. How to manage two types of memory? • à Solution 3: Balance two limited resource with a single knob 9

Solution 1: Parallel Sort for Grouping Known duals of Grouping: Hash vs. Sort • DRAM: Hash is the best [VLDB’09, VLDB’13, SIGMOD’15] • Contribution: 3D memory reverses the debate. Sort outperforms Hash. Sort is worse than Hash on algorithmic complexity • O(NlogN) vs. O(N) Yet, Sort outperforms Hash after we exploit all: • Abundant memory bandwidth • High task parallelism • Wide SIMD (avx512) [VLDB’09] Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. [VLDB’13] Multi-core, main-memory joins: Sort vs. hash revisited [SIGMOD’15] Rethinking simd vectorization for in-memory databases 10

Solution 1: Parallel Sort for Grouping 180 300 160 250 million pairs / sec 140 120 200 GB / sec 100 150 80 60 100 40 50 20 0 0 0 20 40 60 0 20 40 60 # cores # cores Throughput Mem bandwidth So Sort outperforms s Hash sh on 3D memory 11

Solution 1: Parallel Sort for Grouping 180 300 160 250 million pairs / sec 140 120 200 GB / sec 100 150 80 Hash DRAM 60 100 40 50 20 Hash DRAM 0 0 0 20 40 60 0 20 40 60 # cores # cores Throughput Mem bandwidth Sort outperforms So s Hash sh on 3D memory 12

Solution 1: Parallel Sort for Grouping 180 300 160 250 million pairs / sec 140 Hash 3D mem 120 200 GB / sec 100 150 80 Hash DRAM Hash 3D mem 60 100 40 50 20 Hash DRAM 0 0 0 20 40 60 0 20 40 60 # cores # cores Throughput Mem bandwidth So Sort outperforms s Hash sh on 3D memory 13

Solution 1: Parallel Sort for Grouping 180 300 160 250 million pairs / sec 140 Hash 3D mem 120 200 GB / sec 100 150 80 Hash DRAM Hash 3D mem 60 100 40 Sort DRAM Sort DRAM 50 20 Hash DRAM 0 0 0 20 40 60 0 20 40 60 # cores # cores Throughput Mem bandwidth So Sort outperforms s Hash sh on 3D memory 14

Solution 1: Parallel Sort for Grouping Sort 3D mem Sort 3D mem 180 300 160 250 million pairs / sec 140 Hash 3D mem 120 200 GB / sec 100 150 80 Hash DRAM Hash 3D mem 60 100 40 Sort DRAM Sort DRAM 50 20 Hash DRAM 0 0 0 20 40 60 0 20 40 60 # cores # cores Throughput Mem bandwidth So Sort outperforms s Hash sh on 3D memory 15

Solution 2: Only use 3D memory for in-memory index Smaller Faster Full Records <key, key1,v1, v2, v3…> Index <key, pointer> More efficient K Swapping Streaming data 16 GB 96 GB 375 GB/s Cores 80 GB/s 3D Memory DRAM Mi Minimize th the u e use of se of p prec eciou ous 3 s 3D m mem em’s c s capacity w ty while e ex exploit hig high h bandw bandwidt idth 16

Solution 3: balance two limited resources 3D memory Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM 17

Solution 3: balance two limited resources 3D memory Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM High pressure on 3D Memory capacity 18

Solution 3: balance two limited resources 3D-stacked Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM High pressure on 3D Memory capacity à indexes on DRAM 19

Solution 3: balance two limited resources 3D-stacked Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM Pressure rebalanced 20

Solution 3: balance two limited resources 3D-stacked Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM High pressure on DRAM bandwidth 21

Solution 3: balance two limited resources 3D-stacked Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM High pressure on DRAM bandwidth à more indexes on 3D memory 22

Solution 3: balance two limited resources 3D-stacked Capacity DRAM Bandwidth 16 GB 80 GB/s Cores 3D Memory DRAM Pressure rebalanced 23

Solution 3: balance two limited resources 3D-stacked Capacity DRAM Bandwidth Back 16 GB pressure 80 GB/s Cores 3D Memory DRAM High pressure on both… à reach hardware limit à limit data ingestion 24

Other optimizations • Customized memory allocator • Customized task scheduler for high pipeline and data parallelism • High parallel merge-sort kernels using avx-512 • Dynamically handle key changes • Parallel aggregation • Co-design RDMA ingestion with memory management and task scheduling • Task parallelism to utilize all CPU cores • … 25

St StreamBo mBox-HB HBM Im Implem plemen entatio tion • Based on our prior work StreamBox [USENIX ATC’17] • Implement on real hardware (Intel KNL) with RDMA network • 61K lines of C++11, of which 38K lines are new • Open source: http://xsel.rocks/p/streambox 16GB 3D memory 40Gb/s 96GB DRAM 64 cores @1.3GHz Mellanox ConnectX-2 Ninja Developer Platform (KNL) [USENIX ATC’17] StreamBox: Modern Stream Processing on a Multicore Machine, Hongyu Miao, Heejin Park, Myeongjae Jeon, 26 Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin, in Proc. USENIX Annual Technical Conference, 2017.

Evaluation • Comparing to widely used stream analytics engine • Validating our key system designs 27

StreamBox-HBM is 10x faster than Flink 60 RDMA ingestion limit 50 Ours @ KNL Throughput MRec/s 40 5-10x 30 Flink @ x56 20 Flink @ KNL 10 0 2 10 18 26 34 42 50 58 # Cores Benchmark: Yahoo Stream Benchmark. KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 28 Output delay: 1 second x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000

Poor performance without any key designs TopK Per Key 35 30 Throughput Mrec/s 25 20 15 10 5 3D as cache full-records 0 0 10 20 30 40 50 60 # cores 29

In-mem-index performs better than full-record TopK Per Key 35 30 Throughput Mrec/s 3D as cache 25 in-mem-index 20 Using 15 in-mem index 10 5 3D as cache full-records 0 0 10 20 30 40 50 60 # cores 30

3D memory boosts performance TopK Per Key 35 30 Throughput Mrec/s 3D as cache Using 25 in-mem-index 3D memory 20 DRAM only 15 in-mem-index 10 5 3D as cache full-records 0 0 10 20 30 40 50 60 # cores 31

SW better manages hybrid memory than HW TopK Per Key 35 3D + DRAM SW manages in-mem-index 30 hybrid memory Throughput Mrec/s 3D as cache 25 in-mem-index 20 DRAM only 15 in-mem-index 10 5 3D as cache full-records 0 0 10 20 30 40 50 60 # cores 32

Performance improve with all system designs TopK Per Key 35 3D + DRAM in-mem-index 30 Throughput Mrec/s 3D as cache 25 in-mem-index Using all key 20 DRAM only system designs 15 in-mem-index 10 5 3D as cache full-records 0 0 10 20 30 40 50 60 # cores 33

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth - PowerPoint PPT Presentation

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST ; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox

HBM a global company Sales offices: 16 in Europe 5 in the Americas 6 in Asia

2.5D FPGA-HBM Integration Challenges Jaspreet Gandhi , Boon Ang, Tom Lee, Henley Liu, Myongseob

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache Tyler

values for the phthalates and bisphenols Greet Schoeters 28 th September 2018 Coll llecting

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

2012 EOS/ESD Symposium ESD Dynamic Methodology for Diagnosis and Predictive Simulation of

Lecture 5: Training Neural Networks, Part I Fei-Fei Li & Andrej Karpathy & Justin

Different aspects in correlation products pricing Pascal DELANOE, Structured Equity Derivatives

CLIMATE FINANCE? Sren E. Ltken UNEP Risoe UNU, Helsinki September 28 th 2012 UNEP RIS C

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Course summary SO 2020_2021_Q1 1.1 Outline Goals Competences Methodology

Thyroid Cases Case Based Discussion Chienying Liu: no disclosures Jennifer Park-Sigal: no

r r sts t

Outline Background Venezia Hardware Architecture Venezia Software Architecture

HVP contributions to anomalous magnetic moments of all leptons from first principle At Physical

Outcome Delivery at Miki Renee Tsielepi ? Mobius Loop Workshop Reservations Purchasing

Object-based SSD (OSSD): Our Practice and Experience Jaesoo Lee jaesu.lee@samsung.com Flash

Latent Factor Analysis of Gaussian Distributions under Graphical Constraints Md Mahmudul Hasan,

Quickest Detection of a Dynamic Anomaly in a Heterogeneous Sensor Network Georgios Rovatsos, UIUC

Health Care Innovation Challenge November 17, 2011 The Innovation Center Mission Statement

Agenda What is the Connecticut Microgrid Program? CEFIA Overview Financing Challenge

Clean Energy Solutions Center and the ClimateWorks Foundation-International Council on Clean

The Global Doctor: Scientific Medicine and Social Movements McGill University Montreal, Canada

Banking Dynamics and Capital Regulation Jos-Vctor Ros-Rull Tamon Takamura Yaz Terajima

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth - PowerPoint PPT Presentation

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue ECE; Myeongjae Jeon, UNIST ; Gennady Pekhimenko, UToronto; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox

HBM a global company Sales offices: 16 in Europe 5 in the Americas 6 in Asia

2.5D FPGA-HBM Integration Challenges Jaspreet Gandhi , Boon Ang, Tom Lee, Henley Liu, Myongseob

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache Tyler

values for the phthalates and bisphenols Greet Schoeters 28 th September 2018 Coll llecting

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

2012 EOS/ESD Symposium ESD Dynamic Methodology for Diagnosis and Predictive Simulation of

Lecture 5: Training Neural Networks, Part I Fei-Fei Li &amp; Andrej Karpathy &amp; Justin

Different aspects in correlation products pricing Pascal DELANOE, Structured Equity Derivatives

CLIMATE FINANCE? Sren E. Ltken UNEP Risoe UNU, Helsinki September 28 th 2012 UNEP RIS C

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Course summary SO 2020_2021_Q1 1.1 Outline Goals Competences Methodology

Thyroid Cases Case Based Discussion Chienying Liu: no disclosures Jennifer Park-Sigal: no

r r sts t

Outline Background Venezia Hardware Architecture Venezia Software Architecture

HVP contributions to anomalous magnetic moments of all leptons from first principle At Physical

Outcome Delivery at Miki Renee Tsielepi ? Mobius Loop Workshop Reservations Purchasing

Object-based SSD (OSSD): Our Practice and Experience Jaesoo Lee jaesu.lee@samsung.com Flash

Latent Factor Analysis of Gaussian Distributions under Graphical Constraints Md Mahmudul Hasan,

Quickest Detection of a Dynamic Anomaly in a Heterogeneous Sensor Network Georgios Rovatsos, UIUC

Health Care Innovation Challenge November 17, 2011 The Innovation Center Mission Statement

Agenda What is the Connecticut Microgrid Program? CEFIA Overview Financing Challenge

Clean Energy Solutions Center and the ClimateWorks Foundation-International Council on Clean

The Global Doctor: Scientific Medicine and Social Movements McGill University Montreal, Canada

Banking Dynamics and Capital Regulation Jos-Vctor Ros-Rull Tamon Takamura Yaz Terajima

Lecture 5: Training Neural Networks, Part I Fei-Fei Li & Andrej Karpathy & Justin