CS 839: Design the Next-Generation Database Lecture 19: RDMA for - PowerPoint PPT Presentation

CS 839: Design the Next-Generation Database Lecture 19: RDMA for OLAP Xiangyao Yu 3/31/2020 1

Discussion Highlights SmartNIC vs. SmartSSD • Different application scenarios: one for storage, one for network • SATA vs. PCIe? • SmartNICs used for reducing CPU overhead; SmartSSD used for reducing data movement • SmartNIC seems more popular among hardware vendors • Computation in SmartNIC is stronger than SmartSSD Database operators pushed to SmartNIC • Common: encryption, caching • OLTP: filtering, aggregation, locking, indexing • OLAP: filtering, project, aggregation, compression Benefits of putting smartness into the NIC • Packet processing, latency reduction • Effect of SmartSSD is limited due to caching; caching does not apply in SmartNIC • Isolate security checks from CPU • Collect run time statistics such as network usage and latencies • Reduces burden on PCIe 2

Today’s Paper VLDB 2017 3

Bandwidth and Latency 4

Algorithm Designs Shared RDMA & Distributed Memory SmartNIC System Concurrency Shared lock table ??? Partitioned lock Control table Fault Tolerance Shared log ??? Two-phase commit Join Radix join ??? Bloom-filter + semi-join 5

Message Passing Shared memory Message Passing 6

Message Passing Interface (MPI) Standard library interface for writing parallel programs in high- performance computing (HPC) • Hardware independent interface • Can leverage performance of underlying hardware 7

MPI One-Sided Operations Memory Window: memory that is accessible by other processes through RMA operations Multicore CPU Multicore CPU RMA Memory Memory Window 8

MPI One-Sided Operations MPI_Win_create: exposes local memory to RMA operation by other processes. • Collective operation • Creates window object MPI_Win_free: deallocates window object MPI_Put: moves data from local memory to remote memory MPI_Get: retrieves data from remote memory into local memory MPI_Win_lock and MPI_Win_unlock to protect RMA operations on a specific window 9

Radix Hash Join Partitioned hash join achieves the best performance when each partition of the inner relation fits in cache Þ A large number of partitions Partitioning … P0 P1 Pk Þ Performance suffers when the # partitions > # TLB entries or # of cachelines in the cache Radix Join: Partition through multiple passes 10

Radix Hash Join 1 st pass of partitioning 11

Radix Hash Join 1 st pass of partitioning Data shuffle 12

Radix Hash Join 1 st pass of partitioning Data shuffle Following passes of partitioning 13

Radix Hash Join 1 st pass of partitioning Data shuffle Following passes of partitioning Partition outer relation 14

Radix Hash Join 1 st pass of partitioning Data shuffle Following passes of partitioning Build and probe Partition outer relation 15

Radix Hash Join – Performance Model Compute the histogram • Determine the size of memory windows • Assignment of partitions to nodes • Offsets within memory windows into which each process writes exclusively 16

Radix Hash Join – Performance Model Multi-pass partitioning Number of passes : partitioning fan-out Time of partitioning 17

Radix Hash Join – Performance Model Build and Probe Build Time Probe Time 18

Radix Hash Join – Performance Model + + + 19

Sort-Merge Join Range partitioning 20

Sort-Merge Join Range partitioning Sort individual runs 21

Sort-Merge Join Range partitioning Sort individual runs Data shuffle 22

Sort-Merge Join Range partitioning Sort individual runs Data shuffle Merge 23

Sort-Merge Join Range partitioning Sort individual runs Data shuffle Merge Sort-merge outer relation 24

Sort-Merge Join Range partitioning Sort individual runs Data shuffle Merge Join Sort-merge outer relation 25

Sort-Merge Join – Performance Model Partitioning 26

Sort-Merge Join – Performance Model Sorting individual runs of length l Number of runs Sorting performance Sorting time 27

Sort-Merge Join – Performance Model Merging multiple runs into a sorted output Number of iterations : Merge fan-in Merge time 28

Sort-Merge Join – Performance Model Joining sorted relations 29

Sort-Merge Join – Performance Model Total execution time + + + 30

Radix-Hash Join vs. Sort-Merge Join Radix join Sort-merge join + + + + + + 31

Performance Evaluation 36

Baseline Experiments 37

Scale-Out Experiments • Compression improves performance • Radix join outperforms sort-merge join 38

Radix Join Execution Time Breakdown Time of Histogram computation and window allocation largely remains constant 39

Radix Join Execution Time Breakdown Time of local partitioning and build/probe remain constant 40

Radix Join Execution Time Breakdown Time of network partitioning increases at more than 1024 cores 41

Radix Join Execution Time Breakdown Time of network partitioning increases at more than 1024 cores • Partitioning fan-out is increased beyond its optimal setting • Additional time spent in MPI_Put and MPI_Flush 42

Radix Join Execution Time Breakdown Time due to load imbalance increases with core count 43

Sort-Merge Join Execution Time Breakdown 44

Sort-Merge Join Execution Time Breakdown Partitioning fan-out is pushed beyond its optimal configuration 45

Sort-Merge Join Execution Time Breakdown Within sorting, time of network shuffling increases with core count 46

Sort-Merge Join Execution Time Breakdown Time of merge and joining stays constant Time due to load imbalance slightly increases with core count 47

Scale-Up Experiments With more cores per machine, considerably more time spent on MPI_Put and MPI_Flush . Difficult to fully interleave computation and communication 48

Comparison with the Model Network shuffling is the bottleneck 49

RDMA for OLAP – Q/A Collective communication scheduling for joins? Supercomputers used in the real world for database workloads? Radix join vs. hash join? Radix join does not achieve theoretical maximum performance What is partition fan-out? MPI vs. shared memory for join 50

Group Discussion How can Smart NICs help improve the performance of joins? Can you think of any hardware/software techniques that may close the performance gap between radix join and sort-merge join? Can you think of any hardware/software techniques that may allow radix join to achieve its theoretical maximum performance? 51

Before Next Lecture Submit discussion summary to https://wisc-cs839-ngdb20.hotcrp.com • Deadline: Wednesday 11:59pm Submit review for • Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases • [optional] Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes 52

CS 839: Design the Next-Generation Database Lecture 19: RDMA for - PowerPoint PPT Presentation

CS 839: Design the Next-Generation Database Lecture 19: RDMA for OLAP Xiangyao Yu 3/31/2020 1 Discussion Highlights SmartNIC vs. SmartSSD Different application scenarios: one for storage, one for network SATA vs. PCIe? SmartNICs

CS 839: Design the Next-Generation Database Lecture 6: Deterministic Database Xiangyao Yu

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

CS 839: Design the Next-Generation Database Lecture 4: Multicore (Part I) Xiangyao Yu 1/30/2020

CS 839: Design the Next-Generation Database Lecture 24: HTAP Xiangyao Yu 4/16/2020 1

CS 839: Design the Next-Generation Database Lecture 14: Process in Memory Xiangyao Yu 3/5/2020

CS 839: Design the Next-Generation Database Lecture 20: OLTP in Cloud Xiangyao Yu 4/2/2020 1

CS 839: Design the Next-Generation Database Lecture 2: Transaction Basics Xiangyao Yu 1/23/2020

CS 839: Design the Next-Generation Database Lecture 23: Serverless Xiangyao Yu 4/14/2020 1

CS 839: Design the Next-Generation Database Lecture 1: Introduction Xiangyao Yu 1/21/2020 Who

CS 839: Design the Next-Generation Database Lecture 22: Snowflake Xiangyao Yu 4/9/2020 1

CS 839: Design the Next-Generation Database Lecture 17: Smart NIC Xiangyao Yu 3/24/2020 1

CS 839: Design the Next-Generation Database Lecture 13: Smart SSD Xiangyao Yu 3/3/2020 1

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

THE FINEST HOMES DESERVE www.SabinaKier.com THE FINEST MARKETING. G oinG to the ends of the earth

Database Design October 24, 2008 Database Design Outline Database Design E-R diagrams

Video Consoles - The Next Generation consoles and games from Next Generation 1994 - present

/ 4.0 21.MAY 2019,XIAMEN 2019 5 21

Minimum Cost Deployment of Radio and Transport Resources in Centralized Radio Architectures F.

An Energy-aware Scheduling Algorithm in DVFS-enabled Networked Data Centers CLOSER 2016 - TEEC

rt trs P ( X vi = 1 v , i ) = 1 0 1 1 3 1 +

Matlab Review Picker Engineering Program Smith College EGR 301 January 25, 2005 Judith Cardell

System Using OpenCV WU Jia OpenCV China Team Outline Face recognition in brief Build a

FOUNDATIONS [Track One : Believer To Disciple] Lesson 11 : The Holy Spirit [Track One]

Ephesians Series Lesson #081 August 30, 2020 Dean Bible Ministries www.deanbibleministries.org