CS 839: Design the Next-Generation Database Lecture 19: RDMA for - - PowerPoint PPT Presentation

cs 839 design the next generation database lecture 19
SMART_READER_LITE
LIVE PREVIEW

CS 839: Design the Next-Generation Database Lecture 19: RDMA for - - PowerPoint PPT Presentation

CS 839: Design the Next-Generation Database Lecture 19: RDMA for OLAP Xiangyao Yu 3/31/2020 1 Discussion Highlights SmartNIC vs. SmartSSD Different application scenarios: one for storage, one for network SATA vs. PCIe? SmartNICs


slide-1
SLIDE 1

Xiangyao Yu 3/31/2020

CS 839: Design the Next-Generation Database Lecture 19: RDMA for OLAP

1

slide-2
SLIDE 2

Discussion Highlights

2 SmartNIC vs. SmartSSD

  • Different application scenarios: one for storage, one for network
  • SATA vs. PCIe?
  • SmartNICs used for reducing CPU overhead; SmartSSD used for reducing data movement
  • SmartNIC seems more popular among hardware vendors
  • Computation in SmartNIC is stronger than SmartSSD

Database operators pushed to SmartNIC

  • Common: encryption, caching
  • OLTP: filtering, aggregation, locking, indexing
  • OLAP: filtering, project, aggregation, compression

Benefits of putting smartness into the NIC

  • Packet processing, latency reduction
  • Effect of SmartSSD is limited due to caching; caching does not apply in SmartNIC
  • Isolate security checks from CPU
  • Collect run time statistics such as network usage and latencies
  • Reduces burden on PCIe
slide-3
SLIDE 3

Today’s Paper

3

VLDB 2017

slide-4
SLIDE 4

Bandwidth and Latency

4

slide-5
SLIDE 5

Algorithm Designs

5

Shared Memory RDMA & SmartNIC Distributed System Concurrency Control Shared lock table ??? Partitioned lock table Fault Tolerance Shared log ??? Two-phase commit Join Radix join ??? Bloom-filter + semi-join

slide-6
SLIDE 6

Message Passing

6

Shared memory Message Passing

slide-7
SLIDE 7

Message Passing Interface (MPI)

Standard library interface for writing parallel programs in high- performance computing (HPC)

  • Hardware independent interface
  • Can leverage performance of underlying hardware

7

slide-8
SLIDE 8

MPI One-Sided Operations

Memory Window: memory that is accessible by other processes through RMA operations

8

Multicore CPU Memory Window Multicore CPU Memory

RMA

slide-9
SLIDE 9

MPI One-Sided Operations

MPI_Win_create: exposes local memory to RMA operation by other processes.

  • Collective operation
  • Creates window object

MPI_Win_free: deallocates window object MPI_Put: moves data from local memory to remote memory MPI_Get: retrieves data from remote memory into local memory MPI_Win_lock and MPI_Win_unlock to protect RMA operations on a specific window

9

slide-10
SLIDE 10

Radix Hash Join

Partitioned hash join achieves the best performance when each partition of the inner relation fits in cache Þ A large number of partitions Þ Performance suffers when the # partitions > # TLB entries or # of cachelines in the cache Radix Join: Partition through multiple passes

10

Partitioning P0 P1 Pk

slide-11
SLIDE 11

Radix Hash Join

11

1st pass of partitioning

slide-12
SLIDE 12

Radix Hash Join

12

Data shuffle 1st pass of partitioning

slide-13
SLIDE 13

Radix Hash Join

13

Data shuffle 1st pass of partitioning Following passes of partitioning

slide-14
SLIDE 14

Radix Hash Join

14

Data shuffle 1st pass of partitioning Following passes of partitioning Partition outer relation

slide-15
SLIDE 15

Radix Hash Join

15

Data shuffle 1st pass of partitioning Following passes of partitioning Build and probe Partition outer relation

slide-16
SLIDE 16

Radix Hash Join – Performance Model

16

Compute the histogram

  • Determine the size of memory windows
  • Assignment of partitions to nodes
  • Offsets within memory windows into which each process writes exclusively
slide-17
SLIDE 17

Radix Hash Join – Performance Model

17

Multi-pass partitioning Number of passes : partitioning fan-out Time of partitioning

slide-18
SLIDE 18

Radix Hash Join – Performance Model

18

Build and Probe Build Time Probe Time

slide-19
SLIDE 19

Radix Hash Join – Performance Model

19

+ + +

slide-20
SLIDE 20

Sort-Merge Join

20

Range partitioning

slide-21
SLIDE 21

Sort-Merge Join

21

Range partitioning Sort individual runs

slide-22
SLIDE 22

Sort-Merge Join

22

Range partitioning Sort individual runs Data shuffle

slide-23
SLIDE 23

Sort-Merge Join

23

Range partitioning Sort individual runs Data shuffle Merge

slide-24
SLIDE 24

Sort-Merge Join

24

Range partitioning Sort individual runs Data shuffle Merge Sort-merge outer relation

slide-25
SLIDE 25

Sort-Merge Join

25

Range partitioning Sort individual runs Data shuffle Merge Sort-merge outer relation Join

slide-26
SLIDE 26

Sort-Merge Join – Performance Model

26

Partitioning

slide-27
SLIDE 27

Sort-Merge Join – Performance Model

27

Sorting individual runs of length l

Number of runs Sorting performance Sorting time

slide-28
SLIDE 28

Sort-Merge Join – Performance Model

28

Merging multiple runs into a sorted output

Number of iterations : Merge fan-in Merge time

slide-29
SLIDE 29

Sort-Merge Join – Performance Model

29

Joining sorted relations

slide-30
SLIDE 30

Sort-Merge Join – Performance Model

30

Total execution time

+ + +

slide-31
SLIDE 31

Radix-Hash Join vs. Sort-Merge Join

31

+ + + + + +

Radix join Sort-merge join

slide-32
SLIDE 32

Radix-Hash Join vs. Sort-Merge Join

32

+ + + + + +

Radix join Sort-merge join

slide-33
SLIDE 33

Radix-Hash Join vs. Sort-Merge Join

33

+ + + + + +

Radix join Sort-merge join

slide-34
SLIDE 34

Radix-Hash Join vs. Sort-Merge Join

34

+ + + + + +

Radix join Sort-merge join

slide-35
SLIDE 35

Radix-Hash Join vs. Sort-Merge Join

35

+ + + + + +

Radix join Sort-merge join

slide-36
SLIDE 36

Performance Evaluation

36

slide-37
SLIDE 37

Baseline Experiments

37

slide-38
SLIDE 38

Scale-Out Experiments

  • Compression improves

performance

  • Radix join outperforms

sort-merge join

38

slide-39
SLIDE 39

Radix Join Execution Time Breakdown

Time of Histogram computation and window allocation largely remains constant

39

slide-40
SLIDE 40

Radix Join Execution Time Breakdown

Time of local partitioning and build/probe remain constant

40

slide-41
SLIDE 41

Radix Join Execution Time Breakdown

Time of network partitioning increases at more than 1024 cores

41

slide-42
SLIDE 42

Radix Join Execution Time Breakdown

Time of network partitioning increases at more than 1024 cores

  • Partitioning fan-out is increased beyond its optimal setting
  • Additional time spent in MPI_Put and MPI_Flush

42

slide-43
SLIDE 43

Radix Join Execution Time Breakdown

Time due to load imbalance increases with core count

43

slide-44
SLIDE 44

Sort-Merge Join Execution Time Breakdown

44

slide-45
SLIDE 45

Sort-Merge Join Execution Time Breakdown

Partitioning fan-out is pushed beyond its optimal configuration

45

slide-46
SLIDE 46

Sort-Merge Join Execution Time Breakdown

Within sorting, time of network shuffling increases with core count

46

slide-47
SLIDE 47

Sort-Merge Join Execution Time Breakdown

Time of merge and joining stays constant Time due to load imbalance slightly increases with core count

47

slide-48
SLIDE 48

Scale-Up Experiments

48

With more cores per machine, considerably more time spent on MPI_Put and MPI_Flush. Difficult to fully interleave computation and communication

slide-49
SLIDE 49

Comparison with the Model

Network shuffling is the bottleneck

49

slide-50
SLIDE 50

RDMA for OLAP – Q/A

Collective communication scheduling for joins? Supercomputers used in the real world for database workloads? Radix join vs. hash join? Radix join does not achieve theoretical maximum performance What is partition fan-out? MPI vs. shared memory for join

50

slide-51
SLIDE 51

Group Discussion

How can Smart NICs help improve the performance of joins? Can you think of any hardware/software techniques that may close the performance gap between radix join and sort-merge join? Can you think of any hardware/software techniques that may allow radix join to achieve its theoretical maximum performance?

51

slide-52
SLIDE 52

Before Next Lecture

Submit discussion summary to https://wisc-cs839-ngdb20.hotcrp.com

  • Deadline: Wednesday 11:59pm

Submit review for

  • Amazon Aurora: Design Considerations for High Throughput Cloud-Native

Relational Databases

  • [optional] Amazon Aurora: On Avoiding Distributed Consensus for I/Os,

Commits, and Membership Changes

52