s9557 effective scalable multi gpu joins
play

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay - PowerPoint PPT Presentation

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019 RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in


  1. S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019

  2. RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in a given o_orderpriority, aggregate quarter of a given year count(o_orderkey) as order_count, from in which at least one orders lineitem was received where by the customer later predicate (filter) o_orderdate >= date '[DATE]' and than its committed o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem date. The query lists join where l_orderkey = o_orderkey and the count of such predicate (filter) l_commitdate < l_receiptdate) orders for each order aggregate group by priority sorted in o_orderpriority, order by ascending priority sort o_orderpriority; order 4

  3. TPC-H SCHEMA part (p_) PARTKEY customer (c_) NAME order (o_) lineitem (l_) CUSTKEY MFGR ORDERKEY ORDERKEY NAME CATEGORY LINENUMBER CUSTKEY ADDRESS BRAND PARTKEY ORDERDATE CITY … SUPPKEY ORDPRIORITY … COMMITDATE ORDERSTATUS supplier (s_) … RECEIPTDATE SUPPKEY … NAME … ADDRESS nation (n_) CITY NATIONKEY NATIONKEY NAME … … 5

  4. RELATIONAL JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Foreign Key 1 after applying predicate “ l_commitdate < l_receiptdate ” 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ” 6

  5. HASH JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Build hash table Foreign Key = Probe inputs 1 after applying predicate “ l_commitdate < l_receiptdate ” 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ” 7

  6. JOINS & E2E PERFORMANCE CPU TPC-H Q4 execution breakdown GPU TPC-H Q4 execution breakdown join group-by join group-by 1% 1% 99% 99% 18/22 TPC-H Queries involve Joins and are the longest running ones 1 1 c.f. recently published TPC-H results at http://www.tpc.org/tpch/results/tpch_last_ten_results.asp 8

  7. IMPLEMENTING GPU JOINS In Heterogeneous Systems Hash Table(s) If the hash table fits in Key Payload 32GB GPU memory, performance 23 5 HBM is primarily bound by 27 2 random memory access. 1 Let’s ignore CPU -GPU Build & Probe interconnect for a moment. 1TB+ DDR DB 9 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

  8. PERFORMANCE Peak memory Random 8B bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s (6-channel DDR4) 10x NVIDIA Tesla V100 900 GB/s 60GB/s 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 10 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

  9. PERFORMANCE VS. CAPACITY Peak memory Random 8B Memory capacity bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s 1 TB+ (6-channel DDR4) 1/32 NVIDIA Tesla V100 900 GB/s 60GB/s 32GB 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 11 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

  10. PERFORMANCE VS. CAPACITY Peak memory Random 8B Memory capacity bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s 1 TB+ (6-channel DDR4) 1/2 NVIDIA Tesla V100 900 GB/s 60GB/s 32GB NVIDIA DGX-2 16 x 900 GB/s 16x 60GB/s 512 GB (16x V100) 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 12 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

  11. IS A SINGLE V100 FAST/LARGE ENOUGH? TPC-H query 4 @SF1000 = 1000GB data warehouse Hash table sizes GPU execution breakdown GPU execution breakdown, compressed data Query SF1K SF3K SF10K join group-by join group-by 1% 1% Q4 1.5 GB 4.5 GB 15 GB Q18 21 GB 63 GB 210 GB Q21 10.5 GB 31.5 GB 105 GB 7.0 s 3.8 s 99% 99% For further speedup or > SF 1000 need to to distribute hash table across multiple GPUs 13

  12. DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE NVIDIA DGX-2 Two GPU Boards 2 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory NVIDIA Tesla V100 32GB 1 interconnected by Plane Card Twelve NVSwitches Eight EDR Infiniband/100 GigE 3 4 2.4 TB/sec bi-section 1600 Gb/sec Total bandwidth Bi-directional Bandwidth Two High-Speed Ethernet 8 10/25/40/100 GigE 5 Two Intel Xeon Platinum CPUs 30 TB NVME SSDs 7 Internal Storage 6 1.5 TB System Memory 14

  13. POTENTIAL DGX-2 IMPLEMENTATION Use 2.4TB/s bisection BW to exchange FT chunks GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 NVSwitch Fabric GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 15

  14. SCALING OF INNER JOIN 16

  15. DISCLAIMER This investigation is ongoing For a production system some additional aspects need to be considered: - Data Skew - Cardinality estimation - Query optimizer 17

  16. SCALING OF INNER JOIN redundant build of replicated HT (step 0) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 18

  17. SCALING OF INNER JOIN redundant build of replicated HT (step 1..#GPU-1) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 19

  18. SCALING OF INNER JOIN redundant build of replicated HT (step #GPU) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 20

  19. SCALING OF INNER JOIN parallel probe of replicated HT GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 Probe table 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 21

  20. SCALING OF INNER JOIN Benchmark Problem randomly generated 8 bytes keys build table size = probe table size = 335544320 rows (worst case for HT creation fitting in the memory of a single GPU: 2x 2.5GiB for tables, 2x10GiB for HT + staging buffers (for strong scaling experiment)) HT occupancy = 50% selectivity = 0 for analytical purposes we will look at a real problem later build and probe tables are evenly partitioned across GPUs GPU 0 GPU 1 GPU 2 GPU #GPU Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 … Probe table 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 22

  21. SCALING OF INNER JOIN ON DGX-2 with redundant build of replicated HT 600 120% 500 100% 400 80% Parallel efficiency Runtime [ms] Runtime [ms] Build runtime [ms] 300 60% Probe runtime [ms] Parallel Efficiency build 200 40% Parallel Efficiency probe Parallel Efficiency 100 20% 0 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #GPUs 23 Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

  22. SCALING OF INNER JOIN Basic Idea Open addressing hash table with N buckets key -> hash_value = hf(key) -> bucket_idx = hash_value%N Partition N hash table buckets equally onto GPUs: GPU 0 GPU #GPU GPU 1 Hash table 0…N 1 -1 N # ...N-1 N 1 …N 2 -1 … The bucket_idx and target HT partition can be computed locally from the key 24

  23. SCALING OF INNER JOIN parallel build of a replicated HT (step 0 of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 25

  24. SCALING OF INNER JOIN parallel build of a replicated HT (step 1..#GPU-1 of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 26

  25. SCALING OF INNER JOIN parallel build of a replicated HT (step #GPU of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 27

  26. SCALING OF INNER JOIN parallel build of a replicated HT (phase 2 – merge step) GPU 0 GPU 1 GPU 2 GPU # temp HT temp HT temp HT temp HT 0…N 1 -1 N 1 …N 2 -1 N 2 …N 3 -1 N # …N -1 28

  27. SCALING OF INNER JOIN parallel build of a replicated HT (phase 2 – merge step) GPU 0 GPU 1 GPU 2 GPU # GPU 0 GPU 1 GPU 2 GPU # temp HT temp HT temp HT temp HT res HT res HT res HT res HT 0…N 1 -1 0…N 1 -1 0…N 1 -1 0…N 1 -1 0…N 1 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N # …N -1 N # …N -1 N # …N -1 N # …N -1 N # …N -1 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend