S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay - PowerPoint PPT Presentation

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019

RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in a given o_orderpriority, aggregate quarter of a given year count(o_orderkey) as order_count, from in which at least one orders lineitem was received where by the customer later predicate (filter) o_orderdate >= date '[DATE]' and than its committed o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem date. The query lists join where l_orderkey = o_orderkey and the count of such predicate (filter) l_commitdate < l_receiptdate) orders for each order aggregate group by priority sorted in o_orderpriority, order by ascending priority sort o_orderpriority; order 4

TPC-H SCHEMA part (p_) PARTKEY customer (c_) NAME order (o_) lineitem (l_) CUSTKEY MFGR ORDERKEY ORDERKEY NAME CATEGORY LINENUMBER CUSTKEY ADDRESS BRAND PARTKEY ORDERDATE CITY … SUPPKEY ORDPRIORITY … COMMITDATE ORDERSTATUS supplier (s_) … RECEIPTDATE SUPPKEY … NAME … ADDRESS nation (n_) CITY NATIONKEY NATIONKEY NAME … … 5

RELATIONAL JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Foreign Key 1 after applying predicate “ l_commitdate < l_receiptdate ” 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ” 6

HASH JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Build hash table Foreign Key = Probe inputs 1 after applying predicate “ l_commitdate < l_receiptdate ” 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ” 7

JOINS & E2E PERFORMANCE CPU TPC-H Q4 execution breakdown GPU TPC-H Q4 execution breakdown join group-by join group-by 1% 1% 99% 99% 18/22 TPC-H Queries involve Joins and are the longest running ones 1 1 c.f. recently published TPC-H results at http://www.tpc.org/tpch/results/tpch_last_ten_results.asp 8

IMPLEMENTING GPU JOINS In Heterogeneous Systems Hash Table(s) If the hash table fits in Key Payload 32GB GPU memory, performance 23 5 HBM is primarily bound by 27 2 random memory access. 1 Let’s ignore CPU -GPU Build & Probe interconnect for a moment. 1TB+ DDR DB 9 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

PERFORMANCE Peak memory Random 8B bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s (6-channel DDR4) 10x NVIDIA Tesla V100 900 GB/s 60GB/s 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 10 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

PERFORMANCE VS. CAPACITY Peak memory Random 8B Memory capacity bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s 1 TB+ (6-channel DDR4) 1/32 NVIDIA Tesla V100 900 GB/s 60GB/s 32GB 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 11 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

PERFORMANCE VS. CAPACITY Peak memory Random 8B Memory capacity bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s 1 TB+ (6-channel DDR4) 1/2 NVIDIA Tesla V100 900 GB/s 60GB/s 32GB NVIDIA DGX-2 16 x 900 GB/s 16x 60GB/s 512 GB (16x V100) 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 12 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

IS A SINGLE V100 FAST/LARGE ENOUGH? TPC-H query 4 @SF1000 = 1000GB data warehouse Hash table sizes GPU execution breakdown GPU execution breakdown, compressed data Query SF1K SF3K SF10K join group-by join group-by 1% 1% Q4 1.5 GB 4.5 GB 15 GB Q18 21 GB 63 GB 210 GB Q21 10.5 GB 31.5 GB 105 GB 7.0 s 3.8 s 99% 99% For further speedup or > SF 1000 need to to distribute hash table across multiple GPUs 13

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE NVIDIA DGX-2 Two GPU Boards 2 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory NVIDIA Tesla V100 32GB 1 interconnected by Plane Card Twelve NVSwitches Eight EDR Infiniband/100 GigE 3 4 2.4 TB/sec bi-section 1600 Gb/sec Total bandwidth Bi-directional Bandwidth Two High-Speed Ethernet 8 10/25/40/100 GigE 5 Two Intel Xeon Platinum CPUs 30 TB NVME SSDs 7 Internal Storage 6 1.5 TB System Memory 14

POTENTIAL DGX-2 IMPLEMENTATION Use 2.4TB/s bisection BW to exchange FT chunks GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 NVSwitch Fabric GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 15

SCALING OF INNER JOIN 16

DISCLAIMER This investigation is ongoing For a production system some additional aspects need to be considered: - Data Skew - Cardinality estimation - Query optimizer 17

SCALING OF INNER JOIN redundant build of replicated HT (step 0) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 18

SCALING OF INNER JOIN redundant build of replicated HT (step 1..#GPU-1) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 19

SCALING OF INNER JOIN redundant build of replicated HT (step #GPU) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 20

SCALING OF INNER JOIN parallel probe of replicated HT GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 Probe table 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 21

SCALING OF INNER JOIN Benchmark Problem randomly generated 8 bytes keys build table size = probe table size = 335544320 rows (worst case for HT creation fitting in the memory of a single GPU: 2x 2.5GiB for tables, 2x10GiB for HT + staging buffers (for strong scaling experiment)) HT occupancy = 50% selectivity = 0 for analytical purposes we will look at a real problem later build and probe tables are evenly partitioned across GPUs GPU 0 GPU 1 GPU 2 GPU #GPU Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 … Probe table 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 22

SCALING OF INNER JOIN ON DGX-2 with redundant build of replicated HT 600 120% 500 100% 400 80% Parallel efficiency Runtime [ms] Runtime [ms] Build runtime [ms] 300 60% Probe runtime [ms] Parallel Efficiency build 200 40% Parallel Efficiency probe Parallel Efficiency 100 20% 0 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #GPUs 23 Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

SCALING OF INNER JOIN Basic Idea Open addressing hash table with N buckets key -> hash_value = hf(key) -> bucket_idx = hash_value%N Partition N hash table buckets equally onto GPUs: GPU 0 GPU #GPU GPU 1 Hash table 0…N 1 -1 N # ...N-1 N 1 …N 2 -1 … The bucket_idx and target HT partition can be computed locally from the key 24

SCALING OF INNER JOIN parallel build of a replicated HT (step 0 of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 25

SCALING OF INNER JOIN parallel build of a replicated HT (step 1..#GPU-1 of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 26

SCALING OF INNER JOIN parallel build of a replicated HT (step #GPU of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 27

SCALING OF INNER JOIN parallel build of a replicated HT (phase 2 – merge step) GPU 0 GPU 1 GPU 2 GPU # temp HT temp HT temp HT temp HT 0…N 1 -1 N 1 …N 2 -1 N 2 …N 3 -1 N # …N -1 28

SCALING OF INNER JOIN parallel build of a replicated HT (phase 2 – merge step) GPU 0 GPU 1 GPU 2 GPU # GPU 0 GPU 1 GPU 2 GPU # temp HT temp HT temp HT temp HT res HT res HT res HT res HT 0…N 1 -1 0…N 1 -1 0…N 1 -1 0…N 1 -1 0…N 1 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N # …N -1 N # …N -1 N # …N -1 N # …N -1 N # …N -1 29

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay - PowerPoint PPT Presentation

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019 RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Multi-join Query Evaluation on Big Data Lecture 3 Dan Suciu March, 2015 Dan Suciu Multi-Joins

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

8/10/17 HPEC All Staff In-service=5 points.- No need to put on plan. Sign in and email is all we

Fabrication of CNTs/Al composite with enhanced dispersion pre-treatment Z.Y.Liu, B. L.Xiao, Z.Y.Ma

Game Development Essentials: An Introduction Third Edition Chapter 4 Player Elements who plays

50 Agenda Results Presentation 23 February 2016 Page Presented by Chairmans overview 1

and thread count September 26, 2019 Jim Rosinski UCAR/CPAESS Outline Summary of GPTL CPU

Andy Fryers Powys One of the 22 Welsh counties The largest: 5,197 km 2 (1/4 of

The State Role in U.S. Manufacturing Revival Presentation at Summer NSCL Meeting Boston, MA Dr.

F2020-Q3 RESULTS February 13, 2020 FORWARD LOOKING STATEMENT This presentation contains

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay - PowerPoint PPT Presentation

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019 RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&amp;7,&amp;2015 What%are%joins?

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Multi-join Query Evaluation on Big Data Lecture 3 Dan Suciu March, 2015 Dan Suciu Multi-Joins

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

8/10/17 HPEC All Staff In-service=5 points.- No need to put on plan. Sign in and email is all we

Fabrication of CNTs/Al composite with enhanced dispersion pre-treatment Z.Y.Liu, B. L.Xiao, Z.Y.Ma

Game Development Essentials: An Introduction Third Edition Chapter 4 Player Elements who plays

50 Agenda Results Presentation 23 February 2016 Page Presented by Chairmans overview 1

and thread count September 26, 2019 Jim Rosinski UCAR/CPAESS Outline Summary of GPTL CPU

Andy Fryers Powys One of the 22 Welsh counties The largest: 5,197 km 2 (1/4 of

The State Role in U.S. Manufacturing Revival Presentation at Summer NSCL Meeting Boston, MA Dr.

F2020-Q3 RESULTS February 13, 2020 FORWARD LOOKING STATEMENT This presentation contains

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team