MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, - PowerPoint PPT Presentation

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, GTC March 2019

MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, … Need to compute faster, e.g. weather prediction Better energy efficiency with dense nodes with multiple GPUs 2

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE NVIDIA DGX-2 Two GPU Boards 2 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory NVIDIA Tesla V100 32GB 1 interconnected by Plane Card Twelve NVSwitches Eight EDR Infiniband/100 GigE 3 4 2.4 TB/sec bi-section 1600 Gb/sec Total bandwidth Bi-directional Bandwidth Two High-Speed Ethernet 8 10/25/40/100 GigE 5 Two Intel Xeon Platinum CPUs 30 TB NVME SSDs 7 Internal Storage 6 1.5 TB System Memory 3 3

EXAMPLE: JACOBI SOLVER Solves the 2D-Laplace Equation on a rectangle ∆𝒗 𝒚, 𝒛 = 𝟏 ∀ 𝒚, 𝒛 ∈ Ω\𝜺Ω Dirichlet boundary conditions (constant values on boundaries) on left and right boundary Periodic boundary conditions on top and bottom boundary 4

EXAMPLE: JACOBI SOLVER Single GPU While not converged Do Jacobi step: for( int iy = 1 ; iy < ny - 1 ; iy ++ ) for( int ix = 1 ; ix < nx - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration 5

DOMAIN DECOMPOSITION Different Ways to split the work between processes: Minimize number of neighbors: Minimize surface area/volume ratio: Communicate to less neighbors Communicate less data Optimal for latency bound communication Optimal for bandwidth bound communication Contiguous if data Contiguous if data is row-major is column-major 6

EXAMPLE: JACOBI SOLVER Multi GPU While not converged Do Jacobi step: for ( int iy = iy_start; iy < iy_end; iy ++ ) for( int ix = 1 ; ix < nx - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions One-step with ring exchange Exchange halo with 2 neighbors Swap a_new and a Next iteration 7

SINGLE THREADED MULTI GPU PROGRAMMING while ( l2_norm > tol && iter < iter_max ) { for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { const int top = dev_id > 0 ? dev_id - 1 : ( num_devices - 1 ); const int bottom = ( dev_id + 1 )% num_devices ; cudaSetDevice( dev_id ); cudaMemsetAsync ( l2_norm_d [ dev_id ], 0 , sizeof( real ) ); jacobi_kernel <<< dim_grid , dim_block >>>( a_new [ dev_id ], a [ dev_id ], l2_norm_d [ dev_id ], iy_start [ dev_id ], iy_end [ dev_id ], nx ); cudaMemcpyAsync ( l2_norm_h [ dev_id ], l2_norm_d [ dev_id ], sizeof( real ), cudaMemcpyDeviceToHost ); cudaMemcpyAsync ( a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); cudaMemcpyAsync ( a_new [ bottom ], a_new [ dev_id ]+( iy_end [ dev_id ]- 1 )* nx , nx *sizeof( real ), ... ); } l2_norm = 0.0 ; for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { cudaSetDevice( dev_id ); cudaDeviceSynchronize (); l2_norm += *( l2_norm_h [ dev_id ]); } l2_norm = std :: sqrt ( l2_norm ); for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) std :: swap ( a_new [ dev_id ], a [ dev_id ]); iter ++; } 8

EXAMPLE JACOBI Top/Bottom Halo cudaMemcpyAsync ( a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); 9

EXAMPLE JACOBI Top/Bottom Halo cudaMemcpyAsync ( 1 a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); 1 10

EXAMPLE JACOBI Top/Bottom Halo 2 cudaMemcpyAsync ( cudaMemcpyAsync ( 1 a_new [ top ]+( iy_end [ top ]* nx ), a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); cudaMemcpyAsync ( a_new [ bottom ], 2 a_new [ dev_id ]+( iy_end [ dev_id ]- 1 )* nx , nx *sizeof( real ), ... ); 1 11

CONTROLLING GPU BOOST using application clocks Application can safely run at max $ sudo nvidia-smi -ac 958,1597 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:34:00.0 clocks Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:36:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:39:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:3B:00.0 Short runtime of the benchmark Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:57:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:59:00.0 makes spinning clocks up visible: Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:5C:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:5E:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:B7:00.0 2000 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:B9:00.0 GPU Clock [Mhz] Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:BC:00.0 1500 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:BE:00.0 1000 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:E0:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:E2:00.0 500 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:E5:00.0 Applications clocks set to "(MEM 958, SM 1597)" for GPU 00000000:E7:00.0 0 Time All done. No AC AC 12

EXAMPLE: JACOBI SOLVER Single GPU performance vs. problem size – Tesla V100 SXM3 32 GB 70 100.00% 60 Performance (Mcells/s) 80.00% Efficiency 50 60.00% 40 30 40.00% 20 20.00% 10 0 0.00% 1024 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 13312 14336 15360 16384 17408 18432 Problem size (nx=ny) Performance (Mcells/s) Efficiency (%) Benchmarksetup: DGX-2 with OS 4.0.5, GCC 7.3.0, CUDA 10.0 with 410.104 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3), 13 GPUs@1597Mhz AC, Reported Performance is the minimum of 5 repetitions

SCALABILTY METRICS FOR SUCCESS Serial Time: 𝑈 𝑡 : How long it takes to run the problem with a single GPU Parallel Time: 𝑈 𝑞 : How long it takes to run the problem with multiple GPUs Number of GPU: 𝑄 : The number of GPUs operating on the task at hand Speedup: 𝑇 = 𝑈 𝑞 : How much faster is the parallel version vs. serial. (optimal is 𝑄 ) 𝑡 𝑈 Efficiency: 𝐹 = 𝑇 𝑄 : How efficient are the GPUs used (optimal is 1 ) 14

MULTI GPU JACOBI RUNTIME DGX-2 - 18432 x 18432, 1000 iterations 6 100.00% 90.00% 5 80.00% Parallel Efficiency 70.00% 4 Runtime (s) 60.00% 3 50.00% 40.00% 2 30.00% 20.00% 1 10.00% 0 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #GPUs Single Threaded Copy Parallel Efficiency Benchmarksetup: DGX-2 with OS 4.0.5, GCC 7.3.0, CUDA 10.0 with 410.104 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3), 15 GPUs@1597Mhz AC, , Reported Runtime is the minimum of 5 repetitions

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-2 16

GPUDIRECT P2P MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NVSWITCH GPU8 GPU9 GPU10 GPU11 GPU12 GPU13 GPU14 GPU15 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks 17

GPUDIRECT P2P Enable P2P for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { cudaSetDevice ( dev_id ); const int top = dev_id > 0 ? dev_id - 1 : ( num_devices - 1 ); int canAccessPeer = 0 ; cudaDeviceCanAccessPeer ( & canAccessPeer , dev_id , top ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( top , 0 ); const int bottom = ( dev_id + 1 )% num_devices ; if ( top != bottom ) { cudaDeviceCanAccessPeer ( & canAccessPeer , dev_id , bottom ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( bottom , 0 ); } } 18

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-2 with P2P 19

MULTI GPU JACOBI RUNTIME DGX-2 - 18432 x 18432, 1000 iterations 100.00% 90.00% 80.00% Parallel Efficiency 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #GPUs Single Threaded Copy Single Threaded Copy P2P Benchmarksetup: DGX-2 with OS 4.0.5, GCC 7.3.0, CUDA 10.0 with 410.104 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3), 20 GPUs@1597Mhz AC, , Reported Runtime is the minimum of 5 repetitions

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 V100 on DGX-2 with P2P 21

MULTI THREADED MULTI GPU PROGRAMMING Using OpenMP int num_devices = 0 ; cudaGetDeviceCount ( & num_devices ); #pragma omp parallel num_threads( num_devices ) { int dev_id = omp_get_thread_num (); cudaSetDevice ( dev_id ); } 22

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, - PowerPoint PPT Presentation

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, GTC March 2019 MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g. weather prediction Better energy

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Use Tesla to provide first GPU VM Service in China Feng Zhu

Destroying a First-Year Seminar Program and Rebuilding From the Ground Up Kristi Kirk &

CLIQUES : Security for Dynamic Peer Groups Formation Member add Member leave Group fusion

GBCC Oral Presentation PROGNOSIS AND EFFECT OF ADJUVANT TREATMENT IN IN SMALL, NODE(-), HER2(+)

Coal Conversion to Products Serge Prineau World CTX www.WorldCTX.com 1 International Coal

Medisys Edutech The Future of Healthcare Education April 2020 Private & Confidential The

Movements of satellite-tagged false killer whales around the main Hawaiian Islands: implications

Abandoning Coal in Power Generation Government Initiatives & Way Forward Presentation by Dr

CLIMATE FINANCE ACCOUNTING V20 2nd Working Group Meeting March 29th, 2016, Washingt0n, DC JOE

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, - PowerPoint PPT Presentation

MULTI GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute, GTC March 2019 MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g. weather prediction Better energy

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Use Tesla to provide first GPU VM Service in China Feng Zhu

Destroying a First-Year Seminar Program and Rebuilding From the Ground Up Kristi Kirk &amp;

CLIQUES : Security for Dynamic Peer Groups Formation Member add Member leave Group fusion

GBCC Oral Presentation PROGNOSIS AND EFFECT OF ADJUVANT TREATMENT IN IN SMALL, NODE(-), HER2(+)

Coal Conversion to Products Serge Prineau World CTX www.WorldCTX.com 1 International Coal

Medisys Edutech The Future of Healthcare Education April 2020 Private &amp; Confidential The

Movements of satellite-tagged false killer whales around the main Hawaiian Islands: implications

Abandoning Coal in Power Generation Government Initiatives &amp; Way Forward Presentation by Dr

CLIMATE FINANCE ACCOUNTING V20 2nd Working Group Meeting March 29th, 2016, Washingt0n, DC JOE

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Destroying a First-Year Seminar Program and Rebuilding From the Ground Up Kristi Kirk &

Medisys Edutech The Future of Healthcare Education April 2020 Private & Confidential The

Abandoning Coal in Power Generation Government Initiatives & Way Forward Presentation by Dr