MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - PowerPoint PPT Presentation

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer

MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, … Need to compute faster, e.g. weather prediction Better energy efficiency with dense nodes with multiple GPUs 3

DGX-1 Two fully connected quads, connected at corners GPU0 GPU1 GPU5 GPU4 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory GPU2 GPU3 GPU7 GPU6 Full atomics to Peer GPUs High speed copy engines for bulk CPU 0 CPU 1 data copy 0 - 19 20-39 PCIe to/from CPU 4

EXAMPLE: JACOBI SOLVER Solves the 2D-Laplace Equation on a rectangle ∆𝒗 𝒚, 𝒛 = 𝟏 ∀ 𝒚, 𝒛 ∈ Ω\𝜺Ω Dirichlet boundary conditions (constant values on boundaries) on left and right boundary Periodic boundary conditions on top and bottom boundary 5

EXAMPLE: JACOBI SOLVER Single GPU While not converged Do Jacobi step: for( int iy = 1 ; iy < ny - 1 ; iy ++ ) for( int ix = 1 ; ix < ny - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration 6

DOMAIN DECOMPOSITION Different Ways to split the work between processes: Minimize number of neighbors: Minimize surface area/volume ratio: Communicate to less neighbors Communicate less data Optimal for latency bound communication Optimal for bandwidth bound communication Contiguous if data Contiguous if data is row-major is column-major 7

EXAMPLE: JACOBI SOLVER Multi GPU While not converged Do Jacobi step: for ( int iy = iy_start; iy < iy_end; iy ++ ) for( int ix = 1 ; ix < ny - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions One-step with ring exchange Exchange halo with 2 neighbors Swap a_new and a Next iteration 8

SINGLE THREADED MULTI GPU PROGRAMMING while ( l2_norm > tol && iter < iter_max ) { for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { const int top = dev_id > 0 ? dev_id - 1 : ( num_devices - 1 ); const int bottom = ( dev_id + 1 )% num_devices ; cudaSetDevice( dev_id ); cudaMemsetAsync ( l2_norm_d [ dev_id ], 0 , sizeof( real ) ); jacobi_kernel <<< dim_grid , dim_block >>>( a_new [ dev_id ], a [ dev_id ], l2_norm_d [ dev_id ], iy_start [ dev_id ], iy_end [ dev_id ], nx ); cudaMemcpyAsync ( l2_norm_h [ dev_id ], l2_norm_d [ dev_id ], sizeof( real ), cudaMemcpyDeviceToHost ); cudaMemcpyAsync ( a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); cudaMemcpyAsync ( a_new [ bottom ], a_new [ dev_id ]+( iy_end [ dev_id ]- 1 )* nx , nx *sizeof( real ), ... ); } l2_norm = 0.0 ; for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { cudaSetDevice( dev_id ); cudaDeviceSynchronize (); l2_norm += *( l2_norm_h [ dev_id ]); } l2_norm = std :: sqrt ( l2_norm ); for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) std :: swap ( a_new [ dev_id ], a [ dev_id ]); iter ++; } 9

EXAMPLE JACOBI Top/Bottom Halo cudaMemcpyAsync ( a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); 5/10/2 10 017

EXAMPLE JACOBI Top/Bottom Halo cudaMemcpyAsync ( 1 a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); 1 5/10/2 11 017

EXAMPLE JACOBI Top/Bottom Halo 2 cudaMemcpyAsync ( cudaMemcpyAsync ( 1 a_new [ top ]+( iy_end [ top ]* nx ), a_new [ top ]+( iy_end [ top ]* nx ), a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); a_new [ dev_id ]+ iy_start [ dev_id ]* nx , nx *sizeof( real ), ...); cudaMemcpyAsync ( a_new [ bottom ], 2 a_new [ dev_id ]+( iy_end [ dev_id ]- 1 )* nx , nx *sizeof( real ), ... ); 1 5/10/2 12 017

SCALABILTY METRICS FOR SUCCESS Serial Time: 𝑈 𝑡 : How long it takes to run the problem with a single process Parallel Time: 𝑈 𝑞 : How long it takes to run the problem with multiple processes Number of Processes: 𝑄 : The number of Processes operating on the task at hand 𝑈 Speedup: 𝑇 = 𝑞 : How much faster is the parallel version vs. serial. (optimal is 𝑄 ) 𝑡 𝑈 𝑇 Efficiency: 𝐹 = 𝑄 : How efficient are the processors used (optimal is 1 ) 13

MULTI GPU JACOBI RUNTIME DGX1 - 1024 x 1024, 1000 iterations Chart Title 2,5 120,00% 100,00% 2 Parallel Efficiency 80,00% Runtime (s) 1,5 60,00% 1 40,00% 0,5 20,00% 0 0,00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy Parallel Efficiency 14

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 P100 on DGX-1 15

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 P100 on DGX-1 16

GPUDIRECT P2P MEM MEM MEM MEM MEM MEM MEM GPU0 GPU1 MEM GPU5 GPU4 MEM MEM MEM MEM MEM MEM MEM MEM GPU2 GPU3 GPU7 GPU6 Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks 17

GPUDIRECT P2P Enable P2P for ( int dev_id = 0 ; dev_id < num_devices ; ++ dev_id ) { cudaSetDevice ( dev_id ); const int top = dev_id > 0 ? dev_id - 1 : ( num_devices - 1 ); int canAccessPeer = 0 ; cudaDeviceCanAccessPeer ( & canAccessPeer , dev_id , top ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( top , 0 ); const int bottom = ( dev_id + 1 )% num_devices ; if ( top != bottom ) { cudaDeviceCanAccessPeer ( & canAccessPeer , dev_id , bottom ); if ( canAccessPeer ) cudaDeviceEnablePeerAccess ( bottom , 0 ); } } 18

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 P100 on DGX-1 with P2P 19

MULTI GPU JACOBI RUNTIME DGX1 - 1024 x 1024, 1000 iterations Chart Title 120,00% 100,00% Parallel Efficiency 80,00% 60,00% 40,00% 20,00% 0,00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy Single Threaded Copy P2P 20

1D RING EXCHANGE … Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms 21

MAPPING 1D RING EXCHANGE TO DGX-1 GPU0 GPU1 GPU5 GPU4 GPU2 GPU3 GPU7 GPU6 Dom. Dom. Dom. Dom. Dom. Dom. Dom. Rank 0 1 2 3 4 5 6 7 22

MAPPING 1D RING EXCHANGE TO DGX-1 GPU0 GPU1 GPU5 GPU4 GPU2 GPU3 GPU7 GPU6 Dom. Dom. Dom. Dom. Dom. Dom. Dom. Rank 0 1 2 3 4 5 6 7 export CUDA_VISIBLE_DEVICES = "0,1,2,3,7,6,5,4“ 23

MULTI GPU JACOBI RUNTIME DGX1 - 1024 x 1024, 1000 iterations Chart Title 120,00% 100,00% Parallel Efficiency 80,00% 60,00% 40,00% 20,00% 0,00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy Single Threaded Copy P2P (no opt) Single Threaded Copy P2P 24

MULTI GPU JACOBI NVVP TIMELINE Single Threaded Copy 4 P100 on DGX-1 with P2P 25

MULTI THREADED MULTI GPU PROGRAMMING Using OpenMP int num_devices = 0 ; cudaGetDeviceCount ( & num_devices ); #pragma omp parallel num_threads( num_devices ) { int dev_id = omp_get_thread_num (); cudaSetDevice ( dev_id ); } 26

MULTI GPU JACOBI NVVP TIMELINE Multi Threaded Copy 4 P100 on DGX-1 with P2P 27

MULTI GPU JACOBI RUNTIME DGX1 - 1024 x 1024, 1000 iterations Chart Title 120,00% 100,00% Parallel Efficiency 80,00% 60,00% 40,00% 20,00% 0,00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy P2P Multi Threaded Copy (no thread pinning) 28

GPU/CPU AFFINITY GPU0 GPU1 GPU5 GPU4 GPU2 GPU3 GPU7 GPU6 CPU 0 CPU 1 0 - 19 20-39 thread thread thread thread thread thread thread thread 0 1 2 3 4 5 6 7 29

GPU/CPU AFFINITY Querying system topology with nvidia-smi topo – m $ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity GPU0 X NV1 NV1 NV1 NV1 SOC SOC SOC PIX SOC PHB SOC 0-19 CPU 0 GPU1 NV1 X NV1 NV1 SOC NV1 SOC SOC PIX SOC PHB SOC 0-19 GPU2 NV1 NV1 X NV1 SOC SOC NV1 SOC PHB SOC PIX SOC 0-19 GPU3 NV1 NV1 NV1 X SOC SOC SOC NV1 PHB SOC PIX SOC 0-19 GPU4 NV1 SOC SOC SOC X NV1 NV1 NV1 SOC PIX SOC PHB 20-39 CPU 1 GPU5 SOC NV1 SOC SOC NV1 X NV1 NV1 SOC PIX SOC PHB 20-39 GPU6 SOC SOC NV1 SOC NV1 NV1 X NV1 SOC PHB SOC PIX 20-39 GPU7 SOC SOC SOC NV1 NV1 NV1 NV1 X SOC PHB SOC PIX 20-39 mlx5_0 PIX PIX PHB PHB SOC SOC SOC SOC X SOC PHB SOC mlx5_2 SOC SOC SOC SOC PIX PIX PHB PHB SOC X SOC PHB mlx5_1 PHB PHB PIX PIX SOC SOC SOC SOC PHB SOC X SOC mlx5_3 SOC SOC SOC SOC PHB PHB PIX PIX SOC PHB SOC X Legend: 30

GPU/CPU AFFINITY Using CUDA_VISIBLE_DEVICES and OpenMP env. vars. export OMP_PROC_BIND = TRUE export CUDA_VISIBLE_DEVICES = "0,1,2,3,7,6,5,4“ export OMP_PLACES = "{0},{1},{2},{3},{20},{21},{22},{23}" 31

MULTI GPU JACOBI RUNTIME DGX1 - 1024 x 1024, 1000 iterations Chart Title 120,00% 100,00% Parallel Efficiency 80,00% 60,00% 40,00% 20,00% 0,00% 1 2 3 4 5 6 7 8 #GPUs Single Threaded Copy P2P Multi Threaded Copy (no thread pinning) Multi Threaded Copy 33

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - PowerPoint PPT Presentation

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g.

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Use Tesla to provide first GPU VM Service in China Feng Zhu

Form 1: Presentation Checklist Amending The Syllabus Template Cady-Mae Koon 12/13/2018 Name :

When a Single IR IRB Reviews for Multiple Sites: The Complexities of Simplification John

Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini et al. (Google, ICML

A journey of a thousand miles begins with a single step BATTLEFIELD HIGH SCHOOL CLASS OF

Migrating a Java Solution to HP Integrity NonStop A mix of Whats, Whys and Hows Moore Ewing

Waste: a resource to recycle, reuse and recover raw materials no Topic Scope Call Project

Addressing dietary inequalities Information/ media campaigns largely ineffective among

TRAINING PROGRAM STANDARD RFP & PREPARATION OF RFP FOR A SPECIFIC ASSIGNMENT 1 Request

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute - PowerPoint PPT Presentation

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer MOTIVATION Why use multiple GPUs? Need to compute larger, e.g. bigger networks, car models, Need to compute faster, e.g.

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Use Tesla to provide first GPU VM Service in China Feng Zhu

Form 1: Presentation Checklist Amending The Syllabus Template Cady-Mae Koon 12/13/2018 Name :

When a Single IR IRB Reviews for Multiple Sites: The Complexities of Simplification John

Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini et al. (Google, ICML

A journey of a thousand miles begins with a single step BATTLEFIELD HIGH SCHOOL CLASS OF

Migrating a Java Solution to HP Integrity NonStop A mix of Whats, Whys and Hows Moore Ewing

Waste: a resource to recycle, reuse and recover raw materials no Topic Scope Call Project

Addressing dietary inequalities Information/ media campaigns largely ineffective among

TRAINING PROGRAM STANDARD RFP &amp; PREPARATION OF RFP FOR A SPECIFIC ASSIGNMENT 1 Request

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

TRAINING PROGRAM STANDARD RFP & PREPARATION OF RFP FOR A SPECIFIC ASSIGNMENT 1 Request