S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR - - PowerPoint PPT Presentation

s9677 nvshmem a partitioned global address space library
SMART_READER_LITE
LIVE PREVIEW

S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR - - PowerPoint PPT Presentation

S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA GPU Programming Models Overview of NVSHMEM Porting to NVSHMEM AGENDA Future Work Conclusion and


slide-1
SLIDE 1

Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA

S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS

slide-2
SLIDE 2

2

AGENDA

GPU Programming Models Overview of NVSHMEM Porting to NVSHMEM Future Work Conclusion and Future Work

slide-3
SLIDE 3

3

Compute on GPU Communication from CPU Synchronization at boundaries Offload latencies in critical path Hiding increases code complexity

GPU FOR COMPUTE OFFLOAD

cuda_kernel<<<>>>

GPU CPU PCIe/Network

MPI_Isend MPI_Wait

cudaStreamSynchronize<<<>>>

slide-4
SLIDE 4

4

GPU MASTERS COMMUNICATION

cuda_kernel<<<>>>

GPU CPU Network

cudaStreamSynchronize<<<>>>

Avoids offload latencies Compute – communication overlap Easier to express algorithms with inline communication Improving performance while making it easier to program

shmem_put shmem_quiet shmem_put shmem_put shmem_put shmem_put

slide-5
SLIDE 5

5

AGENDA

GPU Programming Models Overview of NVSHMEM Porting to NVSHMEM Future Work Conclusion and Future Work

slide-6
SLIDE 6

6

WHAT IS NVSHMEM ?

Experimental implementation of OpenSHMEM for NVIDIA GPUs, 1 PE/GPU shared memory: shmem_malloc private memory: cudaMalloc shmem communication APIs: shared->shared or private->shared

slide-7
SLIDE 7

7

DEVICE-INITIATED COMMUNICATION

PE i PE i-1 PE i+1

Thread-level communication APIs Allow finer grained control and overlap Maps well onto NVLink fabric – DGX-1/DGX-2

__global__ void stencil_single_step (float *u, …) { int ix = threadIdx.x, iy = threadIdx.y; //compute //data exchange if (iy == ny) { shmem_float_p (u + ny*nx + ix, u + ix, top_pe); } if (iy == 1) { shmem_float_p (u + nx + ix, u + (ny+1)*nx + ix, bottom_pe); } }

ny nx

slide-8
SLIDE 8

8

THREAD-GROUP COMMUNICATION

PE i PE i-1 PE i+1

Operations can be issued by a WARP/CTA Coarser, hence more efficient transfers over networks like IB Still allows inter-warp/inter-block overlap

__global__ void stencil_single_step (u, …) { //compute //data exchange shmem_float_put_block_nbi (u + ny*nx, u, nx, top_pe); shmem_float_put_block_nbi (u + nx, u + (ny+1)*nx, nx, bottom_pe); }

slide-9
SLIDE 9

9

IN-KERNEL SYNCHRONIZATION

Allows inter-PE synchronization Can offload larger portions of application running CUDA kernels PE i PE i-1 PE i+1

Data transfer

+

Synchronization

+

__global__ void stencil_uber (u, …) { while (iter=0; iter<N; iter++) { //compute //data exchange shmem_float_put_nbi_block (u + ny*nx, u, nx, top_pe); shmem_float_put_nbi_block (u + nx, u + (ny+1)*nx, nx, bottom_pe); shmem_barrier_all(); } }

slide-10
SLIDE 10

10

COLLECTIVE KERNEL LAUNCH

GPUs supported CUDA kernel launch

Device-Initiated Communication Kepler or newer regular <<<>>> or launch APIs Device-Initiated Synchronization Volta or newer shmemx_collective_launch

Provides progress when using device-side inter-kernel synchronization Built on CUDA cooperative launch and requirement of 1PE/GPU

slide-11
SLIDE 11

11

STREAM-ORDERED OPERATIONS

Not optimal to move all communication/synchronization into CUDA kernels Inter-CTA synchronization latencies can be longer than kernel launch latencies Allows mixing fine-grained communication + coarse-grained synchronization

GPU 0 GPU 0 PCIe/Network

shmem_barrier_all_on_stream

slide-12
SLIDE 12

12

INTRA-NODE IMPLEMENTATION

NVLink or PCIe uses CUDA IPC under the hood shmem_put/get on device ld/store shmem_put/get_on_stream cudaMemcpyAsync

GPU 0

Virtual Address Physical Address

GPU 1 GPU 2

Virtual Address Physical Address Virtual Address Physical Address

slide-13
SLIDE 13

13

MULTI-NODE SUPPORT

Reverse offloads network transfers to the CPU Avoids memory fences when signaling CPU Uses standard IB verbs (Mellanox OFED for GPUDirect RDMA)

GPU CPU Network Proxy

ring-buffer IB QP

slide-14
SLIDE 14

14

NVSHMEM STATUS

Research vehicle for designing and evaluating GPU-centric workloads Early access (EA2) available – please reach out to nvshmem@nvidia.com Main Features NVLink and PCIe support InfiniBand support (new) X86 and Power9 (new) support Interoperability with MPI and OpenSHMEM (new) libraries

slide-15
SLIDE 15

15

AGENDA

GPU Programming Models Overview of NVSHMEM Porting to NVSHMEM Future Work Conclusion and Future Work

slide-16
SLIDE 16

16

PORTING TO USE NVSHMEM FROM GPU

Step I : Only communication from inside the kernel (on Kepler or newer GPUs) Step II : Both communication and synchronization from inside the kernel (on Pascal

  • r newer Tesla GPUs)

Using Jacobi Solver, we will walk through I and II and compare with MPI version Code available at : github.com/NVIDIA/multi-gpu-programming-models

GTC 2019 S9139 Multi-GPU Programming Models, Jiri Kraus - Senior Devtech Compute, NVIDIA

slide-17
SLIDE 17

17

EXAMPLE: JACOBI SOLVER

While not converged Do Jacobi step: for( int iy = 1; iy < ny-1; iy++ ) for( int ix = 1; ix < nx-1; ix++ ) a_new[iy*nx+ix] = -0.25 *

  • ( a[ iy

*nx+(ix+1)] + a[ iy *nx+ix-1] + a[(iy-1)*nx+ ix ] + a[(iy+1)*nx+ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration

slide-18
SLIDE 18

18

COMPUTE KERNEL – SINGLE GPU

__global__ void jacobi_kernel( ... ) { const int ix = bIdx.x*bDim.x+tIdx.x; const int iy = bIdx.y*bDim.y+tIdx.y + iy_start; real local_l2_norm = 0.0; if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) { const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ] + a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] ); a_new[ iy * nx + ix ] = new_val; real residue = new_val - a[ iy * nx + ix ]; local_l2_norm += residue * residue; } atomicAdd( l2_norm, local_l2_norm ); }} }

github.com/NVIDIA/multi-gpu-programming-models/tree/master/single_gpu

slide-19
SLIDE 19

19

HOST CODE - MPI

top_stream bottom_stream compute_stream

cudaMemsetAsync(norm) cudaRecordEvent (event0) cudaStreamWaitEvent(event0) compute_jacobi<<>>>(bottom_boundary) cudaStreamWaitEvent(event0) compute_jacobi<<>>>(top_boundary) compute_jacobi<<>>>(interior) cudaStreamSynchronize() cudaStreamSynchronize() MPI_SendRecv(top) MPI_SendRecv(bottom) cudaRecordEvent (event1) cudaRecordEvent (event2) cudaStreamWaitEvent(event1) cudaStreamWaitEvent(event2) cudaMemcpyAsync(norm) MPI_Allreduce(norm)

Once every n iterations

slide-20
SLIDE 20

20

HOST CODE - MPI

while (iter < iter_max ) {while ( l2_norm > tol && iter < iter_max ) {

//reset norm

CUDA_RT_CALL( cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ) ); CUDA_RT_CALL( cudaEventRecord( reset_l2norm_done, compute_stream ) );

//compute boundary

CUDA_RT_CALL( cudaStreamWaitEvent( push_top_stream, reset_l2norm_done, 0 ) ); launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, (iy_start+1), nx, push_top_stream ); CUDA_RT_CALL( cudaEventRecord( push_top_done, push_top_stream ) ) CUDA_RT_CALL( cudaStreamWaitEvent( push_bottom_stream, reset_l2norm_done, 0 ) ); launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_end-1), iy_end, nx, push_bottom_stream ); CUDA_RT_CALL( cudaEventRecord( push_bottom_done, push_bottom_stream ) );

//compute interior

launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_start+1), (iy_end-1), nx, compute_stream );

//Apply periodic boundary conditions

CUDA_RT_CALL( cudaStreamSynchronize( push_top_stream ) ); MPI_CALL( MPI_Sendrecv( a_new+iy_start*nx, nx, MPI_REAL_TYPE, top , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE )); CUDA_RT_CALL( cudaStreamSynchronize( push_bottom_stream ) ); MPI_CALL( MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new, nx, MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));

//Periodic convergence check

if ( (iter % nccheck) == 0 || (!csv && (iter % 100) == 0) ) { CUDA_RT_CALL( cudaStreamWaitEvent( compute_stream, push_top_done, 0 ) ); CUDA_RT_CALL( cudaStreamWaitEvent( compute_stream, push_bottom_done, 0 ) ); CUDA_RT_CALL( cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) ); CUDA_RT_CALL( cudaStreamSynchronize( compute_stream ) ); MPI_CALL( MPI_Allreduce( l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD ) ); l2_norm = std::sqrt( l2_norm ); }

github.com/NVIDIA/multi-gpu-programming-models/tree/master/mpi_overlapp

slide-21
SLIDE 21

21

CUDA KERNEL - NVSHMEM FOR COMMS

__global__ void jacobi_kernel( ... ) { const int ix = bIdx.x*bDim.x+tIdx.x; const int iy = bIdx.y*bDim.y+tIdx.y + iy_start; real local_l2_norm = 0.0; if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) { const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ] + a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] ); a_new[ iy * nx + ix ] = new_val;

if ( iy_start == iy ) shmem_float_p(a_new + top_iy*nx + ix, new_val, top_pe); if ( iy_end == iy ) shmem_float_p(a_new + bottom_iy*nx + ix, new_val, bottom_pe);

real residue = new_val - a[ iy * nx + ix ]; } atomicAdd( l2_norm, local_l2_norm ); }} }

github.com/NVIDIA/multi-gpu-programming-models/tree/master/nvshmem

slide-22
SLIDE 22

22

HOST CODE - NVSHMEM FOR COMMS

a = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); a_new = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); … while (iter < iter_max && l2_norm > tol ) { … jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>( a_new, a, l2_norm_d, iy_start, iy_end, nx, top, iy_end_top, bottom, iy_start_bottom ); shmemx_barrier_all_on_stream(compute_stream); //convergence check if ((iter % nccheck) == 0) { cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) ); cudaStreamSynchronize( compute_stream ); MPI_Allreduce( l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD ); l2_norm = std::sqrt( l2_norm ); } } …

github.com/NVIDIA/multi-gpu-programming-models/tree/master/nvshmem

slide-23
SLIDE 23

23

__global__ void jacobi_uber_kernel( ... ) { grid_group g = this_grid(); //comms only ... g.sync(); if ( (iter % nccheck) == 0 ) { //reduction across shmem pes if (!tid_x && !tid_y) { shmem_barrier_all(); shmem_float_sum_to_all (l2_norm + 1, l2_norm, 1, 0, 0, npes, NULL, NULL); l2_norm[1] = (float) __frsqrt_rn( (float)l2_norm[1] ); l2_norm[0] = 0;}} g.sync();

}

CUDA KERNEL - NVSHMEM FOR COMMS + SYNC

slide-24
SLIDE 24

24

a = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); a_new = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real)); … void *args[] = {&a_new, &a, &l2_norm_d, &iy_start, &iy_end, &nx, &top, &iy_end_top, &bottom, &iy_start_bottom}; shmemx_collective_launch ( jacobi_kernel, dim_grid, dim_block, 0, compute_stream); …

HOST CODE - NVSHMEM FOR COMMS + SYNC

slide-25
SLIDE 25

25

JACOBI SOLVER (ON X86)

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Parallel Efficiency #GPUs

MPI Overlap NVSHMEM

DGX-2 (1 node) - 18432 x 18432, 1000 iterations

Benchmarksetup: DGX-2 with OS 4.0.5, GCC 7.3.0, CUDA 10.0 with 410.104 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3), GPUs@1597Mhz AC, Reported Runtime is the minimum of 5 repetitions

slide-26
SLIDE 26

Multi-GPU transpose : USING NVSHMEM JACOBI SOLVER (ON POWER9)

4 GP100s connected with NVLink

0.0x 0.5x 1.0x 1.5x 2.0x 2.5x 3.0x 3.5x 4.0x 4.5x 5.0x 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 6 12 24 48 96 192 384

Parallel Efficiency

# GPUs mpi nvshmem ratio

Summit (64 nodes) - 16384 x 16384, 1000 iterations

Benchmarksetup: Summit with RHEL 4.14.0, GCC 4.8.5, CUDA 9.2.148 with 396.64 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3)

slide-27
SLIDE 27

27

AGENDA

GPU Programming Models Overview of NVSHMEM Porting to NVSHMEM Future Work Conclusion and Future Work

slide-28
SLIDE 28

28

IN-HEADER IMPLEMENTATION

Put/Get translates to LD/ST on NVLink/PCIe Each API results in

  • Function call overhead
  • Remote address recalculation

Avoid through in-header implementation

10 20 30 40 50 60 GB/s Bytes

shmem_p_bw : implementation in library shmem_p_bw : implementation in header shmem_st_bw : application does store to memory

Benchmarksetup: DGX-1 with OS 4.14.0, GCC 5.4.0, CUDA 10.0.130 with 418.39 Driver

slide-29
SLIDE 29

29

COLLECTIVES OPTIMIZATION

Now use direct all-to-all communication Works well over NVlink in a DGX-1/2 Limits scalability inter-node Improving implementations for device-side ops Leverage NCCL goodness for CPU/on-stream ops

0x 1x 2x 3x 4x 5x 6x 7x 8x 9x

50 100 150 200 250 300 350 400 450 4 8 16 32 64 128 256

Time (microseconds) # GPUs

Alltoall (EA2) Dissemination Ratio

Benchmark setup: Summit with RHEL 4.14.0, GCC 4.8.5, CUDA 9.2.148 with 396.64 Driver

Barrier Latency

slide-30
SLIDE 30

30

OTHERS

Interoperability Tied to OpenMPI for MPI interoperability To make MPI-based bootstrap an external module Strided transfers – multi-dimensional decomposition

slide-31
SLIDE 31

31

SUMMARY

Allows design and experimentation with GPU-initiated communication Early access (EA2) available Support for P2P and Infiniband connected GPUs x86 and P9 support Interoperable with MPI and OpenSHMEM Please reach out to nvshmem@nvidia.com application use case, how GPU-initiated may help

slide-32
SLIDE 32