NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU - PowerPoint PPT Presentation

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

GPU Programming Models Overview of NVSHMEM AGENDA Porting to NVSHMEM Performance Evaluation Conclusion and Future Work 2

GPU CLUSTER PROGRAMMING Offload model void 2dstencil (u, v, …) Compute on GPU { Communication from CPU for (timestep = 0; …) { Synchronization at boundaries interior_compute_kernel <<<…>>> (…) pack_kernel <<<…>>> (…) cudaStreamSynchronize(…) Performance overheads MPI_Irecv(…) Offload latencies MPI_Isend(…) Synchronization overheads MPI_Waitall(…) unpack_kernel <<<…>>> (…) Limits scaling boundary_compute_kernel <<<…>>> (…) … Increases code complexity } More CPU means more power } 3

GPU CLUSTER PROGRAMMING Offload model void 2dstencil (u, v, …) Compute on GPU { Communication from CPU for (timestep = 0; …) { Synchronization at boundaries interior_compute_kernel <<<…>>> (…) pack_kernel <<<…>>> (…) MPI_Irecv_on_stream(…,stream) Performance overheads MPI_Isend_on_stream(…,stream) Offload latencies MPI_Wait_on_stream(…,stream) Synchronization overheads unpack_kernel <<<…>>> (…) boundary_compute_kernel <<<…>>> (…) Limits scaling … } Increases code complexity } More CPU means more power MPI-async and NCCL help improve this, but! 4

GPU-INITIATED COMMUNICATION Removing reliance on CPU for communication avoids overheads Parallelism for implicit compute – communication overlap Continuous fine-grained accesses smooths traffic over the network Direct accesses to remote memory simplifies programming Improving performance while making it easier to program 5

COMMUNICATION FROM CUDA KERNELS __global__ void 2dstencil (u, v, sync, …) { Long running CUDA kernels for(timestep = 0; …) { Communication within parallel compute u[i] = (u[i] + (v[i+1] + v[i-1] . . . //data exchange if (i+1 > nx) { shmem_float_p (v[1], v[i+1], rightpe); } if (i-1 < 1) { void 2dstencil (u, v, …) shmem_float_p (v[nx], v[i-1], leftpe); } { stencil_kernel <<<…>>> (…) //synchronization if (i < 2) { } shmem_fence(); shmem_int_p (sync + i, 1, peers[i]); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync … } } 6

WHAT IS OPENSHMEM ? OpenSHMEM is a PGAS library interface specification Distributed shared memory - defined locality of segments to application instances OpenSHMEM constructs: Programming Elements (PEs) – Execution Context Symmetric objects – Global memory constructs which have same address offsets across all PEs PE 0 PE 1 PE N-1 c i Global and Static Global and Static Global and Static r t e Variables Variables Variables m Symmetry allows m y S s t e c X = shmalloc(sizeof(long)) - Ease of use e l b j i b Variable: X Variable: X Variable: X s O s e a c t - Fast address translation c a A D Symmetric Heap Symmetric Heap Symmetric Heap y l e t o m e R a t a s D t c Local Variables Local Variables Local Variables e e j t b a O v i r P 8

QUICK EXAMPLE a a PE 1 PE 0 Virtual Address Space int *a, *a_remote; Int value = 1; a = (int *) shmem_malloc (sizeof(int)); if ( shmem_my_pe () == 0) { //accessing remote memory using PutAPI shmem_int_p (a/*remote addr*/, value, 1/*remote PE*/); //can do the same using a ST a_remote = shmem_ptr (a, 1); *a_remote = value; } 9

OPENSHMEM FEATURES Point-to-point and group data movement operations Remote Memory Put and Get Collective (broadcast, reductions, etc) Remote Memory Atomic operations Synchronization operations (barrier, sync) Ordering operations (fence, quiet) 10

NVSHMEM Experimental implementation of OpenSHMEM for NVIDIA GPUs Symmetric heap on GPU memory Adds CUDA-specific extensions for performance HOST ONLY HOST/GPU GPU Library setup, exit and query Data movement operations CTA-wide operations Memory management Atomic memory operations Collective CUDA kernel launch Synchronization operations CUDA stream ordered operations Memory ordering 11

COLLECTIVE CUDA KERNEL LAUNCH CUDA threads across GPUs can use NVSHMEM to synchronize or collectively move data These kernels should be concurrently launched and be resident across all GPUs OpenSHMEM extension built on top of CUDA cooperative launch shmemx_collective_launch (…) //takes same arguments as a CUDA kernel launch Can use regular CUDA launch if not using any synchronization or collective APIs 12

CTA-WIDE OPERATIONS Parallelism on the GPU can be used to optimize OpenSHMEM operations Extensions allow threads within a CTA to participate in a single OpenSHMEM call Collective operations translate to a multiple point-to-point interactions between PEs threads can be used to parallelize this Eg: shmemx_barrier_all_cta (…), shmemx_broadcast_cta (…), semantic is still as if a single collective operation is executed Bulk point-to-point transfers benefit from concurrency and with coalescing - Eg: shmemx_putmem_cta (…) 13

CUDA STREAM ORDERED EXTENSIONS Not all communication can be moved into a CUDA kernels Not all compute can be fused in to a single kernel Synchronization or communication at kernel boundary is still required Extension to offload CPU-initiated SHMEM operations onto a CUDA stream Eg: kernel1<<<…,stream>>>(…) shmemx_barrier_all_on_stream(stream) //can be a collective or p2p operation kernel2<<<…,stream>>>(…) 14

NVSHMEM STATUS Working towards an early-access for external customers Initial version will have support for P2P-connected GPUs (single-node) - atomics not supported over PCIe - full feature set on Pascal or newer GPUs Non-P2P and Multi-node support in future 15

PORTING TO USE NVSHMEM FROM GPU Step I : Only communication from inside the kernel (on Kepler or newer GPUs) Step II : Both communication and synchronization APIs from inside the kernel (on Pascal or newer Tesla GPUs) Using Jacobi Solver, we will walk through I and II and compare with MPI version GTC 2018 Multi-GPU Programming Models, Jiri Kraus - Senior Devtech Compute, NVIDIA 17

EXAMPLE: JACOBI SOLVER While not converged Do Jacobi step: for( int iy = 1 ; iy < ny - 1 ; iy ++ ) for( int ix = 1 ; ix < ny - 1 ; ix ++ ) a_new [ iy * nx + ix ] = - 0.25 * -( a [ iy * nx +( ix + 1 )] + a [ iy * nx + ix - 1 ] + a [( iy - 1 )* nx + ix ] + a [( iy + 1 )* nx + ix ] ); Apply periodic boundary conditions Swap a_new and a Next iteration 18

COMPUTE KERNEL __global__ void jacobi_kernel ( ... ) { const int ix = bIdx . x * bDim . x + tIdx . x ; const int iy = bIdx . y * bDim . y + tIdx . y + iy_start ; real local_l2_norm = 0.0 ; for ( int iy = bIdx.y * bDim.y + tIdx.y + iy_start ; iy <= iy_end ; iy += bDim.y * gDim.y ) { for ( int ix = bIdx . x * bDim . x + tIdx . x + 1 ; ix < ( nx - 1 ); ix += bDim . x * gDim . x ) { const real new_val = 0.25 * ( a [ iy * nx + ix + 1 ] + a [ iy * nx + ix - 1 ] + a [ ( iy + 1 ) * nx + ix ] + a [ ( iy - 1 ) * nx + ix ] ); a_new [ iy * nx + ix ] = new_val ; real residue = new_val - a [ iy * nx + ix ]; atomicAdd ( l2_norm , local_l2_norm ); }} } 19

HOST CODE - MPI top_stream bottom_stream compute_stream cudaMemsetAsync(norm) cudaRecordEvent (event0) cudaStreamWaitEvent(event0) cudaStreamWaitEvent(event0) compute_jacobi<<>>>(interior) compute_jacobi<<>>>(top_boundary) compute_jacobi<<>>>(bottom_boundary) cudaRecordEvent (event1) cudaRecordEvent (event2) cudaStreamSynchronize() cudaStreamSynchronize() cudaStreamWaitEvent(event1) Once MPI_SendRecv(top) MPI_SendRecv(bottom) cudaStreamWaitEvent(event2) every n cudaMemcpyAsync(norm) iterations MPI_Allreduce(norm) 20

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU - PowerPoint PPT Presentation

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018 GPU Programming Models Overview of NVSHMEM AGENDA Porting to NVSHMEM Performance Evaluation Conclusion and

S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Anshuman

COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh 1 GOAL

Building Partitioned Architectures Building Partitioned Architectures based on the based on the

Block and Triangular Matrices Block Matrices Defn. A partitioned matrix has the rows and columns

14/04/2016 Global vs Partitioned scheduling Single shared queue instead of multiple dedicated

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Secure Linear Regression on Secure Linear Regression on Vertically Partitioned Datasets

A Coupling Library for Partitioned Multi-Physics Simulations 1. What is preCICE? 2. How to get

GLOBAL RISKS GLOBAL RISKS GLOBAL RISKS - GLOBAL RISKS - - - GLOBAL RISKS GLOBAL RISKS

Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C.

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems

Head-Body Partitioned String Matching for Deep Packet Inspection with Scalable and

Development of the Partitioned Aquaculture System at Clemson University; 1987-2008 - Green-water

Query Optimization Techniques for Partitioned Tables Herodotos Herodotou, Nedyalko Borisov,

Using behavioural experiments to identify consumer problems in markets: Partitioned pricing

A Partitioned Approach for Efficient Graph-Based Place Recognition Mattia G. Gollub, Renaud

MIGSOCK A Migratable TCP Socket in Linux Bryan Kuntz Karthik Rajan Masters Thesis Presentation

UX ESSENTIALS for Business Tour Presented by Jon Deragon, Senior UX Consultant Speaking Events

1 Product Range Products 2 summary summary summary summary Relays with 8 and 11-Pins

www.i-technos.com.tw Established: August ,2002 Capital: US$ US$ 1,000,000 Headquarter:

Responsible Distribution of Internet Numbering Resources over the Years Prepared by Regional

NCARs Next Procurement: Meeting Users Reliability and Storage Demands DAVID L HART NCAR

COMPANY AND PRODUCT PRESENTATION Panel Radiator Parts www.dogusplastik.com.tr |

Midwest Geotechnical Conference Columbus, Ohio September 2019 Bob Arndorfer AASHTO