COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and - - PowerPoint PPT Presentation

communication with nvshmem
SMART_READER_LITE
LIVE PREVIEW

COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley SIMPLIFYING MULTI-GPU COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh 1 GOAL Limitations for strong scaling on GPU clusters Possibly address with GPU Global Address: NVSHMEM


slide-1
SLIDE 1

1 April 4-7, 2016 | Silicon Valley

Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh

SIMPLIFYING MULTI-GPU COMMUNICATION WITH NVSHMEM

slide-2
SLIDE 2

2

GOAL

Limitations for strong scaling on GPU clusters Possibly address with GPU Global Address: NVSHMEM Case studies using NVSHMEM Start of a discussion and not a solution

slide-3
SLIDE 3

3

PROGRAMMING WITH NVSHMEM

slide-4
SLIDE 4

4

GPU CLUSTER PROGRAMMING

Offload model Compute on GPU Communication from CPU Synchronization at boundaries Overheads on GPU Clusters Offload latencies Synchronization overheads Limits strong scaling More CPU means more power

void 2dstencil (u, v, …) { for (timestep = 0; …) { interior_compute_kernel <<<…>>> (…) pack_kernel <<<…>>> (…) cudaStreamSynchronize(…) MPI_Irecv(…) MPI_Isend(…) MPI_Waitall(…) unpack_kernel <<<…>>> (…) boundary_compute_kernel <<<…>>> (…) … } }

slide-5
SLIDE 5

5

GPU-CENTRIC COMMUNICATION

GPU capabilities Compute state to hide latencies to global memory Implicit coalescing of loads/stores to achieve efficiency CUDA helps to program to these Should also benefit when accessing data over the network Direct accesses to remote memory simplifies programming Achieving efficiency while making it easier to program Continuous fine-grained accesses smooths traffic over the network

slide-6
SLIDE 6

6

GPU GLOBAL ADDRESS SPACE

slide-7
SLIDE 7

7

NVSHMEM

A subset of OpenSHMEM Interoperability with MPI/OpenSHMEM, in CUDA kernels/OpenACC regions

Host: initialization and cleanup (host)

nvstart_pes, nvstop_pes

allocation and deallocation (host)

nvshmalloc and nvshmcleanup

nvshmem_barrier_all (host) nvshmem_get_ptr (host/GPU) put and get routines (GPU)

nvshmem_(float/int)_(p/g) nvshmem_(float/int)_(put/get)

nvshmem_(quiet/fence) (GPU) nvshmem_wait/wait_until (GPU)

slide-8
SLIDE 8

8

COMMUNICATION FROM CUDA KERNELS

Long running CUDA kernels Communication within parallelism

__global__ void 2dstencil (u, v, sync, …) { for(timestep = 0; …) { if (i+1 > nx) { v[i+1] = shmem_float_g (v[1], rightpe); } if (i-1 < 1) { v[i-1] = shmem_float_g (v[nx], leftpe); } u[i] = (u[i] + (v[i+1] + v[i-1] . . . if (i < 2) { shmem_int_p (sync + i, 1, peers[i]); shmem_quiet(); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync … } }

slide-9
SLIDE 9

9

EXPERIMENTAL PLATFORMS

  • Single node – GPUs directly connected

with NVLink

  • Single node – up to 8 GPUs – 2 per card –

4 cards under same PCIe root complex using raiser cards with PCIe switch

  • CUDA IPC and P2P
  • Multi-node platform – Top-of-Rack PCIe

Switch – ExpressFabric, proprietary technology from Avago Technologies

  • Inter Host Communication with TWC –

Tunneled Window Connection

Source: http://www.avagotech.com/applications/datacenters/expressfabric/

slide-10
SLIDE 10

10

CURRENT ARCHITECTURES

Operations Communication Write Read Atomics Execution Inter-thread Synchronization Volta + NVLink (Single Node)

☑ ☑ ☑ ☑

Kepler + PCIe Express Fabric (Multi Node)

☑ ☒ ☒

Kepler + P2P

  • ver PCIe

(Single Node)

☑ ☑ ☒

Pascal + NVLink

☑ ☑ ☑

(1) Avoid intra-WARP synchronization (2) Ensure synchronizing blocks are scheduled

slide-11
SLIDE 11

11

PERFORMANCE STUDIES

slide-12
SLIDE 12

12

MPI vs. NVSHMEM for Halo Exchange in EAM Force Evaluation

CoMD MOLECULAR DYNAMICS

GPU-driven communication Fine-grained communication at the thread level Avoids synchronization and artificial serialization

Exchange Z Exchange Y Exchange X

EAM-1 Kernel

Pack MPI buffer

MPI Send Recv

Un-pack MPI buffer Pack MPI buffer

MPI Send Recv

Un-pack MPI buffer Pack MPI buffer

EAM-3 Kernel MPI Send Recv

Un-pack MPI buffer

EAM-1 Kernel Send Data Wait EAM-3 Kernel

slide-13
SLIDE 13

13

CoMD FORCE EVALUATION

NVProf timeline for EAM forces using Link Cells

slide-14
SLIDE 14

14

CoMD PERFORMANCE

4 K80s (8 GPUs) connected over PCIe

0.5 1 1.5 2 2.5 3 3.5 2048 6912 27436 108000 364500 Speedup Atoms/GPU Atom Redistribution Force Exchange Timestep

slide-15
SLIDE 15

15

MULTI-GPU TRANSPOSE

Bandwidth limited MPI version carefully pipelines local transposes and inter-process data movement NVSHMEM significantly reduces code complexity

5 10 15 20 25 384 768 1536 3072 6144 12288 Bi-Bandwidth (GB/sec) Matrix dimension MPI NVSHMEM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

GPU 0 GPU 1

8 16 24 32 40 48 56 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63

2 K40s connected over PCIe

slide-16
SLIDE 16

16

COLLECTIVE COMMUNICATION

NCCL collectives communication library Uses fine-grained load/stores between GPUs – No DMAs used! Pipelines data movement and overlaps it with computation (virtue of WARP scheduling) Implemented over NVSHMEM

Single node PCIe Multi node PCIe NVLink

slide-17
SLIDE 17

17

HPGMG-FV

Proxy for geometric multi-grid linear solvers Boundary exchange is symmetric Point-to-point between neighbors MPI uses 3 Steps: 1 – send data (boundary->MPI buffer) 2 – local exchange (internal->internal) 3 – receive data (MPI buffer->boundary)

Intra-level communication

slide-18
SLIDE 18

18

HPGMG-FV – BOUNDARY EXCHANGE

Implementation complexity

MPI NVSHMEM CopyKernel(BOUNDARY-TO-BUFFER) cudaDeviceSync MPI_Irecv + MPI_Isend CopyKernel(INTERNAL-TO-INTERNAL) MPI_Waitall CopyKernel(BUFFER-TO-BOUNDARY) CopyKernel(ALL-TO-ALL) Nvshmem_barrier_all_offload

slide-19
SLIDE 19

19

5 10 15 20 128^3 64^3 32^3 Time in msec Granularity

HPGMG - Chebyshev Smoother - 8 GPUs

MPI NVSHMEM

HPGMG CHEBYSHEV SMOOTHER

Limited by latencies – more so at coarser levels Use fine-grained put/get with NVSHMEM

Finer Coarser

4 K80s (8 GPUs) connected over PCIe

slide-20
SLIDE 20

20

SUMMARY

Strong scaling important on GPU clusters Overheads from CPU orchestrated communication NVSHMEM is a prototype library for GPU-initiated Communication Better performance and better programmability Promising results with NVIDIA Collectives library and Mini-Apps

slide-21
SLIDE 21

21 April 4-7, 2016 | Silicon Valley

THANK YOU!