[PPT] - SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena PowerPoint Presentation

SLIDE 1

1

Davide Rossetti, Elena Agostini

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT

SLIDE 2

2

GPUDIRECT ELSEWHERE AT GTC2017

H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK INTERCONNECTS (Today 5pm, D.Rossetti, @Mellanox ) S7155 - OPTIMIZED INTER-GPU COLLECTIVE OPERATIONS WITH NCCL (Tue 9am, S.Jeaugey @NVIDIA) S7142 - MULTI-GPU PROGRAMMING MODELS (Wed 1pm, S.Potluri, J.Krauss @NVIDIA) S7489 - CLUSTERING GPUS WITH ETHERNET (Wed 4pm, F.Osman @Broadcom) S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING (Thu 2pm, D.K.Panda @OSU)

SLIDE 3

3

AGENDA

GPUDirect technologies NVLINK-enabled multi-GPU systems GPUDirect P2P GPUDirect RDMA GPUDirect Async Async Benchmarks & applications results

SLIDE 4

4

INTRODUCTION TO GPUDIRECT TECHNOLOGIES

SLIDE 5

5

GPUDIRECT FAMILY1

Technologies, enabling products !!!

GPU pinned memory shared with other RDMA capable devices Avoids intermediate copies

GPUDIRECT SHARED GPU- SYSMEM

Accelerated GPU-GPU memory copies Inter-GPU direct load/store access

GPUDIRECT P2P

Direct GPU to 3rd party device transfers E.g. direct I/O, optimized inter-node communication

GPUDIRECT RDMA2

Direct GPU to 3rd party device synchronizations E.g. optimized inter-node communication

GPUDIRECT ASYNC

[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma

SLIDE 6

6

GPUDIRECT

scopes

GPUDirect P2P → data
Intra-node
GPUs both master and slave
Over PCIe or NVLink
GPUDirect RDMA → data
Inter-node
GPU slave, 3rd party device master
Over PCIe
GPUDirect Async → control
GPU & 3rd party device, master & slave
Over PCIe

Data plane Control plane

GPUDirect Async

HOST GPU GPU

GPUDirect RDMA/P2P

SLIDE 7

7

NVLINK-enabled Multi-GPU servers

SLIDE 8

8

NVIDIA DGX-1

AI Supercomputer-in-a-Box

SLIDE 9

9

DGX-1 SYSTEM TOPOLOGY

GPU – CPU link:

PCIe 12.5+12.5 GB/s eff BW

GPUDirect P2P:

GPU – GPU link is NVLink Cube mesh topology not all-to-all

GPUDirect RDMA:

GPU – NIC link is PCIe

SLIDE 10

10

IBM MINSKY

2 POWER8 with NVLink 4 NVIDIA Tesla P100 GPUs 256 GB System Memory 2 SSD storage devices High-speed interconnect: IB

r Ethernet

Optional: Up to 1 TB System Memory PCIe attached NVMe storage

SLIDE 11

11

P100 GPU

POWER8 CPU

GPU Memory

System Memory P100 GPU 80 GB/s

GPU Memory

NVLink 115 GB/s P100 GPU

POWER8 CPU

GPU Memory

System Memory P100 GPU 80 GB/s

GPU Memory

115 GB/s NVLink

IBM MINSKY SYSTEM TOPOLOGY

GPU – CPU link:

2x NVLINK 40+40 GB/s raw BW

GPUDirect P2P:

GPU – GPU link is 2x NVLink Two cliques topology

GPUDirect RDMA:

Not supported

SLIDE 12

12

GPUDIRECT AND MULTI-GPU SYSTEMS THE CASE OF DGX-1

SLIDE 13

13

HOW TO’S

Device topology, link type and capabilities

GPUa - GPUb link: P2P over NVLINK vs PCIe, speed, etc Same for CPU – GPU link: NVLINK or PCIe Same for NIC – GPU link (HWLOC)

Select an optimized GPU/CPU/NIC combination in MPI runs Enable GPUDirect RDMA

SLIDE 14

14

CUDA LINK CAPABILITIES

// CUDA driver API

typedef enum CUdevice_P2PAttribute_enum { CU_DEVICE_P2P_ATTRIBUTE_PERFORMANCE_RANK = 0x01, CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED = 0x02, CU_DEVICE_P2P_ATTRIBUTE_NATIVE_ATOMIC_SUPPORTED = 0x03 } CUdevice_P2PAttribute; cuDeviceGetP2PAttribute(int* value, CUdevice_P2PAttribute attrib, CUdevice srcDevice, CUdevice dstDevice)

// CUDA runtime API

cudaDeviceGetP2PAttribute(int *value, enum cudaDeviceP2PAttr attr, int srcDevice, int dstDevice)

basic info, GPU-GPU links only

A relative value indicating the performance of the link between two GPUs (NVLINK ranks higher than PCIe). Can do remote native atomics in GPU kernels

SLIDE 15

15

GPUDIRECT P2P: NVLINK VS PCIE

Note: some GPUs are not connected e.g. GPU0-GPU7 Note2: others have multiple potential link (NVLINK and PCIe) but cannot use both at the same time!!!

NVLINK transparently picked if available

cudaSetDevice(0); cudaMalloc(&buf0, size); cudaCanAccessPeer (&access, 0, 1); assert(access == 1); cudaEnablePeerAccess (1, 0); cudaSetDevice(1); cudaMalloc(&buf1, size); … cudaSetDevice (0); cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);

SLIDE 16

16

MULTI GPU RUNS ON DGX-1

Create wrapper script Use local MPI rank (MPI impl dependent) Don’t use CUDA_VISIBLE_DEVICES, hurts P2P!!! Environment variables to pass selection down to MPI and app In application cudaSetDevice(“USE_GPU”) Run wrapper script

Select best GPU/CPU/NIC for each MPI rank

$ cat wrapper.sh if [ ! –z $OMPI_COMM_WORLD_LOCAL_RANK ]; then lrank=$OMPI_COMM_WORLD_LOCAL_RANK elif [ ! –z $MV2_COMM_WORLD_LOCAL_RANK ]; then lrank=$MV2_COMM_WORLD_LOCAL_RANK fi if (( $lrank > 7 )); then echo "too many ranks"; exit; fi case ${HOSTNAME} in *dgx*) USE_GPU=$((2*($lrank%4)+$lrank/4)) # 0,2,4,6,1,3,5,7 export USE_SOCKET=$(($USE_GPU/4)) # 0,0,1,1,0,0,1,1 HCA=mlx5_$(($USE_GPU/2)) # 0,1,2,3,0,1,2,3 export OMPI_MCA_btl_openib_if_include=${HCA} export MV2_IBA_HCA=${HCA} export USE_GPU;; … esac numactl --cpunodebind=${USE_SOCKET} –l $@ $ mpirun –np N wrapper.sh myapp param1 param2 …

SLIDE 17

17

NVML1 NVLINK

nvmlDeviceGetNvLinkVersion(nvmlDevice_t device, unsigned int link, unsigned int *version) nvmlDeviceGetNvLinkState(nvmlDevice_t device, unsigned int link, nvmlEnableState_t *isActive) nvmlDeviceGetNvLinkCapability(nvmlDevice_t device, unsigned int link, nvmlNvLinkCapability_t capability, unsigned int *capResult) nvmlDeviceGetNvLinkRemotePciInfo(nvmlDevice_t device, unsigned int link, nvmlPciInfo_t *pci)

Link discovery and info APIs

1 http://docs.nvidia.com/deploy/nvml-api/

nvmlDevice separate from CUDA gpu id’s (all devices vs CUDA_VISIBLE_DEVICES) NVML_NVLINK_MAX_LINKS=6 See later for capabilities domain:bus:device.function PCI identifier of device on the other side of the link, can be socket PCIe bridge (IBM POWER8)

SLIDE 18

18

NVLINK CAPABILITIES

nvidia-smi nvlink –I <GPU id> –c

typedef enum nvmlNvLinkCapability_enum { NVML_NVLINK_CAP_P2P_SUPPORTED = 0, NVML_NVLINK_CAP_SYSMEM_ACCESS = 1, NVML_NVLINK_CAP_P2P_ATOMICS = 2, NVML_NVLINK_CAP_SYSMEM_ATOMICS= 3, NVML_NVLINK_CAP_SLI_BRIDGE = 4, NVML_NVLINK_CAP_VALID = 5, } nvmlNvLinkCapability_t;

On DGX-1

SLIDE 19

19

NVLINK COUNTERS

Per GPU (-i 0), per link (-l <0..3>) Two sets of counters (-g <0|1>) Per set counter types: cycles,packets,bytes (-sc xyz) Reset individually (-r <0|1>)

On DGX-1

SLIDE 20

20

NVML TOPOLOGY

nvmlDeviceGetTopologyNearestGpus( nvmlDevice_t device, nvmlGpuTopologyLevel_t level, unsigned int* count, nvmlDevice_t* deviceArray ) nvmlDeviceGetTopologyCommonAncestor( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuTopologyLevel_t* pathInfo ) nvmlSystemGetTopologyGpuSet(unsigned int cpuNumber, unsigned int* count, nvmlDevice_t* deviceArray ) nvmlDeviceGetCpuAffinity(nvmlDevice_t device, unsigned int cpuSetSize, unsigned long *cpuSet);

GPU-GPU & GPU-CPU topology query1 APIs

1 http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html

NVML_TOPOLOGY_INTERNAL, NVML_TOPOLOGY_SINGLE, NVML_TOPOLOGY_MULTIPLE, NVML_TOPOLOGY_HOSTBRIDGE, NVML_TOPOLOGY_CPU, NVML_TOPOLOGY_SYSTEM,

SLIDE 21

21

SYSTEM TOPOLOGY

$ nvidia-smi topo -m

On DGX-1

SLIDE 22

22

SYSTEM TOPOLOGY

$ nvidia-smi topo -mp

On DGX-1, PCIe only

SLIDE 23

25

GPUDIRECT P2P

SLIDE 24

26

DGX-1 P2P PERFORMANCE

in CUDA toolkit samples Sources: samples/1_Utilities/p2pBandwidthLatencyTest Binary: samples/bin/x86_64/linux/release/p2pBandwidthL atencyTest

p2pBandwidthLatencyTest

SLIDE 25

27

DGX-1 P2P PERFORMANCE

In CUDA toolkit demo suite: /usr/local/cuda-8.0/extras/demo_suite/busGrind –h Usage: -h: print usage

p [0,1] enable or disable pinned memory tests (default on)
u [0,1] enable or disable unpinned memory tests (default off)
e [0,1] enable or disable p2p enabled memory tests (default on)
d [0,1] enable or disable p2p disabled memory tests (default off)
a enable all tests
n disable all tests

busGrind

SLIDE 26

28

Intra-node MPI BW

8/3/16

GPU-aware MPI running over GPUDirect P2P Dual IVB Xeon 2U server (K40 PCIe) vs DGX-1 (P100-nvlink) 5 10 15 20 25 30 35 40 1K 4K 16K 64K 256K 1M 4M 16M Bandwidth GB/sec Message Size (Bytes)

k40-pcie k40-pcie-bidir p100-nvlink p100-nvlink-bidir

~35 GB/sec Bi-dir ~17 GB/sec Uni-dir

SLIDE 27

29

GPUDIRECT RDMA

SLIDE 28

30

GPUDirect RDMA over RDMA networks

For Linux rdma subsystem

pen-source nvidia_peer_memory kernel module1

important bug fix in ver 1.0-3 !!! enables NVIDIA GPUDirect RDMA on OpenFabrics stack

Multiple vendors

Mellanox2: ConnectX3 to ConnectX-5, IB/RoCE Chelsio3: T5, iWARP Others to come

for better network communication latency

1 https://github.com/Mellanox/nv_peer_memory 2 http://www.mellanox.com/page/products_dyn?product_family=116 3 http://www.chelsio.com/gpudirect-rdma

SLIDE 29

31

GPUDirect RDMA over Infiniband

For bandwidth:

$ git clone git://git.openfabrics.org/~grockah/perf test.git $ cd perftest $ ./autogen.sh $ export CUDA_H_PATH=/usr/local/cuda- 8.0/include/cuda.h $ ./configure –prefix=$HOME/test $ make all install

E.g. host to GPU memory (H-G) BW test:

server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda client $ ~/test/bin/ib_write_bw -n 1000 -O -a server.name.org

GPU to GPU memory (G-G) BW test:

server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda client $ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda server.name.org

Benchmarking bandwidth

SLIDE 30

32

DGX-1 GPUDirect RDMA uni-dir BW

IB message size, 5000 iterations, RC protocol

2000 4000 6000 8000 10000 12000 14000 2 4 8 16 32 64 128 256 512 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB BW (MB/s)

RDMA BW DGX-1, CX-4, P100 SXM, socket0 to socket1

H H H G G H G G

SLIDE 31

33

GPUDIRECT ASYNC

SLIDE 32

34

GPUDIRECT ASYNC

Communications prepared by CPU

hardly parallelizable, branch intensive
GPU orchestrates flow

Run by GPU front-end unit

Same one scheduling GPU work
Now also scheduling network

communications

leverage GPU front-end unit

Front-end unit

Compute Engines

SLIDE 33

35

GPUDIRECT ASYNC OVER OFA VERBS

Prerequisites: nvidia_peer_memory driver, GDRcopy1 library
CUDA 8.0+ Stream Memory Operations (MemOps) APIs
MLNX OFED 4.0+ Peer-Direct Async Verbs APIs
libgdsync2: bridging CUDA & IB Verbs
MPI: experimental support in MVAPICH2-GDS3
libmp: lightweight, MPI-like stream-sync primitives, internal benchmarking

1 http://github.com/NVIDIA/gdrcopy 2 http://github.com/gpudirect/libgdsync, devel branch 3 see DK Panda’s talk

SW stack bits

SLIDE 34

36

SW STACK

CUDA driver IB verbs IB core mlx5 libgdsync NV display driver

kernel-mode user-mode

HCA GPU

HW

mixed

pen-

source proprietary nv_peer_mem

extensions for Async

MVAPICH2 RDMA

ext. for Async

CUDA RT libmp Open MPI applications benchmarks

IB Verbs extensions for Async

cxgb4 Async

extensions for RDMA/Async

SLIDE 35

37

GPUDIRECT ASYNC OVER INFINIBAND

May need special HCA configuration on Kepler/Maxwell GPUs, e.g. on Mellanox:

$ mlxconfig -d /dev/mst/mtxxx_pciconf0 set NON_PREFETCHABLE_PF_BAR=1 $ reboot

Enable GPU peer mappings:

$ cat /etc/modprobe.d/nvidia.conf options nvidia NVreg_RegistryDwords="PeerMappingOverride=1"

Requirements

SLIDE 36

38

CUDA STREAM MemOps

CU_STREAM_WAIT_VALUE_GEQ = 0x0, CU_STREAM_WAIT_VALUE_EQ = 0x1, CU_STREAM_WAIT_VALUE_AND = 0x2, CU_STREAM_WAIT_VALUE_NOR = 0x3, CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags); CUresult cuStreamWaitValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags); CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1 CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags); CUresult cuStreamWriteValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags); CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1, CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2, CU_STREAM_MEM_OP_WAIT_VALUE_64 = 4, CU_STREAM_MEM_OP_WRITE_VALUE_64 = 5, CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3 CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, CUstreamBatchMemOpParams *paramArray, unsigned int flags);

polling on 32/64bit word 32/64bit word write lower-overhead batched work submission guarantees memory consistency for RDMA

SLIDE 37

39

CUDA STREAM MemOps

APIs features

batching multiple consecutive MemOps save ~1.5us each op
use cuStreamBatchMemOp()
APIs accept device pointers
memory need registration (cuMemHostRegister)
device pointer retrieval (cuMemHostGetDevicePointer)
3rd party device PCIe resources (aka BARs)
assumed physically contiguous & uncached
special flag needed in cuMemHostRegister

SLIDE 38

40

ASYNC: LIBMP

SLIDE 39

41

LIBMP

Lightweight message passing library

thin layer on top of IB Verbs in-order receive buffer matching no tags, no wildcards, no data types zero copy transfers, uses flow control of IB RC transport

Eases benchmarking of multi-GPU applications Not released (yet)

SLIDE 40

42

LIBMP

Prototype message passing library

PREPARED BY TRIGGERED BY CPU synchronous

CPU CPU

Stream synchronous

CPU GPU front-end unit

Kernel initiated

CPU GPU SMs

SLIDE 41

43 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP CPU-SYNC COMMUNICATION

// Send/Recv

int mp_irecv (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req); int mp_isend (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);

// Put/Get

int mp_window_create(void *addr, size_t size, mp_window_t *window_t); int mp_iput (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ, mp_window_t *dst_window_t, mp_request_t *req); int mp_iget (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ, mp_window_t *src_window_t, mp_request_t *req);

// Wait

int mp_wait (mp_request_t *req); int mp_wait_all (uint32_t count, mp_request_t *req);

SLIDE 42

44 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP STREAM-SYNC COMMUNICATION

// Send

int mp_isend_on_stream (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req, cudaStream_t stream);

// Put/Get

int mp_iput_on_stream (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ, mp_window_t *dst_window_t, mp_request_t *req, cudaStream_t stream); int mp_iget_on_stream (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ, mp_window_t *src_window_t, mp_request_t *req, cudaStream_t stream);

// Wait

int mp_wait_on_stream (mp_request_t *req, cudaStream_t stream); int mp_wait_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t stream);

SLIDE 43

45

CUDA stream-sync communications

Loop { mp_irecv(…) compute <<<…,stream>>> (buf) mp_isend_on_stream(…) mp_wait_all_on_stream (…) }

CPU HCA stream

SLIDE 44

46 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP STREAM-SYNC batch submission

// Prepare single requests, IB Verbs side only

int mp_send_prepare (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);

// Post set of prepared requests

int mp_isend_post_on_stream (mp_request_t *req, cudaStream_t stream); int mp_isend_post_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t stream);

SLIDE 45

47 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP KERNEL-SYNC COMMUNICATION

// Host side code // extract descriptors from prepared requests

int mp::mlx5::get_descriptors(send_desc_t *send_info, mp_request_t *req); int mp::mlx5::get_descriptors(wait_desc_t *wait_info, mp_request_t *req);

// Device side code

__device__ int mp::device::mlx5::send(send_desc_t * send_info); __device__ int mp::device::mlx5::wait (wait_desc_t * wait_info);

SLIDE 46

48

Kernel-sync communication

Loop { mp_irecv(…, rreq); mp_get_descriptors(wi, rreq); mp_send_prepare(…, sreq); mp_get_descriptors(si, sreq); compute_and_communicate <<<…,stream>>> (buf,wi,si) { do_something(buf); mlx5::send(si); do_something_else(); mlx5::wait(wi); keep_working(); } }

CPU HCA stream

SLIDE 47

49

ASYNC: experimental MPI BINDINGS

SLIDE 48

50

CODE SAMPLE

MPI_Comm comm = MPI_COMM_WORLD; cudaStreamCreate (&stream); Loop { kernel <<<nblocks, nthreads, stream>>> (buf1, buf2); cudaStreamSynchronize (stream); MPI_Irecv(buf2, count, MPI_INT, peer, 0, comm, &req[0]); MPI_Isend(buf1, count, MPI_INT, peer, 0, comm, &req[1]); MPI_Waitall (2, req, statuses); }

Standard GPU-aware MPI

SLIDE 49

51

STREAM-SYNCHRONOUS VERSION

MPI_Comm stream_comm; cudaStreamCreate (&stream); MPIX_Comm_dup_with_stream( MPI_COMM_WORLD, &stream_comm, &stream); Loop { kernel <<<nblocks, nthreads, stream>>> (buf1, buf2); MPI_Recv(buf2, count, MPI_BYTE, peer, 0, stream_comm); MPI_Send(buf1, count, MPI_BYTE, peer, 0, stream_comm); } MPIX_Wait_stream_completion(stream_comm);

Experimental MPI extensions

SLIDE 50

52

ASYNC BENCHMARKS

LibMP models HPGMG-FV CoMD Lulesh2

SLIDE 51

53

LIBMP MODELS

Three different execution models:

CPU-synchronous (default)
CUDA Stream-synchronous, CPU asynchronous (SA)
communications triggered by a CUDA Stream
CUDA Kernel-synchronous (or Kernel-Initiated, KI)
Communications triggered by CUDA Kernel threads

Need to evaluate performance: 3 multi-process applications  MPI to LibMP

Summary

SLIDE 52

54

HPGMG-CUDA

SLIDE 53

55

HPGMG-FV

Overview

High Performance Geometric Multigrid (Lawrence Berkeley National Laboratory) :

Geometric multi-grid solver for variable-coefficient elliptic problems on isotropic Cartesian grids

using the Finite Volume method

F-cycle: multiple V-cycles  a finer grid is smoothed and restricted into a coarser grid. A direct

solver is applied to the coarsest grid and the solution is then iteratively interpolated to finer grids.

Multi-process execution:

Each grid is divided into several same-size boxes
Workload is fairly distributed among processes

SLIDE 54

56

HPGMG-FV CUDA

Overview

GPU accelleration Multi-process and multi-GPU, MPI Level size threshold:

Lower levels computed on CPU
Higher levels computed on GPU

SLIDE 55

57

HPGMG-FV CUDA

Multi-process execution: MPI communications

CPU Level (coarser) Smoother Residual Exchange boundary GPU Level (finer) Smoother Residual Exchange boundary CPU Level (coarser) Smoother Residual Exchange boundary Interpolation Restriction

Inter-level communication: restriction (moving from a finer level to a coarser level) and interpolation (moving from a coarser level to a finer level) Intra-level communication: exchange_boundary function (boundary region exchange between processes)

SLIDE 56

59

EXCHANGE BOUNDARY IMPLEMENTATION

cudaDeviceSynchronize(); for(p=0; p < num_peers; p++) { MPI_Irecv(rBuf, …, rReqs[p]); } pack_kernel<<<…, stream>>>(sBuf, …); cudaDeviceSynchronize(); for(p=0; p < num_peers; p++) { MPI_Isend(sBuf, …, sReqs[p]); } local_kernel<<<…, stream>>>(); MPI_Waitall(rReqs, …); unpack_kernel<<<…, stream>>>(rBuf, …); MPI_Waitall(sReqs, …); for(p=0; p < num_peers; p++) { mp_irecv(rBuf, …, rReqs[p]); } pack_kernel<<<…, stream>>>(sBuf, …); for(p=0; p < num_peers; p++) { mp_isend_on_stream(sBuf, …, sReqs[p], stream); } local_kernel<<<…, stream>>>(); mp_wait_all_on_stream(…, rReqs, stream); unpack_kernel<<<…, stream>>>(rBuf, …); mp_wait_all_on_stream(…, sReqs, stream);

Code comparison: MPI vs SA-Model

MPI Version LibMP SA-Model Version

SLIDE 57

60

MPI IMPLEMENTATION

Consecutive exchange_boundary() calls, CUDA Visual Profiler

Synchronization CUDA kernels Communications (Send or Wait)

Host GPU Stream Network

Isend WaitAll

SLIDE 58

61

MPI IMPLEMENTATION

Consecutive exchange_boundary() calls, CUDA Visual Profiler

Host GPU Stream Network

GPU Idle time

SLIDE 59

63

SA MODEL IMPLEMENTATION

Consecutive exchange_boundary() calls, CUDA Visual Profiler

CUDA kernels Communications (Send, Put or Wait)

Isend Wait Recv Wait Isend

Host GPU Stream Network

CUDA 8.0 driver functions

SLIDE 60

64

SA MODEL IMPLEMENTATION

Wilkes cluster - performance gain

Wilkes cluster (Cambridge, UK):
Telsa K20 GPUs
CUDA 8.0
Mellanox Connect-IB OFED 3.2
OpenMPI
Up to 24% time improvement wrt MPI
4 processes, single GPU level:
64% CPU work time reduction
28% GPU activity time reduction
The bigger is the box size, the more

message size grows: communication

verhead becomes less important

SLIDE 61

65

KI MODEL IMPLEMENTATION

How to

For each exchange_boundary() execution:

Kernel fusion, single vs three CUDA kernels
Move communications inside fused kernel
Respect mutual dependencies (hard!)

launch fused kernel exchange_boundary fused kernel CPU GPU

SLIDE 62

67

KI MODEL IMPLEMENTATION

exchange_boundary() fused kernel – CUDA blocks organization

Fused kernel Set 1 Set 2 Set 3

CUB library and GPU global memory variable

SLIDE 63

68

KI MODEL IMPLEMENTATION

Wilkes cluster - performance gain

Wilkes cluster (Cambridge, UK):
Telsa K20 GPUs
CUDA 8.0
Mellanox Connect-IB OFED 3.2
OpenMPI
Up to 26% time improvement wrt MPI
4 processes, single GPU level:
77% CPU work time reduction
32% GPU activity time reduction

SLIDE 64

69

DGX-1 CLUSTER BENCHMARKS

Performance gain

DGX-1 cluster (NVIDIA):
Tesla P100 GPUs
CUDA 8.0
Mellanox ConnectX-4 OFED 3.4
LibMP v2.0
OpenMPI 1.10.5
2 processes only: about 8% less than

Wilkes

Gain wrt MPI:
CPU sync default
SA model
KI model

SLIDE 65

70

COMD-CUDA

SLIDE 66

71

COMD-CUDA

Implementation

CoMD (www.exmatex.org project) is a classical molecular dynamics proxy

application

O(N2) within cutoff
It considers materials where the interatomic potentials are short range and the

simulation requires the evaluation of all forces between atom pairs within the cutoff distance

Dependency between CUDA kernels and communications
Explored SA-model only

SLIDE 67

72

SA MODEL IMPLEMENTATION

Repeat for 3 times { load_kernel <<<…, stream>>>(sBufA, …); load_kernel <<<…, stream>>>(sBufB, …); cudaDeviceSynchronize(); MPI_Sendrecv(sBufA, rBufA, …); MPI_Sendrecv(sBufB, rBufB, …); cuda_tasks(stream); unload_kernel <<<…, stream>>>(rBufA, …); unload_kernel <<<…, stream>>>(rBufB, …); } for(p=0; p < (3 x 2); p++) { mp_irecv(…, rReqs[p]); } for(p=0; p < (3 x 2); p += 2) { load_kernel <<<…, stream>>>(sBufA, …); mp_isend_on_stream(…, sReqs[p], stream); load_kernel <<<…, stream>>>(sBufB, …); mp_isend_on_stream(…, sReqs[p+1], stream); cuda_tasks(stream); mp_wait_all_on_stream(…, rReqs+p, stream); unload_kernel <<<…, stream>>>(rBufA, …); unload_kernel <<<…, stream>>>(rBufB, …); mp_wait_all_on_stream(…, sReqs+p, stream); }

Communication periods - Code comparison

MPI Version LibMP SA-Model Version

SLIDE 68

73

SA MODEL PERFORMANCE GAIN

Wilkes cluster – communications gain

Wilkes cluster (Cambridge, UK):
Telsa K20 GPUs
CUDA 8.0
Mellanox Connect-IB OFED 3.2
OpenMPI
25% ~ 35% time gain
Communication periods only
Gain SA model wrt MPI

SLIDE 69

74

SA MODEL PERFORMANCE GAIN

DGX-1 cluster – communications

DGX-1 cluster (NVIDIA):
Tesla P100 GPUs
CUDA 8.0
Mellanox ConnectX-4 OFED 3.4
LibMP v2.0
OpenMPI 1.10.5
4 processes only: 13% less than

Wilkes

Communication periods only
Gain SA model wrt MPI

SLIDE 70

75

LULESH2-CUDA

SLIDE 71

76

LULESH2-CUDA

Implementation

Proxy application developed at Lawrence Livermore National Laboratory
“… It approximates the hydrodynamics equations discretely by partitioning the

spatial problem domain into a collection of volumetric elements defined by a mesh.” ( https://codesign.llnl.gov/lulesh.php )

Dependency between CUDA kernels and communications
Explored SA-model only

SLIDE 72

77

Repeat up to 26 times { mp_irecv(rBuf, …, rReqs); } …… Repeat up to 26 times { cuda_kernel<<<…, stream>>>(…); cudaMemcpyAsync(…, stream); mp_isend_on_stream(…, sReqs, stream); } mp_wait_all_on_stream(…, sReqs, stream); …… Repeat up to 26 times { mp_wait_on_stream(…, rReqs, stream); cudaMemcpyAsync(…, stream); cuda_kernel<<<…, stream>>>(…); }

SA MODEL IMPLEMENTATION

Repeat up to 26 times { MPI_Irecv(rBuf, …); } …… Repeat up to 26 times { cuda_kernel<<<…, stream>>>(sBuf, …); cudaDeviceSynchronize(); cudaMemcpyAsync(…, stream); MPI_Isend(sBuf, …); } MPI_Waitall(…); …… Repeat up to 26 times { MPI_Wait(…); cuda_kernel<<<…, stream>>>(rBuf, …); cudaMemcpyAsync(…, stream); cudaDeviceSynchronize(); }

Communication periods - Code comparison

MPI Version LibMP SA-Model Version

SLIDE 73

78

SA MODEL IMPLEMENTATION

Wilkes cluster - performance gain

Progressively increasing the number
f iterations to intensify the GPU

workload

Wilkes cluster (Cambridge, UK):
Telsa K20 GPUs
CUDA 8.0
Mellanox Connect-IB OFED 3.2
OpenMPI
Average gain of 13%
27 and 64 processes

SLIDE 74

79

GAME OVER

SLIDE 75

80