SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena - - PowerPoint PPT Presentation

synchronous communications using gpudirect
SMART_READER_LITE
LIVE PREVIEW

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena - - PowerPoint PPT Presentation

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK


slide-1
SLIDE 1

1

Davide Rossetti, Elena Agostini

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT

slide-2
SLIDE 2

2

GPUDIRECT ELSEWHERE AT GTC2017

H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK INTERCONNECTS (Today 5pm, D.Rossetti, @Mellanox ) S7155 - OPTIMIZED INTER-GPU COLLECTIVE OPERATIONS WITH NCCL (Tue 9am, S.Jeaugey @NVIDIA) S7142 - MULTI-GPU PROGRAMMING MODELS (Wed 1pm, S.Potluri, J.Krauss @NVIDIA) S7489 - CLUSTERING GPUS WITH ETHERNET (Wed 4pm, F.Osman @Broadcom) S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING (Thu 2pm, D.K.Panda @OSU)

slide-3
SLIDE 3

3

AGENDA

GPUDirect technologies NVLINK-enabled multi-GPU systems GPUDirect P2P GPUDirect RDMA GPUDirect Async Async Benchmarks & applications results

slide-4
SLIDE 4

4

INTRODUCTION TO GPUDIRECT TECHNOLOGIES

slide-5
SLIDE 5

5

GPUDIRECT FAMILY1

Technologies, enabling products !!!

GPU pinned memory shared with other RDMA capable devices Avoids intermediate copies

GPUDIRECT SHARED GPU- SYSMEM

Accelerated GPU-GPU memory copies Inter-GPU direct load/store access

GPUDIRECT P2P

Direct GPU to 3rd party device transfers E.g. direct I/O, optimized inter-node communication

GPUDIRECT RDMA2

Direct GPU to 3rd party device synchronizations E.g. optimized inter-node communication

GPUDIRECT ASYNC

[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma

slide-6
SLIDE 6

6

GPUDIRECT

scopes

  • GPUDirect P2P → data
  • Intra-node
  • GPUs both master and slave
  • Over PCIe or NVLink
  • GPUDirect RDMA → data
  • Inter-node
  • GPU slave, 3rd party device master
  • Over PCIe
  • GPUDirect Async → control
  • GPU & 3rd party device, master & slave
  • Over PCIe

Data plane Control plane

GPUDirect Async

HOST GPU GPU

GPUDirect RDMA/P2P

slide-7
SLIDE 7

7

NVLINK-enabled Multi-GPU servers

slide-8
SLIDE 8

8

NVIDIA DGX-1

AI Supercomputer-in-a-Box

170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh 2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W

slide-9
SLIDE 9

9

DGX-1 SYSTEM TOPOLOGY

GPU – CPU link:

PCIe 12.5+12.5 GB/s eff BW

GPUDirect P2P:

GPU – GPU link is NVLink Cube mesh topology not all-to-all

GPUDirect RDMA:

GPU – NIC link is PCIe

slide-10
SLIDE 10

10

IBM MINSKY

2 POWER8 with NVLink 4 NVIDIA Tesla P100 GPUs 256 GB System Memory 2 SSD storage devices High-speed interconnect: IB

  • r Ethernet

Optional: Up to 1 TB System Memory PCIe attached NVMe storage

slide-11
SLIDE 11

11

P100 GPU

POWER8 CPU

GPU Memory

System Memory P100 GPU 80 GB/s

GPU Memory

NVLink 115 GB/s P100 GPU

POWER8 CPU

GPU Memory

System Memory P100 GPU 80 GB/s

GPU Memory

115 GB/s NVLink

IBM MINSKY SYSTEM TOPOLOGY

GPU – CPU link:

2x NVLINK 40+40 GB/s raw BW

GPUDirect P2P:

GPU – GPU link is 2x NVLink Two cliques topology

GPUDirect RDMA:

Not supported

slide-12
SLIDE 12

12

GPUDIRECT AND MULTI-GPU SYSTEMS THE CASE OF DGX-1

slide-13
SLIDE 13

13

HOW TO’S

Device topology, link type and capabilities

GPUa - GPUb link: P2P over NVLINK vs PCIe, speed, etc Same for CPU – GPU link: NVLINK or PCIe Same for NIC – GPU link (HWLOC)

Select an optimized GPU/CPU/NIC combination in MPI runs Enable GPUDirect RDMA

slide-14
SLIDE 14

14

CUDA LINK CAPABILITIES

// CUDA driver API

typedef enum CUdevice_P2PAttribute_enum { CU_DEVICE_P2P_ATTRIBUTE_PERFORMANCE_RANK = 0x01, CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED = 0x02, CU_DEVICE_P2P_ATTRIBUTE_NATIVE_ATOMIC_SUPPORTED = 0x03 } CUdevice_P2PAttribute; cuDeviceGetP2PAttribute(int* value, CUdevice_P2PAttribute attrib, CUdevice srcDevice, CUdevice dstDevice)

// CUDA runtime API

cudaDeviceGetP2PAttribute(int *value, enum cudaDeviceP2PAttr attr, int srcDevice, int dstDevice)

basic info, GPU-GPU links only

A relative value indicating the performance of the link between two GPUs (NVLINK ranks higher than PCIe). Can do remote native atomics in GPU kernels

slide-15
SLIDE 15

15

GPUDIRECT P2P: NVLINK VS PCIE

Note: some GPUs are not connected e.g. GPU0-GPU7 Note2: others have multiple potential link (NVLINK and PCIe) but cannot use both at the same time!!!

NVLINK transparently picked if available

cudaSetDevice(0); cudaMalloc(&buf0, size); cudaCanAccessPeer (&access, 0, 1); assert(access == 1); cudaEnablePeerAccess (1, 0); cudaSetDevice(1); cudaMalloc(&buf1, size); … cudaSetDevice (0); cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);

slide-16
SLIDE 16

16

MULTI GPU RUNS ON DGX-1

Create wrapper script Use local MPI rank (MPI impl dependent) Don’t use CUDA_VISIBLE_DEVICES, hurts P2P!!! Environment variables to pass selection down to MPI and app In application cudaSetDevice(“USE_GPU”) Run wrapper script

Select best GPU/CPU/NIC for each MPI rank

$ cat wrapper.sh if [ ! –z $OMPI_COMM_WORLD_LOCAL_RANK ]; then lrank=$OMPI_COMM_WORLD_LOCAL_RANK elif [ ! –z $MV2_COMM_WORLD_LOCAL_RANK ]; then lrank=$MV2_COMM_WORLD_LOCAL_RANK fi if (( $lrank > 7 )); then echo "too many ranks"; exit; fi case ${HOSTNAME} in *dgx*) USE_GPU=$((2*($lrank%4)+$lrank/4)) # 0,2,4,6,1,3,5,7 export USE_SOCKET=$(($USE_GPU/4)) # 0,0,1,1,0,0,1,1 HCA=mlx5_$(($USE_GPU/2)) # 0,1,2,3,0,1,2,3 export OMPI_MCA_btl_openib_if_include=${HCA} export MV2_IBA_HCA=${HCA} export USE_GPU;; … esac numactl --cpunodebind=${USE_SOCKET} –l $@ $ mpirun –np N wrapper.sh myapp param1 param2 …

slide-17
SLIDE 17

17

NVML1 NVLINK

nvmlDeviceGetNvLinkVersion(nvmlDevice_t device, unsigned int link, unsigned int *version) nvmlDeviceGetNvLinkState(nvmlDevice_t device, unsigned int link, nvmlEnableState_t *isActive) nvmlDeviceGetNvLinkCapability(nvmlDevice_t device, unsigned int link, nvmlNvLinkCapability_t capability, unsigned int *capResult) nvmlDeviceGetNvLinkRemotePciInfo(nvmlDevice_t device, unsigned int link, nvmlPciInfo_t *pci)

Link discovery and info APIs

1 http://docs.nvidia.com/deploy/nvml-api/

nvmlDevice separate from CUDA gpu id’s (all devices vs CUDA_VISIBLE_DEVICES) NVML_NVLINK_MAX_LINKS=6 See later for capabilities domain:bus:device.function PCI identifier of device on the other side of the link, can be socket PCIe bridge (IBM POWER8)

slide-18
SLIDE 18

18

NVLINK CAPABILITIES

nvidia-smi nvlink –I <GPU id> –c

typedef enum nvmlNvLinkCapability_enum { NVML_NVLINK_CAP_P2P_SUPPORTED = 0, NVML_NVLINK_CAP_SYSMEM_ACCESS = 1, NVML_NVLINK_CAP_P2P_ATOMICS = 2, NVML_NVLINK_CAP_SYSMEM_ATOMICS= 3, NVML_NVLINK_CAP_SLI_BRIDGE = 4, NVML_NVLINK_CAP_VALID = 5, } nvmlNvLinkCapability_t;

On DGX-1

slide-19
SLIDE 19

19

NVLINK COUNTERS

Per GPU (-i 0), per link (-l <0..3>) Two sets of counters (-g <0|1>) Per set counter types: cycles,packets,bytes (-sc xyz) Reset individually (-r <0|1>)

On DGX-1

slide-20
SLIDE 20

20

NVML TOPOLOGY

nvmlDeviceGetTopologyNearestGpus( nvmlDevice_t device, nvmlGpuTopologyLevel_t level, unsigned int* count, nvmlDevice_t* deviceArray ) nvmlDeviceGetTopologyCommonAncestor( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuTopologyLevel_t* pathInfo ) nvmlSystemGetTopologyGpuSet(unsigned int cpuNumber, unsigned int* count, nvmlDevice_t* deviceArray ) nvmlDeviceGetCpuAffinity(nvmlDevice_t device, unsigned int cpuSetSize, unsigned long *cpuSet);

GPU-GPU & GPU-CPU topology query1 APIs

1 http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html

NVML_TOPOLOGY_INTERNAL, NVML_TOPOLOGY_SINGLE, NVML_TOPOLOGY_MULTIPLE, NVML_TOPOLOGY_HOSTBRIDGE, NVML_TOPOLOGY_CPU, NVML_TOPOLOGY_SYSTEM,

slide-21
SLIDE 21

21

SYSTEM TOPOLOGY

$ nvidia-smi topo -m

On DGX-1

slide-22
SLIDE 22

22

SYSTEM TOPOLOGY

$ nvidia-smi topo -mp

On DGX-1, PCIe only

slide-23
SLIDE 23

25

GPUDIRECT P2P

slide-24
SLIDE 24

26

DGX-1 P2P PERFORMANCE

in CUDA toolkit samples Sources: samples/1_Utilities/p2pBandwidthLatencyTest Binary: samples/bin/x86_64/linux/release/p2pBandwidthL atencyTest

p2pBandwidthLatencyTest

slide-25
SLIDE 25

27

DGX-1 P2P PERFORMANCE

In CUDA toolkit demo suite: /usr/local/cuda-8.0/extras/demo_suite/busGrind –h Usage: -h: print usage

  • p [0,1] enable or disable pinned memory tests (default on)
  • u [0,1] enable or disable unpinned memory tests (default off)
  • e [0,1] enable or disable p2p enabled memory tests (default on)
  • d [0,1] enable or disable p2p disabled memory tests (default off)
  • a enable all tests
  • n disable all tests

busGrind

slide-26
SLIDE 26

28

Intra-node MPI BW

8/3/16

GPU-aware MPI running over GPUDirect P2P Dual IVB Xeon 2U server (K40 PCIe) vs DGX-1 (P100-nvlink) 5 10 15 20 25 30 35 40 1K 4K 16K 64K 256K 1M 4M 16M Bandwidth GB/sec Message Size (Bytes)

k40-pcie k40-pcie-bidir p100-nvlink p100-nvlink-bidir

~35 GB/sec Bi-dir ~17 GB/sec Uni-dir

slide-27
SLIDE 27

29

GPUDIRECT RDMA

slide-28
SLIDE 28

30

GPUDirect RDMA over RDMA networks

For Linux rdma subsystem

  • pen-source nvidia_peer_memory kernel module1

important bug fix in ver 1.0-3 !!! enables NVIDIA GPUDirect RDMA on OpenFabrics stack

Multiple vendors

Mellanox2: ConnectX3 to ConnectX-5, IB/RoCE Chelsio3: T5, iWARP Others to come

for better network communication latency

1 https://github.com/Mellanox/nv_peer_memory 2 http://www.mellanox.com/page/products_dyn?product_family=116 3 http://www.chelsio.com/gpudirect-rdma

slide-29
SLIDE 29

31

GPUDirect RDMA over Infiniband

For bandwidth:

$ git clone git://git.openfabrics.org/~grockah/perf test.git $ cd perftest $ ./autogen.sh $ export CUDA_H_PATH=/usr/local/cuda- 8.0/include/cuda.h $ ./configure –prefix=$HOME/test $ make all install

E.g. host to GPU memory (H-G) BW test:

server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda client $ ~/test/bin/ib_write_bw -n 1000 -O -a server.name.org

GPU to GPU memory (G-G) BW test:

server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda client $ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda server.name.org

Benchmarking bandwidth

slide-30
SLIDE 30

32

DGX-1 GPUDirect RDMA uni-dir BW

IB message size, 5000 iterations, RC protocol

2000 4000 6000 8000 10000 12000 14000 2 4 8 16 32 64 128 256 512 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB BW (MB/s)

RDMA BW DGX-1, CX-4, P100 SXM, socket0 to socket1

H H H G G H G G

slide-31
SLIDE 31

33

GPUDIRECT ASYNC

slide-32
SLIDE 32

34

GPUDIRECT ASYNC

Communications prepared by CPU

  • hardly parallelizable, branch intensive
  • GPU orchestrates flow

Run by GPU front-end unit

  • Same one scheduling GPU work
  • Now also scheduling network

communications

leverage GPU front-end unit

Front-end unit

Compute Engines

slide-33
SLIDE 33

35

GPUDIRECT ASYNC OVER OFA VERBS

  • Prerequisites: nvidia_peer_memory driver, GDRcopy1 library
  • CUDA 8.0+ Stream Memory Operations (MemOps) APIs
  • MLNX OFED 4.0+ Peer-Direct Async Verbs APIs
  • libgdsync2: bridging CUDA & IB Verbs
  • MPI: experimental support in MVAPICH2-GDS3
  • libmp: lightweight, MPI-like stream-sync primitives, internal benchmarking

1 http://github.com/NVIDIA/gdrcopy 2 http://github.com/gpudirect/libgdsync, devel branch 3 see DK Panda’s talk

SW stack bits

slide-34
SLIDE 34

36

SW STACK

CUDA driver IB verbs IB core mlx5 libgdsync NV display driver

kernel-mode user-mode

HCA GPU

HW

mixed

  • pen-

source proprietary nv_peer_mem

extensions for Async

MVAPICH2 RDMA

  • ext. for Async

CUDA RT libmp Open MPI applications benchmarks

IB Verbs extensions for Async

cxgb4 Async

extensions for RDMA/Async

slide-35
SLIDE 35

37

GPUDIRECT ASYNC OVER INFINIBAND

May need special HCA configuration on Kepler/Maxwell GPUs, e.g. on Mellanox:

$ mlxconfig -d /dev/mst/mtxxx_pciconf0 set NON_PREFETCHABLE_PF_BAR=1 $ reboot

Enable GPU peer mappings:

$ cat /etc/modprobe.d/nvidia.conf options nvidia NVreg_RegistryDwords="PeerMappingOverride=1"

Requirements

slide-36
SLIDE 36

38

CUDA STREAM MemOps

CU_STREAM_WAIT_VALUE_GEQ = 0x0, CU_STREAM_WAIT_VALUE_EQ = 0x1, CU_STREAM_WAIT_VALUE_AND = 0x2, CU_STREAM_WAIT_VALUE_NOR = 0x3, CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags); CUresult cuStreamWaitValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags); CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1 CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags); CUresult cuStreamWriteValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags); CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1, CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2, CU_STREAM_MEM_OP_WAIT_VALUE_64 = 4, CU_STREAM_MEM_OP_WRITE_VALUE_64 = 5, CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3 CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, CUstreamBatchMemOpParams *paramArray, unsigned int flags);

polling on 32/64bit word 32/64bit word write lower-overhead batched work submission guarantees memory consistency for RDMA

slide-37
SLIDE 37

39

CUDA STREAM MemOps

APIs features

  • batching multiple consecutive MemOps save ~1.5us each op
  • use cuStreamBatchMemOp()
  • APIs accept device pointers
  • memory need registration (cuMemHostRegister)
  • device pointer retrieval (cuMemHostGetDevicePointer)
  • 3rd party device PCIe resources (aka BARs)
  • assumed physically contiguous & uncached
  • special flag needed in cuMemHostRegister
slide-38
SLIDE 38

40

ASYNC: LIBMP

slide-39
SLIDE 39

41

LIBMP

Lightweight message passing library

thin layer on top of IB Verbs in-order receive buffer matching no tags, no wildcards, no data types zero copy transfers, uses flow control of IB RC transport

Eases benchmarking of multi-GPU applications Not released (yet)

slide-40
SLIDE 40

42

LIBMP

Prototype message passing library

PREPARED BY TRIGGERED BY CPU synchronous

CPU CPU

Stream synchronous

CPU GPU front-end unit

Kernel initiated

CPU GPU SMs

slide-41
SLIDE 41

43 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP CPU-SYNC COMMUNICATION

// Send/Recv

int mp_irecv (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req); int mp_isend (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);

// Put/Get

int mp_window_create(void *addr, size_t size, mp_window_t *window_t); int mp_iput (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ, mp_window_t *dst_window_t, mp_request_t *req); int mp_iget (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ, mp_window_t *src_window_t, mp_request_t *req);

// Wait

int mp_wait (mp_request_t *req); int mp_wait_all (uint32_t count, mp_request_t *req);

slide-42
SLIDE 42

44 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP STREAM-SYNC COMMUNICATION

// Send

int mp_isend_on_stream (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req, cudaStream_t stream);

// Put/Get

int mp_iput_on_stream (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ, mp_window_t *dst_window_t, mp_request_t *req, cudaStream_t stream); int mp_iget_on_stream (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ, mp_window_t *src_window_t, mp_request_t *req, cudaStream_t stream);

// Wait

int mp_wait_on_stream (mp_request_t *req, cudaStream_t stream); int mp_wait_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t stream);

slide-43
SLIDE 43

45

CUDA stream-sync communications

Loop { mp_irecv(…) compute <<<…,stream>>> (buf) mp_isend_on_stream(…) mp_wait_all_on_stream (…) }

CPU HCA stream

slide-44
SLIDE 44

46 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP STREAM-SYNC batch submission

// Prepare single requests, IB Verbs side only

int mp_send_prepare (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);

// Post set of prepared requests

int mp_isend_post_on_stream (mp_request_t *req, cudaStream_t stream); int mp_isend_post_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t stream);

slide-45
SLIDE 45

47 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LIBMP KERNEL-SYNC COMMUNICATION

// Host side code // extract descriptors from prepared requests

int mp::mlx5::get_descriptors(send_desc_t *send_info, mp_request_t *req); int mp::mlx5::get_descriptors(wait_desc_t *wait_info, mp_request_t *req);

// Device side code

__device__ int mp::device::mlx5::send(send_desc_t * send_info); __device__ int mp::device::mlx5::wait (wait_desc_t * wait_info);

slide-46
SLIDE 46

48

Kernel-sync communication

Loop { mp_irecv(…, rreq); mp_get_descriptors(wi, rreq); mp_send_prepare(…, sreq); mp_get_descriptors(si, sreq); compute_and_communicate <<<…,stream>>> (buf,wi,si) { do_something(buf); mlx5::send(si); do_something_else(); mlx5::wait(wi); keep_working(); } }

CPU HCA stream

slide-47
SLIDE 47

49

ASYNC: experimental MPI BINDINGS

slide-48
SLIDE 48

50

CODE SAMPLE

MPI_Comm comm = MPI_COMM_WORLD; cudaStreamCreate (&stream); Loop { kernel <<<nblocks, nthreads, stream>>> (buf1, buf2); cudaStreamSynchronize (stream); MPI_Irecv(buf2, count, MPI_INT, peer, 0, comm, &req[0]); MPI_Isend(buf1, count, MPI_INT, peer, 0, comm, &req[1]); MPI_Waitall (2, req, statuses); }

Standard GPU-aware MPI

slide-49
SLIDE 49

51

STREAM-SYNCHRONOUS VERSION

MPI_Comm stream_comm; cudaStreamCreate (&stream); MPIX_Comm_dup_with_stream( MPI_COMM_WORLD, &stream_comm, &stream); Loop { kernel <<<nblocks, nthreads, stream>>> (buf1, buf2); MPI_Recv(buf2, count, MPI_BYTE, peer, 0, stream_comm); MPI_Send(buf1, count, MPI_BYTE, peer, 0, stream_comm); } MPIX_Wait_stream_completion(stream_comm);

Experimental MPI extensions

slide-50
SLIDE 50

52

ASYNC BENCHMARKS

LibMP models HPGMG-FV CoMD Lulesh2

slide-51
SLIDE 51

53

LIBMP MODELS

Three different execution models:

  • CPU-synchronous (default)
  • CUDA Stream-synchronous, CPU asynchronous (SA)
  • communications triggered by a CUDA Stream
  • CUDA Kernel-synchronous (or Kernel-Initiated, KI)
  • Communications triggered by CUDA Kernel threads

Need to evaluate performance: 3 multi-process applications  MPI to LibMP

Summary

slide-52
SLIDE 52

54

HPGMG-CUDA

slide-53
SLIDE 53

55

HPGMG-FV

Overview

High Performance Geometric Multigrid (Lawrence Berkeley National Laboratory) :

  • Geometric multi-grid solver for variable-coefficient elliptic problems on isotropic Cartesian grids

using the Finite Volume method

  • F-cycle: multiple V-cycles  a finer grid is smoothed and restricted into a coarser grid. A direct

solver is applied to the coarsest grid and the solution is then iteratively interpolated to finer grids.

Multi-process execution:

  • Each grid is divided into several same-size boxes
  • Workload is fairly distributed among processes
slide-54
SLIDE 54

56

HPGMG-FV CUDA

Overview

GPU accelleration Multi-process and multi-GPU, MPI Level size threshold:

  • Lower levels computed on CPU
  • Higher levels computed on GPU
slide-55
SLIDE 55

57

HPGMG-FV CUDA

Multi-process execution: MPI communications

CPU Level (coarser) Smoother Residual Exchange boundary GPU Level (finer) Smoother Residual Exchange boundary CPU Level (coarser) Smoother Residual Exchange boundary Interpolation Restriction

Inter-level communication: restriction (moving from a finer level to a coarser level) and interpolation (moving from a coarser level to a finer level) Intra-level communication: exchange_boundary function (boundary region exchange between processes)

slide-56
SLIDE 56

59

EXCHANGE BOUNDARY IMPLEMENTATION

cudaDeviceSynchronize(); for(p=0; p < num_peers; p++) { MPI_Irecv(rBuf, …, rReqs[p]); } pack_kernel<<<…, stream>>>(sBuf, …); cudaDeviceSynchronize(); for(p=0; p < num_peers; p++) { MPI_Isend(sBuf, …, sReqs[p]); } local_kernel<<<…, stream>>>(); MPI_Waitall(rReqs, …); unpack_kernel<<<…, stream>>>(rBuf, …); MPI_Waitall(sReqs, …); for(p=0; p < num_peers; p++) { mp_irecv(rBuf, …, rReqs[p]); } pack_kernel<<<…, stream>>>(sBuf, …); for(p=0; p < num_peers; p++) { mp_isend_on_stream(sBuf, …, sReqs[p], stream); } local_kernel<<<…, stream>>>(); mp_wait_all_on_stream(…, rReqs, stream); unpack_kernel<<<…, stream>>>(rBuf, …); mp_wait_all_on_stream(…, sReqs, stream);

Code comparison: MPI vs SA-Model

MPI Version LibMP SA-Model Version

slide-57
SLIDE 57

60

MPI IMPLEMENTATION

Consecutive exchange_boundary() calls, CUDA Visual Profiler

Synchronization CUDA kernels Communications (Send or Wait)

Host GPU Stream Network

Isend WaitAll

slide-58
SLIDE 58

61

MPI IMPLEMENTATION

Consecutive exchange_boundary() calls, CUDA Visual Profiler

Host GPU Stream Network

GPU Idle time

slide-59
SLIDE 59

63

SA MODEL IMPLEMENTATION

Consecutive exchange_boundary() calls, CUDA Visual Profiler

CUDA kernels Communications (Send, Put or Wait)

Isend Wait Recv Wait Isend

Host GPU Stream Network

CUDA 8.0 driver functions

slide-60
SLIDE 60

64

SA MODEL IMPLEMENTATION

Wilkes cluster - performance gain

  • Wilkes cluster (Cambridge, UK):
  • Telsa K20 GPUs
  • CUDA 8.0
  • Mellanox Connect-IB OFED 3.2
  • OpenMPI
  • Up to 24% time improvement wrt MPI
  • 4 processes, single GPU level:
  • 64% CPU work time reduction
  • 28% GPU activity time reduction
  • The bigger is the box size, the more

message size grows: communication

  • verhead becomes less important
slide-61
SLIDE 61

65

KI MODEL IMPLEMENTATION

How to

For each exchange_boundary() execution:

  • Kernel fusion, single vs three CUDA kernels
  • Move communications inside fused kernel
  • Respect mutual dependencies (hard!)

launch fused kernel exchange_boundary fused kernel CPU GPU

slide-62
SLIDE 62

67

KI MODEL IMPLEMENTATION

exchange_boundary() fused kernel – CUDA blocks organization

Fused kernel Set 1 Set 2 Set 3

CUB library and GPU global memory variable

slide-63
SLIDE 63

68

KI MODEL IMPLEMENTATION

Wilkes cluster - performance gain

  • Wilkes cluster (Cambridge, UK):
  • Telsa K20 GPUs
  • CUDA 8.0
  • Mellanox Connect-IB OFED 3.2
  • OpenMPI
  • Up to 26% time improvement wrt MPI
  • 4 processes, single GPU level:
  • 77% CPU work time reduction
  • 32% GPU activity time reduction
slide-64
SLIDE 64

69

DGX-1 CLUSTER BENCHMARKS

Performance gain

  • DGX-1 cluster (NVIDIA):
  • Tesla P100 GPUs
  • CUDA 8.0
  • Mellanox ConnectX-4 OFED 3.4
  • LibMP v2.0
  • OpenMPI 1.10.5
  • 2 processes only: about 8% less than

Wilkes

  • Gain wrt MPI:
  • CPU sync default
  • SA model
  • KI model
slide-65
SLIDE 65

70

COMD-CUDA

slide-66
SLIDE 66

71

COMD-CUDA

Implementation

  • CoMD (www.exmatex.org project) is a classical molecular dynamics proxy

application

  • O(N2) within cutoff
  • It considers materials where the interatomic potentials are short range and the

simulation requires the evaluation of all forces between atom pairs within the cutoff distance

  • Dependency between CUDA kernels and communications
  • Explored SA-model only
slide-67
SLIDE 67

72

SA MODEL IMPLEMENTATION

Repeat for 3 times { load_kernel <<<…, stream>>>(sBufA, …); load_kernel <<<…, stream>>>(sBufB, …); cudaDeviceSynchronize(); MPI_Sendrecv(sBufA, rBufA, …); MPI_Sendrecv(sBufB, rBufB, …); cuda_tasks(stream); unload_kernel <<<…, stream>>>(rBufA, …); unload_kernel <<<…, stream>>>(rBufB, …); } for(p=0; p < (3 x 2); p++) { mp_irecv(…, rReqs[p]); } for(p=0; p < (3 x 2); p += 2) { load_kernel <<<…, stream>>>(sBufA, …); mp_isend_on_stream(…, sReqs[p], stream); load_kernel <<<…, stream>>>(sBufB, …); mp_isend_on_stream(…, sReqs[p+1], stream); cuda_tasks(stream); mp_wait_all_on_stream(…, rReqs+p, stream); unload_kernel <<<…, stream>>>(rBufA, …); unload_kernel <<<…, stream>>>(rBufB, …); mp_wait_all_on_stream(…, sReqs+p, stream); }

Communication periods - Code comparison

MPI Version LibMP SA-Model Version

slide-68
SLIDE 68

73

SA MODEL PERFORMANCE GAIN

Wilkes cluster – communications gain

  • Wilkes cluster (Cambridge, UK):
  • Telsa K20 GPUs
  • CUDA 8.0
  • Mellanox Connect-IB OFED 3.2
  • OpenMPI
  • 25% ~ 35% time gain
  • Communication periods only
  • Gain SA model wrt MPI
slide-69
SLIDE 69

74

SA MODEL PERFORMANCE GAIN

DGX-1 cluster – communications

  • DGX-1 cluster (NVIDIA):
  • Tesla P100 GPUs
  • CUDA 8.0
  • Mellanox ConnectX-4 OFED 3.4
  • LibMP v2.0
  • OpenMPI 1.10.5
  • 4 processes only: 13% less than

Wilkes

  • Communication periods only
  • Gain SA model wrt MPI
slide-70
SLIDE 70

75

LULESH2-CUDA

slide-71
SLIDE 71

76

LULESH2-CUDA

Implementation

  • Proxy application developed at Lawrence Livermore National Laboratory
  • “… It approximates the hydrodynamics equations discretely by partitioning the

spatial problem domain into a collection of volumetric elements defined by a mesh.” ( https://codesign.llnl.gov/lulesh.php )

  • Dependency between CUDA kernels and communications
  • Explored SA-model only
slide-72
SLIDE 72

77

Repeat up to 26 times { mp_irecv(rBuf, …, rReqs); } …… Repeat up to 26 times { cuda_kernel<<<…, stream>>>(…); cudaMemcpyAsync(…, stream); mp_isend_on_stream(…, sReqs, stream); } mp_wait_all_on_stream(…, sReqs, stream); …… Repeat up to 26 times { mp_wait_on_stream(…, rReqs, stream); cudaMemcpyAsync(…, stream); cuda_kernel<<<…, stream>>>(…); }

SA MODEL IMPLEMENTATION

Repeat up to 26 times { MPI_Irecv(rBuf, …); } …… Repeat up to 26 times { cuda_kernel<<<…, stream>>>(sBuf, …); cudaDeviceSynchronize(); cudaMemcpyAsync(…, stream); MPI_Isend(sBuf, …); } MPI_Waitall(…); …… Repeat up to 26 times { MPI_Wait(…); cuda_kernel<<<…, stream>>>(rBuf, …); cudaMemcpyAsync(…, stream); cudaDeviceSynchronize(); }

Communication periods - Code comparison

MPI Version LibMP SA-Model Version

slide-73
SLIDE 73

78

SA MODEL IMPLEMENTATION

Wilkes cluster - performance gain

  • Progressively increasing the number
  • f iterations to intensify the GPU

workload

  • Wilkes cluster (Cambridge, UK):
  • Telsa K20 GPUs
  • CUDA 8.0
  • Mellanox Connect-IB OFED 3.2
  • OpenMPI
  • Average gain of 13%
  • 27 and 64 processes
slide-74
SLIDE 74

79

GAME OVER

slide-75
SLIDE 75

80