STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - - PowerPoint PPT Presentation

state of gpudirect technologies
SMART_READER_LITE
LIVE PREVIEW

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2 GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Davide Rossetti(*) Sreeram Potluri David Fontaine

STATE OF GPUDIRECT TECHNOLOGIES

slide-2
SLIDE 2

2

OUTLOOK

GPUDirect overall GPUDirect Async SW architecture CUDA Async APIs

slide-3
SLIDE 3

3

GPUDIRECT FAMILY1

  • GPUDirect Shared GPU-Sysmem for optimized inter-node copy
  • GPUDirect P2P for intra-node
  • accelerated GPU-GPU memcpy
  • inter-GPU direct load/store access
  • GPUDirect RDMA2 for optimized inter-node communication
  • GPUDirect Async for optimized inter-node communication

[1] developer info: https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma

slide-4
SLIDE 4

4

GPUDIRECT IN THE CAVE

CERN’s NA62 experiment “probes decays of the charged kaon”

4/7/16

[*] http://apegate.roma1.infn.it/mediawiki/index.php/NaNet_overview

slide-5
SLIDE 5

5

GPUDIRECT

scopes

  • GPUDirect P2P à data
  • GPUs both master and slave
  • GPUDirect RDMA à data
  • GPU slave, 3rd party device master
  • GPUDirect Async à control
  • GPU & 3rd party device master & slave

Data plane Control plane

GPUDirect Async

HOST GPU GPU

GPUDirect RDMA/P2P

slide-6
SLIDE 6

6

GPUDIRECT

scopes (2)

  • GPUDirect RDMA & Async
  • ver PCIe, for low latency
  • GPUDirect P2P
  • ver PCIe
  • ver NVLink (Pascal only)

GPU CPU

PCIe switch

3rd party device

GPU

RDMA P2P Async

slide-7
SLIDE 7

7

GPUDIRECT RDMA ON PASCAL

peak results, optimal PCIe fabric

2 4 6 8 10 12 14 GK110 P100 bandwidth (GB/s) GPU family RDMA read RDMA write

4/7/16

slide-8
SLIDE 8

8

GPUDIRECT P2P ON PASCAL

early results, P2P thru NVLink

5000 10000 15000 20000 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB Bandwidth (MB/s)

Open-MPI intra-node GPU-to-GPU point-to-point BW

4/7/16

17.9GB/s

slide-9
SLIDE 9

9

ASYNC: MOTIVATION

4/7/16

slide-10
SLIDE 10

10

VISUAL PROFILE - TRADITIONAL

(Time marked for one step, Domain size/GPU – 1024, Boundary – 16, Ghost Width – 1)

slide-11
SLIDE 11

11

VISUAL PROFILE - TRADITIONAL

(Time marked for one step, Domain size/GPU – 128, Boundary – 16, Ghost Width – 1)

CPU bounded

slide-12
SLIDE 12

12

SW ARCHITECTURE

slide-13
SLIDE 13

13

GPUDIRECT SW ECOSYSTEM

CUDA driver IB verbs IB core mlx5 NV display driver

kernel-mode user-mode

HCA GPU

HW

mixed

  • pen-

source proprietary

nv_peer_mem

MVAPICH2 RDMA CUDA RT Open MPI applications benchmarks cxgb4

[*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

extensions[*] for RDMA

slide-14
SLIDE 14

14

EXTENDED STACK

CUDA driver IB verbs IB core mlx5 libgdsync NV display driver

kernel-mode user-mode

HCA GPU

HW

mixed

  • pen-

source proprietary

nv_peer_mem

extensions for Async

MVAPICH2 RDMA

  • ext. for Async

CUDA RT libmp Open MPI applications benchmarks

IB Verbs extensions for Async

cxgb4 Async

extensions[*] for RDMA/Async

[*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

slide-15
SLIDE 15

15 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

GPUDIRECT ASYNC + INFINIBAND

  • CUDA Async extensions, preview in CUDA 8.0 EA
  • Peer-direct async extension, in MLNX OFED 3.x, soon
  • libgdsync, on github.com/gpudirect, soon
  • libmp, on github.com/gpudirect, soon

preview release of components

slide-16
SLIDE 16

16

ASYNC: APIS

slide-17
SLIDE 17

17

GPUDIRECT ASYNC

CPU prepares work plan

  • hardly parallelizable, branch intensive
  • GPU orchestrates flow

Runs on optimized front-end unit

  • Same one scheduling GPU work
  • Now also scheduling network

communications

expose GPU front-end unit

Front-end unit

Compute Engines

slide-18
SLIDE 18

18

STREAM MEMORY OPERATIONS

CU_STREAM_WAIT_VALUE_GEQ = 0x0, CU_STREAM_WAIT_VALUE_EQ = 0x1, CU_STREAM_WAIT_VALUE_AND = 0x2, CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags); CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1 CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags); CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1, CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2, CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3 CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, CUstreamBatchMemOpParams *paramArray, unsigned int flags);

polling on 32-bit word 32-bit word write low-overhead batched work submission guarantee memory consistency fpr RDMA

slide-19
SLIDE 19

19 Front-end unit Compute Engines

STREAM MEMORY OPERATIONS

GPU front-end unit

*(volatile uint32_t*)h_flag = 0; … cuStreamWaitValue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<GSZ,BSZ,0,stream>>>(); cuStreamWriteValue32(stream, d_flag, 2, 0); … *(volatile uint32_t*)h_flag = 1; … cudaStreamSynchronize(stream); assert(*(volatile uint32_t*)h_flag== 2);

1 2 3 1 2 3

host mem h_flag 1 2

slide-20
SLIDE 20

20

GPUDIRECT ASYNC

APIs features

  • batching multiple consecutive mem ops save ~1us each op
  • use cuStreamBatchMemOp
  • APIs accept device pointers
  • memory need registration (cuMemHostRegister)
  • device pointer retrieval (cuMemHostGetDevicePointer)
  • 3rd party device PCIe resources (aka BARs)
  • assumed physically contiguous & uncached
  • special flag needed
slide-21
SLIDE 21

21

GPU PEER MAPPING

struct device_bar { void *ptr; CUdeviceptr d_ptr; size_t len; }; void map_device_bar(device_bar *db) { device_driver_get_bar(&db->ptr,&db->len); CUCHECK(cuMemHostRegister(db->ptr, db->len, CU_MEMHOSTREGISTER_IOMEMORY)); CUCHECK(cuMemHostGetDevicePointer(&db->d_ptr, db->ptr, 0)); } … cuStreamWriteValue32(stream, db->d_ptr+off, 0xfaf0, 0);

registration is mandatory new flag GPU access to device thru device pointer

accessing 3rd party device PCIe resource from GPU

slide-22
SLIDE 22

22

GPU PEER MAPPING + ASYNC

PCIe iface GPU 3rd party device PCIe resources cuStreamWriteValue32(stream, db->d_ptr+off, 0xfaf0, 0); phys_ptr+off

0xfaf0

PCIe bus

slide-23
SLIDE 23

23

2DSTENCIL PERFORMANCE

weak scaling, RDMA vs RDMA+Async

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 8 16 32 64 128 256 512 1024 2048 4096 8192

Percentage Improvement local la0ce size

2DStencil

NP=2 NP=4

two/four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch

slide-24
SLIDE 24

24

CAVEATS

  • GPUDirect RDMA & Async
  • need correct/reliable forwarding of PCIe transactions
  • GPUDirect Async
  • GPU peer mapping limited to privileged processes (CUDA 8.0 EA)
  • Platform:
  • best: PCIE switch
  • limited: CPU root-complex

Good platform

slide-25
SLIDE 25

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join