state of gpudirect technologies
play

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2 GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem


  1. April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine

  2. GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2

  3. GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for optimized inter-node copy • GPUDirect P2P for intra-node • accelerated GPU-GPU memcpy • inter-GPU direct load/store access • GPUDirect RDMA 2 for optimized inter-node communication • GPUDirect Async for optimized inter-node communication • [ 1 ] developer info: https://developer.nvidia.com/gpudirect [ 2 ] http://docs.nvidia.com/cuda/gpudirect-rdma 3

  4. GPUDIRECT IN THE CAVE CERN’s NA62 experiment “probes decays of the charged kaon” 4 [*] http://apegate.roma1.infn.it/mediawiki/index.php/NaNet_overview 4/7/16

  5. GPUDIRECT scopes GPU GPUDirect P2P à data • GPUDirect RDMA/P2P Data plane GPUs both master and slave • GPUDirect RDMA à data • GPU slave, 3 rd party device master • GPUDirect Async à control • GPUDirect Async GPU & 3 rd party device master & slave • GPU HOST Control plane 5

  6. GPUDIRECT scopes (2) 3 rd party GPUDirect RDMA & Async • device GPU Async over PCIe, for low latency • RDMA GPUDirect P2P • PCIe P2P switch over PCIe • over NVLink (Pascal only) • GPU CPU 6

  7. GPUDIRECT RDMA ON PASCAL peak results, optimal PCIe fabric 14 12 bandwidth (GB/s) 10 8 6 RDMA read RDMA write 4 2 0 GK110 P100 GPU family 7 4/7/16

  8. GPUDIRECT P2P ON PASCAL early results, P2P thru NVLink Open-MPI intra-node GPU-to-GPU point-to-point BW 20000 Bandwidth (MB/s) 15000 10000 17.9GB/s 5000 0 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8 4/7/16

  9. ASYNC: MOTIVATION 9 4/7/16

  10. VISUAL PROFILE - TRADITIONAL (Time marked for one step, Domain size/GPU – 1024, Boundary – 16, Ghost Width – 1) 10

  11. VISUAL PROFILE - TRADITIONAL CPU bounded (Time marked for one step, Domain size/GPU – 128, Boundary – 16, Ghost Width – 1) 11

  12. SW ARCHITECTURE 12

  13. GPUDIRECT SW applications benchmarks ECOSYSTEM MVAPICH2 Open MPI CUDA RT CUDA IB verbs driver user-mode kernel-mode NV display IB core driver nv_peer_mem extensions[*] for RDMA proprietary cxgb4 mlx5 open- source RDMA HW GPU HCA mixed 13 [*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

  14. EXTENDED STACK applications benchmarks MVAPICH2 Open MPI libmp CUDA RT libgdsync CUDA IB verbs IB Verbs extensions driver extensions for Async for Async user-mode kernel-mode NV display IB core driver nv_peer_mem extensions[*] for RDMA/Async proprietary ext. for Async cxgb4 mlx5 open- source RDMA HW GPU HCA Async mixed 14 [*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

  15. GPUDIRECT ASYNC + INFINIBAND preview release of components • CUDA Async extensions, preview in CUDA 8.0 EA • Peer-direct async extension, in MLNX OFED 3.x, soon • libgdsync, on github.com/gpudirect, soon • libmp, on github.com/gpudirect, soon 15 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

  16. ASYNC: APIS 16

  17. GPUDIRECT ASYNC Front-end unit expose GPU front-end unit CPU prepares work plan hardly parallelizable, branch intensive • GPU orchestrates flow • Runs on optimized front-end unit Same one scheduling GPU work • Now also scheduling network • Compute Engines communications 17

  18. STREAM MEMORY OPERATIONS guarantee memory consistency fpr RDMA CU_STREAM_WAIT_VALUE_GEQ = 0x0, CU_STREAM_WAIT_VALUE_EQ = 0x1, CU_STREAM_WAIT_VALUE_AND = 0x2, CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 polling on 32-bit CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, word cuuint32_t value, unsigned int flags); CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1 CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, 32-bit word write cuuint32_t value, unsigned int flags); CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1, CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2, CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3 low-overhead batched CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, work submission CUstreamBatchMemOpParams *paramArray, unsigned int flags); 18

  19. STREAM MEMORY OPERATIONS Front-end unit GPU front-end unit host mem 3 0 h_flag 1 2 2 1 *(volatile uint32_t*)h_flag = 0; … 1 cuStreamWaitValue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<GSZ,BSZ,0,stream>>>(); 2 cuStreamWriteValue32(stream, d_flag, 2, 0); 3 … *(volatile uint32_t*)h_flag = 1; … cudaStreamSynchronize(stream); Compute Engines assert(*(volatile uint32_t*)h_flag== 2); 19

  20. GPUDIRECT ASYNC APIs features batching multiple consecutive mem ops save ~1us each op • use cuStreamBatchMemOp • APIs accept device pointers • memory need registration (cuMemHostRegister) • device pointer retrieval (cuMemHostGetDevicePointer) • 3 rd party device PCIe resources (aka BARs) • assumed physically contiguous & uncached • special flag needed • 20

  21. GPU PEER MAPPING accessing 3 rd party device PCIe resource from GPU struct device_bar { void *ptr; CUdeviceptr d_ptr; size_t len; }; void map_device_bar(device_bar *db) { device_driver_get_bar(&db->ptr,&db->len); registration is mandatory CUCHECK( cuMemHostRegister (db->ptr, db->len, CU_MEMHOSTREGISTER_IOMEMORY )); new flag CUCHECK( cuMemHostGetDevicePointer (&db->d_ptr, db->ptr, 0)); } GPU access to … device thru cuStreamWriteValue32 (stream, db->d_ptr+off , 0xfaf0, 0); device pointer 21

  22. GPU PEER MAPPING + ASYNC cuStreamWriteValue32(stream, db->d_ptr+off , 0xfaf0, 0); PCIe bus 0xfaf0 phys_ptr+off PCIe iface PCIe resources 3 rd party device GPU 22

  23. 2DSTENCIL PERFORMANCE weak scaling, RDMA vs RDMA+Async 2DStencil 35.00% 30.00% Percentage Improvement 25.00% 20.00% NP=2 15.00% NP=4 10.00% 5.00% 0.00% 8 16 32 64 128 256 512 1024 2048 4096 8192 local la0ce size 23 two/four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch

  24. CAVEATS Good platform GPUDirect RDMA & Async • need correct/reliable forwarding of PCIe transactions • GPUDirect Async • GPU peer mapping limited to privileged processes (CUDA 8.0 EA) • Platform: • best: PCIE switch • limited: CPU root-complex • 24

  25. April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend