STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine

GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2

GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for optimized inter-node copy • GPUDirect P2P for intra-node • accelerated GPU-GPU memcpy • inter-GPU direct load/store access • GPUDirect RDMA 2 for optimized inter-node communication • GPUDirect Async for optimized inter-node communication • [ 1 ] developer info: https://developer.nvidia.com/gpudirect [ 2 ] http://docs.nvidia.com/cuda/gpudirect-rdma 3

GPUDIRECT IN THE CAVE CERN’s NA62 experiment “probes decays of the charged kaon” 4 [*] http://apegate.roma1.infn.it/mediawiki/index.php/NaNet_overview 4/7/16

GPUDIRECT scopes GPU GPUDirect P2P à data • GPUDirect RDMA/P2P Data plane GPUs both master and slave • GPUDirect RDMA à data • GPU slave, 3 rd party device master • GPUDirect Async à control • GPUDirect Async GPU & 3 rd party device master & slave • GPU HOST Control plane 5

GPUDIRECT scopes (2) 3 rd party GPUDirect RDMA & Async • device GPU Async over PCIe, for low latency • RDMA GPUDirect P2P • PCIe P2P switch over PCIe • over NVLink (Pascal only) • GPU CPU 6

GPUDIRECT RDMA ON PASCAL peak results, optimal PCIe fabric 14 12 bandwidth (GB/s) 10 8 6 RDMA read RDMA write 4 2 0 GK110 P100 GPU family 7 4/7/16

GPUDIRECT P2P ON PASCAL early results, P2P thru NVLink Open-MPI intra-node GPU-to-GPU point-to-point BW 20000 Bandwidth (MB/s) 15000 10000 17.9GB/s 5000 0 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8 4/7/16

ASYNC: MOTIVATION 9 4/7/16

VISUAL PROFILE - TRADITIONAL (Time marked for one step, Domain size/GPU – 1024, Boundary – 16, Ghost Width – 1) 10

VISUAL PROFILE - TRADITIONAL CPU bounded (Time marked for one step, Domain size/GPU – 128, Boundary – 16, Ghost Width – 1) 11

SW ARCHITECTURE 12

GPUDIRECT SW applications benchmarks ECOSYSTEM MVAPICH2 Open MPI CUDA RT CUDA IB verbs driver user-mode kernel-mode NV display IB core driver nv_peer_mem extensions[*] for RDMA proprietary cxgb4 mlx5 open- source RDMA HW GPU HCA mixed 13 [*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

EXTENDED STACK applications benchmarks MVAPICH2 Open MPI libmp CUDA RT libgdsync CUDA IB verbs IB Verbs extensions driver extensions for Async for Async user-mode kernel-mode NV display IB core driver nv_peer_mem extensions[*] for RDMA/Async proprietary ext. for Async cxgb4 mlx5 open- source RDMA HW GPU HCA Async mixed 14 [*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

GPUDIRECT ASYNC + INFINIBAND preview release of components • CUDA Async extensions, preview in CUDA 8.0 EA • Peer-direct async extension, in MLNX OFED 3.x, soon • libgdsync, on github.com/gpudirect, soon • libmp, on github.com/gpudirect, soon 15 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

ASYNC: APIS 16

GPUDIRECT ASYNC Front-end unit expose GPU front-end unit CPU prepares work plan hardly parallelizable, branch intensive • GPU orchestrates flow • Runs on optimized front-end unit Same one scheduling GPU work • Now also scheduling network • Compute Engines communications 17

STREAM MEMORY OPERATIONS guarantee memory consistency fpr RDMA CU_STREAM_WAIT_VALUE_GEQ = 0x0, CU_STREAM_WAIT_VALUE_EQ = 0x1, CU_STREAM_WAIT_VALUE_AND = 0x2, CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 polling on 32-bit CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, word cuuint32_t value, unsigned int flags); CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1 CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, 32-bit word write cuuint32_t value, unsigned int flags); CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1, CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2, CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3 low-overhead batched CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, work submission CUstreamBatchMemOpParams *paramArray, unsigned int flags); 18

STREAM MEMORY OPERATIONS Front-end unit GPU front-end unit host mem 3 0 h_flag 1 2 2 1 *(volatile uint32_t*)h_flag = 0; … 1 cuStreamWaitValue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<GSZ,BSZ,0,stream>>>(); 2 cuStreamWriteValue32(stream, d_flag, 2, 0); 3 … *(volatile uint32_t*)h_flag = 1; … cudaStreamSynchronize(stream); Compute Engines assert(*(volatile uint32_t*)h_flag== 2); 19

GPUDIRECT ASYNC APIs features batching multiple consecutive mem ops save ~1us each op • use cuStreamBatchMemOp • APIs accept device pointers • memory need registration (cuMemHostRegister) • device pointer retrieval (cuMemHostGetDevicePointer) • 3 rd party device PCIe resources (aka BARs) • assumed physically contiguous & uncached • special flag needed • 20

GPU PEER MAPPING accessing 3 rd party device PCIe resource from GPU struct device_bar { void *ptr; CUdeviceptr d_ptr; size_t len; }; void map_device_bar(device_bar *db) { device_driver_get_bar(&db->ptr,&db->len); registration is mandatory CUCHECK( cuMemHostRegister (db->ptr, db->len, CU_MEMHOSTREGISTER_IOMEMORY )); new flag CUCHECK( cuMemHostGetDevicePointer (&db->d_ptr, db->ptr, 0)); } GPU access to … device thru cuStreamWriteValue32 (stream, db->d_ptr+off , 0xfaf0, 0); device pointer 21

GPU PEER MAPPING + ASYNC cuStreamWriteValue32(stream, db->d_ptr+off , 0xfaf0, 0); PCIe bus 0xfaf0 phys_ptr+off PCIe iface PCIe resources 3 rd party device GPU 22

2DSTENCIL PERFORMANCE weak scaling, RDMA vs RDMA+Async 2DStencil 35.00% 30.00% Percentage Improvement 25.00% 20.00% NP=2 15.00% NP=4 10.00% 5.00% 0.00% 8 16 32 64 128 256 512 1024 2048 4096 8192 local la0ce size 23 two/four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch

CAVEATS Good platform GPUDirect RDMA & Async • need correct/reliable forwarding of PCIe transactions • GPUDirect Async • GPU peer mapping limited to privileged processes (CUDA 8.0 EA) • Platform: • best: PCIE switch • limited: CPU root-complex • 24

April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2 GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University

Technologies : Retour sur le Futur ? Technologies : Retour sur le Futur ? Technologies : Retour

BBC Technologies: Our LATAM Experience Who are BBC Technologies? BBC Technologies Where we are

ZEBRA TECHNOLOGIES ZEBRA TECHNOLOGIES DevTalk - Enterprise Browser 2.5 Darryn Campbell SW

THANK YOU FOR YOUR INTEREST IN POKE YOKE TECHNOLOGIES POKE YOKE TECHNOLOGIES AT A GLANCE POKE

ZEBRA TECHNOLOGIES ZEBRA TECHNOLOGIES DevTalk Whats new for Zebra developers in Android 10

Neuron Technologies Company Profile Dr. Ismayil Alakbarov Founder & CEO, Neuron Technologies

UML State Models U State ode s Basic State Model Concepts/Notations Basic State Model

Supporting PHP Dynamic Analysis in PHP AiR Mark Hills 13th International Workshop on Dynamic

Management of Blueprints Fernando Lpez Aguilar TID June 2013 http://www.fiware.eu The

OntoWiki Browsing and Editing RDF Knowledge bases with OntoWiki and RDFauthor Norman Heino

Music and Metadata: The Musiconis Database and FAB-Musiconis Project Susan Boynton, Department of

Recording the Knowledge Footprints February 4, 2015 The Knowledge Problem Those who cannot

Ensemble Welcome to Service Management Clint Byrum Ubuntu Server Team Ubuntu Cloud Days

Creative Techniques for Loading Web Pages Faster Trevor Parscal - @trevorparscal Roan Kattouw

Infrastructure Automation with Opscode Chef http://opscode.com @opscode #opschef Tuesday, June

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2 GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University

Technologies : Retour sur le Futur ? Technologies : Retour sur le Futur ? Technologies : Retour

BBC Technologies: Our LATAM Experience Who are BBC Technologies? BBC Technologies Where we are

ZEBRA TECHNOLOGIES ZEBRA TECHNOLOGIES DevTalk - Enterprise Browser 2.5 Darryn Campbell SW

THANK YOU FOR YOUR INTEREST IN POKE YOKE TECHNOLOGIES POKE YOKE TECHNOLOGIES AT A GLANCE POKE

ZEBRA TECHNOLOGIES ZEBRA TECHNOLOGIES DevTalk Whats new for Zebra developers in Android 10

Neuron Technologies Company Profile Dr. Ismayil Alakbarov Founder &amp; CEO, Neuron Technologies

UML State Models U State ode s Basic State Model Concepts/Notations Basic State Model

Supporting PHP Dynamic Analysis in PHP AiR Mark Hills 13th International Workshop on Dynamic

Management of Blueprints Fernando Lpez Aguilar TID June 2013 http://www.fiware.eu The

OntoWiki Browsing and Editing RDF Knowledge bases with OntoWiki and RDFauthor Norman Heino

Music and Metadata: The Musiconis Database and FAB-Musiconis Project Susan Boynton, Department of

Recording the Knowledge Footprints February 4, 2015 The Knowledge Problem Those who cannot

Ensemble Welcome to Service Management Clint Byrum Ubuntu Server Team Ubuntu Cloud Days

Creative Techniques for Loading Web Pages Faster Trevor Parscal - @trevorparscal Roan Kattouw

Infrastructure Automation with Opscode Chef http://opscode.com @opscode #opschef Tuesday, June

Neuron Technologies Company Profile Dr. Ismayil Alakbarov Founder & CEO, Neuron Technologies