support for gpus with gpudirect rdma in mvapich2
play

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth - PowerPoint PPT Presentation

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU Project GPUDirect


  1. Support for GPUs with GPUDirect RDMA in MVAPICH2 SC’13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

  2. Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 2

  3. Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/watt Multi-core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Multi-core processors are ubiquitous and InfiniBand is widely accepted • MVAPICH2 has constantly evolved to provide superior performance • • Accelerators/Coprocessors are becoming common in high-end systems • How does MVAPICH2 help development on these emerging architectures? Tianhe – 2 (1) Stampede (6) Titan (2) Tianhe – 1A (10) SC'13 NVIDIA Booth presentation 3

  4. InfiniBand + GPU systems (Past) • Many systems today have both GPUs and high-speed networks such as InfiniBand • Problem: Lack of a common memory registration mechanism – Each device has to pin the host memory it will use – Many operating systems do not allow multiple devices to register the same memory pages • Previous solution: – Use different buffer for each device and copy data SC'13 NVIDIA Booth presentation 4

  5. GPU-Direct • Collaboration between Mellanox and NVIDIA to converge on one memory registration technique • Both devices register a common host buffer – GPU copies data to this buffer, and the network adapter can directly read from this buffer (or vice-versa) • Note that GPU-Direct does not allow you to bypass host memory SC'13 NVIDIA Booth presentation 5

  6. MPI + CUDA • Data movement in applications with standard MPI and CUDA interfaces CPU At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .); PCIe At Receiver: NIC GPU MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .); Switch High Productivity and Low Performance • Users can do the Pipelining at the application level using non-blocking MPI and CUDA interfaces Low Productivity and High Performance 6 SC'13 NVIDIA Booth presentation

  7. GPU-Aware MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Optimizes data movement from GPU memory At Sender: MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity 7 SC'13 NVIDIA Booth presentation

  8. Pipelined Data Movement in MVAPICH2 • Pipelines data movement from the GPU, overlaps device-to-host CUDA copies - inter-process data movement (network transfers or shared memory - copies) host-to-device CUDA copies - 3000 45% improvement compared Memcpy+Send - 2500 MemcpyAsync+Isend with a naïve (Memcpy+Send) MVAPICH2-GPU 2000 Time (us) Better 1500 24% improvement compared - 1000 with an advanced user-level 500 implementation 0 (MemcpyAsync+Isend) 32K 128K 512K 2M Message Size (bytes) Internode osu_latency large SC'13 NVIDIA Booth presentation 8

  9. Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 9

  10. GPU-Direct RDMA (GDR) with CUDA • Network adapter can directly System read/write data from/to GPU Memory device memory CPU • Avoids copies through the host • Fastest possible communication Chip GPU between GPU and IB HCA set InfiniBand • Allows for better asynchronous communication GPU Memory • OFED with GDR support is under development by Mellanox and NVIDIA 10 MVAPICH User Group Meeting 2013 SC'13 NVIDIA Booth presentation

  11. GPU-Direct RDMA (GDR) with CUDA SNB E5-2670 / • OFED with support for GPUDirect RDMA is IVB E5-2680V2 under work by NVIDIA and Mellanox System CPU Memory • OSU has an initial design of MVAPICH2 using GPUDirect RDMA GPU – Hybrid design using GPU-Direct RDMA Chipset IB Adapter • GPUDirect RDMA and Host-based pipelining GPU • Alleviates P2P bandwidth bottlenecks on Memory SandyBridge and IvyBridge SNB E5-2670 – Support for communication using multi-rail P2P write: 5.2 GB/s – Support for Mellanox Connect-IB and ConnectX P2P read: < 1.0 GB/s VPI adapters IVB E5-2680V2 P2P write: 6.4 GB/s – Support for RoCE with Mellanox ConnectX VPI P2P read: 3.5 GB/s adapters SC'13 NVIDIA Booth presentation 11

  12. Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency Large Message Latency Small Message Latency 800 25 1-Rail 1-Rail 2-Rail 2-Rail 700 1-Rail-GDR 1-Rail-GDR 20 10 % 600 2-Rail-GDR 2-Rail-GDR Latency (us) Latency (us) 500 15 400 67 % 10 300 200 5 5.49 usec 100 0 0 8K 32K 128K 512K 2M 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch SC'13 NVIDIA Booth presentation 12

  13. Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth Small Message Bandwidth Large Message Bandwidth 2000 12000 1-Rail 1-Rail 9.8 GB/s 1800 2-Rail 2-Rail 10000 1-Rail-GDR 1-Rail-GDR 1600 2-Rail-GDR 2-Rail-GDR Bandwidth (MB/s) Bandwidth (MB/s) 1400 8000 1200 1000 6000 800 5x 4000 600 400 2000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch 13 SC'13 NVIDIA Booth presentation

  14. Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth Small Message Bi-Bandwidth Large Message Bi-Bandwidth 2000 25000 1-Rail 1-Rail 1800 2-Rail 2-Rail 19 GB/s 1-Rail-GDR 1-Rail-GDR 1600 20000 2-Rail-GDR 2-Rail-GDR 19 % Bi-Bandwidth (MB/s) Bi-Bandwidth (MB/s) 1400 1200 15000 1000 800 10000 4.3x 600 400 5000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch 14 SC'13 NVIDIA Booth presentation

  15. How can I get started with GDR Experimentation? • Two modules are needed – Alpha version of OFED kernel and libraries with GPUDirect RDMA (GDR) support from Mellanox – Alpha version of MVAPICH2-GDR from OSU (currently a separate distribution) • Send a note to hpc@mellanox.com • You will get alpha versions of GDR driver and MVAPICH2-GDR (based on MVAPICH2 2.0a release) • You can get started with this version • MVAPICH2 team is working on multiple enhancements (collectives, datatypes, one-sided) to exploit the advantages of GDR • As GDR driver matures, successive versions of MVAPICH2-GDR with enhancements will be made available to the community SC'13 NVIDIA Booth presentation 15

  16. Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 16

  17. Multi-GPU Configurations Process 0 Process 1 • Multi-GPU node architectures are becoming common Memory • Until CUDA 3.2 – Communication between processes staged through the host CPU – Shared Memory (pipelined) – Network Loopback [asynchronous) • CUDA 4.0 and later – Inter-Process Communication (IPC) I/O Hub – Host bypass – Handled by a DMA Engine – Low latency and Asynchronous – Requires creation, exchange and GPU 0 GPU 1 mapping of memory handles - overhead HCA SC'13 NVIDIA Booth presentation 17

  18. Designs in MVAPICH2 and Performance SHARED-MEM CUDA IPC • MVAPICH2 takes advantage of CUDA 50 Latency (usec) IPC for MPI communication between 40 GPUs 30 70% 20 • Hides the complexity and overheads of 10 handle creation, exchange and mapping 0 • Available in standard releases from 1 4 16 64 256 1K Message Size (Bytes) MVAPICH2 1.8 Intranode osu_latency small 2000 6000 Bandwidth (MBps) Latency (usec) 5000 1500 46% 78% 4000 1000 3000 2000 500 1000 0 0 4K 16K 64K 256K 1M 4M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Intranode osu_latency large Intranode osu_bw 18 SC'13 NVIDIA Booth presentation

  19. Collectives Optimizations in MVAPICH2: Overview • Optimizes data movement at the collective level for small messages • Pipelines data movement in each send/recv operation for large messages • Several collectives have been optimized Bcast, Gather, Scatter, Allgather, Alltoall, Scatterv, Gatherv, - Allgatherv, Alltoallv • Collective level optimizations are completely transparent to the user • Pipelining can be tuned using point-to-point parameters SC'13 NVIDIA Booth presentation 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend