mvapich2 gdr pushing the frontier of designing mpi
play

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference GTC 2016 by Khaled Hamidouche Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:


  1. MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference GTC 2016 by Khaled Hamidouche Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: hamidouc@cse.ohio-state.edu E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~hamidouc http://www.cse.ohio-state.edu/~panda

  2. Outline Overview of the MVAPICH2 Project • • MVAPICH2-GPU with GPUDirect-RDMA (GDR) What’s new with MVAPICH2-GDR • • Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing • Efficient Support for Managed Memory • RoCE and Optimized Collective • Initial support for GPUDirect Async feature • Efficient Deep Learning with MVAPICH2-GDR • OpenACC-Aware support • Conclusions Network Based Computing Laboratory GTC 2016 2

  3. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,550 organizations in 79 countries – More than 360,000 (> 0.36 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘15 ranking) • 10 th ranked 519,640-core cluster (Stampede) at TACC 13 th ranked 185,344-core cluster (Pleiades) at NASA • 25 th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others • – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10 th in Nov’15, 519,640 cores, 5.168 Plops) – Network Based Computing Laboratory GTC 2016 3

  4. MVAPICH2 Architecture High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- Fault Collectives I/O and Active Introspection point Job Startup Virtualization Memory & Analysis Algorithms Awareness Messages File Systems Tolerance Primitives Access Support for Modern Multi-/Many-core Architectures Support for Modern Networking Technology (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL * ), NVIDIA GPGPU) (InfiniBand, iWARP, RoCE, OmniPath) Modern Features Transport Protocols Transport Mechanisms Modern Features SR- Multi Shared UMR ODP * NVLink * CAPI * RC XRC UD DC MCDRAM * CMA IVSHMEM IOV Rail Memory * Upcoming Network Based Computing Laboratory GTC 2016 4

  5. Optimizing MPI Data Movement on GPU Clusters • Connected as PCIe devices – Flexibility bu t Complexity Memory buffers Node 1 Node 0 1 . Intra- GPU QPI 2 . Intra-Socket GPU -GPU CPU CPU CPU 3 . Inter-Socket GPU -GPU PCIe 4 . Inter-Node GPU -GPU 5 . Intra-Socket GPU -Host 6 . Inter-Socket GPU -Host GPU GPU GPU GPU 7 . Inter-Node GPU -Host IB 8 . Inter-Node GPU -GPU with IB adapter on remote socket and more . . . • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline … • Critical for runtimes to optimize data movement while hiding the complexity Network Based Computing Laboratory GTC 2016 5

  6. GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender: inside MPI_Send(s_devbuf, size, …); MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity Network Based Computing Laboratory GTC 2016 6

  7. CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers Network Based Computing Laboratory GTC 2016 7

  8. Using MVAPICH2-GPUDirect Version • MVAPICH2-2.2b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • System software requirements • Mellanox OFED 2.1 or later • NVIDIA Driver 331.20 or later • NVIDIA CUDA Toolkit 7.0 or later • Plugin for GPUDirect RDMA http://www.mellanox.com/page/products_dyn?product_family=116 • Strongly recommended GDRCOPY module from NVIDIA • https://github.com/NVIDIA/gdrcopy • Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu Network Based Computing Laboratory GTC 2016 8

  9. Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU Internode Bandwidth GPU-GPU internode latency Latency (us) 30 MV2-GDR2.2b MV2-GDR2.0b 3000 MV2-GDR2.2b Bandwidth 25 MV2 w/o GDR 2500 (MB/s) MV2-GDR2.0b 11X 20 2000 MV2 w/o GDR 15 10x 1500 10 2X 1000 5 0 500 2.18us 0 2 8 32 128 512 2K 0 Message Size (bytes) 1 4 16 64 256 1K 4K Message Size (bytes) GPU-GPU Internode Bi-Bandwidth Bi-Bandwidth (MB/s) 4000 MV2-GDR2.2b 3000 11x MV2-GDR2.0b 2000 MVAPICH2-GDR-2.2b MV2 w/o GDR Intel Ivy Bridge (E5-2680 v2) node - 20 cores 1000 2x NVIDIA Tesla K40c GPU 0 Mellanox Connect-IB Dual-FDR HCA 1 4 16 64 256 1K 4K CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Message Size (bytes) Network Based Computing Laboratory GTC 2016 9

  10. Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 2500 Average Time Steps per 3000 Average Time Steps per MV2 MV2+GDR 2500 2000 second (TPS) 2X second (TPS) 2X 2000 1500 1500 1000 1000 500 500 0 0 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 • MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory GTC 2016 10

  11. Full and Efficient MPI-3 RMA Support Small Message Latency 35 30 Latency (us) 25 6X 20 15 10 2.88 us 5 0 0 2 8 32 128 512 2K 8K Message Size (bytes) MVAPICH2-GDR-2.2b Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA, CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Network Based Computing Laboratory GTC 2016 11

  12. Performance of MVAPICH2-GDR with GPU-Direct RDMA and Multi-Rail Support GPU-GPU Internode MPI Uni-Directional Bandwidth GPU-GPU Internode Bi-directional Bandwidth 10000 8,759 16000 15,111 9000 MV2-GDR 2.1 MV2-GDR 2.1 14000 8000 Bandwidth (MB/s) Bi-Bandwidth (MB/s) MV2-GDR 2.1 RC2 12000 7000 40% MV2-GDR 2.1 RC2 6000 10000 20% 5000 8000 4000 6000 3000 4000 2000 1000 2000 0 0 1 4 16 64 256 1K 4K 16K64K 256K 1M 4M 1 4 16 64 256 1K 4K 16K64K 256K1M 4M Message Size (bytes) Message Size (bytes) MVAPICH2-GDR-2.2.b Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Network Based Computing Laboratory GTC 2016 12

  13. Outline Overview of the MVAPICH2 Project • • MVAPICH2-GPU with GPUDirect-RDMA (GDR) What’s new with MVAPICH2-GDR • • Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing • Efficient Support for Managed Memory • RoCE and Optimized Collective • Initial support for GPUDirect Async feature • Efficient Deep Learning with MVAPICH2-GDR • OpenACC-Aware support • Conclusions Network Based Computing Laboratory GTC 2016 13

  14. Non-Blocking Collectives (NBC) using Core-Direct Offload • MPI NBC decouples initiation (Ialltoall) and completion (Wait) phases and provide overlap potential (Ialltoall + compute + Wait) but CPU drives progress largely in Wait (=> 0 overlap) • CORE-Direct feature provides true overlap capabilities by providing a priori specification of a list of network- tasks which is progressed by the NIC instead of the CPU (hence freeing it) • We propose a design that combines GPUDirect RDMA and Core-Direct features to provide efficient support of CUDA-Aware NBC collectives on GPU buffers • Overlap communication with CPU computation • Overlap communication with GPU computation • Extend OMB with CUDA-Aware NBC benchmarks to evaluate Latency • • Overlap on both CPU and GPU A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015 Network Based Computing Laboratory GTC 2016 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend