mvapich2 gdr pushing the frontier of hpc and deep learning
play

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017 by Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: subramon@cse.ohio-state.edu E-mail:


  1. MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017 by Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: subramon@cse.ohio-state.edu E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon http://www.cse.ohio-state.edu/~panda

  2. Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) What’s new with MVAPICH2-GDR • • Efficient MPI-3 Non-Blocking Collective support Maximal overlap in MPI Datatype Processing • Efficient Support for Managed Memory • RoCE and Optimized Collective • Initial support for GPUDirect Async feature • • Efficient Deep Learning with MVAPICH2-GDR • OpenACC-Aware support • Conclusions Network Based Computing Laboratory GTC 2017 2

  3. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,750 organizations in 84 countries – More than 416,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘16 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 13th, 241,108-core (Pleiades) at NASA • 17th, 462,462-core (Stampede) at TACC • 40th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1 st in Jun’16, 10M cores, 100 PFlops) – Network Based Computing Laboratory GTC 2017 3

  4. Network Based Computing Laboratory Number of Downloads 100000 150000 200000 250000 300000 350000 400000 450000 50000 0 MVAPICH2 Release Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 MV2 0.9.0 Oct-06 Mar-07 Aug-07 MV2 0.9.8 Jan-08 Jun-08 MV2 1.0 Nov-08 Apr-09 MV 1.0 Sep-09 MV2 1.0.3 GTC 2017 Feb-10 MV 1.1 Jul-10 Timeline Dec-10 May-11 MV2 1.4 Oct-11 Mar-12 MV2 1.5 Aug-12 MV2 1.6 Jan-13 Jun-13 MV2 1.7 Nov-13 MV2 1.8 Apr-14 Sep-14 Feb-15 MV2 1.9 MV2-GDR 2.0b Jul-15 MV2-MIC 2.0 Dec-15 MV2 2.1 May-16 MV2-Virt 2.1rc2 MV2-GDR 2.2rc1 Oct-16 MV2-X 2.2 MV2 2.3a Mar-17 4

  5. MVAPICH2 Software Family Requirements MVAPICH2 Library to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA Network Based Computing Laboratory GTC 2017 5

  6. Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- I/O and Fault Collectives Active Introspection point Job Startup Virtualization Memory & Analysis Algorithms Awareness Messages File Systems Tolerance Primitives Access Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Modern Features Transport Protocols Transport Mechanisms Modern Features Multi SR- Shared UMR NVLink * CAPI * RC XRC UD DC ODP MCDRAM * CMA IVSHMEM IOV Rail Memory * Upcoming Network Based Computing Laboratory GTC 2017 6

  7. Optimizing MPI Data Movement on GPU Clusters • Connected as PCIe devices – Flexibility bu t Complexity Memory buffers Node 1 Node 0 1 . Intra- GPU QPI 2 . Intra-Socket GPU -GPU CPU CPU CPU 3 . Inter-Socket GPU -GPU PCIe 4 . Inter-Node GPU -GPU 5 . Intra-Socket GPU -Host 6 . Inter-Socket GPU -Host GPU 7 . Inter-Node GPU -Host GPU GPU GPU IB 8 . Inter-Node GPU -GPU with IB adapter on remote socket and more . . . • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline … • Critical for runtimes to optimize data movement while hiding the complexity Network Based Computing Laboratory GTC 2017 7

  8. GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender: inside MPI_Send(s_devbuf, size, …); MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity Network Based Computing Laboratory GTC 2017 8

  9. CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers Network Based Computing Laboratory GTC 2017 9

  10. Installing MVAPICH2-GDR • MVAPICH2-2.2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • Please select the best matching package for your system We have most of common combinations. If you do not find your match here please email OSU with the following details • – OS versions – OFED version – CUDA Version – Compiler (GCC, Intel and PGI) • Install instructions • Having root permissions • On default Path: rpm -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm • On specific Path: rpm --prefix /custom/install/prefix -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm • Do not have root permissions: • rpm2cpio mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm | cpio –id • More details on the installation process refer to: http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2#_installing_mvapich2_gdr_library Network Based Computing Laboratory GTC 2017 10

  11. Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU Internode Bandwidth GPU-GPU internode latency 30 MV2-GDR2.2 MV2-GDR2.0b Latency (us) 3000 MV2-GDR2.2 Bandwidth 25 MV2 w/o GDR 2500 MV2-GDR2.0b 11X (MB/s) 20 2000 MV2 w/o GDR 15 3X 10x 1500 10 2X 1000 5 0 500 2.18us 0 2 8 32 128 512 2K 0 Message Size (bytes) 1 4 16 64 256 1K 4K Message Size (bytes) GPU-GPU Internode Bi-Bandwidth Bi-Bandwidth (MB/s) 4000 MV2-GDR2.2 3000 MV2-GDR2.0b 11x 2000 MVAPICH2-GDR-2.2 MV2 w/o GDR 2X Intel Ivy Bridge (E5-2680 v2) node - 20 cores 1000 NVIDIA Tesla K40c GPU 0 Mellanox Connect-X4 EDR HCA 1 4 16 64 256 1K 4K CUDA 8.0 Mellanox OFED 3.0 with GPU-Direct-RDMA Message Size (bytes) Network Based Computing Laboratory GTC 2017 11

  12. Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 2500 Average Time Steps per 3000 Average Time Steps per MV2 MV2+GDR 2500 2000 2X second (TPS) second (TPS) 2X 2000 1500 1500 1000 1000 500 500 0 0 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory GTC 2017 12

  13. Full and Efficient MPI-3 RMA Support Small Message Latency 35 30 Latency (us) 25 6X 20 15 10 2.88 us 5 0 0 2 8 32 128 512 2K 8K Message Size (bytes) MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA, CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Network Based Computing Laboratory GTC 2017 13

  14. Performance of MVAPICH2-GDR with GPU-Direct RDMA and Multi-Rail Support GPU-GPU Internode MPI Uni-Directional GPU-GPU Internode Bi-directional Bandwidth 10000 Bandwidth 16000 8,759 9000 MV2-GDR 2.1 15,111 MV2-GDR 2.1 14000 20% 8000 Bandwidth (MB/s) MV2-GDR 2.1 RC2 Bi-Bandwidth (MB/s) 12000 7000 40% MV2-GDR 2.1 RC2 6000 10000 5000 8000 4000 6000 3000 4000 2000 1000 2000 0 0 1 4 16 64 256 1K 4K 16K64K 256K 1M4M 1 4 16 64 256 1K 4K 16K64K 256K1M 4M Message Size (bytes) Message Size (bytes) MVAPICH2-GDR-2.2.b Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA Network Based Computing Laboratory GTC 2017 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend