MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 2018) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail: panda@cse.ohio-state.edu E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon

Outline • Overview of the MVAPICH2 Project MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory Streaming Support with InfiniBand Multicast and GDR • • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Collectives for Deep Learning • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory 2 GTC 2018

Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,875 organizations in 86 countries – More than 460,000 (> 0.46 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 9th, 556,104 cores (Oakforest-PACS) in Japan • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Network Based Computing Laboratory 3 GTC 2018

Network Based Computing Laboratory Number of Downloads 100000 150000 200000 250000 300000 350000 400000 450000 500000 50000 0 MVAPICH2 Release Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 Oct-06 MV2 0.9.0 Mar-07 Aug-07 MV2 0.9.8 Jan-08 Jun-08 Nov-08 MV2 1.0 Apr-09 Sep-09 MV 1.0 Feb-10 MV2 1.0.3 Jul-10 GTC 2018 MV 1.1 Dec-10 Timeline May-11 Oct-11 MV2 1.4 Mar-12 Aug-12 MV2 1.5 Jan-13 Jun-13 MV2 1.6 Nov-13 Apr-14 MV2 1.7 Sep-14 MV2 1.8 Feb-15 Jul-15 MV2 1.9 Dec-15 MV2-GDR 2.0b May-16 MV2-MIC 2.0 Oct-16 MV2 Virt 2.2 Mar-17 MV2-X 2.3b MV2-GDR 2.3a Aug-17 MV2 2.3rc1 Jan-18 OSU INAM 0.9.3 4

MVAPICH2 Architecture High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- Fault Collectives I/O and Active Introspection point Job Startup Virtualization Memory & Analysis Algorithms Awareness Messages File Systems Tolerance Primitives Access Support for Modern Multi-/Many-core Architectures Support for Modern Networking Technology (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL * ), NVIDIA GPGPU) (InfiniBand, iWARP, RoCE, OmniPath) Modern Features Transport Protocols Transport Mechanisms Modern Features SR- Multi Shared UMR ODP * NVLink * CAPI * RC XRC UD DC MCDRAM * CMA IVSHMEM IOV Rail Memory * Upcoming Network Based Computing Laboratory 5 GTC 2018

MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications Network Based Computing Laboratory 6 GTC 2018

MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters • Connected as PCIe devices – Flexibility but Complexity Memory buffers Node 1 Node 0 1 . Intra- GPU QPI CPU 2 . Intra-Socket GPU -GPU CPU CPU 3 . Inter-Socket GPU -GPU PCIe 4 . Inter-Node GPU -GPU 5 . Intra-Socket GPU -Host 6 . Inter-Socket GPU -Host GPU 7 . Inter-Node GPU -Host GPU GPU GPU IB 8 . Inter-Node GPU -GPU with IB adapter on remote socket and more . . . • NVLink is leading to more paths • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline … • Critical for runtimes to optimize data movement while hiding the complexity Network Based Computing Laboratory 7 GTC 2018

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender: inside MPI_Send(s_devbuf, size, …); MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity Network Based Computing Laboratory 8 GTC 2018

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory Network Based Computing Laboratory 9 GTC 2018

Using MVAPICH2-GPUDirect Version • MVAPICH2-2.3 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • System software requirements • Mellanox OFED 3.2 or later NVIDIA Driver 367.48 or later • • NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later Plugin for GPUDirect RDMA • http://www.mellanox.com/page/products_dyn?product_family=116 • Strongly recommended GDRCOPY module from NVIDIA • https://github.com/NVIDIA/gdrcopy • Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu Network Based Computing Laboratory 10 GTC 2018

MVAPICH2-GDR 2.3a • Released on 11/09/2017 • Major Features and Enhancements – Based on MVAPICH2 2.2 – Support for CUDA 9.0 – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Enhanced performance of GPU-based point-to-point communication – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – Enhanced performance of MPI_Allreduce for GPU-resident data – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications • Basic support for IB-MCAST designs with GPUDirect RDMA • Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA • Advanced reliability support for IB-MCAST designs – Efficient broadcast designs for Deep Learning applications – Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems Network Based Computing Laboratory 11 GTC 2018

Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency GPU-GPU Inter-node Bi-Bandwidth 30 6000 Bandwidth (MB/s) Latency (us) 20 4000 11X 10 10x 1.88us 2000 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth Bandwidth (MB/s) 4000 3000 MVAPICH2-GDR-2.3a 2000 9x Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 128 256 512 1K 2K 4K CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a Network Based Computing Laboratory 12 GTC 2018

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 2018) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail:

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology

GdR = a tool from CNRS GdR = Groupe de Recherche A tool to gather an academic

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL January 2017 September 2013

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL May 2017 September 2013 INTRODUCING

July 2017 September 2013 INTRODUCING ASIA FRONTIER CAPITAL AFC Asia Frontier Fund 2

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

The Heritage of the Inventor School Inventor School Movement in the GDR Movement in the GDR

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

SYNTHESIS OF TITANIUM ALUMINIDES BASED INTERMETALLIC MATRIX COMPOSITES BY MECHANICAL ALLOYING AND

Migration and the Global Cities of Arabia In the Era of Mobility Andrew Gardner Professor of

GLAMOUR more than a magazine. LEXY CORREA JAYMIE OAKES INTRODUCTION Multi-Brand

The Roaring Twenties: Age of Excess Consumer Economy Age of Excess The Age of Excess was a time

Educational and Awareness raising initiatives on promotion in France Claire Corbill &

Module 4: Transformative Potential of Community Radio Session 4: Radio Plug Production

The Power of On Air Promotion How to make your (small) station successful with On Air Promotion

Sutton South Cheam and Belmont Local Committee Page 29 31 January 2018 7pm Agenda Item 10

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 2018) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail:

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology

GdR = a tool from CNRS GdR = Groupe de Recherche A tool to gather an academic

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL January 2017 September 2013

AFC Asia Frontier Fund AFC Asia Frontier Fund CONFIDENTIAL May 2017 September 2013 INTRODUCING

July 2017 September 2013 INTRODUCING ASIA FRONTIER CAPITAL AFC Asia Frontier Fund 2

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

The Heritage of the Inventor School Inventor School Movement in the GDR Movement in the GDR

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

The Frontier Thesis: How &amp; Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

SYNTHESIS OF TITANIUM ALUMINIDES BASED INTERMETALLIC MATRIX COMPOSITES BY MECHANICAL ALLOYING AND

Migration and the Global Cities of Arabia In the Era of Mobility Andrew Gardner Professor of

GLAMOUR more than a magazine. LEXY CORREA JAYMIE OAKES INTRODUCTION Multi-Brand

The Roaring Twenties: Age of Excess Consumer Economy Age of Excess The Age of Excess was a time

Educational and Awareness raising initiatives on promotion in France Claire Corbill &amp;

Module 4: Transformative Potential of Community Radio Session 4: Radio Plug Production

The Power of On Air Promotion How to make your (small) station successful with On Air Promotion

Sutton South Cheam and Belmont Local Committee Page 29 31 January 2018 7pm Agenda Item 10

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

Educational and Awareness raising initiatives on promotion in France Claire Corbill &