Exploiting Computation and Communication Overlap in MVAPICH2 MPI - PowerPoint PPT Presentation

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++ Workshop (April ‘18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

High-End Computing (HEC): Towards Exascale 100 PFlops in 2016 1 EFlops in 2020- 2022? Expected to have an ExaFlop system in 2020-2022! Network Based Computing Laboratory Charm++ Workshop (April ’18) 2

Parallel Programming Models Overview P2 P3 P1 P1 P2 P3 P1 P2 P3 Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance • Task-based models (Charm++) are getting used extensively Network Based Computing Laboratory Charm++ Workshop (April ’18) 3

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Charm++, Hadoop (MapReduce), Spark (RDD, DAG) Layers Communication Library or Runtime for Programming Models Performance Energy- I/O and Fault Point-to-point Collective Synchronization Scalability Communication Communication and Locks Awareness File Systems Tolerance Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and Omni-Path) Network Based Computing Laboratory Charm++ Workshop (April ’18) 4

Basic Concept of Overlapping Communication with Computation Application Take Advantage of Overlap - Transparently - Co-design Runtime (MPI, Charm++) Design Runtime Primitives Exploiting Overlap Capabilities of Network Mechanisms Networking Technology Network Based Computing Laboratory Charm++ Workshop (April ’18) 5

Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,875 organizations in 86 countries – More than 462,000 (> 0.46 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 9th, 556,104 cores (Oakforest-PACS) in Japan • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Network Based Computing Laboratory Charm++ Workshop (April ’18) 6

Network Based Computing Laboratory Number of Downloads 100000 150000 200000 250000 300000 350000 400000 450000 500000 50000 0 MVAPICH2 Release Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 Oct-06 MV2 0.9.0 Mar-07 Aug-07 MV2 0.9.8 Jan-08 Jun-08 Nov-08 MV2 1.0 Apr-09 Charm++ Workshop (April ’18) Sep-09 MV 1.0 Feb-10 MV2 1.0.3 Jul-10 MV 1.1 Dec-10 Timeline May-11 Oct-11 MV2 1.4 Mar-12 Aug-12 MV2 1.5 Jan-13 Jun-13 MV2 1.6 Nov-13 Apr-14 MV2 1.7 Sep-14 MV2 1.8 Feb-15 Jul-15 MV2 1.9 Dec-15 MV2-GDR 2.0b May-16 MV2-MIC 2.0 Oct-16 MV2 Virt 2.2 Mar-17 MV2-X 2.3b MV2-GDR 2.3a Aug-17 MV2 2.3rc1 Jan-18 OSU INAM 0.9.3 7

Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- Fault Collectives I/O and Active Introspection point Job Startup Memory Virtualization & Analysis Algorithms Messages Awareness File Systems Tolerance Primitives Access Support for Modern Multi-/Many-core Architectures Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Modern Features Transport Protocols Transport Mechanisms Modern Features SR- Multi Shared UMR NVLink * CAPI * RC XRC UD DC ODP MCDRAM * CMA XPMEM* IVSHMEM IOV Rail Memory * Upcoming Network Based Computing Laboratory Charm++ Workshop (April ’18) 8

MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications Network Based Computing Laboratory Charm++ Workshop (April ’18) 9

Presentation Outline • MVAPICH2/MVAPICH2-X – Job Startup – Point-to-point Communication – Remote Memory Access (RMA) – Collective Communication • MVAPICH2-GDR – Support for InfiniBand Core-Direct – GPU-kernel based Reduction – Datatype Processing • Deep Learning Application: OSU Caffe Network Based Computing Laboratory Charm++ Workshop (April ’18) 10

Overlapping Application Compute with MPI Startup P0 P1 P2 P3 P0 P1 P2 P3 MPI_Init MPI_Init Initialize HCA Initialize HCA Obtain Endpoint Address Obtain Endpoint Address Exchange Addresses Application Exchange Addresses Read Input Files Set Up Problem Application Communication Read Input Files Independent Tasks Set Up Problem Compute / Communicate Compute / Communicate MPI can continue to initialize in the No Overlap between MPI_Init and background while Application starts Application Computation Network Based Computing Laboratory Charm++ Workshop (April ’18) 11

Towards High Performance and Scalable Startup at Exascale • Near-constant MPI and OpenSHMEM initialization time at any process count Memory Required to Store a b c On-demand d P M a Endpoint Information Connection • 10x and 30x improvement in startup time b PMIX_Ring of MPI and OpenSHMEM respectively at 16,384 processes c PMIX_Ibarrier PGAS – State of the art P e • Memory consumption reduced for d PMIX_Iallgather M MPI – State of the art remote endpoint information by e O Shmem based PMI O PGAS/MPI – Optimized O(processes per node) • 1GB Memory saved per node with 1M Job Startup Performance processes and 16 processes per node On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K a Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15) b PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia ’14) Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th c d IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15) e SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16) Network Based Computing Laboratory Charm++ Workshop (April ’18) 12

Exploiting Computation and Communication Overlap in MVAPICH2 MPI - PowerPoint PPT Presentation

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++ Workshop (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Overlap between VaD VaD and AD: and AD: Overlap between an epidemiological perspective an

Cloud Layer Overlap and the Influence of Vertical and Temporal Resolution of Radar Data Oliver

Corrected network measures Introduction Overlap weight Corrected Vladimir Batagelj overlap

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled

dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

Monotonic and Sequential Fractional Programming for Performance Optimization in Interference

5 Constructing MoM Members IFEM Ch 5 Slide 1 Introduction to FEM What Are MoM Members?

Logic-based Program Verification Decidability of Propositional and First-Order Logic. First-Order

Power Supply and Energy Extraction System for the CBM magnet Conceptual Design Report Dr. Erokhin

Relations Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry Spring 2006 Computer

D ECEMBER 11, 2012 The Plan O UTLINE Basics of Ramsey optimal policy problem (the microeconomics)

(Pixel) modules for ILD TPC LCTPC module development

Spatial partitioning scheme - the one dimension case Erdeniz Ozgun Bas, Erik Saule , Umit V.