Benchmarks and Middleware for Designing Convergent HPC, Big Data and - PowerPoint PPT Presentation

Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems Keynote Talk at Bench ’19 Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Follow us on https://twitter.com/mvapich

High-End Computing (HEC): PetaFlop to ExaFlop 100 PFlops in 2017 149 PFlops in 2018 1 EFlops in 2020-2021? Expected to have an ExaFlop system in 2020-2021! Bench ‘19 2 Network Based Computing Laboratory

Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!! Bench ‘19 3 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Physical Compute Bench ‘19 4 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Bench ‘19 5 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Bench ‘19 6 Network Based Computing Laboratory

Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Hadoop Job Deep Learning Job Spark Job Bench ‘19 7 Network Based Computing Laboratory

Presentation Overview • MVAPICH Project – MPI and PGAS Library with CUDA-Awareness • HiBD Project – High-Performance Big Data Analytics Library • HiDL Project – High-Performance Deep Learning • Public Cloud Deployment – Microsoft-Azure and Amazon-AWS • Conclusions Bench ‘19 8 Network Based Computing Laboratory

Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,050 organizations in 89 countries – More than 614,000 (> 0.6 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) 3 rd , 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 5 th , 448, 448 cores (Frontera) at TACC • 8 th , 391,680 cores (ABCI) in Japan • • 15 th , 570,020 cores (Neurion) in South Korea and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu Partner in the TACC Frontera System • Empowering Top500 systems for over a decade Bench ‘19 9 Network Based Computing Laboratory

Number of Downloads Network Based Computing Laboratory 100000 200000 300000 400000 500000 600000 0 MVAPICH2 Release Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 Oct-06 MV2 0.9.0 Mar-07 Aug-07 Jan-08 MV2 0.9.8 Jun-08 Nov-08 Apr-09 MV2 1.0 Sep-09 Feb-10 MV 1.0 Jul-10 MV2 1.0.3 Dec-10 Bench ‘19 May-11 MV 1.1 Timeline Oct-11 Mar-12 Aug-12 MV2 1.4 Jan-13 Jun-13 MV2 1.5 Nov-13 Apr-14 MV2 1.6 Sep-14 Feb-15 MV2 1.7 Jul-15 Dec-15 MV2 1.8 May-16 Oct-16 MV2 1.9 Mar-17 MV2-GDR 2.0b Aug-17 Jan-18 MV2-MIC 2.0 MV2 Virt 2.2 OSU INAM 0.9.3 Jun-18 MV2-X 2.3 rc2 Nov-18 MV2-GDR 2.3.2 MV2 2.3.2 Apr-19 MV2-Azure 2.3.2 MV2-AWS 2.3 10

Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- Fault Collectives I/O and Active Introspection point Job Startup Memory Virtualization & Analysis Algorithms Messages Awareness File Systems Tolerance Primitives Access Support for Modern Multi-/Many-core Architectures Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Modern Features Transport Protocols Transport Mechanisms Modern Features SR- Multi Shared UMR CAPI * RC SRD UD DC ODP Optane * NVLink CMA XPMEM IVSHMEM IOV Rail Memory * Upcoming Bench ‘19 11 Network Based Computing Laboratory

MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS Performance OMB Bench ‘19 12 Network Based Computing Laboratory

Convergent Software Stacks for HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) Bench ‘19 13 Network Based Computing Laboratory

Need for Micro-Benchmarks to Design and Evaluate Programming Models • Message Passing Interface (MPI) is the common programming model in scientific computing Has 100’s of APIs and Primitives (Point-to-point, RMA, Collectives, Datatypes, …) • • Multiple challenges for MPI developers, users, managers of HPC centers How to optimize the designs of these APIs on various hardware platforms and configurations? • Designers and developers • • Comparing performance of an MPI library (at the API-level) across various platforms and configurations? • Designers, developers and users • How to compare the performance of multiple MPI libraries (at the API-level) on a given platform and across platforms? • Procurement decision by managers • How to correlate the performance from the micro-benchmark level to the overall application level? • Application developers and users, also beneficial for co-deigns Bench ‘19 14 Network Based Computing Laboratory

OSU Micro-Benchmarks (OMB) • Available since 2004 (https://mvapich.cse.ohio-state.edu/benchmarks) • Suite of microbenchmarks to study communication performance of various programming models • Benchmarks available for the following programming models – Message Passing Interface (MPI) – Partitioned Global Address Space (PGAS) • Unified Parallel C (UPC) • Unified Parallel C++ (UPC++) • OpenSHMEM • Benchmarks available for multiple accelerator based architectures – Compute Unified Device Architecture (CUDA) – OpenACC Application Program Interface • Part of various national resource procurement suites like NERSC-8 / Trinity Benchmarks • Continuing to add support for newer primitives and features Bench ‘19 15 Network Based Computing Laboratory

OSU Micro-Benchmarks (MPI): Examples and Capabilities • Host-Based – Point-to-point – Collectives • Blocking and Non-Blocking • Job-startup • GPU-Based – CUDA-aware • Point-to-point: Device-to-Device (DD), Device-to-Host (DH), Host-to-Device (HD) • Collectives – Managed Memory • Point-to-point: Managed-Device-to-Managed-Device (MD-MD) Bench ‘19 16 Network Based Computing Laboratory

One-way Latency: MPI over IB with MVAPICH2 Large Message Latency Small Message Latency 120 1.8 TrueScale-QDR 1.6 1.19 100 ConnectX-3-FDR 1.4 1.11 ConnectIB-DualFDR 80 1.2 ConnectX-4-EDR Latency (us) Latency (us) 1 60 Omni-Path 1.15 0.8 ConnectX-6 HDR 1.01 40 0.6 1.04 0.4 1.1 20 0.2 0 0 Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Bench ‘19 17 Network Based Computing Laboratory

Benchmarks and Middleware for Designing Convergent HPC, Big Data and - PowerPoint PPT Presentation

Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale Systems Keynote Talk at Bench 19 Conference by Dhabaleswar K. (DK) Panda The Ohio State University E-mail:

Middleware Chapter 2: Contents - Chapter 2 Understanding middleware Middleware as a

Java Middleware Patrick Eugster, Till Bay, Tomas Hruz Java Middleware What is middleware

Designing Scalable HPC, Deep Learning, Big Data, and Cloud Middleware for Exascale Systems Talk

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

EOS: E Exactly xactly- -O Once E nce E- -S Service Middleware ervice Middleware EOS:

Middleware Petr Tma Middleware Petr Tma This is a work in progress material created to

From Middleware Implementor to Middleware User (There and Back Again) Steve Vinoski Member of

Middleware Petr T uma Middleware by Petr T uma This material is a work in progress that

Communication Middleware Software Layers 1.1 A distributed system organized as middleware. Note

Designing a Workload Scenario for Benchmarking Message-Oriented Middleware Kai Sachs*, Kai

NorduGrid NorduGrid collaboration: some history collaboration: some history collaboration: some

Entity Resolution: Glue for Middleware Hector Garcia-Molina Stanford University Middleware

ThingsJS: Towards a Flexible and Self-Adaptable Middleware for Dynamic and Heterogeneous IoT

Developm ent of convergent J2 EE applications for OpenSER Elias Baixas Morat Engineer

Choosing Middleware: Why performance and scalability do (and dont) matter Michi Henning

1 Research Approach: the Kokyu Flexible Middleware Scheduling/Dispatching Framework Integration

Harvey Friedmans Finite Phase Transitions L. Gordeev Uni-T ubingen, Uni-Ghent, PUC-Rio de

TouIST: a friendly language for propositional logic and more Application to planning with SAT or

Ehsan Totoni Josep Torrellas Laxmikant V. Kale Charm

Purity, content, and arithmetic Andrew Arana Philosophy, University of Paris 1

1 Do Commodity Index Holdings Still Make Sense for Institutional Investors? Revisiting the

Education, Conflict and Dimensions of State Fragility Julia Paulson and Robin Shields

www.cornwall-insight.com Tom Palmer and Tom Edwards HELPING YOU MAKE SENSE OF THE HELPING YOU

Count Me In: Exploring the Relationship Between Quantitative Reasoning and Civic Engagement