High Performance MPI on IBM 12x InfiniBand Architecture Abhinav - PowerPoint PPT Presentation

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu , Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1

Presentation Road-Map • Introduction and Motivation • Background • Enhanced MPI design for IBM 12x Architecture • Performance Evaluation • Conclusions and Future Work 2

Introduction and Motivation • Demand for more compute power is driven by Parallel Applications – Molecular Dynamics (NAMD), Car Crash Simulations (LS- DYNA) , ...... , …… • Cluster sizes have been increasing forever to meet these demands – 9K proc. (Sandia Thunderbird, ASCI Q) – Larger scale clusters are planned using upcoming multi- core architectures • MPI is used as the primary programming model for writing these applications 3

Emergence of InfiniBand • Interconnects with very low latency and very high throughput have become available – InfiniBand, Myrinet, Quadrics … • InfiniBand – High Performance and Open Standard – Advanced Features • PCI-Express Based InfiniBand Adapters are becoming popular – 8X (1X ~ 2.5 Gbps) with Double Data Rate (DDR) support – MPI Designs for these Adapters are emerging • Compared to PCI-Express, GX+ I/O Bus Based Adapters are also emerging – 4X and 12X link support 4

InfiniBand Adapters To Host To Network 12x 4x P1 P1 HCA HCA Chipset Chipset P2 P2 4x 12x I/O Bus Interface I/O Bus Interface PCI-X (4x Bidirectional) (SDR/DDR) PCI-Express (16x Bidirectional) GX+ (>24x Bidirectional Bandwidth) MPI for PCI-Express based are coming up IBM 12x InfiniBand Adapters on GX+ are coming up 5

Problem Statement • How do we design an MPI with low overhead for IBM 12x InfiniBand Architecture? • What are the performance benefits of enhanced design over the existing designs? – Point-to-point communication – Collective communication – MPI Applications 6

Overview of InfiniBand • An interconnect technology to connect I/O nodes and processing nodes • InfiniBand provides multiple transport semantics – Reliable Connection • Supports reliable notification and Remote Direct Memory Access (RDMA) � – Unreliable Datagram • Data delivery is not reliable, send/recv is supported – Reliable Datagram • Currently not implemented by Vendors – Unreliable Connection • Notification is not supported • InfiniBand uses a queue pair (QP) model for data transfer – Send queue (for send operations) – Receive queue (not involved in RDMA kind of operations) 8

MultiPathing Configurations Switch A combination of these is also possible Switch Multiple Adapters and Multiple Ports (Multi-Rail Configurations) Multi-rail for multiple Send/recv engines 9

MPI Design for 12x Architecture ADI Layer Communication Marker pt-to-pt, Eager, Rendezvous EPC collective? Communication Scheduling Completion Scheduler Policies Notifier Multiple QPs/port Notification InfiniBand Layer Jiuxing Liu, Abhinav Vishnu and Dhabaleswar K. Panda. , “ Building Multi-rail InfiniBand Clusters: MPI-level Design and Performance Evaluation, ”. SuperComputing 2004 11

Discussion on Scheduling Policies Enhanced Pt-to-Pt and Collective (EPC) Blocking Communication Non-blocking Policies Collective Communication Reverse Even Striping Multiplexing Overhead � •Multiple Stripes Round •Multiple Completions Binding Robin 12

EPC Characteristics pt-2-pt blocking striping non-blocking round-robin collective striping • For small messages, round robin policy is used – Striping leads to overhead for small messages 13

MVAPICH/MVAPICH2 • We have used MVAPICH as our MPI framework for the enhanced design • MVAPICH/MVAPICH2 – High Performance MPI-1/MPI-2 implementation over InfiniBand and iWARP – Has powered many supercomputers in TOP500 supercomputing rankings – Currently being used by more than 450 organizations (academia and industry worldwide) – http://nowlab.cse.ohio-state.edu/projects/mpi-iba • The enhanced design is available with MVAPICH – Will become available with MVAPICH2 in the upcoming releases 14

Experimental TestBed • The Experimental Test-Bed consists of: – Power5 based systems with SLES9 SP2 – GX+ at 950 MHz clock speed – 2.6.9 Kernel Version – 2.8 GHz Processor with 8 GB of Memory – TS120 switch for connecting the adapters • One port per adapter and one adapter is used for communication – The objective is to see the benefit with using only one physical port 16

Ping-Pong Latency Test • EPC adds insignificant overhead to the small message latency • Large Message latency reduces by 41% using EPC with IBM 12x architecture 17

Small Messages Throughput • Unidirectional bandwidth doubles for small messages using EPC • Bidirectional bandwidth does not improve with increasing number of QPs due to the copy bandwidth limitation 18

Large Messages Throughput • EPC improves the uni-directional and bi-directional throughput significantly for medium size messages • We can achieve a peak unidirectional bandwidth of 2731 MB/s and bidirectional bandwidth of 5421 MB/s 19

Collective Communication • MPI_Alltoall shows significant benefits for large messages • MPI_Bcast shows more benefits for very large messages 20

NAS Parallel Benchmarks • For class A and class B problem sizes, x1 configuration shows improvement • There is no degradation for other configurations on Fourier Transform 21

NAS Parallel Benchmarks • Integer sort shows 7-11% improvement for x1 configurations • Other NAS Parallel Benchmarks do not show performance degradation 22

Conclusions • We presented an enhanced design for IBM 12x InfiniBand Architecture – EPC (Enhanced Point-to-Point and collective communication) • We have implemented our design and evaluated with Micro- benchmarks, collectives and MPI application kernels • IBM 12x HCAs can significantly improve communication performance – 41% for ping-pong latency test – 63-65% for uni-directional and bi-directional bandwidth tests – 7-13% improvement in performance for NAS Parallel Benchmarks – We can achieve a peak bandwidth of 2731 MB/s and 5421 MB/s unidirectional and bidirectional bandwidth respectively 24

Future Directions • We plan to evaluate EPC with multi-rail configurations on upcoming multi-core systems – Multi-port configurations – Multi-HCA configurations • Scalability studies of using multiple QPs on large scale clusters – Impact of QP caching – Network Fault Tolerance 25

Acknowledgements Our research is supported by the following organizations • Current Funding support by • Current Equipment support by 26

Web Pointers http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu E-mail: {vishnu, panda}@cse.ohio-state.edu, brad.benton@us.ibm.com 27

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav - PowerPoint PPT Presentation

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu , Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction and Motivation

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Infiniband for Open MPI Andrew Friedley, Torsten Hoefler Matthew L. Leininger, Andrew Lumsdaine

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Reservation-Based Scheduling for IRQ Threads Luca Abeni, Nicola Manica, Luigi Palopoli

IP over Web-Avian Carriers (IPoWAC) Dan Ldtke Historical Context IP over Avian Carriers

CPSC 213 2.7.1-2.7.3, 2.7.5-2.7.6 Textbook 3.6.1-3.6.5 Introduction to Computer

Ping An International Financial Leasing Aviation Business Unit 20 February 2017 1 CONTENTS 1.

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Mobile Performance from Radio Up WebRTC battery, latency, and bandwidth optimization for the

Declarative Routing Seminar in Distributed Computing 08 with papers chosen by Prof. T. Roscoe

The FF Planning System Jorge A. Baier Department of Computer Science University of Toronto

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav - PowerPoint PPT Presentation

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu , Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction and Motivation

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Performance of HPC Middleware over Infiniband WAN Designing Efficient FTP Mechanisms for High

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

Infiniband for Open MPI Andrew Friedley, Torsten Hoefler Matthew L. Leininger, Andrew Lumsdaine

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Reservation-Based Scheduling for IRQ Threads Luca Abeni, Nicola Manica, Luigi Palopoli

IP over Web-Avian Carriers (IPoWAC) Dan Ldtke Historical Context IP over Avian Carriers

CPSC 213 2.7.1-2.7.3, 2.7.5-2.7.6 Textbook 3.6.1-3.6.5 Introduction to Computer

Ping An International Financial Leasing Aviation Business Unit 20 February 2017 1 CONTENTS 1.

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Mobile Performance from Radio Up WebRTC battery, latency, and bandwidth optimization for the

Declarative Routing Seminar in Distributed Computing 08 with papers chosen by Prof. T. Roscoe

The FF Planning System Jorge A. Baier Department of Computer Science University of Toronto

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards