Design challenges of High- performance and Scalable MPI over - PowerPoint PPT Presentation

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik

Presentation Overview • In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage • Zero Copy protocol using Unreliable Datagram • MVAPICH-Aptus : A scalable High performance Multi-Transport MPI over InfiniBand

High Performance and Scalable MPI with Reduced Memory usage Motivation Does aggressively reducing communication buffer memory lead to degradation • of end application performance? How much memory can we expect the MPI library to consume during execution • of a typical application, while still proving the best available performance ?

High Performance and Scalable MPI with Reduced Memory usage IB provides several types of transport services – Reliable Connection (RC) • - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. Reliable Datagram (RD) • - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. Unreliable Connection (UC) • - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. Unreliable Datagram (UD) • - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

High Performance and Scalable MPI with Reduced Memory usage Upper level software service Shared Receive Queue - This allows multiple QPs to be attached to one receive queue (even for connection oriented transport) - This approach is memory efficient

High Performance and Scalable MPI with Reduced Memory usage Remote Direct Memory Access (RDMA) - Application can directly access the memory of the remove process. - RDMA has very low latency.

High Performance and Scalable MPI with Reduced Memory usage MVAPICH Design Overview MVAPICH uses two major protocols – 1. Eager Protocol - It is used to transfer small messages. - The messages are buffered inside the MPI library. - “pre-allocated” communication buffers are required on the sender and receiver side 2. Rendezvous Protocol - It is used to transfer large messages. - The message are sent directly to receiver’s user memory.

High Performance and Scalable MPI with Reduced Memory usage 1 . Adaptive RDMA with Send/Receive - In order to avoid a memory-scalability problem when the number of nodes increase, this channel is adaptive. - Limited buffers are allocated initially. - Once a threshold number of messages are exchanged, next messages are transferred using RDMA .

High Performance and Scalable MPI with Reduced Memory usage 2. Adaptive RDMA with SQR Channel - Idea is based on ARDMA-SR. Only Difference is the Shared Queue Receiver is used. - Drawback : Sender doesn’t know the receiver buffer availability. - Solution : Setting a “low-watermark” for the SQR.

High Performance and Scalable MPI with Reduced Memory usage 3. Shared Receive Queue - This channel exclusively utilizes the SRQ feature. - This follows the same “low-watermark technique as the ARDMA-SRQ. - Even though RDMA has low latency, they consume more memory.

High Performance and Scalable MPI with Reduced Memory usage NAS Benchmark

High Performance and Scalable MPI with Reduced Memory usage High Performance Linpack - Benchmark for solving linear equations. - It is used as the primary measure for ranking biannual Top 500 list of the world’s fastest supercomputers

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Motivation 1. Performance Scalability - Memory copies are detrimental to the overall performance of the application. - HCA cache can only hold a limited number of QPs 2. Resource Scalability - With a connection oriented transport the memory requirements increase linearly with the number of connected processes.

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Traditional Zero-Copy 1. Matched Queues Interface - The receiver deciphers the message tag from the sent message and matches it with the posted receive operations. 2. Rendezvous Protocol using RDMA - Initially a handshake protocol is used, followed by RDMA.

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram UD vs RC memory usage For 16k connections – UD = 40 MB / process RC = 240 MB / process

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Challenges for true zero copy design Limited MTU Size • - UD transport has a Maximum Transfer Unit(MTU) limit of 2KB. - Segmentation required. Lack of dedicated Receive Buffers • - Difficult to post receive buffers for a particular peer as they are all shared. - If no buffer is posted to a QP, message sent is silently dropped. Lack of Reliability • - There is no guarantee that a message will arrive at the receiver Lack of ordering • - Message may not arrive in the same order they are sent. Lack of RDMA • - RDMA only works for connection oriented transport.

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Proposed Design - Design is based on serialized communication since RDMA is not specified for UD transport - Serialized implies that the order of transfer is agreed beforehand, and only sender transmit to a QP at a single time.

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Solutions to design challenges 1. Efficient Segmentation - The design chooses to get completion signal only for the last packet. - The underlying reliability layer would mark packets as missing at the receiver’s end and the sender is notified. 2. Zero Copy Pool - A pool of QPs are maintained. - When a message transfer is initiated, a QP is taken from the pool and the application receive buffer is posted to it. 3. Optimized Reliability and Ordering for Large Messages - One approach is the perform a checksum for the entire receive buffer. - Each operation can specify a 32-bit immediate field that will be available to the receiver as part of the completion entry.

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Experimental Evaluation Ping Pong Latency

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Uni-Directional Bandwidth

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Bi-Directional Bandwidth

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Motivation This paper seeks to address two mains questions - 1. What are the different protocols developed for MPI over IB ? How well do they perform at scale ? 2. Given this knowledge, can the MPI Library be designed to dynamically select protocols to optimized for performance and scalability ?

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand IB provides several types of transport services – Reliable Connection (RC) • - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. Reliable Datagram (RD) • - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. Unreliable Connection (UC) • - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. Unreliable Datagram (UD) • - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Message Channel Eager Protocol Channel

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Message Channel Rendezvous Protocol Channel

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Performance : Eager Latency

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Performance : Uni-Directional Bandwidth

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Scalability Test : Memory Usage

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Scalability Test : Latency

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Characteristics Summary

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Overview of Design As seen from the experimental results, using only one channel is not sufficient • to achieve performance and scalability. The solution is to use a combination of message channels and transports to optimize • for performance as well as scalability. Design Challenges 1. When should a channel be created ? 2. When should a channel be used ?

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Channel Allocation

Design challenges of High- performance and Scalable MPI over - PowerPoint PPT Presentation

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy protocol using Unreliable

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

MPI and Scalable Parallel Performance Analysis 25 Years of MPI Workshop, ANL, September 25, 2017

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

Programming Introduction to MPI What is MPI? 2 MPI Forum First message-passing interface

Unreliable Datagram Extension to QUIC draft-pauly-quic-datagram-00 Tommy Pauly , Eric Kinnear,

BN Semantics 2 The revenge of d-separation Graphical Models 10708 Carlos Guestrin

Parameter Learning 1 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University

Design Patterns & Concurrency Sebastian Graf, Oliver Haase 1 Expectations ? ...on the

ArgonCube 2x2 Cabling and grounding F. Piastra 31.10.2019 Power connections/grounding DAQ rack

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments & Frames) ECE 422 Data

A Practical Marketing Approach to GDPR Great News! Focus on email marketing What you

Reactive Extensions (Rx) Your prescription to cure event processing blues Bart J.F. De Smet

Sambuz

Useful Links

Newsletter

Mail Us

Design challenges of High- performance and Scalable MPI over - PowerPoint PPT Presentation

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy protocol using Unreliable

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

MPI and Scalable Parallel Performance Analysis 25 Years of MPI Workshop, ANL, September 25, 2017

The Evolution of MPI William Gropp Computer Science www.cs.uiuc.edu/ homes/ wgropp Outline 1.

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

Programming Introduction to MPI What is MPI? 2 MPI Forum First message-passing interface

Unreliable Datagram Extension to QUIC draft-pauly-quic-datagram-00 Tommy Pauly , Eric Kinnear,

BN Semantics 2 The revenge of d-separation Graphical Models 10708 Carlos Guestrin

Parameter Learning 1 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University

Design Patterns &amp; Concurrency Sebastian Graf, Oliver Haase 1 Expectations ? ...on the

ArgonCube 2x2 Cabling and grounding F. Piastra 31.10.2019 Power connections/grounding DAQ rack

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments &amp; Frames) ECE 422 Data

A Practical Marketing Approach to GDPR Great News! Focus on email marketing What you

Reactive Extensions (Rx) Your prescription to cure event processing blues Bart J.F. De Smet

Sambuz

Useful Links

Newsletter

Mail Us

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Design Patterns & Concurrency Sebastian Graf, Oliver Haase 1 Expectations ? ...on the

ETHERNET (Functions, Standards, Hubs, Bridges, Switches, Segments & Frames) ECE 422 Data