design challenges of high performance and scalable mpi
play

Design challenges of High- performance and Scalable MPI over - PowerPoint PPT Presentation

Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy protocol using Unreliable


  1. Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik

  2. Presentation Overview • In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage • Zero Copy protocol using Unreliable Datagram • MVAPICH-Aptus : A scalable High performance Multi-Transport MPI over InfiniBand

  3. High Performance and Scalable MPI with Reduced Memory usage Motivation Does aggressively reducing communication buffer memory lead to degradation • of end application performance? How much memory can we expect the MPI library to consume during execution • of a typical application, while still proving the best available performance ?

  4. High Performance and Scalable MPI with Reduced Memory usage IB provides several types of transport services – Reliable Connection (RC) • - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. Reliable Datagram (RD) • - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. Unreliable Connection (UC) • - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. Unreliable Datagram (UD) • - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

  5. High Performance and Scalable MPI with Reduced Memory usage Upper level software service Shared Receive Queue - This allows multiple QPs to be attached to one receive queue (even for connection oriented transport) - This approach is memory efficient

  6. High Performance and Scalable MPI with Reduced Memory usage Remote Direct Memory Access (RDMA) - Application can directly access the memory of the remove process. - RDMA has very low latency.

  7. High Performance and Scalable MPI with Reduced Memory usage MVAPICH Design Overview MVAPICH uses two major protocols – 1. Eager Protocol - It is used to transfer small messages. - The messages are buffered inside the MPI library. - “pre-allocated” communication buffers are required on the sender and receiver side 2. Rendezvous Protocol - It is used to transfer large messages. - The message are sent directly to receiver’s user memory.

  8. High Performance and Scalable MPI with Reduced Memory usage 1 . Adaptive RDMA with Send/Receive - In order to avoid a memory-scalability problem when the number of nodes increase, this channel is adaptive. - Limited buffers are allocated initially. - Once a threshold number of messages are exchanged, next messages are transferred using RDMA .

  9. High Performance and Scalable MPI with Reduced Memory usage 2. Adaptive RDMA with SQR Channel - Idea is based on ARDMA-SR. Only Difference is the Shared Queue Receiver is used. - Drawback : Sender doesn’t know the receiver buffer availability. - Solution : Setting a “low-watermark” for the SQR.

  10. High Performance and Scalable MPI with Reduced Memory usage 3. Shared Receive Queue - This channel exclusively utilizes the SRQ feature. - This follows the same “low-watermark technique as the ARDMA-SRQ. - Even though RDMA has low latency, they consume more memory.

  11. High Performance and Scalable MPI with Reduced Memory usage NAS Benchmark

  12. High Performance and Scalable MPI with Reduced Memory usage High Performance Linpack - Benchmark for solving linear equations. - It is used as the primary measure for ranking biannual Top 500 list of the world’s fastest supercomputers

  13. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

  14. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Motivation 1. Performance Scalability - Memory copies are detrimental to the overall performance of the application. - HCA cache can only hold a limited number of QPs 2. Resource Scalability - With a connection oriented transport the memory requirements increase linearly with the number of connected processes.

  15. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Traditional Zero-Copy 1. Matched Queues Interface - The receiver deciphers the message tag from the sent message and matches it with the posted receive operations. 2. Rendezvous Protocol using RDMA - Initially a handshake protocol is used, followed by RDMA.

  16. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram UD vs RC memory usage For 16k connections – UD = 40 MB / process RC = 240 MB / process

  17. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Challenges for true zero copy design Limited MTU Size • - UD transport has a Maximum Transfer Unit(MTU) limit of 2KB. - Segmentation required. Lack of dedicated Receive Buffers • - Difficult to post receive buffers for a particular peer as they are all shared. - If no buffer is posted to a QP, message sent is silently dropped. Lack of Reliability • - There is no guarantee that a message will arrive at the receiver Lack of ordering • - Message may not arrive in the same order they are sent. Lack of RDMA • - RDMA only works for connection oriented transport.

  18. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Proposed Design - Design is based on serialized communication since RDMA is not specified for UD transport - Serialized implies that the order of transfer is agreed beforehand, and only sender transmit to a QP at a single time.

  19. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Solutions to design challenges 1. Efficient Segmentation - The design chooses to get completion signal only for the last packet. - The underlying reliability layer would mark packets as missing at the receiver’s end and the sender is notified. 2. Zero Copy Pool - A pool of QPs are maintained. - When a message transfer is initiated, a QP is taken from the pool and the application receive buffer is posted to it. 3. Optimized Reliability and Ordering for Large Messages - One approach is the perform a checksum for the entire receive buffer. - Each operation can specify a 32-bit immediate field that will be available to the receiver as part of the completion entry.

  20. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Experimental Evaluation Ping Pong Latency

  21. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Uni-Directional Bandwidth

  22. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Bi-Directional Bandwidth

  23. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

  24. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Motivation This paper seeks to address two mains questions - 1. What are the different protocols developed for MPI over IB ? How well do they perform at scale ? 2. Given this knowledge, can the MPI Library be designed to dynamically select protocols to optimized for performance and scalability ?

  25. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand IB provides several types of transport services – Reliable Connection (RC) • - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. Reliable Datagram (RD) • - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. Unreliable Connection (UC) • - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. Unreliable Datagram (UD) • - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

  26. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Message Channel Eager Protocol Channel

  27. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Message Channel Rendezvous Protocol Channel

  28. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Performance : Eager Latency

  29. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Performance : Uni-Directional Bandwidth

  30. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Scalability Test : Memory Usage

  31. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Evaluation Scalability Test : Latency

  32. MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand Channel Characteristics Summary

  33. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Overview of Design As seen from the experimental results, using only one channel is not sufficient • to achieve performance and scalability. The solution is to use a combination of message channels and transports to optimize • for performance as well as scalability. Design Challenges 1. When should a channel be created ? 2. When should a channel be used ?

  34. Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram Channel Allocation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend