A practically constant-time MPI Broadcast Algorithm for large-scale - PowerPoint PPT Presentation

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast T. Hoefler, C. Siebert, W. Rehm Open Systems Lab Computer Architecture Group Indiana University Chemnitz University of Technology Bloomington, USA Chemnitz, Germany IPDPS’07 - CAC’07 Workshop Long Beach, CA, USA 26th March 2007 T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Introduction MPI is (still) the de-facto standard in parallel programming systems are going to extreme scale applications start to use high scalability collective operations are an important tool scalable collective operations are very important Our approach Use special hardware features to improve scalability of collective operations. T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Traditional Approach ensure scalability with O ( log 2 P ) algorithms optimized implementations available for different collectives looks promising, but: grows fast for small process-counts (e.g., 256 processes need t = 8 · t send ) processes are skewed by the algorithm (e.g., node 1 leaves the tree faster than node 7) 0 round 1 1 round 2 3 2 round 3 7 5 6 4 T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Multicast Support Multicast characteristics unreliable no guaranteed in-order delivery datagrams limited in size (MTU) MC groups must be network-wide unique MPI Interface reliable transmission virtually unlimited message size multiple independent MPI jobs on a single network T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Traditional Approaches to Ensure Reliability ACK Schemes linear ACK - hot-spot problems tree-based ACK - high latency co-root scheme - combination of both, similar problems every (co-)root waits for last process in its group retransmission timeout NACK Schemes topologies similar to ACK root has to wait for some time (or save the message buffer) timeout very hard to determine and not reliable synchronization problems (delayed processes?) T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

A new Approach The new algorithm two-stage approach packets are fragmented to the MTU first stage sends fragmented message via Multicast processes that received the fragment correctly become new root second stage performs a reliable ring-broadcast ⇒ highest possible parallelism T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

The algorithm �� 5 �� 6 �� 4 �� 7 3 �� 8 2 �� 1 �� T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Multicast Group Management problematic if multiple MPI jobs run in a subnet ideal solution: MADCAP for InfiniBand TM does not exist (subnet-manager?) select MCGID randomly carefully seeded cryptographically secure pseudorandom number generator (Blum-Blum-Shub) 112 bit address space collision probability for 1000 groups: 10 − 18 T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Packet Format Data (Payload) Sequence BID CRC−32 Number Fields Sequence Number: number of fragment BID: Broadcast Identifier CRC: (optional) checksum packet error rate: 0.287% T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Implementation implemented as collv1 component MCGID is selected per communicator one UD QP per communicator (scalable) n pre-posted RRs on this QP (selectable, default 5) use to “tuned” for small communicators/large messages API independent macro layer for OFED/MVAPI T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Performance Results Benchmark Environment odin cluster at Indiana University 128 InfiniBand TM nodes 2Ghz dual core AMD Opteron(tm) processor 270 → 1-byte IMB latency 60 IB TUNED 50 Time in microseconds 40 30 20 10 0 0 20 40 60 80 100 120 Communicator Size T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Performance Results 1-byte latency for each rank 70 IB TUNED 60 Time in microseconds 50 40 30 20 10 0 0 20 40 60 80 100 120 MPI Rank T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Performance Results 1-byte latency or rank 1 50 IB 45 TUNED 40 Time in microseconds 35 30 25 20 15 10 5 0 0 20 40 60 80 100 120 Communicator Size T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Performance Results 1-byte latency or rank N − 1 70 IB TUNED 60 Time in microseconds 50 40 30 20 10 0 0 20 40 60 80 100 120 Communicator Size T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

Conclusions and Future Work Conclusions a new algorithm to use Multicast for MPI_BCAST massively parallel scheme to deal with reliability issues (average) constant-time (2 · t send ) bcast implementation tree-based algorithms cause process skew the newly proposed algorithm does not skew processes Future Work investigate other collective operations investigate the influence of process skew on applications investigate large message support T. Hoefler, C. Siebert, W. Rehm Constant-time Multicast

A practically constant-time MPI Broadcast Algorithm for large-scale - PowerPoint PPT Presentation

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast T. Hoefler, C. Siebert, W. Rehm Open Systems Lab Computer Architecture Group Indiana University Chemnitz University of Technology

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

BROADCAST RECEIVER SERVICES Broadcast receiver A broadcast receiver is a dormant component of

ISO 14001 Background and New Requirements Archie Prentice Practically Green Practically Green

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

G ROUP C OMMUNICATION (A PPLICATION - LEVEL M ULTICAST ) Prasun Dewan Department of Computer

Hybrid Multicast Implementation Matthias Whlisch, Thomas C. Schmidt Stig Venaas {waehlisch,

OpenFlow Campus Trials GEC7 Stanford University Continued progress Increasing provider

Bufferbloat is all wet? Bufferbloat is all wet? Stephen Hemminger Stephen Hemminger s t e p

A Multicast Protocol for Mobile Ad Hoc A Multicast Protocol for Mobile Ad Hoc Networks Using

Classifiers Classifier methods Table of n slots Install entries into classifier install {}

PGM: Reliable General Multicast Implementation for Linux? Christoph Lameter,

Learning-based Max-Min Fair Hybrid Precoding for mmWave Multicasting Luis F. Abanto-Leon

A practically constant-time MPI Broadcast Algorithm for large-scale - PowerPoint PPT Presentation

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast T. Hoefler, C. Siebert, W. Rehm Open Systems Lab Computer Architecture Group Indiana University Chemnitz University of Technology

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

BROADCAST RECEIVER SERVICES Broadcast receiver A broadcast receiver is a dormant component of

ISO 14001 Background and New Requirements Archie Prentice Practically Green Practically Green

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

G ROUP C OMMUNICATION (A PPLICATION - LEVEL M ULTICAST ) Prasun Dewan Department of Computer

Hybrid Multicast Implementation Matthias Whlisch, Thomas C. Schmidt Stig Venaas {waehlisch,

OpenFlow Campus Trials GEC7 Stanford University Continued progress Increasing provider

Bufferbloat is all wet? Bufferbloat is all wet? Stephen Hemminger Stephen Hemminger s t e p

A Multicast Protocol for Mobile Ad Hoc A Multicast Protocol for Mobile Ad Hoc Networks Using

Classifiers Classifier methods Table of n slots Install entries into classifier install {}

PGM: Reliable General Multicast Implementation for Linux? Christoph Lameter,

Learning-based Max-Min Fair Hybrid Precoding for mmWave Multicasting Luis F. Abanto-Leon

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards