A practically constant-time MPI Broadcast Algorithm for large-scale - - PowerPoint PPT Presentation

a practically constant time mpi broadcast algorithm for
SMART_READER_LITE
LIVE PREVIEW

A practically constant-time MPI Broadcast Algorithm for large-scale - - PowerPoint PPT Presentation

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast T. Hoefler, C. Siebert, W. Rehm Open Systems Lab Computer Architecture Group Indiana University Chemnitz University of Technology


slide-1
SLIDE 1

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast

  • T. Hoefler, C. Siebert, W. Rehm

Open Systems Lab Computer Architecture Group Indiana University Chemnitz University of Technology Bloomington, USA Chemnitz, Germany

IPDPS’07 - CAC’07 Workshop

Long Beach, CA, USA

26th March 2007

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-2
SLIDE 2

Introduction

MPI is (still) the de-facto standard in parallel programming systems are going to extreme scale applications start to use high scalability collective operations are an important tool scalable collective operations are very important Our approach Use special hardware features to improve scalability of collective operations.

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-3
SLIDE 3

Introduction

MPI is (still) the de-facto standard in parallel programming systems are going to extreme scale applications start to use high scalability collective operations are an important tool scalable collective operations are very important Our approach Use special hardware features to improve scalability of collective operations.

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-4
SLIDE 4

Traditional Approach

ensure scalability with O(log2P) algorithms

  • ptimized implementations available for different collectives

looks promising, but:

grows fast for small process-counts (e.g., 256 processes need t = 8 · tsend) processes are skewed by the algorithm (e.g., node 1 leaves the tree faster than node 7)

1 2 4 3 5 6 7

round 1 round 2 round 3

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-5
SLIDE 5

Multicast Support

Multicast characteristics unreliable no guaranteed in-order delivery datagrams limited in size (MTU) MC groups must be network-wide unique MPI Interface reliable transmission virtually unlimited message size multiple independent MPI jobs on a single network

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-6
SLIDE 6

Multicast Support

Multicast characteristics unreliable no guaranteed in-order delivery datagrams limited in size (MTU) MC groups must be network-wide unique MPI Interface reliable transmission virtually unlimited message size multiple independent MPI jobs on a single network

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-7
SLIDE 7

Traditional Approaches to Ensure Reliability

ACK Schemes linear ACK - hot-spot problems tree-based ACK - high latency co-root scheme - combination of both, similar problems every (co-)root waits for last process in its group retransmission timeout NACK Schemes topologies similar to ACK root has to wait for some time (or save the message buffer) timeout very hard to determine and not reliable synchronization problems (delayed processes?)

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-8
SLIDE 8

Traditional Approaches to Ensure Reliability

ACK Schemes linear ACK - hot-spot problems tree-based ACK - high latency co-root scheme - combination of both, similar problems every (co-)root waits for last process in its group retransmission timeout NACK Schemes topologies similar to ACK root has to wait for some time (or save the message buffer) timeout very hard to determine and not reliable synchronization problems (delayed processes?)

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-9
SLIDE 9

A new Approach

The new algorithm two-stage approach packets are fragmented to the MTU first stage sends fragmented message via Multicast processes that received the fragment correctly become new root second stage performs a reliable ring-broadcast ⇒ highest possible parallelism

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-10
SLIDE 10

The algorithm

  • 7

3 5 1 2 8 6 4

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-11
SLIDE 11

The algorithm

  • 7

3 8 5 1 2 6 4

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-12
SLIDE 12

Multicast Group Management

problematic if multiple MPI jobs run in a subnet ideal solution: MADCAP for InfiniBandTM does not exist (subnet-manager?) select MCGID randomly carefully seeded cryptographically secure pseudorandom number generator (Blum-Blum-Shub) 112 bit address space collision probability for 1000 groups: 10−18

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-13
SLIDE 13

Packet Format

Number Sequence CRC−32 BID Data (Payload)

Fields Sequence Number: number of fragment BID: Broadcast Identifier CRC: (optional) checksum packet error rate: 0.287%

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-14
SLIDE 14

Implementation

implemented as collv1 component MCGID is selected per communicator

  • ne UD QP per communicator (scalable)

n pre-posted RRs on this QP (selectable, default 5) use to “tuned” for small communicators/large messages API independent macro layer for OFED/MVAPI

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-15
SLIDE 15

Performance Results

Benchmark Environment

  • din cluster at Indiana University

128 InfiniBandTM nodes 2Ghz dual core AMD Opteron(tm) processor 270 → 1-byte IMB latency

10 20 30 40 50 60 20 40 60 80 100 120 Time in microseconds Communicator Size IB TUNED

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-16
SLIDE 16

Performance Results

1-byte latency for each rank

10 20 30 40 50 60 70 20 40 60 80 100 120 Time in microseconds MPI Rank IB TUNED

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-17
SLIDE 17

Performance Results

1-byte latency or rank 1

5 10 15 20 25 30 35 40 45 50 20 40 60 80 100 120 Time in microseconds Communicator Size IB TUNED

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-18
SLIDE 18

Performance Results

1-byte latency or rank N − 1

10 20 30 40 50 60 70 20 40 60 80 100 120 Time in microseconds Communicator Size IB TUNED

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-19
SLIDE 19

Conclusions and Future Work

Conclusions a new algorithm to use Multicast for MPI_BCAST massively parallel scheme to deal with reliability issues (average) constant-time (2 · tsend) bcast implementation tree-based algorithms cause process skew the newly proposed algorithm does not skew processes Future Work investigate other collective operations investigate the influence of process skew on applications investigate large message support

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast

slide-20
SLIDE 20

Conclusions and Future Work

Conclusions a new algorithm to use Multicast for MPI_BCAST massively parallel scheme to deal with reliability issues (average) constant-time (2 · tsend) bcast implementation tree-based algorithms cause process skew the newly proposed algorithm does not skew processes Future Work investigate other collective operations investigate the influence of process skew on applications investigate large message support

  • T. Hoefler, C. Siebert, W. Rehm

Constant-time Multicast