SLIDE 1
A Network-Failure-Tolerant Message-Passing System for Terascale Clusters
Richard L. Graham Advanced Computing Los Alamos National Laboratory
SLIDE 2 LA-MPI team
- David Daniel
- Nehal Desai
- Rich Graham
- Dean Risinger
- Mitch Sukalski
Lampi- support@lanl.gov Past Contributors
- Ron Minnich
- Sung-Eun Choi
- Craig Rasmussen
- Ling-Ling Chen
- MaryDell Nochumson
- Steve Karmesin
- Peter Beckman
SLIDE 3 Why yet another MPI ?
- We build very large clusters with the
latest and best interconnects – integration issue: “End to End Reliability not assured”
SLIDE 4 Definitions
- Path – Homogeneous network transport
- bject
- Fragment striping – Sending fragments of a
single message along several different physical devices of a given Path.
- Message striping – Striping different
messages along different paths.
SLIDE 5 Definitions
- Reliability – Correcting non-catastrophic,
transient failures
- Resilience – Surviving catastrophic
network failures
SLIDE 6 LA-MPI design goals
- Message passing support for terascale
clusters
- Fault tolerant (reliable, resilient)
- High performance
- Thread Safe
- Support widely used message passing API
(MPI)
- Multi-platform and supportable (Open source)
SLIDE 7 LA-MPI architecture
* Run time job control
- job startup
- standard I/O
- job monitoring
* Message passing library
- resource management
- message management
SLIDE 8
User Application
Memory & Message Management Network Path Scheduler Shared Memory Network Communication Network Drivers Memory Subsystem Net A Net B Net C
MPI Interface
OS Bypass Other Hosts User Level Kernel Level
MML SRL
SLIDE 9
Message Created By MPI Path Associated With Message Fragment Sent To Dest Proc Recv Ack/Nack NACK ? Specific Ack ? Record Aggregate Information Release Fragment Timer Fragment Timeout ? Recv Posted by Dest Proc Fragment Recv’ed Was Fragment Recv’ed OK Generate Ack/Nack
No Yes Yes Retransmit Yes No No
SLIDE 10
Platforms Supported
OS’s
* Linux (i686, alpha) * TRU64 * 32 and 64 bit versions * IRIX
“Interconnects”
* Shared Memory * UDP * ELAN3 * HIPPI-800
SLIDE 11
20 40 60 80 100 120 140
CICE – 64 time steps
MPICH MPT LA-MPI Time - Seconds 117.0 13.5 96.5 8.43 99.2 8.02 Boundary exchange Total time
SLIDE 12
7.0 6.5 19.9 155.3 143.5 N/A 526.7 525.6 586.0 2.3 N/A 23.5 132.8 N/A 123.5 2.3 N/A 5.5 13.8 N/A 4.5 O2K (shared Mem) O2K (HIPPI-800) O2K (IP) i686 (shared Mem) I686 (IP) ES45 (1G shared Mem) ES45 (1G ELAN-3) Platform LA-MPI MPT MPICH
Zero Byte Half Round Trip Latency (uSec)
SLIDE 13
145 135 85.7 135 73 N/A 36.8 34.8 8.7 168 N/A 131 11.3 N/A 11.0 690 N/A 760 281(1NIC) N/A 290 O2K (shared Mem) O2K (HIPPI-800) O2K (IP) i686 (shared Mem) I686 (IP) ES45 (1G shared Mem) ES45 (1G IP) Platform LA-MPI MPT MPICH
Peak Bandwidth (MB/Sec)
SLIDE 14 Allgather (uSec/call)
1 x 2 24.5 623 35.2 959 1 x 4 39.3 1380 70 3140 1 x 32 595 15500 403 57600 1 x 64 2590 48800 774 153000 1 x 120 8480 129000 2 x 4 329 7150 1660 19900 2 x 32 907 165000 9700 245000 4 x 4 633 18400 2400 51200 4 x 32 1669.2 407891.9 13500 639000
Host x nProcs LA-MPI MPT
40 bytes 40000 bytes 40 bytes 40000 bytes
Would not run
SLIDE 15 50 100 150 4 N 3N 2N 1N MPT
HIPPI-800 Ping-Pong Bandwidth (MB/Sec)
8 256 1K 64K 16K 1M
MB/Sec
SLIDE 16 Future Directions
- Finish the resilience work
- Additional interconnects (Myrinet 2000 )
- “Progress” engine
- Dynamic reconfiguration
SLIDE 17 LA-MPI design goals
- Message passing support for tera-scale
clusters
- Fault tolerant (reliable, resilient)
- High performance
- Support widely used message passing API
(MPI)
- Thread Safe
- Multi-platform and supportable (Open source)
SLIDE 18 Summary
- Only known end-to-end fault tolerant
MPI implementation
- Well performing implementation
- Current support for several platforms