a network failure tolerant message passing system for
play

A Network-Failure-Tolerant Message-Passing System for Terascale - PowerPoint PPT Presentation

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters Richard L. Graham Advanced Computing Los Alamos National Laboratory LA-MPI team David Daniel Past Contributors Ron Minnich Nehal Desai Sung-Eun Choi


  1. A Network-Failure-Tolerant Message-Passing System for Terascale Clusters Richard L. Graham Advanced Computing Los Alamos National Laboratory

  2. LA-MPI team • David Daniel Past Contributors • Ron Minnich • Nehal Desai • Sung-Eun Choi • Rich Graham • Craig Rasmussen • Dean Risinger • Ling-Ling Chen • Mitch Sukalski • MaryDell Nochumson • Steve Karmesin Lampi- • Peter Beckman support@lanl.gov

  3. Why yet another MPI ? • We build very large clusters with the latest and best interconnects – integration issue: “End to End Reliability not assured”

  4. Definitions • Path – Homogeneous network transport object • Fragment striping – Sending fragments of a single message along several different physical devices of a given Path. • Message striping – Striping different messages along different paths .

  5. Definitions • Reliability – Correcting non-catastrophic, transient failures • Resilience – Surviving catastrophic network failures

  6. LA-MPI design goals • Message passing support for terascale clusters • Fault tolerant (reliable, resilient) • High performance • Thread Safe • Support widely used message passing API (MPI) • Multi-platform and supportable (Open source)

  7. LA-MPI architecture • Two component design * Run time job control - job startup - standard I/O - job monitoring * Message passing library - resource management - message management

  8. User Application MPI Interface Memory & Message Network Path MML Management Scheduler Shared Network Memory Communication SRL Net Net Net OS A B C User Level Bypass Memory Network Kernel Other Hosts Subsystem Drivers Level

  9. Recv Posted by Message Created Timer Dest Proc By MPI No Path Associated Fragment With Message Timeout ? Retransmit Yes Fragment Recv’ed Fragment Sent To Dest Proc Was Fragment Recv Recv’ed OK Ack/Nack Yes NACK ? Generate Ack/Nack No Record Aggregate Yes No Specific Ack ? Release Fragment Information

  10. Platforms Supported OS’s “Interconnects” * Linux (i686, alpha) * Shared Memory * TRU64 * UDP * 32 and 64 bit * ELAN3 versions * HIPPI-800 * IRIX

  11. CICE – 64 time steps 140 117.0 120 Time - Seconds 99.2 96.5 100 80 60 40 13.5 20 8.43 8.02 0 MPICH MPT LA-MPI Boundary exchange Total time

  12. Zero Byte Half Round Trip Latency (uSec) Platform LA-MPI MPT MPICH 7.0 6.5 19.9 O2K (shared Mem) 155.3 143.5 N/A O2K (HIPPI-800) 526.7 525.6 586.0 O2K (IP) 2.3 N/A 23.5 i686 (shared Mem) 132.8 N/A 123.5 I686 (IP) 2.3 N/A 5.5 ES45 (1G shared Mem) 13.8 N/A 4.5 ES45 (1G ELAN-3)

  13. Peak Bandwidth (MB/Sec) Platform LA-MPI MPT MPICH 145 135 85.7 O2K (shared Mem) 135 73 N/A O2K (HIPPI-800) 36.8 34.8 8.7 O2K (IP) 168 N/A 131 i686 (shared Mem) 11.3 N/A 11.0 I686 (IP) 690 N/A 760 ES45 (1G shared Mem) 281(1NIC) N/A 290 ES45 (1G IP)

  14. Allgather (uSec/call) LA-MPI MPT Host x nProcs 40 bytes 40000 bytes 40 bytes 40000 bytes 1 x 2 24.5 623 35.2 959 1 x 4 39.3 1380 70 3140 1 x 32 595 15500 403 57600 1 x 64 2590 48800 774 153000 Would not run 1 x 120 8480 129000 2 x 4 329 7150 1660 19900 2 x 32 907 165000 9700 245000 4 x 4 633 18400 2400 51200 4 x 32 1669.2 407891.9 13500 639000

  15. HIPPI-800 Ping-Pong Bandwidth (MB/Sec) 150 4 N MB/Sec 100 3N 2N 1N 50 MPT 0 8 256 1K 16K 64K 1M

  16. Future Directions • Finish the resilience work • Additional interconnects (Myrinet 2000 ) • “Progress” engine • Dynamic reconfiguration

  17. LA-MPI design goals • Message passing support for tera-scale clusters • Fault tolerant (reliable, resilient) • High performance • Support widely used message passing API (MPI) • Thread Safe • Multi-platform and supportable (Open source)

  18. Summary • Only known end-to-end fault tolerant MPI implementation • Well performing implementation • Current support for several platforms

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend