A Network-Failure-Tolerant Message-Passing System for Terascale - - PowerPoint PPT Presentation

a network failure tolerant message passing system for
SMART_READER_LITE
LIVE PREVIEW

A Network-Failure-Tolerant Message-Passing System for Terascale - - PowerPoint PPT Presentation

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters Richard L. Graham Advanced Computing Los Alamos National Laboratory LA-MPI team David Daniel Past Contributors Ron Minnich Nehal Desai Sung-Eun Choi


slide-1
SLIDE 1

A Network-Failure-Tolerant Message-Passing System for Terascale Clusters

Richard L. Graham Advanced Computing Los Alamos National Laboratory

slide-2
SLIDE 2

LA-MPI team

  • David Daniel
  • Nehal Desai
  • Rich Graham
  • Dean Risinger
  • Mitch Sukalski

Lampi- support@lanl.gov Past Contributors

  • Ron Minnich
  • Sung-Eun Choi
  • Craig Rasmussen
  • Ling-Ling Chen
  • MaryDell Nochumson
  • Steve Karmesin
  • Peter Beckman
slide-3
SLIDE 3

Why yet another MPI ?

  • We build very large clusters with the

latest and best interconnects – integration issue: “End to End Reliability not assured”

slide-4
SLIDE 4

Definitions

  • Path – Homogeneous network transport
  • bject
  • Fragment striping – Sending fragments of a

single message along several different physical devices of a given Path.

  • Message striping – Striping different

messages along different paths.

slide-5
SLIDE 5

Definitions

  • Reliability – Correcting non-catastrophic,

transient failures

  • Resilience – Surviving catastrophic

network failures

slide-6
SLIDE 6

LA-MPI design goals

  • Message passing support for terascale

clusters

  • Fault tolerant (reliable, resilient)
  • High performance
  • Thread Safe
  • Support widely used message passing API

(MPI)

  • Multi-platform and supportable (Open source)
slide-7
SLIDE 7

LA-MPI architecture

  • Two component design

* Run time job control

  • job startup
  • standard I/O
  • job monitoring

* Message passing library

  • resource management
  • message management
slide-8
SLIDE 8

User Application

Memory & Message Management Network Path Scheduler Shared Memory Network Communication Network Drivers Memory Subsystem Net A Net B Net C

MPI Interface

OS Bypass Other Hosts User Level Kernel Level

MML SRL

slide-9
SLIDE 9

Message Created By MPI Path Associated With Message Fragment Sent To Dest Proc Recv Ack/Nack NACK ? Specific Ack ? Record Aggregate Information Release Fragment Timer Fragment Timeout ? Recv Posted by Dest Proc Fragment Recv’ed Was Fragment Recv’ed OK Generate Ack/Nack

No Yes Yes Retransmit Yes No No

slide-10
SLIDE 10

Platforms Supported

OS’s

* Linux (i686, alpha) * TRU64 * 32 and 64 bit versions * IRIX

“Interconnects”

* Shared Memory * UDP * ELAN3 * HIPPI-800

slide-11
SLIDE 11

20 40 60 80 100 120 140

CICE – 64 time steps

MPICH MPT LA-MPI Time - Seconds 117.0 13.5 96.5 8.43 99.2 8.02 Boundary exchange Total time

slide-12
SLIDE 12

7.0 6.5 19.9 155.3 143.5 N/A 526.7 525.6 586.0 2.3 N/A 23.5 132.8 N/A 123.5 2.3 N/A 5.5 13.8 N/A 4.5 O2K (shared Mem) O2K (HIPPI-800) O2K (IP) i686 (shared Mem) I686 (IP) ES45 (1G shared Mem) ES45 (1G ELAN-3) Platform LA-MPI MPT MPICH

Zero Byte Half Round Trip Latency (uSec)

slide-13
SLIDE 13

145 135 85.7 135 73 N/A 36.8 34.8 8.7 168 N/A 131 11.3 N/A 11.0 690 N/A 760 281(1NIC) N/A 290 O2K (shared Mem) O2K (HIPPI-800) O2K (IP) i686 (shared Mem) I686 (IP) ES45 (1G shared Mem) ES45 (1G IP) Platform LA-MPI MPT MPICH

Peak Bandwidth (MB/Sec)

slide-14
SLIDE 14

Allgather (uSec/call)

1 x 2 24.5 623 35.2 959 1 x 4 39.3 1380 70 3140 1 x 32 595 15500 403 57600 1 x 64 2590 48800 774 153000 1 x 120 8480 129000 2 x 4 329 7150 1660 19900 2 x 32 907 165000 9700 245000 4 x 4 633 18400 2400 51200 4 x 32 1669.2 407891.9 13500 639000

Host x nProcs LA-MPI MPT

40 bytes 40000 bytes 40 bytes 40000 bytes

Would not run

slide-15
SLIDE 15

50 100 150 4 N 3N 2N 1N MPT

HIPPI-800 Ping-Pong Bandwidth (MB/Sec)

8 256 1K 64K 16K 1M

MB/Sec

slide-16
SLIDE 16

Future Directions

  • Finish the resilience work
  • Additional interconnects (Myrinet 2000 )
  • “Progress” engine
  • Dynamic reconfiguration
slide-17
SLIDE 17

LA-MPI design goals

  • Message passing support for tera-scale

clusters

  • Fault tolerant (reliable, resilient)
  • High performance
  • Support widely used message passing API

(MPI)

  • Thread Safe
  • Multi-platform and supportable (Open source)
slide-18
SLIDE 18

Summary

  • Only known end-to-end fault tolerant

MPI implementation

  • Well performing implementation
  • Current support for several platforms