Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University
Application-Transparent Checkpoint/Restart for MPI Programs over - - PowerPoint PPT Presentation
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University
Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University
– Queue Pairs (QP) – Completion Queues (CQ)
– Protection Domain (PD) – Memory Regions (MR) and access keys
InfiniBand Stack (Courtesy from IB Spec.) Queuing Model (Courtesy from IB Spec.)
Coordinated Uncoordinated Communication Induced Pros:
consistency
Cons:
rollback upon failure
Pros:
Cons:
message logging
Pros:
without global coordination
Cons:
message processing
System Level Application Level Compiler Assisted Pros:
user applications
independent to the progress of application
Cons:
consistency issue
Pros:
can be customized
Cons:
code need to be rewritten according to checkpointing interface
Pros:
checkpointing without source code modification
Cons:
compiler techniques for consistency
MPI Process MPI Process MPI Process MPI Job Console
Global C/R Coordinator Control Message Manager
MPI Process
Local C/R Controller Communication Channel Manager C/R Library
– Process Manager: Multi-Purpose Daemon (MPD), developed in ANL, extended with C/R messaging support – C/R Library: Berkeley Lab’s Checkpoint/Restart (BLCR) Data Network Data Network
Process Manager Pt2pt Data Connections
MPI Process MPI Process MPI Process MPI Job Console
Global C/R Coordinator Control Message Manager
MPI Process
Local C/R Controller Communication Channel Manager C/R Library
Data Network Data Network
Process Manager Data Connections
Checkpoint Request
Running Initial Synchronization Pre-checkpoint Coordination Post-checkpoint Coordination Local Checkpointing
Process Manager
Restart Request
Restarting
MPI Job Console
Global C/R Coordinator
MPI Process MPI Process MPI Process
Control Message Manager
MPI Process
Local C/R Controller Communication Channel Manager C/R Library
Data Network Data Network
Post-checkpoint Coordination Running
Data Connections
InfiniBand Host Adapter (HCA)
QPs MRs CQs PDs
MPI InfiniBand Channel MPI Upper Layers User Application
Network Connection Information Dedicated Communication Buffers Registered User Buffers Channel Progress Information
Peer MPI Process Peer MPI Process Peer MPI Process
InfiniBand Fabric InfiniBand Fabric
HCA HCA HCA Storage Running Pre-checkpoint Coodination Local Checkpointing Post-checkpoint Coordination Initial Synchronization
– MPI-1 (MVAPICH) – MPI-2 (MVAPICH2)
world to take advantage of InfiniBand
– Largest being Sandia Thunderbird Cluster (4512 nodes with 9024 processors)
– Time tested and stable code base with novel features
http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
– InfiniBand Cluster with 12 nodes, dual Intel Xeon 3.4 GHz CPUs, 2 GB memory, Redhat Linux AS 4 with kernel version 2.6.11; – Ext3 file system on top of local SATA disks – Mellanox InfiniHost MT23108 HCA adapters
– Analysis of overhead for taking one checkpoint and restart
– Performance impact to applications when checkpointing periodically
193MB 213MB 126MB Checkpoint size per process SP.C.9 BT.C.9 LU.C.8 Benchmark
1 2 3 4 5 6 7 8 lu.C.8 bt.C.9 sp.C.9 lu.C.8 bt.C.9 sp.C.9 Checkpoint Restart Time (seconds)
File Access Coordination
File accessing time is the dominating factor of checkpoint/restart overhead
checkpoint/restart request to program resumes execution
local disk before program continues
100 200 300 400 500 1 min 2 min 4 min None Checkpointing Interval Execution Time (seconds) lu.C.8 bt.C.9 sp.C.9
5 10 15 20 25 30 2 min (6) 4 min (2) 8 min (1) None Checkpointing Interval & No. of checkpoints Performance (GFLOPS)
200 400 600 800 1000 1 min (14) 2 min (7) 4 min (3) 8 min (1) None Checkpointing Interval & No. of Checkpoints Excution Time (seconds) 0% 2% 4% 6% 8% 10% Performance Impact
– Molecular dynamics for biochemical analysis. – DPPC dataset running on 10 processes.
25
26