 
              Charm++ Workshop Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni Parallel Programming Laboratory University of Illinois at Urbana-Champaign May, 2013 1 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++ Outline 1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart 2 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++ Our Philosophy Keep progress rate despite failures No Fault Tolerance Support Fault Tolerance Support 100% Checkpoint Progress Slowdown Recovery Failure Time Optimize for the common case Minimize performance overhead 3 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++ Optimize for the common case Failures rarely bring down more than one node at a time In Jaguar (now Titan, top 1 supercomputer), 92.27% of failures are individual node crashes So, our strategies are geared to handle all single-node failures and most multi-node failures 100 Frequency (%) 10 1 0.1 System 12 System 18 System 19 System 20 System 21 MPP2 Tsubame Mercury 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 4 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++ Minimize performance overhead Automatic restart: Failure detection in runtime system Immediate rollback-recovery Parallel recovery Faster checkpoint Double in-memory checkpoint/restart B is the buddy of A Objects $ % ! " # & ' Local ! " # & ' $ % Checkpoint Remote & ' $ % ! " # Checkpoint Node A Node B Node C 5 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++ Minimize performance overhead Automatic restart: Failure detection in runtime system Immediate rollback-recovery Parallel recovery Faster checkpoint Double in-memory checkpoint/restart Semi-blocking checkpointing: asynchronously store the checkpoint remotely 5 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Outline 1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart 6 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Blocking Checkpoint checkpoint done barrier checkpoint τ blocking β interval NODE 1 α δ blocking checkpoint α NODE 2 β overhead 𝜐 blocking 𝜀 blocking Each node has a buddy node to store the checkpoint. Resume computation after all the nodes have successfully saved the checkpoints in their buddy nodes. 7 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Semi-blocking Checkpointing remote checkpoint local checkpoint barrier done done checkpoint interval τ NODE 1 α β local checkpoint overhead δ overlap period α θ β NODE 2 remote checkpoint interference φ ϕ 𝜀 𝛴 𝜐 Resume computation as soon as each node stores its own checkpoint (local checkpoint). Interleave the transmission of the checkpoint to buddy with application execution (remote checkpoint). 8 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Single Checkpoint Overhead 70 40 blocking checkpoint blocking checkpoint semi � blocking checkpoint semi � blocking checkpoint 35 60 30 Checkpoint Overhead(s) Checkpoint Overhead(s) 50 25 40 20 30 15 20 10 10 5 0 0 128 256 512 1024 128 256 512 1024 Number of Cores Number of Cores Wave2D Weak Scaling ChaNGa Strong Scaling Semi-Blocking checkpoint reduces checkpoint overhead significantly. 9 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Leveraging Solid State Drives Solid State Drive: becoming increasingly available on individual nodes Full SSD strategy Half SSD strategy Only store remote checkpoint in SSD Faster checkpoint and restart 10 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Asynchronous Checkpointing to SSD with IO thread worker worker worker worker IO thread thread thread thread thread IO threads Write checkpoint to/Read SSD request checkpoint from SSD write to When receive request SSD from worker thread. Notify worker thread When SSD is done with Checkpoint certain request. finishes 11 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart Checkpoint/Restart on SSD 30 45 half − aio in − memory full − aio half − aio 40 half − sio full − aio 25 full − sio 35 Timing Penalty(s) 20 Restart Time(s) 30 15 25 20 10 15 5 10 0 5 0.45 1.34 2.23 0.45 1.34 2.23 Checkpoint Size/Node(GB) Checkpoint Size/Node(GB) Half SSD strategy with asynchronous IO reduces the timing penalty for aio asynchronous IO checkpointing to SSD sio synchronous IO Restart from SSD does not incur extra overhead 12 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Outline 1 Fault Tolerance Philosophy in Charm++ 2 Asynchronous Checkpoint/Restart 3 Replication enhanced Checkpoint Restart 13 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart New challenge: soft error Not just from cosmic rays Computer electronic’s sensitivity to radiation increases as their dimensions and operating voltage decreases because of the requirements for high performance and low power. What may happen if soft failure rate keeps increasing? 1 10000 0.9 FIT rate (soft data corruption) 0.8 1000 0.7 Vulnerability 0.6 100 0.5 0.4 10 0.3 0.2 1 0.1 0 2K 4K 8K 16K 32K 64K 128K256K512K 1024K Number of Sockets 14 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Partition framework in Charm++ 15 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Partition framework in Charm++ Ranking Local rank Global rank 15 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Partition framework in Charm++ Ranking Local rank Global rank Inter-partition communication CmiInterSyncSend(local rank, partition, size, message) CmiInterSyncSendAndFree(local rank, partition, size, message) 15 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Replication enhanced Fault Tolerance Overview Periodic soft data corruption detection Automatically correct soft error from checkpoint Yes, there are benefits for hard failure! No need for remote checkpointing 16 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Replication enhanced Fault Tolerance Overview Extension from the double in-memory checkpointing Replica 1 Replica 2 buddy Objects ! # ! # $ % " $ % " Local ! # ! # " " $ % $ % Checkpoint Remote Checkpoint Node A Node B Node A Node B 17 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Replication enhanced Fault Tolerance Overview TIME Job Starts replica 1 replica 2 application execution transfer checkpoint for soft error detection T1 checkpoint recovery hard error detected by replica 2 hard T2 error replica 2 sends checkpoints to replica 1 for recovery T3 soft error detected, both replicas roll back 18 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Initial Result: soft error detection overhead 3 0.12 checkpoint checkpoint 2.5 0.1 2 0.08 Time (s) Time (s) 1.5 0.06 1 0.04 0.5 0.02 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k Number of Cores per Replica Number of Cores per Replica Jacobi3D AMPI LeanMD 19 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Optimization Topology aware mapping Replica 1 nodes Replica 2 nodes # inter-replica messages [0-4] 1 3 2 0 1 1 0 1 2 3 4 1 1 0 3 1 1 2 3 4 2 1 0 1 0 1 0 1 3 0 1 1 0 1 1 2 3 4 2 1 1 0 3 1 2 1 0 1 1 0 1 2 3 4 1 0 3 0 1 1 2 3 4 2 1 1 0 1 0 1 3 2 0 1 1 0 1 1 2 3 4 1 1 0 3 2 3 4 2 1 0 1 0 1 0 1 1 1 1 3 1 0 1 2 3 4 2 1 0 1 0 1 (a) Default-mapping (b) Optimal-mapping 20 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Optimization Checksum 21 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Optimization Checksum Issue with checksum How to handle floating point round off error? 21 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Result: after optimization 3 0.12 checkpoint checksum checkpoint checksum optimal mapping optimal mapping 2.5 0.1 2 0.08 Time (s) Time (s) 1.5 0.06 1 0.04 0.5 0.02 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k Number of Cores per Replica Number of Cores per Replica Jacobi3D AMPI LeanMD 22 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart Result: recovery from hard failures 0.3 0.2 default default optimal optimal 0.25 0.15 0.2 Time (s) Time (s) 0.15 0.1 0.1 0.05 0.05 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k Number of Cores per Replica Number of Cores per Replica Jacobi3D AMPI LeanMD 23 / 25
Charm++ Workshop Thanks Thanks! Questions? 24 / 25
Recommend
More recommend