Charm++ Workshop
Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure
Xiang Ni
Parallel Programming Laboratory University of Illinois at Urbana-Champaign
May, 2013
1 / 25
Scalable in-memory checkpoint for hard and soft error protection - - PowerPoint PPT Presentation
Charm++ Workshop Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni Parallel Programming Laboratory University of Illinois at Urbana-Champaign May, 2013 1 / 25 Charm++ Workshop Fault
Charm++ Workshop
1 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++
2 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++
Fault Tolerance Support No Fault Tolerance Support 100% Slowdown Checkpoint Failure Recovery
3 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++
0.1 1 10 100 System 12 System 18 System 19 System 20 System 21 MPP2 Tsubame Mercury Frequency (%) 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 4 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++
! " # ! " # $ % & ' & ' ! " # $ % $ % & '
Objects Remote Checkpoint Local Checkpoint B is the buddy of A
5 / 25
Charm++ Workshop Fault Tolerance Philosophy in Charm++
5 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
6 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
NODE 1 NODE 2 barrier checkpoint done
β α β α
7 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
NODE 1 NODE 2 barrier local checkpoint done remote checkpoint done
β α β
α
8 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
10 20 30 40 50 60 70 128 256 512 1024 Checkpoint Overhead(s) Number of Cores blocking checkpoint semiblocking checkpoint
5 10 15 20 25 30 35 40 128 256 512 1024 Checkpoint Overhead(s) Number of Cores blocking checkpoint semiblocking checkpoint
9 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
10 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
worker thread worker thread worker thread worker thread IO thread write to SSD SSD request Checkpoint finishes
11 / 25
Charm++ Workshop Asynchronous Checkpoint/Restart
5 10 15 20 25 30 0.45 1.34 2.23 Timing Penalty(s) Checkpoint Size/Node(GB) half−aio full−aio half−sio full−sio 5 10 15 20 25 30 35 40 45 0.45 1.34 2.23 Restart Time(s) Checkpoint Size/Node(GB) in−memory half−aio full−aio
12 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
13 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
1 10 100 1000 10000 2K 4K 8K 16K 32K 64K 128K256K512K 1024K FIT rate (soft data corruption) Number of Sockets 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vulnerability 14 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
15 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
15 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
15 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
16 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
!
"
!
"
$ % $ %
!
"
!
"
$ % $ %
17 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
replica 1 replica 2 transfer checkpoint for soft error detection hard error hard error detected by replica 2 replica 2 sends checkpoints to replica 1 for recovery soft error detected, both replicas roll back
application execution checkpoint recovery
Job Starts
T1 T2 T3
TIME
18 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
0.5 1 1.5 2 2.5 3 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint
0.02 0.04 0.06 0.08 0.1 0.12 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint
19 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
4 3 4 4 4 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Replica 2 nodes
2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
(a) Default-mapping
1 1 1 1 1 1 1 1
(b) Optimal-mapping
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Replica 1 nodes
1
# inter-replica messages [0-4] 20 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
21 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
21 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
0.5 1 1.5 2 2.5 3 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint
checksum
0.02 0.04 0.06 0.08 0.1 0.12 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica checkpoint
checksum
22 / 25
Charm++ Workshop Replication enhanced Checkpoint Restart
0.05 0.1 0.15 0.2 0.25 0.3 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica default
0.05 0.1 0.15 0.2 1k 2k 4k 8k 16k Time (s) Number of Cores per Replica default
23 / 25
Charm++ Workshop
24 / 25