Fault Tolerance Support for Supercomputers with Multicore Nodes
Esteban Meneses Xiang Ni
Monday, April 18, 2011
Fault Tolerance Support for Supercomputers with Multicore Nodes - - PowerPoint PPT Presentation
Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011 Exascale Supercomputer: 100 M of cores an Exascale system could be expected to have a failure ... every 3539 minutes
Esteban Meneses Xiang Ni
Monday, April 18, 2011
Exascale Supercomputer: 100 M of cores “an Exascale system could be expected to have a failure ... every 35–39 minutes”
Exascale Computing Study
“insufficient resilience of the software infrastructure would likely render extreme scale systems effectively unusable”
The International Exascale Software
Monday, April 18, 2011
Monday, April 18, 2011
Node X Node Y Node Z
Monday, April 18, 2011
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
Monday, April 18, 2011
Node W
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
Node X Node Y Node Z Predictor
Monday, April 18, 2011
Node X’
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
Node W Node X Node Y Node Z
1.0 2.0 3.0 4.0 1 151 301 451 601
Time/step (s) Timestep
1.0 2.0 3.0 4.0 1 151 301 451 601
Time/step (s) Timestep With LB Without LB
Restart with or without spare nodes Checkpoint in buddy’s memory
Monday, April 18, 2011
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
Node W Node X Node Y Node Z
Team-based Message Logging Parallel Restart
Team Q Team R
Monday, April 18, 2011
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
40 80 120 160 200 240 280 320 360 Execution Time Memory Overhead 12 24 36 48 60 72 84 96 108 Time (seconds) Memory (MB) NoLB(8) GreedyLB(8) TeamLB(1) TeamLB(8)
Team-based Load Balancer
Monday, April 18, 2011
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
Node X PE A PE B PE C PE D
Shared Memory (SM)
The minimum unit of failure is a node Single node failure support
Monday, April 18, 2011
Proactive Checkpoint Restart Message Logging SMP Checkpoint Restart SMP Message Logging
Node X PE A PE B
SM
Node Y PE C PE D
SM
Causal Message Logging ➙ determinants in shared memory Load balancing ➙ increase communication inside a node Lock contention ➙ hybrid scheme
Monday, April 18, 2011
Monday, April 18, 2011
!" !# !$" !$# !%" !%# !&" !'( !$%) !%#' !#$% !$"%( *+,-!./-0123/4 56,7-8!19!:18-/ :;-0<=1+2>?@-/>A8>!!!@+2B!.C7-4 )!DE )"!DE (""!DE
Monday, April 18, 2011
!" !#" !$"" !$#" !%"" !" !# !$" !$# !%" &'()'*++!,-.*'/.-(01 2-3*!,+*4(05+1 67*489(-0.:;*+./'.!!!</4(=-!,>=*?!@%!4('*+1
Monday, April 18, 2011
0.05 0.1 0.15 0.2 0.25 128 256 512 1024 Time per Iteration (seconds) Number of Cores Jacobi (Ranger) Message Logging Checkpoint/Restart
Monday, April 18, 2011
Monday, April 18, 2011
0.1% 1% 10% 100% Tsubame MPP2 Mercury Frequency
1 node 2 nodes 3 nodes 4 nodes > 4 nodes
Monday, April 18, 2011
Monday, April 18, 2011
A B C D E F G H
A B C D E F G H
Monday, April 18, 2011
0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 Probability Multiple Concurrent Failures Multiple Failure Survivability (n=1024) 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16
Monday, April 18, 2011
0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 Probability Multiple Concurrent Failures Multiple Failure Survivability (n=1024) degree=2 degree=4 degree=8 degree=16 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16
Monday, April 18, 2011
Monday, April 18, 2011
Monday, April 18, 2011
Monday, April 18, 2011
Monday, April 18, 2011
f(x)=(1-p)(x-1)p
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 Probability Multiple Concurrent Failures Multiple Failure Distribution (n=1024) p=0.7
Monday, April 18, 2011
S Checkpoint/Restart Message Logging (2) Message Logging (4) Message Logging (8) Message Logging (16)
0.999402 0.997624 0.995285 0.990716 0.981973
Monday, April 18, 2011