October 6, 2016
Multi-level checkpointing and silent data corruption
Anne Benoit 2, Franck Cappello 1, Aurélien Cavelan 2, Sheng Di 1, Hongyang Sun 2, Yves Robert 2, Frédéric Vivien 2
1 Argonne National Laboratory 2 INRIA
Multi-level checkpointing and silent data corruption Anne Benoit 2 , - - PowerPoint PPT Presentation
Multi-level checkpointing and silent data corruption Anne Benoit 2 , Franck Cappello 1 , Aurlien Cavelan 2 , Sheng Di 1 , Hongyang Sun 2 , Yves Robert 2 , Frdric Vivien 2 1 Argonne National Laboratory 2 INRIA October 6, 2016 Fail-stop
October 6, 2016
1 Argonne National Laboratory 2 INRIA
◮ Component failure (node, network, power, ...) ◮ Application fails and data is lost
◮ 2013: Preprod. Blue Waters requires repairs ≈ 4 hours [2, 1] ◮ 2014: Titan loses a node every ≈ 1.5 days [2, 3, 1] ◮ 2014: Blue Waters loses ≈ 2 nodes per day [1]
October 6, 2016 - 2/28
Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery:
Time
C W C W C W C
October 6, 2016 - 3/28
Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery:
Time
C W C W C W C
Time
C C W C W C
October 6, 2016 - 3/28
Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery:
Time
C W C W C W C
Time
C C W C W C
Time
C R W C W C
October 6, 2016 - 3/28
◮ Different kinds of checkpoints: local disk storage,
◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time
C W1 C W2 C W3 C
October 6, 2016 - 4/28
◮ Different kinds of checkpoints: local disk storage,
◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time
C W1 C W2 C W3 C
Time
C W1 C W2 C W3 C
October 6, 2016 - 4/28
◮ Different kinds of checkpoints: local disk storage,
◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time
C W1 C W2 C W3 C
Time
C W1 C R W2 C W3 C
October 6, 2016 - 4/28
◮ Different kinds of checkpoints: local disk storage,
◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time
C W1 C W2 C W3 C
Time
C W1 C R W2 C W3 C
Time
C W1 C W2 C W3 C
October 6, 2016 - 4/28
◮ Different kinds of checkpoints: local disk storage,
◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time
C W1 C W2 C W3 C
Time
C W1 C R W2 C W3 C
Time
C C R W1 C W2 C W3 C
October 6, 2016 - 4/28
◮ Different kinds of checkpoints: local disk storage,
◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities Time
C W1 C W2 C W3 C
Time
C W1 C R W2 C W3 C
Time
C C R W1 C W2 C W3 C
October 6, 2016 - 4/28
◮ Type-1: follow an exponential distribution of failure rate λ1 ◮ Type-2: follow an exponential distribution of failure rate λ2
◮ Type-2 checkpoints take time C2 (recovery R2)
◮ Type-1 checkpoints take time C1 (recovery R1)
October 6, 2016 - 5/28
◮ Type-1: follow an exponential distribution of failure rate λ1 ◮ Type-2: follow an exponential distribution of failure rate λ2
◮ Type-2 checkpoints take time C2 (recovery R2)
◮ Type-1 checkpoints take time C1 (recovery R1)
October 6, 2016 - 5/28
◮ Type-1: follow an exponential distribution of failure rate λ1 ◮ Type-2: follow an exponential distribution of failure rate λ2
◮ Type-2 checkpoints take time C2 (recovery R2)
◮ Type-1 checkpoints take time C1 (recovery R1)
Other assumptions
◮ Fault of type-i is followed by a downtime and a type-i recovery ◮ No faults during recoveries
October 6, 2016 - 5/28
◮ Pattern: work of some size W divided in K chunks
w1 C1 w2 C1 w3 C1 ... C1 wK C1 C2
◮ Objective: overhead minimization
◮ First property:
October 6, 2016 - 6/28
◮ Chunks have size wopt where:
◮ There are K chunks in a pattern where:
◮ Missing notations
λ , λ = λ1 + λ2,
L, β = R(1 + L(eλC2 − 1)),
λ
◮ Ugly implicit equations: solve them numerically!
October 6, 2016 - 7/28
◮ Total size of job: Wtotal ◮ Chunks have same wopt size than previously ◮ There are p∗ patterns where:
βe
◮ Ugly implicit equations: solve them numerically!
October 6, 2016 - 8/28
103500 104000 104500 105000 105500 106000 106500 200 250 300 350 400 450 500 550 600 650 Wall-clock Time (seconds) Wopt (seconds) Wopt2=1215 Wopt2=1235 Wopt2=1255 Wopt2=1275 Wopt2=1295 Wopt2=1315 Wopt2=1335 Wopt2=1355 Wopt2=1375
Case 1
114500 115000 115500 116000 116500 117000 117500 118000 118500 119000 150 200 250 300 350 400 Wall-clock Time (seconds) Wopt (seconds) Wopt2=693 Wopt2=713 Wopt2=733 Wopt2=753 Wopt2=773 Wopt2=793 Wopt2=813 Wopt2=833 Wopt2=853
Case 2
144000 145000 146000 147000 148000 149000 150000 100 150 200 250 300 Wall-clock Time (seconds) Wopt (seconds) Wopt2=631 Wopt2=651 Wopt2=671 Wopt2=691 Wopt2=711 Wopt2=731 Wopt2=751 Wopt2=771 Wopt2=791
Case 3
119000 120000 121000 122000 123000 124000 60 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=406 Wopt2=426 Wopt2=446 Wopt2=466 Wopt2=486 Wopt2=506 Wopt2=526 Wopt2=546 Wopt2=566
Case 4
138000 140000 142000 144000 146000 148000 150000 152000 154000 40 60 80 100 120 140 160 180 Wall-clock Time (seconds) Wopt (seconds) Wopt2=239 Wopt2=259 Wopt2=279 Wopt2=299 Wopt2=319 Wopt2=339 Wopt2=359 Wopt2=379 Wopt2=399
Case 5
84000 85000 86000 87000 88000 89000 90000 91000 92000 40 60 80 100 120 140 Wall-clock Time (seconds) Wopt (seconds) Wopt2=420 Wopt2=440 Wopt2=460 Wopt2=480 Wopt2=500 Wopt2=520 Wopt2=540 Wopt2=560 Wopt2=580
Case 6
125000 130000 135000 140000 145000 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=333 Wopt2=353 Wopt2=373 Wopt2=393 Wopt2=413 Wopt2=433 Wopt2=453 Wopt2=473 Wopt2=493
Case 7
360000 380000 400000 420000 440000 460000 480000 500000 520000 540000 60 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=370 Wopt2=390 Wopt2=410 Wopt2=430 Wopt2=450 Wopt2=470 Wopt2=490 Wopt2=510 Wopt2=530
Case 8
180000 190000 200000 210000 220000 230000 240000 250000 60 80 100 120 140 160 180 200 Wall-clock Time (seconds) Wopt (seconds) Wopt2=370 Wopt2=390 Wopt2=410 Wopt2=430 Wopt2=450 Wopt2=470 Wopt2=490 Wopt2=510 Wopt2=530
Case 9
October 6, 2016 - 9/28
◮ We know how to use efficiently two-level checkpointing
◮ What about silent data corruption?
October 6, 2016 - 10/28
◮ Bit flip (Disk, RAM, Cache, Bus, ...) ◮ Problems: detection latency, potentially wrong results
◮ 2002: Unprotected address bus ASCI Q at Los Alamos
◮ 2003: No ECC Virginia Tech 1, 100 Apple Power Mac G5
◮ 2010: ECC protected Jaguar saw 350 bit-flips/min [3] ◮ 2010: ECC protected Jaguar saw 1 double-bit error/day [3] ◮ 2014: Titan: reported > 1 double-bit error per week [4]
October 6, 2016 - 11/28
Time W W silent error
C C C
October 6, 2016 - 12/28
Time W W Detect! silent error
C C C
October 6, 2016 - 12/28
Time W W Detect! corrupted! silent error
C C C
October 6, 2016 - 12/28
Time W W Detect! corrupted! silent error
C C C
October 6, 2016 - 12/28
Time W W Detect! corrupted! corrupted?
C C C
October 6, 2016 - 12/28
Time W W Detect! corrupted! corrupted?
C C C
October 6, 2016 - 12/28
General-purpose approaches
◮ Replication [Fiala et al. 2012] or triple modular redundancy and voting [Lyons and Vanderkulk 1962]
Application-specific approaches
◮ Algorithm-based fault tolerance (ABFT): checksums in dense matrices Limited to one error detection and/or correction in practice [Huang and Abraham 1984] ◮ Partial differential equations (PDE): use lower-order scheme as verification mechanism [Benson, Schmit and Schreiber 2014] ◮ Generalized minimal residual method (GMRES): inner-outer iterations [Hoemmen and Heroux 2011] ◮ Preconditioned conjugate gradients (PCG): orthogonalization check every k iterations, re-orthogonalization if problem detected [Sao and Vuduc 2013, Chen 2013]
Data-analytics approaches
◮ Dynamic monitoring of HPC datasets based on physical laws (e.g., temperature limit, speed limit) and space or temporal proximity [Bautista-Gomez and Cappello 2014] ◮ Time-series prediction, spatial multivariate interpolation [Di et al. 2014]
October 6, 2016 - 13/28
Solution: coupling checkpointing with verification
Time W W Error Detection
V ∗ C V ∗ C V ∗ C
◮ Before each checkpoint, run some verification mechanism or error
detection test
◮ Silent error, if any, is detected by verification ◮ Last checkpoint is always valid
Problem solved! But can do better than that!
October 6, 2016 - 14/28
Time Error Detection
V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C
◮ Pro: silent error detected earlier in pattern ◮ Con: additional overhead in error-free executions ◮ Need to find the best trade-off
October 6, 2016 - 15/28
Time Error Detection
V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C
◮ Pro: silent error detected earlier in pattern ◮ Con: additional overhead in error-free executions ◮ Need to find the best trade-off ◮ Not all verification mechanisms have 100% accuracy!
October 6, 2016 - 15/28
◮ Lower accuracy: recall r = #detected errors #total errors
◮ Lower cost, i.e., V < V ∗ Time Error Detect? Detect!
V ∗ C V1 V2 V ∗ C V1 V2 V ∗ C
October 6, 2016 - 16/28
◮ Disk checkpoint: stable storage (slow but resilient) ◮ Memory checkpoint: local copy (fast but lost on fail-stop)
◮ Fail-stop error ⇒ rollback to last disk checkpoint ◮ Silent errors ⇒ rollback to last memory checkpoint
◮ Combine everything into a single periodic pattern ◮ Minimize the overhead due to faults and to fault-tolerance
October 6, 2016 - 17/28
Time W Pattern
V ∗ CM CD V ∗ CM CD
Time w1 w2 wn W · · · · · ·
V ∗ CM CD V ∗ CM V ∗ CM V ∗ CM V ∗ CM CD
October 6, 2016 - 18/28
Time wi,1 wi,2 wi,mi W · · · · · ·
V ∗ CM CD V V V V ∗ CM CD
Time w1,1 w1,m1 wn,1 wn,mn w1 wn W · · · · · · · · · · · · · · · · · · · · ·
V ∗ CM CD V V V ∗ CM V ∗ CM V V V ∗ CM CD
October 6, 2016 - 19/28
Pattern W ∗ n∗ m∗ Overhead(Pattern) PD
λs + λf 2
– – 2
2
PDV ∗
1 2
1 m∗
2
–
λs +λf · CM +CD V ∗
PDV
1 2
2−r (m∗−2)r+2
2
– 2 − 2
r +
λs +λf
r
V + CM + CD
r
V ∗+CM +CD
V
− 2−r
r
r
V PDM
λs n∗ + λf 2
λf
·
CD V ∗+CM
– 2
PDMV ∗
1 2
1 m∗ λs n∗ + λf 2
λf · CD CM
CM
V ∗
PDMV
1 2
2−r (m∗−2)r+2
n∗ + λf 2
λf · CD V ∗− 2−r r V +CM
2 − 2
r
r
V + CM
r
V ∗+CM
V
− 2−r
r
r
V
October 6, 2016 - 20/28
Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Hera
Predicted Simulated
Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Atlas
Predicted Simulated
Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Coastal
Predicted Simulated
Patterns PD PDV* PDV PDM PDMV* PDMV Expected Overhead 0.05 0.1 0.15 0.2 Platform Coastal SSD
Predicted Simulated
October 6, 2016 - 21/28
◮ We know how to use efficiently two-level checkpointing
◮ Caveat: we assumed full freedom to place checkpoints and
October 6, 2016 - 22/28
◮ Application modeled as a linear task graph ◮ Checkpoints and verifications are performed in between tasks
1 V (1) 2 i V ∗(i)C (i)
M
i+1 j V ∗(j)C (j)
M C (j) D
◮ Question: when to take which checkpoint and verification in
Edisk(d2) Emem(d1, m2) Everif (d1, m1, v2) Epartial (d1, m1, v1, p1, v2) E−(d1, m1, v1, p1, p2, v2) Eleft(v1, p1) Eright(d1, m1, v1, p2, v2) d0 d1 d2 m1 m2 v1 v2 p1 p2
◮ Optimal solution: O(n6) dynamic programming algorithm
October 6, 2016 - 23/28
◮ Mix of silent and fail-stop errors ◮ Mix of partial and guaranteed verifications
◮ Results limited to 2 levels...
◮ Exponential failure distribution
October 6, 2016 - 24/28
◮ S. Di, Y. Robert, F. Vivien, and F. Cappello.
◮ A. Benoit, A. Cavelan, Y. Robert, and H. Sun.
◮ A. Benoit, A. Cavelan, Y. Robert, and H. Sun.
October 6, 2016 - 25/28
October 6, 2016 - 26/28
October 6, 2016 - 27/28