 
              Multi-level checkpointing and silent data corruption Anne Benoit 2 , Franck Cappello 1 , Aurélien Cavelan 2 , Sheng Di 1 , Hongyang Sun 2 , Yves Robert 2 , Frédéric Vivien 2 1 Argonne National Laboratory 2 INRIA October 6, 2016
Fail-stop errors Characteristics ◮ Component failure (node, network, power, ...) ◮ Application fails and data is lost Fault rate proportional to number of components ◮ 2013: Preprod. Blue Waters requires repairs ≈ 4 hours [2, 1] ◮ 2014: Titan loses a node every ≈ 1 . 5 days [2, 3, 1] ◮ 2014: Blue Waters loses ≈ 2 nodes per day [1] F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 2/28
Coping with fail-stop errors Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery: C W C W C W C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 3/28
Coping with fail-stop errors Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery: C W C W C W C Time Fail-stop error � C C W C W C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 3/28
Coping with fail-stop errors Instantaneous error detection Standard approach: Periodic checkpoint, rollback, and recovery: C W C W C W C Time Fail-stop error � C C W C W C Time Fail-stop error � C R W C W C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 3/28
Multi-Level Checkpointing ◮ Different kinds of checkpoints: local disk storage, partner-copy, Reed-Solomon encoding technique, file system ◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities W 1 W 2 W 3 C C C C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 4/28
Multi-Level Checkpointing ◮ Different kinds of checkpoints: local disk storage, partner-copy, Reed-Solomon encoding technique, file system ◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities W 1 W 2 W 3 C C C C Time Fail-stop error � C W 1 C W 2 C W 3 C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 4/28
Multi-Level Checkpointing ◮ Different kinds of checkpoints: local disk storage, partner-copy, Reed-Solomon encoding technique, file system ◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities W 1 W 2 W 3 C C C C Time Fail-stop error � C W 1 C R W 2 C W 3 C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 4/28
Multi-Level Checkpointing ◮ Different kinds of checkpoints: local disk storage, partner-copy, Reed-Solomon encoding technique, file system ◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities W 1 W 2 W 3 C C C C Time Fail-stop error � C W 1 C R W 2 C W 3 C Time Fail-stop error � W 1 W 2 W 3 C C C C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 4/28
Multi-Level Checkpointing ◮ Different kinds of checkpoints: local disk storage, partner-copy, Reed-Solomon encoding technique, file system ◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities W 1 W 2 W 3 C C C C Time Fail-stop error � C W 1 C R W 2 C W 3 C Time Fail-stop error � W 1 W 2 W 3 C C R C C C Time F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 4/28
Multi-Level Checkpointing ◮ Different kinds of checkpoints: local disk storage, partner-copy, Reed-Solomon encoding technique, file system ◮ Different kinds of errors: node failure, router failure, etc. ◮ Each checkpoint has a cost and some resilience capabilities W 1 W 2 W 3 C C C C Time Fail-stop error � C W 1 C R W 2 C W 3 C Time Fail-stop error � W 1 W 2 W 3 C C R C C C Time When should we checkpoint? Using which mechanism? F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 4/28
Two-level checkpointing: assumptions Two types of faults ◮ Type-1: follow an exponential distribution of failure rate λ 1 ◮ Type-2: follow an exponential distribution of failure rate λ 2 Two types of checkpoints ◮ Type-2 checkpoints take time C 2 (recovery R 2 ) Enables recovery from type-1 and type-2 faults ◮ Type-1 checkpoints take time C 1 (recovery R 1 ) Enables recovery from type-1 faults F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 5/28
Two-level checkpointing: assumptions Two types of faults ◮ Type-1: follow an exponential distribution of failure rate λ 1 ◮ Type-2: follow an exponential distribution of failure rate λ 2 More dramatic faults Two types of checkpoints ◮ Type-2 checkpoints take time C 2 (recovery R 2 ) Enables recovery from type-1 and type-2 faults More expensive checkpoints ◮ Type-1 checkpoints take time C 1 (recovery R 1 ) Enables recovery from type-1 faults Cheap checkpoints F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 5/28
Two-level checkpointing: assumptions Two types of faults ◮ Type-1: follow an exponential distribution of failure rate λ 1 ◮ Type-2: follow an exponential distribution of failure rate λ 2 More dramatic faults Two types of checkpoints ◮ Type-2 checkpoints take time C 2 (recovery R 2 ) Enables recovery from type-1 and type-2 faults More expensive checkpoints ◮ Type-1 checkpoints take time C 1 (recovery R 1 ) Enables recovery from type-1 faults Cheap checkpoints Other assumptions ◮ Fault of type- i is followed by a downtime and a type- i recovery ◮ No faults during recoveries F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 5/28
Execution time of a pattern ◮ Pattern: work of some size W divided in K chunks ... w 1 C 1 w 2 C 1 w 3 C 1 C 1 w K C 1 C 2 ◮ Objective: overhead minimization Overhead ( Pattern ( K , W , w 1 , ..., w K )) = E ( Pattern ( K , W , w 1 , ..., w K )) − 1 W ◮ First property: Execution time is minimized when all chunks have same size F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 6/28
Unknown job length: optimal solution ◮ Chunks have size w opt where: N ( w opt ) ln ( N ( w opt )) = λ Lw opt ( e λ ( w opt + C 1 ) − 1 ) ◮ There are K chunks in a pattern where: βλ Kw opt e λ ( w opt + C 1 ) ( 1 + L ( e λ ( w opt + C 1 ) − 1 )) K − 1 = α + β L ( 1 + L ( e λ ( w opt + C 1 ) − 1 )) K ◮ Missing notations N ( w ) = 1 + L ( e λ ( w + C 1 ) − 1 ) , L = λ 2 λ , λ = λ 1 + λ 2 , α = R ( e λ C 2 − 1 ) − β L , β = R ( 1 + L ( e λ C 2 − 1 )) , R = 1 + λ 1 R 1 + λ 2 R 2 + D λ ◮ Ugly implicit equations: solve them numerically! F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 7/28
Known job length: optimal solution ◮ Total size of job: W total ◮ Chunks have same w opt size than previously ◮ There are p ∗ patterns where: p ∗ = W total ln ( N ( w opt )) � � � � α L L + 1 w opt β e with the same notations as previously and L ( z ) = x if xe x = z . ◮ Ugly implicit equations: solve them numerically! F. Vivien - Multi-level checkpointing and silent error detection October 6, 2016 - 8/28
Recommend
More recommend