on the combination of silent error detection and
play

On the Combination of Silent Error Detection and Checkpointing - PowerPoint PPT Presentation

On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H erault, Yves Robert, Fr ed eric Vivien & Dounia Zaidouni PRDC 2013 Silent error detection 1 Introduction, motivation G. Aupy


  1. On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H´ erault, Yves Robert, Fr´ ed´ eric Vivien & Dounia Zaidouni PRDC 2013

  2. Silent error detection 1 Introduction, motivation G. Aupy Introduction, motivation 2 Optimal Checkpointing strategy Optimal Exponential distribution Checkpointing strategy Arbitrary distribution Exponential distribution Arbitrary distribution 3 Limited resources Limited resources Incorporating 4 Incorporating detection detection k checkpoints k checkpoints for 1 verification for 1 verification k verifications k verifications for 1 checkpoint for 1 checkpoint Conclusion, future work 5 Conclusion, future work Announcement 6 Announcement 1.0

  3. Silent error A few definitions detection G. Aupy Introduction, motivation Optimal Checkpointing strategy • Many types of faults: software error, hardware Exponential distribution malfunction, memory corruption Arbitrary distribution • Many possible behaviors: transient, unrecoverable, silent Limited resources • Restrict to silent errors Incorporating detection • This includes some software faults, some hardware errors k checkpoints for 1 verification (soft errors in L1 cache), double bit flip k verifications for 1 checkpoint • Silent error detected when corrupt data is activated Conclusion, future work Announcement 2.0

  4. Silent error A few definitions detection G. Aupy Introduction, motivation Optimal Checkpointing strategy • Many types of faults: software error, hardware Exponential distribution malfunction, memory corruption Arbitrary distribution • Many possible behaviors: transient, unrecoverable, silent Limited resources • Restrict to silent errors Incorporating detection • This includes some software faults, some hardware errors k checkpoints for 1 verification (soft errors in L1 cache), double bit flip k verifications for 1 checkpoint • Silent error detected when corrupt data is activated Conclusion, future work • Silent errors are the black swans of errors (Marc Snir) Announcement 2.0

  5. Silent error Error sources (courtesy Franck Cappello) detection G. Aupy Introduction, motivation • Analysis of error and failure logs Optimal Checkpointing strategy • In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of Exponential outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware distribution problems, albeit rarer, need 6.3-100.7 hours to solve.” Arbitrary distribution • In 2007 (Garth Gibson, ICPP Keynote): Limited resources Hardware Incorporating detection 50% k checkpoints • In 2008 (Oliner and J. Stearley, DSN Conf.): for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other. Hardware errors, Disks, processors, memory, network Conclusion: Both Hardware and Software failures have to be considered 3.0

  6. Silent error detection 1 Introduction, motivation G. Aupy Introduction, motivation 2 Optimal Checkpointing strategy Optimal Exponential distribution Checkpointing strategy Arbitrary distribution Exponential distribution Arbitrary distribution 3 Limited resources Limited resources Incorporating 4 Incorporating detection detection k checkpoints k checkpoints for 1 verification for 1 verification k verifications k verifications for 1 checkpoint for 1 checkpoint Conclusion, future work 5 Conclusion, future work Announcement 6 Announcement 4.0

  7. Silent error detection G. Aupy Introduction, motivation Error Detection Optimal Checkpointing strategy Exponential Time X e X d distribution Arbitrary distribution Figure : Error and detection latency. Limited resources Incorporating detection • X e inter arrival time between errors; mean time µ e k checkpoints for 1 verification k verifications • X d error detection time; mean time µ d for 1 checkpoint Conclusion, • Assume X d and X e independent future work Announcement 5.0

  8. Silent error Notations detection G. Aupy Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary • C checkpointing time distribution Limited • R recovery time resources Incorporating • W total work detection k checkpoints • w some piece of work for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 6.0

  9. Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential distribution Arbitrary distribution Limited resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 7.0

  10. Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 7.0

  11. Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work • Probability of error during w + C Announcement 7.0

  12. Silent error For one chunk detection G. Aupy Introduction, motivation 1 When X e follows an Exponential law of parameter λ e = µ e , in Optimal order to execute a total work of w + C , we need: Checkpointing strategy Exponential • Probability of execution without error distribution Arbitrary distribution Limited resources E ( T ( w )) = e − λ e ( w + C ) ( w + C ) Incorporating detection k checkpoints + (1 − e − λ e ( w + C ) ) ( E ( T lost ) + E ( X d ) + E ( T rec ) + E ( T ( w ))) for 1 verification k verifications for 1 checkpoint Conclusion, future work • Probability of error during w + C Announcement • Execution time with an error 7.0

  13. Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary distribution Limited resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

  14. Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal This is the time elapsed between the completion of the last Checkpointing strategy checkpoint and the error Exponential distribution Arbitrary � ∞ distribution E ( T lost ) = x P ( X = x | X < w + C ) dx Limited resources 0 � w + C Incorporating 1 detection x λ e e − λ e x dx = k checkpoints P ( X < w + C ) for 1 verification 0 k verifications = 1 w + C for 1 checkpoint − Conclusion, e λ e ( w + C ) − 1 λ e future work Announcement 8.0

  15. Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing strategy Exponential distribution Arbitrary distribution Limited This is the time needed for error detection, E ( X d ) = µ d resources Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

  16. Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing This is the time to recover from the error (there can be a fault strategy durnig recovery): Exponential distribution Arbitrary distribution E ( T rec ) = e − λ e R R Limited resources + (1 − e − λ e R )( E ( R lost ) + E ( X d ) + E ( T rec )) Incorporating detection k checkpoints for 1 verification k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

  17. Silent error detection Let us focus on the time lost due to an error: G. Aupy E ( T lost ) + E ( X d ) + E ( T rec ) Introduction, motivation Optimal Checkpointing This is the time to recover from the error (there can be a fault strategy durnig recovery): Exponential distribution Arbitrary distribution E ( T rec ) = e − λ e R R Limited resources + (1 − e − λ e R )( E ( R lost ) + E ( X d ) + E ( T rec )) Incorporating detection k checkpoints 1 R for 1 verification Similarly to E ( T lost ), we have: E ( R lost ) = λ e − e λ e R − 1 . k verifications for 1 checkpoint Conclusion, future work Announcement 8.0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend