algorithm based fault tolerance
play

Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr - PowerPoint PPT Presentation

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr eric Vivien 2 Thomas H ed 1 University of Tennessee Knoxville, USA 2


  1. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr´ eric Vivien 2 Thomas H´ ed´ 1 – University of Tennessee Knoxville, USA 2 – ENS Lyon & INRIA, France frederic.vivien@inria.fr 3 rd JLESC Summer School Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 1/ 45

  2. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Outline 1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 2/ 45

  3. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Outline 1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 3/ 45

  4. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Generic vs. Application specific approaches Generic solutions Universal Very low prerequisite One size fits all (pros and cons) Application specific solutions Requires (deep) study of the application/algorihtm Tailored solution: higher efficiency Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 4/ 45

  5. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Backward Recovery vs. Forward Recovery Backward Recovery Rollback / Backward Recovery: returns in the history to recover from failures. Spends time to re-execute computations Rebuilds states already reached Typical: checkpointing techniques Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45

  6. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Backward Recovery vs. Forward Recovery Forward Recovery Forward Recovery: proceeds without returning Pays additional costs during (failure-free) computation to maintain consistent redundancy Or pays additional computations when failures happen General technique: Replication Application-Specific techniques: Iterative algorithms with fixed point convergence, ABFT, ... Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45

  7. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

  8. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums   5 1 7 M = 4 3 5   4 6 9 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

  9. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm Based Fault Tolerance (ABFT) Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums   5 1 7 13 M = 4 3 5 12   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

  10. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  11. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Simple recomputation: 4+3+5 = 12. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  12. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Simple recomputation: 4+3+5 = 12. Missing original data   5 1 7 13 M = 4 5 12   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  13. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and fail-stop errors Missing checksum data  5 1 7 13  M = 4 3 5   4 6 9 19 Simple recomputation: 4+3+5 = 12. Missing original data   5 1 7 13 M = 4 5 12   4 6 9 19 Simple recomputation: 12-(4+5) = 3. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

  14. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption   5 1 7 13 M = 4 3 5 13   4 6 9 19 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45

  15. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption   5 1 7 13 M = 4 3 5 13   4 6 9 19 Error detection: 4 + 3 + 5 � = 13 Limitations The following matrix would have successfully passed the sanity check:  5 1 7 13  M = 5 3 5 13   4 6 9 19 Can detect one error and correct zero . Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45

  16. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption One row and one column of checksums   5 1 7 13 4 3 5 11   M =   4 6 9 19   13 9 21 43 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45

  17. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption One row and one column of checksums   5 1 7 13 4 3 5 11   M =   4 6 9 19   13 9 21 43 Checksum recomputation to look for silent data corruptions:  5 + 1 + 7 = 13  4 + 3 + 5 = 12     4 + 6 + 9 = 19   13 + 10 + 21 = 44 Checksums do not match ! Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45

  18. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption     5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12     M =     4 6 9 19 4 + 6 + 9 = 19     13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

  19. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption     5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12     M =     4 6 9 19 4 + 6 + 9 = 19     13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

  20. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT and silent data corruption     5 1 7 13 5 + 1 + 7 = 13 4 3 5 11 4 + 3 + 5 = 12     M =     4 6 9 19 4 + 6 + 9 = 19     13 9 21 43 13 + 10 + 21 = 44 Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Recomputing the checksums we find that:   5 + 1 + 7 = 13 4 + 2 + 5 = 11    Checksums match �   4 + 6 + 9 = 19  13 + 9 + 21 = 43 Can detect two errors and correct one Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

  21. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT for Matrix-Matrix multiplication Aim: Computation of C = A × B Let e T = [1 , 1 , · · · , 1], we define � A � C � � Ce A c := , B r := , C f := � � B Be . e T A e T C e T Ce Where A c is the column checksum matrix , B r is the row checksum matrix and C f is the full checksum matrix . Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45

  22. Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing ABFT for Matrix-Matrix multiplication Aim: Computation of C = A × B Let e T = [1 , 1 , · · · , 1], we define � A � C � � Ce A c := , B r := , C f := � � B Be . e T A e T C e T Ce Where A c is the column checksum matrix , B r is the row checksum matrix and C f is the full checksum matrix . � A � A c × B r � � = × B Be e T A � AB � C � � ABe Ce = C f = = e T AB e T ABe e T C e T Ce Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend