Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr - - PowerPoint PPT Presentation

algorithm based fault tolerance
SMART_READER_LITE
LIVE PREVIEW

Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr - - PowerPoint PPT Presentation

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing Algorithm-Based Fault Tolerance erault 1 , Yves Robert 1 , 2 & Fr eric Vivien 2 Thomas H ed 1 University of Tennessee Knoxville, USA 2


slide-1
SLIDE 1

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm-Based Fault Tolerance

Thomas H´ erault1, Yves Robert1,2 & Fr´ ed´ eric Vivien2 1 – University of Tennessee Knoxville, USA 2 – ENS Lyon & INRIA, France

frederic.vivien@inria.fr

3rd JLESC Summer School

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 1/ 45

slide-2
SLIDE 2

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Outline

1

Introduction: Matrix-Matrix Multiplication

2

ABFT for block LU factorization

3

Composite approach: ABFT & Checkpointing

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 2/ 45

slide-3
SLIDE 3

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Outline

1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 3/ 45

slide-4
SLIDE 4

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Generic vs. Application specific approaches

Generic solutions Universal Very low prerequisite One size fits all (pros and cons) Application specific solutions Requires (deep) study of the application/algorihtm Tailored solution: higher efficiency

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 4/ 45

slide-5
SLIDE 5

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Backward Recovery vs. Forward Recovery

Backward Recovery Rollback / Backward Recovery: returns in the history to recover from failures. Spends time to re-execute computations Rebuilds states already reached Typical: checkpointing techniques

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45

slide-6
SLIDE 6

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Backward Recovery vs. Forward Recovery

Forward Recovery Forward Recovery: proceeds without returning Pays additional costs during (failure-free) computation to maintain consistent redundancy Or pays additional computations when failures happen General technique: Replication Application-Specific techniques: Iterative algorithms with fixed point convergence, ABFT, ...

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 5/ 45

slide-7
SLIDE 7

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerance (ABFT)

Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

slide-8
SLIDE 8

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerance (ABFT)

Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums M =   5 1 7 4 3 5 4 6 9  

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

slide-9
SLIDE 9

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerance (ABFT)

Principle Limited to Linear Algebra computations Matrices are extended with rows and/or columns of checksums M =   5 1 7 13 4 3 5 12 4 6 9 19  

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 6/ 45

slide-10
SLIDE 10

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and fail-stop errors

Missing checksum data M =   5 1 7 13 4 3 5 4 6 9 19  

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

slide-11
SLIDE 11

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and fail-stop errors

Missing checksum data M =   5 1 7 13 4 3 5 4 6 9 19   Simple recomputation: 4+3+5 = 12.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

slide-12
SLIDE 12

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and fail-stop errors

Missing checksum data M =   5 1 7 13 4 3 5 4 6 9 19   Simple recomputation: 4+3+5 = 12. Missing original data M =   5 1 7 13 4 5 12 4 6 9 19  

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

slide-13
SLIDE 13

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and fail-stop errors

Missing checksum data M =   5 1 7 13 4 3 5 4 6 9 19   Simple recomputation: 4+3+5 = 12. Missing original data M =   5 1 7 13 4 5 12 4 6 9 19   Simple recomputation: 12-(4+5) = 3.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 7/ 45

slide-14
SLIDE 14

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

M =   5 1 7 13 4 3 5 13 4 6 9 19  

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45

slide-15
SLIDE 15

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

M =   5 1 7 13 4 3 5 13 4 6 9 19   Error detection: 4 + 3 + 5 = 13 Limitations The following matrix would have successfully passed the sanity check: M =   5 1 7 13 5 3 5 13 4 6 9 19   Can detect one error and correct zero.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 8/ 45

slide-16
SLIDE 16

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

One row and one column of checksums M =     5 1 7 13 4 3 5 11 4 6 9 19 13 9 21 43    

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45

slide-17
SLIDE 17

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

One row and one column of checksums M =     5 1 7 13 4 3 5 11 4 6 9 19 13 9 21 43     Checksum recomputation to look for silent data corruptions:     5 + 1 + 7 = 13 4 + 3 + 5 = 12 4 + 6 + 9 = 19 13 + 10 + 21 = 44     Checksums do not match !

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 9/ 45

slide-18
SLIDE 18

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

M =     5 1 7 13 4 3 5 11 4 6 9 19 13 9 21 43         5 + 1 + 7 = 13 4 + 3 + 5 = 12 4 + 6 + 9 = 19 13 + 10 + 21 = 44     Both checksums are affected, giving out the location of the error.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

slide-19
SLIDE 19

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

M =     5 1 7 13 4 3 5 11 4 6 9 19 13 9 21 43         5 + 1 + 7 = 13 4 + 3 + 5 = 12 4 + 6 + 9 = 19 13 + 10 + 21 = 44     Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

slide-20
SLIDE 20

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT and silent data corruption

M =     5 1 7 13 4 3 5 11 4 6 9 19 13 9 21 43         5 + 1 + 7 = 13 4 + 3 + 5 = 12 4 + 6 + 9 = 19 13 + 10 + 21 = 44     Both checksums are affected, giving out the location of the error. We solve: 4 + x + 5 = 11 1 + x + 6 = 9 Recomputing the checksums we find that:     5 + 1 + 7 = 13 4 + 2 + 5 = 11 4 + 6 + 9 = 19 13 + 9 + 21 = 43     Checksums match Can detect two errors and correct one

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 10/ 45

slide-21
SLIDE 21

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT for Matrix-Matrix multiplication

Aim: Computation of C = A × B Let eT = [1, 1, · · · , 1], we define Ac := A eTA

  • , Br :=
  • B

Be

  • , C f :=

C Ce eTC eTCe

  • .

Where Ac is the column checksum matrix, Br is the row checksum matrix and C f is the full checksum matrix.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45

slide-22
SLIDE 22

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT for Matrix-Matrix multiplication

Aim: Computation of C = A × B Let eT = [1, 1, · · · , 1], we define Ac := A eTA

  • , Br :=
  • B

Be

  • , C f :=

C Ce eTC eTCe

  • .

Where Ac is the column checksum matrix, Br is the row checksum matrix and C f is the full checksum matrix. Ac × Br = A eTA

  • ×
  • B

Be

  • =

AB ABe eTAB eTABe

  • =

C Ce eTC eTCe

  • = C f

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 11/ 45

slide-23
SLIDE 23

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

In practice... things are more complicated!

When do errors strike? Are all data always protected? Computations are approximate because of floating-point rounding Error detection and error correction capabilities depend on the number of checksum rows and columns

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 12/ 45

slide-24
SLIDE 24

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Outline

1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 13/ 45

slide-25
SLIDE 25

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Block LU factorization

A A' U L U

Solve A · x = b (hard) Transform A into a LU factorization Solve L · y = b, then U · x = y

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 14/ 45

slide-26
SLIDE 26

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Block LU factorization

A A' U L U

GETF2: factorize a column block TRSM - Update row block GEMM: Update the trailing matrix

Solve A · x = b (hard) Transform A into a LU factorization Solve L · y = b, then U · x = y

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 14/ 45

slide-27
SLIDE 27

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Block LU factorization

L U U L U

GETF2: factorize a column block TRSM - Update row block GEMM: Update the trailing matrix

L U

Solve A · x = b (hard) Transform A into a LU factorization Solve L · y = b, then U · x = y

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 14/ 45

slide-28
SLIDE 28

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Block LU factorization

0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3

Failure of rank 2

2D Block Cyclic Distribution (here 2 × 3) A single failure ⇒ many data lost

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 14/ 45

slide-29
SLIDE 29

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

M

P mb

nb Q N < 2N/Q + nb

+ + +

0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3 4 5 0 2 1 3

Checksum: invertible operation on the data of the row / column

Checksum blocks are doubled, to allow recovery when data and checksum are lost together

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-30
SLIDE 30

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

M P mb nb Q N N/Q

+ + +

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3 0 2 4 1 3 5 0 2 4 1 3 5 0 2 1 3

Checksum: invertible operation on the data of the row / column

Checksum replication can be avoided by dedicating computing resources to checksum storage

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-31
SLIDE 31

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

GETF2 GEMM TRSM

Idea of ABFT: applying the operation on data and checksum preserves the checksum properties

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-32
SLIDE 32

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

+

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

For the part of the data that is not updated this way, the checksum must be re-calculated

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-33
SLIDE 33

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

GETF2 GEMM TRSM

To avoid slowing down all processors and panel operation, group checksum updates every Q block columns

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-34
SLIDE 34

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

GETF2 GEMM TRSM

To avoid slowing down all processors and panel operation, group checksum updates every Q block columns

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-35
SLIDE 35

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

GETF2 GEMM TRSM

To avoid slowing down all processors and panel operation, group checksum updates every Q block columns

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-36
SLIDE 36

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

+

Then, update the missing coverage. Keep checkpoint block column to cover failures during that time

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-37
SLIDE 37

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

M P mb nb Q N N/Q

+ + +

“Checkpoint”

0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 B A B A B

Checkpoint the next set of Q-Panels to be able to return to it in case of failures

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 15/ 45

slide-38
SLIDE 38

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

  • In case of failure, conclude the operation, then

Missing Data = Checksum - Sum(Existing Data) s

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 16/ 45

slide-39
SLIDE 39

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Algorithm Based Fault Tolerant LU decomposition

0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 A 1 3 B A A B B

+ + +

In case of failure, conclude the operation, then

Missing Checksum = Sum(Existing Data)s

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 16/ 45

slide-40
SLIDE 40

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Failure inside a Q−panel factorization

M P mb nb Q N N/Q

+ + +

“Checkpoint”

0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 B A B A B

Failures may happen while inside a Q−panel factorization

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 17/ 45

slide-41
SLIDE 41

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Failure inside a Q−panel factorization

M P mb nb Q N

  • “Checkpoint”

0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 B A B A B

Valid Checksum Information allows to recover most of the missing data, but not all: the checksum for the current Q−panels are not valid

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 17/ 45

slide-42
SLIDE 42

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Failure inside a Q−panel factorization

M P mb nb Q N

“Checkpoint”

0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 B A B A B

We use the checkpoint to restore the Q−panel in its initial state

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 17/ 45

slide-43
SLIDE 43

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Failure inside a Q−panel factorization

M P mb nb Q N

“Checkpoint”

0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 A B A A B B 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 B A B A B

and re-execute that part of the factorization, without applying

  • utside of the scope

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 17/ 45

slide-44
SLIDE 44

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT LU decomposition: implementation

MPI Implementation PBLAS-based: need to provide “Fault-Aware” version of the library Cannot enter recovery state at any point in time: need to complete ongoing operations despite failures

Recovery starts by defining the position of each process in the factorization and bring them all in a consistent state (checksum property holds)

Need to test the return code of each and every MPI-related call

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 18/ 45

slide-45
SLIDE 45

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT LU decomposition: performance

7 14 21 28 35

6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 96x96; 320k 192x192; 640k

10 20 30 40 50

Relative Overhead (%) Performance (TFlop/s) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGETRF FT-PDGETRF (no error) FT-PDGETRF (w/1 recovery) Overhead: FT-PDGETRF (no error) Overhead: FT-PDGETRF (w/1 recovery)

Open MPI with ULFM; Kraken supercomputer.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 19/ 45

slide-46
SLIDE 46

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT QR decomposition: performance

2 4 6 8 10 12 6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 10 20 30 40 50 60 Performance (TFlop/s) Relative Overhead (%) #Processors (PxQ grid); Matrix size (N) ScaLAPACK PDGEQRF FT-PDGEQRF (no errors) FT-PDGEQRF (one error) Overhead: FT-PDGEQRF (no errors) Overhead: FT-PDGEQRF (one error)

Open MPI with ULFM; Kraken supercomputer.

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 20/ 45

slide-47
SLIDE 47

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT LU decomposition: implementation

?

ABFT Recovery

Checkpoint on Failure - MPI Implementation FT-MPI / MPI-Next FT: not easily available on large machines Checkpoint on Failure = workaround

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 21/ 45

slide-48
SLIDE 48

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT QR decomposition: performance

0.5 1 1.5 2 2.5 3 3.5 20k 40k 60k 80k 100k Performance (Tflops/s) Matrix Size (N) ScaLAPACK QR CoF-QR (w/o failure) CoF-QR (w/1 failure)

Checkpoint on Failure - MPI Performance Open MPI; Kraken supercomputer;

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 22/ 45

slide-49
SLIDE 49

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Outline

1 Introduction: Matrix-Matrix Multiplication 2 ABFT for block LU factorization 3 Composite approach: ABFT & Checkpointing

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 23/ 45

slide-50
SLIDE 50

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Fault Tolerance Techniques

General Techniques Replication Rollback Recovery

Coordinated Checkpointing Uncoordinated Checkpointing & Message Logging Hierarchical Checkpointing Multilevel Checkpointing

Application-Specific Techniques Algorithm Based Fault Tolerance (ABFT) Iterative Convergence Approximated Computation

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 24/ 45

slide-51
SLIDE 51

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Application

Typical Application

f o r ( aninsanenumber ) { /∗ E x t r a c t data from ∗ s i m u l a t i o n , f i l l up ∗ matrix ∗/ sim2mat ( ) ; /∗ F a c t o r i z e matrix , ∗ Solve ∗/ dgeqrf ( ) ; d s o l v e ( ) ; /∗ Update s i m u l a t i o n ∗ with r e s u l t v e c t o r ∗/ vec2sim ( ) ; }

Process 0 Process 1 Process 2 Application Application Application Library Library Library LIBRARY Phase GENERAL Phase

Characteristics

Large part of (total)

computation spent in factorization/solve Between LA operations:

use resulting vector / matrix

with operations that do not preserve the checksums on the data

modify data not covered by

ABFT algorithms

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 25/ 45

slide-52
SLIDE 52

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Application

Typical Application

f o r ( aninsanenumber ) { /∗ E x t r a c t data from ∗ s i m u l a t i o n , f i l l up ∗ matrix ∗/ sim2mat ( ) ; /∗ F a c t o r i z e matrix , ∗ Solve ∗/ dgeqrf ( ) ; d s o l v e ( ) ; /∗ Update s i m u l a t i o n ∗ with r e s u l t v e c t o r ∗/ vec2sim ( ) ; }

Process 0 Process 1 Process 2 Application Application Application Library Library Library LIBRARY Phase GENERAL Phase

Characteristics

Large part of (total)

computation spent in factorization/solve Between LA operations:

use resulting vector / matrix

with operations that do not preserve the checksums on the data

modify data not covered by

ABFT algorithms

Goodbye ABFT?!

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 25/ 45

slide-53
SLIDE 53

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Application

Typical Application

f o r ( aninsanenumber ) { /∗ E x t r a c t data from ∗ s i m u l a t i o n , f i l l up ∗ matrix ∗/ sim2mat ( ) ; /∗ F a c t o r i z e matrix , ∗ Solve ∗/ dgeqrf ( ) ; d s o l v e ( ) ; /∗ Update s i m u l a t i o n ∗ with r e s u l t v e c t o r ∗/ vec2sim ( ) ; }

Process 0 Process 1 Process 2 Application Application Application Library Library Library LIBRARY Phase GENERAL Phase

Characteristics

Large part of (total)

computation spent in factorization/solve Between LA operations:

use resulting vector / matrix

with operations that do not preserve the checksums on the data

modify data not covered by

ABFT algorithms

Problem Statement

How to use fault tolerant operations(∗) within a non-fault tolerant(∗∗) application?(∗∗∗)

(*) ABFT, or other application-specific FT (**) Or within an application that does not have the same kind of FT (***) And keep the application globally fault tolerant...

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 25/ 45

slide-54
SLIDE 54

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: no failure

Process 0 Process 1 Process 2 Application Application Application Library Library Library Periodic Checkpoint Split Forced Checkpoints

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 26/ 45

slide-55
SLIDE 55

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during Library phase

Process 0 Process 1 Process 2 Application Application Application Library Library Library Failure (during LIBRARY) Rollback (partial) Recovery ABFT Recovery

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 27/ 45

slide-56
SLIDE 56

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during General phase

Process 0 Process 1 Process 2 Application Application Application Library Library Library Failure (during GENERAL) Rollback (full) Recovery Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 28/ 45

slide-57
SLIDE 57

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT&PeriodicCkpt: Optimizations

Process 0 Process 1 Process 2 Application Application Application Library Library Library ABFT&PERIODICCKPT

ABFT&PeriodicCkpt: Optimizations If the duration of the General phase is too small: don’t add checkpoints If the duration of the Library phase is too small: don’t do ABFT recovery, remain in General mode

this assumes a performance model for the library call

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 29/ 45

slide-58
SLIDE 58

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

ABFT&PeriodicCkpt: Optimizations

Process 0 Process 1 Process 2 Application Application Application Library Library Library ABFT&PERIODICCKPT GENERAL Checkpoint Interval

ABFT&PeriodicCkpt: Optimizations If the duration of the General phase is too small: don’t add checkpoints If the duration of the Library phase is too small: don’t do ABFT recovery, remain in General mode

this assumes a performance model for the library call

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 29/ 45

slide-59
SLIDE 59

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

A few notations

Process 0 Process 1 Process 2 Application Application Application Library Library Library T0 TG TL PG

Times, Periods T0: Duration of an Epoch (without FT) TL = αT0: Time spent in the Library phase TG = (1 − α)T0: Time spent in the General phase PG: Periodic Checkpointing Period T ff, T ff

G, T ff L : “Fault Free” times

tlost

G , tlost L

: Lost time (recovery overheads) T final

G

, T final

L

: Total times (with faults)

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 30/ 45

slide-60
SLIDE 60

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

A few notations

Process 0 Process 1 Process 2 Application Application Application Library Library Library C CL CL

Costs CL = ρC: time to take a checkpoint of the Library data set C¯

L = (1 − ρ)C: time to take a checkpoint of the General data

set R, R¯

L: time to load a full / General data set checkpoint

D: down time (time to allocate a new machine / reboot) ReconsABFT: time to apply the ABFT recovery φ: Slowdown factor on the Library phase, when applying ABFT

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 30/ 45

slide-61
SLIDE 61

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

General phase, fault free waste

General phase

Process 0 Process 1 Process 2 Application Application Application Library Library Library Periodic Checkpoint Split Forced Checkpoints

Without Failures T ff

G =

TG + C¯

L

if TG < PG

TG PG −C × PG

if TG ≥ PG

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 31/ 45

slide-62
SLIDE 62

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Library phase, fault free waste

Library phase

Process 0 Process 1 Process 2 Application Application Application Library Library Library Periodic Checkpoint Split Forced Checkpoints

Without Failures T ff

L = φ × TL + CL

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 32/ 45

slide-63
SLIDE 63

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

General phase, failure overhead

General phase

Process 0 Process 1 Process 2 Application Application Application Library Library Library Failure (during GENERAL) Rollback (full) Recovery

Failure Overhead tlost

G

=    D + R + T ff

G

2

if TG < PG D + R + PG

2

if TG ≥ PG

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 33/ 45

slide-64
SLIDE 64

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Library phase, failure overhead

Library phase

Process 0 Process 1 Process 2 Application Application Application Library Library Library Failure (during LIBRARY) Rollback (partial) Recovery ABFT Recovery

Failure Overhead tlost

L

= D + R¯

L + ReconsABFT

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 34/ 45

slide-65
SLIDE 65

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Overall

Overall Time (with overheads) of Library phase is constant (in PG ): T final

L

= 1 1 − D+R¯

L+ReconsABFT

µ

× (α × TL + CL) Time (with overehads) of General phase accepts two cases: T final

G

=       

1 1−

D+R+ TG +C¯ L 2 µ

× (TG + CL) if TG < PG

TG (1− C

PG )(1− D+R+ PG 2 µ

)

if TG ≥ PG Which is minimal in the second case, if PG =

  • 2C(µ − D − R)

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 35/ 45

slide-66
SLIDE 66

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Waste From the previous, we derive the waste, which is obtained by Waste = 1 − T0 T final

G

+ T final

L

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 36/ 45

slide-67
SLIDE 67

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Toward Exascale, and Beyond!

Let’s think at scale Number of components ր ⇒ MTBF ց Number of components ր ⇒ Problem Size ր Problem Size ր ⇒ Computation Time spent in Library phase ր

ABFT&PeriodicCkpt should perform better with scale

ij

By how much?

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 37/ 45

slide-68
SLIDE 68

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Competitors

FT algorithms compared PeriodicCkpt Basic periodic checkpointing Bi-PeriodicCkpt Applies incremental checkpointing techniques to save only the library data during the library phase. ABFT&PeriodicCkpt The algorithm described above

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 38/ 45

slide-69
SLIDE 69

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Weak Scale #1

Weak Scale Scenario #1 Number of components, n, increase Memory per component remains constant Problem Size increases in O(√n) (e.g. matrix operation) µ at n = 105: 1 day, is in O( 1

n)

C (=R) at n = 105, is 1 minute, is in O(n) α is constant at 0.8, as is ρ.

(both Library and General phase increase in time at the same speed)

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 39/ 45

slide-70
SLIDE 70

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Weak Scale #1

10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M Waste Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 40/ 45

slide-71
SLIDE 71

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Weak Scale #2

Weak Scale Scenario #2 Number of components, n, increase Memory per component remains constant Problem Size increases in O(√n) (e.g. matrix operation) µ at n = 105: 1 day, is O( 1

n)

C (=R) at n = 105, is 1 minute, is in O(n) ρ remains constant at 0.8, but Library phase is O(n3) when General phases progresses in O(n2) (α is 0.8 at n = 105 nodes).

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 41/ 45

slide-72
SLIDE 72

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Weak Scale #2

10 20 30 40 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M 0.00 0.12 0.25 0.38 0.50 0.62 0.75 0.88 1.00 Waste Ratio of time spent in the ABFT routine Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt ABFT Ratio

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 42/ 45

slide-73
SLIDE 73

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Weak Scale #3

Weak Scale Scenario #3 Number of components, n, increase Memory per component remains constant Problem Size increases in O(√n) (e.g. matrix operation) µ at n = 105: 1 day, is O( 1

n)

C (=R) at n = 105, is 1 minute, stays independent of n (O(1)) ρ remains constant at 0.8, but Library phase is O(n3) when General phases progresses in O(n2) (α is 0.8 at n = 105 nodes).

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 43/ 45

slide-74
SLIDE 74

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Weak Scale #3

2 4 6 # Faults Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k α = 0.55 10k α = 0.8 100k α = 0.92 1M α = 0.975 Waste Nodes PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 44/ 45

slide-75
SLIDE 75

Introduction ABFT for block LU factorization Composite approach: ABFT & Checkpointing

Conclusion

Algorithm-Based Fault Tolerance Application-specific solution for linear algebra kernels Low-overhead forward-recovery solution Used alone or in conjunction with backward-recovery solutions Going further Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy. A. Bouteiller, Th. Herault, G. Bosilca, P. Du, J. Dongarra. ACM Transactions on Parallel Computing 1(2), 2015. Composing resilience techniques: ABFT, periodic and incremental checkpointing. G. Bosilca, A. Bouteiller, Th. Herault, Y. Robert, J. Dongarra. IJNC 5(1), 2015. Fault tolerance techniques for high-performance

  • computing. J. Dongarra, Th. Herault, Y. Robert.

http://www.netlib.org/lapack/lawnspdf/lawn289.pdf

Thomas H´ erault, Yves Robert, and Fr´ ed´ eric Vivien ABFT 45/ 45