Fault Tolerant Linear Algebra: goals and methods. Julien Langou, University of Colorado Denver
Fault‐tolerant Linear Algebra: Goals and Methods. 0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE)
Fault‐tolerant Linear Algebra: Goals and Methods. 0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE)
Goals Perform reliable and efficent computaFon with unreliable units. • Unreliable units: Process crash, hardware failure, erroneous communicaFon, erroneous computaFon, … • Our method: at the algorithm level. • MoFvaFon: cost effecFve, large unit count
Errasure Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 P 1 P 2 P 3 P 4 2 4 6 8 Error Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 P 1 P 2 P 3 P 4 2 4 6 8
Errasure Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 P 1 P 2 P 3 P 4 Lost processor 2 2 4 6 8 Error Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 Processor 2 returns an P 1 P 2 P 3 P 4 2 5 6 8 incorrect result
Errasure Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 P 1 P 2 P 3 P 4 Lost processor 2 2 4 6 8 ‐ we know whether there is an errasure or not, Error Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 Processor 2 returns an P 1 P 2 P 3 P 4 2 5 6 8 incorrect result ‐ we do not know if there is an error,
Errasure Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 P 1 P 2 P 3 P 4 Lost processor 2 2 4 6 8 ‐ we know whether there is an errasure or not, ‐ we know where the errasure is, Error Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 Processor 2 returns an P 1 P 2 P 3 P 4 2 5 6 8 incorrect result ‐ we do not know if there is an error, ‐ assuming we know that an error occurs, we do not know where it is
Errasure Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 P 1 P 2 P 3 P 4 Lost processor 2 2 4 6 8 ‐ we know whether there is an errasure or not, ‐ we know where the errasure is, ‐ so we only need to recover Error Problem P 1 P 2 P 3 P 4 4 processors available 1+1 2+2 3+3 4+4 Processor 2 returns an P 1 P 2 P 3 P 4 2 5 6 8 incorrect result ‐ we do not know if there is an error, ‐ assuming we know that an error occurs, we do not know where it is ‐ we also need to recover
Fault‐tolerant Linear Algebra: Goals and Methods. 0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE)
Diskless checkpoinFng 4 processors available P 1 P 2 P 3 P 4
Diskless checkpoinFng 4 processors available P 1 P 2 P 3 P 4 Add a 5 th one and perform a P 1 P 2 P 3 P 4 P c + + + checksum (MPI_Reduce) Ready for the computaFons P 1 P 2 P 3 P 4 P c … … …
Diskless checkpoinFng 4 processors available P 1 P 2 P 3 P 4 Add a 5 th one and perform a P 1 P 2 P 3 P 4 P c + + + checksum (MPI_Reduce) Ready for the computaFons P 1 P 2 P 3 P 4 P c … … … P 1 P 2 P 3 P 4 P c Lost a processor P 1 P 3 P 4 P c
Diskless checkpoinFng 4 processors available P 1 P 2 P 3 P 4 Add a 5 th one and perform a P 1 P 2 P 3 P 4 P c + + + checksum (MPI_Reduce) Ready for the computaFons P 1 P 2 P 3 P 4 P c … … … P 1 P 2 P 3 P 4 P c Lost a processor P 1 P 3 P 4 P c Recover the processor (FT‐MPI) P c P 1 P 3 P 4 P 2 ‐ ‐ ‐ Recover the data (MPI_Reduce) P 1 P 2 P 3 P 4 P c Ready for the computaFons
Diskless checkpoinFng (remarks) • You can use either floaFng‐point arithmeFc or binary arithmeFc for the checksum • MulFple failures/errors supported through Reed‐Solomon algorithm, opFmal algorithm in the sense that, to support p simultaneous failures/errors, only need to add p processes.
Time for a MPI_Reduce (using MVAPICH) on Infiniband on jacquard.nersc.gov 2.5 2 Fme (sec) 1.5 122.0 MB 68.7 MB 1 30.5 MB 7.6 MB 0.5 0 64 81 100 121 256 # of processors
Fault‐tolerant Linear Algebra: Goals and Methods. 0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE)
ABFT = Algorithm Based Fault Tolerance. • K. Huang, J. Abraham, " Algorithm‐Based Fault Tolerance for Matrix Opera;ons ," IEEE Trans. on Comp. (Spec. Issue Reliable & Fault‐ Tolerant Comp.), C‐33, 1984, pp. 518‐528. • If checkpoints are performed in floaFng‐point arithmeFc then we can exploit the linearity of the mathemaFcal relaFons on the object to maintain the checksums
ABFT concept in an example We want to perform z = λx+μy. λ λ λ λ X 1 X 2 X 3 X 4 + + + + Y 1 Y 2 Y 3 Y 4 μ μ μ μ Z 1 Z 2 Z 3 Z 4 Proc 1 Proc 2 Proc 3 Proc 4
ABFT concept in an example We want to perform z = λx+μy. X 1 X 2 X 3 X 4 + + + Y 1 Y 2 Y 3 Y 4 + + + Proc 1 Proc 2 Proc 3 Proc 4
ABFT concept in an example We want to perform z = λx+μy. checkX X 1 X 2 X 3 X 4 X c + + + checkY Y 1 Y 2 Y 3 Y 4 Y c + + + Proc 1 Proc 2 Proc 3 Proc 4 Proc c
ABFT concept in an example We want to perform z = λx+μy. checkX X 1 X 2 X 3 X 4 X c checkY Y 1 Y 2 Y 3 Y 4 Y c Proc 1 Proc 2 Proc 3 Proc 4 Proc c
Recommend
More recommend