FaultTolerantLinearAlgebra: goalsandmethods. - - PowerPoint PPT Presentation
FaultTolerantLinearAlgebra: goalsandmethods. - - PowerPoint PPT Presentation
FaultTolerantLinearAlgebra: goalsandmethods. JulienLangou,UniversityofColoradoDenver FaulttolerantLinearAlgebra:GoalsandMethods. 0GOALS
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
Goals
Perform reliable and efficent computaFon with unreliable units.
- Unreliable units: Process crash, hardware
failure, erroneous communicaFon, erroneous computaFon, …
- Our method: at the algorithm level.
- MoFvaFon: cost effecFve, large unit count
P1
2
P2
4
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available
Errasure Problem Error Problem
P1
2
P2
4
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available
P1
2
P2
4
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available Lost processor 2
Errasure Problem Error Problem
P1
2
P2
5
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available Processor 2 returns an incorrect result
P1
2
P2
4
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available
Errasure Problem
‐ we know whether there is an errasure or not,
Error Problem
P1
2
P2
5
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available ‐ we do not know if there is an error, Lost processor 2 Processor 2 returns an incorrect result
P1
2
P2
4
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available
Errasure Problem
‐ we know whether there is an errasure or not, ‐ we know where the errasure is,
Error Problem
P1
2
P2
5
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available ‐ we do not know if there is an error, ‐ assuming we know that an error occurs, we do not know where it is Lost processor 2 Processor 2 returns an incorrect result
P1
2
P2
4
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available
Errasure Problem
‐ we know whether there is an errasure or not, ‐ we know where the errasure is, ‐ so we only need to recover
Error Problem
P1
2
P2
5
P3
6
P4
8
P1
1+1
P2
2+2
P3
3+3
P4
4+4 4 processors available ‐ we do not know if there is an error, ‐ assuming we know that an error occurs, we do not know where it is ‐ we also need to recover Lost processor 2 Processor 2 returns an incorrect result
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
P1 P2 P3 P4 4 processors available
Diskless checkpoinFng
P1 P2 + P3 P4 + + Pc P1 P2 P3 P4 P1 P2 P3 P4 Pc 4 processors available Add a 5th one and perform a checksum (MPI_Reduce) Ready for the computaFons … … …
Diskless checkpoinFng
P1 P2 + P3 P4 + + Pc P1 P2 P3 P4 Pc P1 P2 P3 P4 P1 P3 P4 Pc P1 P2 P3 P4 Pc 4 processors available Add a 5th one and perform a checksum (MPI_Reduce) Ready for the computaFons … … … Lost a processor
Diskless checkpoinFng
P1 P2 + P3 P4 + + Pc P1 P2 P3 P4 Pc P1 P2 P3 P4 P1 P3 P4 Pc Pc P1 ‐ P3 P4 ‐ ‐ P2 P1 P2 P3 P4 Pc P1 P2 P3 P4 Pc 4 processors available Add a 5th one and perform a checksum (MPI_Reduce) Ready for the computaFons … … … Lost a processor Recover the processor (FT‐MPI) Recover the data (MPI_Reduce) Ready for the computaFons
Diskless checkpoinFng
Diskless checkpoinFng (remarks)
- You can use either floaFng‐point arithmeFc or
binary arithmeFc for the checksum
- MulFple failures/errors supported through
Reed‐Solomon algorithm, opFmal algorithm in the sense that, to support p simultaneous failures/errors, only need to add p processes.
Time for a MPI_Reduce (using MVAPICH) on Infiniband on jacquard.nersc.gov
0 0.5 1 1.5 2 2.5 64 81 100 121 256 122.0 MB 68.7 MB 30.5 MB 7.6 MB # of processors Fme (sec)
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
ABFT = Algorithm Based Fault Tolerance.
- K. Huang, J. Abraham, "Algorithm‐Based Fault Tolerance for Matrix
Opera;ons," IEEE Trans. on Comp. (Spec. Issue Reliable & Fault‐ Tolerant Comp.), C‐33, 1984, pp. 518‐528.
- If checkpoints are performed in floaFng‐point arithmeFc then we
can exploit the linearity of the mathemaFcal relaFons on the object to maintain the checksums
ABFT concept in an example
X1 X2 X3 X4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4 + + + + We want to perform z = λx+μy. Proc 1 Proc 2 Proc 3 Proc 4 λ λ λ λ μ μ μ μ
ABFT concept in an example
X1 X2 + X3 X4 + + Y1 Y2 + Y3 Y4 + + Proc 1 Proc 2 Proc 3 Proc 4 We want to perform z = λx+μy.
ABFT concept in an example
X1 X2 + X3 X4 + + Xc Y1 Y2 + Y3 Y4 + + Yc Proc 1 Proc 2 Proc 3 Proc 4 Proc c checkX checkY We want to perform z = λx+μy.
ABFT concept in an example
X1 X2 X3 X4 Xc Y1 Y2 Y3 Y4 Yc Proc 1 Proc 2 Proc 3 Proc 4 Proc c checkX checkY We want to perform z = λx+μy.
ABFT concept in an example
X1 X2 X3 X4 Xc Y1 Y2 Y3 Y4 Yc Z1 Z2 Z3 Z4 + + + + Proc 1 Proc 2 Proc 3 Proc 4 Proc c checkX checkY λ λ λ λ μ μ μ μ We want to perform z = λx+μy.
ABFT concept in an example
X1 X2 X3 X4 Xc Y1 Y2 Y3 Y4 Yc Z1 Z2 + Z3 Z4 + + Zc + + + + Proc 1 Proc 2 Proc 3 Proc 4 Proc c checkX checkY checkZ λ λ λ λ μ μ μ μ We want to perform z = λx+μy.
ABFT concept in an example
X1 X2 X3 X4 Xc Y1 Y2 Y3 Y4 Yc Z1 Z2 Z3 Z4 Zc + + + + + Proc 1 Proc 2 Proc 3 Proc 4 Proc c checkX checkY checkZ No overhead to compute the checksum of Z. Property used: (λx1+μy1)+(λx2+μy2)=λ(x1+x2)+μ(y1+y2), distribuFvity of external mulFplicaFon over intenal addiFon, associaFvity of internal addiFon. λ λ λ λ μ μ μ μ λ μ We want to perform z = λx+μy.
ABFT summary.
- Relies on floaFng‐point arithmeFc checksums
- Exploit the checksum processors
- Algorithms exist for any linear operaFons:
– AXPY, SCAL, (BLAS1) – GEMV (BLAS2) – GEMM (BLAS3) – LU, QR, Cholesky (LAPACK) – FFT
Our contribuFon (1)
- Lack of generalizaFon in the previous
approach which has restricted the number algorithms used: Cholesky, QR through Gram‐ Schmidt, LU without pivoFng.
- With an ABFT BLAS and the LAPACK algorithm,
we have developped:
– QR with Houesholder reflecFon, – LU with pivoFng – Hessenberg reducFon
Our contribuFon (2)
- If there is no error then ABFT guarantees that the
checksum of L and U are consisten at the end of the LU factorizaFon.
- If there is an error, you can then detect it.
- However you can not correct it in the case where
the error propagate.
- (Then why even using ABFT?)
- We have a light‐weight mechanism ( O(n2) ) to
detect errors before they propagate
Our contribuFon (3)
- Error correcFng codes are known to be
unstable in floaFng‐point arithmeFc.
- We have developed a stable error correcFng
code (although naïve and not opFmal, it works for us and is efficient enough).
Our contribuFons
- 1. Generalize ABFT to ``all’’ LAPACK algorithms
- 2. Avoid error propagaFon
- 3. Stable error correcFng code in floaFng‐point
arithmeFc => Maybe ABFT might work ayer all!
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
Error DETECTION: Residual checking
- To detect an error in
C αA* B + βCin (1)
- 1. Save Cin
- 2. Perform C αA* B + βCin
- 3. Take random (vector) x, check that
|| Cx ‐ α( A* ( B x )) + ( βCin x ) || < ε
- 4. If check is not good, start again from step 2.
- Works with almost anything (e.g. A = VDVT)
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
Encoding
x
- Error detecFon.
Encoding
=
G x y
- Error detecFon.
Encoding
x_ y
Encoding
=
G x_ y_ y
- Error detecWon.
- Note that G could have been
1‐by‐n (no need for 2‐by‐n)
- Note that correcWon is
possible if locaWon is known
=
G e s
- Solve Ge = s under the
constraints that e has one zero
- In other words, find column of
G colinear to s :: locaFon problem solved
s y
= ‐
y_
=
G e s
- Solve Ge = s under the
constraints that e = x – x_ has
- ne zero
- In other words, find column of
G colinear to s :: locaFon problem solved
3 1 3 1 0 1 7 9 2 5 8 2 1 1 1 3 = 3 3
G e s
- Solve Ge = s under the
constraints that e has one zero
- In other words, find column of
G colinear to s :: locaFon problem solved
3 1 3 1 0 1 7 9 2 5 8 2 1 1 1 3 0 0 0 0 0 3 0 0 = 3 3
G e s
- Solve Ge = s under the
constraints that e has one zero
- In other words, find column of
G colinear to s :: locaFon problem solved
Reed‐Solomon
1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 = 2 8
G e s
- Solve Ge = s under the
constraints that e has one zero
- In other words, find column of
G colinear to s :: locaFon problem solved
Reed‐Solomon
1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 0 0 0 2 0 0 0 0 = 2 8
G e s
- Solve Ge = s under the
constraints that e has one zero
- In other words, find column of
G colinear to s :: locaFon problem solved
3 1 3 1 0 1 7 9 2 5 8 2 1 1 1 3 0 0 0 0 0 1 0 = 4
G e s
- Solve Ge = s under the
constraints that e has two zero
- In other words, find the
unique two columns that generates s. Complexity choose(n,2) = 28
1 4 9 3 0 0 5 0 ‐7 ‐4 ‐1
Stable Recovery Cost Extra Memory Reed Solomon ✗ ✓ nberr. ✓ Random ✓ ✗ nberr. ✓
Encoding
1 2 3 4 5 6 7 8
x
Encoding
1 2 3 4 5 6 7 8
x
1 2 3 4 5 6 7 8
x
Encoding
1 2 3 4 5 6 7 8
x
1 2 3 4 5 6 7 8
x
1 2 3 4 5 6 7 8
x y
Stable Recovery Cost Extra Memory Reed Solomon ✗ ✓ nberr. ✓ Random ✓ ✗ nberr. ✓ Coordinate ✓ ✓ Sqrt(n) ✗
Timing for recovery
n, size of x Time in sec
- Accuracy comparison ayer recovery.
- maxerr = 3
- nberr =3
n, size of x max( | xi – yi|/|xi|)
- Accuracy comparison ayer recovery.
- maxerr = 4
- nberr =4
n size of x n, size of x max( | xi – yi|/|xi|)
- maxerr = 10
- nberr =10
n, size of x n, size of x max( | xi – yi|/|xi|) Time in sec
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
LU factorizaFon:
A Ar Ac Af Bc
T
A I I Br StarFng from the encoding: =
A Ar Ac Af Bc
T
A = Ur Lc 0 = U L = Ac A = Ar Br Bc
T
A = Br (1) (2) (3) Bc
T
= Lc = Ur Br (1) (2) U L P End Af
A Ar Ac Af Bc
T
A = Ur Lc 0 = U L = Ac A = Ar Br Bc
T
A = Br (1) (2) (3) Bc
T
= Lc = Ur Br (1) (2) U L Ur Af Bc
T
= = Ac = Ar Br = (1) (2) (3) U L Lc Ar Ac A U L (4) (5) L Bc
T
= L P P Br = U A A Bc
T
A Br For j=1:nb:n, End Af Af Lc Ur
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
Experiments on MM
- Goal:
– Write a FT‐PDGEMM (Fault Tolerant matrix Matrix MulFply)
- TesWng:
– Perform FT‐PDGEMM in a loop and check results with residual checking on top of this add on automaFc process killer.
* α + β A B C C α + β A B Cin Cout x ‐ x
( ( ) )
x
[ ]
ABFT‐BLAS: a Parallel Fault Tolerant library BLAS based on ABFT techniques
- Constructed on top of FT‐MPI.
- Provides users a fault‐tolerant environment:
– Detect failures – Recover data automaFcally – Enables the user to stack computaFonal rouFnes the
- ne on top of the others (two general problems)
– Goal: research library for conducFng experiments on fault tolerance
- Provides developpers with an automaFc process
killer
EXAMPLE CODE
int rc; struct Vector v; struct Matrix a; struct Dataworld worldmpi; struct Global_ddata normv; struct Global_idata nbr_iter; … rc = MPI_Init(&argc, &argv); rc = init_world(&worldmpi, p, q, rc); rc = get_info_on_grid(&worldmpi, &me, &myrow, &mycol, &nprow, &npcol); … rc = allocate_vector(&v, POS_ROW, 0, nb_n, &worldmpi, "v"); rc = allocate_matrix(&a, m, n, nb_m, nb_n, &worldmpi, "a"); rc = allocate_dglobal(&normv, 1, &worldmpi); rc = allocate_iglobal(&nbr_iter, 1, &worldmpi); … if (!worldmpi.recovering) { ... here goes the user code to iniWalize objects ... rc = make_checksum_matrix(&a, &worldmpi); rc = make_checksum_vector(&v, &worldmpi); } if (worldmpi.user_state == 0) { rc = gdnrm2(&worldmpi, &v, normv.data); worldmpi.user_state = 1; } if (worldmpi.user_state == 1) { ... here goes any call to the ABFT‐BLAS numerical rouWnes ... worldmpi.user_state = 2; } free_vector(&v); free_matrix(&a); free_dglobal(&normv); free_iglobal(&nbr_iter); exit(0);
0‐ GOALS 0.1‐ ERRASURE OR ERROR? 1‐ METHODS 1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK 1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE 1.3‐ OTHERS 2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC 3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR) 4‐ ABFT‐BLAS LIBRARY 5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE) Fault‐tolerant Linear Algebra: Goals and Methods.
PDGEMM‐SUMMA
PDGEMM.
A B * + C C For k = 1:nb:n, End For PDGEMM :
PDGEMM‐SUMMA ABFT‐PDGEMM‐SUMMA
ABFT‐PDGEMM.
- The algorithm maintains the consistency of
the checkpoints of the matrix C naturally.
A B * + C C For k = 1:nb:n, End For ABFT‐PDGEMM :
jacquard.nersc.gov
- Processor type Opteron 2.2 GHz
- Processor theoreWcal peak 4.4 GFlops/sec
- Number of applicaWon processors 712
- System theoreWcal peak (computaWonal nodes) 3.13 TFlops/sec
- Number of shared‐memory applicaWon nodes 356
- Processors per node 2
- Physical memory per node 6 GBytes
- Usable memory per node 3‐5 GBytes
- Switch Interconnect InfiniBand
- Switch MPI UnidrecWonal Latency 4.5 μsec
- Switch MPI UnidirecWonal Bandwidth (peak) 620 MB/s
- Global shared disk GPFS Usable disk space 30 TBytes
- Batch system PBS Pro
Mvapich vs FTMPI
0 0.5 1 1.5 2 2.5 3 3.5 4 64 81 100 121 256 454 0 0.5 1 1.5 2 2.5 3 3.5 4 64 81 100 121 256 454 4000 3000 2000 1000 # of processors # of processors GFLOPs/sec/proc GFLOPs/sec/proc
FT‐PDGEMM ‐‐ nloc=4,000
0 0.5 1 1.5 2 2.5 3 3.5 64 81 100 121 256 484 FT off FT on FT on,1 fault GFLOPs/sec/proc # of processors
FT‐PDGEMM ‐‐ nloc=4,000
0 0.5 1 1.5 2 2.5 3 3.5 64 81 100 121 256 484 FT off FT on FT on,1 fault 0 200 400 600 800 1000 1200 1400 1600 64 81 100 121 256 484 GFLOPs/sec/proc # of processors # of processors GFLOPs/sec
Performance modeling
0 0.5 1 1.5 2 2.5 3 3.5 4 100 400 1024 Model SUMMA Model ABFT Measured SUMMA Measured ABFT # of processors GFLOPs/sec/proc
Strong scalability
0 0.5 1 1.5 2 2.5 3 3.5 4 4 100 400 1024 nloc = 60,000 SUMMA nloc = 40,000 SUMMA nloc = 30,000 SUMMA nloc = 20,000 SUMMA nloc = 10,000 SUMMA # of processors GFLOPs/sec/proc
Strong scalability
0 0.5 1 1.5 2 2.5 3 3.5 4 4 100 400 1024 nloc = 60,000 SUMMA nloc = 60,000 ABFT nloc = 40,000 SUMMA nloc = 30,000 ABFT nloc = 30,000 SUMMA nloc = 30,000 ABFT2 nloc = 20,000 SUMMA nloc = 20,000 ABFT nloc = 10,000 SUMMA nloc = 10,000 ABFT # of processors GFLOPs/sec/proc ABFT represents the only known alternaFve to address fault tolerance in strong scalability
Strong scalability
0 0.5 1 1.5 2 2.5 3 3.5 4 4 100 400 1024 nloc = 60,000 SUMMA nloc = 60,000 ABFT nloc = 40,000 SUMMA nloc = 30,000 ABFT nloc = 30,000 SUMMA nloc = 30,000 ABFT2 nloc = 20,000 SUMMA nloc = 20,000 ABFT nloc = 10,000 SUMMA nloc = 10,000 ABFT # of processors GFLOPs/sec/proc ABFT represents the only known alternaFve to address fault tolerance in strong scalability
ABFT advantages over diskless checkpoinFng
- Independent of the Surface (n2) / volume (n3).
– Important for n/n operaFons (e.g. FFT) – Important for MM with small n
- Independent of failure rate
– No need to guess parameters
- Fits nicely in the algorithm
– No need for explicit synchronizaFon for example
ABFT disadvantage over diskless checkpoinFng
- Rely on floaWng point arithmeWc checksums
– CancellaFon, ill‐condiFonned matrices, etc.