 
              SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel Numerical Linear Algebra Solvers Luc Giraud joint work with E. Agullo , P. Salas , E. F. Yetkin , M. Zounon funded by ANR RESCUE and G8-ECS HiePACS Inria Project Joint Inria-CERFACS lab INRIA Bordeaux Sud-Ouest
Context ◮ HPC systems are not fault-free ◮ A faulty components (node, core, memory) loses all its data ◮ Simulations at exascale have to be resilient Resilience: Ability to compute a correct output in presence of faults ◮ Context: Numerical linear algebra ◮ Goal: Keep converging in presence of fault ◮ Method: Recover-restart strategy without Checkpoint L. Giraud - Resilient numerical linear algebra solvers 2/ 25
Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 3/ 25
Faults in HPC Systems Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 4/ 25
Faults in HPC Systems Framework Forecast for extreme scale systems ◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF L. Giraud - Resilient numerical linear algebra solvers 5/ 25
Faults in HPC Systems Framework Forecast for extreme scale systems ◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF Objectives ◮ Explore fault-tolerant schemes with less/no overhead ◮ Numerical algorithms to deal with overhead issue L. Giraud - Resilient numerical linear algebra solvers 5/ 25
Faults in HPC Systems Framework Forecast for extreme scale systems ◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF Objectives ◮ Explore fault-tolerant schemes with less/no overhead ◮ Numerical algorithms to deal with overhead issue Faults in this presentation ◮ Detected corrupted memory space (node crashes, damaged memory pages, uncorrected bit-flip, . . . ) L. Giraud - Resilient numerical linear algebra solvers 5/ 25
Sparse linear systems Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 6/ 25
Sparse linear systems x b A Ax = b We attempt to design fault tolerant solvers = for sparse linear system Two classes of iterative methods ◮ Stationary methods (Jacobi, Gauss-Seidel, . . . ) ◮ Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . ) ◮ Krylov methods have attractive potential for Extreme-scale L. Giraud - Resilient numerical linear algebra solvers 7/ 25
Interpolation methods Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 8/ 25
Interpolation methods Block row distribution x A b P 1 We distinguish two categories of data: P 2 ◮ Static data = P ◮ Dynamic data 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods Block row distribution x A b P 1 We distinguish two categories of data: P 2 ◮ Static data = P ◮ Dynamic data 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods Block row distribution x A b P 1 We distinguish two categories of data: P 2 ◮ Static data = P ◮ Dynamic data 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods Static data Dynamic data x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods Static data Dynamic data x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods �� �� �� �� Static data Dynamic data �� �� Lost data �� �� x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods ��� ��� ��� ��� Static data Dynamic data ��� ��� Lost data ��� ��� x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails ◮ Failed processor is replaced ◮ Static data are restored L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods ��� ��� ��� ��� Static data Dynamic data ��� ��� Lost data ��� ��� x b A P 0 1 We distinguish two categories of data: P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails ◮ Failed processor is replaced Reset: Set ( x 1 ) to initial value ◮ Static data are restored L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods �� �� �� �� Dynamic data �� �� Lost data Interpolatedv data Static data �� �� x b A �� �� �� �� �� �� �� �� P �� �� �� �� 1 �� �� �� �� �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails ◮ Failed processor is replaced Our algorithms aim at recovering x 1 ◮ Static data are restored and restart L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Interpolation methods Overview of our fault tolerant algorithm P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Simulation of parallel environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Fault P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration Failed iteration P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration Failed iteration Interpolation P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Restart Fault Successful iteration Failed iteration Interpolation P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Interpolation methods Overview of our fault tolerant algorithm Restart Fault Successful iteration Failed iteration Interpolation P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25
Recommend
More recommend