preliminary investigations on resilient parallel
play

Preliminary Investigations on Resilient Parallel Numerical Linear - PowerPoint PPT Presentation

SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel Numerical Linear Algebra Solvers Luc Giraud joint work with E. Agullo , P. Salas , E. F. Yetkin , M. Zounon funded by ANR RESCUE and G8-ECS HiePACS Inria


  1. SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel Numerical Linear Algebra Solvers Luc Giraud joint work with E. Agullo , P. Salas , E. F. Yetkin , M. Zounon funded by ANR RESCUE and G8-ECS HiePACS Inria Project Joint Inria-CERFACS lab INRIA Bordeaux Sud-Ouest

  2. Context ◮ HPC systems are not fault-free ◮ A faulty components (node, core, memory) loses all its data ◮ Simulations at exascale have to be resilient Resilience: Ability to compute a correct output in presence of faults ◮ Context: Numerical linear algebra ◮ Goal: Keep converging in presence of fault ◮ Method: Recover-restart strategy without Checkpoint L. Giraud - Resilient numerical linear algebra solvers 2/ 25

  3. Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 3/ 25

  4. Faults in HPC Systems Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 4/ 25

  5. Faults in HPC Systems Framework Forecast for extreme scale systems ◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF L. Giraud - Resilient numerical linear algebra solvers 5/ 25

  6. Faults in HPC Systems Framework Forecast for extreme scale systems ◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF Objectives ◮ Explore fault-tolerant schemes with less/no overhead ◮ Numerical algorithms to deal with overhead issue L. Giraud - Resilient numerical linear algebra solvers 5/ 25

  7. Faults in HPC Systems Framework Forecast for extreme scale systems ◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF Objectives ◮ Explore fault-tolerant schemes with less/no overhead ◮ Numerical algorithms to deal with overhead issue Faults in this presentation ◮ Detected corrupted memory space (node crashes, damaged memory pages, uncorrected bit-flip, . . . ) L. Giraud - Resilient numerical linear algebra solvers 5/ 25

  8. Sparse linear systems Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 6/ 25

  9. Sparse linear systems x b A Ax = b We attempt to design fault tolerant solvers = for sparse linear system Two classes of iterative methods ◮ Stationary methods (Jacobi, Gauss-Seidel, . . . ) ◮ Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . ) ◮ Krylov methods have attractive potential for Extreme-scale L. Giraud - Resilient numerical linear algebra solvers 7/ 25

  10. Interpolation methods Outline Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives L. Giraud - Resilient numerical linear algebra solvers 8/ 25

  11. Interpolation methods Block row distribution x A b P 1 We distinguish two categories of data: P 2 ◮ Static data = P ◮ Dynamic data 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  12. Interpolation methods Block row distribution x A b P 1 We distinguish two categories of data: P 2 ◮ Static data = P ◮ Dynamic data 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  13. Interpolation methods Block row distribution x A b P 1 We distinguish two categories of data: P 2 ◮ Static data = P ◮ Dynamic data 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  14. Interpolation methods Static data Dynamic data x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  15. Interpolation methods Static data Dynamic data x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  16. Interpolation methods �� �� �� �� Static data Dynamic data �� �� Lost data �� �� x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  17. Interpolation methods ��� ��� ��� ��� Static data Dynamic data ��� ��� Lost data ��� ��� x b A �� �� �� �� P �� �� 1 �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails ◮ Failed processor is replaced ◮ Static data are restored L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  18. Interpolation methods ��� ��� ��� ��� Static data Dynamic data ��� ��� Lost data ��� ��� x b A P 0 1 We distinguish two categories of data: P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails ◮ Failed processor is replaced Reset: Set ( x 1 ) to initial value ◮ Static data are restored L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  19. Interpolation methods �� �� �� �� Dynamic data �� �� Lost data Interpolatedv data Static data �� �� x b A �� �� �� �� �� �� �� �� P �� �� �� �� 1 �� �� �� �� �� �� �� �� �� �� We distinguish two categories of data: �� �� �� �� �� �� �� �� P 2 ◮ Static data = ◮ Dynamic data P 3 P 4 Let’s assume that P 1 fails ◮ Failed processor is replaced Our algorithms aim at recovering x 1 ◮ Static data are restored and restart L. Giraud - Resilient numerical linear algebra solvers 9/ 25

  20. Interpolation methods Overview of our fault tolerant algorithm P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Simulation of parallel environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  21. Interpolation methods Overview of our fault tolerant algorithm Fault P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  22. Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  23. Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  24. Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration Failed iteration P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  25. Interpolation methods Overview of our fault tolerant algorithm Fault Successful iteration Failed iteration Interpolation P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  26. Interpolation methods Overview of our fault tolerant algorithm Restart Fault Successful iteration Failed iteration Interpolation P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

  27. Interpolation methods Overview of our fault tolerant algorithm Restart Fault Successful iteration Failed iteration Interpolation P 1 P 2 P 3 P 4 Time ◮ Sequential simulations ◮ Generation of fault trace ◮ Simulation of parallel ◮ Realistic probability distribution environment L. Giraud - Resilient numerical linear algebra solvers 10/ 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend