Preliminary Investigations on Resilient Parallel Numerical Linear - - PowerPoint PPT Presentation

preliminary investigations on resilient parallel
SMART_READER_LITE
LIVE PREVIEW

Preliminary Investigations on Resilient Parallel Numerical Linear - - PowerPoint PPT Presentation

SIAM EX14 Workshop July 7, Chicago - IL Preliminary Investigations on Resilient Parallel Numerical Linear Algebra Solvers Luc Giraud joint work with E. Agullo , P. Salas , E. F. Yetkin , M. Zounon funded by ANR RESCUE and G8-ECS HiePACS Inria


slide-1
SLIDE 1

SIAM EX14 Workshop July 7, Chicago - IL

Preliminary Investigations on Resilient Parallel Numerical Linear Algebra Solvers

Luc Giraud

joint work with

  • E. Agullo, P. Salas, E. F. Yetkin, M. Zounon

funded by ANR RESCUE and G8-ECS

HiePACS Inria Project Joint Inria-CERFACS lab INRIA Bordeaux Sud-Ouest

slide-2
SLIDE 2

Context

  • L. Giraud - Resilient numerical linear algebra solvers

2/ 25

Resilience: Ability to compute a correct output in presence of faults

◮ Context: Numerical linear algebra ◮ Goal: Keep converging in presence of fault ◮ Method: Recover-restart strategy without Checkpoint ◮ HPC systems are not fault-free ◮ A faulty components (node, core, memory) loses

all its data

◮ Simulations at exascale have to be resilient

slide-3
SLIDE 3

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

3/ 25

slide-4
SLIDE 4

Faults in HPC Systems

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

4/ 25

slide-5
SLIDE 5

Faults in HPC Systems

Framework

Forecast for extreme scale systems

◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF

  • L. Giraud - Resilient numerical linear algebra solvers

5/ 25

slide-6
SLIDE 6

Faults in HPC Systems

Framework

Forecast for extreme scale systems

◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF

Objectives

◮ Explore fault-tolerant schemes with less/no overhead ◮ Numerical algorithms to deal with overhead issue

  • L. Giraud - Resilient numerical linear algebra solvers

5/ 25

slide-7
SLIDE 7

Faults in HPC Systems

Framework

Forecast for extreme scale systems

◮ Mean Time Between Failure (MTBF): less than one hour ◮ Checkpoint time might be larger than MTBF

Objectives

◮ Explore fault-tolerant schemes with less/no overhead ◮ Numerical algorithms to deal with overhead issue

Faults in this presentation

◮ Detected corrupted memory space (node crashes, damaged

memory pages, uncorrected bit-flip, . . . )

  • L. Giraud - Resilient numerical linear algebra solvers

5/ 25

slide-8
SLIDE 8

Sparse linear systems

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

6/ 25

slide-9
SLIDE 9

Sparse linear systems

  • L. Giraud - Resilient numerical linear algebra solvers

7/ 25

x b A =

Ax = b

We attempt to design fault tolerant solvers for sparse linear system Two classes of iterative methods

◮ Stationary methods (Jacobi, Gauss-Seidel, . . . ) ◮ Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . ) ◮ Krylov methods have attractive potential for Extreme-scale

slide-10
SLIDE 10

Interpolation methods

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

8/ 25

slide-11
SLIDE 11

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

Block row distribution

x b A P P P P

1 2 3 4

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

slide-12
SLIDE 12

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

Block row distribution

x b A P P P P

1 2 3 4

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

slide-13
SLIDE 13

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

Block row distribution

x b A P P P P

1 2 3 4

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

slide-14
SLIDE 14

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

  • x

b A P P P P

1 2 3 4

Static data Dynamic data

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

slide-15
SLIDE 15

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

  • x

b A P P P P

1 2 3 4

Static data Dynamic data

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

Let’s assume that P1 fails

slide-16
SLIDE 16

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

  • x

b A P P P P

1 2 3 4

Static data Dynamic data Lost data

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

Let’s assume that P1 fails

slide-17
SLIDE 17

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

  • x

b A P P P P

1 2 3 4

Static data Dynamic data Lost data

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

Let’s assume that P1 fails

◮ Failed processor is replaced ◮ Static data are restored

slide-18
SLIDE 18

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

  • x

b A P P P P

1 2 3 4

Static data Dynamic data Lost data

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

Let’s assume that P1 fails

◮ Failed processor is replaced ◮ Static data are restored

Reset: Set (x1) to initial value

slide-19
SLIDE 19

Interpolation methods

  • L. Giraud - Resilient numerical linear algebra solvers

9/ 25

  • x

b A P P P P

1 2 3 4

Static data Dynamic data Lost data Interpolatedv data

=

We distinguish two categories of data:

◮ Static data ◮ Dynamic data

Let’s assume that P1 fails

◮ Failed processor is replaced ◮ Static data are restored

Our algorithms aim at recovering x1 and restart

slide-20
SLIDE 20

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

◮ Sequential simulations ◮ Simulation of parallel

environment

slide-21
SLIDE 21

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-22
SLIDE 22

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-23
SLIDE 23

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-24
SLIDE 24

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-25
SLIDE 25

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration Interpolation

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-26
SLIDE 26

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration Interpolation Restart

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-27
SLIDE 27

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration Interpolation Restart

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-28
SLIDE 28

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration Interpolation Restart

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-29
SLIDE 29

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration Interpolation Restart

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-30
SLIDE 30

Interpolation methods

Overview of our fault tolerant algorithm

  • L. Giraud - Resilient numerical linear algebra solvers

10/ 25

P P P P

1 2 3 4

Time

Fault Successful iteration Failed iteration Interpolation Restart

◮ Sequential simulations ◮ Simulation of parallel

environment

◮ Generation of fault trace ◮ Realistic probability distribution

slide-31
SLIDE 31

Interpolation methods

Interpolation methods

Fault in linear system

A11 A12 A21 A22 x1 x2

  • =

b1 b2

  • L. Giraud - Resilient numerical linear algebra solvers

11/ 25

slide-32
SLIDE 32

Interpolation methods

Interpolation methods

Fault in linear system

A11 A12 A21 A22 ? x2

  • =

b1 b2

  • How to recover x1?
  • L. Giraud - Resilient numerical linear algebra solvers

11/ 25

slide-33
SLIDE 33

Interpolation methods

Interpolation methods

Fault in linear system

A11 A12 A21 A22 ? x2

  • =

b1 b2

  • How to recover x1?

Linear Interpolation (LI)

[Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

  • L. Giraud - Resilient numerical linear algebra solvers

11/ 25

slide-34
SLIDE 34

Interpolation methods

Interpolation methods

Fault in linear system

A11 A12 A21 A22 ? x2

  • =

b1 b2

  • How to recover x1?

Linear Interpolation (LI)

[Langou, Chen, Bosilca, Dongarra, SISC, 2007]

Solve A11x1 = b1 − A12x2

Least Squares Interpolation (LSI)

A11 A21

  • x1 +

A21 A22

  • x2 =

b1 b2

  • x1 = argmin

x

  • b1

b2

A11 A21

  • x −

A12 A22

  • x2
  • 2
  • L. Giraud - Resilient numerical linear algebra solvers

11/ 25

slide-35
SLIDE 35

Interpolation methods

Main properties - basic linear algebra

Proposition

The initial guess generated by LI after a fault does ensure that the A-norm of the forward error associated with the iterates computed by restarted CG or PCG is monotonically decreasing

  • L. Giraud - Resilient numerical linear algebra solvers

12/ 25

slide-36
SLIDE 36

Interpolation methods

Main properties - basic linear algebra

Proposition

The initial guess generated by LI after a fault does ensure that the A-norm of the forward error associated with the iterates computed by restarted CG or PCG is monotonically decreasing

[LI might not be defined for non-SPD matrices as diagonal blocks might be singular]

  • L. Giraud - Resilient numerical linear algebra solvers

12/ 25

slide-37
SLIDE 37

Interpolation methods

Main properties - basic linear algebra

Proposition

The initial guess generated by LI after a fault does ensure that the A-norm of the forward error associated with the iterates computed by restarted CG or PCG is monotonically decreasing

[LI might not be defined for non-SPD matrices as diagonal blocks might be singular]

Proposition

The initial guess generated by LSI after a fault does ensure the monotonic decrease of the residual norm of minimal residual Krylov subspace methods such as GMRES and MinRES after a restarting due to a failure

  • L. Giraud - Resilient numerical linear algebra solvers

12/ 25

slide-38
SLIDE 38

Numerical experiments

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

13/ 25

slide-39
SLIDE 39

Numerical experiments

Impact of fault rate

Preconditioned GMRES (Kim1 - 2 % data lost)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 140 280 420 560 700 840 980 1120 1260 1400

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 4 faults

  • L. Giraud - Resilient numerical linear algebra solvers

14/ 25

slide-40
SLIDE 40

Numerical experiments

Impact of fault rate

Preconditioned GMRES (Kim1 - 2 % data lost)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 140 280 420 560 700 840 980 1120 1260 1400

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 8 faults

  • L. Giraud - Resilient numerical linear algebra solvers

14/ 25

slide-41
SLIDE 41

Numerical experiments

Impact of fault rate

Preconditioned GMRES (Kim1 - 2 % data lost)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 140 280 420 560 700 840 980 1120 1260 1400

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 17 faults

  • L. Giraud - Resilient numerical linear algebra solvers

14/ 25

slide-42
SLIDE 42

Numerical experiments

Impact of fault rate

Preconditioned GMRES (Kim1 - 2 % data lost)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 140 280 420 560 700 840 980 1120 1260 1400

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 40 faults

  • L. Giraud - Resilient numerical linear algebra solvers

14/ 25

slide-43
SLIDE 43

Numerical experiments

Impact of lost data volume

Preconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 98 196 294 392 490 588 686 784 882 980

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 3 % data lost

  • L. Giraud - Resilient numerical linear algebra solvers

15/ 25

slide-44
SLIDE 44

Numerical experiments

Impact of lost data volume

Preconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 98 196 294 392 490 588 686 784 882 980

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 0.8 % data lost

  • L. Giraud - Resilient numerical linear algebra solvers

15/ 25

slide-45
SLIDE 45

Numerical experiments

Impact of lost data volume

Preconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 98 196 294 392 490 588 686 784 882 980

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 0.2 % data lost

  • L. Giraud - Resilient numerical linear algebra solvers

15/ 25

slide-46
SLIDE 46

Numerical experiments

Impact of lost data volume

Preconditioned GMRES(100) (Averous/epb3 - 10 faults)

1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 98 196 294 392 490 588 686 784 882 980

||(b-Ax)||/||b|| Iteration

Reset LI LSI SC REF

Figure: 0.001 % data lost

  • L. Giraud - Resilient numerical linear algebra solvers

15/ 25

slide-47
SLIDE 47

Numerical experiments

Penalty of restart strategy

◮ Recover-restart strategy ◮ When restarting, we lose the Krylov subspace built before the

fault

◮ Consequence: delay of convergence due to restart ◮ Restarting mechanism is naturally implemented in GMRES to

reduce the computational resource consumption

◮ CG does not need to be restarted

  • L. Giraud - Resilient numerical linear algebra solvers

16/ 25

slide-48
SLIDE 48

Numerical experiments

Penality of restart strategy on PCG

1e-13 1e-12 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 83 166 249 332 415 498 581 664 747 830

A-norm(error) Iterations

Reset LI LSI SC

Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults - 5 % data lost

  • L. Giraud - Resilient numerical linear algebra solvers

17/ 25

slide-49
SLIDE 49

Numerical experiments

Penality of restart strategy on PCG

1e-13 1e-12 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 83 166 249 332 415 498 581 664 747 830

A-norm(error) Iterations

Reset LI LSI SC REF

Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults - 5 % data lost

  • L. Giraud - Resilient numerical linear algebra solvers

17/ 25

slide-50
SLIDE 50

Resilience in eigensolvers

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

18/ 25

slide-51
SLIDE 51

Resilience in eigensolvers

Recovery-restart for eigensolvers

Fault in eigenproblem

A11 A12 A21 A22 x1 x2

  • = λ

x1 x2

  • L. Giraud - Resilient numerical linear algebra solvers

19/ 25

slide-52
SLIDE 52

Resilience in eigensolvers

Recovery-restart for eigensolvers

Fault in eigenproblem

A11 A12 A21 A22 ? x2

  • = λ

? x2

  • How to recover x1?

Linear Interpolation (LI)

Solve the linear system

  • A11 − λI1
  • x1 = −A12x2

Least Squares Interpolation (LSI)

A11 A21

  • x1 +

A21 A22

  • x2 = λ

x1 x2

  • x1 = argmin

x

  • A11 − λI1

A21

  • x +
  • A12

A22 − λI2

  • x2
  • 2
  • L. Giraud - Resilient numerical linear algebra solvers

19/ 25

slide-53
SLIDE 53

Resilience in eigensolvers

  • L. Giraud - Resilient numerical linear algebra solvers

20/ 25

x A = x

If Ax = λx with x = 0, where A ∈ Cn×n, x ∈ Cn, and λ ∈ C , then,

◮ λ : eigenvalue ◮ x : eigenvector ◮ (λ, x) : eigenpair

Two classes of methods

◮ Fixed Point Methods (Power Method, Subspace iteration) ◮ Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/Krylov

Schur)

slide-54
SLIDE 54

Resilience in eigensolvers

Thermo-acoustic test example

(a few smallest eigenvalues)

  • L. Giraud - Resilient numerical linear algebra solvers

21/ 25

slide-55
SLIDE 55

Resilience in eigensolvers

Jacobi-Davidson method

1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 1e+01 24 48 72 96 120 144 168 192 216 240

||(Ax - lambda*x)||/||lambda|| Iteration

1 2 2 3 4 LSI REF

Figure: Jacobi-Davidson method with 5 faults - 1 % lost data. Convergence history using LSI and Checkpoint of current iterate

  • L. Giraud - Resilient numerical linear algebra solvers

22/ 25

slide-56
SLIDE 56

Concluding remarks and perspectives

Outline

Faults in HPC Systems Sparse linear systems Interpolation methods Numerical experiments Resilience in eigensolvers Concluding remarks and perspectives

  • L. Giraud - Resilient numerical linear algebra solvers

23/ 25

slide-57
SLIDE 57

Concluding remarks and perspectives

Concluding remarks

Summary

◮ We have designed techniques to interpolate meaningfull lost

data based on simple linear algebra tools

◮ Our techniques preserve some of the key monotonicy of Krylov

solvers but lack of robustness of LI for non-SPD problems

◮ The restarting effect remains reasonable within the GMRES

context

◮ No fault, no overhead ◮ These techniques can be adpated to multiple faults ◮ What about silent soft-error - CGPOP preliminary

experiments ?

  • L. Giraud - Resilient numerical linear algebra solvers

24/ 25

slide-58
SLIDE 58

Merci for your attention Questions ?

https://team.inria.fr/hiepacs/