Algorithm-based checkpoint-recovery for the conjugate gradient - - PowerPoint PPT Presentation

algorithm based checkpoint recovery for the conjugate
SMART_READER_LITE
LIVE PREVIEW

Algorithm-based checkpoint-recovery for the conjugate gradient - - PowerPoint PPT Presentation

Algorithm-based checkpoint-recovery for the conjugate gradient method Carlos Pachajoa, Christina Pacher, Markus Levonyak, Wilfried N. Gansterer. 49th International Conference on Parallel Processing Acknowledgements This work has been funded by


slide-1
SLIDE 1

Algorithm-based checkpoint-recovery for the conjugate gradient method

Carlos Pachajoa, Christina Pacher, Markus Levonyak, Wilfried N. Gansterer. 49th International Conference on Parallel Processing

slide-2
SLIDE 2

Acknowledgements

This work has been funded by the Vienna Science and Technology Fund through project ICT15-113. Experiments are run on the VSC3 machine of the Vienna Scientific Cluster.

slide-3
SLIDE 3

Motivation

1 Unreliability at larger scales

  • Reliability of larger-scale computer systems is predicted to decline.
  • Computers can no longer be thought as reliable machines. Resilience is an active

research field.

  • We focus on node failures: Events in which a node stops working and the data

contained in it is lost. Several nodes can fail simultaneously if, for example, a switch stops working.

2 Resilience for the conjugate gradient method

  • Iterative solver for symmetric, positive definite (SPD) linear systems.
  • Significant in many physically-motivated problems.
  • Particularly suitable for work with sparse matrices. Usable in very large systems.
slide-4
SLIDE 4

Problem statement

  • Unreliable computer cluster. Possibility of node failures occuring.
  • Find the solution of a linear system for an SPD matrix using the conjugate

gradient method.

  • Sparse matrices stored with a block-row distribution. Vector elements distributed

in the same way as the rows.

slide-5
SLIDE 5

Key idea 1: Matrix-vector product

  • The matrix-vector product provides some redundancy for the input vector,

and can be augmented to guarantee complete redundancy.

slide-6
SLIDE 6

Storing redundantly for a node failure

The matrix-vector product provides some redundancy for the input vector (Chen 2011). In this example, we focus on the second rank: One of its entries is not necessary for the SpMV and must be sent additionally. Ap =                                                                     Node 0 Node 1 Node 2 Node 3

slide-7
SLIDE 7

Multiple node failures

  • These ideas can be generalized to deal with multiple, simultaneous node failures,

for example, in the event of a switch failure.

slide-8
SLIDE 8

Augmented SpMV product for multiple node failures

mi: Multiplicity of entry i.

m4 = 2 m5 = 0 m6 = 2 m7 = 1

Ap =                                                                             Node 0 Node 1 Node 2 Node 3 To guarantee we can recover from up to φ node failures, the SpMV must be augmented until the multiplicity for every entry of each node is mi ≥ φ

slide-9
SLIDE 9

Key idea 2: State reconstruction

The complete state can be recovered from the two last search directions (the p vector) (Pachajoa et al. 2019). Preconditioned conjugate gradient method

1

r(0) := b − Ax(0), z(0) := Pr(0), p(0) := z(0)

2

repeat

3

α(j) := r(j)⊤z(j)/p(j)⊤Ap(j);

4

x(j+1) := x(j) + α(j)p(j);

5

r(j+1) := r(j) − α(j)Ap(j);

6

z(j+1) := Pr(j+1);

7

β(j) := r(j+1)⊤z(j+1)/r(j)⊤z(j);

8

p(j+1) := z(j+1) + β(j)p(j);

9

until r2/b2 < rtol;

Exact state reconstruction method

1

Retrieve the static data AIf ,I , PIf ,I , and bIf ;

2

Gather r(j)

I\If and x(j) I\If ;

3

Retrieve the redundant copies of β(j−1), p(j−1)

If

, and p(j)

If ;

4

Compute z(j)

If

:= p(j)

If

− β(j−1)p(j−1)

If

;

5

Compute v := z(j)

If

− PIf ,I\If r(j)

I\If ;

6

Solve PIf ,If r(j)

If

= v for r(j)

If ;

7

Compute w := bIf − r(j)

If

− AIf ,I\If x(j)

I\If ;

8

Solve AIf ,If x(j)

If

= w for x(j)

If ;

slide-10
SLIDE 10

Key idea 3: Reduce the overhead by storing every T iterations

j

p(j) x(j) z(j) r (j)

j + 1

p(j+1) x(j+1) z(j+1) r (j+1)

j + 2

p(j+2) x(j+2) z(j+2) r (j+2)

j + 3

p(j+3) x(j+3) z(j+3) r (j+3)

j + 4

p(j+4) x(j+4) z(j+4) r (j+4)

j + 5

p(j+5) x(j+5) z(j+5) r (j+5)

j + 6

p(j+6) x(j+6) z(j+6) r (j+6)

ESR algorithm

1

Retrieve the static data AIf ,I , PIf ,I , and bIf ;

2

Gather r(j)

I\If and x(j) I\If ;

3

Retrieve the redundant copies of β(j−1), p(j−1)

If

, and p(j)

If ;

4

Compute z(j)

If

:= p(j)

If

− β(j−1)p(j−1)

If

;

5

Compute v := z(j)

If

− PIf ,I\If r(j)

I\If ;

6

Solve PIf ,If r(j)

If

= v for r(j)

If ;

7

Compute w := bIf − r(j)

If

− AIf ,I\If x(j)

I\If ;

8

Solve AIf ,If x(j)

If

= w for x(j)

If ;

Two problems:

  • To use the reconstructed parts, we also need the corresponding entries of the

vector for that iteration.

  • Therefore, all nodes must store their local parts of the vector at the checkpoint.
  • We need p for two consecutive iterations to be able to perform the reconstruction.
  • We need a queue of redundantly stored data.
slide-11
SLIDE 11

Storing redundant data every few iterations

Start j = [ , , ] j = 1 [ , , ] j = T − 1 [ , , ] j = T [ , , p

′ ( T )

] j = T + 1 [ , p

′ ( T )

, p

′ ( T + 1 )

] j = T + 2 [ , p

′ ( T )

, p

′ ( T + 1 )

] j = 2 T − 1 [ , p

′ ( T )

, p

′ ( T + 1 )

] j = 2 T [ p

′ ( T )

, p

′ ( T + 1 )

, p

′ ( 2 T )

] j = 2 T + 1 [ p

′ ( T + 1 )

, p

′ ( 2 T )

, p

′ ( 2 T + 1 )

] j = 2 T + 2 [ p

′ ( T + 1 )

, p

′ ( 2 T )

, p

′ ( 2 T + 1 )

]

slide-12
SLIDE 12

Definition of ASpMV

  • The function SpMV takes a matrix

and a vector as inputs, and outputs a vector.

  • The function ASpMV additionally takes

a target number of redundant copies (φ) and a queue to store them (Q). ̺ := SpMV(A, p) ̺ := ASpMV(A, p, φ, Q)

slide-13
SLIDE 13

Reducing the frequency: CG-ESRP

Preconditioned conjugate gradient method

1 r (0) := b − Ax(0), z(0) := Pr (0), p(0) := z(0) 2 repeat 3

α(j) := r (j)⊤z()/p(j)⊤Ap(j);

4

x(j+1) := x(j) + α(j)p(j);

5

r (j+1) := r (j) − α(j)Ap(j);

6

z(j+1) := Pr (j+1);

7

β(j) := r (j+1)⊤z(j+1)/r (j)⊤z(j);

8

p(j+1) := z(j+1) + β(j)p(j);

9 until r2/b2 < rtol;

Conjugate gradient method using exact state reconstruction with periodic storage (CG-ESRP)

1 r (0) := b − Ax(0), z(0) := Pr (0), p(0) := z(0), j := 0, 2 Q := [ , , ]; 3 repeat 4

if j mod T = 0 and j > 2 then

5

̺(j) := ASpMV(A, p(j), φ, Q);

6

β ∗ ∗ = β(j);

7

else if (j − 1) mod T = 0 and j > 2 then

8

̺(j) := ASpMV(A, p(j), φ, Q);

9

x∗ = x(j), r∗ = r (j), z∗ = z(j), p∗ = p(j);

10

β∗ = β ∗ ∗;

11

else

12

̺(j) := SpMV(A, p(j));

13

α(j) := r (j)⊤z(j)/p(j)⊤̺(j);

14

x(j+1) := x(j) + α(j)p(j);

15

r (j+1) := r (j) − α(j)̺(j);

16

z(j+1) := Pr (j+1);

17

β(j) := r (j+1)⊤z(j+1)/r (j)⊤z(j);

18

p(j+1) := z(j+1) + β(j)p(j);

19

j := j + 1;

20 until r2/b2 < rtol;

slide-14
SLIDE 14

Experimental setup

  • 128 nodes of the VSC3.
  • Two strategies to recover: ESRP and in-memory CR (IMCR).
  • Simulated node failures.
  • Checkpointing interval of 20, 50 and 100 iterations.
  • Resilience with 1, 3 and 8 redundant copies.
  • Runs without resilience, and with resilience with and without node failures.

Test matrices from the SuiteSparse collection (Davis and Hu 2011) Matrix Problem type Problem size #NZ Emilia 923 Structural 923 136 40 373 538 audikw 1 Structural 943 695 77 651 847

slide-15
SLIDE 15

Results for matrix Emilia 923

  • Reference time t0 = 14.66s
  • σt0 is 0.93% of t0.

T = 20 T = 50 T = 100 checkpointing interval 0.1% 1.0% 10.0% runtime overhead ESRP ESR IMCR

(a) Failure-free solver

T = 20 T = 50 T = 100 checkpointing interval 1.0% 10.0% runtime overhead ESRP ESR IMCR

(b) Node failures introduced

slide-16
SLIDE 16

Results for matrix audikw 1

  • Reference time t0 = 23.22s
  • σt0 is 0.14% of t0.

T = 20 T = 50 T = 100 checkpointing interval 0.1% 1.0% runtime overhead ESRP ESR IMCR

(a) Failure-free solver

T = 20 T = 50 T = 100 checkpointing interval 1.0% 10.0% runtime overhead ESRP ESR IMCR

(b) Node failures introduced

slide-17
SLIDE 17

Conclusions and perspectives

Conclusions

  • In our first experiments, ESRP drastically reduces the overhead of ESR.
  • In failure-free cases, also faster than in-memory CR.
  • In our experiments, the cost of communication seems to be too low. We cannot

conclude that IMCR is faster than ESRP in this case.

  • Recovery time for ESRP is dominated by the solution of the local linear system

during reconstruction.

Perspectives

  • Experiments with larger problems and a larger number of nodes, to reach a

different regime for computation/communication ratio.

  • Application of matrix partitioning algorithms.
  • Implementation with real node failures.
slide-18
SLIDE 18

Contact us

REPEAL Project https://repeal.taa.univie.ac.at/ Carlos Pachajoa carlos.pachajoa@univie.ac.at