Algorithm-based checkpoint-recovery for the conjugate gradient - - PowerPoint PPT Presentation
Algorithm-based checkpoint-recovery for the conjugate gradient - - PowerPoint PPT Presentation
Algorithm-based checkpoint-recovery for the conjugate gradient method Carlos Pachajoa, Christina Pacher, Markus Levonyak, Wilfried N. Gansterer. 49th International Conference on Parallel Processing Acknowledgements This work has been funded by
Acknowledgements
This work has been funded by the Vienna Science and Technology Fund through project ICT15-113. Experiments are run on the VSC3 machine of the Vienna Scientific Cluster.
Motivation
1 Unreliability at larger scales
- Reliability of larger-scale computer systems is predicted to decline.
- Computers can no longer be thought as reliable machines. Resilience is an active
research field.
- We focus on node failures: Events in which a node stops working and the data
contained in it is lost. Several nodes can fail simultaneously if, for example, a switch stops working.
2 Resilience for the conjugate gradient method
- Iterative solver for symmetric, positive definite (SPD) linear systems.
- Significant in many physically-motivated problems.
- Particularly suitable for work with sparse matrices. Usable in very large systems.
Problem statement
- Unreliable computer cluster. Possibility of node failures occuring.
- Find the solution of a linear system for an SPD matrix using the conjugate
gradient method.
- Sparse matrices stored with a block-row distribution. Vector elements distributed
in the same way as the rows.
Key idea 1: Matrix-vector product
- The matrix-vector product provides some redundancy for the input vector,
and can be augmented to guarantee complete redundancy.
Storing redundantly for a node failure
The matrix-vector product provides some redundancy for the input vector (Chen 2011). In this example, we focus on the second rank: One of its entries is not necessary for the SpMV and must be sent additionally. Ap = Node 0 Node 1 Node 2 Node 3
Multiple node failures
- These ideas can be generalized to deal with multiple, simultaneous node failures,
for example, in the event of a switch failure.
Augmented SpMV product for multiple node failures
mi: Multiplicity of entry i.
m4 = 2 m5 = 0 m6 = 2 m7 = 1
Ap = Node 0 Node 1 Node 2 Node 3 To guarantee we can recover from up to φ node failures, the SpMV must be augmented until the multiplicity for every entry of each node is mi ≥ φ
Key idea 2: State reconstruction
The complete state can be recovered from the two last search directions (the p vector) (Pachajoa et al. 2019). Preconditioned conjugate gradient method
1
r(0) := b − Ax(0), z(0) := Pr(0), p(0) := z(0)
2
repeat
3
α(j) := r(j)⊤z(j)/p(j)⊤Ap(j);
4
x(j+1) := x(j) + α(j)p(j);
5
r(j+1) := r(j) − α(j)Ap(j);
6
z(j+1) := Pr(j+1);
7
β(j) := r(j+1)⊤z(j+1)/r(j)⊤z(j);
8
p(j+1) := z(j+1) + β(j)p(j);
9
until r2/b2 < rtol;
Exact state reconstruction method
1
Retrieve the static data AIf ,I , PIf ,I , and bIf ;
2
Gather r(j)
I\If and x(j) I\If ;
3
Retrieve the redundant copies of β(j−1), p(j−1)
If
, and p(j)
If ;
4
Compute z(j)
If
:= p(j)
If
− β(j−1)p(j−1)
If
;
5
Compute v := z(j)
If
− PIf ,I\If r(j)
I\If ;
6
Solve PIf ,If r(j)
If
= v for r(j)
If ;
7
Compute w := bIf − r(j)
If
− AIf ,I\If x(j)
I\If ;
8
Solve AIf ,If x(j)
If
= w for x(j)
If ;
Key idea 3: Reduce the overhead by storing every T iterations
j
p(j) x(j) z(j) r (j)
j + 1
p(j+1) x(j+1) z(j+1) r (j+1)
j + 2
p(j+2) x(j+2) z(j+2) r (j+2)
j + 3
p(j+3) x(j+3) z(j+3) r (j+3)
j + 4
p(j+4) x(j+4) z(j+4) r (j+4)
j + 5
p(j+5) x(j+5) z(j+5) r (j+5)
j + 6
p(j+6) x(j+6) z(j+6) r (j+6)
ESR algorithm
1
Retrieve the static data AIf ,I , PIf ,I , and bIf ;
2
Gather r(j)
I\If and x(j) I\If ;
3
Retrieve the redundant copies of β(j−1), p(j−1)
If
, and p(j)
If ;
4
Compute z(j)
If
:= p(j)
If
− β(j−1)p(j−1)
If
;
5
Compute v := z(j)
If
− PIf ,I\If r(j)
I\If ;
6
Solve PIf ,If r(j)
If
= v for r(j)
If ;
7
Compute w := bIf − r(j)
If
− AIf ,I\If x(j)
I\If ;
8
Solve AIf ,If x(j)
If
= w for x(j)
If ;
Two problems:
- To use the reconstructed parts, we also need the corresponding entries of the
vector for that iteration.
- Therefore, all nodes must store their local parts of the vector at the checkpoint.
- We need p for two consecutive iterations to be able to perform the reconstruction.
- We need a queue of redundantly stored data.
Storing redundant data every few iterations
Start j = [ , , ] j = 1 [ , , ] j = T − 1 [ , , ] j = T [ , , p
′ ( T )
] j = T + 1 [ , p
′ ( T )
, p
′ ( T + 1 )
] j = T + 2 [ , p
′ ( T )
, p
′ ( T + 1 )
] j = 2 T − 1 [ , p
′ ( T )
, p
′ ( T + 1 )
] j = 2 T [ p
′ ( T )
, p
′ ( T + 1 )
, p
′ ( 2 T )
] j = 2 T + 1 [ p
′ ( T + 1 )
, p
′ ( 2 T )
, p
′ ( 2 T + 1 )
] j = 2 T + 2 [ p
′ ( T + 1 )
, p
′ ( 2 T )
, p
′ ( 2 T + 1 )
]
Definition of ASpMV
- The function SpMV takes a matrix
and a vector as inputs, and outputs a vector.
- The function ASpMV additionally takes
a target number of redundant copies (φ) and a queue to store them (Q). ̺ := SpMV(A, p) ̺ := ASpMV(A, p, φ, Q)
Reducing the frequency: CG-ESRP
Preconditioned conjugate gradient method
1 r (0) := b − Ax(0), z(0) := Pr (0), p(0) := z(0) 2 repeat 3
α(j) := r (j)⊤z()/p(j)⊤Ap(j);
4
x(j+1) := x(j) + α(j)p(j);
5
r (j+1) := r (j) − α(j)Ap(j);
6
z(j+1) := Pr (j+1);
7
β(j) := r (j+1)⊤z(j+1)/r (j)⊤z(j);
8
p(j+1) := z(j+1) + β(j)p(j);
9 until r2/b2 < rtol;
Conjugate gradient method using exact state reconstruction with periodic storage (CG-ESRP)
1 r (0) := b − Ax(0), z(0) := Pr (0), p(0) := z(0), j := 0, 2 Q := [ , , ]; 3 repeat 4
if j mod T = 0 and j > 2 then
5
̺(j) := ASpMV(A, p(j), φ, Q);
6
β ∗ ∗ = β(j);
7
else if (j − 1) mod T = 0 and j > 2 then
8
̺(j) := ASpMV(A, p(j), φ, Q);
9
x∗ = x(j), r∗ = r (j), z∗ = z(j), p∗ = p(j);
10
β∗ = β ∗ ∗;
11
else
12
̺(j) := SpMV(A, p(j));
13
α(j) := r (j)⊤z(j)/p(j)⊤̺(j);
14
x(j+1) := x(j) + α(j)p(j);
15
r (j+1) := r (j) − α(j)̺(j);
16
z(j+1) := Pr (j+1);
17
β(j) := r (j+1)⊤z(j+1)/r (j)⊤z(j);
18
p(j+1) := z(j+1) + β(j)p(j);
19
j := j + 1;
20 until r2/b2 < rtol;
Experimental setup
- 128 nodes of the VSC3.
- Two strategies to recover: ESRP and in-memory CR (IMCR).
- Simulated node failures.
- Checkpointing interval of 20, 50 and 100 iterations.
- Resilience with 1, 3 and 8 redundant copies.
- Runs without resilience, and with resilience with and without node failures.
Test matrices from the SuiteSparse collection (Davis and Hu 2011) Matrix Problem type Problem size #NZ Emilia 923 Structural 923 136 40 373 538 audikw 1 Structural 943 695 77 651 847
Results for matrix Emilia 923
- Reference time t0 = 14.66s
- σt0 is 0.93% of t0.
T = 20 T = 50 T = 100 checkpointing interval 0.1% 1.0% 10.0% runtime overhead ESRP ESR IMCR
(a) Failure-free solver
T = 20 T = 50 T = 100 checkpointing interval 1.0% 10.0% runtime overhead ESRP ESR IMCR
(b) Node failures introduced
Results for matrix audikw 1
- Reference time t0 = 23.22s
- σt0 is 0.14% of t0.
T = 20 T = 50 T = 100 checkpointing interval 0.1% 1.0% runtime overhead ESRP ESR IMCR
(a) Failure-free solver
T = 20 T = 50 T = 100 checkpointing interval 1.0% 10.0% runtime overhead ESRP ESR IMCR
(b) Node failures introduced
Conclusions and perspectives
Conclusions
- In our first experiments, ESRP drastically reduces the overhead of ESR.
- In failure-free cases, also faster than in-memory CR.
- In our experiments, the cost of communication seems to be too low. We cannot
conclude that IMCR is faster than ESRP in this case.
- Recovery time for ESRP is dominated by the solution of the local linear system
during reconstruction.
Perspectives
- Experiments with larger problems and a larger number of nodes, to reach a
different regime for computation/communication ratio.
- Application of matrix partitioning algorithms.
- Implementation with real node failures.