Coded QR Decomposi.on
Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University
1
Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard - - PowerPoint PPT Presentation
Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University 1 Mo.va.on 2 Mo.va.on Coded Compu)ng Coding Theory + Distributed Compu.ng Straggling Issue in Cloud Compu.ng Other Issues
1
2
3
Straggling Issue in Cloud Compu.ng
Yu et al. ’17, Jeong et al. ’17, ‘18, Baharav ’18, Sinong et
et al. ’17 Halbawi et al. ’18, Ye ’18]
4
Fugaku supercomputer (2021) 150,000 nodes Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!!
5
6
Fugaku supercomputer (2021) 150,000 nodes Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!!
HPC’s Solu.on: Algorithm-based fault-tolerance (ABFT)
adding encoded redundancy tailored to specific algorithm.
Same idea as Coded Compu)ng!!
7
bridge the gap
where ABFT faces challenges
compu.ng literature:
systema.c nodes)
8
Orthogonal Q (i.e. QT Q = I) Upper triangular R
check- sums
check- sums
R’ is upper-triangularà R is upper-triangular So we can retrieve A=Q x R as the QR decomposi.on of A.
9
checksums
Q’T x Q’ = I does not imply QT x Q = I
Can we efficiently restore the orthogonality of Q?
10
checksums =
Not orthogonal
11
12
In-node checksum storage: Out-of-node checksum storage:
(Conven.onal seeng) A0 A1 A0+A1
Node Node 1
A0 A1 A0+A1
Node Node 1 Node 2
checksum checksum
à Can we s.ll have some op.mality guarantee like MDS condi.on?
13
both ver.cal and horizontal checksums as follows: where and are checksum-generator matrices.
distributed over the new set of checksum processors.
14
15
Coded Compu)ng: Master-Worker SeWng
A1 A0 A2 A3 Master Node A0 A1 A2 A3 A0 A1
redundancy Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Input
Output Master Node
processors.
computa.on.
16
HPC SeWng: 2D block-cyclic distribu)on Coded Compu)ng: Master-Worker SeWng
A1 A0 A2 A3 Master Node A0 A1 A2 A3 A0 A1
redundancy Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Input
Output Master Node
Systema.c processors Checksum processors
Single-node fail-stop failures:
completely stops responding, and loses its part of the global data.
external source. Real-.me Recovery:
QR decomposi.on, immediately triggering the recovery process.
from its latest failure.
17
18
computa.on
R
computa.on
We consider MGS, one of the 3 most widely use algorithms for QR decomposi.on.
Checksums preserved to facilitate fault-tolerant computa.on
à Post-orthogonaliza.on Post-processing to restore the Degraded Orthogonality
Minimal number of checksums for single-node failure tolerance
19
20
A AGh GvA
(t)
(t)
1 (t)
R
1 (t)Gh
Checksums preserved!
21
A AGh GvA
1
R
1Gh
22
à Retrieve where is non-orthogonal (first challenge), and is upper-triangular.
A = Q1R
1
Q1 R
1
Challenge 1: In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q1
A AGh GvA
Q1 GvQ1 R
1
R
1Gh
Not orthogonal
23
24
Challenge 1: In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Main Idea: Cheap Post-processing:
Q1
Q1 →
A AGh GvA
Q1 GvQ1 R
1
R
1Gh
Not orthogonal
25
Challenge 1: In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Main Idea: Cheap Post-processing: à Post-orthogonaliza)on: orthogonal matrix ! Q1
Q1 → G0Q1
A AGh GvA
Q1 GvQ1 R
1
R
1Gh
Not orthogonal
26
1
Q1 GvQ1
Orthogonal Not orthogonal
Gv
c x n matrix
27
G1 V
c c n-c n
Ic +G1
V V T −In−c
c c n-c n-c
Post-orthogonaliza)on condi)on
Main Result:
28
Reminder: checksum-generator matrix: Gv =
G1 V ⎡ ⎣ ⎤ ⎦
system of linear equa.ons:
29
Ax = b ⇔ A'x = (G0A)x = G0b
(G0Q1)Rx = G0b ⇔ Rx = (G0Q1)T (G0b)
Overhead of post-orthogonaliza.on: Matrix mul.plica.ons and
(G0Q1) (G0b)
30
Note:
Recap: R-factor protec.on:
Q-factor protec.on:
àConstruc.on of to tolerate single-node failures.
Gh
31
Gv = G1 V ⎡ ⎣ ⎤ ⎦ Post-orthogonaliza)on condi)on Gv
32
à Can we s.ll have some op.mality guarantee like MDS condi.on?
33
A0 A1 A0+A1
Node Node 1
checksum
K = 2 L ρ ⎡ ⎢ ⎢ ⎤ ⎥ ⎥
K ≥ L ρ −1 ⎡ ⎢ ⎢ ⎤ ⎥ ⎥
34
35
Out-of-node Checksum Storage In-node Checksum Storage
R-factor protec.on Q-factor protec.on
n pc ⎡ ⎢ ⎢ ⎤ ⎥ ⎥ n pr ⎡ ⎢ ⎢ ⎤ ⎥ ⎥ n pc −1 ⎡ ⎢ ⎢ ⎤ ⎥ ⎥ 2 n pr ⎡ ⎢ ⎢ ⎤ ⎥ ⎥
Applica.on to coded QR decomposi.on under in-node checksum storage: The matrix is distributed over processors.
n×n A P = prpc
Challenge 1: Q protec)on via coding?
Modified Gram-Schmidt (MGS) algorithm: à The trick: use low-cost post-orthogonaliza.on of A to A’, to restore the degraded orthogonality of Q. à A’ can be then used in place of A with negligible overhead for solving a non-singular square system of linear equa.ons. Challenge 2: minimal number of checksums required under in-node checksum storage?
under the in-node checksum storage seeng.
Note:
36