Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard - - PowerPoint PPT Presentation

coded qr decomposi on
SMART_READER_LITE
LIVE PREVIEW

Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard - - PowerPoint PPT Presentation

Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University 1 Mo.va.on 2 Mo.va.on Coded Compu)ng Coding Theory + Distributed Compu.ng Straggling Issue in Cloud Compu.ng Other Issues


slide-1
SLIDE 1

Coded QR Decomposi.on

Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University

1

slide-2
SLIDE 2

Mo.va.on

2

slide-3
SLIDE 3

Mo.va.on

3

Coded Compu)ng

Coding Theory + Distributed Compu.ng

Straggling Issue in Cloud Compu.ng

  • Coded Matrix Mul)plica)on [Lee et al. ’15, ’17,

Yu et al. ’17, Jeong et al. ’17, ‘18, Baharav ’18, Sinong et

  • al. ‘18, Shahrzad et al. ‘19]
  • Coded MapReduce [Li et al. ’15, ’17, ’18]
  • Coded Gradient Descent [Tandon et al. ’16, Raviv

et al. ’17 Halbawi et al. ’18, Ye ’18]

Other Issues ??

slide-4
SLIDE 4

Mo.va.on

Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC)

4

slide-5
SLIDE 5

Mo.va.on

Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC)

Fugaku supercomputer (2021) 150,000 nodes Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!!

Larger Scale à Unreliability !!

5

slide-6
SLIDE 6

Mo.va.on

6

Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC)

Fugaku supercomputer (2021) 150,000 nodes Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!!

Larger Scale à Unreliability !!

HPC’s Solu.on: Algorithm-based fault-tolerance (ABFT)

adding encoded redundancy tailored to specific algorithm.

=

Same idea as Coded Compu)ng!!

slide-7
SLIDE 7

Mo.va.on

7

Coded Compu)ng ABFT for HPC

bridge the gap

  • QR Decomposi.on-- an important matrix factoriza.on in HPC,

where ABFT faces challenges

  • More prac.cal HPC seeng that was not considered in coded

compu.ng literature:

  • Block-cyclic distribu.on
  • In-node checksum storage (storing redundancies in

systema.c nodes)

à Coded QR Decomposi>on

slide-8
SLIDE 8

What is QR Decomposi.on?

  • QR decomposi.on is widely used in many HPC

applica.ons: solving system of linear equa.ons, SVM, linear least squares problem, etc.

8

Orthogonal Q (i.e. QT Q = I) Upper triangular R

slide-9
SLIDE 9

ABFT for QR Decomposi.on

Key idea: [O. Maslennikow et al. ‘98, P. Du et al. ‘12, P.Wu et al. ’14]

A

check- sums

R

check- sums

Q

R’

R’ is upper-triangularà R is upper-triangular So we can retrieve A=Q x R as the QR decomposi.on of A.

9

slide-10
SLIDE 10

Challenges in Coding for QR Decomposi.on

  • Can we do the same trick for Q protec.on? NO.
  • Proven in [Theorem 5.1, P. Du et al. ’12].

Q

checksums

Q’

Q’T x Q’ = I does not imply QT x Q = I

à Challenge 1: Q protec)on via coding?

Can we efficiently restore the orthogonality of Q?

10

A

checksums =

R x

Not orthogonal

slide-11
SLIDE 11

Challenges in Coding for QR Decomposi.on

In-node checksum storage:

  • was recently proposed for ABFT [P. Du et al. ’12].
  • stores coded data (checksums) in original processors

instead of adding extra processors for fault tolerance.

11

slide-12
SLIDE 12

Challenges in Coding for QR Decomposi.on

12

In-node checksum storage: Out-of-node checksum storage:

(Conven.onal seeng) A0 A1 A0+A1

Node Node 1

A0 A1 A0+A1

Node Node 1 Node 2

checksum checksum

  • Op.mal coding strategy: MDS
  • Fundamental Limit??

à Can we s.ll have some op.mality guarantee like MDS condi.on?

à Challenge 2: minimal number of checksums required under in-node checksum storage?

slide-13
SLIDE 13

Summary of Challenges

Challenge 1: Q protec)on via coding? Challenge 2: minimal number of checksums required under in-node checksum storage?

13

à Our Contribu>on: Address these 2 challenges

slide-14
SLIDE 14
  • For fault tolerance, we encode the n x n matrix A with

both ver.cal and horizontal checksums as follows: where and are checksum-generator matrices.

  • Out-of-node checksum storage: The checksums are

distributed over the new set of checksum processors.

System Model

Gh Gv

14

slide-15
SLIDE 15

System Model

15

Coded Compu)ng: Master-Worker SeWng

A1 A0 A2 A3 Master Node A0 A1 A2 A3 A0 A1

redundancy Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Input

Output Master Node

slide-16
SLIDE 16
  • The input matrix A is distributed among

processors.

  • The below layout is maintained throughout the

computa.on.

System Model

16

HPC SeWng: 2D block-cyclic distribu)on Coded Compu)ng: Master-Worker SeWng

A1 A0 A2 A3 Master Node A0 A1 A2 A3 A0 A1

redundancy Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Input

Output Master Node

Systema.c processors Checksum processors

slide-17
SLIDE 17

Failure Model and Real-.me Recovery in HPC

Single-node fail-stop failures:

  • A failure corresponds to a systema.c processor that

completely stops responding, and loses its part of the global data.

  • The iden.ty of the failed processor is provided by some

external source. Real-.me Recovery:

  • The failure can occur at any point during the execu.on of

QR decomposi.on, immediately triggering the recovery process.

  • Computa.on con.nues once the system has recovered

from its latest failure.

17

slide-18
SLIDE 18

QR Decomposi.on: Modified Gram- Schmidt (MGS) algorithm

18

Q

computa.on

R

computa.on

We consider MGS, one of the 3 most widely use algorithms for QR decomposi.on.

slide-19
SLIDE 19

Main Results

Checksum-preserva.on for MGS

Checksums preserved to facilitate fault-tolerant computa.on

Challenge 1: Q protec)on via coding?

à Post-orthogonaliza.on Post-processing to restore the Degraded Orthogonality

Challenge 2: minimal number of checksums required under in-node checksum storage? à Op.mality for in-node checksum storage seeng

Minimal number of checksums for single-node failure tolerance

19

slide-20
SLIDE 20

Checksum-preserva.on for MGS

  • To facilitate real-.me recovery, we want the

checksums to be preserved at any itera.on of MGS (or GS).

  • We encode , and QR-factorizes .
  • At each itera.on , the algorithm

maintains the updates and , so that at the end is the QR decomposi.on

  • f .

Q(t) R(t) ! A = Q(T )R(T ) t =1,...,T ! A

20

A → ! A ! A

slide-21
SLIDE 21

Checksum-preserva.on for MGS

A AGh GvA

Q1

(t)

GvQ1

(t)

R

1 (t)

R

1 (t)Gh

! A Q(t) R(t)

Checksums preserved!

We prove that: At any itera.on of MGS,

21

t

slide-22
SLIDE 22

Checksum-preserva.on for MGS

A AGh GvA

Q1 GvQ1 R

1

R

1Gh

! A Q(T ) R(T )

At the end, i.e. , we have:

22

t = T

à Retrieve where is non-orthogonal (first challenge), and is upper-triangular.

A = Q1R

1

Q1 R

1

slide-23
SLIDE 23

Challenge 1: Degraded Orthogonality

  • f Conven.onal Coding

Challenge 1: In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q1

A AGh GvA

Q1 GvQ1 R

1

R

1Gh

Not orthogonal

23

slide-24
SLIDE 24

Challenge 1: Degraded Orthogonality

  • f Conven.onal Coding

24

Challenge 1: In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Main Idea: Cheap Post-processing:

  • rthogonal matrix !

Q1

Q1 →

A AGh GvA

Q1 GvQ1 R

1

R

1Gh

Not orthogonal

slide-25
SLIDE 25

Challenge 1: Degraded Orthogonality

  • f Conven.onal Coding

25

Challenge 1: In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Main Idea: Cheap Post-processing: à Post-orthogonaliza)on: orthogonal matrix ! Q1

Q1 → G0Q1

A AGh GvA

Q1 GvQ1 R

1

R

1Gh

Not orthogonal

slide-26
SLIDE 26

Post-orthogonaliza.on

26

Ques)on: Can we always construct such that is orthogonal?

G0

G0Q

1

Q1 GvQ1

Orthogonal Not orthogonal

It depends on .

Gv

Checksum-generator matrix under our control !!

c x n matrix

slide-27
SLIDE 27

Construc.on of G0

27

Gv :

G1 V

c c n-c n

G0 =

Ic +G1

V V T −In−c

c c n-c n-c

G0 is sparse as

slide-28
SLIDE 28

We could prove that if , then:

  • is orthogonal
  • is inver.ble

à is now the QR decomposi.on

  • f ! But would be useful?

A' = G0A = (G0Q1)R

A'

Post-orthogonaliza)on condi)on

Post-orthogonaliza.on Condi.on for Checksum-generator Matrix

(G0Q1) G0

Main Result:

28

A'

Reminder: checksum-generator matrix: Gv =

G1 V ⎡ ⎣ ⎤ ⎦

slide-29
SLIDE 29

Post-orthogonaliza.on for Linear Solvers

  • We consider QR decomposi.on in solving a non-singular square

system of linear equa.ons:

  • QR factoriza.on of can now be used to find x:
  • Finally, x can be found by triangular solve.

29

Ax = b ⇔ A'x = (G0A)x = G0b

à As G0 is sparse, the total overhead for fault- tolerance is negligible.

(G0Q1)Rx = G0b ⇔ Rx = (G0Q1)T (G0b)

A'

Overhead of post-orthogonaliza.on: Matrix mul.plica.ons and

(G0Q1) (G0b)

slide-30
SLIDE 30

Checksum-Generator Matrices for Single-Node Failures

30

Note:

  • Single-node failure is the most common scenario in HPC.
  • Anything related to mul.ple-node failure scenarios would be interes.ng future work!
slide-31
SLIDE 31

Checksum-Generator Matrices for Single- Node Failures

Recap: R-factor protec.on:

  • Designing is straighporward, as there is no restric.on.
  • We can use MDS code for op.mality.

Q-factor protec.on:

  • must sa.sfy .

àConstruc.on of to tolerate single-node failures.

Gh

31

Gv = G1 V ⎡ ⎣ ⎤ ⎦ Post-orthogonaliza)on condi)on Gv

slide-32
SLIDE 32

In-node Checksum Storage

32

slide-33
SLIDE 33

In-node Checksum Storage

  • This new seeng could be more appealing in

prac.ce as it does not require addi.onal processors.

à Can we s.ll have some op.mality guarantee like MDS condi.on?

à Challenge 2: minimal number of checksums required under this seWng?

33

A0 A1 A0+A1

Node Node 1

checksum

slide-34
SLIDE 34

In-node Checksum Storage: Lower Bound for the Number of Checksums

Se1ngs & parameters: data blocks distributed cyclically over processors. Checksums are used. What was achieved in previous work?

  • [P. Du et al. ‘12] needed for single-node

failure tolerance. Our results:

  • We prove the minimal to tolerate a single

node failure and design the coding strategy to achieve it:

K = 2 L ρ ⎡ ⎢ ⎢ ⎤ ⎥ ⎥

K ≥ L ρ −1 ⎡ ⎢ ⎢ ⎤ ⎥ ⎥

34

L

ρ

K K

slide-35
SLIDE 35

Coded QR Decomposi.on under In-node Checksum Storage Seeng

35

Out-of-node Checksum Storage In-node Checksum Storage

R-factor protec.on Q-factor protec.on

n pc ⎡ ⎢ ⎢ ⎤ ⎥ ⎥ n pr ⎡ ⎢ ⎢ ⎤ ⎥ ⎥ n pc −1 ⎡ ⎢ ⎢ ⎤ ⎥ ⎥ 2 n pr ⎡ ⎢ ⎢ ⎤ ⎥ ⎥

Applica.on to coded QR decomposi.on under in-node checksum storage: The matrix is distributed over processors.

n×n A P = prpc

slide-36
SLIDE 36

Contribu.on

Challenge 1: Q protec)on via coding?

  • Propose a new linear encoding strategy for Q matrix protec.on in

Modified Gram-Schmidt (MGS) algorithm: à The trick: use low-cost post-orthogonaliza.on of A to A’, to restore the degraded orthogonality of Q. à A’ can be then used in place of A with negligible overhead for solving a non-singular square system of linear equa.ons. Challenge 2: minimal number of checksums required under in-node checksum storage?

  • Obtain the minimal number of checksums required for single-node failures

under the in-node checksum storage seeng.

Note:

  • Single-node failure is the most common scenario in HPC.
  • Anything related to mul.ple-node failure scenarios would be interes.ng future work!

36