coded qr decomposi on
play

Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard - PowerPoint PPT Presentation

Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University 1 Mo.va.on 2 Mo.va.on Coded Compu)ng Coding Theory + Distributed Compu.ng Straggling Issue in Cloud Compu.ng Other Issues


  1. Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University 1

  2. Mo.va.on 2

  3. Mo.va.on Coded Compu)ng Coding Theory + Distributed Compu.ng Straggling Issue in Cloud Compu.ng Other Issues ?? - Coded Matrix Mul)plica)on [Lee et al. ’15, ’17, Yu et al. ’17, Jeong et al. ’17, ‘18, Baharav ’ 18, Sinong et al. ‘18, Shahrzad et al. ‘19] - Coded MapReduce [Li et al. ’15, ’17, ’18] - Coded Gradient Descent [Tandon et al. ’16, Raviv et al. ’17 Halbawi et al. ’18, Ye ’18] 3

  4. Mo.va.on Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC) 4

  5. Mo.va.on Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC) Larger Scale à Unreliability !! Fugaku supercomputer (2021) 150,000 nodes Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!! 5

  6. Mo.va.on Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC) Larger Scale à Unreliability !! HPC’s Solu.on: Algorithm-based fault-tolerance (ABFT) = Fugaku supercomputer (2021) adding encoded redundancy tailored 150,000 nodes to specific algorithm. Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!! Same idea as Coded Compu)ng !! 6

  7. Mo.va.on bridge the gap ABFT for Coded HPC Compu)ng • QR Decomposi.on-- an important matrix factoriza.on in HPC, where ABFT faces challenges • More prac.cal HPC seeng that was not considered in coded compu.ng literature: - Block-cyclic distribu.on - In-node checksum storage (storing redundancies in systema.c nodes) à Coded QR Decomposi>on 7

  8. What is QR Decomposi.on? Orthogonal Q (i.e. Q T Q = I) Upper triangular R • QR decomposi.on is widely used in many HPC applica.ons: solving system of linear equa.ons, SVM, linear least squares problem, etc. 8

  9. ABFT for QR Decomposi.on Key idea: [O. Maslennikow et al. ‘98, P. Du et al. ‘12, P.Wu et al. ’ 14] R’ A R check- Q check- sums sums R’ is upper-triangular à R is upper-triangular So we can retrieve A=Q x R as the QR decomposi.on of A. 9

  10. Challenges in Coding for QR Decomposi.on • Can we do the same trick for Q protec.on? NO . Not orthogonal Q A checksums = x Q’ R Q’ T x Q’ = I does not imply Q T x Q = I checksums • Proven in [Theorem 5.1, P. Du et al. ’ 12] . à Challenge 1: Q protec)on via coding? Can we efficiently restore the orthogonality of Q? 10

  11. Challenges in Coding for QR Decomposi.on In-node checksum storage: • was recently proposed for ABFT [P. Du et al. ’ 12] . • stores coded data (checksums) in original processors instead of adding extra processors for fault tolerance. 11

  12. Challenges in Coding for QR Decomposi.on In-node checksum storage: Out-of-node checksum storage: (Conven.onal seeng) checksum checksum A 0 A 1 A 0 +A 1 A 0 A 1 A 0 +A 1 Node Node Node Node Node 0 1 0 1 2 - Fundamental Limit?? - Op.mal coding strategy: MDS à Can we s.ll have some op.mality guarantee like MDS condi.on? à Challenge 2: minimal number of checksums required under in-node checksum storage? 12

  13. Summary of Challenges Challenge 1: Q protec)on via coding? Challenge 2: minimal number of checksums required under in-node checksum storage? à Our Contribu>on: Address these 2 challenges 13

  14. System Model • For fault tolerance, we encode the n x n matrix A with both ver.cal and horizontal checksums as follows: where and are checksum-generator matrices. G v G h • Out-of-node checksum storage: The checksums are distributed over the new set of checksum processors. 14

  15. System Model Coded Compu)ng: Master-Worker SeWng Input Master Node A 0 A 1 A 2 A 3 redundancy A 2 A 3 A 0 A 1 A 0 A 1 Worker Worker Worker Worker Worker 1 2 3 4 5 Output Master Node 15

  16. System Model Coded Compu)ng: Master-Worker SeWng HPC SeWng: 2D block-cyclic distribu)on Input Master Node The input matrix A is distributed among • processors. The below layout is maintained throughout the A 0 A 1 A 2 A 3 • computa.on. redundancy Systema.c A 2 A 3 A 0 A 1 A 0 A 1 processors Worker Worker Worker Worker Worker Checksum 1 2 3 4 5 processors Output Master Node 16

  17. Failure Model and Real -.me Recovery in HPC Single-node fail-stop failures: • A failure corresponds to a systema.c processor that completely stops responding, and loses its part of the global data. • The iden.ty of the failed processor is provided by some external source. Real-.me Recovery: • The failure can occur at any point during the execu.on of QR decomposi.on, immediately triggering the recovery process. • Computa.on con.nues once the system has recovered from its latest failure. 17

  18. QR Decomposi.on: Modified Gram- Schmidt (MGS) algorithm We consider MGS, one of the 3 most widely use algorithms for QR decomposi.on. R Q computa.on computa.on 18

  19. Main Results Checksum-preserva.on for MGS Checksums preserved to facilitate fault-tolerant computa.on Challenge 1: Q protec)on via coding? à Post-orthogonaliza.on Post-processing to restore the Degraded Orthogonality Challenge 2: minimal number of checksums required under in-node checksum storage? à Op.mality for in-node checksum storage seeng Minimal number of checksums for single-node failure tolerance 19

  20. Checksum-preserva.on for MGS • To facilitate real-.me recovery, we want the checksums to be preserved at any itera.on of MGS (or GS). A → ! ! • We encode , and QR-factorizes . A A • At each itera.on , the algorithm t = 1,..., T maintains the updates and , so that at Q ( t ) R ( t ) ! the end is the QR decomposi.on A = Q ( T ) R ( T ) ! of . A 20

  21. Checksum-preserva.on for MGS We prove that: At any itera.on of MGS, t ! Q ( t ) R ( t ) A ( t ) ( t ) ( t ) G h Q 1 R R A AG h 1 1 ( t ) G v Q 1 G v A Checksums preserved! 21

  22. Checksum-preserva.on for MGS At the end, i.e. , we have: t = T ! Q ( T ) R ( T ) A Q 1 R R 1 G h A AG h 1 G v Q 1 G v A à Retrieve where is non-orthogonal (first challenge), and Q 1 A = Q 1 R 1 is upper-triangular. R 22 1

  23. Challenge 1: Degraded Orthogonality of Conven.onal Coding Challenge 1: Not orthogonal R R 1 G h AG h Q 1 A 1 G v Q 1 G v A In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q 1 23

  24. Challenge 1: Degraded Orthogonality of Conven.onal Coding Challenge 1: Not orthogonal R R 1 G h AG h Q 1 A 1 G v Q 1 G v A In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q 1 Main Idea: Cheap Post-processing: orthogonal matrix ! Q 1 → 24

  25. Challenge 1: Degraded Orthogonality of Conven.onal Coding Challenge 1: Not orthogonal R R 1 G h AG h Q 1 A 1 G v Q 1 G v A In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q 1 Main Idea: Cheap Post-processing: G 0 Q 1 à Post-orthogonaliza)on: orthogonal matrix ! Q 1 → 25

  26. Post-orthogonaliza.on Ques)on: Can we always construct such that G 0 is orthogonal? G 0 Q 1 Not orthogonal c x n matrix It depends on . G v Q 1 Orthogonal G v Q 1 Checksum-generator matrix under our control !! 26

  27. Construc.on of G 0 n G v : G 1 c V n-c c c n-c c V I c + G 1 G 0 is sparse as G 0 = n-c − I n − c V T 27

  28. Post-orthogonaliza.on Condi.on for Checksum-generator Matrix Main Result: We could prove that if , then: • is orthogonal ( G 0 Q 1 ) Post-orthogonaliza)on • is inver.ble condi)on G 0 Reminder: ⎡ ⎤ checksum-generator matrix: G v = G 1 V ⎣ ⎦ à is now the QR decomposi.on A ' = G 0 A = ( G 0 Q 1 ) R of ! But would be useful? A ' A ' 28

  29. Post-orthogonaliza.on for Linear Solvers • We consider QR decomposi.on in solving a non-singular square system of linear equa.ons: Ax = b ⇔ A ' x = ( G 0 A ) x = G 0 b • QR factoriza.on of can now be used to find x: A ' ( G 0 Q 1 ) Rx = G 0 b Overhead of post-orthogonaliza.on: Matrix mul.plica.ons and ⇔ Rx = ( G 0 Q 1 ) T ( G 0 b ) ( G 0 Q 1 ) ( G 0 b ) • Finally, x can be found by triangular solve. à As G 0 is sparse, the total overhead for fault- tolerance is negligible. 29

  30. Checksum-Generator Matrices for Single-Node Failures Note: • Single-node failure is the most common scenario in HPC. • Anything related to mul.ple-node failure scenarios would be interes.ng future work! 30

  31. Checksum-Generator Matrices for Single- Node Failures Recap: R-factor protec.on: • Designing is straighporward, as there is no restric.on. G h • We can use MDS code for op.mality. Post-orthogonaliza)on Q-factor protec.on: condi)on • must sa.sfy . ⎡ ⎤ G v = G 1 V ⎣ ⎦ à Construc.on of to tolerate single-node failures. G v 31

  32. In-node Checksum Storage 32

  33. In-node Checksum Storage checksum A 0 A 1 A 0 +A 1 Node Node 0 1 • This new seeng could be more appealing in prac.ce as it does not require addi.onal processors. à Can we s.ll have some op.mality guarantee like MDS condi.on? à Challenge 2: minimal number of checksums required under this seWng? 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend