communication avoiding cholesky qr2 for rectangular
play

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward - PowerPoint PPT Presentation

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign March 9, 2018 Edward Hutter


  1. Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign March 9, 2018 Edward Hutter Parallel 3D Cholesky-QR2

  2. Abstract Novelty: new distributed-memory QR factorization algorithm extends recent Cholesky-QR2 algorithm to rectangular matrices exploits a tunable processor grid to provably reduce communication utilizes first distributed-memory implementation of recursive 3D Cholesky factorization Benefits practical, flexible, and achieves minimal communication (data movement between processors) Drawbacks matrix must be su ffi ciently well-conditioned machine must have su ffi cient memory Edward Hutter Parallel 3D Cholesky-QR2

  3. Motivation for reducing parallel algorithmic costs Communication cost model: ↵ - cost of sending a message over a network � - cost of injecting a byte of data into a network � - cost of a floating point operation Current cost trend for distributed-memory machines: ↵ � � � � Goal: A QR factorization algorithm that prioritizes minimizing synchronization and communication cost Edward Hutter Parallel 3D Cholesky-QR2

  4. 2D and 3D QR algorithm characteristics 2D algorithms are communication-optimal assuming minimal memory footprint 3D algorithms take advantage of extra memory to reduce communication exist in theory but have not been implemented or studied in practice. 1 2 1A. Tiskin 2007, ”Communication-e ffi cient generic pairwise elimination” 2E. Solomonik et al., ”A communication-avoiding parallel algorithm for the symmetric eigenvalue problem” Edward Hutter Parallel 3D Cholesky-QR2

  5. Parallel QR factorization cost comparison Scalapack PGEQRF method is a 2D Householder QR CA-QR utilizes TSQR along panels to reduce synchronization Communication-avoiding CholeskyQR2 is a tunable Cholesky-based QR factorization algorithm We will compare both the theoretical cost and performance ✓ n log P · ↵ + mn ◆ T 2D Householder QR = O · � p P ✓ p ◆ P log 2 P · ↵ + mn T CAQR-HR = O · � p P ◆ 2 ◆ 2 ✓ Pn ! ✓ n 2 m 3 3 T CA-CholeskyQR2 = O log P · ↵ + · � m P Edward Hutter Parallel 3D Cholesky-QR2

  6. Cholesky-QR algorithm [ Q , R ] CholeskyQR ( A ) B A T A R Cholesky( B ) Q AR − 1 A T A = R T Q T QR = R T R if Q is orthogonal Both Cholesky and triangular solve are backwards stable, yet sensitive to conditioning of B Cholesky can break down due to numerical error, losing positive-definiteness CholeskyQR is not stable, deviation from orthogonality of computed Q is O (  ( A ) 2 · ✏ ), where ✏ is machine epsilon Edward Hutter Parallel 3D Cholesky-QR2

  7. Cholesky-QR2 algorithm [ Q , R ] CholeskyQR2 ( A ) Q 1 , R 1 CholeskyQR ( A ) Q , R 2 CholeskyQR ( Q 1 ) R R 2 R 1 By performing CholeskyQR 2x, the residual and deviation ⇣ ⌘ 1 3 from orthogonality are O ( ✏ ) if  ( A ) = O √ ✏ Proposed as a replacement for TSQR for tall-and-skinny matrices Lower theoretical communication cost by O (log P ), better performance, and simpler implementation 4 TSQR is unconditionably stable 3 Y. Yamamoto et al., ”Roundo ff Error Analysis of the CholeskyQR2 ..” 4 Y. Yamamoto et al., ”CholeskyQR2: A communication-avoiding algorithm” Edward Hutter Parallel 3D Cholesky-QR2

  8. 1D CholeskyQR2 Edward Hutter Parallel 3D Cholesky-QR2

  9. 1D CholeskyQR2 n 2 m ! ! log P · α + n 2 · β + + n 3 T CholeskyQR2 1D ( m , n , P ) = O · γ P Edward Hutter Parallel 3D Cholesky-QR2

  10. Figure: 3D algorithm for square matrix multiplication 1 2 3 1Bersten 1989, ”Communication-e ffi cient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2

  11. Figure: 3D algorithm for square matrix multiplication 1 2 3 log P · ↵ + n 2 · � + n 3 ✓ ◆ T 3D MM ( n , P ) = O P · � 2 P 3 1Bersten 1989, ”Communication-e ffi cient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2

  12. 3D recursive CholeskyInverse [ L , L − 1 ] CholeskyInverse ( A ) h i L − 1 L 11 ← CholeskyInverse ( A 11 ) 11 L 21 ← A 21 L − T 11 h L − 1 i ← CholeskyInverse ( A 22 − L 21 L T 21 ) L 22 22 L − 1 ← − L − 1 22 L 21 L − 1 21 11 T CholeskyInverse3D ( n , P ) = 2 T ↵ − � ⇣ n ⌘ 2 , P + O (3D Matrix Multiplication) Edward Hutter Parallel 3D Cholesky-QR2

  13. 3D recursive CholeskyInverse [ L , L − 1 ] CholeskyInverse ( A ) h i L − 1 L 11 ← CholeskyInverse ( A 11 ) 11 L 21 ← A 21 L − T 11 h L − 1 i ← CholeskyInverse ( A 22 − L 21 L T 21 ) L 22 22 L − 1 ← − L − 1 22 L 21 L − 1 21 11 T CholeskyInverse3D ( n , P ) = 2 T ↵ − � ⇣ n ⌘ 2 , P + O (3D Matrix Multiplication) 3 log P · ↵ + n 2 · � + n 3 ✓ ◆ 2 T CholeskyInverse3D ( n , P ) = O P P · � 2 P 3 Edward Hutter Parallel 3D Cholesky-QR2

  14. Triangular solve with multiple right hand sides (TRSM) TRSM3D is an option to reduce computation cost by 2 at the highest levels of recursion Skip two matrix multiplications by not obtaining L − 1 21 Utilize diagonally inverted blocks and iterate along column panels Two 3D Matrix Multiplications per iteration to solve for panel and update trailing matrix Edward Hutter Parallel 3D Cholesky-QR2

  15. Figure: A T A over a tunable c ⇥ d ⇥ c processor grid Edward Hutter Parallel 3D Cholesky-QR2

  16. Figure: Start with a tunable c ⇥ d ⇥ c processor grid Edward Hutter Parallel 3D Cholesky-QR2

  17. Figure: Broadcast columns of A Cost: 2 log 2 c · ↵ + 2 mn dc · � Edward Hutter Parallel 3D Cholesky-QR2

  18. Figure: Reduce contiguous groups of size c Cost: 2 log 2 c · ↵ + 2 n 2 c 2 · � + n 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2

  19. Figure: Allreduce alternating groups of size d c c · ↵ + 2 n 2 c 2 · � + n 2 d Cost: 2 log 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2

  20. Figure: Broadcast missing pieces of B along depth Cost: 2 log 2 c · ↵ + 2 n 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2

  21. Figure: d c simultaneous 3D CholeskyInverse on cubes of dimension c ⇣ c 2 log c 3 · ↵ + n 2 ⌘ c 2 · � + n 3 Cost: O c 3 · � Edward Hutter Parallel 3D Cholesky-QR2

  22. Figure: d c simultaneous 3D matrix multiplication or TRSM on cubes of dimension c ⇣ ⌘ Cost: O (2 log 2 c 3 · ↵ + dc + n 2 + nc · � + n 2 m 4 mn c 2 d · � ) c 2 Edward Hutter Parallel 3D Cholesky-QR2

  23. Figure: Tunable Cholesky-QR2 Algorithm Edward Hutter Parallel 3D Cholesky-QR2

  24. Cost comparison By setting m d = n c and enforcing P = c 2 d , we are able to solve for the optimal grid and minimize communication 1D CholeskyQR2 2D Householder QR 2D CAQR-HR optimal ⇣ ⌘ 2 ! √ 3 log P Pn # of messages O (log P ) n log P P log P O m ◆ 2 ✓ ! n 2 m 3 ⇣ n 2 ⌘ O ( mn O ( mn # of words P ) P ) O O √ √ P ✓ n 2 m ◆ O ( mn 2 O ( mn 2 ✓ n 2 m ◆ + n 3 # of flops O ) ) O P P P P ◆ 2 ✓ ! n 2 m 3 ⇣ + n 2 ⌘ mn O ( mn O ( mn Memory footprint P ) P ) O O P P Edward Hutter Parallel 3D Cholesky-QR2

  25. Implementation and performance testing All code written from scratch in modern C++ and MPI This is a first implementation TRSM not used in results No overlap in computation and communication No topology-aware mapping No threading except for multi-threaded BLAS For each processor count and matrix size: Scalapack PGEQRF was tuned over block sizes and processor grid dimensions CA-CholeskyQR2 was tuned over range of 3D grid dimensions Best performance numbers were chosen to compare Edward Hutter Parallel 3D Cholesky-QR2

  26. Architecture details Cray XC40 system at Argonne Leadership Computing Facility Compute nodes: Intel Knights Landing Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups Edward Hutter Parallel 3D Cholesky-QR2

  27. Architecture details Cray XC40 system at Argonne Leadership Computing Facility Compute nodes: Intel Knights Landing Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups Our configuration: 16 MPI processes per node, 4 OpenMP threads per process, 2 hyperthreads per core MKL 2018 libraries used for BLAS and LAPACK MKL 2018 SCALAPACK library used as comparison for benchmarking ( d , c ) pairs for c ⇥ d ⇥ c processor grid given for each data point Edward Hutter Parallel 3D Cholesky-QR2

  28. Weak scaling on Theta (XC40) 256 Cholesky-QR2, m=#nodes*256, n=128 ScaLAPACK QR, m=#nodes*256, n=128 128 (4096, 2) Gigaflops/s 64 32 (2048, 2) (1024, 2) 16 (512, 2) 8 128 256 512 1024 #nodes (16 processes/node) Edward Hutter Parallel 3D Cholesky-QR2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend