 
              Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign March 9, 2018 Edward Hutter Parallel 3D Cholesky-QR2
Abstract Novelty: new distributed-memory QR factorization algorithm extends recent Cholesky-QR2 algorithm to rectangular matrices exploits a tunable processor grid to provably reduce communication utilizes first distributed-memory implementation of recursive 3D Cholesky factorization Benefits practical, flexible, and achieves minimal communication (data movement between processors) Drawbacks matrix must be su ffi ciently well-conditioned machine must have su ffi cient memory Edward Hutter Parallel 3D Cholesky-QR2
Motivation for reducing parallel algorithmic costs Communication cost model: ↵ - cost of sending a message over a network � - cost of injecting a byte of data into a network � - cost of a floating point operation Current cost trend for distributed-memory machines: ↵ � � � � Goal: A QR factorization algorithm that prioritizes minimizing synchronization and communication cost Edward Hutter Parallel 3D Cholesky-QR2
2D and 3D QR algorithm characteristics 2D algorithms are communication-optimal assuming minimal memory footprint 3D algorithms take advantage of extra memory to reduce communication exist in theory but have not been implemented or studied in practice. 1 2 1A. Tiskin 2007, ”Communication-e ffi cient generic pairwise elimination” 2E. Solomonik et al., ”A communication-avoiding parallel algorithm for the symmetric eigenvalue problem” Edward Hutter Parallel 3D Cholesky-QR2
Parallel QR factorization cost comparison Scalapack PGEQRF method is a 2D Householder QR CA-QR utilizes TSQR along panels to reduce synchronization Communication-avoiding CholeskyQR2 is a tunable Cholesky-based QR factorization algorithm We will compare both the theoretical cost and performance ✓ n log P · ↵ + mn ◆ T 2D Householder QR = O · � p P ✓ p ◆ P log 2 P · ↵ + mn T CAQR-HR = O · � p P ◆ 2 ◆ 2 ✓ Pn ! ✓ n 2 m 3 3 T CA-CholeskyQR2 = O log P · ↵ + · � m P Edward Hutter Parallel 3D Cholesky-QR2
Cholesky-QR algorithm [ Q , R ] CholeskyQR ( A ) B A T A R Cholesky( B ) Q AR − 1 A T A = R T Q T QR = R T R if Q is orthogonal Both Cholesky and triangular solve are backwards stable, yet sensitive to conditioning of B Cholesky can break down due to numerical error, losing positive-definiteness CholeskyQR is not stable, deviation from orthogonality of computed Q is O (  ( A ) 2 · ✏ ), where ✏ is machine epsilon Edward Hutter Parallel 3D Cholesky-QR2
Cholesky-QR2 algorithm [ Q , R ] CholeskyQR2 ( A ) Q 1 , R 1 CholeskyQR ( A ) Q , R 2 CholeskyQR ( Q 1 ) R R 2 R 1 By performing CholeskyQR 2x, the residual and deviation ⇣ ⌘ 1 3 from orthogonality are O ( ✏ ) if  ( A ) = O √ ✏ Proposed as a replacement for TSQR for tall-and-skinny matrices Lower theoretical communication cost by O (log P ), better performance, and simpler implementation 4 TSQR is unconditionably stable 3 Y. Yamamoto et al., ”Roundo ff Error Analysis of the CholeskyQR2 ..” 4 Y. Yamamoto et al., ”CholeskyQR2: A communication-avoiding algorithm” Edward Hutter Parallel 3D Cholesky-QR2
1D CholeskyQR2 Edward Hutter Parallel 3D Cholesky-QR2
1D CholeskyQR2 n 2 m ! ! log P · α + n 2 · β + + n 3 T CholeskyQR2 1D ( m , n , P ) = O · γ P Edward Hutter Parallel 3D Cholesky-QR2
Figure: 3D algorithm for square matrix multiplication 1 2 3 1Bersten 1989, ”Communication-e ffi cient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2
Figure: 3D algorithm for square matrix multiplication 1 2 3 log P · ↵ + n 2 · � + n 3 ✓ ◆ T 3D MM ( n , P ) = O P · � 2 P 3 1Bersten 1989, ”Communication-e ffi cient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2
3D recursive CholeskyInverse [ L , L − 1 ] CholeskyInverse ( A ) h i L − 1 L 11 ← CholeskyInverse ( A 11 ) 11 L 21 ← A 21 L − T 11 h L − 1 i ← CholeskyInverse ( A 22 − L 21 L T 21 ) L 22 22 L − 1 ← − L − 1 22 L 21 L − 1 21 11 T CholeskyInverse3D ( n , P ) = 2 T ↵ − � ⇣ n ⌘ 2 , P + O (3D Matrix Multiplication) Edward Hutter Parallel 3D Cholesky-QR2
3D recursive CholeskyInverse [ L , L − 1 ] CholeskyInverse ( A ) h i L − 1 L 11 ← CholeskyInverse ( A 11 ) 11 L 21 ← A 21 L − T 11 h L − 1 i ← CholeskyInverse ( A 22 − L 21 L T 21 ) L 22 22 L − 1 ← − L − 1 22 L 21 L − 1 21 11 T CholeskyInverse3D ( n , P ) = 2 T ↵ − � ⇣ n ⌘ 2 , P + O (3D Matrix Multiplication) 3 log P · ↵ + n 2 · � + n 3 ✓ ◆ 2 T CholeskyInverse3D ( n , P ) = O P P · � 2 P 3 Edward Hutter Parallel 3D Cholesky-QR2
Triangular solve with multiple right hand sides (TRSM) TRSM3D is an option to reduce computation cost by 2 at the highest levels of recursion Skip two matrix multiplications by not obtaining L − 1 21 Utilize diagonally inverted blocks and iterate along column panels Two 3D Matrix Multiplications per iteration to solve for panel and update trailing matrix Edward Hutter Parallel 3D Cholesky-QR2
Figure: A T A over a tunable c ⇥ d ⇥ c processor grid Edward Hutter Parallel 3D Cholesky-QR2
Figure: Start with a tunable c ⇥ d ⇥ c processor grid Edward Hutter Parallel 3D Cholesky-QR2
Figure: Broadcast columns of A Cost: 2 log 2 c · ↵ + 2 mn dc · � Edward Hutter Parallel 3D Cholesky-QR2
Figure: Reduce contiguous groups of size c Cost: 2 log 2 c · ↵ + 2 n 2 c 2 · � + n 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2
Figure: Allreduce alternating groups of size d c c · ↵ + 2 n 2 c 2 · � + n 2 d Cost: 2 log 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2
Figure: Broadcast missing pieces of B along depth Cost: 2 log 2 c · ↵ + 2 n 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2
Figure: d c simultaneous 3D CholeskyInverse on cubes of dimension c ⇣ c 2 log c 3 · ↵ + n 2 ⌘ c 2 · � + n 3 Cost: O c 3 · � Edward Hutter Parallel 3D Cholesky-QR2
Figure: d c simultaneous 3D matrix multiplication or TRSM on cubes of dimension c ⇣ ⌘ Cost: O (2 log 2 c 3 · ↵ + dc + n 2 + nc · � + n 2 m 4 mn c 2 d · � ) c 2 Edward Hutter Parallel 3D Cholesky-QR2
Figure: Tunable Cholesky-QR2 Algorithm Edward Hutter Parallel 3D Cholesky-QR2
Cost comparison By setting m d = n c and enforcing P = c 2 d , we are able to solve for the optimal grid and minimize communication 1D CholeskyQR2 2D Householder QR 2D CAQR-HR optimal ⇣ ⌘ 2 ! √ 3 log P Pn # of messages O (log P ) n log P P log P O m ◆ 2 ✓ ! n 2 m 3 ⇣ n 2 ⌘ O ( mn O ( mn # of words P ) P ) O O √ √ P ✓ n 2 m ◆ O ( mn 2 O ( mn 2 ✓ n 2 m ◆ + n 3 # of flops O ) ) O P P P P ◆ 2 ✓ ! n 2 m 3 ⇣ + n 2 ⌘ mn O ( mn O ( mn Memory footprint P ) P ) O O P P Edward Hutter Parallel 3D Cholesky-QR2
Implementation and performance testing All code written from scratch in modern C++ and MPI This is a first implementation TRSM not used in results No overlap in computation and communication No topology-aware mapping No threading except for multi-threaded BLAS For each processor count and matrix size: Scalapack PGEQRF was tuned over block sizes and processor grid dimensions CA-CholeskyQR2 was tuned over range of 3D grid dimensions Best performance numbers were chosen to compare Edward Hutter Parallel 3D Cholesky-QR2
Architecture details Cray XC40 system at Argonne Leadership Computing Facility Compute nodes: Intel Knights Landing Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups Edward Hutter Parallel 3D Cholesky-QR2
Architecture details Cray XC40 system at Argonne Leadership Computing Facility Compute nodes: Intel Knights Landing Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups Our configuration: 16 MPI processes per node, 4 OpenMP threads per process, 2 hyperthreads per core MKL 2018 libraries used for BLAS and LAPACK MKL 2018 SCALAPACK library used as comparison for benchmarking ( d , c ) pairs for c ⇥ d ⇥ c processor grid given for each data point Edward Hutter Parallel 3D Cholesky-QR2
Weak scaling on Theta (XC40) 256 Cholesky-QR2, m=#nodes*256, n=128 ScaLAPACK QR, m=#nodes*256, n=128 128 (4096, 2) Gigaflops/s 64 32 (2048, 2) (1024, 2) 16 (512, 2) 8 128 256 512 1024 #nodes (16 processes/node) Edward Hutter Parallel 3D Cholesky-QR2
Recommend
More recommend