Communication-avoiding Cholesky-QR2 for rectangular matrices Edward - PowerPoint PPT Presentation

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign March 9, 2018 Edward Hutter Parallel 3D Cholesky-QR2

Abstract Novelty: new distributed-memory QR factorization algorithm extends recent Cholesky-QR2 algorithm to rectangular matrices exploits a tunable processor grid to provably reduce communication utilizes first distributed-memory implementation of recursive 3D Cholesky factorization Benefits practical, flexible, and achieves minimal communication (data movement between processors) Drawbacks matrix must be su ffi ciently well-conditioned machine must have su ffi cient memory Edward Hutter Parallel 3D Cholesky-QR2

Motivation for reducing parallel algorithmic costs Communication cost model: ↵ - cost of sending a message over a network � - cost of injecting a byte of data into a network � - cost of a floating point operation Current cost trend for distributed-memory machines: ↵ � � � � Goal: A QR factorization algorithm that prioritizes minimizing synchronization and communication cost Edward Hutter Parallel 3D Cholesky-QR2

2D and 3D QR algorithm characteristics 2D algorithms are communication-optimal assuming minimal memory footprint 3D algorithms take advantage of extra memory to reduce communication exist in theory but have not been implemented or studied in practice. 1 2 1A. Tiskin 2007, ”Communication-e ffi cient generic pairwise elimination” 2E. Solomonik et al., ”A communication-avoiding parallel algorithm for the symmetric eigenvalue problem” Edward Hutter Parallel 3D Cholesky-QR2

Parallel QR factorization cost comparison Scalapack PGEQRF method is a 2D Householder QR CA-QR utilizes TSQR along panels to reduce synchronization Communication-avoiding CholeskyQR2 is a tunable Cholesky-based QR factorization algorithm We will compare both the theoretical cost and performance ✓ n log P · ↵ + mn ◆ T 2D Householder QR = O · � p P ✓ p ◆ P log 2 P · ↵ + mn T CAQR-HR = O · � p P ◆ 2 ◆ 2 ✓ Pn ! ✓ n 2 m 3 3 T CA-CholeskyQR2 = O log P · ↵ + · � m P Edward Hutter Parallel 3D Cholesky-QR2

Cholesky-QR algorithm [ Q , R ] CholeskyQR ( A ) B A T A R Cholesky( B ) Q AR − 1 A T A = R T Q T QR = R T R if Q is orthogonal Both Cholesky and triangular solve are backwards stable, yet sensitive to conditioning of B Cholesky can break down due to numerical error, losing positive-definiteness CholeskyQR is not stable, deviation from orthogonality of computed Q is O (  ( A ) 2 · ✏ ), where ✏ is machine epsilon Edward Hutter Parallel 3D Cholesky-QR2

Cholesky-QR2 algorithm [ Q , R ] CholeskyQR2 ( A ) Q 1 , R 1 CholeskyQR ( A ) Q , R 2 CholeskyQR ( Q 1 ) R R 2 R 1 By performing CholeskyQR 2x, the residual and deviation ⇣ ⌘ 1 3 from orthogonality are O ( ✏ ) if  ( A ) = O √ ✏ Proposed as a replacement for TSQR for tall-and-skinny matrices Lower theoretical communication cost by O (log P ), better performance, and simpler implementation 4 TSQR is unconditionably stable 3 Y. Yamamoto et al., ”Roundo ff Error Analysis of the CholeskyQR2 ..” 4 Y. Yamamoto et al., ”CholeskyQR2: A communication-avoiding algorithm” Edward Hutter Parallel 3D Cholesky-QR2

1D CholeskyQR2 Edward Hutter Parallel 3D Cholesky-QR2

1D CholeskyQR2 n 2 m ! ! log P · α + n 2 · β + + n 3 T CholeskyQR2 1D ( m , n , P ) = O · γ P Edward Hutter Parallel 3D Cholesky-QR2

Figure: 3D algorithm for square matrix multiplication 1 2 3 1Bersten 1989, ”Communication-e ffi cient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2

Figure: 3D algorithm for square matrix multiplication 1 2 3 log P · ↵ + n 2 · � + n 3 ✓ ◆ T 3D MM ( n , P ) = O P · � 2 P 3 1Bersten 1989, ”Communication-e ffi cient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2

3D recursive CholeskyInverse [ L , L − 1 ] CholeskyInverse ( A ) h i L − 1 L 11 ← CholeskyInverse ( A 11 ) 11 L 21 ← A 21 L − T 11 h L − 1 i ← CholeskyInverse ( A 22 − L 21 L T 21 ) L 22 22 L − 1 ← − L − 1 22 L 21 L − 1 21 11 T CholeskyInverse3D ( n , P ) = 2 T ↵ − � ⇣ n ⌘ 2 , P + O (3D Matrix Multiplication) Edward Hutter Parallel 3D Cholesky-QR2

3D recursive CholeskyInverse [ L , L − 1 ] CholeskyInverse ( A ) h i L − 1 L 11 ← CholeskyInverse ( A 11 ) 11 L 21 ← A 21 L − T 11 h L − 1 i ← CholeskyInverse ( A 22 − L 21 L T 21 ) L 22 22 L − 1 ← − L − 1 22 L 21 L − 1 21 11 T CholeskyInverse3D ( n , P ) = 2 T ↵ − � ⇣ n ⌘ 2 , P + O (3D Matrix Multiplication) 3 log P · ↵ + n 2 · � + n 3 ✓ ◆ 2 T CholeskyInverse3D ( n , P ) = O P P · � 2 P 3 Edward Hutter Parallel 3D Cholesky-QR2

Triangular solve with multiple right hand sides (TRSM) TRSM3D is an option to reduce computation cost by 2 at the highest levels of recursion Skip two matrix multiplications by not obtaining L − 1 21 Utilize diagonally inverted blocks and iterate along column panels Two 3D Matrix Multiplications per iteration to solve for panel and update trailing matrix Edward Hutter Parallel 3D Cholesky-QR2

Figure: A T A over a tunable c ⇥ d ⇥ c processor grid Edward Hutter Parallel 3D Cholesky-QR2

Figure: Start with a tunable c ⇥ d ⇥ c processor grid Edward Hutter Parallel 3D Cholesky-QR2

Figure: Broadcast columns of A Cost: 2 log 2 c · ↵ + 2 mn dc · � Edward Hutter Parallel 3D Cholesky-QR2

Figure: Reduce contiguous groups of size c Cost: 2 log 2 c · ↵ + 2 n 2 c 2 · � + n 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2

Figure: Allreduce alternating groups of size d c c · ↵ + 2 n 2 c 2 · � + n 2 d Cost: 2 log 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2

Figure: Broadcast missing pieces of B along depth Cost: 2 log 2 c · ↵ + 2 n 2 c 2 · � Edward Hutter Parallel 3D Cholesky-QR2

Figure: d c simultaneous 3D CholeskyInverse on cubes of dimension c ⇣ c 2 log c 3 · ↵ + n 2 ⌘ c 2 · � + n 3 Cost: O c 3 · � Edward Hutter Parallel 3D Cholesky-QR2

Figure: d c simultaneous 3D matrix multiplication or TRSM on cubes of dimension c ⇣ ⌘ Cost: O (2 log 2 c 3 · ↵ + dc + n 2 + nc · � + n 2 m 4 mn c 2 d · � ) c 2 Edward Hutter Parallel 3D Cholesky-QR2

Figure: Tunable Cholesky-QR2 Algorithm Edward Hutter Parallel 3D Cholesky-QR2

Cost comparison By setting m d = n c and enforcing P = c 2 d , we are able to solve for the optimal grid and minimize communication 1D CholeskyQR2 2D Householder QR 2D CAQR-HR optimal ⇣ ⌘ 2 ! √ 3 log P Pn # of messages O (log P ) n log P P log P O m ◆ 2 ✓ ! n 2 m 3 ⇣ n 2 ⌘ O ( mn O ( mn # of words P ) P ) O O √ √ P ✓ n 2 m ◆ O ( mn 2 O ( mn 2 ✓ n 2 m ◆ + n 3 # of flops O ) ) O P P P P ◆ 2 ✓ ! n 2 m 3 ⇣ + n 2 ⌘ mn O ( mn O ( mn Memory footprint P ) P ) O O P P Edward Hutter Parallel 3D Cholesky-QR2

Implementation and performance testing All code written from scratch in modern C++ and MPI This is a first implementation TRSM not used in results No overlap in computation and communication No topology-aware mapping No threading except for multi-threaded BLAS For each processor count and matrix size: Scalapack PGEQRF was tuned over block sizes and processor grid dimensions CA-CholeskyQR2 was tuned over range of 3D grid dimensions Best performance numbers were chosen to compare Edward Hutter Parallel 3D Cholesky-QR2

Architecture details Cray XC40 system at Argonne Leadership Computing Facility Compute nodes: Intel Knights Landing Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups Edward Hutter Parallel 3D Cholesky-QR2

Architecture details Cray XC40 system at Argonne Leadership Computing Facility Compute nodes: Intel Knights Landing Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups Our configuration: 16 MPI processes per node, 4 OpenMP threads per process, 2 hyperthreads per core MKL 2018 libraries used for BLAS and LAPACK MKL 2018 SCALAPACK library used as comparison for benchmarking ( d , c ) pairs for c ⇥ d ⇥ c processor grid given for each data point Edward Hutter Parallel 3D Cholesky-QR2

Weak scaling on Theta (XC40) 256 Cholesky-QR2, m=#nodes*256, n=128 ScaLAPACK QR, m=#nodes*256, n=128 128 (4096, 2) Gigaflops/s 64 32 (2048, 2) (1024, 2) 16 (512, 2) 8 128 256 512 1024 #nodes (16 processes/node) Edward Hutter Parallel 3D Cholesky-QR2

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward - PowerPoint PPT Presentation

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign March 9, 2018 Edward Hutter

Area-Universal Area-Universal Rectangular Layouts Rectangular Layouts David Eppstein

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

A number of steps are required in solving design problems. The steps should be carried out

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS Jorge

On Cholesky structures on real symmetric matrices and their applications Hideyuki ISHI (Osaka

lgebra Linear e Aplicaes RECTANGULAR SYSTEMS AND ECHELON FORMS Rectangular systems

Rectangular Planar Array wher e, Rectangular Planar Array an where k = 2 / d The

MATH 105: Finite Mathematics 1-1: Rectangular Coordinates, Lines Prof. Jonathan Duncan Walla

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Two-dimensional self-avoiding walks Mireille Bousquet-Mlou CNRS, LaBRI, Bordeaux, France

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Section6.3 Matrices and Systems of Equations Introduction Definitions A matrix is a rectangular

1 2 3 The Industry Standard The Industry Standard Design and investigation of rectangular,

LU Decomposition INGE4035 Numerical Methods Applied to Engineering Dr. Marco A Arocha October 2,

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Chapter 23 Two Categorical Variables: The Chi-Square Test Chapter 22 1 BPS - 5th Ed.

CONFRONTING THE DAMPING OF THE BARYON ACOUSTIC OSCILLATIONS WITH OBSERVATIONS Hidenori Nomura,

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud Zhengping Qian, Xiuwei Chen,

Synthesis of certified programs in fixed-point arithmetic, and its application to linear algebra

Measuring Energy and Power with PAPI Vince Weaver vweaver1@eecs.utk.edu 11 May 2012 Power and

A three-level BDDC algorithm for mortar discretization Hyea Hyun Kim National Institute for