HiCMA: Hierarchical Computations on Manycore Architectures Hatem - PowerPoint PPT Presentation

HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016

Outline Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H -Matrices HiCMA in a nutshell

Students/Collaborators/Support Academic: ◮ Extreme Computing Research Center @ KAUST W. Boukaram, A. Charara, G. Ch´ avez, D. Keyes, D. Sukkari and G. Turkiyyah ◮ Tokyo Institute of Technology R. Yokota ◮ Institut Polytechnique de Bordeaux - INRIA Bordeaux M. Faverge ◮ Innovative Computing Laboratory - UTK Industry:

Hardware/Software Trends ◮ Flops are free ◮ On/Off-chip network bandwidth limited ◮ Increasing gap between flops and bandwidth ◮ Data movement are the most energy-consuming operations ◮ Synchronization-reducing ◮ Communication-reducing

Going hierarchical all the way down the software stack ◮ Recursive formulation (increase data locality) ◮ Old concept! ◮ Tree structure (depth-first Vs breadth-first tree traversal) ◮ Reduce vertical/horizontal data motion ◮ Without compromising concurrency ◮ Trade-off between data reuse and parallelism

Standard SVD solver One stage reduction: Figure: Computational stages of the standard SVD algorithm: (a) bidiagonal reduction, (b) bidiagonal SVD solver and (c) back transformation.

Two-stage SVD solver Two-stage reduction: Figure: Reduction of a general dense matrix to bidiagonal form using a two-stage approach.

QDWH-SVD [1 , 2] solver Three computational stages: ◮ Polar decomposition A = U p H : iterative procedure using the matrix inversion free formulation based on QR/Cholesky factorization ◮ Symmetric eigensolver H = V Σ V ⊤ to calculate the singular values and the right singular vectors ◮ Matrix-matrix multiplication U = U p V to calculate the left singular vectors [1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349. [2] D. Sukkari, H. Ltaief and D. Keyes, A High Performance QDWH-SVD Solver Using Hardware Accelerators, submitted to Trans. on Math. Soft., 2015.

Divide-and-Conquer Figure: The recursive QDWH-SVD algorithm. The matrix A i , j corresponds to the submatrix indexed j at the i th level of recursion.

Performance results 1000 1000 2.3x 3.5x 100 100 Log (time (s)) Log (time (s)) 10 10 MKL-DGESVD MKL-DGESDD 1 1 MKL-DGESVD MAGMA-QDWH-SVD MAGMA-QDWH-SVD 0.1 0.1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 0 4 8 2 6 0 0 4 8 2 6 0 Matrix size Matrix size (a) Ill-conditioned matrix. (b) Well-conditioned matrix. Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU) against Intel MKL (CPU).

Performance results 1000 1000 18% 2.1x 100 100 Log (time (s)) Log (time (s)) 10 10 MAGMA-DGESVD MAGMA-DGESVD MAGMA-DGESDD MAGMA-QDWH-SVD 1 1 MAGMA-QDWH-SVD 0.1 0.1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 0 4 8 2 6 0 0 4 8 2 6 0 Matrix size Matrix size (a) Ill-conditioned matrix. (b) Well-conditioned matrix. Figure: Performance comparisons against existing MAGMA SVD solvers (GPU).

Recursive formulation ◮ Usually used for Level 2 BLAS algorithms (e.g., panel factorization) ◮ Increase data locality ◮ Run at the cache level speed ◮ Again, not new and literature is quite rich ◮ And it does pay off for Level 3 BLAS too!

Triangular matrix-matrix multiplication (TRMM) 1-RecTRMM N 1 N 2 2-GEMM N 1 A u ¡ ¡ ¡ ¡ ¡B l ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ B r ¡ M N 2 A ll A r N 1 N 2 3-RecTRMM Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursive TRMM, and splitting along the vertical direction. Operations are performed according to their numbering.

Triangular matrix-matrix multiplication GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ Figure: A hypothetical tree representing the operations executed by the recursive algorithm. Operations are to be executed by traversing the tree in depth-first order.

Performance results on NVIDIA GPUs 1200 � 1100 � 1000 � Performance (GFlop / s) � 900 � 800 � 700 � Theo-Peak � 600 � DGEMM � 500 � cuBLAS_DTRMM (OOP) � 400 � KBLAS_DTRMM (IP) � 300 � cuBLAS_DTRMM (IP) � 200 � 100 � 0 � 1 � 2 � 3 � 4 � 5 � 6 � 7 � 8 � 9 � 10 � 11 � 12 � 13 � 14 � Matrix Dimension (x1024) � Figure: Performance comparisons of KBLAS DTRMM against that of IP and OOP cuBLAS DTRMM (Integration to CUDA 8.0).

Performance results on higher DLA computations using GPUs 1600 � 1400 � Theo-Peak � DGEMM � Performance (GFlop / s) � DPOTRI + KBLAS_TRMM � DPOTRI + cuBLAS_TRMM � 1200 � 1000 � 800 � 600 � 400 � 200 � 0 � 1 � 2 � 3 � 4 � 5 � 6 � 7 � 8 � 9 � 10 � 11 � 12 � 13 � 14 � Matrix Dimension (x1024) � Figure: Performance speedup of matrix inversion in MAGMA library (DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.

Low Rank Approximation using H -Matrices ◮ Introduced by E. Tyrtyshnikov and revisited later by W. Hackbush [1 , 2] ◮ R = U X V T [1] W. Hackbush, A Sparse Matrix Arithmetic based on H -Matrices (Part I), Computing, 62(2), pp 89-108, 1999. [2] W. Hackbusch and B. Khoromskij, A Sparse H -Matrix Arithmetic (Part II): Application to multi-dimensional problems, Computing, 64(1), pp. 21-47, 2000.

Nice Properties of H -Matrices ◮ Memory footprint saving from O ( n 2 ) to O ( k n log ( n )) ◮ Linear arithmetic complexity MVM: from O ( n 2 ) to O ( k n log ( n )) MMM and A − 1 : from O ( n 3 ) to O ( k 2 n log 2 ( n ))

Examples of H -Matrices 4 3 5 9 9 3 3 9 9 6 9 5 5 5 9 9 9 9 5 5 9 4 9 9 9 9 9 9 9 9 9 4 4 4 9 6 9 6 5 9 9 9 9 9 9 9 9 9 5 5 5 9 9 4 4 4 9 9 9 9 9 5 5 9 9 9 4 5 5 5 9 9 9 9 9 9 9 9 4 3 6 9 9 9 5 3 3 9 6 6 5 9 9 9 9 9 9 9 9 9 5 5 5 9 9 5 5 5 9 9 9 9 9 9 9 9 9 5 6 6 9 3 3 5 9 9 3 4 9 9 6 9 5 5 5 9 9 9 9 9 9 5 5 4 9 9 9 9 9 9 9 9 4 4 4 9 9 5 5 5 9 9 9 9 9 9 9 9 9 5 6 9 6 9 4 4 4 9 9 9 9 9 9 9 4 5 5 9 9 9 9 5 5 5 9 9 9 9 3 3 9 6 9 9 5 9 4 3 Figure: Example of H -matrix approximation for BEM.

Examples of H -Matrices 25 20 20 30 20 16 30 20 16 30 16 20 20 32 27 20 30 20 30 30 20 20 32 18 20 30 20 32 28 30 27 32 28 28 28 20 20 20 30 20 32 9 32 29 30 20 20 32 19 20 29 20 29 29 18 32 29 32 19 29 19 32 19 19 19 20 20 20 30 20 32 30 30 32 20 32 30 30 30 20 20 32 10 20 30 20 32 20 30 20 32 20 20 20 9 32 30 32 20 30 32 10 32 20 20 20 10 32 20 32 10 20 10 32 10 10 10 32 Figure: Examples of H -matrix approximation for covariance matrix.

Tree Structure

H -MVM

Implementation Details Dense MVM: ◮ Calculate the products V T x for the leaves in the tree Batch MVM operation Upsweep: ◮ Sweep up the column basis tree tree calculating the products of the inner nodes from the products of their children block SpMV (BSR) Mult: ◮ It is also block SpMV (BSR) per level of the tree Downsweep: ◮ Transpose operation of the upsweep phase block SpMV (BSR) Pipelining: ◮ Overlapping computations possible within Dense MVM / Upsweep / Mult phases. Downsweep, however, requires a sync point!

Performance Results Figure: Performance (GB/s) of H -MVM using k = 8 and n min = 32.

Advanced Hierarchical Matrix Operations Context: ◮ Very small sizes! ◮ Batch operation executions at each level of the tree ◮ (usually) Fixed sizes ◮ Recursive formulation, stressing register usage ◮ State-of-the-art implementations not well optimized for this scope or not supported ◮ NVIDIA K40 GPU (single GPU)

Advanced Hierarchical Matrix Operations H -Matrix compression: ◮ Batch QR factorizations (square and tall-and-skinny) ◮ Batch SVD H -Matrix computations: ◮ Level 3 BLAS: SYRK, TRSM ◮ Factorizations: POTRF ◮ Solves: POTRS, POSV

HiCMA: Hierarchical Computations on Manycore Architectures Hatem - PowerPoint PPT Presentation

HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016 Outline Motivations

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Architectures Architectural styles Software architectures Architectures versus middleware

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Sparse Matrix-Matrix Mul/plica/on for Modern Manycore Architectures

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures* Engin

Toward Extreme-Scale Manycore Architectures Josep Torrellas Department of Computer Science

Agent architectures and hierarchical control Overview: Agents and Robots Agent systems and

Agent Architectures and Hierarchical Control Overview: Agents and Robots Agent systems and

Agent Architectures and Hierarchical Control Overview: Agents and Robots Agent systems and

Spatial Data Structures Hierarchical Bounding Volumes Hierarchical Bounding Volumes Grids Grids

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

2D ECEI OBSERVATION OF TEARING MODES IN KSTAR RMP EXPERIMENTS MINJUN J. CHOI * , G.S. YUN, J.

99 Mo/ 99m Tc Generator using (n, ) 99 Mo S. Hasan and L.F. Centofanti Perma-Fix

Paper Review Introduction Optical Generation of Microwave Signals All-Optical

Electronic transport as a tool to study the microscopic structure of a density-wave state Pavel

Fuel Cell System Based On Boost Inverter For Single Phase Grid Integration Using ANN K.V.Sneha

Small Water and Wastewater Systems APPLICATION OF PAC FOR MEMBRANE FOULING CONTROL IN A

Low-Permeability Black Oil/Gas Condensate Reservoirs Using Analytical/Semi-Analytical Methods

Small-Scale HVDC Assessment Anchorage Association for Energy Economics November 5 th , 2012 Jason

Sambuz

Useful Links

Newsletter

Mail Us