HiCMA: Hierarchical Computations on Manycore Architectures Hatem - - PowerPoint PPT Presentation
HiCMA: Hierarchical Computations on Manycore Architectures Hatem - - PowerPoint PPT Presentation
HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016 Outline Motivations
Outline
Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell
Students/Collaborators/Support
Academic:
◮ Extreme Computing Research Center @ KAUST
- W. Boukaram, A. Charara, G. Ch´
avez, D. Keyes, D. Sukkari and G. Turkiyyah
◮ Tokyo Institute of Technology
- R. Yokota
◮ Institut Polytechnique de Bordeaux - INRIA Bordeaux
- M. Faverge
◮ Innovative Computing Laboratory - UTK
Industry:
Outline
Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell
Hardware/Software Trends
◮ Flops are free ◮ On/Off-chip network bandwidth limited ◮ Increasing gap between flops and bandwidth ◮ Data movement are the most energy-consuming operations ◮ Synchronization-reducing ◮ Communication-reducing
Going hierarchical all the way down the software stack
◮ Recursive formulation (increase data locality) ◮ Old concept! ◮ Tree structure (depth-first Vs breadth-first tree traversal) ◮ Reduce vertical/horizontal data motion ◮ Without compromising concurrency ◮ Trade-off between data reuse and parallelism
Outline
Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell
Standard SVD solver
One stage reduction:
Figure: Computational stages of the standard SVD algorithm: (a) bidiagonal reduction, (b) bidiagonal SVD solver and (c) back transformation.
Two-stage SVD solver
Two-stage reduction:
Figure: Reduction of a general dense matrix to bidiagonal form using a two-stage approach.
QDWH-SVD[1,2] solver
Three computational stages:
◮ Polar decomposition A = UpH: iterative procedure using the
matrix inversion free formulation based on QR/Cholesky factorization
◮ Symmetric eigensolver H = V ΣV ⊤ to calculate the singular
values and the right singular vectors
◮ Matrix-matrix multiplication U = UpV to calculate the left
singular vectors
[1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient Spectral
Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349.
[2] D. Sukkari, H. Ltaief and D. Keyes, A High Performance
QDWH-SVD Solver Using Hardware Accelerators, submitted to
- Trans. on Math. Soft., 2015.
Divide-and-Conquer
Figure: The recursive QDWH-SVD algorithm. The matrix Ai,j corresponds to the submatrix indexed j at the ith level of recursion.
Performance results
0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size
2.3x
MKL-DGESVD MKL-DGESDD MAGMA-QDWH-SVD
(a) Ill-conditioned matrix.
0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size
3.5x
MKL-DGESVD MAGMA-QDWH-SVD
(b) Well-conditioned matrix.
Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU) against Intel MKL (CPU).
Performance results
0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size
18%
MAGMA-DGESVD MAGMA-DGESDD MAGMA-QDWH-SVD
(a) Ill-conditioned matrix.
0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size
2.1x
MAGMA-DGESVD MAGMA-QDWH-SVD
(b) Well-conditioned matrix.
Figure: Performance comparisons against existing MAGMA SVD solvers (GPU).
Outline
Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell
Recursive formulation
◮ Usually used for Level 2 BLAS algorithms (e.g., panel
factorization)
◮ Increase data locality ◮ Run at the cache level speed ◮ Again, not new and literature is quite rich ◮ And it does pay off for Level 3 BLAS too!
Triangular matrix-matrix multiplication (TRMM)
¡ ¡ ¡ ¡ ¡Bl ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Br ¡
1-RecTRMM 3-RecTRMM 2-GEMM
Au All Ar M N1 N2 N1 N2 N2 N1 Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursive TRMM, and splitting along the vertical direction. Operations are performed according to their numbering.
Triangular matrix-matrix multiplication
GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡
Figure: A hypothetical tree representing the operations executed by the recursive algorithm. Operations are to be executed by traversing the tree in depth-first order.
Performance results on NVIDIA GPUs
100 200 300 400 500 600 700 800 900 1000 1100 1200 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Performance (GFlop / s) Matrix Dimension (x1024) Theo-Peak DGEMM cuBLAS_DTRMM (OOP) KBLAS_DTRMM (IP) cuBLAS_DTRMM (IP)
Figure: Performance comparisons of KBLAS DTRMM against that of IP and OOP cuBLAS DTRMM (Integration to CUDA 8.0).
Performance results on higher DLA computations using GPUs
200 400 600 800 1000 1200 1400 1600 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Performance (GFlop / s) Matrix Dimension (x1024) Theo-Peak DGEMM DPOTRI + KBLAS_TRMM DPOTRI + cuBLAS_TRMM
Figure: Performance speedup of matrix inversion in MAGMA library (DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.
Outline
Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell
Low Rank Approximation using H-Matrices
◮ Introduced by E. Tyrtyshnikov and revisited later by W.
Hackbush[1,2]
◮ R = U X V T [1] W. Hackbush, A Sparse Matrix Arithmetic based on H-Matrices
(Part I), Computing, 62(2), pp 89-108, 1999.
[2] W. Hackbusch and B. Khoromskij, A Sparse H-Matrix
Arithmetic (Part II): Application to multi-dimensional problems, Computing, 64(1), pp. 21-47, 2000.
Nice Properties of H-Matrices
◮ Memory footprint saving
from O(n2) to O(k n log(n))
◮ Linear arithmetic complexity
MVM: from O(n2) to O(k n log(n)) MMM and A−1: from O(n3) to O(k2 n log2(n))
Examples of H-Matrices
4
3 3 3
5 5 5 9
9
5
9
9 5
9 9
5
4 4 4
6
9
9
9
6 9
9 9
6
5 5 5
9
9
9
9
4 9
9
9 9 9
9
9 4
9
9
9
9
9 9 9
4 4 4
5
9 9
5 9
9
5
9
9 5 5 5
4
3 3 3
9
9
6
9
9 6
9 9
6
5 5 5
9
9
9
9
5 9
9 9
9
9 9 9
9
9 9 9
9
9
9
5 9
9 9
9 5
9
9
9
9 9 9
9
9 9 9
9
9 9
9 5
9
9
9
9
5 5 5
6
9 9
6 9
9
6
9
9
3 3 3
4
5 5 5 9
9
5
9
9 5
9 9
5
4 4 4
9 9 9
9
9
9
9
4 9
9
9 9 9
9
9 4
9
9
9
9
5 5 5
6
9 9
9 6
9
9
9
6
4 4 4
5
9 9
5 9
9
5
9
9 5 5 5
3 3 3
4
Figure: Example of H-matrix approximation for BEM.
Examples of H-Matrices
25 20 20 20 30
30
20 16 16 16 30
30
20 20 20 20 30
30 32 27
27
20 20 20 20 30
30 32 28
28
32 28 28 32 18
18
20 20 20 20 30
30 32 29
29
20 20 20 20 29
29 32 19
19
32 29 29 32 19
19
32 19 19 32 9
9
20 20 20 20 30
30 32 30
30
32 30 30 32 20
20
20 20 20 20 30
30 32 20
20
32 20 20 32 10
10
32 30 30 32 20
20
32 20 20 32 10
10
32 20 20 32 10
10
32 10 10 32
Figure: Examples of H-matrix approximation for covariance matrix.
Tree Structure
H-MVM
Implementation Details
Dense MVM:
◮ Calculate the products V Tx for the leaves in the tree
Batch MVM operation Upsweep:
◮ Sweep up the column basis tree tree calculating the products
- f the inner nodes from the products of their children
block SpMV (BSR) Mult:
◮ It is also block SpMV (BSR) per level of the tree
Downsweep:
◮ Transpose operation of the upsweep phase
block SpMV (BSR) Pipelining:
◮ Overlapping computations possible within Dense MVM /
Upsweep / Mult phases. Downsweep, however, requires a sync point!
Performance Results
Figure: Performance (GB/s) of H-MVM using k = 8 and n min = 32.
Advanced Hierarchical Matrix Operations
Context:
◮ Very small sizes! ◮ Batch operation executions at each level of the tree ◮ (usually) Fixed sizes ◮ Recursive formulation, stressing register usage ◮ State-of-the-art implementations not well optimized for this
scope or not supported
◮ NVIDIA K40 GPU (single GPU)
Advanced Hierarchical Matrix Operations
H-Matrix compression:
◮ Batch QR factorizations (square and tall-and-skinny) ◮ Batch SVD
H-Matrix computations:
◮ Level 3 BLAS: SYRK, TRSM ◮ Factorizations: POTRF ◮ Solves: POTRS, POSV
Performance Results (preliminary)
0.125 0.25 0.5 1 2 4 8 16 32 64 128 8 16 32 64 128 256 Performance (GFLOP/s Log2) Matrix Size
Batch QR (Square matrix)
KBLAS_10000 KBLAS_1000 CUBLAS_10000 CUBLAS_1000 MAGMA_10000 MAGMA_1000