HiCMA: Hierarchical Computations on Manycore Architectures Hatem - - PowerPoint PPT Presentation

hicma hierarchical computations on manycore architectures
SMART_READER_LITE
LIVE PREVIEW

HiCMA: Hierarchical Computations on Manycore Architectures Hatem - - PowerPoint PPT Presentation

HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016 Outline Motivations


slide-1
SLIDE 1

HiCMA: Hierarchical Computations on Manycore Architectures

Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016

slide-2
SLIDE 2

Outline

Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell

slide-3
SLIDE 3

Students/Collaborators/Support

Academic:

◮ Extreme Computing Research Center @ KAUST

  • W. Boukaram, A. Charara, G. Ch´

avez, D. Keyes, D. Sukkari and G. Turkiyyah

◮ Tokyo Institute of Technology

  • R. Yokota

◮ Institut Polytechnique de Bordeaux - INRIA Bordeaux

  • M. Faverge

◮ Innovative Computing Laboratory - UTK

Industry:

slide-4
SLIDE 4

Outline

Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell

slide-5
SLIDE 5

Hardware/Software Trends

◮ Flops are free ◮ On/Off-chip network bandwidth limited ◮ Increasing gap between flops and bandwidth ◮ Data movement are the most energy-consuming operations ◮ Synchronization-reducing ◮ Communication-reducing

slide-6
SLIDE 6

Going hierarchical all the way down the software stack

◮ Recursive formulation (increase data locality) ◮ Old concept! ◮ Tree structure (depth-first Vs breadth-first tree traversal) ◮ Reduce vertical/horizontal data motion ◮ Without compromising concurrency ◮ Trade-off between data reuse and parallelism

slide-7
SLIDE 7

Outline

Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell

slide-8
SLIDE 8

Standard SVD solver

One stage reduction:

Figure: Computational stages of the standard SVD algorithm: (a) bidiagonal reduction, (b) bidiagonal SVD solver and (c) back transformation.

slide-9
SLIDE 9

Two-stage SVD solver

Two-stage reduction:

Figure: Reduction of a general dense matrix to bidiagonal form using a two-stage approach.

slide-10
SLIDE 10

QDWH-SVD[1,2] solver

Three computational stages:

◮ Polar decomposition A = UpH: iterative procedure using the

matrix inversion free formulation based on QR/Cholesky factorization

◮ Symmetric eigensolver H = V ΣV ⊤ to calculate the singular

values and the right singular vectors

◮ Matrix-matrix multiplication U = UpV to calculate the left

singular vectors

[1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient Spectral

Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349.

[2] D. Sukkari, H. Ltaief and D. Keyes, A High Performance

QDWH-SVD Solver Using Hardware Accelerators, submitted to

  • Trans. on Math. Soft., 2015.
slide-11
SLIDE 11

Divide-and-Conquer

Figure: The recursive QDWH-SVD algorithm. The matrix Ai,j corresponds to the submatrix indexed j at the ith level of recursion.

slide-12
SLIDE 12

Performance results

0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size

2.3x

MKL-DGESVD MKL-DGESDD MAGMA-QDWH-SVD

(a) Ill-conditioned matrix.

0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size

3.5x

MKL-DGESVD MAGMA-QDWH-SVD

(b) Well-conditioned matrix.

Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU) against Intel MKL (CPU).

slide-13
SLIDE 13

Performance results

0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size

18%

MAGMA-DGESVD MAGMA-DGESDD MAGMA-QDWH-SVD

(a) Ill-conditioned matrix.

0.1 1 10 100 1000 1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2 9 2 1 6 1 2 4 1 1 2 6 4 1 2 2 8 8 1 3 3 1 2 1 4 3 3 6 1 5 3 6 Log (time (s)) Matrix size

2.1x

MAGMA-DGESVD MAGMA-QDWH-SVD

(b) Well-conditioned matrix.

Figure: Performance comparisons against existing MAGMA SVD solvers (GPU).

slide-14
SLIDE 14

Outline

Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell

slide-15
SLIDE 15

Recursive formulation

◮ Usually used for Level 2 BLAS algorithms (e.g., panel

factorization)

◮ Increase data locality ◮ Run at the cache level speed ◮ Again, not new and literature is quite rich ◮ And it does pay off for Level 3 BLAS too!

slide-16
SLIDE 16

Triangular matrix-matrix multiplication (TRMM)

¡ ¡ ¡ ¡ ¡Bl ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Br ¡

1-RecTRMM 3-RecTRMM 2-GEMM

Au All Ar M N1 N2 N1 N2 N2 N1 Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursive TRMM, and splitting along the vertical direction. Operations are performed according to their numbering.

slide-17
SLIDE 17

Triangular matrix-matrix multiplication

GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡

Figure: A hypothetical tree representing the operations executed by the recursive algorithm. Operations are to be executed by traversing the tree in depth-first order.

slide-18
SLIDE 18

Performance results on NVIDIA GPUs

100 200 300 400 500 600 700 800 900 1000 1100 1200 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Performance (GFlop / s) Matrix Dimension (x1024) Theo-Peak DGEMM cuBLAS_DTRMM (OOP) KBLAS_DTRMM (IP) cuBLAS_DTRMM (IP)

Figure: Performance comparisons of KBLAS DTRMM against that of IP and OOP cuBLAS DTRMM (Integration to CUDA 8.0).

slide-19
SLIDE 19

Performance results on higher DLA computations using GPUs

200 400 600 800 1000 1200 1400 1600 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Performance (GFlop / s) Matrix Dimension (x1024) Theo-Peak DGEMM DPOTRI + KBLAS_TRMM DPOTRI + cuBLAS_TRMM

Figure: Performance speedup of matrix inversion in MAGMA library (DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.

slide-20
SLIDE 20

Outline

Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell

slide-21
SLIDE 21

Low Rank Approximation using H-Matrices

◮ Introduced by E. Tyrtyshnikov and revisited later by W.

Hackbush[1,2]

◮ R = U X V T [1] W. Hackbush, A Sparse Matrix Arithmetic based on H-Matrices

(Part I), Computing, 62(2), pp 89-108, 1999.

[2] W. Hackbusch and B. Khoromskij, A Sparse H-Matrix

Arithmetic (Part II): Application to multi-dimensional problems, Computing, 64(1), pp. 21-47, 2000.

slide-22
SLIDE 22

Nice Properties of H-Matrices

◮ Memory footprint saving

from O(n2) to O(k n log(n))

◮ Linear arithmetic complexity

MVM: from O(n2) to O(k n log(n)) MMM and A−1: from O(n3) to O(k2 n log2(n))

slide-23
SLIDE 23

Examples of H-Matrices

4

3 3 3

5 5 5 9

9

5

9

9 5

9 9

5

4 4 4

6

9

9

9

6 9

9 9

6

5 5 5

9

9

9

9

4 9

9

9 9 9

9

9 4

9

9

9

9

9 9 9

4 4 4

5

9 9

5 9

9

5

9

9 5 5 5

4

3 3 3

9

9

6

9

9 6

9 9

6

5 5 5

9

9

9

9

5 9

9 9

9

9 9 9

9

9 9 9

9

9

9

5 9

9 9

9 5

9

9

9

9 9 9

9

9 9 9

9

9 9

9 5

9

9

9

9

5 5 5

6

9 9

6 9

9

6

9

9

3 3 3

4

5 5 5 9

9

5

9

9 5

9 9

5

4 4 4

9 9 9

9

9

9

9

4 9

9

9 9 9

9

9 4

9

9

9

9

5 5 5

6

9 9

9 6

9

9

9

6

4 4 4

5

9 9

5 9

9

5

9

9 5 5 5

3 3 3

4

Figure: Example of H-matrix approximation for BEM.

slide-24
SLIDE 24

Examples of H-Matrices

25 20 20 20 30

30

20 16 16 16 30

30

20 20 20 20 30

30 32 27

27

20 20 20 20 30

30 32 28

28

32 28 28 32 18

18

20 20 20 20 30

30 32 29

29

20 20 20 20 29

29 32 19

19

32 29 29 32 19

19

32 19 19 32 9

9

20 20 20 20 30

30 32 30

30

32 30 30 32 20

20

20 20 20 20 30

30 32 20

20

32 20 20 32 10

10

32 30 30 32 20

20

32 20 20 32 10

10

32 20 20 32 10

10

32 10 10 32

Figure: Examples of H-matrix approximation for covariance matrix.

slide-25
SLIDE 25

Tree Structure

slide-26
SLIDE 26

H-MVM

slide-27
SLIDE 27

Implementation Details

Dense MVM:

◮ Calculate the products V Tx for the leaves in the tree

Batch MVM operation Upsweep:

◮ Sweep up the column basis tree tree calculating the products

  • f the inner nodes from the products of their children

block SpMV (BSR) Mult:

◮ It is also block SpMV (BSR) per level of the tree

Downsweep:

◮ Transpose operation of the upsweep phase

block SpMV (BSR) Pipelining:

◮ Overlapping computations possible within Dense MVM /

Upsweep / Mult phases. Downsweep, however, requires a sync point!

slide-28
SLIDE 28

Performance Results

Figure: Performance (GB/s) of H-MVM using k = 8 and n min = 32.

slide-29
SLIDE 29

Advanced Hierarchical Matrix Operations

Context:

◮ Very small sizes! ◮ Batch operation executions at each level of the tree ◮ (usually) Fixed sizes ◮ Recursive formulation, stressing register usage ◮ State-of-the-art implementations not well optimized for this

scope or not supported

◮ NVIDIA K40 GPU (single GPU)

slide-30
SLIDE 30

Advanced Hierarchical Matrix Operations

H-Matrix compression:

◮ Batch QR factorizations (square and tall-and-skinny) ◮ Batch SVD

H-Matrix computations:

◮ Level 3 BLAS: SYRK, TRSM ◮ Factorizations: POTRF ◮ Solves: POTRS, POSV

slide-31
SLIDE 31

Performance Results (preliminary)

0.125 0.25 0.5 1 2 4 8 16 32 64 128 8 16 32 64 128 256 Performance (GFLOP/s Log2) Matrix Size

Batch QR (Square matrix)

KBLAS_10000 KBLAS_1000 CUBLAS_10000 CUBLAS_1000 MAGMA_10000 MAGMA_1000

slide-32
SLIDE 32

Performance Results (preliminary)

slide-33
SLIDE 33

Performance Results (preliminary)

slide-34
SLIDE 34

Performance Results (preliminary)

1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ Performance ¡(GFlop/s ¡Log2) ¡ Matrix ¡Size ¡

DSYRK_Batch ¡

¡KBLAS_10240 ¡ ¡KBLAS_1024 ¡ ¡MAGMA_10240 ¡ ¡MAGMA_1024 ¡

slide-35
SLIDE 35

Performance Results (preliminary)

0.125 ¡ 0.25 ¡ 0.5 ¡ 1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ Performance ¡(GFlop/s ¡Log2) ¡ Matrix ¡Size ¡

DTRSM_Batch ¡

¡KBLAS_10240 ¡ ¡KBLAS_1024 ¡ ¡MAGMA_IP_10240 ¡ ¡MAGMA_IP_1024 ¡ ¡CUBLAS_10240 ¡ ¡CUBLAS_1024 ¡

slide-36
SLIDE 36

Performance Results (preliminary)

1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ Performance ¡(GFlop/s ¡Log2) ¡ Matrix ¡Size ¡

DPOTRF_Batch ¡

¡KBLAS_1024 ¡ ¡KBLAS_10240 ¡ ¡MAGMA_1024 ¡ ¡MAGMA_10240 ¡

slide-37
SLIDE 37

Performance Results (preliminary)

0.25 ¡ 0.5 ¡ 1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ Performance ¡(GFlop/s ¡Log2) ¡ Matrix ¡Size ¡

DPOTRS_Batch ¡

¡KBLAS_10240 ¡ ¡KBLAS_1024 ¡ ¡MAGMA_10240 ¡ ¡MAGMA_1024 ¡

slide-38
SLIDE 38

Performance Results (preliminary)

0.25 ¡ 0.5 ¡ 1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 8 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ Performance ¡(GFlop/s ¡Log2) ¡ Matrix ¡Size ¡

DPOSV_Batch ¡

¡KBLAS_10240 ¡ ¡KBLAS_1024 ¡ ¡MAGMA_10240 ¡ ¡MAGMA_1024 ¡

slide-39
SLIDE 39

Outline

Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H-Matrices HiCMA in a nutshell

slide-40
SLIDE 40

HiCMA’s Scope

The hierarchical computations for manycore architectures library aims to:

◮ Develop high performance numerical solvers:

Dense/ Data-Sparse (H)

◮ Increase data reuse thanks to a recursive/hierarchical

formulation

◮ Exploit high level of concurrency ◮ Perform asynchronous execution ◮ Target various architectures:

Shared/Distributed-memory Accelerators/Co-processors ARM

slide-41
SLIDE 41

HiCMA Software Stack

slide-42
SLIDE 42

HiCMA’s Backbone

slide-43
SLIDE 43

HiCMA’s Horsepower

slide-44
SLIDE 44

HiCMA’s MoC