Energy Efficient, High-Performance Solvers through Small Dense - PowerPoint PPT Presentation

Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs Azzam Haidar and Stan Tomov with J. Dongarra, P. Luszczek, I. Yamazaki, M. Gates, H. Anzt, and T. Dong University of Tennessee, Knoxville GPU Technology Conference March 17 – 20, 2015 San Jose, CA

Outline • Motivation • Overview of the MAGMA library – Hybrid GPU+CPU dense linear algebra – Sparse linear algebra – Batched linear algebra – Power considerations and GPU-only algorithms • Batched one-sided factorizations • Future Directions

Motivation • Important scientific applications rely on sparse linear algebra • K40 GPU computing efficiency on Comp mput ute int ntens nsive (dense LU) vs. Memory-bound ound comp mputa utati tion n (SpMV)

Motivation • Many dense and sparse direct solvers need HP, energy-efficient LA on many small independent dense matrices – Tiled linear algebra algorithms – Multifrontal methods – Preconditioners (using DLA) in sparse iterative solvers, many applications, … Batched LA DAG-based factorization Sparse / Dense Matrix System  LU, QR, or Cholesky on small diagonal matrices  TRSM s, QR s, or LU s  TRSM s, TRMM s  Updates (Schur complement) GEMM s, SYRK s, TRMM s And many other BLAS/LAPACK, e.g., for application specific solvers, preconditioners, and matrices

MAGMA: LAPACK for GPUs • MAGMA – M atrix A lgebra for G PU and M ulticore A rchitecture – LAPACK/ScaLAPACK on hybrid architectures – http://icl.cs.utk.edu/magma/ • Next Generation of DLA Software: Rely on MAGMA - hybrid scheduler • Developers & Hybrid Algorithms - hybrid kernels (for nested parallelism) (heterogeneity friendly) collaborators: – UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) – Community effort, similarly to LAPACK/ScaLAPACK

Key Features of MAGMA 1.6 HIBRID ALGORITHMS MAGMA uses hybrid algorithms where the computation is split into tasks of varying granularity and their execution scheduled over the hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the MICs. PERFORMANCE & ENERGY EFFICIENCY

Methodology overview A methodology to use all available resources: • MAGMA uses hybrid algorithms based on – Representing linear algebra algorithms as collections Hybrid CPU+GPU algorithms (small tasks for multicores and of tasks and data dependencies among them large tasks for GPUs) – Properly scheduling tasks' execution over multicore and GPU hardware components • Successfully applied to fundamental linear algebra algorithms – One- and two-sided factorizations and solvers – Iterative linear and eigensolvers • Productivity – 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

A Hybrid Algorithm Example Left-looking hybrid Cholesky to parallel hybrid MAGMA From sequential 1 for ( j=0, j<n; j+=nb) { LAPACK 2 jb = min(nb, n-j); for ( j=0, j<n; j+=nb) { 3 magma_ zherk( MagmaUpper, MagmaConjTrans, jb = min(nb, n-j); jb, j, one, dA(0,j), ldda, one, dA(j,j), ldda, queue ); zherk( MagmaUpper jb, j, one, dA(0 4 magma_zgetmatrix_async ( jb, jb, dA(j,j), ldda, work, jb, queue, & event ); 5 if (j+jb < n) if (j+jb < n) 6 magma_ zgemm( MagmaConjTrans, MagmaNoTrans, jb, n-j-jb, j, one, zgemm( MagmaCo dA(0,j), ldda, dA(0,j+jb), ldda, one, dA(j, j+jb), ldda, que dA(0,j), ldd 7 magma_event_sync ( event ); zpotrf( MagmaUpper 8 zpotrf( MagmaUpperStr, &jb, work, &jb, info); if (info != 0) 9 if (info != 0) *info += j; 10 *info += j; 11 magma_zsetmatrix_async (jb, jb, work, jb, dA(j, j), ldda, queue, & event ); If (j+jb) < n) { 12 If (j+jb) < n) { ztrsm( MagmaLeft, 13 magma_event_sync ( event ); jb, n 14 magma_ ztrsm( MagmaLeft, MagmaUpper, MagmaConjTrans, MagmaNo } jb, n-j-jb, one, dA(j,j), ldda, dA(j,j+jb), ldda, queue); } } } Note: • MAGMA and LAPACK look similar • Difference is lines in red, specifying data transfers and dependencies • Differences are further hidden in a dynamic scheduler making the top level representation of MAGMA algorithms almost identical to LAPACK

A Hybrid Algorithm Example Left-looking hybrid Cholesky MAGMA runtime environment to parallel hybrid • Scheduling can be static MAGMA or dynamic From sequential 1 for ( j=0, j<n; j+=nb) { LAPACK 2 jb = min(nb, n-j); • Dynamic is based on QUARK for ( j=0, j<n; j+=nb) { 3 magma_ zherk( MagmaUpper, MagmaConjTrans, jb = min(nb, n-j); jb, j, one, dA(0,j), ldda, one, dA(j,j), ldda, queue ); zherk( MagmaUpper • Uses CUDA streams to offload jb, j, one, dA(0 4 magma_zgetmatrix_async ( jb, jb, dA(j,j), ldda, work, jb, queue, & event ); 5 if (j+jb < n) computation to the GPU if (j+jb < n) 6 magma_ zgemm( MagmaConjTrans, MagmaNoTrans, jb, n-j-jb, j, one, zgemm( MagmaCo dA(0,j), ldda, dA(0,j+jb), ldda, one, dA(j, j+jb), ldda, que dA(0,j), ldd 7 magma_event_sync ( event ); zpotrf( MagmaUpper 8 zpotrf( MagmaUpperStr, &jb, work, &jb, info); if (info != 0) 9 if (info != 0) *info += j; 10 *info += j; 11 magma_zsetmatrix_async (jb, jb, work, jb, dA(j, j), ldda, queue, & event ); If (j+jb) < n) { 12 If (j+jb) < n) { ztrsm( MagmaLeft, 13 magma_event_sync ( event ); jb, n 14 magma_ ztrsm( MagmaLeft, MagmaUpper, MagmaConjTrans, MagmaNo } jb, n-j-jb, one, dA(j,j), ldda, dA(j,j+jb), ldda, queue); } } } Note: • MAGMA and LAPACK look similar • Difference is lines in red, specifying data transfers and dependencies • Differences are further hidden in a dynamic scheduler making the top level representation of MAGMA algorithms almost identical to LAPACK

A Hybrid Algorithm Example Left-looking hybrid Cholesky MAGMA runtime environment to parallel hybrid GPU CPU MAGMA CUDA Queues From sequential 3 1 for ( j=0, j<n; j+=nb) { 14 13 11 7 6 4 LAPACK Tasks 2 jb = min(nb, n-j); for ( j=0, j<n; j+=nb) { Computed 3 magma_ zherk( MagmaUpper, MagmaConjTrans, 3’ 3’ jb = min(nb, n-j); on the GPU 3 3 jb, j, one, dA(0,j), ldda, one, dA(j,j), ldda, queue ); zherk( MagmaUpper 3’ 3’ 4 jb, j, one, dA(0 4 magma_zgetmatrix_async ( jb, jb, dA(j,j), ldda, work, jb, queue, & event ); Offloaded 6 4 5 if (j+jb < n) to the 6’ 6’ if (j+jb < n) 6 magma_ zgemm( MagmaConjTrans, MagmaNoTrans, jb, n-j-jb, j, one, GPU 6’ zgemm( MagmaCo 6’ 6 4 dA(0,j), ldda, dA(0,j+jb), ldda, one, dA(j, j+jb), ldda, que 7 dA(0,j), ldd 6’ 6’ CPU task #8 and 7 magma_event_sync ( event ); 7 6’ 6’ CPU-GPU zpotrf( MagmaUpper 8 zpotrf( MagmaUpperStr, &jb, work, &jb, info); 7 on the CPU communications if (info != 0) 9 if (info != 0) 6’ 6’ 8 *info += j; are overlapped 10 *info += j; 6’ 6’ with GPU 11 magma_zsetmatrix_async (jb, jb, work, jb, dA(j, j), ldda, queue, & event ); 11 If (j+jb) < n) { 11 computations 6’ 11 6’ 12 If (j+jb) < n) { Offloaded 13 6’ 6’ ztrsm( MagmaLeft, 13 magma_event_sync ( event ); to the 13 13 jb, n 14 magma_ ztrsm( MagmaLeft, MagmaUpper, MagmaConjTrans, MagmaNo 6’ 6’ GPU 14 } 14 jb, n-j-jb, one, dA(j,j), ldda, dA(j,j+jb), ldda, queue); 6’ 6’ } } } 14’ 14’ 14’ 14’ PCI Note: • MAGMA and LAPACK look similar time • Difference is lines in red, specifying data transfers and dependencies • Differences are further hidden in a dynamic scheduler making the top level representation of MAGMA algorithms almost identical to LAPACK

Mixed precision iterative refinement Solving general dense linear systems using mixed precision iterative refinement 700 SP Solve 600 DP Solve (MP Iter.Ref.) DP Solve 500 400 300 200 Keeneland GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 100 0 Matrix size

Out of GPU Memory Algorithms Solving large problems that do not fit in the GPU memory 350 DGETRF 300 250 200 Matrices of size that do not fit in a specified GPU memory 150 100 Keeneland 50 GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 0 Matrix size

Out of GPU Memory Algorithms Solving large problems that do not fit in the GPU memory 350 DGETRF 300 250 200 Out-of-GPU-memory Algorithms can now solve large problems 150 100 Keeneland 50 GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 0 Matrix size

Out of GPU Memory Algorithm • Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) • The rest of the matrix stays on the CPU • Left-looking versions minimize writing on the CPU 1) Copy A 2 to the GPU Untouched 2) Update A 2 using A 1 (a panel of A 1 at a time) part of the 3) Factor the updated A 2 using existing matrix hybrid code To be Factored 4) Copy factored A 2 to the CPU factored A 2 on GPU . . . sub-matric sub-matrix A 1 on CPU Trivially extended to multiGPUs: A 2 is “larger” with 1 -D block cyclic distribution, again reusing existing algorithms

MultiGPU Support • Data distribution – 1-D block-cyclic distribution nb • Algorithm GPU GPU GPU GPU . . . – GPU holding current panel 0 1 2 0 is sending it to CPU – All updates are done in parallel on the GPUs – Look-ahead is done with GPU holding the next panel

Energy Efficient, High-Performance Solvers through Small Dense - PowerPoint PPT Presentation

Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs Azzam Haidar and Stan Tomov with J. Dongarra, P. Luszczek, I. Yamazaki, M. Gates, H. Anzt, and T. Dong University of Tennessee, Knoxville GPU

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

High Performance Linear System Solvers with Focus on Graph Laplacians Richard Peng Georgia Tech

SAT Solvers Aditya Parameswaran Luv Kumar 1st June, 2005 and 14th June, 2005 Aditya

Solvers Principles and Architecture (SPA) Part 2 SMT Solvers Master Sciences Informatique (Sif)

Tutorial on SMT Solvers Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell April

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Pluggable SAT-Solvers for SMT-Solvers Bas Schaafsma DISI, University of Trento & Fondazione

SAT and SMT Murphy Berzish Overview Boolean Satisfiability (SAT) problem SAT solvers:

Tutorial on SAT Solvers Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell May

Solvers Principles and Architecture (SPA) Part 1 Anatomy of SAT Solvers Master Sciences

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr

Satisfiability Modulo Theories SMT solvers are finding their way in many different application

CYCLIC CRACK PROPAGATION AND -ARREST IN A UNIDIRECTIONAL POLYMER MATRIX COMPOSITE EXHIBITING

Franklin Road Improvements 1 Community Liaison Group meeting 27 September 2016 Contractor

Quantum Information Summer School 2019, IBA M. Sohaib Alam Rigetti Computing 15 July, 2019 The

Fire Dynamics Simulator: Advances on simulation capability for complex geometry ed tape such

The diameter of permutation groups Proof ideas H. A. Helfgott and . Seress July 2013 Cayley

Finite Quotients of Braid Groups Lily Li Joint Work with Alice Chudnovsky, Caleb Partin, and

Lattices and cohomological Mackey functors for finite cyclic p -groups Thomas Weigel (joint work

Factorizations of Coxeter Elements in Complex Reflection Groups University of Minnesota-Twin

Energy Efficient, High-Performance Solvers through Small Dense - PowerPoint PPT Presentation

Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs Azzam Haidar and Stan Tomov with J. Dongarra, P. Luszczek, I. Yamazaki, M. Gates, H. Anzt, and T. Dong University of Tennessee, Knoxville GPU

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Energy-efficient &amp; High-performance Energy-efficient &amp; High-performance Instruction Fetch

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

High Performance Linear System Solvers with Focus on Graph Laplacians Richard Peng Georgia Tech

SAT Solvers Aditya Parameswaran Luv Kumar 1st June, 2005 and 14th June, 2005 Aditya

Solvers Principles and Architecture (SPA) Part 2 SMT Solvers Master Sciences Informatique (Sif)

Tutorial on SMT Solvers Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell April

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Pluggable SAT-Solvers for SMT-Solvers Bas Schaafsma DISI, University of Trento &amp; Fondazione

SAT and SMT Murphy Berzish Overview Boolean Satisfiability (SAT) problem SAT solvers:

Tutorial on SAT Solvers Combinatorial Problem Solving (CPS) Enric Rodr guez-Carbonell May

Solvers Principles and Architecture (SPA) Part 1 Anatomy of SAT Solvers Master Sciences

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 &amp; angr

Satisfiability Modulo Theories SMT solvers are finding their way in many different application

CYCLIC CRACK PROPAGATION AND -ARREST IN A UNIDIRECTIONAL POLYMER MATRIX COMPOSITE EXHIBITING

Franklin Road Improvements 1 Community Liaison Group meeting 27 September 2016 Contractor

Quantum Information Summer School 2019, IBA M. Sohaib Alam Rigetti Computing 15 July, 2019 The

Fire Dynamics Simulator: Advances on simulation capability for complex geometry ed tape such

The diameter of permutation groups Proof ideas H. A. Helfgott and . Seress July 2013 Cayley

Finite Quotients of Braid Groups Lily Li Joint Work with Alice Chudnovsky, Caleb Partin, and

Lattices and cohomological Mackey functors for finite cyclic p -groups Thomas Weigel (joint work

Factorizations of Coxeter Elements in Complex Reflection Groups University of Minnesota-Twin

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

Pluggable SAT-Solvers for SMT-Solvers Bas Schaafsma DISI, University of Trento & Fondazione

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr