Accelerating Linear Algebra on Small Matrices from - PowerPoint PPT Presentation

Accelerating ¡Linear ¡Algebra ¡on ¡Small ¡Matrices ¡– ¡ ¡ from ¡Batched ¡BLAS ¡to ¡Large ¡Scale ¡Solvers ¡ Stan ¡Tomov ¡ and ¡Ichitaro ¡Yamazaki ¡ Azzam ¡Haidar, ¡Ahmad ¡Abdelfattah, ¡Mark ¡Gates, ¡and ¡Jack ¡Dongarra ¡ ¡ Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville ¡ In collaboration with : LLNL, Livermore, CA, USA University of Manchester, Manchester, UK University of Paris-Sud, France GTC 2018 San Jose, CA March 26 – 29, 2018 ¡

Outline • Introduction • Batched BLAS • MAGMA Batched functionalities and techniques • Accelerating applications with Batched LA – MAGMA DNN, Templates, and Tensors – Fused Batched BLAS – Applications in exascale discretizations (CEED project) • PART II: Batched computations for large scale solvers with low-rank approximations and precond. • Conclusions 2 / 37

Dense Linear Algebra in Applications Dens ense e Linear Linear Alge lgebr bra a (DLA LA) is is needed needed in in a a wide ide var ariet iety of of science cience and and engineer engineering ing applica pplications ions: : • Linear systems: Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more • SVD: A = U Σ V* (Au = σ v and A*v = σ u) Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, • sparse with dense blocks, etc. • DLA is crucial to the development of sparse solvers 3 / 37

Dense Linear Algebra in Applications Dens ense e Linear Linear Alge lgebr bra a (DLA LA) is is needed needed in in a a wide ide var ariet iety of of science cience and and engineer engineering ing applica pplications ions: : Provided in MAGMA 2.3 • Linear systems: Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more • SVD: A = U Σ V* (Au = σ v and A*v = σ u) Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, • sparse with dense blocks, etc. • DLA is crucial to the development of sparse solvers http://icl.cs.utk.edu/magma https://bitbucket.org/icl/magma

Why use GPUs in HPC? PERFORMANCE & ENERGY EFFICIENCY Energy efficiency MAGMA 2.3 LU factorization in double precision arithmetic (under ~ the same power draw) CPU Intel Xeon E5-2650 v3 (Haswell) 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU NVIDIA Kepler GPU V100 NVIDIA Volta GPU K40 2x10 cores @ 2.30 GHz 56 MP x 64 @ 1.19 GHz 80 MP x 64 @ 1.38 GHz 6000 ¡ 25 ¡ V100 ¡ 5000 ¡ 10x Performance GFLOP/s 20 ¡ GFLOPs / Watt 10x P100 ¡ 4000 ¡ 15 ¡ K40 ¡ 3000 ¡ CPU ¡ 10 ¡ 2000 ¡ 5 ¡ 1000 ¡ 0 ¡ 0 ¡ 2k ¡ 4k ¡ 6k ¡ 8k ¡ 10k ¡12k ¡14k ¡16k ¡18k ¡20k ¡22k ¡24k ¡26k ¡28k ¡30k ¡32k ¡34k ¡36k ¡ CPU ¡ K40 ¡ P100 ¡ V100 ¡ Matrix size N x N 5 / 37

Many applications need LA on many small matrices Sparse/Dense solvers & preconditioners Data Analytics and associated with it Linear Algebra on small LA problems are needed in many applications: DAG-based factorization Sparse / Dense Matrix Batched LAPACK System • Machine learning, • Neuroscience, • Data mining, • Astrophysics, • High-order FEM, • Quantum chemistry, Single calls to • Numerical LA, • Multi-physics problems, Batched BLAS • Graph analysis, • Signal processing, etc. Machine learning Applications using high-order FEM Convolution Pooling Convolution Pooling Fully Output • Matrix-free basis evaluation needs efficient tensor contractions, Data D connected predictions Output O n ∑ chicken 0.4 C A B = person 0.1 . k , i 1 k , i 2, i 3 i 1, i 2, i 3 boat 0.3 O k dog 0.01 n , k D k • Within ECP CEED Project, designed MAGMA batched methods Convolution of Filters F i (feature detection) and input image D: Filters F • For every filter F n and every channel, the computation for to split the computation in many small high-intensity GEMMs, F n every pixel value O n,k is a tensor contraction : grouped together (batched) for efficient execution: ∑ O D F = k , i n , i n , k i Plenty of parallelism; small operations that must be batched • • With data “reshape” the computation can be transformed into Batch_{ C i3 = A T B i3 , for range of i3 } a batched GEMM (for efficiency; among other approaches) 6 / 37

MAGMA Batched Computations 1. Non-batched computation • loop over the matrices one by one and compute using multithread (note that, since matrices are of small sizes there is not enough work for all the cores). So we expect low performance as well as threads contention might also affect the performance for (i=0; i<batchcount; i++) dgemm(…) There ¡is ¡not ¡enough ¡work ¡ Low ¡percentage ¡of ¡the ¡ to ¡fulfill ¡all ¡the ¡cores. ¡ ¡ resources ¡is ¡used ¡ 7 / 37

MAGMA Batched Computations 1. Batched computation Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to operate on it independently • For very small matrices, assign a matrix/core (CPU) or per TB for GPU • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU) • For large size switch to multithreads classical 1 matrix per round. Batched_dgemm(…) Tasks manager dispatcher Based on the kernel High ¡percentage ¡of ¡the ¡ design that decide the resources ¡is ¡used ¡ n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler 8 / 37

Accelerating Linear Algebra on Small Matrices from - PowerPoint PPT Presentation

Accelerating Linear Algebra on Small Matrices from Batched BLAS to Large Scale Solvers Stan Tomov and Ichitaro Yamazaki Azzam Haidar, Ahmad

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

20/11/2019 20/11/2019 2 Outline Introduction to Linear Algebra Algebra Solving

Linear Algebra Review Leila Wehbe January 29, 2013 Leila Wehbe Linear Algebra Review Metrics

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Linear Algebra Chapter 1. Vectors, Matrices, and Linear Systems Section 1.2. The Norm and Dot

Transformations and Matrices Transformations I Transformations are functions Matrices

Implement Distributed Alternating Least Squares Algorithm for Matrix Completion Varun Gandhi

Incorporating Climate Variability and Change Into IWRM Second RSC Meeting July 19-23, 2010

Regional Differences in Electricity Accessibility among the Asian Least Developed Countries

TRADE AND DEVELOPMENT THEORY, PRACTICE, CAMBODIAS STORY Sven Callebaut, Adviser Ministry of

Generalized Inverses & Least Squares Problems B. Wayne

OPTIMAL POINT TARIFFS IN TRANSMISSION AND DISTRIBUTION USING NONNEGATIVE LEAST SQUARES ESTIMATES

Copula Regression R A H U L A . P A R S A D R A K E U N I V E R S I TY & S TU A R T A .

Using Model Reduction in Data Assimilation Met Office Website Nancy Nichols, Amos Lawless The

Accelerating Linear Algebra on Small Matrices from - PowerPoint PPT Presentation

Accelerating Linear Algebra on Small Matrices from Batched BLAS to Large Scale Solvers Stan Tomov and Ichitaro Yamazaki Azzam Haidar, Ahmad

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen &amp; Greg Corrado Linear Algebra is

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

20/11/2019 20/11/2019 2 Outline Introduction to Linear Algebra Algebra Solving

Linear Algebra Review Leila Wehbe January 29, 2013 Leila Wehbe Linear Algebra Review Metrics

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Linear Algebra Chapter 1. Vectors, Matrices, and Linear Systems Section 1.2. The Norm and Dot

Transformations and Matrices Transformations I Transformations are functions Matrices

Implement Distributed Alternating Least Squares Algorithm for Matrix Completion Varun Gandhi

Incorporating Climate Variability and Change Into IWRM Second RSC Meeting July 19-23, 2010

Regional Differences in Electricity Accessibility among the Asian Least Developed Countries

TRADE AND DEVELOPMENT THEORY, PRACTICE, CAMBODIAS STORY Sven Callebaut, Adviser Ministry of

Generalized Inverses &amp; Least Squares Problems B. Wayne

OPTIMAL POINT TARIFFS IN TRANSMISSION AND DISTRIBUTION USING NONNEGATIVE LEAST SQUARES ESTIMATES

Copula Regression R A H U L A . P A R S A D R A K E U N I V E R S I TY &amp; S TU A R T A .

Using Model Reduction in Data Assimilation Met Office Website Nancy Nichols, Amos Lawless The

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Chapter 4: Vectors, Matrices, and Linear Algebra Scott Owen & Greg Corrado Linear Algebra is

Generalized Inverses & Least Squares Problems B. Wayne

Copula Regression R A H U L A . P A R S A D R A K E U N I V E R S I TY & S TU A R T A .