on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen - PowerPoint PPT Presentation

PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th , 2016 a Université de Lille, Sciences et Technologies and CNRS, Maison de la Simulation, Paris-Saclay, FRANCE b School of Informatics and Computing Indiana University Bloomington, USA This work was supported by the HPC Centre Champagne-Ardenne ROMEO

Outline • Introduction • Krylov Method, GMRES as an example • Energy consumption evaluation for Krylov methods • Conclusion April 6, 2016 GTC 2

With new programming paradigms and languages, extreme computing (exascale and beyond) would have to face several critical problems, such as the following : • Minimize the global computing time, • Accelerate the convergence, use a good preconditionner • Numerical stability has to be maintained (at least) • Minimize the number of communications (optimized Ax, asynchronous comp, communication compiler and mapper,….) • Minimize the number of longer size scalar products, • Minimize memory space, cache optimization…. • Select the best sparse matrix compressed format, • Mixed arithmetic • Minimize energy consumption • …… The goal of this talk is to illustrate that we would need “smart” auto -tuning of several parameters to minimize the computing time and the energy consumption for intelligent linear algebra methods to create the next generation of High Performance Numerical software for Extreme Computing April 6, 2016 GTC 4

GMRES example : about memory space, dot products and sparse matrix-vector multiplication A :Matrix of size n x n Matrix of size n x m Memory space : Scalar products, at j fixed: sparse matrix : nnz elements Sparse Matrix-vector product : n of size C Krylov basis vectors : n m Orthogonalization : j of size n Hessenberg matrix : m m m, the subspace size, may be auto-tuned at runtime to minimize the space memory occupation and the number of scalar product, with better or approximately same convergence behaviors

GMRES example : about memory space, dot products and sparse matrix-vector multiplication m, subspace size 1 matrix vector multiplication J scalar product Subspace computation : O(m 3 ) Memory space : Scalar products, at j fixed: sparse matrix : nnz elements Sparse Matrix-vector product : n of size C Krylov basis vectors : n m Orthogonalization : j of size n Hessenberg matrix : m m m , the subspace size, may be auto-tuned at runtime to minimize the space memory occupation and the number of scalar product, with better or approximately same convergence behaviors.

GMRES : about memory space and dot products Incomplete orthogonalization (Y. Saad): i.e. i= from max(1,j-q) to j q>0. Then, J-q+1 bands on the Hesseberg matrix. Memory space : Scalar products, at j fixed: sparse matrix : nnz (i.e. < C n) elements Sparse Matrix-vector product : n of size C Krylov basis vectors : n m Orthogonalization : m of size n Hessenberg matrix : m m m, the subspace size, may be auto-tuned at runtime to minimize the space memory occupation and the number of scalar product, with better or approximately same convergence behaviors. The number of vectors othogonalized with the new one may be auto-tuned at runtime. The subspace size may be large!

GMRES : about memory space and dot products Memory space : Scalar products, at j fixed: sparse matrix : nnz (i.e. < C n) elements Sparse Matrix-vector product : n of size C Krylov basis vectors : n m Orthogonalization : m of size n Hessenberg matrix : m m [see papers with P.-Y. Aquilanti (TOTAL) and T. Katigari (U. Tokyo)] Other technique : so- called “Communication Avoiding” (CA) : we first compute a non-orthogonal basis +TSQR to orthogonalize the vectors

GMRES : about memory space and dot products What about the energy consumption with respect to these different versions? Is the faster version the more energy efficient? Does exist an unique solution to minimize all these criteria (computing time, energy, memory space..) Memory space : Scalar products, at j fixed: sparse matrix : nnz (i.e. < C n) elements Sparse Matrix-vector product : n of size C Krylov basis vectors : n m Orthogonalization : m of size n Hessenberg matrix : m m [see papers with P.-Y. Aquilanti (TOTAL) and T. Katigari (U. Tokyo)] Other technique : so- called “Communication Avoiding” (CA) : we first compute a non-orthogonal basis +TSQR to orthogonalize the vectors

Different Orthogonalizations April 6, 2016 GTC 11

April 6, 2016 GTC 13

GMRES April 6, 2016 GTC 21

Conclusion • We have to find a tradeoff between several minimizations (time of each iteration, global time to convergence (i.e. number of iteration), accuracy, memory space, cache utilizations, and energy consumption • Optimizing some parameters from architecture, numerical method, algorithm, parallelism, memory space, multi-core utilizations, … .. would not lead to an unique solution • End users would have to decide what criteria to minimize • Expertise from end-users would be exploited through new high level language and/or framework (YML, PGAS, … ..) – cf. yml.prism.uvsq.fr • We have to analyse auto-tuned numerical methods to find new crieteria to evaluate the quality of the converge and to decide actions The method may decide to compute others parameters just to take the decision, and they may learnt (linear algebra learning?) leading to intelligent linear algebra. April 6, 2016 GTC 23

on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen - PowerPoint PPT Presentation

PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th , 2016 a Universit de Lille, Sciences et Technologies and CNRS, Maison de la

Application Accelerators: Application Accelerators: Application Accelerators: Application

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Overview on GPU Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

r rs r t rt r (

Geometric aspects of quantum computing Maris Ozols University of Waterloo Department of C&O

The Reconstruction of Some 3D Convex Polyominoes from Orthogonal Projections Maciej G ebala

Extremely strong convergence of eigenvalue-density of linear stochastic dynamical systems S

Introductory Multi-Linear Algebra: Dual Spaces to Tensor Products Hayden Borg University of

Bessel inequality for robust stability analysis of time-delay system F. Gouaisbaut, Y. Ariba, A.

Why Algorithmic and Rigorous Polynomial Approximations? Rigorous Polynomial Approximation =

Annihilating Polynomials of Excellent Quadratic Forms Klaas-Tido R uhl EPFL GTEM Network -

on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen - PowerPoint PPT Presentation

PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th , 2016 a Universit de Lille, Sciences et Technologies and CNRS, Maison de la

Application Accelerators: Application Accelerators: Application Accelerators: Application

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Overview on GPU Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

r rs r t rt r (

Geometric aspects of quantum computing Maris Ozols University of Waterloo Department of C&amp;O

The Reconstruction of Some 3D Convex Polyominoes from Orthogonal Projections Maciej G ebala

Extremely strong convergence of eigenvalue-density of linear stochastic dynamical systems S

Introductory Multi-Linear Algebra: Dual Spaces to Tensor Products Hayden Borg University of

Bessel inequality for robust stability analysis of time-delay system F. Gouaisbaut, Y. Ariba, A.

Why Algorithmic and Rigorous Polynomial Approximations? Rigorous Polynomial Approximation =

Annihilating Polynomials of Excellent Quadratic Forms Klaas-Tido R uhl EPFL GTEM Network -

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Geometric aspects of quantum computing Maris Ozols University of Waterloo Department of C&O