Dense Triangular Solvers on Multicore Clusters using UPC Jorge - PowerPoint PPT Presentation

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es International Conference on Computational Science ICCS 2011 1/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 Conclusions 5 2/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Introduction 1 BLAS2 Triangular Solver 2 BLAS3 Triangular Solver 3 Experimental Evaluation 4 5 Conclusions 3/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions UPC: a Suitable Alternative for HPC in Multi-core Era Programming Models: PGAS Languages: Traditionally: Shared/Distributed memory programming models UPC -> C Challenge: hybrid memory architectures Titanium -> Java PGAS (Partitioned Global Address Co-Array Fortran -> Space) Fortran UPC Compilers: Berkeley UPC GCC (Intrepid) Michigan TU HP , Cray and IBM UPC Compilers 4/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Studied Numerical Operations BLAS Libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) Development of UPCBLAS gemv : Matrix-vector product ( α ∗ A ∗ x + β ∗ y = y ) gemm : Matrix-matrix product ( α ∗ A ∗ B + β ∗ C = C ) Studied Routines trsv : BLAS2 Triangular Solver ( M ∗ x = b ) trsm : BLAS3 Triangular Solver ( M ∗ X = B ) 5/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (I) Example Types of Blocks Matrix 8X8 i < j Zero matrix 2 Threads i = j Triangular matrix 2 Rows per block i > j Square matrix 7/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (II) THREAD 0 → trsv( M 11 , x 1 , b 1 ) 8/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (III) THREAD 0 → gemv( M 31 , x 1 , b 3 ) THREAD 1 → gemv( M 21 , x 1 , b 2 ) → trsv( M 22 , x 2 , b 2 ) → gemv( M 41 , x 1 , b 4 ) 9/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (IV) THREAD 0 → gemv( M 32 , x 2 , b 3 ) → trsv( M 33 , x 3 , b 3 ) THREAD 1 → gemv( M 42 , x 2 , b 2 ) 10/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (V) THREAD 1 → gemv( M 43 , x 3 , b 4 ) → trsv( M 44 , x 4 , b 4 ) 11/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS2 Triangular Solver( M ∗ x = b ) (and VI) Impact of the Block Size The more blocks the matrix is divided in, the more ... computations can be simultaneously performed ( ↑ perf) synchronizations are needed ( ↓ perf) Best block size automatically determined -> Paper 12/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS3 Triangular Solver( M ∗ X = B ) (I) Studied Distributions Triangular and dense matrices distributed by rows -> Similar approach than BLAS2 but changing sequential gemv → gemm trsv → trsm Dense matrices distributed by columns Triangular and dense matrices with 2D distribution (multicore-aware) 14/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS3 Triangular Solver( M ∗ X = B ) (II) Dense Matrices Distributed by Columns 15/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Parallel BLAS3 Triangular Solver( M ∗ X = B ) (and III) Triangular and dense matrices with 2D distribution (multicore-aware) Node 1 -> Cores 0 & 1 Node 2 -> Cores 2 & 3 16/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS2 Triangular Solver Departamental Cluster; InfiniBand; 8 nodes; 2th/node m = 30000 5 UPC ScaLAPACK 4.5 4 Speedups 3.5 3 2.5 2 1.5 2 4 8 16 Number of Threads 18/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS3 Triangular Solver (I) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node Itanium2 supercomputer: m = 12000 n = 12000 120 UPC M_dist UPC M_rep 100 UPC multi ScaLAPACK 80 Speedups 60 40 20 0 2 4 8 16 32 64 128 Number of Threads 19/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS3 Triangular Solver (II) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node Itanium2 supercomputer: m = 15000 n = 4000 100 UPC M_dist 90 UPC M_rep 80 UPC multi ScaLAPACK 70 Speedups 60 50 40 30 20 10 0 2 4 8 16 32 64 128 Number of Threads 20/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Evaluation of the BLAS3 Triangular Solver (and III) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node Itanium2 supercomputer: m = 8000 n = 25000 140 UPC M_dist UPC M_rep 120 UPC multi 100 ScaLAPACK Speedups 80 60 40 20 0 2 4 8 16 32 64 128 Number of Threads 21/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Main Conclusions Summary Implementation of BLAS triangular solvers for UPC Several techniques to improve their performance Special effort to find the most appropriate data distributions BLAS2 → Block-cyclic distribution by rows Block size automatically determined according to the characteristics of the scenario BLAS3 → Depending on the memory constraints Comparison with ScaLAPACK (MPI) UPC easier to use Similar of better performance 23/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es International Conference on Computational Science ICCS 2011 24/24

Dense Triangular Solvers on Multicore Clusters using UPC Jorge - PowerPoint PPT Presentation

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Dense Triangular Solvers on Multicore Clusters using UPC Jorge Gonzlez-Domnguez*, Mara J. Martn, Guillermo L. Taboada, Juan Tourio

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

CoMo-UPC TMA evaluation service @ UPC Pere Barlet-Ros Josep Sanjus-Cuxart Advanced Broadband

HOW-TO GUIDE ON SOUTH-SOUTH AND TRIANGULAR COOPERATION AND DECENT WORK Contents Introduction

Triangular Matrices Definition 1 Given an n n matrix A A is called upper triangular if all

KnowledgeWeb UPC Introduction Semantic Web Education Activities and Potential Contributions

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Dense Linear Algebra Solvers for Multicore with GPU Accelerators Stanimire Tomov , Rajib Nath,

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

I nternational research The evidence on clusters is clear Firms located in clusters are more

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs Jiya Su

CSE-571 Robot moves from to . Probabilistic Robotics x , y , x ' ,

Probabilistic Fundamentals in Robotics Pr Pr obabilistic Mode ls of Mobile Robots obabilistic

Simulation Discrete-Event System Simulation Dr. Mesut Gne Computer Science, Informatik

Lecture 12 Gaussian Process Models 10/16/2018 1 Multivariate Normal Multivariate Normal

Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie Chi Department of Electrical

Comments on Sigma Filter Degradation of a smoothed image is due to blurring of object

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

Sambuz

Useful Links

Newsletter

Mail Us