Dense Triangular Solvers on Multicore Clusters using UPC Jorge - - PowerPoint PPT Presentation

dense triangular solvers on multicore clusters using upc
SMART_READER_LITE
LIVE PREVIEW

Dense Triangular Solvers on Multicore Clusters using UPC Jorge - - PowerPoint PPT Presentation

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions Dense Triangular Solvers on Multicore Clusters using UPC Jorge Gonzlez-Domnguez*, Mara J. Martn, Guillermo L. Taboada, Juan Tourio


slide-1
SLIDE 1

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Dense Triangular Solvers on Multicore Clusters using UPC

Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño

Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es

International Conference on Computational Science ICCS 2011

1/24

slide-2
SLIDE 2

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

2/24

slide-3
SLIDE 3

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

3/24

slide-4
SLIDE 4

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

UPC: a Suitable Alternative for HPC in Multi-core Era

Programming Models:

Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages:

UPC -> C Titanium -> Java Co-Array Fortran -> Fortran

UPC Compilers:

Berkeley UPC GCC (Intrepid) Michigan TU HP , Cray and IBM UPC Compilers

4/24

slide-5
SLIDE 5

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Studied Numerical Operations

BLAS Libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) Development of UPCBLAS

gemv: Matrix-vector product (α ∗ A ∗ x + β ∗ y = y) gemm: Matrix-matrix product (α ∗ A ∗ B + β ∗ C = C)

Studied Routines trsv: BLAS2 Triangular Solver (M ∗ x = b) trsm: BLAS3 Triangular Solver (M ∗ X = B)

5/24

slide-6
SLIDE 6

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

6/24

slide-7
SLIDE 7

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (I)

Example Matrix 8X8 2 Threads 2 Rows per block Types of Blocks i < j Zero matrix i = j Triangular matrix i > j Square matrix

7/24

slide-8
SLIDE 8

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (II)

THREAD 0 → trsv(M11,x1,b1)

8/24

slide-9
SLIDE 9

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (III)

THREAD 0 → gemv(M31,x1,b3) THREAD 1 → gemv(M21,x1,b2) → trsv(M22,x2,b2) → gemv(M41,x1,b4)

9/24

slide-10
SLIDE 10

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (IV)

THREAD 0 → gemv(M32,x2,b3) → trsv(M33,x3,b3) THREAD 1 → gemv(M42,x2,b2)

10/24

slide-11
SLIDE 11

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (V)

THREAD 1 → gemv(M43,x3,b4) → trsv(M44,x4,b4)

11/24

slide-12
SLIDE 12

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (and VI)

Impact of the Block Size The more blocks the matrix is divided in, the more ... computations can be simultaneously performed (↑ perf) synchronizations are needed (↓ perf) Best block size automatically determined -> Paper

12/24

slide-13
SLIDE 13

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

13/24

slide-14
SLIDE 14

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS3 Triangular Solver(M ∗ X = B) (I)

Studied Distributions Triangular and dense matrices distributed by rows -> Similar approach than BLAS2 but changing sequential

gemv → gemm trsv → trsm

Dense matrices distributed by columns Triangular and dense matrices with 2D distribution (multicore-aware)

14/24

slide-15
SLIDE 15

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS3 Triangular Solver(M ∗ X = B) (II)

Dense Matrices Distributed by Columns

15/24

slide-16
SLIDE 16

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS3 Triangular Solver(M ∗ X = B) (and III)

Triangular and dense matrices with 2D distribution (multicore-aware) Node 1 -> Cores 0 & 1 Node 2 -> Cores 2 & 3

16/24

slide-17
SLIDE 17

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

17/24

slide-18
SLIDE 18

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS2 Triangular Solver

Departamental Cluster; InfiniBand; 8 nodes; 2th/node

1.5 2 2.5 3 3.5 4 4.5 5 2 4 8 16 Speedups Number of Threads m = 30000 UPC ScaLAPACK

18/24

slide-19
SLIDE 19

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS3 Triangular Solver (I)

Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

20 40 60 80 100 120 2 4 8 16 32 64 128 Speedups Number of Threads Itanium2 supercomputer: m = 12000 n = 12000 UPC M_dist UPC M_rep UPC multi ScaLAPACK

19/24

slide-20
SLIDE 20

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS3 Triangular Solver (II)

Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

10 20 30 40 50 60 70 80 90 100 2 4 8 16 32 64 128 Speedups Number of Threads Itanium2 supercomputer: m = 15000 n = 4000 UPC M_dist UPC M_rep UPC multi ScaLAPACK

20/24

slide-21
SLIDE 21

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS3 Triangular Solver (and III)

Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

20 40 60 80 100 120 140 2 4 8 16 32 64 128 Speedups Number of Threads Itanium2 supercomputer: m = 8000 n = 25000 UPC M_dist UPC M_rep UPC multi ScaLAPACK

21/24

slide-22
SLIDE 22

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

22/24

slide-23
SLIDE 23

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Main Conclusions

Summary Implementation of BLAS triangular solvers for UPC Several techniques to improve their performance Special effort to find the most appropriate data distributions

BLAS2 → Block-cyclic distribution by rows

Block size automatically determined according to the characteristics of the scenario

BLAS3 → Depending on the memory constraints

Comparison with ScaLAPACK (MPI)

UPC easier to use Similar of better performance

23/24

slide-24
SLIDE 24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Dense Triangular Solvers on Multicore Clusters using UPC

Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño

Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es

International Conference on Computational Science ICCS 2011

24/24