Design and Performance Issues of Cholesky and LU Solvers using - - PowerPoint PPT Presentation

design and performance issues of cholesky and lu solvers
SMART_READER_LITE
LIVE PREVIEW

Design and Performance Issues of Cholesky and LU Solvers using - - PowerPoint PPT Presentation

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS Jorge Gonzlez-Domnguez*, Osni A. Marques**, Mara J. Martn*, Guillermo L. Taboada*, Juan


slide-1
SLIDE 1

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS

Jorge González-Domínguez*, Osni A. Marques**, María J. Martín*, Guillermo L. Taboada*, Juan Touriño*

*Computer Architecture Group, University of A Coruña, Spain {jgonzalezd,mariam,taboada,juan}@udc.es **Computational Research Division, Lawrence Berkeley National Laboratory, CA, USA OAMarques@lbl.gov

10th IEEE International Symposium on Parallel and Distributed Processing with Applications ISPA 2012

1/30

slide-2
SLIDE 2

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

1

Introduction

2

Cholesky Solver

3

LU Solver

4

Experimental Evaluation

5

Conclusions

2/30

slide-3
SLIDE 3

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

1

Introduction

2

Cholesky Solver

3

LU Solver

4

Experimental Evaluation

5

Conclusions

3/30

slide-4
SLIDE 4

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPC: a Suitable Alternative for HPC in Multi-core Era

Programming Models:

Traditionally: Shared/Distributed memory programming models Challenge: Hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages:

UPC (C) Titanium (Java) Co-Array Fortran (Fortran)

Main advantages of the PGAS model: simplifies programming allows an efficient use of

  • ne-sided communications

4/30

slide-5
SLIDE 5

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS

Characteristics of the Library Includes parallel BLAS routines built on top of UPC Focused on increasing the programmability Distributed matrices and vectors are represented by shared arrays

Advantage: Shared arrays are implicitly distributed Drawback: Only 1D distributions allowed

Good trade-off between programmability and performance UPCBLAS parallel functions call internally BLAS routines to perform the sequential computations in each thread

5/30

slide-6
SLIDE 6

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS Matrix Vector Product

int upc_blas_sgemv(UPCBLAS_DIMMDIST dimmDist, int block_size, int sec_block_size, UPCBLAS_TRANSPOSE transpose, int m, int n, float alpha, shared void *A, int lda, shared void *x, float beta, shared void *y);

Syntax similar to sequential BLAS Pointers point to shared memory Additional parameters to specify the distribution

dimmDist: enumerate value to specify the type of distribution (by rows or by columns) The meaning of block_size and sec_block_size depends

  • n the dimmDist value

6/30

slide-7
SLIDE 7

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS Matrix Vector Product

int upc_blas_sgemv(UPCBLAS_DIMMDIST dimmDist, int block_size, int sec_block_size, UPCBLAS_TRANSPOSE transpose, int m, int n, float alpha, shared void *A, int lda, shared void *x, float beta, shared void *y);

Syntax similar to sequential BLAS Pointers point to shared memory Additional parameters to specify the distribution

dimmDist: enumerate value to specify the type of distribution (by rows or by columns) The meaning of block_size and sec_block_size depends

  • n the dimmDist value

6/30

slide-8
SLIDE 8

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS Matrix Vector Product

int upc_blas_sgemv(UPCBLAS_DIMMDIST dimmDist, int block_size, int sec_block_size, UPCBLAS_TRANSPOSE transpose, int m, int n, float alpha, shared void *A, int lda, shared void *x, float beta, shared void *y);

Syntax similar to sequential BLAS Pointers point to shared memory Additional parameters to specify the distribution

dimmDist: enumerate value to specify the type of distribution (by rows or by columns) The meaning of block_size and sec_block_size depends

  • n the dimmDist value

6/30

slide-9
SLIDE 9

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS Matrix Vector Product

int upc_blas_sgemv(UPCBLAS_DIMMDIST dimmDist, int block_size, int sec_block_size, UPCBLAS_TRANSPOSE transpose, int m, int n, float alpha, shared void *A, int lda, shared void *x, float beta, shared void *y);

Syntax similar to sequential BLAS Pointers point to shared memory Additional parameters to specify the distribution

dimmDist: enumerate value to specify the type of distribution (by rows or by columns) The meaning of block_size and sec_block_size depends

  • n the dimmDist value

6/30

slide-10
SLIDE 10

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS Matrix Vector Product

shared [16] float A[64]; shared [4] float x[8]; shared [2] float y[8] upc_blas_sgemv(upcblas_rowDist, 2, 4, upcblas_noTrans, 8, 8, alpha, (shared void *)A, 8, (shared void *)x, beta, (shared void *)y);

7/30

slide-11
SLIDE 11

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

UPCBLAS More Information

Described in:

  • J. González-Domíguez, M. J. Martín, G. L. Taboada, J. Touriño,
  • R. Doallo, D. A. Mallón and B. Wibecan, “UPCBLAS: A Library

for Parallel Matrix Computations in Unified Parallel C”, Concurrency and Computation: Practice and Experience, 2012 (In Press), available at http://dx.doi.org/10.1002/cpe.1914

8/30

slide-12
SLIDE 12

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Description of the Problem

Solution of Systems of Equations using UPCBLAS A ∗ X = B being:

A a mxm matrix X and B mxn matrices (X overwrites B)

First step: Factorization

Cholesky: A = L ∗ LT LU: A = L ∗ U

Second step: Two triangular solvers

Cholesky : L ∗ Y = B and LT ∗ X = Y LU: L ∗ Y = B and U ∗ X = Y

9/30

slide-13
SLIDE 13

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

1

Introduction

2

Cholesky Solver

3

LU Solver

4

Experimental Evaluation

5

Conclusions

10/30

slide-14
SLIDE 14

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Factorization Two different block algorithms were implemented Both are based on BLAS3 routines

Algorithm based on gemm (LAPACK) Algorithm based on syrk (ScaLAPACK)

Only 1D distributions available: block-cyclic distribution by rows or by columns

Block-cyclic distribution by rows Aij are submatrices

11/30

slide-15
SLIDE 15

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on gemm

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Ai,i = Ai,i − Ai,0..i−1 ∗ AT

i,0..i−1 → syrk

Sequential Cholesky Factorization of Ai,i end Ai+1..N,i = Ai+1..N,i − Ai+1..N,0..i−1 ∗ AT

i,0..i−1 → gemm

Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

12/30

slide-16
SLIDE 16

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on gemm

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Ai,i = Ai,i − Ai,0..i−1 ∗ AT

i,0..i−1 → syrk

Sequential Cholesky Factorization of Ai,i end Ai+1..N,i = Ai+1..N,i − Ai+1..N,0..i−1 ∗ AT

i,0..i−1 → gemm

Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

12/30

slide-17
SLIDE 17

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on gemm

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Ai,i = Ai,i − Ai,0..i−1 ∗ AT

i,0..i−1 → syrk

Sequential Cholesky Factorization of Ai,i end Ai+1..N,i = Ai+1..N,i − Ai+1..N,0..i−1 ∗ AT

i,0..i−1 → gemm

Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

12/30

slide-18
SLIDE 18

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on gemm

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Ai,i = Ai,i − Ai,0..i−1 ∗ AT

i,0..i−1 → syrk

Sequential Cholesky Factorization of Ai,i end Ai+1..N,i = Ai+1..N,i − Ai+1..N,0..i−1 ∗ AT

i,0..i−1 → gemm

Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

12/30

slide-19
SLIDE 19

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on gemm

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Ai,i = Ai,i − Ai,0..i−1 ∗ AT

i,0..i−1 → syrk

Sequential Cholesky Factorization of Ai,i end Ai+1..N,i = Ai+1..N,i − Ai+1..N,0..i−1 ∗ AT

i,0..i−1 → gemm

Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

12/30

slide-20
SLIDE 20

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on syrk

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential Cholesky Factorization of Ai,i end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ AT

i+1..N,i → syrk

end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

13/30

slide-21
SLIDE 21

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on syrk

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential Cholesky Factorization of Ai,i end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ AT

i+1..N,i → syrk

end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

13/30

slide-22
SLIDE 22

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver based on syrk

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential Cholesky Factorization of Ai,i end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ AT

i+1..N,i → syrk

end Solve Y ∗ AT = B → trsm Solve X ∗ A = Y → trsm

Algorithm based on syrk vs based on gemm Advantage: Less computations performed only by one thread Drawback: Stronger data dependencies -> Increase of the

  • verhead due to synchronizations

13/30

slide-23
SLIDE 23

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Cholesky Solver

Optimization Techniques Appropriate choice of NB (synchronization-parallelism trade-off) Numerical computations within each UPC thread are parallelized to exploit NUMA systems:

Calls to multithreaded BLAS routines Implementation of sequential Cholesky factorization with OpenMP support

14/30

slide-24
SLIDE 24

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

1

Introduction

2

Cholesky Solver

3

LU Solver

4

Experimental Evaluation

5

Conclusions

15/30

slide-25
SLIDE 25

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Basic LU Solver

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential LU Factorization of Ai,i..N end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

16/30

slide-26
SLIDE 26

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Basic LU Solver

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential LU Factorization of Ai,i..N end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

16/30

slide-27
SLIDE 27

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Basic LU Solver

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential LU Factorization of Ai,i..N end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

16/30

slide-28
SLIDE 28

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Basic LU Solver

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential LU Factorization of Ai,i..N end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

16/30

slide-29
SLIDE 29

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Basic LU Solver

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Sequential LU Factorization of Ai,i..N end Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

16/30

slide-30
SLIDE 30

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

LU Solver with Partial Pivoting

Partial Pivoting To avoid inconsistencies because of dividing by 0 If the matrix is distributed by rows the pivoting is performed by columns -> it can be parallelized A vector of size N (P) is used to store the information about the pivoting

17/30

slide-31
SLIDE 31

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

LU Solver with Partial Pivoting

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Pi = Partial Pivoting of Ai,i..N Swap Ai,i..N according to Pi Sequential LU Factorization of Ai,i..N end Swap A0..N,i..N according to Pi Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Swap B according to P Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

18/30

slide-32
SLIDE 32

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

LU Solver with Partial Pivoting

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Pi = Partial Pivoting of Ai,i..N Swap Ai,i..N according to Pi Sequential LU Factorization of Ai,i..N end Swap A0..N,i..N according to Pi Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Swap B according to P Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

18/30

slide-33
SLIDE 33

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

LU Solver with Partial Pivoting

for i=0;i<NB;i=i+1 do if MYTHREAD has affinity to block i then Pi = Partial Pivoting of Ai,i..N Swap Ai,i..N according to Pi Sequential LU Factorization of Ai,i..N end Swap A0..N,i..N according to Pi Solve Z ∗ AT

i,i = Ai+1..N,i → trsm

Ai+1..N,i = Z Ai+1..N,i+1..N = Ai+1..N,i+1..N − Ai+1..N,i ∗ Ai,i+1..N → gemm end Swap B according to P Solve A ∗ Y = B → trsm Solve A ∗ X = Y → trsm

18/30

slide-34
SLIDE 34

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

LU Solver

Optimization Techniques Appropriate choice of the block size Numerical computations within each UPC thread are parallelized to exploit NUMA systems:

Calls to multithreaded BLAS routines Implementation of sequential LU factorization with OpenMP support

19/30

slide-35
SLIDE 35

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

1

Introduction

2

Cholesky Solver

3

LU Solver

4

Experimental Evaluation

5

Conclusions

20/30

slide-36
SLIDE 36

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Characteristics of the Testbed

Carver Supercomputer Installed at National Energy Research Supercomputing Center (NERSC) of the Lawrence Berkeley National Laboratory 320 nodes with 2 quad-core Intel Xeon 5550X 8 cores and 24GB per node 4X QDR InfiniBand network (32 Gbps of theoretical effective bandwidth)

21/30

slide-37
SLIDE 37

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Characteristics of the Testbed

Software Berkeley UPC 2.12.2 compiler (internode communications through GASNet over InfiniBand) Multithreaded Intel Math Kernel Library (MKL) 10.2.2 (8 internal threads per node) ScaLAPACK 1.8.0 library

22/30

slide-38
SLIDE 38

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Characteristics of the Experiments

Type of Experiments Weak scaling Double precision data Best block size 1 UPC thread per node and 8 OpenMP threads per UPC thread (one per core)

23/30

slide-39
SLIDE 39

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Characteristics of the Experiments

Measures Speedups relative to the ScaLAPACK times Comparisons with ScaLAPACK

ScaLAPACK-2D: results using ScaLAPACK and a 2D distribution ScaLAPACK-1D: results using ScaLAPACK and a 1D distribution

24/30

slide-40
SLIDE 40

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Experimental Evaluation of Cholesky

10 20 30 40 50 60 70 80 90 100 110 16 32 64 128 256 Speedup Number of cores Cholesky solver (1200 GFLOP per thread) UPC-row (based on gemm) UPC-col (based on gemm) UPC-row (based on syrk) UPC-col (based on syrk)

25/30

slide-41
SLIDE 41

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Experimental Evaluation of Cholesky

20 40 60 80 100 120 140 160 180 16 32 64 128 256 Speedup Number of cores Cholesky solver (1200 GFLOP per thread) Scalapack-2D Scalapack-1D UPC-row (based on gemm)

26/30

slide-42
SLIDE 42

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Experimental Evaluation of LU with Partial Pivoting

20 40 60 80 100 120 140 160 180 200 220 16 32 64 128 256 Speedup Number of cores LU solver with pivoting (1365 GFLOP per thread) Scalapack-2D Scalapack-1D UPC-row

27/30

slide-43
SLIDE 43

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

1

Introduction

2

Cholesky Solver

3

LU Solver

4

Experimental Evaluation

5

Conclusions

28/30

slide-44
SLIDE 44

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Summary UPC implementations of Cholesky and LU solvers using UPCBLAS UPCBLAS main advantage: Easier to use than MPI-based libraries UPCBLAS main drawback: Performance limited as shared arrays can only work with 1D distributions Conclusions UPCBLAS is a good trade-off between programmability and performance UPCBLAS is a good alternative for increasing productivity

29/30

slide-45
SLIDE 45

Introduction Cholesky Solver LU Solver Experimental Evaluation Conclusions

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS

Jorge González-Domínguez*, Osni A. Marques**, María J. Martín*, Guillermo L. Taboada*, Juan Touriño*

*Computer Architecture Group, University of A Coruña, Spain {jgonzalezd,mariam,taboada,juan}@udc.es **Computational Research Division, Lawrence Berkeley National Laboratory, CA, USA OAMarques@lbl.gov

10th IEEE International Symposium on Parallel and Distributed Processing with Applications ISPA 2012

30/30