A Distributed and Parallel Asynchronous Unite and Conquer Method to - - PowerPoint PPT Presentation

a distributed and parallel asynchronous unite and conquer
SMART_READER_LITE
LIVE PREVIEW

A Distributed and Parallel Asynchronous Unite and Conquer Method to - - PowerPoint PPT Presentation

A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems Xinzhe WU 1 , 2 Serge G. Petiton 1 , 2 1 Maison de la Simulation/CNRS, Gif-sur-Yvette, 91191, France 2 CRIStAL, University of Lille


slide-1
SLIDE 1

A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems

Xinzhe WU1,2 Serge G. Petiton1,2

1Maison de la Simulation/CNRS, Gif-sur-Yvette, 91191, France 2CRIStAL, University of Lille 1, Science and Technology

January 29, 2018 HPC Asia 2018, Tokyo, Japan

slide-2
SLIDE 2

Outline

1

Introduction, toward extreme computing

2

Asynchronous Unite and Conquer GMRES/LS-ERAM (UCGLE) method

3

Experimentations, evaluation and analysis

4

Conclusion and Perspectives

2 / 25

slide-3
SLIDE 3

Krylov Methods

Krylov Subspace

Km = span{r0, Ar0, · · · , Am−1r0} Different Krylov Methods:

1 Resolution of linear systems

⊖ GMRES ⊖ CG ⊖ BiCG, etc.

2 Resolution of eigenvalue problems

⊖ ERAM ⊖ IRAM, etc.

3 / 25

slide-4
SLIDE 4

Future Parallel Programming Trends

Future Programming Trends:

1 Highly hierarchical architectures

⊖ Computing ⊖ Memory

2 Increasing levels and degree of parallelism 3 Heterogeneity

⊖ Computing ⊖ Memory ⊖ Scalability

4 Requirement of parallel programming

⊖ Multi-grain ⊖ Multi-level memory ⊖ Reducing synchronizations and promoting asynchronicity ⊖ Multi-level scheduling strategies

4 / 25

slide-5
SLIDE 5

Toward extreme computing, some correlated goals

⊖ Minimize the global computing time ⊖ Accelerate the convergence ⊖ Minimize the number of communications ⊖ Minimize the number of longer size scalar products and reductions ⊖ Minimize the memory space, cache optimization ⊖ Select the best sparse matrix compressed format ⊖ Mixed arithmetic ⊖ Minimize energy consumption ⊖ Fault tolerance, resilience

5 / 25

slide-6
SLIDE 6

Toward extreme computing, some correlated goals

⊖ Minimize the global computing time ⊖ Accelerate the convergence ⊖ Minimize the number of communications ⊖ Minimize the number of longer size scalar products and reductions ⊖ Minimize the memory space, cache optimization ⊖ Select the best sparse matrix compressed format ⊖ Mixed arithmetic ⊖ Minimize energy consumption ⊖ Fault tolerance, resilience

6 / 25

Preconditioning Unite and Conquer

slide-7
SLIDE 7

Unite and Conquer Approach

Unite and conquer approach: improving the convergence using other iterative methods[Emad, Nahid and Petiton, Serge, 2016].

Figure: Multiple Explicitly Restarted Arnoldi Method (MERAM) [Nahid Emad et al, 2005].

7 / 25

slide-8
SLIDE 8

Outline

1

Introduction, toward extreme computing

2

Asynchronous Unite and Conquer GMRES/LS-ERAM (UCGLE) method

3

Experimentations, evaluation and analysis

4

Conclusion and Perspectives

8 / 25

slide-9
SLIDE 9

UCGLE Method Implementation

UCGLE method is proposed to solve the non-Hermitian linear systems based on the work of this article [Essai, Azeddine and Berg´

ere, Guy and Petiton, Serge G, 1999].

Figure: Workflow of UCGLE method

Eigenvalues Residual Residual Least Square Residual ERAM Component LS Component GMRES Component Manager Process LS Process Process #1 ERAM Process #2 ERAM Process #3 ERAM Process #1 GMRES Process #2 GMRES Process #3 GMRES

9 / 25

slide-10
SLIDE 10

Least Squares Method

Polynomial preconditioner iterates: xn = x0 + Pn(A)r0 → rn = Rn(A)r0 with Rn(λ) = 1 − λPn(λ). The purpose is to find a kind of polynomial Pn which can minimize Rn(A)r0. For more details of this method, see the article [Youssef Saad, 1987].

Figure: Eigenvalues, convex hull and ellipse

Load eigenval- ues of matrix A Construct the convex hull englobaling the eigenvalues Compute the ellipse by the convex hull Compute matrix M and T by Chebyshev polynomial basis M = LLT by Cholesky factoriza- tion, and F = LT Compute new residual 10 / 25

slide-11
SLIDE 11

Least Squares Method

Least Squares method residual

r = (Rk(A))ιr0 =

m

  • i=1

ρ((Rk)(λi)ι)ui +

n

  • i=m+1

ρ((Rk)(λi)ι)ui

750 1500 2250 3000

Iteration Steps

1e-10 1e-8 1e-6 1e-4 1e-2 1

Residual Convergence of UCGLE vs GMRES

No preconditioner UCGLE

Figure: An example of UCGLE for convergence.

11 / 25

slide-12
SLIDE 12

Software engine, orchestration of UCGLE

All three computation components are implemented using the scientific libraries PETSc and SLEPc, based on the work of Pierre-Yves Aquilanti during his thesis at University of Lille 1 [Pierre-Yves Aquilanti, 2011].

Figure: Asynchronous Communication and Parallelism of UCGLE method

ERAM_COMM GMRES_COMM LS_COMM Manager Processor Residual Vector

MPI_COMM_WORLD

Residual Vector Some Ritz values Parameters for the preconditioner Coarse granularity Medium granularity Fine granularity

12 / 25

slide-13
SLIDE 13

Components Implementation

The method implementation is based on the work of Pierre-Yves Aquilanti during his thesis at University of Lille 1 [Pierre-Yves Aquilanti, 2011].

GMRES Component

The GMRES component is well implemented by the PETSc library.

Arnoldi Component

The Arnoldi component is implemented by the SLEPc library to calculate the eigenvalues of the matrix operator A.

LS Component

Using the Cholesky algorithm, which is provided by PETSc as a preconditioner, but can be used without problem as a factorization method correctly.

13 / 25

slide-14
SLIDE 14

Important Parameters

There are large number of parameters in UCGLE for the users to select and autotune in order to get the best performance.

  • I. GMRES Component

* mg: GMRES Krylov Subspace size * ǫg: absolute tolerance used for the GMRES convergence test * Pg: GMRES processors number * suse: number of times that polynomial applied before taking account into the new eigenvalues * L: number of GMRES restarts before each time LS precondtioning

  • II. Arnoldi Component

* ma: Arnoldi Krylov subspace size * r: number of eigenvalues required * ǫa: convergence tolerance * Pa: Arnoldi processors number

  • III. LS Component

* d: Least Squares polynomial degree

14 / 25

slide-15
SLIDE 15

Outline

1

Introduction, toward extreme computing

2

Asynchronous Unite and Conquer GMRES/LS-ERAM (UCGLE) method

3

Experimentations, evaluation and analysis

4

Conclusion and Perspectives

15 / 25

slide-16
SLIDE 16

Test Matrices

All the following results come from this article [Xinzhe WU and Serge G. Petiton, 2017].

Matrix utm300 from Matrix Market

Figure: Two strategies of large and sparse matrix generator Table: Test matrices information Matrix Name n nnz Matrix Type matLine 1.8 × 107 2.9 × 107 non-Symmetric matBlock 1.8 × 107 1.9 × 108 non-Symmetric MEG1 1.024 × 107 7.27 × 109 non-Hermitian MEG2 5.1 × 106 3.64 × 109 non-Hermitian

16 / 25

slide-17
SLIDE 17

Experimental Hardware

Experiments on the ROMEO supercomputer in Reims (Champagne, France). ROMEO has 130 nodes. each node has 2 CPU with 8 cores and 2 GPUs. The node specfication is given as following:

Table: Node Specifications of the cluster ROMEO Nodes Number BullX R421 × 130 Mother Board SuperMicro X9DRG-QF CPU Intel Ivy Bridge 8 cores 2,6 GHz × 2 sockets Memory DDR 32GB GPU NVIDIA Tesla K20X × 2 Memory GDDR5 6 GB / GPU

17 / 25

slide-18
SLIDE 18

Convergence and Fault Tolerance Evaluation

500 1000 1500 2000

Iteration Steps

1e-10 1e-8 1e-6 1e-4 1e-2 1

Residual

Fault Points Fault Points

(a) matLine

SOR Jacobi No preconditioner UCGLE FT(G) UCGLE FT(E) UCGLE

750 1500 2250 3000 3750

Iteration Steps

1e-10 1e-8 1e-6 1e-4 1e-2 1

Residual

Fault Points Fault Points

(b) matBlock

50 100 150 200 250 300 350

Iteration Steps

1e-10 1e-8 1e-6 1e-4 1e-2 1

Residual

Fault Points Fault Points

(c) MEG1

150 300 450 600

Iteration Steps

1e-10 1e-8 1e-6 1e-4 1e-2 1

Residual

Fault Points Fault Points

(d) MEG2

Figure: Convergence comparison of matLine, matBlock, MEG1 and MEG2 by UCGLE, classic GMRES, Jacobi preconditoned GMRES, SOR preconditioned GMRES, UCGLE FT(G) and UCGLE FT(E); X-axis refers to the iteration step for each method; Y-axis refers to the residual, a base 10 logarithmic scale is used for Y-axis.

18 / 25

slide-19
SLIDE 19

Summary of Iteration Number for Convergence

Table: Summary of iteration number for convergence of 4 test matrices using SOR, Jacobi, non preconditioned GMRES,UCGLE FT(G),UCGLE FT(G) and UCGLE: red × in the table presents this solving procedure cannot converge to accurate solution (here absolute residual tolerance 1 × 10−10 for GMRES convergence test) in acceptable iteration number (20000 here).

Matrix Name SOR Jacobi No preconditioner UCGLE FT(G) UCGLE FT(G) UCGLE matLine 1430 × 1924 995 1073 900 matBlock 2481 3579 3027 2048 2005 1646 MEG1 217 386 400 81 347 74 MEG2 750 × × 82 × 64 19 / 25

slide-20
SLIDE 20

Strong Scalability Results

1 2 4 8 16 32 64 128 256

GMRES CPU core or GPU count

10−2 10−1 100 101 102

Time (s)

SOR (CPU) Jacobi (CPU) No preconditioner (CPU) UCGLE (CPU) SOR (GPU) Jacobi (GPU) No preconditioner (GPU) UCGLE (GPU)

Figure: Strong scalability test of solve time per iteration for UCGLE, GMRES without preconditioner, Jacobi and SOR preconditioned GMRES using matrix MEG1 on CPU and GPU; X-axis refers respectively to CPU cores of GMRES from 1 to 256 and GPU number of GMRES from 2 to 128; Y-axis refers to the average execution time per iteration. A base 2 logarithmic scale is used for X-axis, and a base 10 logarithmic scale is used for Y-axis.

20 / 25

slide-21
SLIDE 21

Performance Evaluation

2 4 8 16 32 64 128 256

Total CPU core or GPU count

10−2 10−1 100 101 102

Time (s)

SOR (CPU) Jacobi (CPU) No preconditioner (CPU) UCGLE (CPU) SOR (GPU) Jacobi (GPU) No preconditioner (GPU) UCGLE (GPU)

Figure: Performance comparison of solve time per iteration for UCGLE, GMRES without preconditioner, Jacobi and SOR preconditioned GMRES using matrix MEG1 on CPU and GPU; X-axis refers respectively to the total CPU cores number or GPU number for these four methods; Y-axis refers to the average execution time per iteration. A base 2 logarithmic scale is used for X-axis, and a base 10 logarithmic scale is used for Y-axis.

21 / 25

slide-22
SLIDE 22

Outline

1

Introduction, toward extreme computing

2

Asynchronous Unite and Conquer GMRES/LS-ERAM (UCGLE) method

3

Experimentations, evaluation and analysis

4

Conclusion and Perspectives

22 / 25

slide-23
SLIDE 23

Conclusion and Perspectives

Then

1 UCGLE is an asynchronous preconditoned method, minimizing

communications, fault tolerant, and allowing efficient GMRES/LS computation for other systems with different right hand sides;

2 Several other preconditioners ”may be used” (FGMRES) between LS

polynomial ”accelrations”;

3 A lot of parameters have to be analyzed : smart-tuning at runtime,

learning, toward intelligent linear algebra;

4 We have to experiment with very large matrices and on world larger

supercomputers to evaluate the impact of large latence for reduction;

5 Adapted programming paradigms have to be used for such

asynchronous multi-granularity distributed and parallel computing (YML-XMP/YML-XACC, · · · );

6 SMG2S: generation of Non-Hermitian matrices, including some

generate with a given spectrum (soon proposed on line)

23 / 25

slide-24
SLIDE 24

References

Nahid Emad and Serge Petiton (2016) Unite and conquer approach for high scale numerical computing Journal of Computational Science, 5 – 14. Nahid Emad and Serge Petiton and Guy Edjlali (2005) Multiple explicitly restarted Arnoldi method for solving large eigenproblems SIAM Journal on Scientific Computing, 253 – 277. Xinzhe WU, Serge G. Petiton (2018) A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems. International Conference on High Performance Computing in Asia-Pacific Region, accepted. Azeddine Essai, Guy Berg´ ere and Serge G. Petiton (1999) Heterogeneous Parallel Hybrid GMRES/LS-Arnoldi Method. PPSC, 1999. Youssef Saad (1987) Least squares polynomials in the complex plane and their use for solving nonsymmetric linear systems SIAM Journal on Numerical Analysis, 155 – 169. Pierre-Yves Aquilanti (2011) Methodes de Krylov reparties et massivement paralleles pour la resolution de problemes de Geoscience sur machines heterogenes depassant le petaflop. PhD dissertation, Universite de Lille. 24 / 25

slide-25
SLIDE 25

Thank you for your attentions! Questions?

25 / 25