[PPT] - A Parallel and Scalable Iterative Solver for Sequences of Dense PowerPoint Presentation

SLIDE 1

Mitglied der Helmholtz-Gemeinschaft

A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

SLIDE 2

Motivation and Goals

Electronic Structure Band energy gap Conductivity Forces, etc.

SIMULATIONS

Mathematical Model + Algorithmic Structure Extracting & Exploiting Information

SLIDE 3

Motivation and Goals

Electronic Structure Band energy gap Conductivity Forces, etc.

SIMULATIONS

Mathematical Model + Algorithmic Structure Extracting & Exploiting Information

Performance Efficiency Scalability More Physics

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 2

SLIDE 4

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) ChFSI parallelization and numerical tests

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 3

SLIDE 5

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) ChFSI parallelization and numerical tests

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 4

SLIDE 6

Density Functional Theory scheme

Self-consistent cycle

Initial guess for charge density

nstart(r)

Compute discretized Kohn-Sham equations Solve a set of eigenproblems

P(ℓ)

k1 ...P(ℓ) kN

Compute new charge density

n(ℓ)(r)

Converged?

|n(ℓ) −n(ℓ−1)| < η

OUTPUT Electronic structure,

... No Yes

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 5

SLIDE 7

Density Functional Theory scheme

Self-consistent cycle

Initial guess for charge density

nstart(r)

Compute discretized Kohn-Sham equations Solve a set of eigenproblems

P(ℓ)

k1 ...P(ℓ) kN

Compute new charge density

n(ℓ)(r)

Converged?

|n(ℓ) −n(ℓ−1)| < η

OUTPUT Electronic structure,

... No Yes

1 every P(ℓ)

k

: A(ℓ)

k x = B(ℓ) k λx is a generalized eigenvalue problem;

2 A and B are DENSE and hermitian (B is also pos. def.); 3 Pks with different k index have different size and are independent from

each other.

4 k = 1 : 10−100

; ℓ = 1 : 20−50

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 5

SLIDE 8

Density Functional Theory scheme

Adjacent cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

direct solver direct solver direct solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

direct solver direct solver direct solver

Next cycle

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 6

SLIDE 9

Density Functional Theory scheme

Adjacent cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

direct solver direct solver direct solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

direct solver direct solver direct solver

Next cycle

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 6

SLIDE 10

Density Functional Theory scheme

Adjacent cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

direct solver direct solver direct solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

direct solver direct solver direct solver

Next cycle

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 6

SLIDE 11

Sequences of eigenproblems

Current solving strategy

The set of generalized eigenproblems P(1) ...P(ℓ)P(ℓ+1) ...P(N) is handled as a set of disjoint problems N ×P; Each problem P(ℓ) is solved independently using a direct solver as a black-box from a standard library (i.e. ScaLAPACK).

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 7

SLIDE 12

Sequences of eigenproblems

Current solving strategy

The set of generalized eigenproblems P(1) ...P(ℓ)P(ℓ+1) ...P(N) is handled as a set of disjoint problems N ×P; Each problem P(ℓ) is solved independently using a direct solver as a black-box from a standard library (i.e. ScaLAPACK).

Extracting information − → searching for a new strategy

Treated the set of eigenproblems N ×P as a sequence

P(ℓ)

; Investigated the presence of a connection between adjacent problems,

collected data on angles b/w eigenvectors of adjacent eigenproblems; Θ(ℓ)

ki ≡ {θ1,...,θn} = diag

1−X(ℓ−1)

ki

, ˜ X(ℓ)

ki

uncovered evolution of eigenvectors along the sequence

for fixed ki θ(2)

j

≥ θ(3)

j

≥ ··· ≥ θ(N)

j

: θ(2)

j

≫ θ(N)

j

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 7

SLIDE 13

Angles evolution

fixed k

Example: a metallic compound at fixed k

2 6 10 14 18 22 10

−10

10

−8

10

−6

10

−4

10

−2

10

Evolution of subspace angle for eigenvectors of k−point 1 and lowest 75 eigs

Iterations (2 −> 22) Angle b/w eigenvectors of adjacent iterations

AuAg

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 8

SLIDE 14

Correlation and its exploitation

Correlation

∃ correlation between successive eigenvectors x(ℓ−1)

j

and x(ℓ)

j ;

angles are small after the first few iterations.

Exploiting information: SIMULATION ⇒ ALGORITHM The stage is favorable to an iterative eigensolver where the solution of P(ℓ−1) is used to solve P(ℓ).

Note: Mathematical model Correlation. Correlation ⇐ numerical analysis of the simulation.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 9

SLIDE 15

Improved Density Functional Theory scheme

Adjacent cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

iterative solver iterative solver iterative solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

iterative solver iterative solver iterative solver

Next cycle

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 10

SLIDE 16

Algorithmic choice

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 11

SLIDE 17

Algorithmic choice

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 11

SLIDE 18

Algorithmic choice

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices. Disadvantage: Many mat-vec multiplications Advantage: xGEMM on blocks

approx. eigenvecs

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 11

SLIDE 19

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) ChFSI parallelization and numerical tests

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 12

SLIDE 20

Selecting an iterative eigensolver

Two are the main properties an iterative algorithm has to comply with:

1 The ability to receive as input a sizable set of approximate eigenvectors

Z0 (extracted from X(ℓ−1)

ki

);

2 The capacity to solve simultaneously for a substantial portion of

eigenpairs.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 13

SLIDE 21

Selecting an iterative eigensolver

Two are the main properties an iterative algorithm has to comply with:

1 The ability to receive as input a sizable set of approximate eigenvectors

Z0 (extracted from X(ℓ−1)

ki

);

2 The capacity to solve simultaneously for a substantial portion of

eigenpairs. ChFSI constitutes the natural choice: it accepts the full set of multiple starting vectors; it avoids stalling when facing small clusters of eigenvalues; when augmented with polynomial accelerators it has a much faster convergence rate; converged eigenvectors can be easily locked; the degree of the polynomial can be opportunely optimized.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 13

SLIDE 22

ChFSI pseudocode

INPUT: Hamiltonian, approximate eigenpairs – (Λ,Z0), TOL, DEG. OUTPUT: Wanted eigenpairs W.

1 Lanczos step. Identify the bounds for the eigenspectrum interval.

REPEAT UNTIL CONVERGENCE:

2 Chebyshev filter. Filter a block of vectors W ←

− Z0.

3 Re-orthogonalize the vectors outputted by the filter; W = QR. 4 Compute the Rayleigh quotient G = Q†HQ. 5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ. 6 Compute the approximate Ritz pairs (Λ,W ← QY). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors.

END REPEAT

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 14

SLIDE 23

The core of the algorithm: Chebyshev filter

Chebyshev polynomials

The Chebyshev polynomial Cm of the first kind of order m, is defined as

Cm(x) = cos(marccos(x)), x ∈ [−1,1]; cosh(marccosh(x)), |x| > 1.

−3 −2 −1 1 2 3 100 200 300 400 500

Degree 5

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 x 10

6

Degree 10

−3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 x 10

10

Degree 15

−3 −2 −1 1 2 3 −1.5 −1 −0.5 0.5 1 1.5 x 10

14

Degree 20

Three-terms recurrence relation

Cm+1 (x) = 2xCm (x)−Cm−1 (x); m ∈ N, C0 (x) = 1, C1 (x) = x

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 15

SLIDE 24

The core of the algorithm: Chebyshev filter

The pseudocode

A simple linear transformation maps [−1,1] − → [α,β] ⊂ R defines c = β+α

2

as the center of the interval and e = β−α

2

as the width of the interval.

INPUT: Hamiltonian H - approx. eigenvectors Z0 lowest eigenv. λ1 - parameters for interval c, e - DEG. OUTPUT: Filtered vectors W.

1 σ1 ← e/(λ1 −c) 2 Z1 ← σ1

e (H −cIn)Z0 xGEMM FOR: i = 1 → DEG −1

3 σi+1 ←

1 (2/σ1 −σi)

4 Zi+1 ← 2σi+1

e (H −cIn)Zi −σi+1σiZi−1 xGEMM END FOR.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 16

SLIDE 25

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) ChFSI parallelization and numerical tests

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 17

SLIDE 26

Experimental tests setup

C language implementation of ChFSI

OMPChFSI – OpenMP parallelization for shared memory EleChFSI – Elemental (MPI) parallelization for distributed memory Matrix sizes: 2,600 ÷ 13,300. We used the standard form for the problem Ax = λBx − → A′y = λy with A′ = L−1AL−T and y = LTx Tests were performed on the JUROPA cluster. 2 Intel Xeon 5570 (Nehalem-EP) quad-core processors/node, 2.93GHz; 24 GB/node;

THEORETICAL PEAK PERFORMANCE/CORE=11.71 GFLOPS;

Minimum absolute tolerance of residuals rx = 10−10; All numerical data are MEDIAN values over 12 distinct measurements.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 18

SLIDE 27

Distributed memory: “Elemental” framework

The core of the library is the two-dimensional cyclic element-wise matrix distribution. p MPI processes are logically viewed as a two-dimensional r ×c process grid with p = r ×c. The matrix A = [aij] is distributed over the grid in such a way that the process (s,t) owns the matrix. As,t =    aγ,δ aγ,δ+c ... aγ+r,δ aγ+r,δ+c ... . . . . . .   , γ ≡ (s+σr) mod r and δ ≡ (t +σc) mod c, σr and σc are arbitrarily chosen alignment parameters.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 19

SLIDE 28

Distributed memory: “Elemental” framework

The overall performance is influenced by grid shape and algorithmic block size. Grid Shape

For a given number p > 1 of processors there are several possible choices for r and c making a grid shape (r,c) = r ×c. It can have a significant impact on the overall performance and is usually influenced by the algorithm.

Algorithmic block size

Algorithmic block size: size of blocks of input data, correlated to the square root of the L2 cache. Not only depends on the algorithm itself, but it is also affected by the architecture.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 19

SLIDE 29

Distributed memory: “Elemental” framework

ℓ = 20, size n = 13,379, and number of sought after eigenpairs nev = 972. The problem was solved with EleChFSI using 16, 32, and 64 cores, all possible grid shapes (r, c) and three distinct algorithmic block sizes.

( 1 , 1 6 ) ( 2 , 8 ) ( 4 , 4 ) ( 8 , 2 ) ( 1 6 , 1 ) ( 1 , 3 2 ) ( 2 , 1 6 ) ( 4 , 8 ) ( 8 , 4 ) ( 1 6 , 2 ) ( 3 2 , 1 ) ( 1 , 6 4 ) ( 2 , 3 2 ) ( 4 , 1 6 ) ( 8 , 8 ) ( 1 6 , 4 ) ( 3 2 , 2 ) ( 6 4 , 1 ) Grid Shape 100 200 300 400 500 Time [sec] 16 Cores 32 Cores 64 Cores The interplay between algorithmic block size and grid shape. Algorithmic blocksize 64 Algorithmic blocksize 128 Algorithmic blocksize 256

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 19

SLIDE 30

Speed-up

Speed-up =

CPU time (input random vectors) CPU time (input approximate eigenvectors)

3 7 11 15 19 23 Iteration Index

1

2 3 4 Speed-up Au98Ag10 - n =13,379 - 128 cores.

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 20

SLIDE 31

Scalability

Strong Scalability (the size of the eigenproblems are kept fixed while the number of cores is progressively increased) for EleChFSI over two systems of size n = 13,379 and n = 9,273 respectively.

8 16 32 64 128 256 Number of Cores 10 102 103 Time [sec] 406.28 85.79 208.62 45.47 108.25 24.22 57.51 12.99 801.76 167.83 31.99 7.36 Au98Ag10 Na15Cl14Li

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 21

SLIDE 32

Scalability

Weak Scalability (fixed ratio of data per processor) for EleChFSI. Times are weighted a posteriori keeping into account the ratio of operations per data.

16 32 40 80 92 184 Number of Cores 15 30 45 Weighted Time [sec] 33.12 11.83 34.79 12.23 34.81 12.37 Au98Ag10 Na15Cl14Li 3893 5638 6217 8970 9273 13379 System Size

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 21

SLIDE 33

EleChFSI versus direct solvers (parallel MRRR)

For the size of eigenproblems here tested the ScaLAPACK implementation

f BXINV is comparable with EleMRRR. For this reason a direct

comparison with ScaLAPACK is not included.

3 6 9 12 Iteration Index

5

10 15 20 25 30 Time [sec] EleChFSI EleMRRR 64 Cores 128 Cores

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 22

SLIDE 34

Conclusions and future work

Computing MORE (with higher efficiency) grants more performance than computing LESS (with lower efficiency). (Preconditioning) ChFSI with eigenvectors of P(ℓ−1) to speedup the solution P(ℓ) is a successful strategy; The parallelization of ChFSI shows great potential for scalability and efficiency; The improved use of computing resources together with strong scalability will enable to access larger physics; ONGOING AND FUTURE WORK

1 Finalizing filter optimization by adjusting the degree of the polynomial so

to just achieve the required eigenvector residuals.

2 Parallelization of ChFSI for GPUs;

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 23

SLIDE 35

References

1 EDN, and M. Berljafa A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW To be published in the proceedings of PPAM 2013. [arXiv:1305.5120] 2 EDN, and M. Berljafa Block Iterative Eigensolvers for Sequences of Correlated Eigenvalue Problems

Comp. Phys. Comm. 184 (2013), pp. 2478-2488, [arXiv:1206.3768].

3 EDN, P . Bientinesi, and S. Blügel, Correlation in sequences of generalized eigenproblems arising in Density Functional Theory,

Comp. Phys. Comm. 183 (2012), pp. 1674-1682, [arXiv:1108.2594].

PPAM 2013 Warsaw, Poland, Sept. 10th

M. Berljafa and E. Di Napoli

Folie 24