Parallel block Chebyshev subspace iteration algorithm optimized for - - PowerPoint PPT Presentation

parallel block chebyshev subspace iteration algorithm
SMART_READER_LITE
LIVE PREVIEW

Parallel block Chebyshev subspace iteration algorithm optimized for - - PowerPoint PPT Presentation

Mitglied der Helmholtz-Gemeinschaft Parallel block Chebyshev subspace iteration algorithm optimized for sequences of correlated dense eigenproblems ERCIM 2012 Oviedo, Spain , Dec. 2nd M. Berljafa and E. Di Napoli Motivation and Goals Reverse


slide-1
SLIDE 1

Mitglied der Helmholtz-Gemeinschaft

Parallel block Chebyshev subspace iteration algorithm optimized for sequences of correlated dense eigenproblems

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli
slide-2
SLIDE 2

Motivation and Goals

Reverse Simulation

total energy band energy gap conductivity forces, etc.        ← − Simulations ⇐ = = ⇒    Mathematical model Algorithmic structure Goal Increasing the performance of large legacy codes by exploiting physical information extracted from the simulations that can be used to speed-up the algorithms used in such codes

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 2

slide-3
SLIDE 3

Outline

Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations Eigenvectors angle evolution The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 3

slide-4
SLIDE 4

Outline

Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations Eigenvectors angle evolution The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 4

slide-5
SLIDE 5

The Foundations

Investigative framework

Quantum Mechanics and its ingredients

H = − ¯

h2 2m n

i=1

∇2

i − n

i=1∑ α

Zα |xi −aα| +∑

i<j

1 |xi −xj| Hamiltonian Φ(x1;s1,x2;s2,...,xn;sn) Wavefunction Φ :

  • R3 ×{± 1

2}

n − → R high-dimensional anti-symmetric function – describes the orbitals of atoms and molecules. In the Born-Oppenheimer approximation, it is the solution of the

Electronic Schrödinger Equation

H Φ(x1;s1,x2;s2,...,xn;sn) = E Φ(x1;s1,x2;s2,...,xn;sn)

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 5

slide-6
SLIDE 6

Density Functional Theory (DFT)

1 Φ(x1;s1,x2;s2,...,xn;sn) =

⇒ Λi,aφa(xi;si)

2 Density of states n(r) = ∑a |φa(r)|2 3 In the Schrödinger equation the exact Coulomb interaction is substituted

with an effective potential V0(r) = VI(r)+VH(r)+Vxc(r)

Hohenberg-Kohn theorem

∃ one-to-one correspondence n(r) ↔ V0(r) = ⇒ V0(r) = V0(r)[n] ∃! a functional E[n] : E0 = minnE[n] The high-dimensional Schrödinger equation translates into a set of coupled non-linear low-dimensional self-consistent Kohn-Sham (KS) equation ∀ a solve ˆ HKSφa(r) =

  • − ¯

h2 2m∇2 +V0(r)

  • φa(r) = εaφa(r)

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 6

slide-7
SLIDE 7

Discretized Kohn-Sham scheme

Self-consistent cycle

Initial guess

nstart(r) = ⇒

Compute KS potential

V0(r)[n] − →

Solve a set of eigenproblems

Pk1 ...PkN ↑ No ↓

OUTPUT Energy, ... Yes

⇐ =

Converged?

|n(ℓ+1) −n(ℓ)| < η ← −

Compute new density

n(r) = ∑k,ν |φk,ν(r)|2

FLAPW details

Observations:

1 every Pk : Ax = Bλx is a generalized eigenvalue problem; 2 A and B are DENSE and hermitian (B is also pos. def.); 3 Pks with different k index have different size and are independent from

each other.

4 k= 1:10-100 ; i = 1:20-50

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 7

slide-8
SLIDE 8

Algorithmic digression

Direct solvers. Iterative solvers.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 8

slide-9
SLIDE 9

Algorithmic digression

Direct solvers. Iterative solvers.

       ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗               ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗       

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 8

slide-10
SLIDE 10

Algorithmic digression

Direct solvers. Iterative solvers.

       ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗               ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗       

|λ1|>|λ2| > |λ3| > ... Axj = λjxj v = ∑j γjxj Av = ∑j λjγjxj ⇒ Akv = ∑j λk

j γjxj

Rate of convergence → magnitude of

  • λ1

λj

  • ERCIM 2012 Oviedo, Spain , Dec. 2nd
  • M. Berljafa and E. Di Napoli

Folie 8

slide-11
SLIDE 11

Algorithmic digression

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 8

slide-12
SLIDE 12

Outline

Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations Eigenvectors angle evolution The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 9

slide-13
SLIDE 13

Sequences of eigs: an ALGORITHM ⇐ SIM case

Sequences of eigenproblems

Consider the set of generalized eigenproblems P(1) ...P(ℓ)P(ℓ+1) ...P(N) not a a set of disjoint problems (P)N, but as a sequence; Could this sequence

  • P(ℓ)
  • f eigenproblems evolve following a

convergence pattern in line with the convergence of n(r)? REVERSE SIMULATION method: numerical simulations analyzed employing a parameter-based “inverse” problem method; collected data on deviation angles b/w eigenvectors of adjacent eigenproblems; identified “evolutions” of eigenvectors along the sequence.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 10

slide-14
SLIDE 14

Angle evolution

fixed k

Example: a metallic compound at fixed k

2 6 10 14 18 22 10

−10

10

−8

10

−6

10

−4

10

−2

10

Evolution of subspace angle for eigenvectors of k−point 1 and lowest 75 eigs

Iterations (2 −> 22) Angle b/w eigenvectors of adjacent iterations

AuAg

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 11

slide-15
SLIDE 15

Correlation and its exploitation

∃ correlation between successive eigenvectors x(ℓ−1) and x(ℓ) Angles decrease monotonically with some oscillation Majority of angles are small after the first few iterations

Note: Mathematical model Correlation. Correlation ⇐ numerical analysis of the simulation.

ALGORITHM ⇐ SIM The stage is favorable to an iterative eigensolver where the eigenvectors of P(ℓ−1) are fed to the solve P(ℓ).

Next stages of the investigation:

1 Development of a block iterative eigensolver that can exploit the

correlation

2 Investigate if approximate eigenvectors can speed-up the iterative solver

  • f choice

3 Understand if such an iterative method be competitive with direct

methods for dense problems

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 12

slide-16
SLIDE 16

Algorithmic choice

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 13

slide-17
SLIDE 17

Outline

Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations Eigenvectors angle evolution The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 14

slide-18
SLIDE 18

Chebyshev Filtered Sub-space Iteration method

Two essential properties the iterative algorithm has to comply with:

1 the ability to receive as input a sizable set of approximate eigenvectors; 2 the capacity to solve simultaneously for a substantial portion of

eigenpairs. ChFSI constitutes the natural choice: it accepts the full set of multiple starting vectors; it avoids stalling when facing small clusters of eigenvalues; when augmented with polynomial accelerators it has a much faster convergence rate; converged eigenvectors can be easily locked.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 15

slide-19
SLIDE 19

Pseudocode

INPUT: Hamiltonian, approximation for the eigenpairs – (Λ,W), TOL, DEG. OUTPUT: Wanted eigenpairs.

1 Lanczos step. Identify the bounds for the interval to be filtered out.

REPEAT UNTIL CONVERGENCE:

2 Chebyshev filter. Filter a block of vectors W. 3 QR decomposition. Re-orthogonalize the vectors outputted by the filter. 4 Compute the Rayleigh quotient G = WHHW. 5 Compute the primitive Ritz pairs (Λ,Q). 6 Compute the approximate Ritz pairs (Λ,WQ). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors.

END REPEAT

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 16

slide-20
SLIDE 20

The core of the algorithm: Chebyshev filter

Definition

The Chebyshev polynomial Cm of the first kind of order m, is defined as Cm(x) = cos(marccos(x)), x ∈ [−1,1]; cosh(marccosh(x)), |x| > 1.

−3 −2 −1 1 2 3 100 200 300 400 500

Degree 5

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 x 10

6

Degree 10

−3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 x 10

10

Degree 15

−3 −2 −1 1 2 3 −1.5 −1 −0.5 0.5 1 1.5 x 10

14

Degree 20 ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 17

slide-21
SLIDE 21

The core of the algorithm: Chebyshev filter

C0 (x) = 1 C1 (x) = x Cm+1 (x) = 2xCm (x)−Cm−1 (x), m ∈ N. INPUT: Hamiltonian H - approx. eigenvectors Z0 - lowest eigenv. λ1 - parameters for interval c, e - DEG. OUTPUT: Filtered vectors ZDEG.

1 σ1 ← e/(λ1 −c) 2 Z1 ← σ1

e (H −cIn)Z0 FOR: i = 1 → DEG −1

3 σi+1 ←

1 (2/σ1 −σi)

4 Zi+1 ← 2σi+1

e (H −cIn)Zi −σi+1σiZi−1 END FOR.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 18

slide-22
SLIDE 22

Outline

Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations Eigenvectors angle evolution The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 19

slide-23
SLIDE 23

Experimental tests setup

Matrix sizes: 2,600 ÷ 13,300.

B ill-conditioned

B is in general almost singular. Examples: size(A) = 50 → κ(A) ≈ 104 ; size(A) = 500 → κ(A) ≈ 107 We used the standard form for the problem Ax = λBx − → A′y = λy with A′ = L−1AL−T and y = LTx

  • n a version of ChFSI implemented in C.
  • Approx. vs Random vectors fed to ChFSI against Iteration Index;

Parallel ChFSI vs Direct methods against Iteration Index. Tests were performed on JUROPA using 1 node with 8 cores. 2 Intel Xeon 5570 (Nehalem-EP) quad-core processors, 2.93GHz; 24 GB/node; SuSe Linux; All numerical data are MEDIAN values over 12 distinct measurements;

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 20

slide-24
SLIDE 24

Speed-up for sequential ChFSI

2 4 6 8 10 12 14 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8

Iteration index Speed−up Speed−up vs. Iteration index for Nev=256 and three distinct matrix sizes

NaCl −− n=3893 NaCl −− n=6217 NaCl −− n=9273 ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 21

slide-25
SLIDE 25

Speed-up for sequential ChFSI

5 10 15 20 25 30 1 1.5 2 2.5 3 3.5 4

Iteration index Speed−up Speed−up vs. Iteration index for Nev=972 and 2 distinct matrix sizes

AuAg −− n=5638 AuAg −− n=8970 ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 21

slide-26
SLIDE 26

Sequential to Parallel: what and how?

Fraction of computing time for sequential ChFSI extracted from AuAg system (n=8970, nev=972) at iteration 23.

< 1% 90% 6% 4%

Residuals convergence Rayleigh−Ritz Chebyshev filter Lanczos

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 22

slide-27
SLIDE 27

Sequential to Parallel: what and how?

Speed-up of multi-threaded BLAS ChFSI vs sequential ChFSI over 8 cores when fed approx. eigenvectors.

5 10 15 20 25 30 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8

Iteration index Speed−up Speed−up of Multi−threaded ChFSI over Sequential ChFSI w.r.t. Iteration index AuAg −− n=8970 NaCl −− n=6217

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 22

slide-28
SLIDE 28

Sequential to Parallel: what and how?

Speed-up of sequential ChFSI and multi-threaded BLAS ChFSI.

5 10 15 20 25 30 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Iteration Index Speed−up

Speed−up of approx. vs. random vectors w.r.t. Iteration index

AuAg −− n=8970 −− 8 cores AuAg −− n=8970 −− 1 core

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 22

slide-29
SLIDE 29

Recapping

1 Approximate vs. Random:

sequential ChFSI achieves speed-ups in the range 1.5X ÷ 3.5X; multi-threaded version of ChFSI achieve speed-ups up to 5X.

2 Multi-threaded ChFSI:

by just using the multi-threaded version of BLAS, ChFSI achieve speed-ups above 7X; the larger the size of the eigenproblem the better ChFSI performs.

UNANSWERED QUESTIONS: Can we do better on shared memory architectures? Can we be faster than the direct methods?

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 23

slide-30
SLIDE 30

Recapping

1 Approximate vs. Random:

sequential ChFSI achieves speed-ups in the range 1.5X ÷ 3.5X; multi-threaded version of ChFSI achieve speed-ups up to 5X.

2 Multi-threaded ChFSI:

by just using the multi-threaded version of BLAS, ChFSI achieve speed-ups above 7X; the larger the size of the eigenproblem the better ChFSI performs.

UNANSWERED QUESTIONS: Can we do better on shared memory architectures? YES → OpenMP . Can we be faster than the direct methods? depends on the percentage of the eigenspectrum sought after and the iteration index.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 23

slide-31
SLIDE 31

Parallelizing the filter with OMP

Input : Hamiltonian – H, vectors to be filtered – X,

DEG, lower, upper.

Output: Filtered vectors – Y.

1: for i = 1 to DEG do 2:

Compute α. Compute β.

3:

Y = αHX +βY.

4: SWAP(X,Y). 5: end for 6: SWAP(X,Y).

= × Hamiltonian. Vectors to be filtered. Filtered vectors Figure: Matrix–matrix multiplication scheme.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 24

slide-32
SLIDE 32

OMP ChFSI vs multi-threaded ChFSI

2 4 6 8 10 12 14 16 18 20 22 800 900 1000 1100 1200 1300 1400 1500 1600 1700 Iteration Index CPU time (seconds)

Time to completion for AuAg (n=13379) of OpenMP vs. multi−threaded BLAS

ChFSI multi−threaded BLAS ChFSI OpenMP ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 25

slide-33
SLIDE 33

Scalability

OMP ChFSI obtains almost ideal strong scalability: the larger the better.

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Number of threads Speed−up

BChFSI −− Strong scalability

AuAg −− n=8970 NaCl −− n=6217 CaFeAs −− n=2612 Ideal speed−up ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 26

slide-34
SLIDE 34

OMP ChFSI vs Direct eigensolvers

  • n shared architectures

Comparison with multi-threaded LAPACK MR3 and MKL multi-threaded LAPACK MR3.

2 4 6 8 10 12 14 16 18 20 22 200 400 600 800 1000 1200 1400 1600 1800 Iteration index CPU time (seconds)

Time for LAPACK + m−t BLAS and OMPChFSI for AuAg (n=13379)

LAPACK + multi−thrd BLAS MKL LAPACK + multi−thrd BLAS ChFSI OpenMP solving for lowest 7.3%

  • f eigespectrum

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 27

slide-35
SLIDE 35

OMP ChFSI vs Direct eigensolvers

  • n shared architectures

Comparison with multi-threaded LAPACK MR3 and MKL multi-threaded LAPACK MR3.

2 4 6 8 10 12 14 50 100 150 200 250 300 350 400 450 Iteration index CPU time (seconds)

Time for LAPACK + m−t BLAS and OMPChFSI for NaCl (n=9273)

LAPACK + multi−thrd BLAS MKL LAPACK + multi−thrd BLAS OpenMP ChFSI solving for lowest 2.8% of eigenspectrum ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 27

slide-36
SLIDE 36

Conclusions and future work

Feeding eigenvectors of P(ℓ−1) to a block iterative solver like ChFSI speed-ups the iterative solver for P(ℓ); The algorithmic structure of FLAPW could be re-thought: → “reverse simulation” can substantially influence the computational paradigm of an application; ChFSI can be extended and improved;

1

Ongoing work to parallelize ChFSI for distributed memory architectures = ⇒ Elemental;

2

On going effort to optimize the filter by adjusting the degree of the polynomial so to just achieve the required eigenvector residuals.

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 28

slide-37
SLIDE 37

References

1 EDN, and M. Berljafa Block Iterative Eigensolvers for Sequences of Correlated Eigenvalue Problems Submitted to Comp. Phys. Comm. [arXiv:1206.3768]. 2 EDN, P . Bientinesi, and S. Blügel, Correlation in sequences of generalized eigenproblems arising in Density Functional Theory,

  • Comp. Phys. Comm. 183 (2012), pp. 1674-1682, [arXiv:1108.2594].

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 29

slide-38
SLIDE 38

Kohn-Sham scheme and DFT

Self-consistent cycle

Typically this set of equations is solved using an iterative self-consistent cycle

Initial guess

ninit(r) = ⇒

Compute KS potential

V0(r)[n] − →

Solve KS equations

ˆ HKSφa(r) = εaφa(r) ↑ No ↓

OUTPUT Energy, forces, ... Yes

⇐ =

Converged?

|n(ℓ+1) −n(ℓ)| < η ← −

Compute new density

n(r) = ∑a |φa(r)|2 In practice this iterative cycle is much more computationally challenging and requires some form of broadly defined discretization

ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 30

slide-39
SLIDE 39

Generalized eigenvalue problems Ax = λBx

A common way of discretizing the KS equations is to expand the wave functions φa(r) on a basis set φa(r) − → φk,ν(r) =

|G+k|≤Kmax

cG

k,νψG(k,r)

This expansion is then inserted in the KS equations ψ∗

G(k,r)∑ G′

ˆ HKS cG′

k,ν ψG′(k,r) = λkν ψ∗ G(k,r)∑ G′

cG′

k,ν ψG′(k,r),

and, by defining the matrix entries for the left and right hand side respectively as Hamiltonian Ak and overlap matrices Bk, [Ak Bk]GG′ = ∑

α

  • dr ψ∗

G(k,r)

ˆ HKS ˆ 1

  • ψG′(k,r)
  • ne arrives at generalized eigenvalue equations parametrized by k

Pk :

G′

(Ak)GG′ cG′

kν = λkν∑ G′

(Bk)GG′ cG′

kν.

≡ Akxi = λiBkxi.

Return ERCIM 2012 Oviedo, Spain , Dec. 2nd

  • M. Berljafa and E. Di Napoli

Folie 31