Enabling large scale LAPW DFT calculations by a scalable iterative - - PowerPoint PPT Presentation

enabling large scale lapw dft calculations by a scalable
SMART_READER_LITE
LIVE PREVIEW

Enabling large scale LAPW DFT calculations by a scalable iterative - - PowerPoint PPT Presentation

Mitglied der Helmholtz-Gemeinschaft Enabling large scale LAPW DFT calculations by a scalable iterative eigensolver CSE15, Salt Lake City. March 17th E. Di Napoli , D. Wortmann, and M. Berljafa Typical Applications Atomic Structure Magnetic


slide-1
SLIDE 1

Mitglied der Helmholtz-Gemeinschaft

Enabling large scale LAPW DFT calculations by a scalable iterative eigensolver

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and
  • M. Berljafa
slide-2
SLIDE 2

Typical Applications

Electronic Structure Atomic Structure Magnetic Structure

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 2

slide-3
SLIDE 3

Outline

The FLAPW method Sequences of correlated eigenproblems The algorithm: Chebyshev Accelerated Subspace Iteration (CHASE) CHASE parallelization and numerical tests

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 3

slide-4
SLIDE 4

Outline

The FLAPW method Sequences of correlated eigenproblems The algorithm: Chebyshev Accelerated Subspace Iteration (CHASE) CHASE parallelization and numerical tests

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 4

slide-5
SLIDE 5

Density Functional Theory (DFT)

1 Φ(x1;s1,x2;s2,...,xn;sn) =

⇒ Λi,aφa(xi;si)

2 density of states n(r) = ∑a fa |φa(r)|2 3 In the Schrödinger equation the exact Coulomb interaction is substituted

with an effective potential V0(r) = VI(r)+VH(r)+Vxc(r)

Hohenberg-Kohn theorem

∃ one-to-one correspondence n(r) ↔ V0(r) = ⇒ V0(r) = V0(r)[n] ∃! a functional E[n] : E0 = minnE[n] The high-dimensional Schrödinger equation translates into a set of coupled non-linear low-dimensional self-consistent Kohn-Sham (KS) equation ∀ a solve ˆ HKSφa(r) =

  • − ¯

h2 2m∇2 +V0(r)

  • φa(r) = εaφa(r)

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 5

slide-6
SLIDE 6

DFT self-consistent field cycle

Initial guess for charge density

nstart(r)

Compute discretized Kohn-Sham equations Solve a set of eigenproblems

P(ℓ)

k1 ...P(ℓ) kN

Compute new charge density

n(ℓ)(r)

Converged?

|n(ℓ) −n(ℓ−1)| < η

OUTPUT Electronic structure,

... No Yes

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 6

slide-7
SLIDE 7

Zoo of methods

LDA GGA LDA + U Hybrid functionals GW-approximation Plane waves Localized basis set Real space grids Green functions All-electron Pseudo-potential Shape approximations Full-potential Spin polarized calculations Finite differences Non-relaticistic eqs. Scalar-relativistic approx, Spin-orbit coupling Dirac equation

  • − ¯

h2 2m∇2 +V0(r)

  • φa(r) = εaφa(r)

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 7

slide-8
SLIDE 8

Introduction to FLAPW

LAPW basis set

ψk,ν(r) =

|G+k|≤Gmax

cG

k,νφG(k,r)

k Bloch vector ν band index φG(k,r) =    ei(k+G)r Interstitial (I)

ℓ,m

  • aα,G

ℓm (k)uα ℓ (r)+bα,G ℓm (k)˙

ℓ (r)

  • Yℓm(ˆ

rα) Muffin Tin

boundary conditions

Continuity of wavefunction and its derivative at MT boundary ⇓ aα,G

ℓm (k)

and bα,G

ℓm (k)

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 8

slide-9
SLIDE 9

Where does the CPU time go?

H and S Eigensolver Charge CPU time PE 50 % 13 % 33% 28 min. 1 27 % 20 % 44 % 36 min. 12 33 % 50 % 17 % 10 min. 30 23 % 61 % 11 % 12 min. 40

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 9

slide-10
SLIDE 10

Where does the CPU time go?

H and S Eigensolver Charge CPU time PE 50 % 13 % 33% 28 min. 1 27 % 20 % 44 % 36 min. 12 33 % 50 % 17 % 10 min. 30 23 % 61 % 11 % 12 min. 40

Solving the generalized eigenvalue problem

1 every P(ℓ)

k

: A(ℓ)

k ck = B(ℓ) k λck is a generalized eigenvalue problem;

2 A and B are DENSE and hermitian (B is positive definite); 3 required: lower 2÷10 % of eigenpairs; 4 momentum vector index: k = 1 : 10÷100; 5 iteration cycle index: ℓ = 1 : 20÷50.

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 9

slide-11
SLIDE 11

Outline

The FLAPW method Sequences of correlated eigenproblems The algorithm: Chebyshev Accelerated Subspace Iteration (CHASE) CHASE parallelization and numerical tests

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 10

slide-12
SLIDE 12

Sequences of Eigenproblems

Adjacent iteration cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

direct solver direct solver direct solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

direct solver direct solver direct solver

Next cycle

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 11

slide-13
SLIDE 13

Sequences of Eigenproblems

Adjacent iteration cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

direct solver direct solver direct solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

direct solver direct solver direct solver

Next cycle

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 11

slide-14
SLIDE 14

Sequences of Eigenproblems

Adjacent iteration cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

direct solver direct solver direct solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

direct solver direct solver direct solver

Next cycle

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 11

slide-15
SLIDE 15

Angles evolution

An example

Example: a metallic compound at fixed k

2 6 10 14 18 22 10

−10

10

−8

10

−6

10

−4

10

−2

10

Evolution of subspace angle for eigenvectors of k−point 1 and lowest 75 eigs

Iterations (2 −> 22) Angle b/w eigenvectors of adjacent iterations

AuAg

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 12

slide-16
SLIDE 16

An alternative solving strategy

Adjacent cycles

ITERATION (ℓ) P(ℓ)

k1

(X(ℓ)

k1 ,Λ(ℓ) k1 )

P(ℓ)

k2

(X(ℓ)

k2 ,Λ(ℓ) k2 )

P(ℓ)

kN

(X(ℓ)

kN ,Λ(ℓ) kN )

X ≡ {x1,...,xn}

iterative solver iterative solver iterative solver

ITERATION (ℓ+1) P(ℓ+1)

k1

(X(ℓ+1)

k1

,Λ(ℓ+1)

k1

) P(ℓ+1)

k2

(X(ℓ+1)

k2

,Λ(ℓ+1)

k2

) P(ℓ+1)

kN

(X(ℓ+1)

kN

,Λ(ℓ+1)

kN

) Λ ≡ diag(λ1,...,λn)

iterative solver iterative solver iterative solver

Next cycle

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 13

slide-17
SLIDE 17

Outline

The FLAPW method Sequences of correlated eigenproblems The algorithm: Chebyshev Accelerated Subspace Iteration (CHASE) CHASE parallelization and numerical tests

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 14

slide-18
SLIDE 18

Chebyshev Filtered Subspace Iteration method

Properties and algorithm evolution

Iterative solver musts input: the full set of multiple starting vectors Z0 ≡ X(ℓ−1)

ki

(:,1 : NEV); needed: it can efficiently use dense linear algebra kernels (i.e. xGEMM); needed: it avoids stalling when facing small clusters of eigenvalues; Chebyshev Subspace Iteration Firstly introduced in [Rutishauser 1969] A version (called CheFSI) tailored to electronic structure computation in [Zhou, Saad, Tiago and Chelikowski 2006] for sparse eigenvalue problems. Our ChASE : 1) is tailored for dense eigenproblem sequences, 2) introduces a locking mechanism, 3) contains a refining inner loop, and 4) optimizes the polynomial degree.

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 15

slide-19
SLIDE 19

The core of the algorithm: Chebyshev filter

Chebyshev polynomials

A generic vector v = ∑n

i=1 sixi is very quickly aligned in the direction of the

eigenvector corresponding to the extremal eigenvalue λ1

vm = pm(A)v =

n

i=1

si pm(A)xi =

n

i=1

si pm(λi)xi = s1x1 +

n

i=2

si Cm( λi−c

e )

Cm( λ1−c

e

) xi ∼ s1x1

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 16

slide-20
SLIDE 20

The core of the algorithm: Chebyshev filter

In practice

Three-terms recurrence relation

Cm+1 (t) = 2xCm (t)−Cm−1 (t); m ∈ N, C0 (t) = 1, C1 (t) = x Zm . = pm( ˜ H) Z0 with ˜ H = H −cIn FOR: i = 1 → DEG −1 Zi+1 ← 2σi+1 e ˜ H × Zi −σi+1σi Zi−1 xGEMM END FOR.

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 17

slide-21
SLIDE 21

Polynomial degree optimization

Convergence ratio and residuals

Definition

The convergence ratio for the eigenvector xi corresponding to eigenvalue λi / ∈ [α,β] is defined as τ(λi) = |ρi|−1 = min

±

  • λi −c

e ± λi −c e 2 −1

  • .

The further away λi is from the interval [α,β] the smaller is |ρi|−1 and the faster the convergence to xi is.

For a set of input vectors V = {v1,v2,...,vnev}

Residuals are a function of m and |ρ|

Res(vm

i ) ∼

Const×

  • 1

ρi

  • m

1 ≤ i ≤ k. Res(vm+m0

i

) ≈ Res(vm0

i )

  • 1

ρi

  • m

⇒ mi ≥ ln

  • TOL

Res(v

m0 i

)

  • /lnρi

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 18

slide-22
SLIDE 22

ChASE pseudocode (optimized)

1 Chebyshev filter. Initial filter W ←

− Z0. with DEG= m0.

2 Re-orthogonalize W = QR & compute the Rayleigh quotient G = Q†HQ. 3 Solve the reduced problem GY = YΛ and compute the approximate Ritz

pairs (Λ,W ← QY) and store their residuals Res(wi).

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 19

slide-23
SLIDE 23

ChASE pseudocode (optimized)

1 Chebyshev filter. Initial filter W ←

− Z0. with DEG= m0.

2 Re-orthogonalize W = QR & compute the Rayleigh quotient G = Q†HQ. 3 Solve the reduced problem GY = YΛ and compute the approximate Ritz

pairs (Λ,W ← QY) and store their residuals Res(wi). REPEAT UNTIL CONVERGENCE:

4 Optimizer. Compute the polynomial degrees mi ≥ ln

  • TOL

Res(wi)

  • /lnρi.

5 Chebyshev filter. Filter W ←

− Z0 with DEG= mi.

6 Re-orthogonalize W = QR & compute the Rayleigh quotient G = Q†HQ. 7 Solve the reduced problem GY = YΛ and compute the approximate Ritz

pairs (Λ,W ← QY).

8 lock the converged vectors. 9 Store the residuals Res(wi) of the unconverged vectors.

END REPEAT

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 19

slide-24
SLIDE 24

Outline

The FLAPW method Sequences of correlated eigenproblems The algorithm: Chebyshev Accelerated Subspace Iteration (CHASE) CHASE parallelization and numerical tests

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 20

slide-25
SLIDE 25

Experimental tests setup

C++ implementation of ChASE

EleChASE – Elemental (MPI) parallelization for distributed memory OMPChASE – OpenMP 4.0 parallelization for shared memory CUChASE – CUDA parallelization for GPUs Interface C++/Fortran so as to call ChASE from FLEUR Tests were performed on the JUROPA and the RWTH RZ cluster. 2 Intel Xeon 5570 (Nehalem-EP) 4d-core processors/node; 2 Intel Xeon E5 2670 (Sandy-Bridge) 8-core processors/node; NVIDIA K20m Xeon Phi Matrix sizes: 2,600 ÷ 29,500.

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 21

slide-26
SLIDE 26

ChASE time profile

As a function of iteration cycles

Time spent in each stage of the algorithm as a function of the iteration index ℓ for a system of size n = 9,273. 3 6 9 12 iteration index ℓ 5 10 15 20 25 time [s] convergence filter Rayleigh - Ritz QR

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 22

slide-27
SLIDE 27

Speed-up

Speed-up =

CPU time (input random vectors) CPU time (input approximate eigenvectors)

3 7 11 15 19 23 Iteration Index

  • 1

2 3 4 Speed-up Au98Ag10 - n =13,379 - 128 cores.

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 23

slide-28
SLIDE 28

Scalability (MPI implementation)

Strong Scalability (the size of the eigenproblems are kept fixed while the number of cores is progressively increased) for EleChASE over three systems of size n = 13,379−12,455−9,273 respectively. 16 32 64 128 256 number of cores 10 102 103 time [s] 196.0 138.0 36.0 102.0 71.0 19.2 55.7 38.2 10.8 384.0 269.0 69.8 30.2 21.4 6.5 Au98Ag10 TiO2 Na15Cl14Li

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 24

slide-29
SLIDE 29

EleChASE versus direct solvers (parallel MRRR)

For the size of eigenproblems here tested the ScaLAPAK implementation of BXINV

  • r MRRR is on par of worse than EleMRRR. For this reason a direct comparison

with ScaLAPACK is not included.

3 6 9 12 eigensystem index ℓ 5 10 15 20 25 30 time [s] EleMRRR EleChFSI, 1e-10 EleChFSI OPT, 1e-10 64 cores 128 cores

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 25

slide-30
SLIDE 30

EleChASE versus direct solvers (parallel MRRR)

For the size of eigenproblems here tested the ScaLAPAK implementation of BXINV

  • r MRRR is on par of worse than EleMRRR. For this reason a direct comparison

with ScaLAPACK is not included.

620 1358 2037 25 50 75 100 125 150 175 200 time [s] ℓ =4 620 1358 2037 number of eigenpairs sought after ℓ =11 620 1358 2037 ℓ =18 EleMRRR ChFSI, 1e-10 ChFSI, 1e-08

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 25

slide-31
SLIDE 31

Offloading to Xeon Phis and GPUs

The role of affinity on Xeon Phi

xGEMM performs at best at 66% of peak performance (1000 GFlops)

60 80 100 120 140 160 180 200 220 240 5 10 15 20 25 Time [s] iteration scatter compact compact small pages

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 26

slide-32
SLIDE 32

Offloading to Xeon Phis and GPUs

GPUs vs CPUs

50 100 150 200 250 300 350 400 5 10 15 20 25 Time [s] iteration 2x Intel Xeon E5-2670 (16 cores) standard 2x Intel Xeon E5-2670 (16 cores) optimized 1x NVIDIA K20m standard 1x NVIDIA K20m optimized 2x NVIDIA K20m standard 2x NVIDIA K20m optimized

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 26

slide-33
SLIDE 33

Offloading to Xeon Phis and GPUs

multi-cores vs many-cores

50 100 150 200 250 300 5 10 15 20 25 Time [s] iteration 2x Intel Xeon E5-2670 (16 cores) 1x NVIDIA K20m 2x NVIDIA K20m 1x XeonPhi optimized (compact)

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 26

slide-34
SLIDE 34

Conclusions and future work

Algorithmic strategy

Sequences of “correlated” eigenproblems ⇒ Tailored algorithms Exploiting the correlation of the eigenproblem sequence to speedup the solution of each P(ℓ) is a successful strategy; Combining iterative methods with kernels for dense linear algebra can pay off. The parallelization shows great potential for scalability and parallel efficiency; Uncovering information can lead to further algorithmic optimizations; ONGOING AND FUTURE WORK

1 Exploring hybrid parallelizations of the code. 2 Implement in FLEUR a mixed direct-iterative solver;

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 27

slide-35
SLIDE 35

References

1 M.Berliafa, D. Wortmann and EDN An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems Concurrency and Computation: Practice and Experience, 27 (2015) 905 2 EDN, and M. Berljafa A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, Vol 8385. [arXiv:1305.5120] 3 EDN, and M. Berljafa Block Iterative Eigensolvers for Sequences of Correlated Eigenvalue Problems

  • Comp. Phys. Comm. 184 (2013), pp. 2478-2488, [arXiv:1206.3768].

4 EDN, P . Bientinesi, and S. Blügel, Correlation in sequences of generalized eigenproblems arising in Density Functional Theory,

  • Comp. Phys. Comm. 183 (2012), pp. 1674-1682, [arXiv:1108.2594].

CSE15, Salt Lake City. March 17th

  • E. Di Napoli, D. Wortmann, and M. Berljafa

Folie 28