(Preconditioning) Chebyshev subspace iteration applied to sequences - - PowerPoint PPT Presentation

preconditioning chebyshev subspace iteration applied to
SMART_READER_LITE
LIVE PREVIEW

(Preconditioning) Chebyshev subspace iteration applied to sequences - - PowerPoint PPT Presentation

Mitglied der Helmholtz-Gemeinschaft (Preconditioning) Chebyshev subspace iteration applied to sequences of dense eigenproblems in ab initio simulations NASCA 2013 Calais, France, June 24th M. Berljafa and E. Di Napoli Motivation and Goals


slide-1
SLIDE 1

Mitglied der Helmholtz-Gemeinschaft

(Preconditioning) Chebyshev subspace iteration applied to sequences of dense eigenproblems in ab initio simulations

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli
slide-2
SLIDE 2

Motivation and Goals

Electronic Structure Band energy gap Conductivity Forces, etc.

SIMULATIONS

Mathematical Model + Algorithmic Structure Extracting & Exploiting Information

slide-3
SLIDE 3

Motivation and Goals

Electronic Structure Band energy gap Conductivity Forces, etc.

SIMULATIONS

Mathematical Model + Algorithmic Structure Extracting & Exploiting Information

Performance Efficiency Scalability More Physics

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 2

slide-4
SLIDE 4

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results Optimizations schemes

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 3

slide-5
SLIDE 5

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results Optimizations schemes

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 4

slide-6
SLIDE 6

Density Functional Theory scheme

Self-consistent cycle

Initial guess for charge density

nstart(r)

Compute discretized Kohn-Sham equations Solve a set of eigenproblems

P(ℓ)

k1 ...P(ℓ) kN

Compute new charge density

n(ℓ)(r)

Converged?

|n(ℓ) −n(ℓ−1)| < η

OUTPUT Electronic structure,

... No Yes

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 5

slide-7
SLIDE 7

Density Functional Theory scheme

Self-consistent cycle

Initial guess for charge density

nstart(r)

Compute discretized Kohn-Sham equations Solve a set of eigenproblems

P(ℓ)

k1 ...P(ℓ) kN

Compute new charge density

n(ℓ)(r)

Converged?

|n(ℓ) −n(ℓ−1)| < η

OUTPUT Electronic structure,

... No Yes Observations:

1 every P(ℓ)

k

: A(ℓ)

k x = B(ℓ) k λx is a generalized eigenvalue problem;

2 A and B are DENSE and hermitian (B is also pos. def.); 3 Pks with different k index have different size and are independent from

each other.

4 k = 1 : 10−100

; ℓ = 1 : 20−50

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 5

slide-8
SLIDE 8

Sequences of eigenproblems

Current solving strategy

The set of generalized eigenproblems P(1) ...P(ℓ)P(ℓ+1) ...P(N) is handled as a set of disjoint problems (P)N; Each problem P(ℓ) is solved independently using a direct solver as a black-box from a standard library (i.e. ScaLAPACK).

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 6

slide-9
SLIDE 9

Sequences of eigenproblems

Current solving strategy

The set of generalized eigenproblems P(1) ...P(ℓ)P(ℓ+1) ...P(N) is handled as a set of disjoint problems (P)N; Each problem P(ℓ) is solved independently using a direct solver as a black-box from a standard library (i.e. ScaLAPACK).

Extracting information − → searching for a new strategy

Treated the set of eigenproblems (P)N as a sequence

  • P(ℓ)

; Investigated the presence of a connection between adjacent problems,

numerical simulations analyzed employing a parameter-based inverse problem method; collected data on angles b/w eigenvectors of adjacent eigenproblems; discovered evolution of eigenvectors along the sequence.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 6

slide-10
SLIDE 10

Angles evolution

fixed k

Example: a metallic compound at fixed k

2 6 10 14 18 22 10

−10

10

−8

10

−6

10

−4

10

−2

10

Evolution of subspace angle for eigenvectors of k−point 1 and lowest 75 eigs

Iterations (2 −> 22) Angle b/w eigenvectors of adjacent iterations

AuAg

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 7

slide-11
SLIDE 11

Correlation and its exploitation

∃ correlation between successive eigenvectors x(ℓ−1) and x(ℓ); angles are small after the first few iterations.

Note: Mathematical model Correlation. Correlation ⇐ numerical analysis of the simulation.

Exploiting information: SIMULATION ⇒ ALGORITHM The stage is favorable to an iterative eigensolver where the solution of P(ℓ−1) is used to solve P(ℓ).

1 Approximate eigenvectors can speed-up iterative solvers (EDN,

  • M. Berljafa [arXiv:1206.3768])

2 Developed of a block iterative eigensolver (ChFSI) that can maximally

exploit the correlation (EDN, M. Berljafa [arXiv:1305.5120])

3 ChFSI is competitive with direct methods for dense problems in ab initio

methods (EDN M. Berljafa, and in preparation).

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 8

slide-12
SLIDE 12

Algorithmic digression

Direct solvers. Iterative solvers.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 9

slide-13
SLIDE 13

Algorithmic digression

Direct solvers. Iterative solvers.

       ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗               ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗       

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 9

slide-14
SLIDE 14

Algorithmic digression

Direct solvers. Iterative solvers.

       ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗               ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗       

|λ1|>|λ2| > |λ3| > ... Axj = λjxj v = ∑j γjxj Av = ∑j λjγjxj ⇒ Akv = ∑j λk

j γjxj = λ1

  • x1 +∑j≥2

λj λ1 xj

  • Rate of convergence → magnitude of
  • λ1

λj

  • NASCA 2013 Calais, France, June 24th
  • M. Berljafa and E. Di Napoli

Folie 9

slide-15
SLIDE 15

Algorithmic digression

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 9

slide-16
SLIDE 16

Algorithmic choice

Direct solvers. Iterative solvers. Dense matrices. Sparse matrices.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 10

slide-17
SLIDE 17

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results Optimizations schemes

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 11

slide-18
SLIDE 18

Selecting an iterative eigensolver

Two are the main properties an iterative algorithm has to comply with:

1 The ability to receive as input a sizable set Z0 of approximate

eigenvectors;

2 The capacity to solve simultaneously for a substantial portion of

eigenpairs.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 12

slide-19
SLIDE 19

Selecting an iterative eigensolver

Two are the main properties an iterative algorithm has to comply with:

1 The ability to receive as input a sizable set Z0 of approximate

eigenvectors;

2 The capacity to solve simultaneously for a substantial portion of

eigenpairs. ChFSI constitutes the natural choice: it accepts the full set of multiple starting vectors; it avoids stalling when facing small clusters of eigenvalues; when augmented with polynomial accelerators it has a much faster convergence rate; converged eigenvectors can be easily locked; the degree of the polynomial can be opportunely optimized.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 12

slide-20
SLIDE 20

The core of the algorithm: Chebyshev filter

Chebyshev polynomials

The Chebyshev polynomial Cm of the first kind of order m, is defined as

Cm(x) = cos(marccos(x)), x ∈ [−1,1]; cosh(marccosh(x)), |x| > 1.

Three-terms recurrence relation

Cm+1 (x) = 2xCm (x)−Cm−1 (x); m ∈ N, C0 (x) = 1, C1 (x) = x

−3 −2 −1 1 2 3 100 200 300 400 500

Degree 5

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 x 10

6

Degree 10

−3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 x 10

10

Degree 15

−3 −2 −1 1 2 3 −1.5 −1 −0.5 0.5 1 1.5 x 10

14

Degree 20 NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 13

slide-21
SLIDE 21

The core of the algorithm: Chebyshev filter

The basic principle

Theorem

Let |γ| > 1 and Pm denote the set of polynomials of degree smaller or equal to m. Then the extremum

min

p∈Pm,p(γ)=1 max t∈[−1,1]|p(t)|

is reached by

pm(t) . = Cm(t) Cm(γ).

A generic vector v is very quickly aligned in the direction of the eigenvector corresponding to the extremal eigenvalue λ1

vm = pm(A)v =

n

i=1

si pm(A)xi =

n

i=1

si pm(λi)xi = s1x1 +

n

i=2

si Cm( λi−c

e )

Cm( λ1−c

e

) xi ∼ s1x1

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 14

slide-22
SLIDE 22

The core of the algorithm: Chebyshev filter

The pseudocode

A simple linear transformation maps [−1,1] − → [α,β] ⊂ R defines c = β+α

2

as the center of the interval and e = β−α

2

as the width of the interval.

INPUT: Hamiltonian H - approx. eigenvectors Z0 lowest eigenv. λ1 - parameters for interval c, e - DEG. OUTPUT: Filtered vectors W.

1 σ1 ← e/(λ1 −c) 2 Z1 ← σ1

e (H −cIn)Z0 xGEMM FOR: i = 1 → DEG −1

3 σi+1 ←

1 (2/σ1 −σi)

4 Zi+1 ← 2σi+1

e (H −cIn)Zi −σi+1σiZi−1 xGEMM END FOR.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 15

slide-23
SLIDE 23

ChFSI pseudocode

INPUT: Hamiltonian, approximate eigenpairs – (Λ,Z0), TOL, DEG. OUTPUT: Wanted eigenpairs W.

1 Lanczos step. Identify the bounds for the interval to be filtered out.

REPEAT UNTIL CONVERGENCE:

2 Chebyshev filter. Filter a block of vectors W ←

− Z0.

3 QR decomposition. Re-orthogonalize the vectors outputted by the filter;

W = QR.

4 Compute the Rayleigh quotient G = Q†HQ. 5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ. 6 Compute the approximate Ritz pairs (Λ,W ← QY). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors.

END REPEAT

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 16

slide-24
SLIDE 24

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results Optimizations schemes

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 17

slide-25
SLIDE 25

Experimental tests setup

Solving with a ChFSI version implemented in C

  • Approx. vs Random vectors fed to ChFSI against Iteration Index;

Parallel ChFSI vs Direct methods against Iteration Index. Matrix sizes: 2,600 ÷ 13,300. We used the standard form for the problem Ax = λBx − → A′y = λy with A′ = L−1AL−T and y = LTx Tests were performed on the JUROPA cluster. 2 Intel Xeon 5570 (Nehalem-EP) quad-core processors/node, 2.93GHz; 24 GB/node;

THEORETICAL PEAK PERFORMANCE/CORE=11.71 GFLOPS;

Minimum absolute tolerance of residuals rx = 10−10; All numerical data are MEDIAN values over 12 distinct measurements.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 18

slide-26
SLIDE 26

Speed-up for sequential ChFSI

Speed-up =

CPU time (input random vectors) CPU time (input approximate eigenvectors)

2 4 6 8 10 12 14 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8

Iteration index Speed−up Speed−up vs. Iteration index for Nev=256 and three distinct matrix sizes

NaCl −− n=3893 NaCl −− n=6217 NaCl −− n=9273 NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 19

slide-27
SLIDE 27

Speed-up for sequential ChFSI

Speed-up =

CPU time (input random vectors) CPU time (input approximate eigenvectors)

5 10 15 20 25 30 1 1.5 2 2.5 3 3.5 4

Iteration index Speed−up Speed−up vs. Iteration index for Nev=972 and 2 distinct matrix sizes

AuAg −− n=5638 AuAg −− n=8970 NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 19

slide-28
SLIDE 28

Sequential to Parallel: “Elemental” framework

Fraction of computing time for sequential ChFSI extracted from AuAg system (n=8970, nev=972) at iteration 23.

< 1% 90% 6% 4%

Residuals convergence Rayleigh−Ritz Chebyshev filter Lanczos

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 20

slide-29
SLIDE 29

Sequential to Parallel: “Elemental” framework

Strong Scalability (the size of the eigenproblems are kept fixed while the number of cores is progressively increased) for EleChFSI over two systems of size n = 13,379 and n = 9,273 respectively.

8 16 32 64 128 256 Number of Cores 10 102 103 Time [sec] 406.28 85.79 208.62 45.47 108.25 24.22 57.51 12.99 801.76 167.83 31.99 7.36 Au98Ag10 Na15Cl14Li

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 20

slide-30
SLIDE 30

Sequential to Parallel: “Elemental” framework

Weak Scalability (fixed ratio of data per processor) for EleChFSI. Times are weighted a posteriori keeping into account the ratio of operations per data.

16 32 40 80 92 184 Number of Cores 15 30 45 Weighted Time [sec] 33.12 11.83 34.79 12.23 34.81 12.37 Au98Ag10 Na15Cl14Li 3893 5638 6217 8970 9273 13379 System Size

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 20

slide-31
SLIDE 31

EleChFSI versus direct solvers (parallel MRRR)

For the size of eigenproblems here tested the ScaLAPACK implementation

  • f BXINV is comparable with EleMRRR. For this reason a direct

comparison with ScaLAPACK is not included.

3 6 9 12 Iteration Index

  • 5

10 15 20 25 30 Time [sec] EleChFSI EleMRRR 64 Cores 128 Cores

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 21

slide-32
SLIDE 32

Recapping

1 Approximate vs. Random:

sequential ChFSI achieves speed-ups in the range 1.5X ÷ 3.5X; parallel versions of ChFSI can achieve speed-ups up to 5X.

2 Parallel ChFSI:

the algorithms scales extremely well even for medium size eigenproblems; depending on the number of cores and percentage of eigenspectrum, ChFSI is competitive with direct eigensolvers; having access to more cores enables the investigation of larger physical systems (larger eigenproblems).

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 22

slide-33
SLIDE 33

Recapping

1 Approximate vs. Random:

sequential ChFSI achieves speed-ups in the range 1.5X ÷ 3.5X; parallel versions of ChFSI can achieve speed-ups up to 5X.

2 Parallel ChFSI:

the algorithms scales extremely well even for medium size eigenproblems; depending on the number of cores and percentage of eigenspectrum, ChFSI is competitive with direct eigensolvers; having access to more cores enables the investigation of larger physical systems (larger eigenproblems).

THERE IS ROOM FOR FURTHER OPTIMIZATIONS: The degree of the polynomials can be fine-tuned so as to avoid filtering “too much”; Profiling the degrees along the sequence provides additional savings.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 22

slide-34
SLIDE 34

Outline

Sequences of generalized eigenproblems in all-electron computations The algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) Exploiting approximate eigenvectors: numerical results Optimizations schemes

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 23

slide-35
SLIDE 35

Convergence ratio and residuals

Definition

The convergence ratio for the eigenvector wi corresponding to eigenvalue λi / ∈ [α,β] is defined as τ(λi) = |ρi|−1 = min

±

  • λi −c

e ± λi −c e 2 −1

  • .

The further away λi is from the interval [α,β] the smaller is |ρi|−1 and the faster the convergence to wi is.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 24

slide-36
SLIDE 36

Convergence ratio and residuals

Definition

The convergence ratio for the eigenvector wi corresponding to eigenvalue λi / ∈ [α,β] is defined as τ(λi) = |ρi|−1 = min

±

  • λi −c

e ± λi −c e 2 −1

  • .

The further away λi is from the interval [α,β] the smaller is |ρi|−1 and the faster the convergence to wi is.

For a set of input vectors V = {v1,v2,...,vnev}

Claim

Res(vm

1 ) |const.| |ρ1|m

Res(vm

i ) |const.| |ρi|m +|const.| |ρ1|m |ρi|m eps

; k ≥ i ≥ 2.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 24

slide-37
SLIDE 37

Predicting the minimal polynomial degree

Res(vm

i ) = Res(vm0 i )

  • 1

ρi

  • (m−m0)

.

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 25

slide-38
SLIDE 38

Predicting the minimal polynomial degree

Res(vm

i ) = Res(vm0 i )

  • 1

ρi

  • (m−m0)

+eps |ρ1|(m−m0)

|ρi|(m−m0) .

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 25

slide-39
SLIDE 39

Optimizations: gains and losses

ChFSI sequential: optimized vs standard − → total CPU times.

2 3 4 5 6 7 8 9 10 11 12 13 100 110 120 130 140 150 160 170 180 190 200 Iteration Index Time [sec]

Standard ChFSI vs Optimized ChFSI

Standard (total time) Optimized (total time) Standard (RR time) Optimized (RR time) Optimized (filter time) Standard (filter time)

ChFSI Total

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 26

slide-40
SLIDE 40

Optimizations: gains and losses

ChFSI sequential: optimized vs standard − → polynomial filter CPU times.

2 3 4 5 6 7 8 9 10 11 12 13 80 90 100 110 120 130 140 150 160 170 180

Iteration Index Time[sec] Standard ChFSI vs Optimized ChFSI

Optimized (total time) Standard (total time) Optimized (filter time) Standard (filter time) Optimized (RR time) Standard (RR time)

Polynomial filter

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 26

slide-41
SLIDE 41

Optimizations: gains and losses

ChFSI sequential: optimized vs standard − → Rayleigh-Ritz CPU times.

2 3 4 5 6 7 8 9 10 11 12 13 6 7 8 9 10 11 12 13 14 15

Iteration Index Time[sec] Standard ChFSI vs Optimized ChFSI

Optimized (total time) Standard (total time) Optimized (filter time) Standard (filter time) Optimized (RR time) Standard (RR time)

Rayleigh−Ritz

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 26

slide-42
SLIDE 42

Conclusions and future work

(Preconditioning) ChFSI with eigenvectors of P(ℓ−1) to speedup the solution P(ℓ) is a successful strategy; The parallelization of ChFSI shows great potential for scalability and efficiency; The improved use of computing resources together with strong scalability will enable to access larger physics; ONGOING AND FUTURE WORK

1 Finalizing filter optimization by adjusting the degree of the polynomial so

to just achieve the required eigenvector residuals.

2 Parallelization of ChFSI for GPUs by using OpenACC directives;

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 27

slide-43
SLIDE 43

References

1 EDN, and M. Berljafa A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW Submitted to PPAM 2013. [arXiv:1305.5120] 2 EDN, and M. Berljafa Block Iterative Eigensolvers for Sequences of Correlated Eigenvalue Problems Accepted for publication in Comp. Phys. Comm. [arXiv:1206.3768]. 3 EDN, P . Bientinesi, and S. Blügel, Correlation in sequences of generalized eigenproblems arising in Density Functional Theory,

  • Comp. Phys. Comm. 183 (2012), pp. 1674-1682, [arXiv:1108.2594].

NASCA 2013 Calais, France, June 24th

  • M. Berljafa and E. Di Napoli

Folie 28