Mitglied der Helmholtz-Gemeinschaft
Parallel block Chebyshev subspace iteration algorithm optimized for sequences of correlated dense eigenproblems
ERCIM 2012 Oviedo, Spain , Dec. 2nd
- M. Berljafa and E. Di Napoli
Parallel block Chebyshev subspace iteration algorithm optimized for - - PowerPoint PPT Presentation
Mitglied der Helmholtz-Gemeinschaft Parallel block Chebyshev subspace iteration algorithm optimized for sequences of correlated dense eigenproblems ERCIM 2012 Oviedo, Spain , Dec. 2nd M. Berljafa and E. Di Napoli Motivation and Goals Reverse
Mitglied der Helmholtz-Gemeinschaft
Reverse Simulation
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 2
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 3
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 4
Investigative framework
h2 2m n
i=1
i − n
i=1∑ α
i<j
2}
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 5
1 Φ(x1;s1,x2;s2,...,xn;sn) =
2 Density of states n(r) = ∑a |φa(r)|2 3 In the Schrödinger equation the exact Coulomb interaction is substituted
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 6
Self-consistent cycle
Initial guess
Compute KS potential
Solve a set of eigenproblems
OUTPUT Energy, ... Yes
Converged?
Compute new density
FLAPW details
1 every Pk : Ax = Bλx is a generalized eigenvalue problem; 2 A and B are DENSE and hermitian (B is also pos. def.); 3 Pks with different k index have different size and are independent from
4 k= 1:10-100 ; i = 1:20-50
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 7
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 8
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 8
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
j γjxj
Folie 8
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 8
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 9
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 10
fixed k
2 6 10 14 18 22 10
−10
10
−8
10
−6
10
−4
10
−2
10
Evolution of subspace angle for eigenvectors of k−point 1 and lowest 75 eigs
Iterations (2 −> 22) Angle b/w eigenvectors of adjacent iterations
AuAg
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 11
Note: Mathematical model Correlation. Correlation ⇐ numerical analysis of the simulation.
1 Development of a block iterative eigensolver that can exploit the
2 Investigate if approximate eigenvectors can speed-up the iterative solver
3 Understand if such an iterative method be competitive with direct
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 12
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 13
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 14
1 the ability to receive as input a sizable set of approximate eigenvectors; 2 the capacity to solve simultaneously for a substantial portion of
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 15
1 Lanczos step. Identify the bounds for the interval to be filtered out.
2 Chebyshev filter. Filter a block of vectors W. 3 QR decomposition. Re-orthogonalize the vectors outputted by the filter. 4 Compute the Rayleigh quotient G = WHHW. 5 Compute the primitive Ritz pairs (Λ,Q). 6 Compute the approximate Ritz pairs (Λ,WQ). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors.
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 16
−3 −2 −1 1 2 3 100 200 300 400 500
Degree 5
−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 x 10
6
Degree 10
−3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 x 10
10
Degree 15
−3 −2 −1 1 2 3 −1.5 −1 −0.5 0.5 1 1.5 x 10
14
Degree 20 ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 17
1 σ1 ← e/(λ1 −c) 2 Z1 ← σ1
3 σi+1 ←
4 Zi+1 ← 2σi+1
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 18
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 19
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 20
2 4 6 8 10 12 14 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
Iteration index Speed−up Speed−up vs. Iteration index for Nev=256 and three distinct matrix sizes
NaCl −− n=3893 NaCl −− n=6217 NaCl −− n=9273 ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 21
5 10 15 20 25 30 1 1.5 2 2.5 3 3.5 4
Iteration index Speed−up Speed−up vs. Iteration index for Nev=972 and 2 distinct matrix sizes
AuAg −− n=5638 AuAg −− n=8970 ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 21
< 1% 90% 6% 4%
Residuals convergence Rayleigh−Ritz Chebyshev filter Lanczos
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 22
5 10 15 20 25 30 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8
Iteration index Speed−up Speed−up of Multi−threaded ChFSI over Sequential ChFSI w.r.t. Iteration index AuAg −− n=8970 NaCl −− n=6217
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 22
5 10 15 20 25 30 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Iteration Index Speed−up
Speed−up of approx. vs. random vectors w.r.t. Iteration index
AuAg −− n=8970 −− 8 cores AuAg −− n=8970 −− 1 core
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 22
1 Approximate vs. Random:
sequential ChFSI achieves speed-ups in the range 1.5X ÷ 3.5X; multi-threaded version of ChFSI achieve speed-ups up to 5X.
2 Multi-threaded ChFSI:
by just using the multi-threaded version of BLAS, ChFSI achieve speed-ups above 7X; the larger the size of the eigenproblem the better ChFSI performs.
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 23
1 Approximate vs. Random:
sequential ChFSI achieves speed-ups in the range 1.5X ÷ 3.5X; multi-threaded version of ChFSI achieve speed-ups up to 5X.
2 Multi-threaded ChFSI:
by just using the multi-threaded version of BLAS, ChFSI achieve speed-ups above 7X; the larger the size of the eigenproblem the better ChFSI performs.
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 23
DEG, lower, upper.
1: for i = 1 to DEG do 2:
3:
4: SWAP(X,Y). 5: end for 6: SWAP(X,Y).
= × Hamiltonian. Vectors to be filtered. Filtered vectors Figure: Matrix–matrix multiplication scheme.
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 24
2 4 6 8 10 12 14 16 18 20 22 800 900 1000 1100 1200 1300 1400 1500 1600 1700 Iteration Index CPU time (seconds)
Time to completion for AuAg (n=13379) of OpenMP vs. multi−threaded BLAS
ChFSI multi−threaded BLAS ChFSI OpenMP ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 25
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of threads Speed−up
BChFSI −− Strong scalability
AuAg −− n=8970 NaCl −− n=6217 CaFeAs −− n=2612 Ideal speed−up ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 26
2 4 6 8 10 12 14 16 18 20 22 200 400 600 800 1000 1200 1400 1600 1800 Iteration index CPU time (seconds)
Time for LAPACK + m−t BLAS and OMPChFSI for AuAg (n=13379)
LAPACK + multi−thrd BLAS MKL LAPACK + multi−thrd BLAS ChFSI OpenMP solving for lowest 7.3%
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 27
2 4 6 8 10 12 14 50 100 150 200 250 300 350 400 450 Iteration index CPU time (seconds)
Time for LAPACK + m−t BLAS and OMPChFSI for NaCl (n=9273)
LAPACK + multi−thrd BLAS MKL LAPACK + multi−thrd BLAS OpenMP ChFSI solving for lowest 2.8% of eigenspectrum ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 27
1
Ongoing work to parallelize ChFSI for distributed memory architectures = ⇒ Elemental;
2
On going effort to optimize the filter by adjusting the degree of the polynomial so to just achieve the required eigenvector residuals.
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 28
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 29
Self-consistent cycle
Initial guess
Compute KS potential
Solve KS equations
OUTPUT Energy, forces, ... Yes
Converged?
Compute new density
ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 30
|G+k|≤Kmax
k,νψG(k,r)
G(k,r)∑ G′
k,ν ψG′(k,r) = λkν ψ∗ G(k,r)∑ G′
k,ν ψG′(k,r),
α
G(k,r)
G′
kν = λkν∑ G′
kν.
Return ERCIM 2012 Oviedo, Spain , Dec. 2nd
Folie 31