Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Parallel solution of large sparse eigenproblems using a - - PowerPoint PPT Presentation
Parallel solution of large sparse eigenproblems using a - - PowerPoint PPT Presentation
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Parallel solution of large sparse eigenproblems using a Block-Jacobi-Davidson method Melven Rhrig-Zllner Simulation and Software Technology Block-Jacobi-Davidson
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Background
Aim: Calculate a set of outer eigenpairs (λi, vi) of a large sparse matrix: Avi = λivi Project: Equipping Sparse Solvers for Exascale (ESSEX) of the DFG SPPEXA programme
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outline
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outline
Block-Jacobi-Davidson algorithm Jacobi-Davidson QR method Block JDQR method Performance analysis Implementation Results
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Jacobi-Davidson QR method
Background
◮ Introduced 1998 by Fokkema et. al. ◮ Iteratively calculates a small set of eigenvalues (near a specific
target)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Jacobi-Davidson QR method
Background
◮ Introduced 1998 by Fokkema et. al. ◮ Iteratively calculates a small set of eigenvalues (near a specific
target)
◮ Based on a subspace iteration and a correction equation
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Jacobi-Davidson QR method
Background
◮ Introduced 1998 by Fokkema et. al. ◮ Iteratively calculates a small set of eigenvalues (near a specific
target)
◮ Based on a subspace iteration and a correction equation ◮ Motivated by the Rayleigh quotient iteration (RQI)
RQI:
- ¯
vk+1 = (A − λkI)−1vk, vk+1 =
¯ vk+1 vk+1
λk+1 = v T
k+1Avk+1 ◮ Can also be derived from an inexact Newton process
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Jacobi-Davidson QR method
Sketch of the algorithm
1: while not converged do
⊲ Outer iteration
2:
Project the problem to a small subspace
3:
Solve the small eigenvalue problem
4:
Calculate an approximation and its residual
5:
Approximately solve the correction equation ⊲ Inner iteration
6:
Orthogonalize the new direction
7:
Enlarge the subspace
8: end while
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outer iteration
Subspace iteration Deflation
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outer iteration
Subspace iteration
◮ Galerkin projection onto a small subspace W = span{w1, w2, . . . }:
Av − λv ⊥ W, v ∈ W ⇔ (W TAW )s − ˜ λs = 0, for orth. basis W
Deflation
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outer iteration
Subspace iteration
◮ Galerkin projection onto a small subspace W = span{w1, w2, . . . }:
Av − λv ⊥ W, v ∈ W ⇔ (W TAW )s − ˜ λs = 0, for orth. basis W
◮ Approximation ˜
λ with ˜ v = Ws
Deflation
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outer iteration
Subspace iteration
◮ Galerkin projection onto a small subspace W = span{w1, w2, . . . }:
Av − λv ⊥ W, v ∈ W ⇔ (W TAW )s − ˜ λs = 0, for orth. basis W
◮ Approximation ˜
λ with ˜ v = Ws
Deflation
◮ Project out the already known eigenvector space
Q = span{q1, q2, . . . }: ¯ A := (I − QQT)A(I − QQT)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Inner iteration
Jacobi-Davidson correction equation
◮ Based on the current approximation A˜
v − ˜ λ˜ v = r
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Inner iteration
Jacobi-Davidson correction equation
◮ Based on the current approximation A˜
v − ˜ λ˜ v = r
◮ Avoid the (near) singularity of (A − ˜
λI)−1 through a deflation of ˜ Q = Q ˜ v
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Inner iteration
Jacobi-Davidson correction equation
◮ Based on the current approximation A˜
v − ˜ λ˜ v = r
◮ Avoid the (near) singularity of (A − ˜
λI)−1 through a deflation of ˜ Q = Q ˜ v → Solve approximately: (I − ˜ Q ˜ QT)(A − ˜ λI)(I − ˜ Q ˜ QT)wk+1 = −r
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Inner iteration
Jacobi-Davidson correction equation
◮ Based on the current approximation A˜
v − ˜ λ˜ v = r
◮ Avoid the (near) singularity of (A − ˜
λI)−1 through a deflation of ˜ Q = Q ˜ v → Solve approximately: (I − ˜ Q ˜ QT)(A − ˜ λI)(I − ˜ Q ˜ QT)wk+1 = −r
◮ Use some steps of an iterative solver (GMRES, BiCGStab, . . . )
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Inner iteration
Jacobi-Davidson correction equation
◮ Based on the current approximation A˜
v − ˜ λ˜ v = r
◮ Avoid the (near) singularity of (A − ˜
λI)−1 through a deflation of ˜ Q = Q ˜ v → Solve approximately: (I − ˜ Q ˜ QT)(A − ˜ λI)(I − ˜ Q ˜ QT)wk+1 = −r
◮ Use some steps of an iterative solver (GMRES, BiCGStab, . . . ) ◮ Provides a new direction wk+1 for the subspace iteration
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Idea
◮ Calculate corrections for nb eigenvalues at once
Numerical properties
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Idea
◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜
Q = Q ˜ v1 . . . ˜ vnb
- :
(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb
Numerical properties
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Idea
◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜
Q = Q ˜ v1 . . . ˜ vnb
- :
(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb → Approximately solve nb linear systems at once
Numerical properties
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Idea
◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜
Q = Q ˜ v1 . . . ˜ vnb
- :
(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb → Approximately solve nb linear systems at once
◮ Provides new directions wk+1, . . . , wk+nb for the subspace iteration
Numerical properties
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Idea
◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜
Q = Q ˜ v1 . . . ˜ vnb
- :
(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb → Approximately solve nb linear systems at once
◮ Provides new directions wk+1, . . . , wk+nb for the subspace iteration
Numerical properties
◮ More robust ◮ Usually needs more operations
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Sketch of the complete algorithm
1: 2: while not converged do 3:
Project the problem to a small subspace
4:
Solve the small eigenvalue problem
5:
Calculate an nb approximations and their residual
6: 7: 8:
Approximately solve the nb correction equations
9:
Block-Orthogonalize the new directions
10:
Enlarge the subspace
11: end while
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block JDQR method
Sketch of the complete algorithm
1: Setup initial subspace 2: while not converged do 3:
Project the problem to a small subspace
4:
Solve the small eigenvalue problem
5:
Calculate an nb approximations and their residual
6:
Lock converged eigenvalues
7:
Shrink subspace if required (thick restart)
8:
Approximately solve the nb correction equations
9:
Block-Orthogonalize the new directions
10:
Enlarge the subspace
11: end while
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outline
Block-Jacobi-Davidson algorithm Performance analysis Required linear algebra operations spMMVM single node spMMVM inter-node Block vector operations Jacobi-Davidson Operator Implementation Results
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Required linear algebra operations
Sparse matrix-multiple-vector multiplication (spMMVM) Block vector operations
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Required linear algebra operations
Sparse matrix-multiple-vector multiplication (spMMVM)
◮ Large distributed sparse matrix A in CRS or SELL-C-σ format ◮ Distributed blocks of vectors X, Y ∈ Rn×nb ◮ Shifted spMMVM: yi ← (A − ˜
λiI)xi, i = 1, . . . , nb
Block vector operations
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Required linear algebra operations
Sparse matrix-multiple-vector multiplication (spMMVM)
◮ Large distributed sparse matrix A in CRS or SELL-C-σ format ◮ Distributed blocks of vectors X, Y ∈ Rn×nb ◮ Shifted spMMVM: yi ← (A − ˜
λiI)xi, i = 1, . . . , nb
Block vector operations
◮ Different types of operations:
local all-reduction BLAS 1 Y ← X + Y xi, i = 1, . . . , nb BLAS 3 Y ← XM M ← X TY
◮ Redundantly stored small matrices M ∈ Rnb×nb
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM single node
Roofline performance model
◮ Possible performance:
Pideal = min
- Ppeak, bdata
Bcode
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM single node
Roofline performance model
◮ Possible performance:
Pideal = min
- Ppeak, bdata
Bcode
- ◮ Ideal code balance:
Bideal = 6 nb + 8 nnzr bytes flop
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM single node
Roofline performance model
◮ Possible performance:
Pideal = min
- Ppeak, bdata
Bcode
- ◮ Ideal code balance:
Bideal = 6 nb + 8 nnzr bytes flop
- → memory bounded on all current architectures
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM single node
Roofline performance model
◮ Possible performance:
Pideal = min
- Ppeak, bdata
Bcode
- ◮ Ideal code balance:
Bideal = 6 nb + 8 nnzr bytes flop
- → memory bounded on all current architectures
◮ Irregular accesses to X lead to higher data volume:
Ω = Vmeasured Videal ≥ 1
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM single node
Row- vs. column-major storage Setup
◮ Matrix dimensions: n ≈ 106 with
nnzr ≈ 10 non-zeros per row
◮ CRS format ◮ 6-core Intel Westmere CPU 2 4 6 8 10 1 2 4 8 performance [GFlops/s] block size nb row-major row-major, NT stores col.-major Ω = 1.17 Ω = 1.32 Ω = 1.57 Ω = 1.94
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM single node
Intra-socket scaling Setup
◮ Matrix dimensions: n ≈ 107 with
nnzr ≈ 15 non-zeros per row
◮ SELL-C-σ format ◮ 10-core Intel Ivy Bridge CPU
Results
block size 1 4 8 Ω 1.08 1.54 1.97 Speedup 1.0 2.1 2.8
5 10 15 20 1 2 3 4 5 6 7 8 9 10 performance [GFlop/s] # cores single-vector block size 4 block size 8
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
spMMVM inter-node
Setup
◮ Strong scaling:
n ≈ 4 · 107, nnzr ≈ 15
◮ Distributed CRS ◮ RRZE’s Emmy-cluster, each node:
2× 10-core Intel Ivy Bridge
◮ Timings from completed algorithm
Explanation
Two counteracting effects:
◮ Communication volume increases
independent of blocking
◮ Message aggregation 0.5 1 1.5 2 2.5 2 4 8 16 32 64 speedup through blocking # nodes single-vector block size 2 block size 4 block size 8
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block vector operations
Background
◮ All operations are memory bounded.
(also the GEMM, as all matrices are very tall and skinny)
◮ The code balance accurately predicts the node-level performance!
Results of blocking
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block vector operations
Background
◮ All operations are memory bounded.
(also the GEMM, as all matrices are very tall and skinny)
◮ The code balance accurately predicts the node-level performance!
Results of blocking
◮ Faster BLAS 3 operations (e.g. Y ← XM) ◮ Message aggregation for all-reductions (e.g. M ← X TY )
→ Improved performance of some operations
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Complete Jacobi-Davidson Operator
Setup
◮ spMMVM + 2× GEMM:
yi ← (I − QQT)(A − ˜ λiI)xi with Q ∈ Rn×8, i = 1, . . . , nb
◮ Matrix with n ≈ 107, nnzr ≈ 15 ◮ SELL-C-σ format ◮ 10-core Intel Ivy Bridge CPU 0.02 0.04 0.06 0.08 0.1 1 2 4 8 runtime / block size nb [s] block size nb spMMVM GEMM 1 GEMM 2
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outline
Block-Jacobi-Davidson algorithm Performance analysis Implementation Frameworks from the ESSEX project Block pipelined GMRES algorithm Kernel routines Results
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Frameworks from the ESSEX project
phist (Pipelined Hybrid-parallel Linear Solver Toolkit) GHOST (General Hybrid Optimized Sparse Toolkit)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Frameworks from the ESSEX project
phist (Pipelined Hybrid-parallel Linear Solver Toolkit)
◮ General C-interface to linear algebra libraries:
◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)
GHOST (General Hybrid Optimized Sparse Toolkit)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Frameworks from the ESSEX project
phist (Pipelined Hybrid-parallel Linear Solver Toolkit)
◮ General C-interface to linear algebra libraries:
◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)
◮ Iterative solvers
GHOST (General Hybrid Optimized Sparse Toolkit)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Frameworks from the ESSEX project
phist (Pipelined Hybrid-parallel Linear Solver Toolkit)
◮ General C-interface to linear algebra libraries:
◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)
◮ Iterative solvers ◮ Large test framework
GHOST (General Hybrid Optimized Sparse Toolkit)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Frameworks from the ESSEX project
phist (Pipelined Hybrid-parallel Linear Solver Toolkit)
◮ General C-interface to linear algebra libraries:
◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)
◮ Iterative solvers ◮ Large test framework
GHOST (General Hybrid Optimized Sparse Toolkit)
◮ Developed at the RRZE in Erlangen ◮ Hybrid-parallel MPI+OpenMP+CUDA ◮ SELL-C-σ matrix format
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block pipelined GMRES algorithm
Idea
◮ Solve nb linear systems Axi = bi at once
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block pipelined GMRES algorithm
Idea
◮ Solve nb linear systems Axi = bi at once ◮ Remove converged systems and add new ones (pipelining)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block pipelined GMRES algorithm
Idea
◮ Solve nb linear systems Axi = bi at once ◮ Remove converged systems and add new ones (pipelining) ◮ Standard GMRES method (for each system)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Block pipelined GMRES algorithm
Idea
◮ Solve nb linear systems Axi = bi at once ◮ Remove converged systems and add new ones (pipelining) ◮ Standard GMRES method (for each system)
˜ vk+1 ← Avk GMRES subspace 1: GMRES subspace 2: . . .
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Kernel routines
Performance pitfalls
◮ Common BLAS libraries are slow for the block vector GEMM
- perations here (e.g. M ← X TY ).
Implementation details
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Kernel routines
Performance pitfalls
◮ Common BLAS libraries are slow for the block vector GEMM
- perations here (e.g. M ← X TY ).
◮ Getting NUMA accesses right is difficult in a complex library.
(Works fine in GHOST!)
Implementation details
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Kernel routines
Performance pitfalls
◮ Common BLAS libraries are slow for the block vector GEMM
- perations here (e.g. M ← X TY ).
◮ Getting NUMA accesses right is difficult in a complex library.
(Works fine in GHOST!)
◮ Strided accesses in row-major block vectors are slow.
(Hidden by the interface.)
Implementation details
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Kernel routines
Performance pitfalls
◮ Common BLAS libraries are slow for the block vector GEMM
- perations here (e.g. M ← X TY ).
◮ Getting NUMA accesses right is difficult in a complex library.
(Works fine in GHOST!)
◮ Strided accesses in row-major block vectors are slow.
(Hidden by the interface.)
Implementation details
◮ Dedicated kernel routines for individual block sizes
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Kernel routines
Performance pitfalls
◮ Common BLAS libraries are slow for the block vector GEMM
- perations here (e.g. M ← X TY ).
◮ Getting NUMA accesses right is difficult in a complex library.
(Works fine in GHOST!)
◮ Strided accesses in row-major block vectors are slow.
(Hidden by the interface.)
Implementation details
◮ Dedicated kernel routines for individual block sizes ◮ Non-temporal stores where possible
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Kernel routines
Performance pitfalls
◮ Common BLAS libraries are slow for the block vector GEMM
- perations here (e.g. M ← X TY ).
◮ Getting NUMA accesses right is difficult in a complex library.
(Works fine in GHOST!)
◮ Strided accesses in row-major block vectors are slow.
(Hidden by the interface.)
Implementation details
◮ Dedicated kernel routines for individual block sizes ◮ Non-temporal stores where possible ◮ SSE- / AVX-intrinsics
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Outline
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Numerical behavior Performance of the complete algorithm Conclusion
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Numerical behavior
Block size 2
1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found Andrews cfd1 finan512 torsion1 ck656 cry10000 dw8192 rdb3200l
Block size 4
1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Performance of the complete algorithm (1)
Block speedup
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 4 8 16 32 speedup through blocking (t1/tblock) # nodes (with 20 cores each) single vector block size 2 block size 4
Strong scaling
50 100 150 200 250 300 1 2 4 8 16 32 runtime [s] # nodes (with 20 cores each) single vector block size 2 block size 4
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Performance of the complete algorithm (2)
Block speedup
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2 4 8 16 32 64 speedup through blocking (t1/tblock) # nodes (with 20 cores each) single vector block size 2 block size 4
Strong scaling
100 200 300 400 500 600 2 4 8 16 32 64 runtime [s] # nodes (with 20 cores each) single-vector block size 2 block size 4
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Conclusion (1)
Block JDQR algorithm
◮ Performance analysis of block operations:
◮ Impact of memory layout (row- instead of col.-major) ◮ Node-level: better code balance ◮ Inter-node: message aggregation
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Conclusion (1)
Block JDQR algorithm
◮ Performance analysis of block operations:
◮ Impact of memory layout (row- instead of col.-major) ◮ Node-level: better code balance ◮ Inter-node: message aggregation
◮ Numerical behavior
◮ Slight increase of operations ◮ Overhead small for > 10 eigenvalues
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Conclusion (1)
Block JDQR algorithm
◮ Performance analysis of block operations:
◮ Impact of memory layout (row- instead of col.-major) ◮ Node-level: better code balance ◮ Inter-node: message aggregation
◮ Numerical behavior
◮ Slight increase of operations ◮ Overhead small for > 10 eigenvalues
→ Improved performance by factor 1.2-1.4
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Conclusion (2)
Outlook
◮ Numerical improvements:
◮ Specialized solver for symmetric problems ◮ Parallel preconditioning of the linear problems
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Conclusion (2)
Outlook
◮ Numerical improvements:
◮ Specialized solver for symmetric problems ◮ Parallel preconditioning of the linear problems
◮ Algorithmic improvements:
◮ Hiding spMMVM communication behind other operations ◮ More asynchronous communication (all-reductions) ◮ Fast block orthogonalization (TSQR)
Block-Jacobi-Davidson algorithm Performance analysis Implementation Results
Conclusion (2)
Outlook
◮ Numerical improvements:
◮ Specialized solver for symmetric problems ◮ Parallel preconditioning of the linear problems
◮ Algorithmic improvements:
◮ Hiding spMMVM communication behind other operations ◮ More asynchronous communication (all-reductions) ◮ Fast block orthogonalization (TSQR)
◮ Hybrid calculations with GPUs