Parallel solution of large sparse eigenproblems using a - - PowerPoint PPT Presentation

parallel solution of large sparse eigenproblems using a
SMART_READER_LITE
LIVE PREVIEW

Parallel solution of large sparse eigenproblems using a - - PowerPoint PPT Presentation

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Parallel solution of large sparse eigenproblems using a Block-Jacobi-Davidson method Melven Rhrig-Zllner Simulation and Software Technology Block-Jacobi-Davidson


slide-1
SLIDE 1

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Parallel solution of large sparse eigenproblems using a Block-Jacobi-Davidson method

Melven Röhrig-Zöllner Simulation and Software Technology

slide-2
SLIDE 2

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Background

Aim: Calculate a set of outer eigenpairs (λi, vi) of a large sparse matrix: Avi = λivi Project: Equipping Sparse Solvers for Exascale (ESSEX) of the DFG SPPEXA programme

slide-3
SLIDE 3

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

slide-4
SLIDE 4

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outline

Block-Jacobi-Davidson algorithm Jacobi-Davidson QR method Block JDQR method Performance analysis Implementation Results

slide-5
SLIDE 5

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Jacobi-Davidson QR method

Background

◮ Introduced 1998 by Fokkema et. al. ◮ Iteratively calculates a small set of eigenvalues (near a specific

target)

slide-6
SLIDE 6

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Jacobi-Davidson QR method

Background

◮ Introduced 1998 by Fokkema et. al. ◮ Iteratively calculates a small set of eigenvalues (near a specific

target)

◮ Based on a subspace iteration and a correction equation

slide-7
SLIDE 7

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Jacobi-Davidson QR method

Background

◮ Introduced 1998 by Fokkema et. al. ◮ Iteratively calculates a small set of eigenvalues (near a specific

target)

◮ Based on a subspace iteration and a correction equation ◮ Motivated by the Rayleigh quotient iteration (RQI)

RQI:

  • ¯

vk+1 = (A − λkI)−1vk, vk+1 =

¯ vk+1 vk+1

λk+1 = v T

k+1Avk+1 ◮ Can also be derived from an inexact Newton process

slide-8
SLIDE 8

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Jacobi-Davidson QR method

Sketch of the algorithm

1: while not converged do

⊲ Outer iteration

2:

Project the problem to a small subspace

3:

Solve the small eigenvalue problem

4:

Calculate an approximation and its residual

5:

Approximately solve the correction equation ⊲ Inner iteration

6:

Orthogonalize the new direction

7:

Enlarge the subspace

8: end while

slide-9
SLIDE 9

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outer iteration

Subspace iteration Deflation

slide-10
SLIDE 10

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outer iteration

Subspace iteration

◮ Galerkin projection onto a small subspace W = span{w1, w2, . . . }:

Av − λv ⊥ W, v ∈ W ⇔ (W TAW )s − ˜ λs = 0, for orth. basis W

Deflation

slide-11
SLIDE 11

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outer iteration

Subspace iteration

◮ Galerkin projection onto a small subspace W = span{w1, w2, . . . }:

Av − λv ⊥ W, v ∈ W ⇔ (W TAW )s − ˜ λs = 0, for orth. basis W

◮ Approximation ˜

λ with ˜ v = Ws

Deflation

slide-12
SLIDE 12

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outer iteration

Subspace iteration

◮ Galerkin projection onto a small subspace W = span{w1, w2, . . . }:

Av − λv ⊥ W, v ∈ W ⇔ (W TAW )s − ˜ λs = 0, for orth. basis W

◮ Approximation ˜

λ with ˜ v = Ws

Deflation

◮ Project out the already known eigenvector space

Q = span{q1, q2, . . . }: ¯ A := (I − QQT)A(I − QQT)

slide-13
SLIDE 13

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Inner iteration

Jacobi-Davidson correction equation

◮ Based on the current approximation A˜

v − ˜ λ˜ v = r

slide-14
SLIDE 14

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Inner iteration

Jacobi-Davidson correction equation

◮ Based on the current approximation A˜

v − ˜ λ˜ v = r

◮ Avoid the (near) singularity of (A − ˜

λI)−1 through a deflation of ˜ Q = Q ˜ v

slide-15
SLIDE 15

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Inner iteration

Jacobi-Davidson correction equation

◮ Based on the current approximation A˜

v − ˜ λ˜ v = r

◮ Avoid the (near) singularity of (A − ˜

λI)−1 through a deflation of ˜ Q = Q ˜ v → Solve approximately: (I − ˜ Q ˜ QT)(A − ˜ λI)(I − ˜ Q ˜ QT)wk+1 = −r

slide-16
SLIDE 16

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Inner iteration

Jacobi-Davidson correction equation

◮ Based on the current approximation A˜

v − ˜ λ˜ v = r

◮ Avoid the (near) singularity of (A − ˜

λI)−1 through a deflation of ˜ Q = Q ˜ v → Solve approximately: (I − ˜ Q ˜ QT)(A − ˜ λI)(I − ˜ Q ˜ QT)wk+1 = −r

◮ Use some steps of an iterative solver (GMRES, BiCGStab, . . . )

slide-17
SLIDE 17

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Inner iteration

Jacobi-Davidson correction equation

◮ Based on the current approximation A˜

v − ˜ λ˜ v = r

◮ Avoid the (near) singularity of (A − ˜

λI)−1 through a deflation of ˜ Q = Q ˜ v → Solve approximately: (I − ˜ Q ˜ QT)(A − ˜ λI)(I − ˜ Q ˜ QT)wk+1 = −r

◮ Use some steps of an iterative solver (GMRES, BiCGStab, . . . ) ◮ Provides a new direction wk+1 for the subspace iteration

slide-18
SLIDE 18

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Idea

◮ Calculate corrections for nb eigenvalues at once

Numerical properties

slide-19
SLIDE 19

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Idea

◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜

Q = Q ˜ v1 . . . ˜ vnb

  • :

(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb

Numerical properties

slide-20
SLIDE 20

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Idea

◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜

Q = Q ˜ v1 . . . ˜ vnb

  • :

(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb → Approximately solve nb linear systems at once

Numerical properties

slide-21
SLIDE 21

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Idea

◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜

Q = Q ˜ v1 . . . ˜ vnb

  • :

(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb → Approximately solve nb linear systems at once

◮ Provides new directions wk+1, . . . , wk+nb for the subspace iteration

Numerical properties

slide-22
SLIDE 22

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Idea

◮ Calculate corrections for nb eigenvalues at once ◮ Block correction equation with ˜

Q = Q ˜ v1 . . . ˜ vnb

  • :

(I − ˜ Q ˜ QT)(A − ˜ λiI)(I − ˜ Q ˜ QT)wk+i = −ri i = 1, . . . , nb → Approximately solve nb linear systems at once

◮ Provides new directions wk+1, . . . , wk+nb for the subspace iteration

Numerical properties

◮ More robust ◮ Usually needs more operations

slide-23
SLIDE 23

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Sketch of the complete algorithm

1: 2: while not converged do 3:

Project the problem to a small subspace

4:

Solve the small eigenvalue problem

5:

Calculate an nb approximations and their residual

6: 7: 8:

Approximately solve the nb correction equations

9:

Block-Orthogonalize the new directions

10:

Enlarge the subspace

11: end while

slide-24
SLIDE 24

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block JDQR method

Sketch of the complete algorithm

1: Setup initial subspace 2: while not converged do 3:

Project the problem to a small subspace

4:

Solve the small eigenvalue problem

5:

Calculate an nb approximations and their residual

6:

Lock converged eigenvalues

7:

Shrink subspace if required (thick restart)

8:

Approximately solve the nb correction equations

9:

Block-Orthogonalize the new directions

10:

Enlarge the subspace

11: end while

slide-25
SLIDE 25

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outline

Block-Jacobi-Davidson algorithm Performance analysis Required linear algebra operations spMMVM single node spMMVM inter-node Block vector operations Jacobi-Davidson Operator Implementation Results

slide-26
SLIDE 26

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM) Block vector operations

slide-27
SLIDE 27

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

◮ Large distributed sparse matrix A in CRS or SELL-C-σ format ◮ Distributed blocks of vectors X, Y ∈ Rn×nb ◮ Shifted spMMVM: yi ← (A − ˜

λiI)xi, i = 1, . . . , nb

Block vector operations

slide-28
SLIDE 28

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

◮ Large distributed sparse matrix A in CRS or SELL-C-σ format ◮ Distributed blocks of vectors X, Y ∈ Rn×nb ◮ Shifted spMMVM: yi ← (A − ˜

λiI)xi, i = 1, . . . , nb

Block vector operations

◮ Different types of operations:

local all-reduction BLAS 1 Y ← X + Y xi, i = 1, . . . , nb BLAS 3 Y ← XM M ← X TY

◮ Redundantly stored small matrices M ∈ Rnb×nb

slide-29
SLIDE 29

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM single node

Roofline performance model

◮ Possible performance:

Pideal = min

  • Ppeak, bdata

Bcode

slide-30
SLIDE 30

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM single node

Roofline performance model

◮ Possible performance:

Pideal = min

  • Ppeak, bdata

Bcode

  • ◮ Ideal code balance:

Bideal = 6 nb + 8 nnzr bytes flop

slide-31
SLIDE 31

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM single node

Roofline performance model

◮ Possible performance:

Pideal = min

  • Ppeak, bdata

Bcode

  • ◮ Ideal code balance:

Bideal = 6 nb + 8 nnzr bytes flop

  • → memory bounded on all current architectures
slide-32
SLIDE 32

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM single node

Roofline performance model

◮ Possible performance:

Pideal = min

  • Ppeak, bdata

Bcode

  • ◮ Ideal code balance:

Bideal = 6 nb + 8 nnzr bytes flop

  • → memory bounded on all current architectures

◮ Irregular accesses to X lead to higher data volume:

Ω = Vmeasured Videal ≥ 1

slide-33
SLIDE 33

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM single node

Row- vs. column-major storage Setup

◮ Matrix dimensions: n ≈ 106 with

nnzr ≈ 10 non-zeros per row

◮ CRS format ◮ 6-core Intel Westmere CPU 2 4 6 8 10 1 2 4 8 performance [GFlops/s] block size nb row-major row-major, NT stores col.-major Ω = 1.17 Ω = 1.32 Ω = 1.57 Ω = 1.94

slide-34
SLIDE 34

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM single node

Intra-socket scaling Setup

◮ Matrix dimensions: n ≈ 107 with

nnzr ≈ 15 non-zeros per row

◮ SELL-C-σ format ◮ 10-core Intel Ivy Bridge CPU

Results

block size 1 4 8 Ω 1.08 1.54 1.97 Speedup 1.0 2.1 2.8

5 10 15 20 1 2 3 4 5 6 7 8 9 10 performance [GFlop/s] # cores single-vector block size 4 block size 8

slide-35
SLIDE 35

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

spMMVM inter-node

Setup

◮ Strong scaling:

n ≈ 4 · 107, nnzr ≈ 15

◮ Distributed CRS ◮ RRZE’s Emmy-cluster, each node:

2× 10-core Intel Ivy Bridge

◮ Timings from completed algorithm

Explanation

Two counteracting effects:

◮ Communication volume increases

independent of blocking

◮ Message aggregation 0.5 1 1.5 2 2.5 2 4 8 16 32 64 speedup through blocking # nodes single-vector block size 2 block size 4 block size 8

slide-36
SLIDE 36

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block vector operations

Background

◮ All operations are memory bounded.

(also the GEMM, as all matrices are very tall and skinny)

◮ The code balance accurately predicts the node-level performance!

Results of blocking

slide-37
SLIDE 37

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block vector operations

Background

◮ All operations are memory bounded.

(also the GEMM, as all matrices are very tall and skinny)

◮ The code balance accurately predicts the node-level performance!

Results of blocking

◮ Faster BLAS 3 operations (e.g. Y ← XM) ◮ Message aggregation for all-reductions (e.g. M ← X TY )

→ Improved performance of some operations

slide-38
SLIDE 38

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Complete Jacobi-Davidson Operator

Setup

◮ spMMVM + 2× GEMM:

yi ← (I − QQT)(A − ˜ λiI)xi with Q ∈ Rn×8, i = 1, . . . , nb

◮ Matrix with n ≈ 107, nnzr ≈ 15 ◮ SELL-C-σ format ◮ 10-core Intel Ivy Bridge CPU 0.02 0.04 0.06 0.08 0.1 1 2 4 8 runtime / block size nb [s] block size nb spMMVM GEMM 1 GEMM 2

slide-39
SLIDE 39

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Frameworks from the ESSEX project Block pipelined GMRES algorithm Kernel routines Results

slide-40
SLIDE 40

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit) GHOST (General Hybrid Optimized Sparse Toolkit)

slide-41
SLIDE 41

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

◮ General C-interface to linear algebra libraries:

◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)

GHOST (General Hybrid Optimized Sparse Toolkit)

slide-42
SLIDE 42

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

◮ General C-interface to linear algebra libraries:

◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)

◮ Iterative solvers

GHOST (General Hybrid Optimized Sparse Toolkit)

slide-43
SLIDE 43

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

◮ General C-interface to linear algebra libraries:

◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)

◮ Iterative solvers ◮ Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

slide-44
SLIDE 44

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

◮ General C-interface to linear algebra libraries:

◮ GHOST ◮ Trilinos (C++, http://trilinos.sandia.gov) ◮ builtin (for prototyping, row-major storage, Fortran + C99)

◮ Iterative solvers ◮ Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

◮ Developed at the RRZE in Erlangen ◮ Hybrid-parallel MPI+OpenMP+CUDA ◮ SELL-C-σ matrix format

slide-45
SLIDE 45

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block pipelined GMRES algorithm

Idea

◮ Solve nb linear systems Axi = bi at once

slide-46
SLIDE 46

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block pipelined GMRES algorithm

Idea

◮ Solve nb linear systems Axi = bi at once ◮ Remove converged systems and add new ones (pipelining)

slide-47
SLIDE 47

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block pipelined GMRES algorithm

Idea

◮ Solve nb linear systems Axi = bi at once ◮ Remove converged systems and add new ones (pipelining) ◮ Standard GMRES method (for each system)

slide-48
SLIDE 48

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Block pipelined GMRES algorithm

Idea

◮ Solve nb linear systems Axi = bi at once ◮ Remove converged systems and add new ones (pipelining) ◮ Standard GMRES method (for each system)

˜ vk+1 ← Avk GMRES subspace 1: GMRES subspace 2: . . .

slide-49
SLIDE 49

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Kernel routines

Performance pitfalls

◮ Common BLAS libraries are slow for the block vector GEMM

  • perations here (e.g. M ← X TY ).

Implementation details

slide-50
SLIDE 50

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Kernel routines

Performance pitfalls

◮ Common BLAS libraries are slow for the block vector GEMM

  • perations here (e.g. M ← X TY ).

◮ Getting NUMA accesses right is difficult in a complex library.

(Works fine in GHOST!)

Implementation details

slide-51
SLIDE 51

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Kernel routines

Performance pitfalls

◮ Common BLAS libraries are slow for the block vector GEMM

  • perations here (e.g. M ← X TY ).

◮ Getting NUMA accesses right is difficult in a complex library.

(Works fine in GHOST!)

◮ Strided accesses in row-major block vectors are slow.

(Hidden by the interface.)

Implementation details

slide-52
SLIDE 52

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Kernel routines

Performance pitfalls

◮ Common BLAS libraries are slow for the block vector GEMM

  • perations here (e.g. M ← X TY ).

◮ Getting NUMA accesses right is difficult in a complex library.

(Works fine in GHOST!)

◮ Strided accesses in row-major block vectors are slow.

(Hidden by the interface.)

Implementation details

◮ Dedicated kernel routines for individual block sizes

slide-53
SLIDE 53

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Kernel routines

Performance pitfalls

◮ Common BLAS libraries are slow for the block vector GEMM

  • perations here (e.g. M ← X TY ).

◮ Getting NUMA accesses right is difficult in a complex library.

(Works fine in GHOST!)

◮ Strided accesses in row-major block vectors are slow.

(Hidden by the interface.)

Implementation details

◮ Dedicated kernel routines for individual block sizes ◮ Non-temporal stores where possible

slide-54
SLIDE 54

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Kernel routines

Performance pitfalls

◮ Common BLAS libraries are slow for the block vector GEMM

  • perations here (e.g. M ← X TY ).

◮ Getting NUMA accesses right is difficult in a complex library.

(Works fine in GHOST!)

◮ Strided accesses in row-major block vectors are slow.

(Hidden by the interface.)

Implementation details

◮ Dedicated kernel routines for individual block sizes ◮ Non-temporal stores where possible ◮ SSE- / AVX-intrinsics

slide-55
SLIDE 55

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Numerical behavior Performance of the complete algorithm Conclusion

slide-56
SLIDE 56

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Numerical behavior

Block size 2

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found Andrews cfd1 finan512 torsion1 ck656 cry10000 dw8192 rdb3200l

Block size 4

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found

slide-57
SLIDE 57

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Performance of the complete algorithm (1)

Block speedup

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 4 8 16 32 speedup through blocking (t1/tblock) # nodes (with 20 cores each) single vector block size 2 block size 4

Strong scaling

50 100 150 200 250 300 1 2 4 8 16 32 runtime [s] # nodes (with 20 cores each) single vector block size 2 block size 4

slide-58
SLIDE 58

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Performance of the complete algorithm (2)

Block speedup

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2 4 8 16 32 64 speedup through blocking (t1/tblock) # nodes (with 20 cores each) single vector block size 2 block size 4

Strong scaling

100 200 300 400 500 600 2 4 8 16 32 64 runtime [s] # nodes (with 20 cores each) single-vector block size 2 block size 4

slide-59
SLIDE 59

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Conclusion (1)

Block JDQR algorithm

◮ Performance analysis of block operations:

◮ Impact of memory layout (row- instead of col.-major) ◮ Node-level: better code balance ◮ Inter-node: message aggregation

slide-60
SLIDE 60

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Conclusion (1)

Block JDQR algorithm

◮ Performance analysis of block operations:

◮ Impact of memory layout (row- instead of col.-major) ◮ Node-level: better code balance ◮ Inter-node: message aggregation

◮ Numerical behavior

◮ Slight increase of operations ◮ Overhead small for > 10 eigenvalues

slide-61
SLIDE 61

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Conclusion (1)

Block JDQR algorithm

◮ Performance analysis of block operations:

◮ Impact of memory layout (row- instead of col.-major) ◮ Node-level: better code balance ◮ Inter-node: message aggregation

◮ Numerical behavior

◮ Slight increase of operations ◮ Overhead small for > 10 eigenvalues

→ Improved performance by factor 1.2-1.4

slide-62
SLIDE 62

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Conclusion (2)

Outlook

◮ Numerical improvements:

◮ Specialized solver for symmetric problems ◮ Parallel preconditioning of the linear problems

slide-63
SLIDE 63

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Conclusion (2)

Outlook

◮ Numerical improvements:

◮ Specialized solver for symmetric problems ◮ Parallel preconditioning of the linear problems

◮ Algorithmic improvements:

◮ Hiding spMMVM communication behind other operations ◮ More asynchronous communication (all-reductions) ◮ Fast block orthogonalization (TSQR)

slide-64
SLIDE 64

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results

Conclusion (2)

Outlook

◮ Numerical improvements:

◮ Specialized solver for symmetric problems ◮ Parallel preconditioning of the linear problems

◮ Algorithmic improvements:

◮ Hiding spMMVM communication behind other operations ◮ More asynchronous communication (all-reductions) ◮ Fast block orthogonalization (TSQR)

◮ Hybrid calculations with GPUs