[PPT] - Can Block Iterative Eigensolvers Fulfill their Performance Promise? PowerPoint Presentation

SLIDE 1

www.DLR.de • Chart 1 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Can Block Iterative Eigensolvers Fulfill their Performance Promise?

Jonas Thies Melven R¨

hrig-Z¨
llner

Achim Basermann DLR Simulation and Software Technology Distributed Systems and Component Software High Performance Computing Group project ESSEX

SLIDE 2

www.DLR.de • Chart 2 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Motivation

Mathematical problem

Find O(10) eigenpairs

Axi = λixi

f a large, sparse matrix A.
Interior or extreme λi,
symmetric or general A.

Memory gap

Small memory bandwidth vs.

high Peak Flop rate → Increase the compute intensity

Roofline performance model

(2x 12 core Haswell EP)

1 4 16 64 256 1024 1/4 1 4 16 64 Peak bandwidth Peak Flop/s BLAS1 (ddot) DP GFlop/s Compute intensity [Flop / DP element]

SLIDE 3

www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Regular vs. block JDQR

1: start from initial spaces V HV = I, V A := AV , H = V HAV 2: while ... do 3: (sorted) Schur decomp., H = STSH, SHS = I 4: select shift θ = T1,1 5: u = Vs1, uA = V As1 6: r = uA − θu 7: (lock converged eigenpairs in Q) 8: (shrink subspaces if necessary) 9: ˜ Q = [Q u] solve approximately for t ⊥ ˜ Q: 10: (I − ˜ Q ˜ QH)(A − θI)(I − ˜ Q ˜ QH)t = −r 11:

rthogonalize t against V

12: extend V = [V ˜ t], V A, H = V HV A 13: end while

SLIDE 4

www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Regular vs. block JDQR

1: start from initial spaces V HV = I, V A := AV , H = V HAV 2: while ... do 3: (sorted) Schur decomp., H = STSH, SHS = I 4: select shift θ = T1,1 5: u = Vs1, uA = V As1 6: r = uA − θu 7: (lock converged eigenpairs in Q) 8: (shrink subspaces if necessary) 9: ˜ Q = [Q u] solve approximately for t ⊥ ˜ Q: 10: (I − ˜ Q ˜ QH)(A − θI)(I − ˜ Q ˜ QH)t = −r 11:

rthogonalize t against V

12: extend V = [V ˜ t], V A, H = V HV A 13: end while

(. . . ) (. . . )

while ... do

(. . . ) (. . . )

select shifts θi = Ti,i, i = 1 . . . nb u = VS:,1:nb, uA = V AS:,1:nb r:,i = uA

:,i − θiu:,i

(. . . ) (. . . )

˜ Q = [Q u] solve nb independent systems (I − ˜ Q ˜ QH)(A − θiI)(I − ˜ Q ˜ QH)t:,i = −r:,i block-orthogonalize t against V , extend subspaces (by nb vecs) end while

SLIDE 5

www.DLR.de • Chart 4 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Numerical behavior

Block size 2

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found Andrews cfd1 finan512 torsion1 ck656 cry10000 dw8192 rdb3200l

Block size 4

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found

SLIDE 6

www.DLR.de • Chart 5 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Kernels needed for BJDQR

spMMVM: sparse Matrix times Multiple Vector Multiplication
block projection, y ← (I − VV H)x
block vector orthogonalization

given V , W : V HV = I, find Q, R : W = QR, QHQ = I, V HQ = 0.

subspace transformation for ‘thick restart’

V:,1:k ← V:,1:m · S, S ∈ Cm×k, k < m

preconditioning operation approximating (A − τI)−1x

(so far unpreconditioned blocked Krylov → ESSEX-II)

SLIDE 7

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

SLIDE 8

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

1 2 4 8 block size nb

GHOST, col major

SLIDE 9

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

1 2 4 8 block size nb

GHOST, col major

1 2 4 8 block size nb

GHOST, row major

SLIDE 10

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

1 2 4 8 block size nb

GHOST, col major

1 2 4 8 block size nb

GHOST, row major

consequences for overall implementation (avoid strides!)
SELL-C-σ most helpful if SIMD/SIMT width > nb

SLIDE 11

www.DLR.de • Chart 7 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Overall block speedup (strong scaling)

Setup

Non-symmetric matrix from

7-point 3D PDE discretization (n ≈ 1.3 · 108, nnz ≈ 9.4 · 108)

Seeking 20 eigenvalues
Ivy Bridge Cluster

Results

nb = 2: significantly faster
nb = 4: no further improvement

See R¨

hrig-Z¨
llner et al SISC 37(6), 2015

1000 2000 3000 4000 5000 8 16 32 64 128 nodes nb=1 nb=2 nb=4 (const. bar heigth ≡ linear strong scaling)

SLIDE 12

www.DLR.de • Chart 8 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Block orthogonalization schemes

Given orthogonal vectors
v1, . . . , vk
= V
For X ∈ Rn×nb find orthogonal Y ∈ Rn×˜

nb with

YR1 = X − VR2, and V TY = 0

Two phase algorithms

Phase 1 Project: ˜ X ← (I − VV T)X Phase 2 Orthogonalize: Y ← f ( ˜ X)

communication optimal f :
CholQR or SVQB (Stathopoulos and Wu, SISC 2002)
TSQR (Demmel et al., SISC 2012)
Each phase messes with the accuracy of the other. → iterate

SLIDE 13

www.DLR.de • Chart 9 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Raising the computational intensity

Kernel fusion

Phase 2 ˜ X ← XB, C ← V T ˜ X Phase 1 ˜ X ← X − VC, ˜ B ← ˜ X T ˜ X Phase 3 ˜ X ← XB, ˜ B ← ˜ X T ˜ X ⇒ use SVQB

Increased precision

Idea Calculate value and error of each arithmetic operation

Store intermediate results as double-double (DD) numbers
Based on arithmetic building blocks (2Sum, 2Mult)

Muller et al.: Handbook of Floating-Point Arithmetic, Springer 2010

Exploit FMA operations (AVX2)

SLIDE 14

www.DLR.de • Chart 10 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Results: accuracy after one iteration

Error in V TY

10−18 10−16 10−14 10−12 10−10 10−8 10−6 10−4 10−2 100 100 105 1010 1015 1020 1025 condition number κ(X, V ) project+TSQR project/SVQB DD project/SVQB

Error in Y TY

10−18 10−16 10−14 10−12 10−10 10−8 10−6 10−4 10−2 100 100 102 104 106 108 1010 1012 breakdown! different κ(X, W ) condition number κ(X)

SLIDE 15

www.DLR.de • Chart 11 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Results: runtime to orthogonality

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 10 × 1 10 × 2 10 × 4 20 × 1 20 × 2 20 × 4 40 × 1 40 × 2 40 × 4 runtime/block size nb [s] nproj × nb Iterated project/SVQB ” ” with fused operations ” ” with fused DD operations

SLIDE 16

www.DLR.de • Chart 12 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Block orthogonalization aleviates memory gap

Faster through

block algorithms
kernel fusion
increased precision arithmetic

(not data)

1 4 16 64 256 1024 1/4 1 4 16 64 Peak bandwidth Peak Flop/s

mod. G.-S.

Fused DD 40 × 4 DP GFlop/s Compute intensity [Flop / DP element]

SLIDE 17

www.DLR.de • Chart 13 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Summary

Lessons learned

don’t use BLAS-3 for ‘tall, skinny matrices’
row-major storage

”

implement algorithms accordingly (avoid strides)
use ‘free’ operations to increase accuracy/reduce iterations

Outlook: Peta and beyond

reduction of synchronization kicks in
increasing memory gap favors block methods
crucial for ESSEX-II: matrix reordering and preconditioning