Can Block Iterative Eigensolvers Fulfill their Performance Promise? - - PowerPoint PPT Presentation

can block iterative eigensolvers fulfill their
SMART_READER_LITE
LIVE PREVIEW

Can Block Iterative Eigensolvers Fulfill their Performance Promise? - - PowerPoint PPT Presentation

www.DLR.de Chart 1 > SPPEXA16 > J. Thies Block Sparse Solvers > 2016-01-25 Can Block Iterative Eigensolvers Fulfill their Performance Promise? Jonas Thies Melven R ohrig-Z ollner Achim Basermann DLR Simulation and


slide-1
SLIDE 1

www.DLR.de • Chart 1 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Can Block Iterative Eigensolvers Fulfill their Performance Promise?

Jonas Thies Melven R¨

  • hrig-Z¨
  • llner

Achim Basermann DLR Simulation and Software Technology Distributed Systems and Component Software High Performance Computing Group project ESSEX

slide-2
SLIDE 2

www.DLR.de • Chart 2 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Motivation

Mathematical problem

  • Find O(10) eigenpairs

Axi = λixi

  • f a large, sparse matrix A.
  • Interior or extreme λi,
  • symmetric or general A.

Memory gap

  • Small memory bandwidth vs.

high Peak Flop rate → Increase the compute intensity

Roofline performance model

(2x 12 core Haswell EP)

1 4 16 64 256 1024 1/4 1 4 16 64 Peak bandwidth Peak Flop/s BLAS1 (ddot) DP GFlop/s Compute intensity [Flop / DP element]

slide-3
SLIDE 3

www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Regular vs. block JDQR

1: start from initial spaces V HV = I, V A := AV , H = V HAV 2: while ... do 3: (sorted) Schur decomp., H = STSH, SHS = I 4: select shift θ = T1,1 5: u = Vs1, uA = V As1 6: r = uA − θu 7: (lock converged eigenpairs in Q) 8: (shrink subspaces if necessary) 9: ˜ Q = [Q u] solve approximately for t ⊥ ˜ Q: 10: (I − ˜ Q ˜ QH)(A − θI)(I − ˜ Q ˜ QH)t = −r 11:

  • rthogonalize t against V

12: extend V = [V ˜ t], V A, H = V HV A 13: end while

slide-4
SLIDE 4

www.DLR.de • Chart 3 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Regular vs. block JDQR

1: start from initial spaces V HV = I, V A := AV , H = V HAV 2: while ... do 3: (sorted) Schur decomp., H = STSH, SHS = I 4: select shift θ = T1,1 5: u = Vs1, uA = V As1 6: r = uA − θu 7: (lock converged eigenpairs in Q) 8: (shrink subspaces if necessary) 9: ˜ Q = [Q u] solve approximately for t ⊥ ˜ Q: 10: (I − ˜ Q ˜ QH)(A − θI)(I − ˜ Q ˜ QH)t = −r 11:

  • rthogonalize t against V

12: extend V = [V ˜ t], V A, H = V HV A 13: end while

(. . . ) (. . . )

while ... do

(. . . ) (. . . )

select shifts θi = Ti,i, i = 1 . . . nb u = VS:,1:nb, uA = V AS:,1:nb r:,i = uA

:,i − θiu:,i

(. . . ) (. . . )

˜ Q = [Q u] solve nb independent systems (I − ˜ Q ˜ QH)(A − θiI)(I − ˜ Q ˜ QH)t:,i = −r:,i block-orthogonalize t against V , extend subspaces (by nb vecs) end while

slide-5
SLIDE 5

www.DLR.de • Chart 4 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Numerical behavior

Block size 2

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found Andrews cfd1 finan512 torsion1 ck656 cry10000 dw8192 rdb3200l

Block size 4

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM) # eigenvalues found

slide-6
SLIDE 6

www.DLR.de • Chart 5 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Kernels needed for BJDQR

  • spMMVM: sparse Matrix times Multiple Vector Multiplication
  • block projection, y ← (I − VV H)x
  • block vector orthogonalization

given V , W : V HV = I, find Q, R : W = QR, QHQ = I, V HQ = 0.

  • subspace transformation for ‘thick restart’

V:,1:k ← V:,1:m · S, S ∈ Cm×k, k < m

  • preconditioning operation approximating (A − τI)−1x

(so far unpreconditioned blocked Krylov → ESSEX-II)

slide-7
SLIDE 7

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

slide-8
SLIDE 8

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

1 2 4 8 block size nb

GHOST, col major

slide-9
SLIDE 9

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

1 2 4 8 block size nb

GHOST, col major

1 2 4 8 block size nb

GHOST, row major

slide-10
SLIDE 10

www.DLR.de • Chart 6 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Row major vs. column major storage

5 10 15 20 1 2 4 8 runtime [s] block size nb yj ← (A − θj)xj S ← QTY Y ← Y − QS

Tpetra/TBB, col major

1 2 4 8 block size nb

GHOST, col major

1 2 4 8 block size nb

GHOST, row major

  • consequences for overall implementation (avoid strides!)
  • SELL-C-σ most helpful if SIMD/SIMT width > nb
slide-11
SLIDE 11

www.DLR.de • Chart 7 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Overall block speedup (strong scaling)

Setup

  • Non-symmetric matrix from

7-point 3D PDE discretization (n ≈ 1.3 · 108, nnz ≈ 9.4 · 108)

  • Seeking 20 eigenvalues
  • Ivy Bridge Cluster

Results

  • nb = 2: significantly faster
  • nb = 4: no further improvement

See R¨

  • hrig-Z¨
  • llner et al SISC 37(6), 2015

1000 2000 3000 4000 5000 8 16 32 64 128 nodes nb=1 nb=2 nb=4 (const. bar heigth ≡ linear strong scaling)

slide-12
SLIDE 12

www.DLR.de • Chart 8 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Block orthogonalization schemes

  • Given orthogonal vectors
  • v1, . . . , vk
  • = V
  • For X ∈ Rn×nb find orthogonal Y ∈ Rnט

nb with

YR1 = X − VR2, and V TY = 0

Two phase algorithms

Phase 1 Project: ˜ X ← (I − VV T)X Phase 2 Orthogonalize: Y ← f ( ˜ X)

  • communication optimal f :
  • CholQR or SVQB (Stathopoulos and Wu, SISC 2002)
  • TSQR (Demmel et al., SISC 2012)
  • Each phase messes with the accuracy of the other. → iterate
slide-13
SLIDE 13

www.DLR.de • Chart 9 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Raising the computational intensity

Kernel fusion

Phase 2 ˜ X ← XB, C ← V T ˜ X Phase 1 ˜ X ← X − VC, ˜ B ← ˜ X T ˜ X Phase 3 ˜ X ← XB, ˜ B ← ˜ X T ˜ X ⇒ use SVQB

Increased precision

Idea Calculate value and error of each arithmetic operation

  • Store intermediate results as double-double (DD) numbers
  • Based on arithmetic building blocks (2Sum, 2Mult)

Muller et al.: Handbook of Floating-Point Arithmetic, Springer 2010

  • Exploit FMA operations (AVX2)
slide-14
SLIDE 14

www.DLR.de • Chart 10 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Results: accuracy after one iteration

Error in V TY

10−18 10−16 10−14 10−12 10−10 10−8 10−6 10−4 10−2 100 100 105 1010 1015 1020 1025 condition number κ(X, V ) project+TSQR project/SVQB DD project/SVQB

Error in Y TY

10−18 10−16 10−14 10−12 10−10 10−8 10−6 10−4 10−2 100 100 102 104 106 108 1010 1012 breakdown! different κ(X, W ) condition number κ(X)

slide-15
SLIDE 15

www.DLR.de • Chart 11 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Results: runtime to orthogonality

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 10 × 1 10 × 2 10 × 4 20 × 1 20 × 2 20 × 4 40 × 1 40 × 2 40 × 4 runtime/block size nb [s] nproj × nb Iterated project/SVQB ” ” with fused operations ” ” with fused DD operations

slide-16
SLIDE 16

www.DLR.de • Chart 12 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Block orthogonalization aleviates memory gap

Faster through

  • block algorithms
  • kernel fusion
  • increased precision arithmetic

(not data)

1 4 16 64 256 1024 1/4 1 4 16 64 Peak bandwidth Peak Flop/s

  • mod. G.-S.

Fused DD 40 × 4 DP GFlop/s Compute intensity [Flop / DP element]

slide-17
SLIDE 17

www.DLR.de • Chart 13 > SPPEXA’16 > J. Thies • Block Sparse Solvers > 2016-01-25

Summary

Lessons learned

  • don’t use BLAS-3 for ‘tall, skinny matrices’
  • row-major storage

  • implement algorithms accordingly (avoid strides)
  • use ‘free’ operations to increase accuracy/reduce iterations

Outlook: Peta and beyond

  • reduction of synchronization kicks in
  • increasing memory gap favors block methods
  • crucial for ESSEX-II: matrix reordering and preconditioning