Feeding of the Thousands Leveraging the GPU's Computing Power for - - PowerPoint PPT Presentation

feeding of the thousands leveraging the gpu s computing
SMART_READER_LITE
LIVE PREVIEW

Feeding of the Thousands Leveraging the GPU's Computing Power for - - PowerPoint PPT Presentation

SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs In Inherently parallel operations


slide-1
SLIDE 1

Feeding of the Thousands – Leveraging the GPU's Computing Power for Sparse Linear Algebra

SPPEXA Annual Meeting 2016, January 25th, 2016, Garching, Germany Hartwig Anzt

slide-2
SLIDE 2

2

  • In

Inherently parallel operations

  • axpy, copy, gemv...
  • usually memory bound
  • Kernel fusion
  • Sp

Sparse matrix vector product

  • ften computationally most expensive part
  • variety of storage formats, kernels…
  • row-thread mapping can result in imbalance
  • malicious memory access
  • Bo

Bottlenecks

  • sequential operations, unstructured, random memory access
  • incomplete factorizations (ILU/IC)
  • sparse triangular solves

Sparse Linear Algebra on GPUs

slide-3
SLIDE 3

3

  • Sliced Ellpack (SELL) format as

trade-off between CSR and Ellpack

  • Sorting can improve load-balancing

Sparse Matrix Vector Product (SpMV)

Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units, SISC 36(5), 2014.

5 2 4 2 5 3 7 2 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 3 0 0 0 0 0 1 2 5 7 0 1 2 X X 2 7 X X X X X X X X X X X X X 0 X X X X 6 X X X X 5 2 4 2 5 0 0 0 3 7 2 0 0 0 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 5 2 4 0 0 2 0 5 3 7 2 0 0 0 0 0 0 0 7 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 5 2 4 2 5 0 0 0 3 7 2 0 0 0 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 CSR format 5 2 4 2 5 3 7 2 7 5 8 3 0 1 2 5 7 0 1 2 2 7 0 6 0 5 8 10 10 10 11 values colind rowptr

points to first element in row 1 2 3 4 5 6 7 7 col-index row-index 6 5 4 3 2 1

values colind values colind ELL format rowptr Sparse storage formats values colind 0 10 14 16 18

points to first element in block

5 2 4 2 5 0 0 0 3 7 2 0 0 0 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 2 4 2 5 3 7 2 0 0 7 5 0 0 8 3 0 1 2 5 7 0 1 2 X X 2 7 X X X 6 X SELLP format

slide-4
SLIDE 4

4

  • Assign multiple threads to each row
  • 2-dimensional thread blocks

Sparse Matrix Vector Product (SpMV)

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units, SISC 36(5), 2014.

slide-5
SLIDE 5

5

Sparse Matrix Vector Product with multiple Vectors (SpMM)

  • 3-dimensional thread blocks for processing multiple vectors simultaneously

Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers, PMAM 2015. .

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

multiple vectors

+ + + +

TB 0 TB 1 shared memory reduction 2D TBs

=

slide-6
SLIDE 6

6

Sparse Matrix Vector Product with multiple Vectors (SpMM)

  • 3-dimensional thread blocks for processing multiple vectors simultaneously
  • Performance on NVIDIA K40, 64 vectors, DP:

Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers, PMAM 2015. .

20 40 60 80 100 120

audikw_1 bmw3_2 bmwcra_1 bone_010 bone_S10 cant crankseg_2 F1 Fault_639 Hook_1498 inline_1 ldoor pwtk Stoc_1465 stomach xenon_2

GFLOP/s cuSPARSE CSRSpMM cuSPARSE CSRSpMM v2 MAGMA SELL-P SpMM

slide-7
SLIDE 7

7

  • Memory bandwidth in many cases the performance bottleneck.
  • Sequence of consecutive vector updates (BLAS 1)

benefits from enhanced data locality.

  • Design of algorithm-specific kernels.

Kernel Fusion in Sparse Iterative Algorithms

Thread block

r+w r+w r+w r+w r+w r+w

Thread block

r+w r+w r+w r+w r+w r+w

Thread block

r+w r+w r+w r+w r+w r+w

kernel1 kernel2

slide-8
SLIDE 8

8

  • Memory bandwidth in many cases the performance bottleneck
  • Sequence of consecutive vector updates (BLAS 1) benefits from enhanced

data locality

Kernel Fusion in Sparse Iterative Algorithms

while( ( k < maxiter ) && ( res > epsilon ) ){ scalar_fusion_1 <<<Gs, Bs, Ms>>> ( n, rowA, colA, valA, d, z, beta, rho, gamma, vtmp ); fusion_2 ( Gs, Bs, Ms, n, beta, rho, vtmp ); fusion_3 <<<Gs, Bs, Ms>>> ( n, rho, d, x, z, r, vtmp ); fusion_4 ( Gs, Bs, Ms, n, vtmp, vtmp2 ); fusion_5 <<<Gs, Bs>>> ( n, beta, gamma, alpha, d, r, vtmp ); cudaMemcopy( &res, beta, sizeof(float), cudaMemcpyDeviceToHost ); res = sqrt( beta ); k ++; } // end-while while( ( k < maxiter ) && ( res > epsilon ) ){ Scalar_SpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, d, z ); tmp = cublasSdot ( n, d, 1, z, 1 ); rho = beta / tmp; gamma = beta; cublasSaxpy (n, rho, d, 1, x, 1 ); cublasSaxpy (n, -rho, z, 1, r, 1 ); tmp = cublasSdot ( n, r, 1, r, 1 ); beta = tmp; alpha = beta / gamma; cublasSscal (n, alpha, d, 1 ); cublasSaxpy (n, one, r, 1, d, 1 ); res = sqrt( beta ); k++; } // end-while

Aliaga et al.: .: Re Reformulated Conjugate Gradient for the Energy-Aw Aware S Solution of Linear Systems on G GPUs, Parallel Processing (ICPP), 2013.

slide-9
SLIDE 9

9

  • Wh

Which operations can be merged d into a single kernel?

  • kernels compatible in terms of component-thread mapping
  • example classification for Jacobi-CG:

Kernel Fusion in Sparse Iterative Algorithms

Aliaga et al.: .: Systematic Fusion of C CUDA Kernels for It Iterative S Sparse Linear S System So Solvers, Euro-Par 2015, LLNCS 9233, 2015.

Op Operation

  • n

In Input vector(s) Ou Output / / AXPY mapped mapped mapped COPY mapped mapped mapped DOT mapped mapped unmapped Jacobi-Prec mapped mapped mapped SpMV CSR unmapped

  • mapped

SpMV ELL unmapped

  • mapped

SpMV SELL-P unmapped

  • unmapped

y = αx + y y = x α = hx, yi y = M −1x y = Ax y = Ax y = Ax x y

α

y M −1

slide-10
SLIDE 10

10

  • Wh

Which operations can be merged d into a single kernel?

  • Wh

Which kernels do do we want to merge?

  • performance vs. flexibility…

Kernel Fusion in Sparse Iterative Algorithms

Op Operation

  • n

In Input vector(s) Ou Output / / AXPY mapped mapped mapped COPY mapped mapped mapped DOT mapped mapped unmapped Jacobi-Prec mapped mapped mapped SpMV CSR unmapped

  • mapped

SpMV ELL unmapped

  • mapped

SpMV SELL-P unmapped

  • unmapped

y = αx + y y = x α = hx, yi y = M −1x y = Ax y = Ax y = Ax x y

α

y M −1 Code Jacobi-preconditioned J-CG, J-BiCGSTAB, J-IDR, J-GMRES…? How about ILU/IC?

slide-11
SLIDE 11

11

  • Wh

Which operations can be merged d into a single kernel?

  • Wh

Which kernels do do we want to merge?

  • performance vs. flexibility…

Kernel Fusion in Sparse Iterative Algorithms

Op Operation

  • n

In Input vector(s) Ou Output / / AXPY mapped mapped mapped COPY mapped mapped mapped DOT mapped mapped unmapped Jacobi-Prec mapped mapped mapped SpMV CSR unmapped

  • mapped

SpMV ELL unmapped

  • mapped

SpMV SELL-P unmapped

  • unmapped

y = αx + y y = x α = hx, yi y = M −1x y = Ax y = Ax y = Ax x y

α

y M −1 One stand-alone code for each SpMV kernel?

slide-12
SLIDE 12

12

  • Wh

Which operations can be merged d into a single kernel?

  • Wh

Which kernels do do we want to merge?

  • performance vs. flexibility…

Kernel Fusion in Sparse Iterative Algorithms

AUD G3 INL LDO

Runtime [s]

1 2 3 4 5 6 7 8 9 10

basic JCG fusion JCG Jacobi-fusion

Matrix Size Nonzeros AUD 943,695 77,651,847 G3 1,585,478 7,660,826 INL 503,712 36,816,342 LDO 952,203 46,522,475

  • Benefits from fusion Jacobi-preconditioner for very large and sparse matrices.
  • Smaller benefits for more complex algorithms (BiCGSTAB, CGS, QMR, IDR…)
slide-13
SLIDE 13

13

  • How close can kernel fusion bring us to the

th theoreti tical performance bound induced by memory bandwidth th?

  • Cooperation with Moritz Kreutzer, Eduardo Ponce.
  • NVIDIA K40, theoretical bandwidth: 288 GB/s…

Kernel Fusion in Sparse Iterative Algorithms

10

3

10

4

10

5

10

6

10

7

Vector length 40 80 120 160 200 Bandwidth b in GB/s SCR TDK WEB b = 193 GB/s ...

Matrix Size SCR 170,998 TDK 204,316 WEB 1,000,005 DIE 1,157,456

slide-14
SLIDE 14

14

  • Ef

Efficiency compared to roofline performance model: P theoretical compute peak, I intensity, b bandwidth

Kernel Fusion in Sparse Iterative Algorithms

SCR TDK WEB NLP DIE THM AFS MLG G3 TRA

Efficiency [%]

10 20 30 40 50 60 70 80 90 100

IDR(1) IDR(2) IDR(4) IDR(8) Matrix Size Nonzeros SCR 170,998 958,936 TDK 204,316 2,846,228 WEB 1,000,005 3,105,536 NLP 1,062,400 28,704,672 DIE 1,157,456 48,538,952 THM 1,228,045 8,580,313 AFS 1,508,065 52,672,325 MLG 1,504,002 110,879,972 G3 1,585,478 7,660,826 TRA 1,602,111 23,500,731

P = min(P peak; Ib) Gflop/s ID IDR(s) g general Kr Krylov sol solver Moritz Kreutzer, Eduardo Ponce

slide-15
SLIDE 15

15

  • Co

Concurrent kernel execution to exploit unused GPU compute resources.

  • Ra

Rare in sparse linear algebra (most algorithms compose of memory-bound

  • perations).
  • Sm

Small benefits if datasets too sm small to saturate memory bandwidth.

Kernel Overlap in Sparse Iterative Algorithms

SCR TDK WEB DIE

Performance improvement [%]

1 2 3 4 5 6 7 8 9 10

IDR(1) IDR(2) IDR(4) IDR(8)

Matrix Size SCR 170,998 TDK 204,316 WEB 1,000,005 DIE 1,157,456

slide-16
SLIDE 16

16

  • Op

Operation

  • ns

s not

  • t parallelizable to
  • GPU thread con
  • ncurrency:
  • sequential operations
  • unstructured, random memory access
  • incomplete factorizations (ILU/IC)
  • sparse triangular solves

Bottlenecks

slide-17
SLIDE 17

17

  • Op

Operation

  • ns

s not

  • t parallelizable to
  • GPU thread con
  • ncurrency:
  • sequential operations
  • unstructured, random memory access
  • incomplete factorizations (ILU/IC)
  • sparse triangular solves
  • Do

Don’t even try – re rethink the pro roblem!

  • A different algorithm may take you to the same goal.
  • Choose algorithms with fine-grained parallelism,

avoid synchronizations.

  • Sparse iterative solvers provide approximations --

relax the bit-wise reproducibility criterion. Most popular Example: It Iterative IL ILU (Chow et al.)

Bottlenecks

Chow et al., Fi Fine-gr grained Parallel Incomplete LU Factorization, SIAM Journal on Scientific Computing, 37, pp. C169-C193 (2015).

slide-18
SLIDE 18

18

  • Ex

Example: sparse triangular solves in ILU preconditioning:

  • inherently sequential
  • parallelism from level-scheduling/multi-color ordering
  • unable to exploit fine-grained parallelism of GPUs

Bottlenecks

slide-19
SLIDE 19

19

  • Ex

Example: sparse triangular solves in ILU preconditioning:

  • inherently sequential
  • parallelism from level-scheduling/multi-color ordering
  • unable to exploit fine-grained parallelism of GPUs
  • Ta

Take an unconventional approach: Ap Approxima mate sparse triangular solves

  • Replace forward/backward substitutions with it

iterativ ive method.

  • Lo

Low solution accuracy required as LU ≈ A typically only
 a rough approximation.

  • Better sc

scalability of iterative methods.

  • Ja

Jacobi converges as spectral radius of iteration matrix smaller 1:

  • Pe

Performance depends on

  • n Sp

SpMV.

Bottlenecks

xk+1 = D−1b + Mxk

slide-20
SLIDE 20

20

Bottlenecks

Ma Matrix Ex Exact IC 10 10 Jacobi sweeps Top-level PCG: Iterations Runtime Iterations Runtime Laplace 3D 27pt 58 1.83 63 1.14 parabolic_fem 645 37.24 721 7.29 thermal2 1771 305.58 2457 67.41 G3_circuit 1625 45.60 1647 35.43

  • Ex

Example: sparse triangular solves in ILU preconditioning:

  • inherently sequential
  • parallelism from level-scheduling/multi-color ordering
  • unable to exploit fine-grained parallelism of GPUs
  • Ta

Take an unconventional approach: Ap Approxima mate sparse triangular solves

Anzt et al., It Iterative S Sparse Triangular Solves for Preconditioning, Euro-Par 2015.

slide-21
SLIDE 21

21

Su Summary

  • Sp

SpMV op

  • ptimization
  • n central challenge

in iterative sparse linear algebra.

  • Algorithms me

memo mory bound,

  • ften benefit from ke

kernel fusion.

  • Trade-off between pe

performance and fl flexibility:

  • SpMV /Jacobi as building block enhances modularity.
  • Ke

Kernel fusion can bring performance close to th theoreti tical bound.

  • Concurrent kernel execution only beneficial for small problems.
  • We need unconventional approaches for bottleneck-operations.
  • It

Iterative IL ILU generation (Chow et al.)

  • It

Iterative sparse t triangular solves for ILU/IC.

This research is based on a cooperation with Enrique Quintana-Ortí from the University Jaume I, Edmond Chow from Georgia Tech, and Moritz Kreutzer from the University of Erlangen.