Feeding of the Thousands – Leveraging the GPU's Computing Power for Sparse Linear Algebra
SPPEXA Annual Meeting 2016, January 25th, 2016, Garching, Germany Hartwig Anzt
Feeding of the Thousands Leveraging the GPU's Computing Power for - - PowerPoint PPT Presentation
SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs In Inherently parallel operations
SPPEXA Annual Meeting 2016, January 25th, 2016, Garching, Germany Hartwig Anzt
2
3
Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units, SISC 36(5), 2014.
5 2 4 2 5 3 7 2 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 3 0 0 0 0 0 1 2 5 7 0 1 2 X X 2 7 X X X X X X X X X X X X X 0 X X X X 6 X X X X 5 2 4 2 5 0 0 0 3 7 2 0 0 0 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 5 2 4 0 0 2 0 5 3 7 2 0 0 0 0 0 0 0 7 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 5 2 4 2 5 0 0 0 3 7 2 0 0 0 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 CSR format 5 2 4 2 5 3 7 2 7 5 8 3 0 1 2 5 7 0 1 2 2 7 0 6 0 5 8 10 10 10 11 values colind rowptr
points to first element in row 1 2 3 4 5 6 7 7 col-index row-index 6 5 4 3 2 1
values colind values colind ELL format rowptr Sparse storage formats values colind 0 10 14 16 18
points to first element in block
5 2 4 2 5 0 0 0 3 7 2 0 0 0 0 0 7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 2 4 2 5 3 7 2 0 0 7 5 0 0 8 3 0 1 2 5 7 0 1 2 X X 2 7 X X X 6 X SELLP format
4
+ + + +
Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units, SISC 36(5), 2014.
5
Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers, PMAM 2015. .
+ + + +
TB 0 TB 1 shared memory reduction 2D TBs
+ + + +
TB 0 TB 1 shared memory reduction 2D TBs
+ + + +
TB 0 TB 1 shared memory reduction 2D TBs
+ + + +
TB 0 TB 1 shared memory reduction 2D TBs
+ + + +
TB 0 TB 1 shared memory reduction 2D TBs
+ + + +
TB 0 TB 1 shared memory reduction 2D TBs
6
Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers, PMAM 2015. .
20 40 60 80 100 120
audikw_1 bmw3_2 bmwcra_1 bone_010 bone_S10 cant crankseg_2 F1 Fault_639 Hook_1498 inline_1 ldoor pwtk Stoc_1465 stomach xenon_2
GFLOP/s cuSPARSE CSRSpMM cuSPARSE CSRSpMM v2 MAGMA SELL-P SpMM
7
Thread block
r+w r+w r+w r+w r+w r+w
Thread block
r+w r+w r+w r+w r+w r+w
Thread block
r+w r+w r+w r+w r+w r+w
8
while( ( k < maxiter ) && ( res > epsilon ) ){ scalar_fusion_1 <<<Gs, Bs, Ms>>> ( n, rowA, colA, valA, d, z, beta, rho, gamma, vtmp ); fusion_2 ( Gs, Bs, Ms, n, beta, rho, vtmp ); fusion_3 <<<Gs, Bs, Ms>>> ( n, rho, d, x, z, r, vtmp ); fusion_4 ( Gs, Bs, Ms, n, vtmp, vtmp2 ); fusion_5 <<<Gs, Bs>>> ( n, beta, gamma, alpha, d, r, vtmp ); cudaMemcopy( &res, beta, sizeof(float), cudaMemcpyDeviceToHost ); res = sqrt( beta ); k ++; } // end-while while( ( k < maxiter ) && ( res > epsilon ) ){ Scalar_SpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, d, z ); tmp = cublasSdot ( n, d, 1, z, 1 ); rho = beta / tmp; gamma = beta; cublasSaxpy (n, rho, d, 1, x, 1 ); cublasSaxpy (n, -rho, z, 1, r, 1 ); tmp = cublasSdot ( n, r, 1, r, 1 ); beta = tmp; alpha = beta / gamma; cublasSscal (n, alpha, d, 1 ); cublasSaxpy (n, one, r, 1, d, 1 ); res = sqrt( beta ); k++; } // end-while
Aliaga et al.: .: Re Reformulated Conjugate Gradient for the Energy-Aw Aware S Solution of Linear Systems on G GPUs, Parallel Processing (ICPP), 2013.
9
Aliaga et al.: .: Systematic Fusion of C CUDA Kernels for It Iterative S Sparse Linear S System So Solvers, Euro-Par 2015, LLNCS 9233, 2015.
10
11
12
AUD G3 INL LDO
Runtime [s]
1 2 3 4 5 6 7 8 9 10
basic JCG fusion JCG Jacobi-fusion
13
10
3
10
4
10
5
10
6
10
7
Vector length 40 80 120 160 200 Bandwidth b in GB/s SCR TDK WEB b = 193 GB/s ...
14
SCR TDK WEB NLP DIE THM AFS MLG G3 TRA
Efficiency [%]
10 20 30 40 50 60 70 80 90 100
IDR(1) IDR(2) IDR(4) IDR(8) Matrix Size Nonzeros SCR 170,998 958,936 TDK 204,316 2,846,228 WEB 1,000,005 3,105,536 NLP 1,062,400 28,704,672 DIE 1,157,456 48,538,952 THM 1,228,045 8,580,313 AFS 1,508,065 52,672,325 MLG 1,504,002 110,879,972 G3 1,585,478 7,660,826 TRA 1,602,111 23,500,731
15
SCR TDK WEB DIE
Performance improvement [%]
1 2 3 4 5 6 7 8 9 10
IDR(1) IDR(2) IDR(4) IDR(8)
16
17
Chow et al., Fi Fine-gr grained Parallel Incomplete LU Factorization, SIAM Journal on Scientific Computing, 37, pp. C169-C193 (2015).
18
19
20
Anzt et al., It Iterative S Sparse Triangular Solves for Preconditioning, Euro-Par 2015.
21
This research is based on a cooperation with Enrique Quintana-Ortí from the University Jaume I, Edmond Chow from Georgia Tech, and Moritz Kreutzer from the University of Erlangen.