STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE - - PowerPoint PPT Presentation
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE - - PowerPoint PPT Presentation
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY SPARSE MATRIX
SPARSE MATRIX FACTORIZATION ON GPUS
Objective:
Find methods for GPU acceleration of Sparse Cholesky Factorization
Experiment using SuiteSparse 4.4.3 / CHOLMOD
Outline
Sparse Cholesky Factorization Previous work / Issues ‘Branches’ approach
Dense block Cholesky
DIRECT SPARSE FACTORIZATION
Supernodes
L11 Lt
11 = A11
L11 Lt
21 = At 21
A*
22 = A22 – L21 Lt 21 POTRF TRSM GEMM
A11 A22 A21 At
21
= L11 L21 I I A*
22
Lt
11 Lt 21
I
dense Cholesky triangular solve matrix multiplication
Schur complement
compressed column
DIRECT SPARSE FACTORIZATION
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 6 3 1 2 5 4
DIRECT SPARSE FACTORIZATION
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF 6 3 1 2 5 4
DIRECT SPARSE FACTORIZATION
1 2 5 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF TRSM 4 6 3
DIRECT SPARSE FACTORIZATION
1 2 5 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF TRSM GEMM 4
fill fill
6 3
DIRECT SPARSE FACTORIZATION
1 2 5 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF
fill fill
6 4 3
DIRECT SPARSE FACTORIZATION
1 2 3 4 5 6 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF TRSM
fill fill
DIRECT SPARSE FACTORIZATION
1 2 3 4 5 6
fill fill
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF TRSM GEMM
DIRECT SPARSE FACTORIZATION
1 2 3 4 5 6
fill fill
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF TRSM GEMM POTRF
DIRECT SPARSE FACTORIZATION
Lots of ‘small’ math Irregular access patterns Larger matrices -> more dense math Greater connectivity -> more dense math Factors can be large ( > 128 GB )
PREVIOUS WORK
Just send large BLAS-3 to GPU
WORKS! For large, dense matrices Not so good for:
small matrices large matrices with low connectivity (shells / beams in FEA)
Find methods for further GPU acceleration of Sparse Factorization
PREVIOUS WORK
0 ¡ 100 ¡ 200 ¡ 300 ¡ 400 ¡ 500 ¡ 600 ¡ 700 ¡ 800 ¡
GFlops/s ¡ Florida ¡Sparse ¡Matrix ¡Collec4on ¡
CPU ¡ CPU ¡+ ¡GPU ¡
Send appropriately-sized BLAS calls to GPU ‘hide’ PCIe communication Assemble supernodes on GPU Hybrid computing
why not higher? why so low?
supernode score
GPU CPU
supernodes
decreasing cost to assemble
row/column threshold ndrow >= 256 ndcol >= 32
2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡ SuiteSparse (CHOLMOD) 4.4.3
1.5x
ISSUES
PCIe communication
Limits which BLAS operations can be accelerated on GPU
Small BLAS
Low occupancy Launch overhead
Most BLAS calls don’t get sent to the GPU Seek methods which better accelerate factorization of small / minimally-connected matrices
audikw_1.mtx
% on CPU
PROPOSED SOLUTION
Factor branches on GPU
Use previous methods for root No use of CPU Eliminates PCIe communication Requires POTRF , TRSM & GEMM on GPU
Batch and stream BLAS operations
Within levels Amortizes launch overhead Streamed to improve occupancy
No size restriction Maps well to muti-GPU / hybrid computing
branch 1
level 0 level 1 level 2
branch 2 branch 3 branch 4
data
- n
device data
- n
host
BATCHED / STREAMED BLAS
Batch all BLAS calls to amortize kernel launch latency Stream multiple batches to increase occupancy Simply wrap cuBLAS subroutine with batch loop DGEMM w/ m,n,k=16 -> 40 GF
DGEMM example, m,n,k=16
time stream
100 Mflops : 500 Mflops batched: 1.2 Gflops streamed: 4.8 Gflops
Host <-> Device Kernel
BATCHED / STREAMED DGEMM
Square DGEMM 64 streams/threads Batched / streamed cuBLAS performance matches MKL for small size Created by wrapping existing, non-batched routines
passing lists
200 400 600 800 1000 1200 1400 100 200 300 400 500
Gflop/s DGEMM m,n,k GPU streamed GPU batched/streamed GPU streamed CPU
2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡
PLENTY OF PARALLELISM
Lower levels
Many supernodes Few descendants
Upper levels
Few supernodes Many descendants
audikw_1.mtx # of supernodes or GEMM + SYRK ops
supernodes GEMM
BRANCHES
Matrix ¡ # ¡branches ¡ # ¡levels ¡ # ¡supernodes ¡ # ¡root ¡levels ¡ # ¡root ¡supernodes ¡ Fault_639 ¡ 2 ¡ 18-‑19 ¡ 14931 ¡-‑ ¡15794 ¡ 1 ¡ 1 ¡ nd24k ¡ 2 ¡ 11 ¡ 302 ¡-‑ ¡325 ¡ 1 ¡ 1 ¡ inline_1 ¡ 4 ¡ 16-‑17 ¡ 3909 ¡-‑ ¡10633 ¡ 1 ¡ 1 ¡ Emilia_923 ¡ 4 ¡ 17-‑18 ¡ 10314 ¡-‑ ¡11570 ¡ 3 ¡ 4 ¡ boneS10 ¡ 4 ¡ 18-‑23 ¡ 7045 ¡-‑ ¡26182 ¡ 1 ¡ 1 ¡ ldoor ¡ 3 ¡ 19-‑20 ¡ 17413 ¡-‑ ¡35704 ¡ 1 ¡ 1 ¡ bone010 ¡ 6 ¡ 16-‑20 ¡ 1957 ¡-‑ ¡23610 ¡ 1 ¡ 1 ¡ Hook_1498 ¡ 9 ¡ 1-‑18 ¡ 1 ¡-‑ ¡33608 ¡ 3 ¡ 5 ¡ Geo_1438 ¡ 8 ¡ 17-‑18 ¡ 8102 ¡-‑ ¡9335 ¡ 5 ¡ 9 ¡ Serena ¡ 60 ¡ 10-‑17 ¡ 189 ¡-‑ ¡4910 ¡ 10 ¡ 60 ¡ audikw_1 ¡ 4 ¡ 17-‑19 ¡ 5631 ¡-‑ ¡22300 ¡ 1 ¡ 1 ¡ Flan_1564 ¡ 8 ¡ 15-‑17 ¡ 3937 ¡-‑ ¡16309 ¡ 2 ¡ 2 ¡
branch 1 branch 2 branch 3 branch 4
CHOLMOD RESULTS
1.38x average speedup
- vs. previous CPU+GPU
2x average speedup vs. CPU Poorly performing matrices see the greatest speedup
2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡
100 200 300 400 500 600 700 800 900
GFlop/s Florida Sparse Matrix Collection CPU CPU + GPU GPU Branches
CHOLMOD 4.43
PCIE DEPENDENCE
PCIe gen3 -> gen1
12 GB/s -> 3 GB/s 75% loss
CPU+GPU
23% loss
Branches
17% loss
100 200 300 400 500 600 700 800 900
Gflop/s Florida Sparse Matrix Collection
PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3
4.4.3 CPU+GPU GPU Branches 1 ¡x ¡ ¡i7 ¡3930K ¡ ¡+ ¡K40 ¡(max ¡boost, ¡ECC=on) ¡
SHELL MODEL PERFORMANCE
50 100 150 200 250 300 350 400 450 2 4 6 8 10 12
Numerical Factorization rate GF/s million degrees of freedom 4.4.3 CPU 4.4.3 CPU+GPU Branches 1xK40
PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd
- 506,082 supernodes
- 640 branches
- 114–1,730 supernodes
- 8-20 levels
- 49 levels in root branch
- 637 supernodes
2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz. w/ 256 GB + 2xK40 (ECC=ON, full boost)
SHELL MODEL PERFORMANCE
‘Branches’ algorithm well- suited for Multi-GPU 4 x K40 Overall 1.5x speedup Branches 3.1x speedup
We’ve ported the previous algorithm to multi-GPU
- 506,082 supernodes
- 640 branches
- 114–1,730
supernodes
- 8-20 levels
- 49 levels in root branch
- 637 supernodes
1 x K40 4 x K40
time host <-> device compute kernels
SHELL MODEL PERFORMANCE
200 400 600 800 1000 1200 1400 2 4 6 8 10 12
Numerical Factorization rate GF/s million degrees of freedom
4.4.3 CPU 4.4.3 CPU+GPU Branches 1xK40 Branches 2xK40 Branches 2xK40-Proj. Branches 4xK40 Branches 4xK40-Proj. PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd 2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz. w/ 256 GB + 2xK40 (ECC=ON, full boost) assuming 87.5% parallel efficiency
2x K40 4x K40 1x K40
CONCLUSIONS
Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices Universally beneficial Aspects apply to other factorization methods Future
Improved performance of batched routines Support hybrid computing Complete multi-GPU support
RELATED WORK
S5232 - GPU Acceleration of WSMP (Watson Sparse Matrix Package)
Natalia Gimelshein, Anshul Gupta
S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks
Jonathan Hogg
S5476 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs
Azzam Haidar, Stanimire Tomov
S5424 - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement
Tim Davis
S5237 - Jacobi-Davidson Eigensolver in Cusolver Library
Lung-Sheng Chien