[PPT] - ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE PowerPoint Presentation

SLIDE 1

STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

SLIDE 2

SPARSE MATRIX FACTORIZATION ON GPUS

Objective:

Find methods for GPU acceleration of Sparse Cholesky Factorization

Experiment using SuiteSparse 4.4.3 / CHOLMOD

Outline

Sparse Cholesky Factorization Previous work / Issues ‘Branches’ approach

SLIDE 3

Dense block Cholesky

DIRECT SPARSE FACTORIZATION

Supernodes

L11 Lt

11 = A11

L11 Lt

21 = At 21

A*

22 = A22 – L21 Lt 21 POTRF TRSM GEMM

A11 A22 A21 At

21

= L11 L21 I I A*

22

Lt

11 Lt 21

I

dense Cholesky triangular solve matrix multiplication

Schur complement

compressed column

SLIDE 4

DIRECT SPARSE FACTORIZATION

7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 6 3 1 2 5 4

SLIDE 5

DIRECT SPARSE FACTORIZATION

7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF 6 3 1 2 5 4

SLIDE 6

DIRECT SPARSE FACTORIZATION

1 2 5 7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF TRSM 4 6 3

SLIDE 7

DIRECT SPARSE FACTORIZATION

1 2 5 7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF TRSM GEMM 4

fill fill

6 3

SLIDE 8

DIRECT SPARSE FACTORIZATION

1 2 5 7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF

fill fill

6 4 3

SLIDE 9

DIRECT SPARSE FACTORIZATION

1 2 3 4 5 6 7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF TRSM

fill fill

SLIDE 10

DIRECT SPARSE FACTORIZATION

1 2 3 4 5 6

fill fill

7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF TRSM GEMM

SLIDE 11

DIRECT SPARSE FACTORIZATION

1 2 3 4 5 6

fill fill

7

Elimination tree

Bulk of work is in assembling supernodes (wide range of descendant sizes)

‘Left-looking supernodal’

Apply block Cholesky to supernodes

1 2 4 5 7 6 3 POTRF TRSM GEMM POTRF TRSM GEMM POTRF

SLIDE 12

DIRECT SPARSE FACTORIZATION

Lots of ‘small’ math Irregular access patterns Larger matrices -> more dense math Greater connectivity -> more dense math Factors can be large ( > 128 GB )

SLIDE 13

PREVIOUS WORK

Just send large BLAS-3 to GPU

WORKS! For large, dense matrices Not so good for:

small matrices large matrices with low connectivity (shells / beams in FEA)

Find methods for further GPU acceleration of Sparse Factorization

SLIDE 14

PREVIOUS WORK

0 ¡ 100 ¡ 200 ¡ 300 ¡ 400 ¡ 500 ¡ 600 ¡ 700 ¡ 800 ¡

GFlops/s ¡ Florida ¡Sparse ¡Matrix ¡Collec4on ¡

CPU ¡ CPU ¡+ ¡GPU ¡

Send appropriately-sized BLAS calls to GPU ‘hide’ PCIe communication Assemble supernodes on GPU Hybrid computing

why not higher? why so low?

supernode score

GPU CPU

supernodes

decreasing cost to assemble

row/column threshold ndrow >= 256 ndcol >= 32

2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡ SuiteSparse (CHOLMOD) 4.4.3

1.5x

SLIDE 15

ISSUES

PCIe communication

Limits which BLAS operations can be accelerated on GPU

Small BLAS

Low occupancy Launch overhead

Most BLAS calls don’t get sent to the GPU Seek methods which better accelerate factorization of small / minimally-connected matrices

audikw_1.mtx

% on CPU

SLIDE 16

PROPOSED SOLUTION

Factor branches on GPU

Use previous methods for root No use of CPU Eliminates PCIe communication Requires POTRF , TRSM & GEMM on GPU

Batch and stream BLAS operations

Within levels Amortizes launch overhead Streamed to improve occupancy

No size restriction Maps well to muti-GPU / hybrid computing

branch 1

level 0 level 1 level 2

branch 2 branch 3 branch 4

SLIDE 17

data

n

device data

n

host

BATCHED / STREAMED BLAS

Batch all BLAS calls to amortize kernel launch latency Stream multiple batches to increase occupancy Simply wrap cuBLAS subroutine with batch loop DGEMM w/ m,n,k=16 -> 40 GF

DGEMM example, m,n,k=16

time stream

100 Mflops : 500 Mflops batched: 1.2 Gflops streamed: 4.8 Gflops

Host <-> Device Kernel

SLIDE 18

BATCHED / STREAMED DGEMM

Square DGEMM 64 streams/threads Batched / streamed cuBLAS performance matches MKL for small size Created by wrapping existing, non-batched routines

passing lists

200 400 600 800 1000 1200 1400 100 200 300 400 500

Gflop/s DGEMM m,n,k GPU streamed GPU batched/streamed GPU streamed CPU

2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡

SLIDE 19

PLENTY OF PARALLELISM

Lower levels

Many supernodes Few descendants

Upper levels

Few supernodes Many descendants

audikw_1.mtx # of supernodes or GEMM + SYRK ops

supernodes GEMM

SLIDE 20

BRANCHES

Matrix ¡ # ¡branches ¡ # ¡levels ¡ # ¡supernodes ¡ # ¡root ¡levels ¡ # ¡root ¡supernodes ¡ Fault_639 ¡ 2 ¡ 18-‑19 ¡ 14931 ¡-‑ ¡15794 ¡ 1 ¡ 1 ¡ nd24k ¡ 2 ¡ 11 ¡ 302 ¡-‑ ¡325 ¡ 1 ¡ 1 ¡ inline_1 ¡ 4 ¡ 16-‑17 ¡ 3909 ¡-‑ ¡10633 ¡ 1 ¡ 1 ¡ Emilia_923 ¡ 4 ¡ 17-‑18 ¡ 10314 ¡-‑ ¡11570 ¡ 3 ¡ 4 ¡ boneS10 ¡ 4 ¡ 18-‑23 ¡ 7045 ¡-‑ ¡26182 ¡ 1 ¡ 1 ¡ ldoor ¡ 3 ¡ 19-‑20 ¡ 17413 ¡-‑ ¡35704 ¡ 1 ¡ 1 ¡ bone010 ¡ 6 ¡ 16-‑20 ¡ 1957 ¡-‑ ¡23610 ¡ 1 ¡ 1 ¡ Hook_1498 ¡ 9 ¡ 1-‑18 ¡ 1 ¡-‑ ¡33608 ¡ 3 ¡ 5 ¡ Geo_1438 ¡ 8 ¡ 17-‑18 ¡ 8102 ¡-‑ ¡9335 ¡ 5 ¡ 9 ¡ Serena ¡ 60 ¡ 10-‑17 ¡ 189 ¡-‑ ¡4910 ¡ 10 ¡ 60 ¡ audikw_1 ¡ 4 ¡ 17-‑19 ¡ 5631 ¡-‑ ¡22300 ¡ 1 ¡ 1 ¡ Flan_1564 ¡ 8 ¡ 15-‑17 ¡ 3937 ¡-‑ ¡16309 ¡ 2 ¡ 2 ¡

branch 1 branch 2 branch 3 branch 4

SLIDE 21

CHOLMOD RESULTS

1.38x average speedup

vs. previous CPU+GPU

2x average speedup vs. CPU Poorly performing matrices see the greatest speedup

2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡

100 200 300 400 500 600 700 800 900

GFlop/s Florida Sparse Matrix Collection CPU CPU + GPU GPU Branches

CHOLMOD 4.43

SLIDE 22

PCIE DEPENDENCE

PCIe gen3 -> gen1

12 GB/s -> 3 GB/s 75% loss

CPU+GPU

23% loss

Branches

17% loss

100 200 300 400 500 600 700 800 900

Gflop/s Florida Sparse Matrix Collection

PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3

4.4.3 CPU+GPU GPU Branches 1 ¡x ¡ ¡i7 ¡3930K ¡ ¡+ ¡K40 ¡(max ¡boost, ¡ECC=on) ¡

SLIDE 23

SHELL MODEL PERFORMANCE

50 100 150 200 250 300 350 400 450 2 4 6 8 10 12

Numerical Factorization rate GF/s million degrees of freedom 4.4.3 CPU 4.4.3 CPU+GPU Branches 1xK40

PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

506,082 supernodes
640 branches
114–1,730 supernodes
8-20 levels
49 levels in root branch
637 supernodes

2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz. w/ 256 GB + 2xK40 (ECC=ON, full boost)

SLIDE 24

SHELL MODEL PERFORMANCE

‘Branches’ algorithm well- suited for Multi-GPU 4 x K40 Overall 1.5x speedup Branches 3.1x speedup

We’ve ported the previous algorithm to multi-GPU

506,082 supernodes
640 branches
114–1,730

supernodes

8-20 levels
49 levels in root branch
637 supernodes

1 x K40 4 x K40

time host <-> device compute kernels

SLIDE 25

SHELL MODEL PERFORMANCE

200 400 600 800 1000 1200 1400 2 4 6 8 10 12

Numerical Factorization rate GF/s million degrees of freedom

4.4.3 CPU 4.4.3 CPU+GPU Branches 1xK40 Branches 2xK40 Branches 2xK40-Proj. Branches 4xK40 Branches 4xK40-Proj. PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd 2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz. w/ 256 GB + 2xK40 (ECC=ON, full boost) assuming 87.5% parallel efficiency

2x K40 4x K40 1x K40

SLIDE 26

CONCLUSIONS

Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices Universally beneficial Aspects apply to other factorization methods Future

Improved performance of batched routines Support hybrid computing Complete multi-GPU support

SLIDE 27

RELATED WORK

S5232 - GPU Acceleration of WSMP (Watson Sparse Matrix Package)

Natalia Gimelshein, Anshul Gupta

S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Jonathan Hogg

S5476 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs

Azzam Haidar, Stanimire Tomov

S5424 - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Tim Davis

S5237 - Jacobi-Davidson Eigensolver in Cusolver Library

Lung-Sheng Chien

SLIDE 28

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

SPARSE MATRIX FACTORIZATION ON GPUS

Objective:

Find methods for GPU acceleration of Sparse Cholesky Factorization

Outline

Sparse Cholesky Factorization Previous work / Issues ‘Branches’ approach

Dense block Cholesky

DIRECT SPARSE FACTORIZATION

Supernodes

A11 A22 A21 At

= L11 L21 I I A*

Lt

I

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Elimination tree

‘Left-looking supernodal’

Apply block Cholesky to supernodes

DIRECT SPARSE FACTORIZATION

Lots of ‘small’ math Irregular access patterns Larger matrices -> more dense math Greater connectivity -> more dense math Factors can be large ( > 128 GB )

PREVIOUS WORK

Just send large BLAS-3 to GPU

Find methods for further GPU acceleration of Sparse Factorization

PREVIOUS WORK

ISSUES

PROPOSED SOLUTION

BATCHED / STREAMED BLAS

DGEMM example, m,n,k=16

BATCHED / STREAMED DGEMM

PLENTY OF PARALLELISM

Lower levels

Upper levels

BRANCHES

CHOLMOD RESULTS

1.38x average speedup

2x average speedup vs. CPU Poorly performing matrices see the greatest speedup

PCIE DEPENDENCE

PCIe gen3 -> gen1

CPU+GPU

Branches

SHELL MODEL PERFORMANCE

SHELL MODEL PERFORMANCE

‘Branches’ algorithm well- suited for Multi-GPU 4 x K40 Overall 1.5x speedup Branches 3.1x speedup

SHELL MODEL PERFORMANCE

CONCLUSIONS

Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices Universally beneficial Aspects apply to other factorization methods Future

RELATED WORK

THANK YOU