Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Yusuke Nagasaka†, Akira Nukada†, Kojima Ryosuke‡, Satoshi Matsuoka,†
†Tokyo Institute of Technology ‡Kyoto University RIKEN Center for Computational Science
Batched Sparse Matrix Multiplication for Accelerating Graph - - PowerPoint PPT Presentation
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks Yusuke Nagasaka , Akira Nukada , Kojima Ryosuke , Satoshi Matsuoka , Tokyo Institute of Technology Kyoto University RIKEN Center for
Yusuke Nagasaka†, Akira Nukada†, Kojima Ryosuke‡, Satoshi Matsuoka,†
†Tokyo Institute of Technology ‡Kyoto University RIKEN Center for Computational Science
– Chemical compounds and protein are expressed as “graph” – Knowledge graph
1
Quoted from https://arxiv.org/pdf/1711.05859.pdf
2
!" = $
%
&",% (% ) Feature Adjacency matrix
1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1
A= Y = AXW 1 2 3 4 5
Input (Graph structure and features)
MatMul and SpMM
Gr GraphC hConvolution (Y, A A, X X, W W, b bias) fo for b ← 0 to to batchsize do f for ch ← 0 to to channel do do U ← MatMul (X[b], W[ch]) B ← Add(bias[ch], U) C[ch] ← SpMM (A[b][ch], B) Y[b] ← ElementWiseAdd(C) Execute m many Sp SpMM ke kernels. Each o
is i independent o
each o
– GEMM, Sparse-Dense Matrix Multiplication (SpMM)
■ Launch overhead of repeated CUDA kernels is not negligible
– Not clear how to develop batched (small) sparse matrix routine
■ Load balance issue – Number of nodes / sparsity of graph varies by input graphs ■ Occupancy issue – Require architecture specific kernel
3
– Sub Sub-Wa Warp-Assigned ( (SWA) Sp SpMM for SpMM on small matrices
■ Support both SparseTensor and CSR
– Ba Batched ed Sp SpMM
■ High occupancy and utilization of fast shared memory ■ Reduce the overhead of CUDA kernel launches ■ Develop routines both for SparseTensor and CSR
– Execute tens or hundreds small SpMM by single kernel – Significant p performance b boost
■ Up to 9. 9.27x 7x speedup compared to Non-batched approaches for SpMM ■ Up t to 1 1.59x s speedup for training and 1.37x s speedup for inference on GCNs application
4
– Storing only non-zero elements – Reducing memory usage and computation
– Being suited to architectures and given matrices
6
27 30 21 24 36 40
– Array of {Row, Column} ids
– 1 CUDA thread for 1 mul-add operation – nnz * nDense threads – Load-balanced workload – Addition is done by atomicAdd
■ Expensive on global memory
7
1 2 3 4
1 2 3 4 1 3 3 1 2
value id
5 6 7 8 9 10 11 12
1 2 3 4
atomicAdd
1 2 3 4 1 2 3 4 5 6 11 12 7 8 9 10 5 6 22 24 21 24 36 40 1 2 3 4 5 6 7
Thread ID
mDense nDense
Output Matrix
– subWarp is set as power of two
■ Division and modulo operations by executing low-cost bit operations
– Reduce instructions for memory access to same non-zero element
8
SWA_S _SpMM (C, A A, B B, su subWarp) // set matrix C to O i ← threadId nzid ← i /subWarp rid ← idsA[nzid ∗ 2] cid ← idsA[nzid∗2+1] val ← valuesA[nzid] fo for j ← (i % subWarp) to to nB by by subWarp do do At Atomic (C[rid][j] ← C[rid][j] + val ∗ B[cid][j])
1 2 3 4 5 6 7 8 9 10 11 12
subWarp subWarp Thread Thread Thread Thread Thread Thread Thread Thread
27 30
atomicAdd()
– Reduce instructions for memory access to same non-zero element – At Atomic-free a addition t to o
matrix
9
SWA_S _SpMM_C _CSR (C, A A, B B, su subWarp) // set matrix C to O i ← threadId rid ← i / subWarp fo for nzid ← rptA[rid] to to rptA[rid + 1] do do cid ← colidsA[nzid] val ← valuesA[nzid] fo for j ← (i % subWarp) to nB by by subWarp do do C[rid][j] ← C[rid][j] + val * B[cid][j]
1 2 3 4 5 6 7 8 9 10 11 12
subWarp Thread Thread Thread Thread
5 6 27 30
subWarp Thread Thread
– Reduce t the o
CUDA k kernel l launch for initializing output matrix – Hardware s support f for a atomic o
memory
– Divide the output matrix along the column
■ Larger o
matrix c can b be p placed o
shared m memory ■ Also improve t the l locality o
memory a access t to i input d dense ma matrix ix
10 10
Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense) Global Memory Shared Memory
(a) For small matrix (b) Cache blocking with SparseTensor
– Not need to keep whole output matrix (= mA * nB) by each thread block
– subWarp * mA > TB – Row-wise division of input sparse matrix
– TB / subwarp * nB > 32KB
■ Capacity of shared memory is 32KB ■ TB is thread block size
– Improve t the l locality o
memory a access t to i input dense m matrix
11 11
(c) CSR for larger sparse matrix (d) CSR for wider dense matrix
Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense) Shared Memory
– Reduce t the o
CUDA k kernel l launch
– Select (a) or (b) based on maximum size of output
12 12
Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense) Global Memory Shared Memory
(a) For small matrix (b) Cache blocking with SparseTensor
14 14
– CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz – GPU: NVIDIA Tesla P100
■ #SM: 56 – Shared memory: 64 KB/SM ■ Memory: 16GB
– SUSE Linux Enterprise – NVCC V9.0.176
15 15
– csrmm() and csrmm2() in cuSPARSE (non-batched) – SpMM following SparseTensorDenseMatMul in TensorFlow (non-batched) – gemmBatched() from Batched BLAS (batched, but for dense x dense MM) – Batched SpMM for SparseTensor (batched) – Batched SpMM for CSR (batched)
– Parameter: Row/column size (= dim), sparsity (= nnz/row), batch
– 2 * nnzA * nB / exe_time – Not include the operations between zero elements in gemmBatched()
16 16
– Better sm_efficiency with Batched SpMM
17 17
batch=100, dim=50, nnz/row=3 batch=50, dim=50, nnz/row=2
9. 9.27x 7x 6. 6.09x 09x
Te TensorFlow Ba BatchedSpM pMM (S (ST) Ba BatchedSpM pMM (C (CSR) R)
35.51% 89. 89.07% 7% 87 87.87 87%
– Batch=50 cases do not use all SMs on GPU
18 18 batch=50, dim=64, nnz/row=3 batch=100, dim=64, nnz/row=3
– Improvement of parallelism
■ Batched SpMM for CSR launches more threads in proportion to mA
– Increase of dim results in increase of sparsity, more zero-related operations – More cache blocking causes memory pressure to same non-zero element
19 19 batch=100, dim=32, nnz/row=3 batch=100, dim=64, nnz/row=3 batch=100, dim=128, nnz/row=3
– Improvement of Batched SpMM (ST) is limited
■ More race condition by atomic operation
20 20 batch=100, dim=64, nnz/row=3 batch=100, dim=64, nnz/row=1 batch=100, dim=64, nnz/row=5
– dim = [32, 256], nz/row = [1, 5], batch = 100 – cuBLAS is excluded because it requires same input matrices sizes – 3. 3.29x 29x performance improvement at n_B=1024
21 21
22 22
#M #Matri rices Max D Dimension Ep Epoch Bat Batch si size (Training / / In Inference) #la #layer of G f GraphCNN Tox21 7,862 50 50 50 / 200 2 Reaction100 75,477 50 20 100 / 200 3
23 23
!" = $
%
&",% (% ) Feature Adjacency matrix
1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1
A= Y = AXW 1 2 3 4 5
Input (Graph structure and features)
MatMul and SpMM
Gr GraphC hConvolution (Y, A A, X X, W W, b bias) fo for b ← 0 to to batchsize do f for ch ← 0 to to channel do do U ← MatMul (X[b], W[ch]) B ← Add(bias[ch], U) C[ch] ← SpMM (A[b][ch], B) Y[b] ← ElementWiseAdd(C) Gr GraphC hConvolutionBa Batche hed (Y, A A, X X, W W, b bias) fo for ch ← 0 to to channel do do Xr ← Reshape(X, (mx * batchsize, nx) U ← MatMul(Xr, W[ch]) B ← Add(bias[ch], U) Alist ← [A[0][ch], ... , A[batchsize – 1][ch] C[ch] ← BatchedSpMM(Alist, B) Y ← ElementWiseAdd(C)
– Training: Up t to 5 59% i improvement – Inference: Up t to 3 37% i improvement – Data of Tox21 can be placed on LL cache in CPU case
24 24
Execution time [sec]
CPU PU GPU PU Non-Batched Non-Batched Batched Speedup Tr Training To Tox21 854.51 918.03 723.80 1. 1.18x Re Reaction100 16223.98 3029.13 1905.32 1. 1.59x In Inference To Tox21 2.71 2.56 1.97 1. 1.30x Re Reaction100 44.66 22.42 16.32 1. 1.37x
Execution Time [sec]
– CUDA kernel launches: 50 * 3 => 3
25 25
Matmul 1.571 msec Add 1.316 msec SparseTensorDenseMatmul 1.981 msec Matmul 0.031 msec Add 0.023 msec BatchedSpMM 0.190 msec Non-Batched MatMul Add Batched SpMM Batched SparseTensorDenseMatMul
– Handles many operations on dense matrix or vector in a single kernel – High t throughput f for k kernels o
small m matrices – Batched SpMV
■ Highly application specialized (e.g. assumes same non-zero pattern)
– DeepChem
■ Graph structure is expressed as adjacency list
– Chainer Chemistry
■ Treat sparse matrix as dense matrix – Many zero-related operations
26 26
– Sub Sub-Warp A Assigned Sp SpMM – Ba Batched ed Sp SpMM
■ Improve the locality of memory access and exploit shared memory
– Detailed preliminary performance evaluation
■ Up t to 9 9.27x s speedup from Non-batched SpMM kernel ■ Performance advantage to Batched GEMM for small matrices
– Evaluation on GCNs application
■ Up t to 1 1.59x s speedup for training and 1.37x s speedup for inference
27 27
Code will be ready in the end of May https://github.com/YusukeNagasaka/Batched-SpMM