Batched Sparse Matrix Multiplication for Accelerating Graph - - PowerPoint PPT Presentation

batched sparse matrix multiplication for accelerating
SMART_READER_LITE
LIVE PREVIEW

Batched Sparse Matrix Multiplication for Accelerating Graph - - PowerPoint PPT Presentation

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks Yusuke Nagasaka , Akira Nukada , Kojima Ryosuke , Satoshi Matsuoka , Tokyo Institute of Technology Kyoto University RIKEN Center for


slide-1
SLIDE 1

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Yusuke Nagasaka†, Akira Nukada†, Kojima Ryosuke‡, Satoshi Matsuoka,†

†Tokyo Institute of Technology ‡Kyoto University RIKEN Center for Computational Science

slide-2
SLIDE 2

Graph Convolutional Networks (GCNs)

■ “Graph” is input data of neural network

– Chemical compounds and protein are expressed as “graph” – Knowledge graph

1

Quoted from https://arxiv.org/pdf/1711.05859.pdf

slide-3
SLIDE 3

Formulation of Graph Convolution

2

!" = $

%

&",% (% ) Feature Adjacency matrix

1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1

A= Y = AXW 1 2 3 4 5

Input (Graph structure and features)

MatMul and SpMM

Gr GraphC hConvolution (Y, A A, X X, W W, b bias) fo for b ← 0 to to batchsize do f for ch ← 0 to to channel do do U ← MatMul (X[b], W[ch]) B ← Add(bias[ch], U) C[ch] ← SpMM (A[b][ch], B) Y[b] ← ElementWiseAdd(C) Execute m many Sp SpMM ke kernels. Each o

  • peration i

is i independent o

  • f

each o

  • ther.
slide-4
SLIDE 4

Performance Issues of GCNs Applications

■ Many small computing kernels occupy the execution time

– GEMM, Sparse-Dense Matrix Multiplication (SpMM)

■ Launch overhead of repeated CUDA kernels is not negligible

– Not clear how to develop batched (small) sparse matrix routine

■ Load balance issue – Number of nodes / sparsity of graph varies by input graphs ■ Occupancy issue – Require architecture specific kernel

■ How t to e efficiently c compute t tens o

  • r h

hundreds o

  • f s

small Sp SpMMs?

3

slide-5
SLIDE 5

Contribution

■ Batched approaches for SpMM on GPU to improve the performance of GCNs applications

– Sub Sub-Wa Warp-Assigned ( (SWA) Sp SpMM for SpMM on small matrices

■ Support both SparseTensor and CSR

– Ba Batched ed Sp SpMM

■ High occupancy and utilization of fast shared memory ■ Reduce the overhead of CUDA kernel launches ■ Develop routines both for SparseTensor and CSR

– Execute tens or hundreds small SpMM by single kernel – Significant p performance b boost

■ Up to 9. 9.27x 7x speedup compared to Non-batched approaches for SpMM ■ Up t to 1 1.59x s speedup for training and 1.37x s speedup for inference on GCNs application

4

slide-6
SLIDE 6

Sparse Matrix Format

■ Compressing needless zero elements

– Storing only non-zero elements – Reducing memory usage and computation

■ Many formats have been proposed

– Being suited to architectures and given matrices

6

slide-7
SLIDE 7

27 30 21 24 36 40

Implementation of SpMM in TensorFlow

■ COO-like sparse matrix format

– Array of {Row, Column} ids

■ SparseTensorDenseMatmul

– 1 CUDA thread for 1 mul-add operation – nnz * nDense threads – Load-balanced workload – Addition is done by atomicAdd

■ Expensive on global memory

7

1 2 3 4

1 2 3 4 1 3 3 1 2

value id

5 6 7 8 9 10 11 12

1 2 3 4

atomicAdd

1 2 3 4 1 2 3 4 5 6 11 12 7 8 9 10 5 6 22 24 21 24 36 40 1 2 3 4 5 6 7

Thread ID

mDense nDense

Output Matrix

slide-8
SLIDE 8

Sub-Warp-Assigned (SWA) SpMM

■ Assign subWarp to each non-zero element

– subWarp is set as power of two

■ Division and modulo operations by executing low-cost bit operations

– Reduce instructions for memory access to same non-zero element

8

SWA_S _SpMM (C, A A, B B, su subWarp) // set matrix C to O i ← threadId nzid ← i /subWarp rid ← idsA[nzid ∗ 2] cid ← idsA[nzid∗2+1] val ← valuesA[nzid] fo for j ← (i % subWarp) to to nB by by subWarp do do At Atomic (C[rid][j] ← C[rid][j] + val ∗ B[cid][j])

1 2 3 4 5 6 7 8 9 10 11 12

subWarp subWarp Thread Thread Thread Thread Thread Thread Thread Thread

27 30

atomicAdd()

slide-9
SLIDE 9

Sub-Warp-Assigned (SWA) SpMM for CSR

■ Assign subWarp to each row of input sparse matrix

– Reduce instructions for memory access to same non-zero element – At Atomic-free a addition t to o

  • utput m

matrix

9

SWA_S _SpMM_C _CSR (C, A A, B B, su subWarp) // set matrix C to O i ← threadId rid ← i / subWarp fo for nzid ← rptA[rid] to to rptA[rid + 1] do do cid ← colidsA[nzid] val ← valuesA[nzid] fo for j ← (i % subWarp) to nB by by subWarp do do C[rid][j] ← C[rid][j] + val * B[cid][j]

1 2 3 4 5 6 7 8 9 10 11 12

subWarp Thread Thread Thread Thread

5 6 27 30

subWarp Thread Thread

slide-10
SLIDE 10

Efficient Use of Shared Memory

■ Utilize shared memory for output matrix

– Reduce t the o

  • verhead o
  • f C

CUDA k kernel l launch for initializing output matrix – Hardware s support f for a atomic o

  • peration on shared

memory

■ Cache blocking optimization for larger inputs

– Divide the output matrix along the column

■ Larger o

  • utput m

matrix c can b be p placed o

  • n s

shared m memory ■ Also improve t the l locality o

  • f m

memory a access t to i input d dense ma matrix ix

10 10

Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense) Global Memory Shared Memory

(a) For small matrix (b) Cache blocking with SparseTensor

slide-11
SLIDE 11

Efficient Use of Shared Memory for CSR

■ Each subWarp keeps its output row (= nB)

– Not need to keep whole output matrix (= mA * nB) by each thread block

■ More thread blocks for larger mA

– subWarp * mA > TB – Row-wise division of input sparse matrix

■ Cache blocking for wider dense matrix

– TB / subwarp * nB > 32KB

■ Capacity of shared memory is 32KB ■ TB is thread block size

– Improve t the l locality o

  • f m

memory a access t to i input dense m matrix

11 11

(c) CSR for larger sparse matrix (d) CSR for wider dense matrix

Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense) Shared Memory

slide-12
SLIDE 12

Batched Algorithm for SpMMs

■ 1 CUDA kernel manages multiple SpMMs

– Reduce t the o

  • verhead o
  • f C

CUDA k kernel l launch

■ Statically decide whether cache blocking

  • ptimization is applied

– Select (a) or (b) based on maximum size of output

■ Assign one thread block to each SpMM for whole matrix or sub matrix

12 12

Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense) Global Memory Shared Memory

(a) For small matrix (b) Cache blocking with SparseTensor

slide-13
SLIDE 13

Performance Evaluation

14 14

slide-14
SLIDE 14

Evaluation Environment

■ TSUBAME 3.0

– CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz – GPU: NVIDIA Tesla P100

■ #SM: 56 – Shared memory: 64 KB/SM ■ Memory: 16GB

– SUSE Linux Enterprise – NVCC V9.0.176

15 15

slide-15
SLIDE 15

Benchmark of Batched Approaches for SpMM

■ Compare the performance of

– csrmm() and csrmm2() in cuSPARSE (non-batched) – SpMM following SparseTensorDenseMatMul in TensorFlow (non-batched) – gemmBatched() from Batched BLAS (batched, but for dense x dense MM) – Batched SpMM for SparseTensor (batched) – Batched SpMM for CSR (batched)

■ Randomly generate sparse matrix

– Parameter: Row/column size (= dim), sparsity (= nnz/row), batch

■ FLOPS in single precision

– 2 * nnzA * nB / exe_time – Not include the operations between zero elements in gemmBatched()

16 16

slide-16
SLIDE 16

Benchmark Results

■ Parameter settings are based on dataset and configuration of GCNs application ■ Significant s speedups b by B y Batched Sp SpMM family mily

– Better sm_efficiency with Batched SpMM

17 17

batch=100, dim=50, nnz/row=3 batch=50, dim=50, nnz/row=2

9. 9.27x 7x 6. 6.09x 09x

Te TensorFlow Ba BatchedSpM pMM (S (ST) Ba BatchedSpM pMM (C (CSR) R)

35.51% 89. 89.07% 7% 87 87.87 87%

slide-17
SLIDE 17

Benchmark Results

Batch size ■ Precise comparison between batched approaches ■ Larger batch size simply brings higher throughput of SpMMs

– Batch=50 cases do not use all SMs on GPU

18 18 batch=50, dim=64, nnz/row=3 batch=100, dim=64, nnz/row=3

slide-18
SLIDE 18

Benchmark Results

Dimension ■ BatchedSpMM (CSR) is getting better performance

– Improvement of parallelism

■ Batched SpMM for CSR launches more threads in proportion to mA

■ Improvement of cuBLAS and BatchedSpMM (ST) is limited

– Increase of dim results in increase of sparsity, more zero-related operations – More cache blocking causes memory pressure to same non-zero element

19 19 batch=100, dim=32, nnz/row=3 batch=100, dim=64, nnz/row=3 batch=100, dim=128, nnz/row=3

slide-19
SLIDE 19

Benchmark Results

Sparsity ■ Batched SpMM kernels work efficiently on sparser matrices

– Improvement of Batched SpMM (ST) is limited

■ More race condition by atomic operation

■ cuBLAS appears to show better performance on denser matrices

20 20 batch=100, dim=64, nnz/row=3 batch=100, dim=64, nnz/row=1 batch=100, dim=64, nnz/row=5

slide-20
SLIDE 20

Benchmark Results

Mixed ■ Various inputs with changing dimension and sparsity

– dim = [32, 256], nz/row = [1, 5], batch = 100 – cuBLAS is excluded because it requires same input matrices sizes – 3. 3.29x 29x performance improvement at n_B=1024

21 21

slide-21
SLIDE 21

Evaluation on GCNs Application

■ ChemGCN implemented with TensorFlow ■ Dataset and configuration ■ Average time of 5 executions

22 22

#M #Matri rices Max D Dimension Ep Epoch Bat Batch si size (Training / / In Inference) #la #layer of G f GraphCNN Tox21 7,862 50 50 50 / 200 2 Reaction100 75,477 50 20 100 / 200 3

slide-22
SLIDE 22

Formulation of Graph Convolution (again)

23 23

!" = $

%

&",% (% ) Feature Adjacency matrix

1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1

A= Y = AXW 1 2 3 4 5

Input (Graph structure and features)

MatMul and SpMM

Gr GraphC hConvolution (Y, A A, X X, W W, b bias) fo for b ← 0 to to batchsize do f for ch ← 0 to to channel do do U ← MatMul (X[b], W[ch]) B ← Add(bias[ch], U) C[ch] ← SpMM (A[b][ch], B) Y[b] ← ElementWiseAdd(C) Gr GraphC hConvolutionBa Batche hed (Y, A A, X X, W W, b bias) fo for ch ← 0 to to channel do do Xr ← Reshape(X, (mx * batchsize, nx) U ← MatMul(Xr, W[ch]) B ← Add(bias[ch], U) Alist ← [A[0][ch], ... , A[batchsize – 1][ch] C[ch] ← BatchedSpMM(Alist, B) Y ← ElementWiseAdd(C)

slide-23
SLIDE 23

Evaluation on GCNs Application

■ Batched SpMM is used as Batched version

– Training: Up t to 5 59% i improvement – Inference: Up t to 3 37% i improvement – Data of Tox21 can be placed on LL cache in CPU case

24 24

Execution time [sec]

CPU PU GPU PU Non-Batched Non-Batched Batched Speedup Tr Training To Tox21 854.51 918.03 723.80 1. 1.18x Re Reaction100 16223.98 3029.13 1905.32 1. 1.59x In Inference To Tox21 2.71 2.56 1.97 1. 1.30x Re Reaction100 44.66 22.42 16.32 1. 1.37x

Execution Time [sec]

slide-24
SLIDE 24

Profiling with Timeline

■ Profiling result of GraphConvolution layer with Tox21 data ■ Reduction o

  • f k

kernel l launches

– CUDA kernel launches: 50 * 3 => 3

25 25

Matmul 1.571 msec Add 1.316 msec SparseTensorDenseMatmul 1.981 msec Matmul 0.031 msec Add 0.023 msec BatchedSpMM 0.190 msec Non-Batched MatMul Add Batched SpMM Batched SparseTensorDenseMatMul

slide-25
SLIDE 25

Related Work

■ Batched BLAS

– Handles many operations on dense matrix or vector in a single kernel – High t throughput f for k kernels o

  • n s

small m matrices – Batched SpMV

■ Highly application specialized (e.g. assumes same non-zero pattern)

■ Libraries and Framework for GCNs

– DeepChem

■ Graph structure is expressed as adjacency list

– Chainer Chemistry

■ Treat sparse matrix as dense matrix – Many zero-related operations

26 26

slide-26
SLIDE 26

Conclusion

■ Efficient algorithms for many SpMM operations for small matrix

– Sub Sub-Warp A Assigned Sp SpMM – Ba Batched ed Sp SpMM

■ Improve the locality of memory access and exploit shared memory

■ Significant performance boost

– Detailed preliminary performance evaluation

■ Up t to 9 9.27x s speedup from Non-batched SpMM kernel ■ Performance advantage to Batched GEMM for small matrices

– Evaluation on GCNs application

■ Up t to 1 1.59x s speedup for training and 1.37x s speedup for inference

27 27

Code will be ready in the end of May https://github.com/YusukeNagasaka/Batched-SpMM