1Massachusetts Institute of Technology 2Stanford University / Nvidia
Zhekai Zhang*1, Hanrui Wang*1, Song Han1, William J. Dally2
SpArch: Efficient Architecture for Sparse Matrix Multiplication
1
*Equal Contributions
sparch.mit.edu
SpArch: Efficient Architecture for Sparse Matrix Multiplication - - PowerPoint PPT Presentation
sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1
1Massachusetts Institute of Technology 2Stanford University / Nvidia
1
sparch.mit.edu
Graph Computing Compressed Neural Networks
2
Dimension ≈ 4×10! Sparsity ≈ 8×10"! Dimension ≈ 10## Sparsity ≈ 10"$
3
MKL on Intel Core i7-5930K cuSPARSE on TITAN Xp CUSP on TITAN Xp Armadillo on Arm Cortex-A53
Average GFLOPS on 20 benchmarks1,2
Theoretical GFLOPS3,4,5
Utilization
1Leskovec, Jure, and Rok Sosič. "Snap: A general-purpose network analysis and graph-mining library." ACM Transactions on Intelligent Systems and
Technology (TIST) 8.1 (2016): 1.
2Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS) 38.1 (2011): 1. 3https://www.pugetsystems.com/labs/hpc/Linpack-performance-Haswell-E-Core-i7-5960X-and-5930K-594/ 4https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 5http://web.eece.maine.edu/~vweaver/group/machines.html
Double Precision SpMM
4
Dimension ≈ 4×10! Density ≈ 8×10"! Dimension ≈ 10## Density ≈ 10"$
5
Dimension ≈ 4×10! Density ≈ 8×10"! Dimension ≈ 10## Density ≈ 10"$
6
Left Matrix 𝐵 Right Matrix 𝐶
Intermediate Partial Matrix 𝑄! = 𝐵:,!×𝐶!,:
Result Matrix 𝐷 = 𝐵𝐶 = ∑! 𝑄!
Merge Phase Multiply Phase
7
Multiplier Array Merger DRAM DRAM DRAM
Multiply Phase Merge Phase Perfect input reuse: read input matrix only once
8
Multiplier Array Merger DRAM DRAM DRAM
Multiply Phase Merge Phase
9
Multiplier Array Merger DRAM DRAM DRAM
Multiply Phase Merge Phase
10
Multiplier Array Merger DRAM DRAM DRAM
Multiply Phase Merge Phase
11
Multiplier Array Merger DRAM DRAM DRAM
Multiply Phase Merge Phase
12
Multiplier Array Merger DRAM DRAM DRAM
Multiply Phase Merge Phase
13
Multiplier Array Merger DRAM DRAM DRAM
Bad output reuse: intermediate results need storing to DRAM and loading back Multiply Phase Merge Phase
14
Pal, Subhankar, et al. "Outerspace: An outer product based sparse matrix multiplication accelerator." (HPCA). IEEE, 2018.
Partial Matrix Size (#Non-zeros) # Load and Store
Algorithm: Outer Product Technique 1: Pipelined Multiply and Merge Technique 2: Matrix Condensing Technique 3: Huffman Tree Scheduler Technique 4: Row Prefetcher
15
16
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0.1 0.5 0.2 0.3 1.2 MatB 0.6 1.3 1.2
2.2 1.1 Element- wise Add 0.1 1.1 0.2 1.3 1.5
2.2 1.1 1.2
0.1 0.5 0.2 0.3 1.2
0.6 1.3 1.2
2.2 1.1
0.1 1.1 0.2 1.3 1.5
2.2 1.1 1.2
17
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0.1 0.5 0.2 0.3 1.2 MatB 0.6 1.3 1.2
2.2 1.1 Element- wise Add 0.1 1.1 0.2 1.3 1.5
2.2 1.1 1.2
18
ptr1 ptr0
19
ptr1 ptr0
20
ptr1 ptr0
21
ptr1 ptr0
22
ptr1 ptr0
23
ptr1 ptr0
24
−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞
Index 1 2 3 4 5 6 7 8 9 MatA 0.1 0.5 0.2 0.3 MatB Element- wise Add
25
−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞
Index 1 2 3 4 5 6 7 8 9 MatA MatB 0.6 1.3 1.2
Element- wise Add
26
−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ 1 3 3 4 5 7 7 9 1 3 3 4 5 7 7 9 1 3 4 5 7 9 1 3 4 5 7 9
Add values of same indices
≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥
Index 1 2 3 4 5 6 7 8 9 MatA 0.1 0.5 0.2 0.3 MatB 0.6 1.3 1.2
Element- wise Add
0.1 1.1 0.2 1.3 1.5 -0.8
27
−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ 1 3 4 5 7 9 1 3 4 5 7 9
Add values of same indices
≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥
Index 1 2 3 4 5 6 7 8 9 MatA 0.1 0.5 0.2 0.3 MatB 0.6 1.3 1.2
Element- wise Add 0.1 1.1 0.2 1.3 1.5
0.1 1.1 0.2 1.3 1.5 -0.8
28
Merge Tree
13 14
Merger (Comparator Array)
DRAM
11, 12
Merger (Comparator Array)
26 31 28 42 21 23 17 19 22 24 15 16
A B C D
Multiplier Array A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)
−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ 1 3 4 5 7 9 1 3 4 5 7 9
Add values of same indices
≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥
0.1 1.1 0.2 1.3 1.5 -0.8
29
Multiplier Array Merger DRAM DRAM
Ideally, partial matrices will not be stored on DRAM.
30
Multiplier Array Merger DRAM DRAM
Up to 𝟐𝟏𝟖 partial matrices Limited to 64-way merging Too Many rounds
0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge
Breakdown of Memory Access
Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result
31
OuterSPACE baseline After pipeline
# Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros)
32
Condensed Matrix 𝐵′ (CSR) Right Matrix 𝐶 (CSR)
33
Before Condensing: Up to 𝟐𝟏𝟖 partial matrices After Condensing: Only 𝟐𝟏~𝟐𝟏𝟒 partial matrices
34
35
Multiplier Array Merger DRAM DRAM DRAM
Fewer rounds
36 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing
Breakdown of Memory Access
Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result
5x less After pipeline After Matrix Condensing
Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) # Load and Store # Load and Store
37
Partial Matrix Size #Non-zeros Number of parital matrices
…
Most partial matrices contains few nonzeros
38 A B C D E F G H I J K L 15 15 13 12 9 7 3 2 2 2 2 2
15 15 15 15 13 13 12 12 9 7 3 2 2 2 2 2 40 40 8
# 𝐸𝑆𝐵𝑁 𝑏𝑑𝑑𝑓𝑡𝑡 𝑝𝑔 𝑗𝑜𝑢𝑓𝑠𝑛𝑓𝑒𝑗𝑏𝑢𝑓 𝑞𝑏𝑠𝑢𝑗𝑏𝑚 𝑠𝑓𝑡𝑣𝑚𝑢𝑡 = <
𝒍 𝒋𝒕 𝒃 𝒋𝒐𝒖𝒇𝒔𝒏𝒇𝒆𝒋𝒃𝒖𝒇 𝒐𝒑𝒆𝒇
𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = 𝟓𝟏 + 𝟑𝟐 + 𝟗 = 𝟕𝟘
21 21 84 84
39 A B C D E F G H I J K L 15 15 13 12 9 7 3 2 2 2 2 2
# 𝐸𝑆𝐵𝑁 𝑏𝑑𝑑𝑓𝑡𝑡 𝑝𝑔 𝑗𝑜𝑢𝑓𝑠𝑛𝑓𝑒𝑗𝑏𝑢𝑓 𝑞𝑏𝑠𝑢𝑗𝑏𝑚 𝑠𝑓𝑡𝑣𝑚𝑢𝑡 = <
𝒍 𝒋𝒕 𝒃 𝒋𝒐𝒖𝒇𝒔𝒏𝒇𝒆𝒋𝒃𝒖𝒇 𝒐𝒑𝒆𝒇
𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = 𝟕 + 𝟐𝟒 + 𝟓𝟐 = 𝟕𝟏
2 2 2 6 3 2 2 13 13 41 41 15 15 15 15 13 13 84 84 7 9 12 12
40 A B C D E F G H I J K L 15 15 13 12 9 7 3 2 2 2 2 2
2 2 2 6 3 2 2 13 13 41 41 15 15 15 15 13 13 84 84 7 9 12 12
# 𝐸𝑆𝐵𝑁 𝑏𝑑𝑑𝑓𝑡𝑡 = <
𝒍 𝒋𝒕 𝒃 𝒋𝒐𝒖𝒇𝒔𝒏𝒇𝒆𝒋𝒃𝒖𝒇 𝒐𝒑𝒆𝒇
𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = <
𝒍 𝒋𝒕 𝒃 𝒐𝒑𝒆𝒇
𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 − <
𝒍 𝒋𝒕 𝒃 𝒎𝒇𝒃𝒈 𝒐𝒑𝒆𝒇
𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = <
𝒍 𝒋𝒕 𝒃 𝒎𝒇𝒃𝒈 𝒐𝒑𝒆𝒇
𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 ∗ 𝒆𝒇𝒒𝒖𝒊 𝒍 − 𝑫𝒑𝒐𝒕𝒖𝒃𝒐𝒖 Huffman coding minimizes <
𝒍 𝒋𝒕 𝒃 𝒕𝒛𝒏𝒄𝒑𝒎
𝒙𝒇𝒋𝒉𝒊𝒖[𝒍] ∗ 𝒎𝒇𝒐𝒉𝒖𝒊[𝒍]
41 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing Huffman Tree Scheduler
Breakdown of Memory Access
Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result
After Matrix Condensing After Huffman Scheduler
# Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros)
42 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing Huffman Tree Scheduler
Breakdown of Memory Access
Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result
OuterSPACE baseline After Huffman Scheduler
# Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros)
43
After Condensing: Access to right matrix becomes irregular
A cache to the rescue!
Accurately predict the access order ahead of time Prefetch the rows before used Rows of seocnd input Matrix B On-chip Buffer Predicted Order using Matrix 𝐵
44 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing Huffman Tree Scheduler Row Prefetch
Breakdown of Memory Access
Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result
45
Platform Performance (GFLOPS)
46
0.008 0.56 0.6 0.63 2.5 10.4 2 4 6 8 10 12 A r m a d i l l
K L c u S P A R S E C U S P O u t e r S P A C E S p A r c h GFLOPS
Performance
Platform Power (W)
Energy Efficiency
(MFLOPS/W)
47
20 7.5 2.8 4 202 1224 200 400 600 800 1000 1200 1400 A r m a d i l l
K L c u S P A R S E C U S P O u t e r S P A C E S p A r c h MFLOPS/W
Energy Efficiency
48
49
Partial Mat Writer 8.2% Merge Tree 60.6% Multiplier Array 1.6% Row Prefetcher 20.4% Column Fetcher 9.3%
HBM 26.2% Partial Mat Writer 2.8%
Merge Tree 55.4%
Multiplier Array 0.9% Row Prefetcher 13.5% Column Fetcher 1.2%
Area Breakdown Power Breakdown
50
−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ ≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥ 2 2 2 6 3 2 2 6 41 41 15 15 15 15 13 13 84 84 7 9 12 12
sparch.mit.edu
51