SpArch: Efficient Architecture for Sparse Matrix Multiplication - - PowerPoint PPT Presentation

sparch efficient architecture for sparse matrix
SMART_READER_LITE
LIVE PREVIEW

SpArch: Efficient Architecture for Sparse Matrix Multiplication - - PowerPoint PPT Presentation

sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1


slide-1
SLIDE 1

1Massachusetts Institute of Technology 2Stanford University / Nvidia

Zhekai Zhang*1, Hanrui Wang*1, Song Han1, William J. Dally2

SpArch: Efficient Architecture for Sparse Matrix Multiplication

1

*Equal Contributions

sparch.mit.edu

slide-2
SLIDE 2

Accelerate Sparse Matrix Multiplication

Graph Computing Compressed Neural Networks

2

Dimension ≈ 4×10! Sparsity ≈ 8×10"! Dimension ≈ 10## Sparsity ≈ 10"$

slide-3
SLIDE 3

3

MKL on Intel Core i7-5930K cuSPARSE on TITAN Xp CUSP on TITAN Xp Armadillo on Arm Cortex-A53

Average GFLOPS on 20 benchmarks1,2

0.560 0.595 0.631 0.00813

Theoretical GFLOPS3,4,5

289 343 343 5.47

Utilization

0.194% 0.173% 0.184% 0.149%

CPUs and GPUs are Slow and Under-Utilized for SpMM

1Leskovec, Jure, and Rok Sosič. "Snap: A general-purpose network analysis and graph-mining library." ACM Transactions on Intelligent Systems and

Technology (TIST) 8.1 (2016): 1.

2Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS) 38.1 (2011): 1. 3https://www.pugetsystems.com/labs/hpc/Linpack-performance-Haswell-E-Core-i7-5960X-and-5930K-594/ 4https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 5http://web.eece.maine.edu/~vweaver/group/machines.html

Double Precision SpMM

slide-4
SLIDE 4

Challenges

4

Dimension ≈ 4×10! Density ≈ 8×10"! Dimension ≈ 10## Density ≈ 10"$

  • Super-large
  • Limited on-chip memory
slide-5
SLIDE 5

Challenges

  • Super-large
  • Limited on-chip memory
  • Ultra-sparse
  • Low operational intensity
  • Limited memory bandwidth

5

Dimension ≈ 4×10! Density ≈ 8×10"! Dimension ≈ 10## Density ≈ 10"$

slide-6
SLIDE 6

Background: Outer Product

6

Left Matrix 𝐵 Right Matrix 𝐶

Intermediate Partial Matrix 𝑄! = 𝐵:,!×𝐶!,:

Result Matrix 𝐷 = 𝐵𝐶 = ∑! 𝑄!

Merge Phase Multiply Phase

slide-7
SLIDE 7

Background: Outer Product

7

Multiplier Array Merger DRAM DRAM DRAM

Multiply Phase Merge Phase Perfect input reuse: read input matrix only once

slide-8
SLIDE 8

Background: Outer Product

8

Multiplier Array Merger DRAM DRAM DRAM

Multiply Phase Merge Phase

slide-9
SLIDE 9

Background: Outer Product

9

Multiplier Array Merger DRAM DRAM DRAM

Multiply Phase Merge Phase

slide-10
SLIDE 10

Background: Outer Product

10

Multiplier Array Merger DRAM DRAM DRAM

Multiply Phase Merge Phase

slide-11
SLIDE 11

Background: Outer Product

11

Multiplier Array Merger DRAM DRAM DRAM

Multiply Phase Merge Phase

slide-12
SLIDE 12

Background: Outer Product

12

Multiplier Array Merger DRAM DRAM DRAM

Multiply Phase Merge Phase

slide-13
SLIDE 13

Background: Outer Product

13

Multiplier Array Merger DRAM DRAM DRAM

Bad output reuse: intermediate results need storing to DRAM and loading back Multiply Phase Merge Phase

slide-14
SLIDE 14

DRAM access of Intermediate Matrix in the baseline implementation

14

  • Baseline Implementation:

OuterSPACE.

  • Row-wise Output Stationary
  • Each Intermediate Matrix has one-

round of Store and Load.

Pal, Subhankar, et al. "Outerspace: An outer product based sparse matrix multiplication accelerator." (HPCA). IEEE, 2018.

Distribution of DRAM access

Partial Matrix Size (#Non-zeros) # Load and Store

slide-15
SLIDE 15

Key idea: reduce both input and partial matrix DRAM access

Algorithm: Outer Product Technique 1: Pipelined Multiply and Merge Technique 2: Matrix Condensing Technique 3: Huffman Tree Scheduler Technique 4: Row Prefetcher

15

slide-16
SLIDE 16

Multiply and Merge

16

Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0.1 0.5 0.2 0.3 1.2 MatB 0.6 1.3 1.2

  • 0.8

2.2 1.1 Element- wise Add 0.1 1.1 0.2 1.3 1.5

  • 0.8

2.2 1.1 1.2

0.1 0.5 0.2 0.3 1.2

+

0.6 1.3 1.2

  • 0.8

2.2 1.1

=

0.1 1.1 0.2 1.3 1.5

  • 0.8

2.2 1.1 1.2

slide-17
SLIDE 17

Merge Phase

17

Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0.1 0.5 0.2 0.3 1.2 MatB 0.6 1.3 1.2

  • 0.8

2.2 1.1 Element- wise Add 0.1 1.1 0.2 1.3 1.5

  • 0.8

2.2 1.1 1.2

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

slide-18
SLIDE 18

18

[1, 3, 4, 7…] [3, 5, 7, 9…]

ptr1 ptr0

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

Merge Phase

slide-19
SLIDE 19

19

[1, 3, 4, 7…] [3, 5, 7, 9…]

ptr1 ptr0

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

[1]

Merge results

Merge Phase

slide-20
SLIDE 20

20

[1, 3, 4, 7…] [3, 5, 7, 9…]

ptr1 ptr0

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

[1,3,3]

Merge Phase

Merge results

slide-21
SLIDE 21

21

[1, 3, 4, 7…] [3, 5, 7, 9…]

ptr1 ptr0

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

[1,3,3,4]

Merge Phase

Merge results

slide-22
SLIDE 22

22

[1, 3, 4, 7…] [3, 5, 7, 9…]

ptr1 ptr0

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

[1,3,3,4,5]

Merge Phase

Merge results

slide-23
SLIDE 23

23

[1, 3, 4, 7…] [3, 5, 7, 9…]

ptr1 ptr0

MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1)

[1,3,3,4,5,7,7]

Merge Phase

Merge results

slide-24
SLIDE 24

Technique 1: Pipelined Multiply and Merge

24

−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞

Index 1 2 3 4 5 6 7 8 9 MatA 0.1 0.5 0.2 0.3 MatB Element- wise Add

Paralleled in space instead of serialized in time

slide-25
SLIDE 25

Technique 1: Pipelined Multiply and Merge

25

−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞

Index 1 2 3 4 5 6 7 8 9 MatA MatB 0.6 1.3 1.2

  • 0.8

Element- wise Add

Paralleled in space instead of serialized in time

slide-26
SLIDE 26

Technique 1: Pipelined Multiply and Merge

26

−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ 1 3 3 4 5 7 7 9 1 3 3 4 5 7 7 9 1 3 4 5 7 9 1 3 4 5 7 9

Add values of same indices

≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥

Index 1 2 3 4 5 6 7 8 9 MatA 0.1 0.5 0.2 0.3 MatB 0.6 1.3 1.2

  • 0.8

Element- wise Add

0.1 1.1 0.2 1.3 1.5 -0.8

slide-27
SLIDE 27

Technique 1: Pipelined Multiply and Merge

27

−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ 1 3 4 5 7 9 1 3 4 5 7 9

Add values of same indices

≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥

Index 1 2 3 4 5 6 7 8 9 MatA 0.1 0.5 0.2 0.3 MatB 0.6 1.3 1.2

  • 0.8

Element- wise Add 0.1 1.1 0.2 1.3 1.5

  • 0.8

0.1 1.1 0.2 1.3 1.5 -0.8

Get the results in one clock cycle

slide-28
SLIDE 28

Technique 1: Pipelined Multiply and Merge

28

Merge Tree

13 14

Merger (Comparator Array)

DRAM

11, 12

Merger (Comparator Array)

26 31 28 42 21 23 17 19 22 24 15 16

A B C D

Multiplier Array A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)

−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ 1 3 4 5 7 9 1 3 4 5 7 9

Add values of same indices

≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥

0.1 1.1 0.2 1.3 1.5 -0.8

slide-29
SLIDE 29

Technique 1: Pipelined Multiply and Merge

29

Multiplier Array Merger DRAM DRAM

Ideally, partial matrices will not be stored on DRAM.

slide-30
SLIDE 30

Technique 1: Pipelined Multiply and Merge

30

Multiplier Array Merger DRAM DRAM

However….

Up to 𝟐𝟏𝟖 partial matrices Limited to 64-way merging Too Many rounds

slide-31
SLIDE 31

Technique 1: Pipelined Multiply and Merge

0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge

Breakdown of Memory Access

Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result

31

OuterSPACE baseline After pipeline

Distribution of DRAM access

# Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros)

slide-32
SLIDE 32

Technique 2: Matrix Condensing

32

slide-33
SLIDE 33

Technique 2: Matrix Condensing

Condensed Matrix 𝐵′ (CSR) Right Matrix 𝐶 (CSR)

33

slide-34
SLIDE 34

Technique 2: Matrix Condensing

Before Condensing: Up to 𝟐𝟏𝟖 partial matrices After Condensing: Only 𝟐𝟏~𝟐𝟏𝟒 partial matrices

34

slide-35
SLIDE 35

35

Multiplier Array Merger DRAM DRAM DRAM

Fewer rounds

Technique 2: Matrix Condensing

slide-36
SLIDE 36

Technique 2: Matrix Condensing

36 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing

Breakdown of Memory Access

Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result

5x less After pipeline After Matrix Condensing

Distribution of DRAM access

Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) # Load and Store # Load and Store

slide-37
SLIDE 37

Technique 3: Huffman Tree Scheduler

37

Partial Matrix Size #Non-zeros Number of parital matrices

Most partial matrices contains few nonzeros

slide-38
SLIDE 38

Technique 3: Huffman Tree Scheduler

38 A B C D E F G H I J K L 15 15 13 12 9 7 3 2 2 2 2 2

15 15 15 15 13 13 12 12 9 7 3 2 2 2 2 2 40 40 8

# 𝐸𝑆𝐵𝑁 𝑏𝑑𝑑𝑓𝑡𝑡 𝑝𝑔 𝑗𝑜𝑢𝑓𝑠𝑛𝑓𝑒𝑗𝑏𝑢𝑓 𝑞𝑏𝑠𝑢𝑗𝑏𝑚 𝑠𝑓𝑡𝑣𝑚𝑢𝑡 = <

𝒍 𝒋𝒕 𝒃 𝒋𝒐𝒖𝒇𝒔𝒏𝒇𝒆𝒋𝒃𝒖𝒇 𝒐𝒑𝒆𝒇

𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = 𝟓𝟏 + 𝟑𝟐 + 𝟗 = 𝟕𝟘

21 21 84 84

slide-39
SLIDE 39

Technique 3: Huffman Tree Scheduler

39 A B C D E F G H I J K L 15 15 13 12 9 7 3 2 2 2 2 2

# 𝐸𝑆𝐵𝑁 𝑏𝑑𝑑𝑓𝑡𝑡 𝑝𝑔 𝑗𝑜𝑢𝑓𝑠𝑛𝑓𝑒𝑗𝑏𝑢𝑓 𝑞𝑏𝑠𝑢𝑗𝑏𝑚 𝑠𝑓𝑡𝑣𝑚𝑢𝑡 = <

𝒍 𝒋𝒕 𝒃 𝒋𝒐𝒖𝒇𝒔𝒏𝒇𝒆𝒋𝒃𝒖𝒇 𝒐𝒑𝒆𝒇

𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = 𝟕 + 𝟐𝟒 + 𝟓𝟐 = 𝟕𝟏

2 2 2 6 3 2 2 13 13 41 41 15 15 15 15 13 13 84 84 7 9 12 12

slide-40
SLIDE 40

Technique 3: Huffman Tree Scheduler

40 A B C D E F G H I J K L 15 15 13 12 9 7 3 2 2 2 2 2

2 2 2 6 3 2 2 13 13 41 41 15 15 15 15 13 13 84 84 7 9 12 12

# 𝐸𝑆𝐵𝑁 𝑏𝑑𝑑𝑓𝑡𝑡 = <

𝒍 𝒋𝒕 𝒃 𝒋𝒐𝒖𝒇𝒔𝒏𝒇𝒆𝒋𝒃𝒖𝒇 𝒐𝒑𝒆𝒇

𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = <

𝒍 𝒋𝒕 𝒃 𝒐𝒑𝒆𝒇

𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 − <

𝒍 𝒋𝒕 𝒃 𝒎𝒇𝒃𝒈 𝒐𝒑𝒆𝒇

𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 = <

𝒍 𝒋𝒕 𝒃 𝒎𝒇𝒃𝒈 𝒐𝒑𝒆𝒇

𝒙𝒇𝒋𝒉𝒊𝒖 𝒍 ∗ 𝒆𝒇𝒒𝒖𝒊 𝒍 − 𝑫𝒑𝒐𝒕𝒖𝒃𝒐𝒖 Huffman coding minimizes <

𝒍 𝒋𝒕 𝒃 𝒕𝒛𝒏𝒄𝒑𝒎

𝒙𝒇𝒋𝒉𝒊𝒖[𝒍] ∗ 𝒎𝒇𝒐𝒉𝒖𝒊[𝒍]

slide-41
SLIDE 41

Technique 3: Huffman Tree Scheduler

41 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing Huffman Tree Scheduler

Breakdown of Memory Access

Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result

After Matrix Condensing After Huffman Scheduler

Distribution of DRAM access

# Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros)

slide-42
SLIDE 42

Technique 3: Huffman Tree Scheduler

42 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing Huffman Tree Scheduler

Breakdown of Memory Access

Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result

OuterSPACE baseline After Huffman Scheduler

Distribution of DRAM access

# Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros)

slide-43
SLIDE 43

Technique 4: Row Prefetcher

43

After Condensing: Access to right matrix becomes irregular

A cache to the rescue!

slide-44
SLIDE 44

Technique 4: Row Prefetcher

Accurately predict the access order ahead of time Prefetch the rows before used Rows of seocnd input Matrix B On-chip Buffer Predicted Order using Matrix 𝐵

44 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Matrix Condensing Huffman Tree Scheduler Row Prefetch

Breakdown of Memory Access

Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result

slide-45
SLIDE 45

Evaluation Setup

45

OuterSPACE Ours Technology TSMC 32nm TSMC 40nm Area 87 mm2 28.49 mm2 Power 12.39 W 9.26 W DRAM HBM@128GB/s HBM@128GB/s

  • Setups
  • Benchmarks
  • SuiteSparse Matrix Collection
  • Stanford Network Analysis Project (SNAP) dataset
slide-46
SLIDE 46

Evaluation

On SuiteSparse & SNAP:

  • 4.2x speedup and 6x energy efficiency over prior SOTA (OuterSPACE)

Platform Performance (GFLOPS)

Armadillo (ARM) 0.008 (1285x) MKL (CPU) 0.56 (19x) cuSPARSE (GPU) 0.60 (18x) CUSP (GPU) 0.63 (17x) OuterSPACE (ASIC) 2.5 (4x) SpArch (ASIC) 10.4

46

0.008 0.56 0.6 0.63 2.5 10.4 2 4 6 8 10 12 A r m a d i l l

  • M

K L c u S P A R S E C U S P O u t e r S P A C E S p A r c h GFLOPS

Performance

slide-47
SLIDE 47

Evaluation

On SuiteSparse & SNAP:

  • 4.2x speedup and 6x energy efficiency over prior SOTA (OuterSPACE)

Platform Power (W)

Energy Efficiency

(MFLOPS/W)

Armadillo (ARM) 0.4 20 (62x) MKL (CPU) 74 7.5 (164x) cuSPARSE (GPU) 210 2.8 (435x) CUSP (GPU) 157 4.0 (307x) OuterSPACE (ASIC) 12.4 202 (6x) SpArch (ASIC) 8.5 1224

47

20 7.5 2.8 4 202 1224 200 400 600 800 1000 1200 1400 A r m a d i l l

  • M

K L c u S P A R S E C U S P O u t e r S P A C E S p A r c h MFLOPS/W

Energy Efficiency

slide-48
SLIDE 48

Evaluation

On RMAT matrices:

  • 10x~30x faster than Intel CPU (MKL).
  • Higher scalability on ultra-sparse matrices.
  • MKL performance degradation: 5.9x, Ours: 2.7x

48

slide-49
SLIDE 49

Evaluation

49

Partial Mat Writer 8.2% Merge Tree 60.6% Multiplier Array 1.6% Row Prefetcher 20.4% Column Fetcher 9.3%

HBM 26.2% Partial Mat Writer 2.8%

Merge Tree 55.4%

Multiplier Array 0.9% Row Prefetcher 13.5% Column Fetcher 1.2%

Area Breakdown Power Breakdown

slide-50
SLIDE 50
  • SpMM is an important primitive (graphs, sparse neural networks)
  • SpArch accelerates SpMM by using a Spartial Merge Array and Huffman

Tree Scheduler to reduce the DRAM access of partial matrices.

  • Achieve 4.2x speedup and 6x energy efficiency over prior SOTA (OuterSPACE)

Conclusion

50

−∞ 1 3 4 7 +∞ −∞ 3 5 7 9 +∞ ≥ ≥ < < ≥ ≥ ≥ < ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ < < < < ≥ < ≥ < ≥ < ≥ < ≥ ≥ ≥ ≥ 2 2 2 6 3 2 2 6 41 41 15 15 15 15 13 13 84 84 7 9 12 12

sparch.mit.edu

slide-51
SLIDE 51

Thank you!

51