SpArch: Efficient Architecture for Sparse Matrix Multiplication - PowerPoint PPT Presentation

sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1

Accelerate Sparse Matrix Multiplication Graph Computing Dimension ≈ 4×10 ! ≈ 8×10 "! Sparsity Compressed Neural Networks Dimension ≈ 10 ## ≈ 10 "$ Sparsity 2

CPUs and GPUs are Slow and Under-Utilized for SpMM MKL on Intel cuSPARSE on CUSP on TITAN Armadillo on Double Precision Core i7-5930K TITAN Xp Xp Arm Cortex-A53 SpMM Average 0.560 0.595 0.631 0.00813 GFLOPS on 20 benchmarks 1,2 Theoretical 289 343 343 5.47 GFLOPS 3,4,5 0.194% 0.173% 0.184% 0.149% Utilization 1 Leskovec, Jure, and Rok Sosi č . "Snap: A general-purpose network analysis and graph-mining library." ACM Transactions on Intelligent Systems and Technology (TIST) 8.1 (2016): 1. 2 Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS) 38.1 (2011): 1. 3 https://www.pugetsystems.com/labs/hpc/Linpack-performance-Haswell-E-Core-i7-5960X-and-5930K-594/ 4 https://www.techpowerup.com/gpu-specs/titan-x-pascal.c2863 5 http://web.eece.maine.edu/~vweaver/group/machines.html 3

Challenges • Super-large • Limited on-chip memory Dimension ≈ 4×10 ! ≈ 8×10 "! Density Dimension ≈ 10 ## ≈ 10 "$ Density 4

Challenges • Super-large • Limited on-chip memory Dimension ≈ 4×10 ! ≈ 8×10 "! Density • Ultra-sparse • Low operational intensity • Limited memory bandwidth Dimension ≈ 10 ## ≈ 10 "$ Density 5

Background: Outer Product Intermediate Partial Matrix 𝑄 ! = 𝐵 :,! ×𝐶 !,: Right Matrix 𝐶 Result Matrix Multiply Phase 𝐷 = 𝐵𝐶 = ∑ ! 𝑄 ! Merge Phase Left Matrix 𝐵 6

Background: Outer Product Perfect input reuse: read input matrix only once Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 7

Background: Outer Product Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 8

Background: Outer Product Bad output reuse: intermediate results need storing to DRAM and loading back Multiplier DRAM Merger DRAM DRAM Array Multiply Phase Merge Phase 13

DRAM access of Intermediate Matrix in the baseline implementation Distribution of DRAM access • Baseline Implementation: OuterSPACE. # Load and Store • Row-wise Output Stationary • Each Intermediate Matrix has one- round of Store and Load. Partial Matrix Size (#Non-zeros) Pal, Subhankar, et al. "Outerspace: An outer product based sparse matrix multiplication accelerator." (HPCA) . IEEE, 2018. 14

Key idea: reduce both input and partial matrix DRAM access Algorithm: Outer Product Technique 1: Pipelined Multiply and Merge Technique 2: Matrix Condensing Technique 4: Technique 3: Row Prefetcher Huffman Tree Scheduler 15

Multiply and Merge 0.1 0.5 0.6 0.1 1.1 0.2 0.3 1.3 1.2 0.2 1.3 1.5 + = -0.8 2.2 -0.8 2.2 1.2 1.1 1.1 1.2 Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 0 0 0 1.2 0 0 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 2.2 0 1.1 0 0 0 Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 2.2 0 1.1 1.2 0 0 wise Add 16

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 0 0 0 1.2 0 0 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 2.2 0 1.1 0 0 0 Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 2.2 0 1.1 1.2 0 0 wise Add 17

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 [1, 3, 4, 7…] [3, 5, 7, 9…] ptr1 18

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1] [3, 5, 7, 9…] ptr1 19

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3] [3, 5, 7, 9…] ptr1 20

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4] [3, 5, 7, 9…] ptr1 21

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4,5] [3, 5, 7, 9…] ptr1 22

Merge Phase MatA: (1,0.1) (3, 0.5) (4, 0.2) (7, 0.3) (13, 1.2) MatB: (3, 0.6) (5,1.3) (7, 1.2) (9, -0.8) (10, 2.2) (12, 1.1) ptr0 Merge results [1, 3, 4, 7…] [1,3,3,4,5,7,7] [3, 5, 7, 9…] ptr1 23

Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ 3 Index 0 1 2 3 4 5 6 7 8 9 5 MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 MatB 9 Element- +∞ wise Add Paralleled in space instead of serialized in time 24

Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ 3 Index 0 1 2 3 4 5 6 7 8 9 5 MatA 7 MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 Element- +∞ wise Add Paralleled in space instead of serialized in time 25

Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ < < < < 3 ≥ ≥ ≥ < < < Index 0 1 2 3 4 5 6 7 8 9 5 ≥ ≥ ≥ ≥ < < MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 ≥ ≥ ≥ ≥ ≥ < MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 ≥ ≥ ≥ ≥ ≥ < Element- +∞ ≥ ≥ ≥ ≥ wise Add 1 1 1 3 3 0 3 3 3 4 4 4 5 5 5 0 7 7 7 7 7 9 9 9 Add values of same indices 1 3 4 5 7 9 0.1 1.1 0.2 1.3 1.5 -0.8 26

Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ −∞ < < < < 3 ≥ ≥ ≥ < < < Index 0 1 2 3 4 5 6 7 8 9 5 ≥ ≥ ≥ ≥ < < MatA 0 0.1 0 0.5 0.2 0 0 0.3 0 0 7 ≥ ≥ ≥ ≥ ≥ < MatB 0 0 0 0.6 0 1.3 0 1.2 0 -0.8 9 ≥ ≥ ≥ ≥ ≥ < Element- 0 0.1 0 1.1 0.2 1.3 0 1.5 0 -0.8 +∞ ≥ ≥ ≥ ≥ wise Add 1 0 3 4 5 0 7 9 Add values of same indices Get the results in one clock cycle 1 3 4 5 7 9 0.1 1.1 0.2 1.3 1.5 -0.8 27

Technique 1: Pipelined Multiply and Merge −∞ 1 3 4 7 +∞ 11, 12 DRAM −∞ < < < < 13 14 3 ≥ ≥ ≥ < < < Merger (Comparator Array) 5 ≥ ≥ ≥ ≥ < < 22 24 15 16 7 ≥ ≥ ≥ ≥ ≥ < Merger (Comparator Array) 9 ≥ ≥ ≥ ≥ ≥ < 26 31 28 42 21 23 17 19 +∞ ≥ ≥ ≥ ≥ A B C D 1 0 3 4 5 0 7 9 Multiplier Array Merge Tree Add values of same indices A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) 1 3 4 5 7 9 B: (22)(28)(42)(44)(46)(47)(48) C: (11)(13)(15)(21)(23)(25)(41)(43)(45) 0.1 1.1 0.2 1.3 1.5 -0.8 D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72) 28

Technique 1: Pipelined Multiply and Merge Ideally, partial matrices will not be stored on DRAM. Multiplier Merger DRAM DRAM Array 29

Technique 1: Pipelined Multiply and Merge Too Many rounds However…. Multiplier Merger DRAM DRAM Array Up to 𝟐𝟏 𝟖 Limited to 64-way partial merging matrices 30

Technique 1: Pipelined Multiply and Merge Distribution of DRAM access # Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) After pipeline OuterSPACE baseline Breakdown of Memory Access 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result 31

Technique 2: Matrix Condensing 32

Technique 2: Matrix Condensing Right Matrix 𝐶 (CSR) Condensed Matrix 𝐵′ (CSR) 33

Technique 2: Matrix Condensing Before Condensing: After Condensing: Up to 𝟐𝟏 𝟖 partial matrices Only 𝟐𝟏~𝟐𝟏 𝟒 partial matrices 34

Technique 2: Matrix Condensing Fewer rounds Multiplier DRAM Merger DRAM DRAM Array 35

Technique 2: Matrix Condensing Distribution of DRAM access # Load and Store # Load and Store Partial Matrix Size (#Non-zeros) Partial Matrix Size (#Non-zeros) After Matrix Condensing After pipeline Breakdown of Memory Access 0.0 KB 1.0 GB 2.0 GB 3.0 GB 4.0 GB 5.0 GB 6.0 GB 7.0 GB 8.0 GB Baseline (OuterSPACE) Pipelined Multiply and Merge 5x less Matrix Condensing Read Left Matrix Read Right Matrix R/W Intermediate Results Write Final Result 36

SpArch: Efficient Architecture for Sparse Matrix Multiplication - PowerPoint PPT Presentation

sparch.mit.edu SpArch: Efficient Architecture for Sparse Matrix Multiplication Zhekai Zhang* 1 , Hanrui Wang * 1 , Song Han 1 , William J. Dally 2 1 Massachusetts Institute of Technology 2 Stanford University / Nvidia *Equal Contributions 1

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Multiresolution Algorithms for Sparse Matrix Representation By: Mario Barela Mentor: Professor

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

The rational group structure of modular Jacobians with applications to torsion points on elliptic

On Eigenvalues of Geometrically Finite Hyperbolic Manifolds with Infinite Volume Xiaolong Hans

Local to global formulas in geometry and number theory Gerard Freixas i Montplet C.N.R.S.

Gluon scattering amplitudes/Wilson loops duality in gauge theories Gregory Korchemsky

Modelling Biochemical Reaction Networks Lecture 21: Phase diagrams Marc R. Roussel Department

Statistics of ionospheric disturbances and their correlation with GNSS positioning errors at high

RESEARCH IN CALIFORNIA DUAL ENROLLMENT: BARRIERS, INNOVATIONS, POLICY LANDSCAPE California

The Impact of Computing on Noncongruence Modular Forms ANTS X, San Diego July 10, 2012 Winnie