High-performance and Memory-saving Sparse General Matrix-Matrix - PowerPoint PPT Presentation

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology

Sparse General Matrix-Matrix Multiplication (SpGEMM) ■ Numerical application, graph processing – AMG method, graph clustering ■ Low performance – Non-zero pattern of output matrix is unknown before execution ■ Accumulate intermediate products into one non-zero element ■ Hard to manage memory allocation ae+ ae+ a b 0 2 0 e 1 0 bi 1 2 0 a b e bi bh bh g 1 c 0 2 f 0 1 ce 1 2 c f g ce i 3 df 1 d 1 3 h 2 3 0 3 d h i df dg dg 4 5 5 value column �� Row pointer Sparse matrix in CSR format 1

Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices 0 0 0 0 value ah ai 0 2 a b h i 0 2 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 2

Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 3

Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ ah ai bk bl value + + c j cj bk bl d k l dk dl index e f g m eh fj ei gm 0 2 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 4

Accumulation of intermediate products Sparse Accumulator (SPA) [Gilbert, SIAM1992] a b h i ah+ ai+ ah ai bk bl value J Efficient accumulation of intermediate + + c j cj bk bl products: Lookup cost is O(1) d k l dk dl index e f g m eh fj ei gm 0 2 L Require O(#columns) memory by one thread 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 5

Memory Allocation of Output Matrix ■ Non-zero pattern of output is unknown before execution – Cannot allocate exact memory space for output before execution ■ Two ways for allocation of output – 1-phase ■ Allocate enough large memory space for output – 2-phase ■ Count #non-zero of output, then allocate memory for output Computation c cost Memory u usage Li Libraries 1-phase Low Large CUSP, BHSPARSE 2-phase High Small cuSPARSE 6

SpGEMM on GPU ■ Massive parallelism – Simple row/column-based parallelization causes load- imbalance ■ Largely different computation cost by row/column ■ Difficulty of memory management – Small global memory ■ Up to 16GB (P100 GPU) – Hierarchical memory ■ Shared memory (fast, but only 64KB/SM on P100) 7

Contribution ■ We propose GPU-optimized fast SpGEMM algorithm with low memory usage – Efficiently manage column index of output matrix and accumulate intermediate products by hash table ■ Utilize GPU’s shared memory for hash table – Make row groups by the number of non-zero elements or intermediate products to improve load balance – Evaluate the performance of SpGEMM for the Sparse Matrix Collection from University Florida ■ Up to x4.3 in single precision, x4.4 in double precision ■ Memory usage is reduced by – 14.7% in single precision – 10.9% in double precision 8

Related work (1) ■ ESC Algorithm [Bell, SIAM2012] – E xpansion: Generate the list of all intermediate products – S orting by column and row indices – C ontraction: Accumulate intermediate products – Each part can be executed with high parallelism ■ Whole performance is low since ESC requires large memory access, and also large memory space ■ BHSPARSE [Liu, IPDPS2014] – For irregular matrices – Binning by the number of intermediate products per row ■ Switch the algorithms of accumulation by bin – Heap method, bitonic ESC method, mergepath ■ Better load-balance 9

Related work (2) ■ Balanced Hash [Anh, ICS’16] – Improve load balance ■ Worklist: pair of indices for computation of intermediate products – Worklist is stored on global memory – Improve the process of accumulation ■ Use hash table – Fixed size of hash table on shared memory ■ Waste shared memory when the number of non-zero is small – When hash collision occurs, the products are added to queue ■ Store the calculated elements in the table to memory, refresh table, and then process the products in queue ■ Repeat until queue becomes empty ■ Additional memory usage and memory access to queue 10

Proposed Algorithm Key Points ■ Two-phase execution (1) Count #intermediate products – (1 - 4): Count #non-zero (2) Divide the rows into groups by elements of output matrix #intermediate products – (6 - 7): Calculate output matrix (3) Count #non-zero elements – Minimize the usage of memory (4) Set row pointers of output matrix (5) Memory allocation of output matrix (6) Divide the rows into groups by #non-zero elements (7) Compute the output matrix a. Calculate values and column indices on hash table b. Shrink the hash table c. Store to the memory with sorting 11

Proposed Algorithm Key Points ■ Utilize hash table for accumulator – Allocated on fast shared memory ■ Divide the rows into groups by #intermediate products or #non-zero elements – Improve load balance by appropriate thread assignment – Better utilization of shared memory by coordinating hash table size 12

Proposed Algorithm Count #intermediate products / Grouping ■ Rows are divided into several (1) Count #intermediate products groups by #intermediate (2) Divide the rows into groups by products or non-zero elements #intermediate products – Improve the load-balance (3) Count #non-zero elements – Utilize shared memory (4) Set row pointers of output matrix – #intermediate products is upper (5) Memory allocation of output matrix bound of #non-zero elements ■ Counting cost of #intermediate product (6) Divide the rows into groups by is relatively small #non-zero elements (7) Compute the output matrix a. Calculate values and column indices on hash table b. Shrink the hash table c. Store to the memory with sorting 13

Proposed Algorithm Count #Non-zero Elements / Compute the output ■ Two-way thread assignment and (1) Count #intermediate products memory access to input matrices for load-balance (2) Divide the rows into groups by #intermediate products – Appropriate thread assignment for both dense row and sparse row (3) Count #non-zero elements ■ Column indices of output matrix (4) Set row pointers of output matrix are managed by hash table (5) Memory allocation of output matrix – Tables are on shared memory (6) Divide the rows into groups by ■ CUDA kernel for each group #non-zero elements – In order to execute concurrently, (7) Compute the output matrix each kernel is assigned to different a. Calculate values and column CUDA stream indices on hash table b. Shrink the hash table c. Store to the memory with sorting 14

Proposed Algorithm Two-ways thread Assignment -1- ■ PW PWARP/ P/ROW : Partial warp / row – Partial warp means a bundle of 4 threads – 1 pwarp for each row of matrix A, and 1 thread for each non- zero element of A and corresponding row of B – Selected for the groups with sparser rows T T T T 0 1 0 0 a b h i PWARP T T c j 1 1 d k l e f g m 15

Proposed Algorithm Two-ways thread Assignment -2- ■ TB TB/ROW : Thread block / row – Assign 1 thread block (TB) for each row of matrix A, 1 warp for each non-zero element of A, and 1thread for each non-zero element of B – Selected for the groups with denser rows W W A A R R T T P P 1 0 a b h i TB T T c j 1 0 d k l e f g m 16

Proposed Algorithm Hash Table ■ Key is column index of B – if empty, add the element ■ compare-and-swap ■ Each thread counts the number of non-zero elements – Linear probing ■ When the hash is collided, the algorithm tries next entry hash(1)=0 a b e c f g 1 2 d h i hash(1)=0 hash(2)=0 Hash table for 0th row 17

Proposed Algorithm Count #non-zero elements ■ Accumulate the number of non- (1) Count #intermediate products zero counted by each row (2) Divide the rows into groups by – PWARP/ROW: Utilizing warp shuffle #intermediate products – TB/ROW: Accumulate by warp (3) Count #non-zero elements shuffle in warp level, and then (4) Set row pointers of output matrix accumulate the sum of each warp by using shared memory (5) Memory allocation of output matrix (6) Divide the rows into groups by #non-zero elements (7) Compute the output matrix a. Calculate values and column indices on hash table b. Shrink the hash table c. Store to the memory with sorting 18

High-performance and Memory-saving Sparse General Matrix-Matrix - PowerPoint PPT Presentation

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology Sparse General Matrix-Matrix Multiplication (SpGEMM) Numerical

From Saving the Princess to From Saving the Princess to Saving the Cow Saving the Cow Content

Saving Time Bill Rising StataCorp LLC 2018 Stata Conference Columbus, OH July 20, 2018 Saving

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Saving Data in iOS Saving Data with NSString and NSData Saving on iOS Every iOS app is its own

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

CXual Healing Content Creators are people, too! ADRIAN ROLLETT Technical Director @

A86045 Accoun,ng and Financial Repor,ng (2017/2018) Session 11 Impairment of Assets Paul G.

Scope of Briefing Address by Executive Chairman Group Financial Highlights Business

1 GUERRA MUNDIAL http://historiaonline.com.br ANTECEDENTES 2 REVOLUO INDUSTRIAL

DBS Database Systems Implementing and Optimising Query Languages Peter Buneman 9 November 2010

CS330, March 9, 2004 Indexing, Query Processing, and Transactions 1 Some Logistics Next two

Program Analysis with PREfast & SAL Erik Poll Digital Security group Radboud University

Enabling Preservation by means of Open Source dave rice @dericed #fosdem 2015-01-31

High-performance and Memory-saving Sparse General Matrix-Matrix - PowerPoint PPT Presentation

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology Sparse General Matrix-Matrix Multiplication (SpGEMM) Numerical

From Saving the Princess to From Saving the Princess to Saving the Cow Saving the Cow Content

Saving Time Bill Rising StataCorp LLC 2018 Stata Conference Columbus, OH July 20, 2018 Saving

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Saving Data in iOS Saving Data with NSString and NSData Saving on iOS Every iOS app is its own

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

CXual Healing Content Creators are people, too! ADRIAN ROLLETT Technical Director @

A86045 Accoun,ng and Financial Repor,ng (2017/2018) Session 11 Impairment of Assets Paul G.

Scope of Briefing Address by Executive Chairman Group Financial Highlights Business

1 GUERRA MUNDIAL http://historiaonline.com.br ANTECEDENTES 2 REVOLUO INDUSTRIAL

DBS Database Systems Implementing and Optimising Query Languages Peter Buneman 9 November 2010

CS330, March 9, 2004 Indexing, Query Processing, and Transactions 1 Some Logistics Next two

Program Analysis with PREfast &amp; SAL Erik Poll Digital Security group Radboud University

Enabling Preservation by means of Open Source dave rice @dericed #fosdem 2015-01-31

Program Analysis with PREfast & SAL Erik Poll Digital Security group Radboud University