High-Performance Sparse Matrix-Matrix Products on Intel KNL and - PowerPoint PPT Presentation

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke Nagasaka † , Satoshi Matsuoka � † Ariful Azad ‡ , Aydın Buluç ‡ † Tokyo Institute of Technology � Riken Center for Computational Science ‡ Lawrence Berkeley National Laboratory

Sparse G General M Matrix-Matrix M Multiplication (S (SpGEMM) ■ Key kernel in graph processing and numerical applications – Markov clustering, Betweenness centrality, triangle counting, ... – Preconditioner for linear solver ■ AMG (Algebraic Multigrid) method – Time-consuming part ah+ ai+b a b h i bk l j c cj d k l dl dk e f g m eh fj gm ei Output Matrices Input Matrices 1

Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices 0 0 0 0 value ah ai 0 2 a b h i 0 2 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 2

Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ bk bl c j cj d k l dk dl e f g m eh fj ei gm Input Matrices Output Matrices ai ah 0 0 0 0 value + ah + ai 0 2 a b h i 0 2 bl bk 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 3

Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ ah ai bk bl value + + c j cj bk bl d k l dk dl index e f g m eh fj ei gm 0 2 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 4

Accumulation o of i f intermediate p products Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992] a b h i ah+ ai+ ah ai bk bl value J Efficient a accumulation o of i interm rmediate + + c j cj bk bl pr produ ducts: Lookup c d k l cost i is O O(1 (1) dk dl index e f g m eh fj ei gm 0 2 L Requires O(#columns) memory by one thread 0 th row of Output Input Matrices Output Matrices ah ai 0 0 0 0 value + + ah ai 0 2 a b h i 0 2 bk bl 1 c j 3 bit flag 0 0 0 0 1 1 2 d k l 0 2 e f g 0 1 3 m 1 index 0 2 Value Column id SPA Input matrices in sparse format 5

Existing a approaches f for S SpGEMM ■ Several sequential and parallel SpGEMM algorithms – Also packaged in software/libraries Algorithm ( (Library) Ac Accumulator So Sortedness (In Input/Output) MKL - Any/Select MKL-inspector - Any/Unsorted KokkosKernels HashMap Any/Unsorted Heap Heap Sorted/Sorte Hash Hash Table Any/Select 6

Existing a approaches f for S SpGEMM ■ Several sequential and parallel SpGEMM algorithms – Also packaged in software/libraries Questi Qu tions? (a) What is the best algorithm/implementation for a a Algorithm ( (Library) Accumulator Ac So Sotedness (In Input/Output) blem at hand? pr probl MKL - Any/Select MKL-inspector - Any/Unsorted (b) What is the best algorithm/implementation for t the KokkosKernels HashMap Any/Unsorted ture to be used in solving the problem? ar architectu Heap Heap Sorted/Sorte Hash Hash Table Any/Select 7

Cont Contrib ibut ution ion ■ We characterize, optimize and evaluate existing SpGEMM algorithms for real-world applications on modern Multi-core and Many-core architectures – Characterizing the performance of SpGEMM on shared- memory platforms ■ Intel Haswell and Intel KNL architectures ■ Identify b bottlenecks a and m mitigate t them – Evaluation including several use cases ■ A 2 , Square x Tall-skinny, L*U for triangle counting – Showing the impact o of k keeping u unsorted o output – A r recipe f for s selecting t the b best-performing a algorithm f for a s specific a application s scenario 8

Benchmark f for S SpGEMM Thread s scheduling c cost ■ Evaluates the scheduling cost on Haswell and KNL architectures – OpenMP: static, dynamic and guided ■ Scheduling cost hurts the SpGEMM performance 9

Benchmark f for S SpGEMM Memory a allocati tion/deallocati tion c cost ■ Identifies that allocation/deallocation of large memory space is expensive ■ Parallel memory allocation scheme – Each thread independently allocates/deallocates memory and accesses only its own memory space – For S SpGEMM, w we c can r reduce d deallocation c cost Parallel memory allocation Deallocation cost 10 10

Benchmark f for S SpGEMM Im Impact o t of M MCDRAM ■ MCDRAM provides high memory bandwidth – Obviously improves s stream b benchmark – Performance of stanza-like memory access is unclear ■ Small blocks of consecutive elements ■ Access to rows of B in SpGEMM Hard to get the benefits of MCDRAM on very sparse matrices in SpGEMM 11 11

Architecture S Specific O Optimization Thread s scheduling ■ Good l load-balance w with s static s scheduling – Assigning work to threads by FLOP – Work assignment can be efficiently executed in parallel ■ Counting required FLOP of each row ■ PrefixSum to get total FLOP of SpGEMM ■ Assigning rows to thread (Eg. shows the case of 3 threads) – Average FLOP = 11/3 0 a b h i 4 4 4 c j 1 1 5 k l 2 2 d 7 e f m g 4 4 11 Input Matrices FLOP PrefixSum 12 12

Architecture S Specific O Optimization Accumulator f for S Symbolic a and N Numeric P Phases ■ Optimizing algorithms for Intel architectures ■ Heap [Azad, 2016] – Priority queue indexed by column indices – Requires logarithmic time to extract elements – Space e efficient : O(nnz(a i* )) ■ Better cache utilization ■ Hash [Nagasaka, 2016] – Uses hash table for accumulator, based on GPU work ■ Low m memory u usage a and h high p performance – Each thread once allocates the hash table and reuses it – Extended t to Ha HashV hVector to e exploit w wide v vector r register 13 13

Architecture S Specific O Optimization Ha HashV shVector ■ Utilizing 256 and 512-bit wide vector register of Intel architectures for hash probing – Reduces t the n number o of p probing c caused b by h hash c collision – Requires a few more instructions for each check ■ Degrades the performance when the collisions in Hash are rare (a) Hash 1) Check the entry 2) If hash is collided, check next entry 3) If the entry is empty, add the element 2) If the element is not found and the (b) HashVector row has empty entry, add the element 1) Check multiple entries with vector register : element to be added : non-empty entry : empty entry 14 14

Performance Evaluation 15 15

Matrix D Data ■ Synthetic matrix – R-MAT, the recursive matrix generator – Two different non-zero patterns of synthetic matrices ■ ER ER : Erd ő s–Rényi random graphs ■ G500 G500 : Graphs with power-law degree distributions – Used for Graph500 benchmark – Scale n matrix: 2 n -by-2 n – Edge f factor : the average number of non-zero elements per row of the matrix ■ SuiteSparse Matrix Collection – 26 sparse matrices used in several past work 16 16

Evaluation E Envi vironment ■ Cori s sys ystem @ @NERSC – Haswell Cluster ■ Intel Xeon Processor E5-2698 v3 ■ 128GB DDR4 memory – KNL Cluster ■ Intel Xeon Phi Processor 7250 – 68 cores – 32KB/core L1 cache, 1MB/tile L2 cache – 16GB MCDRAM – Quadrant, cache ■ 96GB DDR4 memory – OS: SuSE Linux Enterprise Server 12 SP3 – Intel C++ Compiler (icpc) ver18.0.0 ■ -g –O3 -qopenmp 17 17

Benefit o of P f Performance O Optimization Scheduling a and m memory a allocati tion ■ Good l load b balance with static scheduling ■ For larger matrices, parallel memory allocation scheme keeps high performance A^2 of G500 matrices with edge factor=16 18 18

Benefit o of P f Performance O Optimization Use o of M MCDRAM ■ Benefit o of M MCDRAM e especially o y on d denser m matrices 19 19

Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (KNL, E , ER) ■ Scale = 16 ■ Different performance trends – Performance of MKL degrades with increasing density 20 20

Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (KNL, E , ER) ■ Performance g gain w with k keeping o output u unsorted 21 21

Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (KNL, G , G500) ■ Denser inputs do not simply bring performance gain – Different f from E ER m matrices 22 22

Performance E Evaluation A^2 ^2: S : Scaling w with th d density ty ( (Haswell) ■ HashVector achieves much higher performance 23 23

Performance E Evaluation A^2 ^2: S : Scaling w with th i input s t size ( (KNL, E , ER) ■ Edge factor = 16 ■ Hash and HashVector show good performance in any input size 24 24

High-Performance Sparse Matrix-Matrix Products on Intel KNL and - PowerPoint PPT Presentation

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke Nagasaka , Satoshi Matsuoka Ariful Azad , Aydn Bulu Tokyo Institute of Technology Riken Center for Computational

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parameter efficient training of deep convolutional neural networks by dynamic sparse

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 ,

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

Department of Public Health, Environmental Health (DPH-EH)s Role Freddie Agyin, Branch

Chapter 2 Software Unit 2.2 Operating System 2019-09-27 1. An

A Fully Associative Software-Managed Cache Design Erik G. Hallnor and Steven K. Reindhardt

Composition of Markets with Conflicting Incentives Alex Peysakhovich & Mikkel Plagborg-Moller

On the Erds-Hajnal conjecture for trees Anita Liebenau Monash University joint work with

Measurement of IPv6 Extension Header Support Geoff Huston APNIC Labs IPv6 Extension Header The

Lecture 7: Term Structure Models Simon Gilchrist Boston Univerity and NBER EC 745 Fall, 2013

Tie public debt-to-GDP ratios for CEE countries have improved substantially Debt reduction PEAK

High-Performance Sparse Matrix-Matrix Products on Intel KNL and - PowerPoint PPT Presentation

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke Nagasaka , Satoshi Matsuoka Ariful Azad , Aydn Bulu Tokyo Institute of Technology Riken Center for Computational

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parameter efficient training of deep convolutional neural networks by dynamic sparse

A User-Friendly Hybrid Sparse Matrix Class in C++ Conrad Sanderson, Ryan R. Curtin July 19, 2018

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 ,

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

Department of Public Health, Environmental Health (DPH-EH)s Role Freddie Agyin, Branch

Chapter 2 Software Unit 2.2 Operating System 2019-09-27 1. An

A Fully Associative Software-Managed Cache Design Erik G. Hallnor and Steven K. Reindhardt

Composition of Markets with Conflicting Incentives Alex Peysakhovich &amp; Mikkel Plagborg-Moller

On the Erds-Hajnal conjecture for trees Anita Liebenau Monash University joint work with

Measurement of IPv6 Extension Header Support Geoff Huston APNIC Labs IPv6 Extension Header The

Lecture 7: Term Structure Models Simon Gilchrist Boston Univerity and NBER EC 745 Fall, 2013

Tie public debt-to-GDP ratios for CEE countries have improved substantially Debt reduction PEAK

Composition of Markets with Conflicting Incentives Alex Peysakhovich & Mikkel Plagborg-Moller