High-Performance Sparse Matrix-Matrix Products on Intel KNL and - - PowerPoint PPT Presentation

high performance sparse matrix matrix products on intel
SMART_READER_LITE
LIVE PREVIEW

High-Performance Sparse Matrix-Matrix Products on Intel KNL and - - PowerPoint PPT Presentation

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures Yusuke Nagasaka , Satoshi Matsuoka Ariful Azad , Aydn Bulu Tokyo Institute of Technology Riken Center for Computational


slide-1
SLIDE 1

High-Performance Sparse Matrix-Matrix Products

  • n Intel KNL and Multicore

Architectures

Yusuke Nagasaka†, Satoshi Matsuoka† Ariful Azad‡, Aydın Buluç‡

† Tokyo Institute of Technology Riken Center for Computational Science ‡ Lawrence Berkeley National Laboratory

slide-2
SLIDE 2

Sparse G General M Matrix-Matrix M Multiplication (S (SpGEMM)

■ Key kernel in graph processing and numerical applications

– Markov clustering, Betweenness centrality, triangle counting, ... – Preconditioner for linear solver

■ AMG (Algebraic Multigrid) method

– Time-consuming part

1

ah+ bk

ai+b l

cj dk dl eh fj ei gm

a b c d e f g h i j k l m

Output Matrices Input Matrices

slide-3
SLIDE 3

Accumulation o

  • f i

f intermediate p products

Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992]

2

SPA value index

2 ah ai

bit flag 0

1 1

ah+ bk ai+ bl

cj dk dl eh fj ei gm

a b c d e f g h i j k l m

Input Matrices Output Matrices

a b c d e f g h i j k l m 0 2 3 0 2 1 0 2 1 2 0 1 3

Value Column id Input matrices in sparse format

slide-4
SLIDE 4

Accumulation o

  • f i

f intermediate p products

Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992]

3

SPA value index

2 ah ai ah + bk ai + bl

bit flag 0

1 1

ah+ bk ai+ bl

cj dk dl eh fj ei gm

a b c d e f g h i j k l m

Input Matrices Output Matrices

a b c d e f g h i j k l m 0 2 3 0 2 1 0 2 1 2 0 1 3

Value Column id Input matrices in sparse format

slide-5
SLIDE 5

Accumulation o

  • f i

f intermediate p products

Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992]

4

SPA value index

2 ah ai ah + bk ai + bl

value index

2 ah + bk ai + bl

0th row of Output bit flag 0

1 1

ah+ bk ai+ bl

cj dk dl eh fj ei gm

a b c d e f g h i j k l m

Input Matrices Output Matrices

a b c d e f g h i j k l m 0 2 3 0 2 1 0 2 1 2 0 1 3

Value Column id Input matrices in sparse format

slide-6
SLIDE 6

Accumulation o

  • f i

f intermediate p products

Sparse A Accumulator ( (SPA) [ [Gilbert, S t, SIA IAM1992]

5

SPA value index

2 ah ai ah + bk ai + bl

value index

2 ah + bk ai + bl

0th row of Output bit flag 0

1 1

ah+ bk ai+ bl

cj dk dl eh fj ei gm

a b c d e f g h i j k l m

Input Matrices Output Matrices

a b c d e f g h i j k l m 0 2 3 0 2 1 0 2 1 2 0 1 3

Value Column id Input matrices in sparse format

J Efficient a accumulation o

  • f i

interm rmediate pr produ ducts: Lookup c cost i is O O(1 (1) L Requires O(#columns) memory by one thread

slide-7
SLIDE 7

Existing a approaches f for S SpGEMM

■ Several sequential and parallel SpGEMM algorithms

– Also packaged in software/libraries

6

Algorithm ( (Library) Ac Accumulator So Sortedness (In Input/Output) MKL

  • Any/Select

MKL-inspector

  • Any/Unsorted

KokkosKernels HashMap Any/Unsorted Heap Heap Sorted/Sorte Hash Hash Table Any/Select

slide-8
SLIDE 8

Existing a approaches f for S SpGEMM

■ Several sequential and parallel SpGEMM algorithms

– Also packaged in software/libraries

7

Algorithm ( (Library) Ac Accumulator So Sotedness (In Input/Output) MKL

  • Any/Select

MKL-inspector

  • Any/Unsorted

KokkosKernels HashMap Any/Unsorted Heap Heap Sorted/Sorte Hash Hash Table Any/Select

Qu Questi tions?

(a) What is the best algorithm/implementation for a a pr probl blem at hand? (b) What is the best algorithm/implementation for t the ar architectu ture to be used in solving the problem?

slide-9
SLIDE 9

Cont Contrib ibut ution ion

■ We characterize, optimize and evaluate existing SpGEMM algorithms for real-world applications on modern Multi-core and Many-core architectures

– Characterizing the performance of SpGEMM on shared- memory platforms

■ Intel Haswell and Intel KNL architectures ■ Identify b bottlenecks a and m mitigate t them

– Evaluation including several use cases

■ A2, Square x Tall-skinny, L*U for triangle counting

– Showing the impact o

  • f k

keeping u unsorted o

  • utput

– A r recipe f for s selecting t the b best-performing a algorithm f for a s specific a application s scenario

8

slide-10
SLIDE 10

Benchmark f for S SpGEMM

Thread s scheduling c cost ■ Evaluates the scheduling cost on Haswell and KNL architectures

– OpenMP: static, dynamic and guided

■ Scheduling cost hurts the SpGEMM performance

9

slide-11
SLIDE 11

Benchmark f for S SpGEMM

Memory a allocati tion/deallocati tion c cost ■ Identifies that allocation/deallocation of large memory space is expensive ■ Parallel memory allocation scheme

– Each thread independently allocates/deallocates memory and accesses only its own memory space – For S SpGEMM, w we c can r reduce d deallocation c cost

10 10

Parallel memory allocation Deallocation cost

slide-12
SLIDE 12

Benchmark f for S SpGEMM

Im Impact o t of M MCDRAM ■ MCDRAM provides high memory bandwidth

– Obviously improves s stream b benchmark – Performance of stanza-like memory access is unclear

■ Small blocks of consecutive elements ■ Access to rows of B in SpGEMM

11 11

Hard to get the benefits of MCDRAM on very sparse matrices in SpGEMM

slide-13
SLIDE 13

Architecture S Specific O Optimization

Thread s scheduling ■ Good l load-balance w with s static s scheduling

– Assigning work to threads by FLOP – Work assignment can be efficiently executed in parallel

■ Counting required FLOP of each row ■ PrefixSum to get total FLOP of SpGEMM ■ Assigning rows to thread (Eg. shows the case of 3 threads) – Average FLOP = 11/3

12 12

a b c d e f g h i j k l m

Input Matrices

4 1 2 4

FLOP

4 5 7 11

PrefixSum

4 1 2 4

slide-14
SLIDE 14

Architecture S Specific O Optimization

Accumulator f for S Symbolic a and N Numeric P Phases ■ Optimizing algorithms for Intel architectures ■ Heap [Azad, 2016]

– Priority queue indexed by column indices – Requires logarithmic time to extract elements – Space e efficient: O(nnz(ai*))

■ Better cache utilization

■ Hash [Nagasaka, 2016]

– Uses hash table for accumulator, based on GPU work

■ Low m memory u usage a and h high p performance

– Each thread once allocates the hash table and reuses it – Extended t to Ha HashV hVector to e exploit w wide v vector r register

13 13

slide-15
SLIDE 15

Architecture S Specific O Optimization

Ha HashV shVector ■ Utilizing 256 and 512-bit wide vector register of Intel architectures for hash probing

– Reduces t the n number o

  • f p

probing c caused b by h hash c collision – Requires a few more instructions for each check

■ Degrades the performance when the collisions in Hash are rare

14 14

1) Check the entry 2) If hash is collided, check next entry 3) If the entry is empty, add the element 1) Check multiple entries with vector register 2) If the element is not found and the row has empty entry, add the element (a) Hash (b) HashVector : element to be added : non-empty entry : empty entry

slide-16
SLIDE 16

Performance Evaluation

15 15

slide-17
SLIDE 17

Matrix D Data

■ Synthetic matrix

– R-MAT, the recursive matrix generator – Two different non-zero patterns of synthetic matrices

■ ER ER: Erdős–Rényi random graphs ■ G500 G500: Graphs with power-law degree distributions – Used for Graph500 benchmark

– Scale n matrix: 2n-by-2n – Edge f factor: the average number of non-zero elements per row of the matrix

■ SuiteSparse Matrix Collection

– 26 sparse matrices used in several past work

16 16

slide-18
SLIDE 18

Evaluation E Envi vironment

■ Cori s sys ystem @ @NERSC

– Haswell Cluster

■ Intel Xeon Processor E5-2698 v3 ■ 128GB DDR4 memory

– KNL Cluster

■ Intel Xeon Phi Processor 7250 – 68 cores – 32KB/core L1 cache, 1MB/tile L2 cache – 16GB MCDRAM – Quadrant, cache ■ 96GB DDR4 memory

– OS: SuSE Linux Enterprise Server 12 SP3 – Intel C++ Compiler (icpc) ver18.0.0

■ -g –O3 -qopenmp

17 17

slide-19
SLIDE 19

Benefit o

  • f P

f Performance O Optimization

Scheduling a and m memory a allocati tion ■ Good l load b balance with static scheduling ■ For larger matrices, parallel memory allocation scheme keeps high performance

18 18

A^2 of G500 matrices with edge factor=16

slide-20
SLIDE 20

Benefit o

  • f P

f Performance O Optimization

Use o

  • f M

MCDRAM ■ Benefit o

  • f M

MCDRAM e especially o y on d denser m matrices

19 19

slide-21
SLIDE 21

Performance E Evaluation

A^2 ^2: S : Scaling w with th d density ty ( (KNL, E , ER) ■ Scale = 16 ■ Different performance trends

– Performance of MKL degrades with increasing density

20 20

slide-22
SLIDE 22

Performance E Evaluation

A^2 ^2: S : Scaling w with th d density ty ( (KNL, E , ER) ■ Performance g gain w with k keeping o

  • utput u

unsorted

21 21

slide-23
SLIDE 23

Performance E Evaluation

A^2 ^2: S : Scaling w with th d density ty ( (KNL, G , G500) ■ Denser inputs do not simply bring performance gain

– Different f from E ER m matrices

22 22

slide-24
SLIDE 24

Performance E Evaluation

A^2 ^2: S : Scaling w with th d density ty ( (Haswell) ■ HashVector achieves much higher performance

23 23

slide-25
SLIDE 25

Performance E Evaluation

A^2 ^2: S : Scaling w with th i input s t size ( (KNL, E , ER) ■ Edge factor = 16 ■ Hash and HashVector show good performance in any input size

24 24

slide-26
SLIDE 26

Performance E Evaluation

A^2 ^2: S : Scaling w with th i input s t size ( (KNL, E , ER) ■ Performance g gain w with k keeping o

  • utput u

unsorted ■ MKL for small scale ó HashVector for large scale

25 25

slide-27
SLIDE 27

Performance E Evaluation

A^2 ^2: S : Scaling w with th i input s t size ( (KNL, G , G500) ■ Hash i is b best p performer

26 26

slide-28
SLIDE 28

Performance E Evaluation

A^2 ^2: S : Scaling w with th i input s t size ( (Haswell) ■ More c clear p performance t trend o

  • f K

KNL

– MKL for smaller scales – Hash and HashVector for larger scales

27 27

slide-29
SLIDE 29

Performance E Evaluation

A^2 ^2: S : Scalability ty ( (KNL) ■ Good s scalability o y of H Hash a and Ha HashVec even after 64 threads

28 28

slide-30
SLIDE 30

Performance E Evaluation

A^2 ^2: S : Sensiti tivity ty o

  • f c

compression r rati tio ( (KNL) ■ Evaluation on SuiteSparse matrices ■ Compression ratio (CR): #flop/#non-zero of output ■ Heap: stable performance ■ MKL a and H Hash: B Better p performance w with h higher C CR

29 29

slide-31
SLIDE 31

Performance E Evaluation

A^2 ^2: S : Sensiti tivity ty o

  • f c

compression r rati tio ( (KNL) ■ Hash for low CR ó MKL family for high CR ■ KokkosKernel underperforms other kernels

30 30

slide-32
SLIDE 32

Performance E Evaluation

A^2 ^2: P : Profile o

  • f R

Relati tive P Performance ■ So Sorted: Hash is best performer for 70% matrices

– Runtime o

  • f H

Hash i is a always w within 1 1.6x o

  • f t

the b best

■ Un Unsorted ed: Hash, HashVector and MKL-inspector perform equally

– Each o

  • f t

them p performs t the b best f for a about 3 30%

31 31

1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Performance Relative to the Best Algorithm Fraction of Problems

Sorted

Hash HashVector MKL Heap 1 1.5 2 2.5 3 3.5 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Performance Relative to the Best Algorithm Fraction of Problems

Unsorted

Hash HashVector MKL MKL−inspector Kokkos

slide-33
SLIDE 33

Performance E Evaluation

Square x x T Tall-skinny m matr trix ( (KNL) ■ Multiple BFS, Betweenness Centrality ■ Hash o

  • r Ha

HashVec is t the b best p performer

32 32

slide-34
SLIDE 34

Performance E Evaluation

Triangle C Counti ting o

  • n Su

SuiteSp Sparse matr trices ( (KNL) ■ Reorders and transforms a matrix to L and U

– L is lower triangle and U is upper triangle

■ Similar performance trend to that of A^2

– Hash a and Ha HashV hVector generally o

  • verwhelm M

MKL

33 33

slide-35
SLIDE 35

Empirical R Recipe f for S SpGEMM o

  • n K

KNL

34 34

High C CR ( (>2) Low C CR ( (<=2) A x x A A So Sorted Hash Hash Un Unsorted MKL-inspector Hash L x x U U So Sorted Hash Heap Sparse ( (Edge factor < <=8) Dense ( (Edge factor > > 8 8) Un Uniform Sk Skewed Un Uniform Sk Skewed A x x A A So Sorted Heap Heap Heap Hash Un Unsorted HashVec HashVec HashVec Hash Ta Tall- Sk Skinny So Sorted

  • Hash
  • HashVec

Un Unsorted

  • Hash
  • Hash

(b) Synthetic data specified by sparsity and non-zero pattern (a) Real data specified by compression ratio (CR)

slide-36
SLIDE 36

Conc Conclus usion ion

■ Performance analysis of SpGEMM on Intel KNL and multicore architectures

– Optimizing implementation for these architectures

■ Identify t the b bottlenecks

– Evaluation in various use cases

■ Clarify w which S SpGEMM a algorithm w works w well

– Highlighting the benefit o

  • f l

leaving m matrices u unsorted – Empirical r recipe for selecting the best-performing algorithm for a specific application scenario

35 35

Source code is publicly available at https://bitbucket.org/YusukeNagasaka/mtspgemmlib