Strategies for parallel SpMV multiplication Albert-Jan Yzelman - PowerPoint PPT Presentation

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 1 / 47

Acknowledgements This presentation outlines the pre-print ‘High-level strategies for parallel shared-memory sparse matrx–vector multiplication’ , tech. rep. TW-614, KU Leuven (2012), which has been submitted for publication. This is joint work of Albert-Jan Yzelman and Dirk Roose, Dept. of Computer Science, KU Leuven, Belgium. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 2 / 47

Acknowledgements This work is funded by Intel and by the Institute for the Promotion of Innovation through Science and Technology (IWT), in the framework of the Flanders ExaScience Lab, part of Intel Labs Europe. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 3 / 47

Summary Summary 1 2 Classification 3 Strategies 4 Experiments 5 Conclusions and outlook � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 4 / 47

Central question Given a sparse m × n matrix A and an n × 1 input vector x . How to calculate y = Ax on a shared-memory parallel computer, as fast as possible? � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 5 / 47

Central obstacles Three obstacles oppose an efficient shared-memory parallel sparse matrix–vector (SpMV) multiplication kernel: inefficient cache use, limited memory bandwidth, and non-uniform memory access (NUMA). � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 6 / 47

Central obstacles Visualisation of the SpMV multiplication Ax = y with nonzeroes processed in row-major order: Accesses on the input vector are completely unpredictable. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 7 / 47

Central obstacles The arithmetic intensity of an SpMV multiply lies between 2 4 and 2 5 flop per byte. On an 8-core 2 . 13 GHz (with AVX), and 10 . 67 GB/s DDR3: CPU speed Memory speed 4 · 2 . 13 · 10 9 nz/s 2 / 5 · 10 . 67 · 10 9 nz/s 1 core 32 · 2 . 13 · 10 9 nz/s 2 / 5 · 10 . 67 · 10 9 nz/s 8 cores The CPU-speed exceeds by far the memory speed. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 8 / 47

Central obstacles E.g., a dual-socket machine, with two quad-core processors: Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 �� 4MB L2 4MB L2 4MB L2 4MB L2 System interface System interface Pairs of cores may have different bandwidths. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 9 / 47

Central obstacles If each processor moves data from and to the same memory element, the effective bandwidth is shared. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 10 / 47

Starting point Assuming a row-major order of nonzeroes:  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 CRS storage:  V [4 1 3 2 3 1 2 7 1 1]   A = J [0 1 2 2 3 0 3 0 2 3] ˆ  I [0 3 5 7 10]  Kernel: for i = 0 to m − 1 do for k = ˆ I i to ˆ I i +1 − 1 do add V k · x J k to y i � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 11 / 47

Starting point Assuming a row-major order of nonzeroes:  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 CRS storage:  V [4 1 3 2 3 1 2 7 1 1]   A = J [0 1 2 2 3 0 3 0 2 3] ˆ  I [0 3 5 7 10]  #omp parallel for private( i, k ) schedule( dynamic, 8 ) for i = 0 to m − 1 do for k = ˆ I i to ˆ I i +1 − 1 do add V k · x J k to y i � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 12 / 47

Results Methods DL-2000 DL-580 DL-980 OpenMP CRS (1D) 2 . 6 (8) 3 . 5 (8) 3 . 0 (8) CSB (1D) 4 . 9 (12) 9 . 9 (30) 8 . 6 (32) Interleaved CSB (1D) 6 . 0 (12) 18 . 2 (40) 17 . 7 (64) pOSKI (1D) 5 . 4 (12) 14 . 8 (40) 12 . 2 (64) Row-distr. block CO-H (1D) 7 . 0 (12) 24 . 7 (40) 34 . 0 (64) Fully distributed (2D) 4 . 0 (12) 8 . 3 (40) 6 . 9 (32) Distr. CO-SBD, qp = 32 (2D) 4 . 0 (8) 7 . 4 (32) 6 . 4 (32) Distr. CO-SBD, qp = 64 (2D) 4 . 2 (8) 7 . 0 (16) 6 . 8 (64) Block CO-H+ (ND) 2 . 5 (12) 3 . 1 (16) 3 . 2 (16) Average speedups relative to sequential CRS. Actual number of threads used is in-between brackets. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 13 / 47

Classification Summary 1 2 Classification 3 Strategies 4 Experiments 5 Conclusions and outlook � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 14 / 47

Distribution types Implicit distribution, centralised local allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies with the number of available elements . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 15 / 47

Distribution types Implicit distribution, centralised interleaved allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies with the number of available elements . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 16 / 47

Distribution types Explicit distribution, distributed local allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies with the number of available elements . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 17 / 47

Distribution types P arallelisation maps nonzeroes to processes: π A : { 0 , . . . , m − 1 } × { 0 , . . . , n − 1 } → { 0 , . . . , p − 1 } ; process π A ( i, j ) performs y i = y i + a ij x j . Vectors can be distributed similarly: π y : { 0 , . . . , m − 1 } → { 0 , . . . , p − 1 } , and π x : { 0 , . . . , n − 1 } → { 0 , . . . , p − 1 } . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 18 / 47

Distribution types ND (no distribution): π x and π y are left undefined; maximum freedom for choosing π A . 1D distribution: π A ( i, j ) depends on either i or j ; typically, π A ( i, j ) = π A ( i ) = π y ( i ) . 2D distribution: π A , π x and π y are all defined. Implicit distribution: process s performs only computations with nonzeroes a ij s.t. π A ( i, j ) = s . Explicit distr. (of A ): process s additionally allocates storage for those a ij for which π ( i, j ) = s , on locally fast memory. Explicit distribution of x or y is similar. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 19 / 47

Cache-optimisation Cache-aware: auto-tuning, or coding for specific architectures Cache-oblivious: runs well regardless of architecture exploits structural properties of A Matrix-aware: (usually combined with auto-detection within cache-aware schemes) � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 20 / 47

Parallelisation type Coarse-grained: p equals the available number of units of execution (UoEs; e.g., cores). Fine-grained: p is much larger than the number of available UoEs. A coarse grainsize easily incorporates explicit distributions; a fine grainsize spends less effort to attain load-balance. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 21 / 47

Strategies Summary 1 2 Classification 3 Strategies 4 Experiments 5 Conclusions and outlook � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 22 / 47

No vector distribution W e define π A such that for all s ∈ { 0 , . . . , p − 1 } , � 1 , if s mod p < nz mod p |{ a ij ∈ A | π ( i, j ) = s }| = ⌊ nz / p ⌋ + ; 0 otherwise i.e., perfect load-balance. Which nonzeroes go where, and the order of processing of nonzeroes, is determined by the Hilbert-curve. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 23 / 47

No vector distribution x is implicitly distributed (using interleaved allocation): � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 24 / 47

Strategies for parallel SpMV multiplication Albert-Jan Yzelman - PowerPoint PPT Presentation

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 1 / 47 Acknowledgements This presentation outlines the pre-print

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343

From Serial to Parallel A simple training using the Martix-Vector multiplication algorithm Petros

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

Sequoia Programming the Memory Hierarchy Kayvon Fatahalian Timothy J. Knight Mike Houston

Debugging and improving the C/C++11 memory model Viktor Vafeiadis Max Planck Institute for

Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core

ECE 3574: Applied Software Design InterProcess Communication using Shared Memory Chris Wyatt

Memory Leakage Monitoring S. Y. Jun (Fermilab), G. Cosmo (CERN), A. Dotti (SLAC) 20 th Geant4

The heap hic 1 Limitations of the stack int *table_of(int num, int len) { int table[len+1];

virtual memory 2 1 last time page table: map from virtual to physical pages omit parts of

The CloudyR Project: Statistical Cloud Computing in R with Amazon and Google Thomas J. Leeper

Strategies for parallel SpMV multiplication Albert-Jan Yzelman - PowerPoint PPT Presentation

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 1 / 47 Acknowledgements This presentation outlines the pre-print

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication &amp; Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343

From Serial to Parallel A simple training using the Martix-Vector multiplication algorithm Petros

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

Sequoia Programming the Memory Hierarchy Kayvon Fatahalian Timothy J. Knight Mike Houston

Debugging and improving the C/C++11 memory model Viktor Vafeiadis Max Planck Institute for

Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core

ECE 3574: Applied Software Design InterProcess Communication using Shared Memory Chris Wyatt

Memory Leakage Monitoring S. Y. Jun (Fermilab), G. Cosmo (CERN), A. Dotti (SLAC) 20 th Geant4

The heap hic 1 Limitations of the stack int *table_of(int num, int len) { int table[len+1];

virtual memory 2 1 last time page table: map from virtual to physical pages omit parts of

The CloudyR Project: Statistical Cloud Computing in R with Amazon and Google Thomas J. Leeper

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication