Strategies for parallel SpMV multiplication Albert-Jan Yzelman - PowerPoint PPT Presentation

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 1 / 46

Acknowledgements This presentation outlines the pre-print ‘High-level strategies for parallel shared-memory sparse matrx–vector multiplication’ , tech. rep. TW-614, KU Leuven (2012), which has been submitted for publication. This is joint work of Albert-Jan Yzelman and Dirk Roose, Dept. of Computer Science, KU Leuven, Belgium. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 2 / 46

Acknowledgements This work is funded by Intel and by the Institute for the Promotion of Innovation through Science and Technology (IWT), in the framework of the Flanders ExaScience Lab, part of Intel Labs Europe. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 3 / 46

Summary Summary 1 2 Classification 3 Strategies 4 Experiments 5 Conclusions and outlook � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 4 / 46

Central question Given a sparse m × n matrix A and an n × 1 input vector x . How to calculate y = Ax on a shared-memory parallel computer, as fast as possible? � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 5 / 46

Central obstacles Three obstacles oppose an efficient shared-memory parallel sparse matrix–vector (SpMV) multiplication kernel: inefficient cache use, limited memory bandwidth, and non-uniform memory access (NUMA). � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 6 / 46

Inefficient caching Visualisation of the SpMV multiplication Ax = y with nonzeroes processed in row-major order: Accesses on the input vector are completely unpredictable. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 7 / 46

Limited bandwidth When using unsigned 64 -bit integers for matrix indices and 64 -bit floating-point numbers for nonzero values, the arithmetic intensity of an SpMV multiplication lies between 2 4 and 2 5 flop per byte. Consider an 8-core 2 . 13 GHz with Intel AVX, using 10 . 67 GB/s DDR3 memory: CPU speed Memory speed 4 · 2 . 13 · 10 9 nz/s 4 / 10 · 10 . 67 · 10 9 nz/s 1 core 32 · 2 . 13 · 10 9 nz/s 4 / 10 · 10 . 67 · 10 9 nz/s 8 cores Already with one core the CPU-speed at 8 . 5 Gnz/s exceeds by far the memory speed of 4 . 3 Gnz/s; this gap only widens as the number of cores on-chip increases. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 8 / 46

NUMA architectures E.g., a dual-socket machine, with two quad-core processors: Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 32kB L1 �� 4MB L2 4MB L2 4MB L2 4MB L2 System interface System interface A processor typically has local memory available through its interface, but can reach remote memory too. (These additional interconnects are not pictured here.) � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 9 / 46

NUMA architectures If each processor moves data from and to the same memory element, the effective bandwidth is shared. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 10 / 46

Starting point: CRS Assuming a row-major order of nonzeroes:   4 1 3 0 0 0 2 3   A =   1 0 0 2   7 0 1 1 Storage:  V [4 1 3 2 3 1 2 7 1 1]   A = J [0 1 2 2 3 0 3 0 2 3] ˆ  I [0 3 5 7 10]  Kernel: for i = 0 to m − 1 do for k = ˆ I i to ˆ I i +1 − 1 do add V k · x J k to y i � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 11 / 46

Starting point: CRS Assuming a row-major order of nonzeroes:   4 1 3 0 0 0 2 3   A =   1 0 0 2   7 0 1 1 Storage:  V [4 1 3 2 3 1 2 7 1 1]   A = J [0 1 2 2 3 0 3 0 2 3] ˆ  I [0 3 5 7 10]  #omp parallel for private( i, k ) schedule( dynamic, 8 ) for i = 0 to m − 1 do for k = ˆ I i to ˆ I i +1 − 1 do add V k · x J k to y i � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 12 / 46

Results Methods DL-2000 DL-580 DL-980 OpenMP CRS (1D) 2 . 6 (8) 3 . 5 (8) 3 . 0 (8) CSB (1D) 4 . 9 (12) 9 . 9 (30) 8 . 6 (32) Interleaved CSB (1D) 6 . 0 (12) 18 . 2 (40) 17 . 7 (64) pOSKI (1D) 5 . 4 (12) 14 . 8 (40) 12 . 2 (64) Row-distr. block CO-H (1D) 7 . 0 (12) 24 . 7 (40) 34 . 0 (64) Fully distributed (2D) 4 . 0 (12) 8 . 3 (40) 6 . 9 (32) Distr. CO-SBD, qp = 32 (2D) 4 . 0 (8) 7 . 4 (32) 6 . 4 (32) Distr. CO-SBD, qp = 64 (2D) 4 . 2 (8) 7 . 0 (16) 6 . 8 (64) Block CO-H+ (ND) 2 . 5 (12) 3 . 1 (16) 3 . 2 (16) Average speedups relative to sequential CRS. Actual number of threads used is in-between brackets. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 13 / 46

Classification Summary 1 2 Classification 3 Strategies 4 Experiments 5 Conclusions and outlook � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 14 / 46

Distribution types Implicit distribution, central local allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies with the number of available elements . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 15 / 46

Distribution types Implicit distribution, central interleaved allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies with the number of available elements . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 16 / 46

Distribution types Explicit distribution, distributed local allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies with the number of available elements . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 17 / 46

Distribution types R egardless of implicit or explicit distribution, parallelisation requires a function mapping nonzeroes to processes: π A : { 0 , . . . , m − 1 } × { 0 , . . . , n − 1 } → { 0 , . . . , p − 1 } , with p the total number of concurrent processes. Nonzero a ij ∈ A is used in multiplication by process π A ( i, j ) ; the operation y i = y i + a ij x j is performed by that process. Vectors can be distributed similarly: π y : { 0 , . . . , m − 1 } → { 0 , . . . , p − 1 } , and π x : { 0 , . . . , n − 1 } → { 0 , . . . , p − 1 } . � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 18 / 46

Distribution types N o distribution (ND): π x and π y are left undefined. 1D distribution: π A ( i, j ) depends on either i or j ; typically, π A ( i, j ) = π A ( i ) = π y ( i ) . 2D distribution: π A , π x and π y are all defined. Implicit distribution: process s performs only computations with nonzeroes a ij s.t. π A ( i, j ) = s . Explicit distr. (of A ): process s additionally allocates storage for those a ij for which π ( i, j ) = s , on locally fast memory. (and similarly for explicit distribution of x and y ) � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 19 / 46

Cache-optimisation Cache-aware: auto-tuning, or coding for specific architectures Cache-oblivious: runs well regardless of architecture exploits structural properties of A Matrix-aware: (usually combined with auto-detection within cache-aware schemes) � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 20 / 46

Parallelisation type Coarse-grained: p equals the available number of units of execution (UoEs; e.g., cores). p is much larger than the number of Fine-grained: available UoEs. A coarse grainsize easily incorporates explicit distributions; a fine grainsize spends less effort to attain load-balance. � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 21 / 46

Strategies Summary 1 2 Classification 3 Strategies 4 Experiments 5 Conclusions and outlook � 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 22 / 46

Strategies for parallel SpMV multiplication Albert-Jan Yzelman - PowerPoint PPT Presentation

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 1 / 46 Acknowledgements This presentation outlines the pre-print

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343

From Serial to Parallel A simple training using the Martix-Vector multiplication algorithm Petros

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

BLUFF CO BL COUNTR TRY R REGION VALLEY DEVELOPMENT DELTAS SCHOOLS BRIDGES BEDROCK AND

Drainage & Flooding: MUD 286 & Lakewood Crossing HOA 19June 2018 Mike Rhodes VP MUD 286

UT Austin Astronomy grad student and postdoc seminar I make diffraction gratings from single

k 2 E ( k , t ) dk E ( t ) = E ( k , t ) dk , Z ( t ) = start at k 0 , go to k 1 =k 0

Orthogonal rational functions and rational modifications of a measure Karl Deckers Department of

Z 01010 B = P + Z 01 P + Z 0101 P 1 + 2 zB ( z ) = B ( z ) + P ( z ) GF equations z 5 B (

Gust Rejection Properties of VTOL Multirotor Aircraft James Whidborne & Alastair K. Cooke

An Enriched Automated PV Registry: Combining Image Recognition and 3D Building Data Benjamin

Strategies for parallel SpMV multiplication Albert-Jan Yzelman - PowerPoint PPT Presentation

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience Lab - Albert-Jan Yzelman c Strategies for parallel SpMV multiplication 1 / 46 Acknowledgements This presentation outlines the pre-print

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication &amp; Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343

From Serial to Parallel A simple training using the Martix-Vector multiplication algorithm Petros

Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors

BLUFF CO BL COUNTR TRY R REGION VALLEY DEVELOPMENT DELTAS SCHOOLS BRIDGES BEDROCK AND

Drainage &amp; Flooding: MUD 286 &amp; Lakewood Crossing HOA 19June 2018 Mike Rhodes VP MUD 286

UT Austin Astronomy grad student and postdoc seminar I make diffraction gratings from single

k 2 E ( k , t ) dk E ( t ) = E ( k , t ) dk , Z ( t ) = start at k 0 , go to k 1 =k 0

Orthogonal rational functions and rational modifications of a measure Karl Deckers Department of

Z 01010 B = P + Z 01 P + Z 0101 P 1 + 2 zB ( z ) = B ( z ) + P ( z ) GF equations z 5 B (

Gust Rejection Properties of VTOL Multirotor Aircraft James Whidborne &amp; Alastair K. Cooke

An Enriched Automated PV Registry: Combining Image Recognition and 3D Building Data Benjamin

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Drainage & Flooding: MUD 286 & Lakewood Crossing HOA 19June 2018 Mike Rhodes VP MUD 286

Gust Rejection Properties of VTOL Multirotor Aircraft James Whidborne & Alastair K. Cooke