multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk - - PowerPoint PPT Presentation

multiplication
SMART_READER_LITE
LIVE PREVIEW

multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk - - PowerPoint PPT Presentation

Parallel S p MV multiplication Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk Roose (KU Leuven) December 2013 2013, ExaScience Lab - A. N. Yzelman, D. Roose c Parallel S p MV multiplication 1 / 29 Acknowledgements This presentation


slide-1
SLIDE 1

Parallel SpMV multiplication

Albert-Jan Yzelman (ExaScience Lab / KU Leuven) Dirk Roose (KU Leuven) December 2013

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 1 / 29

slide-2
SLIDE 2

Acknowledgements

This presentation outlines the paper

  • A. N. Yzelman and D. Roose, “High-level strategies for parallel

shared-memory sparse matrx–vector multiplication”, IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), in press: http://dx.doi.org/10.1109/TPDS.2013.31

This work is funded by Intel and by the Institute for the Promotion of Innovation through Science and Technology (IWT), in the framework of the Flanders ExaScience Lab, part of Intel Labs Europe.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 2 / 29

slide-3
SLIDE 3

Introduction

Given a sparse m × n matrix A and an n × 1 input vector x. We consider both sequential and parallel computation of

Ax = y :

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 3 / 29

slide-4
SLIDE 4

Introduction

Given a sparse m × n matrix A and an n × 1 input vector x. We consider both sequential and parallel computation of

Ax = y :

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 3 / 29

slide-5
SLIDE 5

Introduction

Given a sparse m × n matrix A and an n × 1 input vector x. We consider both sequential and parallel computation of

Ax = y :

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 3 / 29

slide-6
SLIDE 6

Introduction

Given a sparse m × n matrix A and an n × 1 input vector x. We consider both sequential and parallel computation of

Ax = y :

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 3 / 29

slide-7
SLIDE 7

Introduction

First obstacle: inefficient cache use

Row-major ordering of nonzeroes: linear access of the output vector y; ...irregular access of the input vector x.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 4 / 29

slide-8
SLIDE 8

Introduction

First obstacle: inefficient cache use

Row-major ordering of nonzeroes: linear access of the output vector y; irregular access of the input vector x.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 4 / 29

slide-9
SLIDE 9

Introduction

Second obstacle: the SpMV is bandwidth bound

1 2 4 8 16 32 64 1/8 1/4 1/2 1 2 4 8 16 attainable GFLOP/sec Arithmetic Intensity FLOP/Byte peak memory BW peak floating-point with vectorization

SpMV has low arithmetic intensity: 0.2–0.25. Compression is mandatory!

(Image courtesy of Prof. Wim Vanroose, UA)

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 5 / 29

slide-10
SLIDE 10

Introduction

Third obstacle: NUMA architectures

  • 32kB L1

32kB L1 32kB L1 32kB L1 Core 1 Core 2 Core 3 Core 4 4MB L2 System interface 4MB L2

Processor-level NUMAness affects cache behaviour. Share data from x or y between by neighbouring cores.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 6 / 29

slide-11
SLIDE 11

Introduction

Third obstacle: NUMA architectures Socket-level NUMAness affects the effective bandwidth. If each processor moves data to (and from) the same memory bank, we are bound by the bandwidth of that single memory bank.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 7 / 29

slide-12
SLIDE 12

Introduction: summary

Three factors impede creating an efficient shared-memory parallel SpMV multiplication kernel:

1

inefficient cache use,

2

limited memory bandwidth, and

3

non-uniform memory access (NUMA).

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 8 / 29

slide-13
SLIDE 13

Increasing cache-efficiency

Adapt the nonzero order:

Original situation

linear access of the output vector y; irregular access of the input vector x.O()

Ref.: A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 9 / 29

slide-14
SLIDE 14

Increasing cache-efficiency

Adapt the nonzero order:

Zig-zag CRS

retains linear access of the output vector y; imposes O(m) more locality.

Ref.: A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 9 / 29

slide-15
SLIDE 15

Increasing cache-efficiency

Adapt the nonzero order using space-filling curves:

Fractal storage using the coordinate format, COO

no linear access of y, but better combined locality on x and y.O()

Ref.: Haase, Liebmann and Plank, “A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices”, International Journal of Parallel, Emergent and Distributed Systems 22(4), pp. 213-220 (2007). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 10 / 29

slide-16
SLIDE 16

Increasing cache-efficiency

Or reordering matrix rows and columns:

Reordering based on sparse matrix partitioning

combines with adapting the nonzero order, models upper-bound on cache misses (with ZZ-CRS).O

Ref.: A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 11 / 29

slide-17
SLIDE 17

Increasing cache-efficiency

Or reordering matrix rows and columns:

Reordering based on sparse matrix partitioning

combines with adapting the nonzero order, models upper-bound on cache misses (with ZZ-CRS).O

Ref.: A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 11 / 29

slide-18
SLIDE 18

Increasing cache-efficiency

Or reordering matrix rows and columns:

Reordering based on sparse matrix partitioning

combines with adapting the nonzero order, models upper-bound on cache misses (with ZZ-CRS).O

Ref.: A. N. Yzelman and Rob H. Bisseling, “Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods”, SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 11 / 29

slide-19
SLIDE 19

Increasing cache-efficiency

Sparse blocking enhances reordering: corresponding vector elements will fit into cache, block-wise reordering is faster.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 12 / 29

slide-20
SLIDE 20

Increasing cache-efficiency

Two options: space-filling curves within or without: (Compressed Sparse Blocks, CSB)

Ref.: Buluc ¸, Williams, Oliker, and Demmel, “Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication”, Proc. Parallel & Distributed Processing (IPDPS), IEEE International, pp. 721-733 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 13 / 29

slide-21
SLIDE 21

Increasing cache-efficiency

Two options: space-filling curves within or without: (Compressed Sparse Blocks, CSB)

Ref.: Buluc ¸, Williams, Oliker, and Demmel, “Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication”, Proc. Parallel & Distributed Processing (IPDPS), IEEE International, pp. 721-733 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 13 / 29

slide-22
SLIDE 22

Increasing cache-efficiency

Two options: space-filling curves within or without:

Ref.: Lorton and Wise, “Analyzing block locality in Morton-order and Morton-hybrid matrices”, SIGARCH Computer Architecture News, 35(4), pp. 6-12 (2007). Ref.: Martone, Filippone, Tucci, Paprzycki, and Ganzha, “Utilizing recursive storage in sparse matrix-vector multiplication - preliminary considerations”, Proceedings of the ISCA 25th International Conference on Computers and Their Applications (CATA), pp 300-305 (2010). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 14 / 29

slide-23
SLIDE 23

Increasing cache-efficiency

Two options: space-filling curves within or without:

Ref.: Lorton and Wise, “Analyzing block locality in Morton-order and Morton-hybrid matrices”, SIGARCH Computer Architecture News, 35(4), pp. 6-12 (2007). Ref.: Martone, Filippone, Tucci, Paprzycki, and Ganzha, “Utilizing recursive storage in sparse matrix-vector multiplication - preliminary considerations”, Proceedings of the ISCA 25th International Conference on Computers and Their Applications (CATA), pp 300-305 (2010). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 14 / 29

slide-24
SLIDE 24

Increasing cache-efficiency

Techniques: adapting the nonzero order, adapting the row and column order, and sparse blocking.

(These are orthogonal approaches.)

Other techniques: cache-aware, matrix-aware blocking: e.g., Sparsity and OSKI.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 15 / 29

slide-25
SLIDE 25

Compression

Assuming a Hilbert order of nonzeroes: A =     4 1 3 2 3 1 2 7 1 1     The coordinate format (COO): A =      V [7 1 4 1 2 3 3 2 1 1] J [0 0 0 1 2 2 3 3 3 2] I [3 2 0 0 1 0 1 2 3 3] Storage requirements: Θ(3nz).

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 16 / 29

slide-26
SLIDE 26

Compression

Compressed COO: Use COO to store each β × β block. Elements of I, J then require only log2 β bits, for a total storage of O(2nz + m).

Ref.: Buluc ¸, Williams, Oliker, and Demmel, “Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication”, Proc. Parallel & Distributed Processing (IPDPS), IEEE International, pp. 721-733 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 17 / 29

slide-27
SLIDE 27

Compression

Assuming a row-major order of nonzeroes: A =     4 1 3 2 3 1 2 7 1 1     CRS: A =      V [4 1 3 2 3 1 2 7 1 1] J [0 1 2 2 3 0 3 0 2 3] ˆ I [0 3 5 7 10] Storage requirements: Θ(2nz + m + 1).

(nz is the number of nonzeroes in A.)

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 18 / 29

slide-28
SLIDE 28

Compression

A =     4 1 3 2 3 1 2 7 1 1     Bi-directional incremental CRS (BICRS): A =      V [7 1 4 1 2 3 3 2 1 1] ∆J [0 4 4 1 5 4 5 4 3 1] ∆I [3 -1 -2 1 -1 1 1 1] Storage requirements, allowing arbitrary traversals: Θ(2nz + row jumps + 1).

Ref.: Yzelman and Bisseling, “A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve”, Progress in Industrial Mathematics at ECMI 2010, pp. 627-634 (2012). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 19 / 29

slide-29
SLIDE 29

Compression

Compressed BICRS: Using BICRS to store each β × β block, compressing ∆I, ∆J. Total storage may be less than O(2nz + m).

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31, in press (2013). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 20 / 29

slide-30
SLIDE 30

Distributions and NUMA

Implicit distribution, centralised local allocation: If each processor moves data to the same single memory element, the bandwidth is limited by a single controller.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 21 / 29

slide-31
SLIDE 31

Distributions and NUMA

Implicit distribution, centralised interleaved allocation: If each processor moves data from all memory elements, the bandwidth multiplies if accesses are uniformly random.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 21 / 29

slide-32
SLIDE 32

Distributions and NUMA

Explicit distribution, distributed local allocation: If each processor moves data from and to its own unique memory element, the bandwidth multiplies.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 21 / 29

slide-33
SLIDE 33

Distributions and NUMA

Use interleaved allocation for everything: no bounds on non-local data movement! Explicitly distribute everything:

  • nly local data. Explicit inter-process data movement.

Mixed approach y and A explicit, x interleaved: no bounds on non-local data movement for x.

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 21 / 29

slide-34
SLIDE 34

Choices

A state-of-the-art SpMV strategy has to choose a cache-efficient strategy (aware, oblivious, ...), competitive data structure (compression), NUMA-efficient distribution strategy. Fine-grained examples: OpenMP CRS: regular CRS, interleaved;

#omp parallel for private( i, k ) schedule( dynamic, 8 ) for i = 0 to m − 1 do for k = ˆ Ii to ˆ Ii+1 − 1 do add Vk · xJk to yi

CSB: Z-curves, compressed (blocked) COO, interleaved. When fine-grained, is interleaving everything mandatory?

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 22 / 29

slide-35
SLIDE 35

Coarse-grained SpMVs

1D: Hilbert curve, compressed (blocked) BICRS, explicit allocation of y and A.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Trans. Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31, in press (2013). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 23 / 29

slide-36
SLIDE 36

Coarse-grained SpMVs

2D: using partitioning and reordering:

Ref.: Yzelman and Bisseling, “Two-dimensional cache-oblivious sparse matrix-vector multiplication”, Parallel Computing 37(12), pp. 806-819 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 24 / 29

slide-37
SLIDE 37

Coarse-grained SpMVs

2D: using partitioning and reordering:

Ref.: Yzelman and Bisseling, “Two-dimensional cache-oblivious sparse matrix-vector multiplication”, Parallel Computing 37(12), pp. 806-819 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 24 / 29

slide-38
SLIDE 38

Coarse-grained SpMVs

2D: using partitioning and reordering:

Ref.: Yzelman and Bisseling, “Two-dimensional cache-oblivious sparse matrix-vector multiplication”, Parallel Computing 37(12), pp. 806-819 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 24 / 29

slide-39
SLIDE 39

Coarse-grained SpMVs

2D: using partitioning and reordering:

Ref.: Yzelman and Bisseling, “Two-dimensional cache-oblivious sparse matrix-vector multiplication”, Parallel Computing 37(12), pp. 806-819 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 24 / 29

slide-40
SLIDE 40

Coarse-grained SpMVs

2D: using partitioning and reordering:

Ref.: Yzelman and Bisseling, “Two-dimensional cache-oblivious sparse matrix-vector multiplication”, Parallel Computing 37(12), pp. 806-819 (2011). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 24 / 29

slide-41
SLIDE 41

Coarse-grained SpMVs

2D: using partitioning and reordering:

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Trans. Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31, in press (2013). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 24 / 29

slide-42
SLIDE 42

BSP 2D implementation

for each local aij = 0 do if xj is not local then bsp get xj from remote process bsp sync multiply y = Ax (using local nonzeroes only) for each local aij = 0 do if yi is non-local then bsp send (i, yi) to remote process bsp sync while bsp qsize() > 0 do bsp move (i, α) from queue add α to local yi

Ref.: Yzelman, Bisseling, Roose, and Meerbergen, “MulticoreBSP for C: a high-performance library for shared-memory parallel programming”, International Journal of Parallel Programming, doi: 10.1007/s10766-013-0262-9, in press (2013). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 25 / 29

slide-43
SLIDE 43

Experiments: CPUs

Matrix Size Nonzeroes Origin Freescale1 3 428 755 17 052 626 Semiconductor industry adaptive 6 815 744 27 248 640 Numerical simulation ldoor 952 203 42 493 817 Structural engineering wiki2007 3 566 907 45 030 389 Link matrix road usa 23 947 347 57 708 624 Road network

Average speedups relative to sequential CRS:

4 x 10 8 x 8 OpenMP CRS 8.8 7.2 PThread 1D 13.6 20.0 Cilk CSB 22.9 26.9 BSP 2D 21.3 30.8

4 x 10: HP DL-580, 4 sockets, 10 core Intel Xeon E7-4870 8 x 8 : HP DL-980, 8 sockets, 8 core Intel Xeon E7-2830

Ref.: Yzelman, Bisseling, Roose, and Meerbergen, “MulticoreBSP for C: a high-performance library for shared-memory parallel programming”, International Journal of Parallel Programming, doi: 10.1007/s10766-013-0262-9, in press (2013). c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 26 / 29

slide-44
SLIDE 44

Experiments: Xeon Phi

60 cores, 4 HW threads per core, scales ‘too slowly’:

1 2 3 4 5 50 100 150 200 250 Gflop/s HW threads Average SpMV speed on an Intel Xeon Phi adaptive - 1D PThreads wiki05 - 1D Pthreads

Latency bound; add more threads, or employ vectorisation

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 27 / 29

slide-45
SLIDE 45

Experiments: Xeon Phi

5 10 15 20 25 30 bcsstk17 lhr34 bcsstk32 nug30 s3dkt3m2 tbdlinux Freescale1 cage14 wiki05 adaptive nd24k RM07R GL7d18 Gflop/s Average SpMV multiplication speed - Xeon Phi, 240 threads 1x1 1x8 2x4 4x2 8x1

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 28 / 29

slide-46
SLIDE 46

Conclusion

We have seen techniques for cache-oblivious SpMV multiplication, and sparse matrix compression. For new and upcoming architectures: highly NUMA architectures require 2D partitioning, vectorisation can help latency hiding (even within a bandwidth-bound computation).

http://www.multicorebsp.com

people.cs.kuleuven.be/~albert-jan.yzelman/software.php

c 2013, ExaScience Lab - A. N. Yzelman, D. Roose Parallel SpMV multiplication 29 / 29