Generalised vectorisation for sparse matrixvector multiplication - - PowerPoint PPT Presentation

generalised vectorisation
SMART_READER_LITE
LIVE PREVIEW

Generalised vectorisation for sparse matrixvector multiplication - - PowerPoint PPT Presentation

Generalised vectorisation for sparse matrixvector multiplication Albert-Jan Yzelman 22nd of July, 2014 c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 1 / 25 Context Solve y = Ax , with A an m


slide-1
SLIDE 1

Generalised vectorisation

for

sparse matrix–vector multiplication

Albert-Jan Yzelman 22nd of July, 2014

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 1 / 25

slide-2
SLIDE 2

Context

Solve y = Ax, with A an m × n input matrix, x an input vector of length n, and y an output vector of length m. Structured and unstructured sparse matrices: Emilia 923 RH Pentium

(unstructured mesh computations) (circuit simulation)

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 2 / 25

slide-3
SLIDE 3

Context

Past work on high-level approaches to SpMV multiplication: 4 x 10 8 x 8 OpenMP CRS 8.8 7.2 PThread 1D 13.6 20.0 Cilk CSB 22.9 26.9 BSP 2D 21.3 30.8

4 x 10: HP DL-580, 4 sockets, 10 core Intel Xeon E7-4870 8 x 8 : HP DL-980, 8 sockets, 8 core Intel Xeon E7-2830

Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication, IEEE

  • Trans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memory parallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

This talk instead focuses on sequential, low-level optimisations.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 3 / 25

slide-4
SLIDE 4

Context

One operation, one flop: scalar addition a := b + c scalar multiplication a := b · c

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 4 / 25

slide-5
SLIDE 5

Context

One operation, l flops: vectorised addition      a0 a1 . . . al−1      :=      b0 b1 . . . bl−1      +      c0 c1 . . . cl−1     

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 4 / 25

slide-6
SLIDE 6

Context

One operation, l flops: vectorised multiplication      a0 a1 . . . al−1      :=      b0 · c0 b1 · c1 . . . bl−1 · cl−1     

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 4 / 25

slide-7
SLIDE 7

Context

Exploiting sparsity through computation using only nonzeroes. Leads to sparse data structures: i = (0, 0, 1, 1, 2, 2, 2, 3) j = (0, 4, 2, 4, 1, 3, 5, 2) v = (a00, a04, . . . , a32) for k = 0 to nz − 1 yik := yik+ vk · xjk

The coordinate (COO) format: two flops versus five data words.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 5 / 25

slide-8
SLIDE 8

Context

1 2 4 8 16 32 64 1/8 1/4 1/2 1 2 4 8 16 attainable GFLOP/sec Arithmetic Intensity FLOP/Byte peak memory BW peak floating-point with vectorization

Theoretical turnover points: Intel Xeon E3-1225 64 operations per word (with vectorisation) 16 operations per word (without vectorisation)

(Image courtesy of Prof. Wim Vanroose, UA)

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 6 / 25

slide-9
SLIDE 9

Context

1 2 4 8 16 32 64 1/8 1/4 1/2 1 2 4 8 16 attainable GFLOP/sec Arithmetic Intensity FLOP/Byte peak memory BW peak floating-point with vectorization

Theoretical turnover points: Intel Xeon Phi 28 operations per word (with vectorisation) 4 operations per word (without vectorisation)

(Image courtesy of Prof. Wim Vanroose, UA)

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 6 / 25

slide-10
SLIDE 10

Motivation

Why care about vectorisation:

(in sparse computations)

Computations can become latency bound. Good caching increases the effective bandwidth, reduces data access latencies. Vectorisation allows retrieving multiple data elements per CPU cycle; better latency hiding.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 7 / 25

slide-11
SLIDE 11

Motivation

Vectorisation also strongly relates to blocking and tiling. Lots of earlier work:

1

classical vector computing (ELLPACK/segmented scans),

2

SpMV register blocking (Blocked CRS, OSKI),

3

sparse blocking, and tiling. This work generalises earlier approaches (1,2). illustrates sparse blocking and tiling (3) through the SpMV multiplication as well as the sparse matrix powers kernel.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 8 / 25

slide-12
SLIDE 12

Sequential SpMV

Blocking to fit subvectors into cache, cache-obliviousness to increase cache efficiency.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 9 / 25

slide-13
SLIDE 13

Sequential SpMV

Blocking to fit subvectors into cache, cache-obliviousness to increase cache efficiency.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 9 / 25

slide-14
SLIDE 14

Sequential SpMV

Blocking to fit subvectors into cache, cache-obliviousness to increase cache efficiency.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 9 / 25

slide-15
SLIDE 15

Sequential SpMV

Blocking to fit subvectors into cache, cache-obliviousness to increase cache efficiency.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 9 / 25

slide-16
SLIDE 16

SpMV vectorisation

What is needed for vectorisation: support for arbitrary nonzero traversals, handling of non-contiguous columns, handling of non-contiguous rows. Basic operation using vector registers ri: r1 := r1 + r2 · r3, the vectorised multiply-add. How to get the right data in the vector registers? nonzeroes of A: steaming loads. elements from x, y: gather/scatter.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 10 / 25

slide-17
SLIDE 17

SpMV vectorisation

Streaming loads only apply to the sparse matrix data structure: for k = 0 to nz − 1 yik := yik + vk · xjk

Streaming loads in blue. For accesses to x and y, alternatives are necessary.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 11 / 25

slide-18
SLIDE 18

SpMV vectorisation

‘Gather’ read random memory areas into a single vector register

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 12 / 25

slide-19
SLIDE 19

SpMV vectorisation

‘Gather’ read random memory areas into a single vector register

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 12 / 25

slide-20
SLIDE 20

SpMV vectorisation

‘Scatter’ write back to random memory areas (inverse gather)

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 13 / 25

slide-21
SLIDE 21

ELLPACK

Sketch of the (sliced) ELLPACK SpMV multiply (l = 3):

Kincaid and Fassiotto, ITPACK software and parallelization in Advances in computer methods for partial differential equations, Proc. 7th Intl. Conf. on Computer Methods for Partial Differential Equations (1992). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 14 / 25

slide-22
SLIDE 22

ELLPACK

Sketch of the (sliced) ELLPACK SpMV multiply (l = 3):

Kincaid and Fassiotto, ITPACK software and parallelization in Advances in computer methods for partial differential equations, Proc. 7th Intl. Conf. on Computer Methods for Partial Differential Equations (1992). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 14 / 25

slide-23
SLIDE 23

ELLPACK

Sketch of the (sliced) ELLPACK SpMV multiply (l = 3):

Kincaid and Fassiotto, ITPACK software and parallelization in Advances in computer methods for partial differential equations, Proc. 7th Intl. Conf. on Computer Methods for Partial Differential Equations (1992). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 14 / 25

slide-24
SLIDE 24

ELLPACK

Sketch of the (sliced) ELLPACK SpMV multiply (l = 3):

Kincaid and Fassiotto, ITPACK software and parallelization in Advances in computer methods for partial differential equations, Proc. 7th Intl. Conf. on Computer Methods for Partial Differential Equations (1992). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 14 / 25

slide-25
SLIDE 25

Segmented scan/reduce

Sketch of a segmented-reduce enabled SpMV multiply: j = (0, 4, 2, 4, 1, 3, 5, 2) → (x0, x4, x2 x4, x1, x3, x5, x2)

gather

↓ v = (a00, a04, a13, . . . , a32) → (x0a00, x4a04, x2a12, . . . , x2a32)

vectorised multiply

↓ summation control array →     x0a00 + x4a04 x2a12 + x4a14 x1a21 + x3a23 + x5a25 x5a35 + x3a32    

segmented reduce

↓ i = (0, 0, 1, 1, 2, 2, 2, 3) → (y0, y1, y2, y3)

gather/load + vectorised summation

↓ scatter y

Blelloch, Heroux, Zagha, Segmented operations for sparse matrix computation on vector multiprocessors (1993). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 15 / 25

slide-26
SLIDE 26

Blocked CRS

The compressed row storage (CRS) format:

s = (0, 2, 4, 7, 8, 9) j = (0, 4, 2, 4, 1, 3, 5, 2) v = (a00, a04, . . . , a32) for i = 0 to m − 1 for k = si to si+1 − 1 yi := yi + vk · xjk

Eisenstat, Gursky, Schultz, Sherman, Yale sparse matrix package II: the nonsymmetric codes, Defense Technical Information Center, technical report (1977). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 16 / 25

slide-27
SLIDE 27

Blocked CRS

v = a00

  • ,
  • a12
  • ,

a04 a14

  • , V3, V4, V5
  • s

= (0, 3, 6) j = (0, 1, 2, 0, 1, 2) for i = 0 to m/2 − 1 for k = si to si+1 − 1

  • y2i

y2i+1

  • :=
  • y2i

y2i+1

  • + Vk ·
  • x2j

x2j+1

  • Pinar & Heath, Improving performance of sparse matrix–vector multiplication (1999);

Vuduc & Moon, Fast sparse matrix–vector multiplication by exploiting variable block structure (2005). c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 17 / 25

slide-28
SLIDE 28

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-29
SLIDE 29

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-30
SLIDE 30

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-31
SLIDE 31

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-32
SLIDE 32

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-33
SLIDE 33

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-34
SLIDE 34

Vectorised BICRS

while there are nonzeroes do load nonzeroes into r3 (load nonzero indices in r4, r5) gather corresponding elements from x into r2 (using r4) gather corresponding elements from y into r1 (using r5) do vectorised multiply-add r1 := r1 + r2 · r3 sum-reduce r1 into unique elements from y scatter unique elements from r1 back to y end while Freedom to choose the number p of unique elements from y: choosing p = 1 leads to a segmented-reduce type SpMV, choosing p = l leads to sliced ELLPACK. Other choices lead to generic p × q blocks, reminiscent of Blocked CRS, with pq = l the vector register width.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 18 / 25

slide-35
SLIDE 35

Vectorised BICRS

Example multiplication using 2 × 2 blocking, #1 A =     4 1 3 2 3 1 2 7 1 1     A =      V [7 1 0 0 / 4 1 2 0 / 3 3 0 0 / 2 0 1 1 ] ˜ J [0 0 0 0 / 4 1 2 0 / 2 1 0 0 / 5 0 -1 0 ] ˜ I [3 2 / -3 -1 / 2 1 ] Multiply (7, 1, 0, 0) with (x0, x0, 0, 0) and add to (y3, y2).

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 19 / 25

slide-36
SLIDE 36

Vectorised BICRS

Example multiplication using 2 × 2 blocking, #2 A =     4 1 3 2 3 1 2 7 1 1     A =      V [7 1 0 0 / 4 1 2 0 / 3 3 0 0 / 2 0 1 1 ] ˜ J [0 0 0 0 / 4 1 2 0 / 2 1 0 0 / 5 0 -1 0 ] ˜ I [3 2 / -3 -1 / 2 1 ] Multiply (4, 1, 2, 0) with (x0, x1, x2, 0) and add to (y0, y1).

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 19 / 25

slide-37
SLIDE 37

Vectorised BICRS

Example multiplication using 2 × 2 blocking, #3 A =     4 1 3 2 3 1 2 7 1 1     A =      V [7 1 0 0 / 4 1 2 0 / 3 3 0 0 / 2 0 1 1 ] ˜ J [0 0 0 0 / 4 1 2 0 / 2 1 0 0 / 5 0 -1 0 ] ˜ I [3 2 / -3 -1 / 2 1 ] Multiply (3, 3, 0, 0) with (x2, x3, 0, 0) and add to (y0, y1).

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 19 / 25

slide-38
SLIDE 38

Vectorised BICRS

Example multiplication using 2 × 2 blocking, #4 A =     4 1 3 2 3 1 2 7 1 1     A =      V [7 1 0 0 / 4 1 2 0 / 3 3 0 0 / 2 0 1 1 ] ˜ J [0 0 0 0 / 4 1 2 0 / 2 1 0 0 / 5 0 -1 0 ] ˜ I [3 2 / -3 -1 / 2 1 ] Multiply (2, 0, 1, 1) with (x3, 0, x2, x3) and add to (y2, y3).

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 19 / 25

slide-39
SLIDE 39

SpMV parallelisation

Implementation on the Intel Xeon Phi: load-balanced row distribution over 240 threads; each thread locally uses Hilbert-ordered sparse blocks; nonzeroes within blocks stored using vectorised BICRS; 1 × 8, 2 × 4, 4 × 2, and 8 × 1 block sizes.

Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication”, IEEE

  • Trans. Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 20 / 25

slide-40
SLIDE 40

SpMV results

5 10 15 20 25 30 35 nd24k s3dkt3m2 Freescale1 wiki2007 cage15 adaptive Speed of multiplication in Gflop/s Speed of SpMV multiplication on the Intel Xeon Phi 1x1 (not vectorised) 1x8 2x4 4x2 8x1

Results of the partially distributed parallel SpMV multiplication using vectorised BICRS on an Intel Xeon Phi 7120A.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 21 / 25

slide-41
SLIDE 41

SpMV results

0.5 1 1.5 2 2.5 3 3.5 4 4.5 nd24k s3dkt3m2 Freescale1 wiki2007 cage15 adaptive Relative fill-in to non-vectorised BICRS Fill-in corresponding to the various block sizes 1x8 2x4 4x2 8x1

In general, blocking sizes that result in the least fill-in achieve the highest performance.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 22 / 25

slide-42
SLIDE 42

SpMV results

5 10 15 20 25 30 35 nd24k s3dkt3m2 Freescale1 wiki2007 cage15 adaptive Gflop/s Comparing Xeon Phi, CPU, and GPU performance for SpMV multiplication Intel Xeon Phi 7120A, vectorised 2x6 Intel Xeon (Ivy Bridge), non-vectorised NVidia K20x, CuSparse (HYB)

Comparing vectorised, non-vectorised, and a competing approach on the Xeon Phi, a dual-socket machine, and a GPU.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 23 / 25

slide-43
SLIDE 43

SpMV results

Structured Unstructured Average Intel Xeon Phi 21.6 8.7 15.2 2x Ivy Bridge CPU 23.5 14.6 19.0 NVIDIA K20X GPU 16.7 13.3 15.0 Aggregate results over 24 matrices. Generalised statements: Large structured matrices: GPUs perform best. Large unstructured matrices: CPUs perform best. Smaller matrices: Xeon Phi or CPUs. But mostly:

no one solution fits all.

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 24 / 25

slide-44
SLIDE 44

Conclusions

Optimised the BICRS data structure that fully exploits vector operations through gather/scatter, controls fill-in through block size selection, supports arbitrary nonzero traversals, while retaining all opportunities for compression. The new approach generalises several earlier approaches. http://albert-jan.yzelman.net/software#SL

c 2014, ExaScience Lab - A. N. Yzelman Generalised vectorisation for SpMV multiplication 25 / 25