CSR SpMV with guaranteed workload balance Merge-based Parallel - - PowerPoint PPT Presentation

csr spmv with guaranteed workload balance
SMART_READER_LITE
LIVE PREVIEW

CSR SpMV with guaranteed workload balance Merge-based Parallel - - PowerPoint PPT Presentation

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel


slide-1
SLIDE 1

1

CSR SpMV with guaranteed workload balance

Merge-based Parallel Decomposition

NVIDIA Research

Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com)

January 26, 2019

  • D. Merrill and M. Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication," SC '16: Proceedings of the International

Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT , 2016, pp. 678-689. doi: 10.1109/SC.2016.57

slide-2
SLIDE 2

2

My soapbox

1. Algorithmic parallel decomposition matters too

  • Versus delegation of scheduling entirely to compiler/runtime

2. Workload imbalance in sparse applications

  • The biggest killer of machine utilization
  • Performance response for arbitrary inputs: reliable vs. capricious “face-planting”

3. Standard data formats

  • Performance portability

4. Evaluation methodology

  • Avoid overfitting by benchmarking on 1Ks-1Ms of datasets, not 10s of datasets
slide-3
SLIDE 3

3

PERFORMANCE (IN)CONSISTENCY

“Consistency is far better than rare moments of greatness”

  • Scott Ginsberg

Faceplant

slide-4
SLIDE 4

4

SPARSE MATRIX-VECTOR MULTIPLICATION

Lots of available parallelism

1.0 -- 1.0 --

  • - 3.0 3.0

4.0 4.0 4.0 4.0 sparse matrix

A

1.0 1.0 1.0 1.0

* =

(1.0)(1.0) + (1.0)(1.0) 0.0 (3.0)(1.0) + (3.0)(1.0) (4.0)(1.0) + (4.0)(1.0) + (4.0)(1.0) +(4.0)(1.0) dense vector

x

dense vector

y

slide-5
SLIDE 5

5

CSR PARALLEL DECOMPOSITION

Option (a): row-based A

2 2 3 1 2 3

values column indices

8 4

row

  • ffsets

2 2

1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0

1.0 -- 1.0 --

  • - 3.0 3.0

4.0 4.0 4.0 4.0

p0 p1 p2 p3

slide-6
SLIDE 6

6

CSR PARALLEL DECOMPOSITION

Option (a): row-based A

imbalance!

1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0

2 2 3 1 2 3 p0 p1 p2 p3

values column indices

8 p0 p1 p2 p3 4

row

  • ffsets

2 2

1.0 -- 1.0 --

  • - 3.0 3.0

4.0 4.0 4.0 4.0

slide-7
SLIDE 7

7

CSR PARALLEL DECOMPOSITION

Option (b): nonzero splitting A

values

2 2 3 1 2

column indices

3 8

row

  • ffsets

2 2 4

1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0

1.0 -- 1.0 --

  • - 3.0 3.0

4.0 4.0 4.0 4.0

p0 p1 p2 p3

slide-8
SLIDE 8

8

CSR PARALLEL DECOMPOSITION

Option (b): nonzero splitting A

imbalance!

values

2 2 3 1 2

column indices

3 p0 p1 p2 p3 p0 p1 p2 p3 8

row

  • ffsets

2 2 4

1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0

1.0 -- 1.0 --

  • - 3.0 3.0

4.0 4.0 4.0 4.0

slide-9
SLIDE 9

9

CSR PARALLEL DECOMPOSITION

Option (c): logical merger

1.0 -- 1.0 --

  • - 1.0 1.0

1.0 1.0 1.0 1.0

A

1.0 1.0 3.0 3.0 4.0 4.0 4.0

row_offsets values

4.0 2 2 3 1 2

column indices

3 p0 p1 p3 p2 2 2 4

slide-10
SLIDE 10

10

IMBALANCE: CsrMV WITH ~35M NON-ZEROS

thermomech_dK

(temperature deformation)

cnr-2000

(Web connectivity)

ASIC_320k

(circuit simulation)

Row-length coeff. of variation 0.10 2.1 61.4

MKL (DP-GFLOPs) 17.9 13.4 11.8

2x Intel Xeon E5-2690 (24-cores each)

35% slower

slide-11
SLIDE 11

11

IMBALANCE: CsrMV WITH ~35M NON-ZEROS

thermomech_dK

(temperature deformation)

cnr-2000

(Web connectivity)

ASIC_320k

(circuit simulation)

Row-length coeff. of variation 0.10 2.1 61.4

MKL (DP-GFLOPs) 17.9 13.4 11.8 cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12

NVIDIA K40M 2x Intel Xeon E5-2690 (24-cores each)

100x faceplant

slide-12
SLIDE 12

12

IMBALANCE: CsrMV WITH ~35M NON-ZEROS

thermomech_dK

(temperature deformation)

cnr-2000

(Web connectivity)

ASIC_320k

(circuit simulation)

Row-length coeff. of variation 0.10 2.1 61.4

MKL (DP-GFLOPs) 17.9 13.4 11.8

Merge-based (DP-GFLOPs) 21.2 22.8 23.2

cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12

Merge-based (DP-GFLOPs) 15.5 16.7 14.1

NVIDIA K40M 2x Intel Xeon E5-2690 (24-cores each)

slide-13
SLIDE 13

13

GPU CsrMV PERFORMANCE LANDSCAPE

The entire Florida Sparse Matrix Collection (4.2K datasets, NVIDIA K40M)

cuSPARSE CsrMV Merge-based CsrMV

0.001 0.01 0.1 1 10 100 1000 Runtime (ms) Matrices by size 0.001 0.01 0.1 1 10 100 1000 Runtime (ms) Matrices by size

Highly correlated with problem size!

slide-14
SLIDE 14

14

CPU CsrMV PERFORMANCE LANDSCAPE

The entire Florida Sparse Matrix Collection (4.2K datasets, 2x Intel Xeon E5-2690)

0.001 0.01 0.1 1 10 100 1000 Runtime (ms) Matrices by size 0.001 0.01 0.1 1 10 100 1000 Runtime (ms) Matrices by size

MKL CsrMV Merge-based CsrMV

Highly correlated with problem size!

slide-15
SLIDE 15

15

CSRMV VISUALIZATION AS 2D “MERGE-PATH”

slide-16
SLIDE 16

16 16

CsrMV visualization as 2D “merge-path”

start end

2 4 8

row_offsets

2 1 2 3 4 5 6 7

(ℕ)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-17
SLIDE 17

17 17

CsrMV visualization as 2D “merge-path”

start end

2 4 8 2 3 4 5 6 7

(ℕ)

1

(+1.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

2

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-18
SLIDE 18

18 18

CsrMV visualization as 2D “merge-path”

start end

2 4 8 1 3 4 5 6 7

(ℕ)

2

(+1.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

2

(+1.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-19
SLIDE 19

19 19

CsrMV visualization as 2D “merge-path”

start end

2 4 8 2 1 3 4 5 6 7

(ℕ)

2.0

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

2

(+1.0) (+1.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-20
SLIDE 20

20 20

CsrMV visualization as 2D “merge-path”

start end

2 2 8 4 1 3 4 5 6 7

(ℕ)

0.0

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

2

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-21
SLIDE 21

21 21

CsrMV visualization as 2D “merge-path”

start end

2 2 8 1 2 4 5 6 7

(ℕ)

(+3.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

4 3

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-22
SLIDE 22

22 22

CsrMV visualization as 2D “merge-path”

start end

2 2 8 1 2 3 5 6 7

(ℕ)

(+3.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

4 4

(+3.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-23
SLIDE 23

23 23

CsrMV visualization as 2D “merge-path”

start end

2 2 4 8 1 2 3 5 6 7

(ℕ)

4 6.0

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

(+3.0) (+3.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-24
SLIDE 24

24 24

CsrMV visualization as 2D “merge-path”

start end

2 2 4 1 2 3 4 6 7

(ℕ)

(+4.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

8 5

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-25
SLIDE 25

25 25

CsrMV visualization as 2D “merge-path”

start end

2 2 4 1 2 3 4 5 6 7

(ℕ)

(+4.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

8

(+4.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-26
SLIDE 26

26 26

CsrMV visualization as 2D “merge-path”

start end

2 2 4 1 2 3 4 5 6 7

(ℕ)

(+4.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

8

(+4.0) (+4.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-27
SLIDE 27

27 27

CsrMV visualization as 2D “merge-path”

start end

2 2 4 1 2 3 4 5 6 7

(ℕ)

(+4.0)

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

row_offsets

8

(+4.0) (+4.0) (+4.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-28
SLIDE 28

28 28

CsrMV visualization as 2D “merge-path”

start end

2 2 4 8

row_offsets

1 2 3 4 5 6 7

(ℕ)

16.0

(1.0)(1.0) (1.0)(1.0) (3.0)(1.0) (3.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0) (4.0)(1.0)

Ax dot products

(+4.0) (+4.0) (+4.0) (+4.0)

▪ Decision path runs from top-left to bottom-right

Each step advances the pointer to the bigger item Breaks ties by always preferring the element from row_offsets

▪ Moves down when advancing within ℕ

Action: accumulate nonzero dp-values

▪ Moves right when advancing within row_offsets

Action: flush and reset accumulator

slide-29
SLIDE 29

29

PARALLELIZING THE CSRMV MERGE-PATH

slide-30
SLIDE 30

30 30

STEP 1: Partition the merge-path

p0 p1 p2

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

E.g., regions comprise 4 path-steps each

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

4 8 12 3 2 5 6 7 1 9 10 11

2 2 4 8 1 2 3 4 5 6 7

slide-31
SLIDE 31

31 31

STEP 2: Coordinate search

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

Find the first (i,j) where both:

Xi is greater than the items before Yj

i + j == diagonal

O(plogn) work complexity

p1: ‘4’ > ‘1’? Yes: search left p2: ‘4’ > ‘5’? No: search right

4 8 12 3 2 5 6 7 1 9 10 11

p0 p1 p2

2 2 4 8 1 2 3 4 5 6 7

slide-32
SLIDE 32

32 32

STEP 2: Coordinate search

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

Find the first (i,j) where both:

Xi is greater than the items before Yj

i + j == diagonal

O(plogn) work complexity

p1: ‘2’ > ‘2’? No: search right p2: ‘8’ > ‘4’? Yes: search left

4 8 12 3 2 5 6 7 1 9 10 11

p0 p1 p2

2 2 4 8 1 2 3 4 5 6 7

slide-33
SLIDE 33

33 33

STEP 2: Coordinate search

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

Find the first (i,j) where both:

Xi is greater than the items before Yj

i + j == diagonal

O(plogn) work complexity

p1: ‘2’ > ‘2’? No: search right p2: ‘4’ > ‘5’? No: search right

4 8 12 3 2 5 6 7 1 9 10 11

p0 p1 p2

2 2 4 8 1 2 3 4 5 6 7

slide-34
SLIDE 34

34 34

STEP 2: Coordinate search

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

Find the first (i,j) where both:

Xi is greater than the items before Yj

i + j == diagonal

O(plogn) work complexity

2 2 4 8 1 2 3 4 5 6 7

p0 p1 p2

4 12 3 2 5 6 7 1 9 10 11 8

slide-35
SLIDE 35

35 35

STEP 2: Coordinate search

2 2 4 8 1 2 3 4 5 6 7 p0 p0 p1 p1 p3 p3

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

Coordinates also provide tight storage-balance if the dataset needs physical partitioning (e.g., GPU shared scratch)

  • 3. Threads run the serial merge algorithm

from their starting points

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

slide-36
SLIDE 36

36 36

STEP 3: Consume path segments

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

Work is tightly balanced! O(n) work complexity

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

2 2 4 8 1 2 3 4 5 6 7 p0 p0 p1 p1 p3 p3

slide-37
SLIDE 37

37 37

STEP 3: Consume path segments

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

Work is tightly balanced! O(n) work complexity

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

2 2 4 8 1 2 3 4 5 6 7

(+4.0) (+3.0) (+1.0)

p0 p0 p1 p1 p3 p3

slide-38
SLIDE 38

38 38

STEP 3: Consume path segments

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

Work is tightly balanced! O(n) work complexity

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

2 2 4 8 1 2 3 4 5 6 7

(+4.0) (+1.0) (+3.0)

p0 p0 p1 p1 p3 p3

(+1.0) (+3.0) (+4.0)

slide-39
SLIDE 39

39 39

STEP 3: Consume path segments

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

Work is tightly balanced! O(n) work complexity

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

2 2 4 8 1 2 3 4 5 6 7

(+4.0)

p0 p0 p1 p1 p3 p3 2.0 6.0

(+3.0) (+3.0) (+1.0) (+1.0) (+4.0) (+4.0)

slide-40
SLIDE 40

40 40

STEP 3: Consume path segments

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

Work is tightly balanced! O(n) work complexity

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

2 2 4 8 1 2 3 4 5 6 7 0.0 12.0

(+4.0)

p0 p0 p1 p1 p3 p3

(+4.0) (+4.0) (+4.0)

slide-41
SLIDE 41

41 41

STEP 4: “Fixup” for rows that cross partitions

  • 1. Partition the grid into |P| equally-sized

diagonal regions (one thread per region)

  • 2. Threads binary-search along diagonals for

2D starting coordinates

  • 3. Threads run the serial merge algorithm

from their starting points

  • 4. Aggregate per-thread runouts by row, add

to prior per-row totals in output vector

Compute “reduce-by-key” across per-thread

<last-row#, run-out> tuples

O(p) work complexity

2 2 4 8 1 2 3 4 5 6 7

<row3, 4.0> <row4, 0.0> <row2, 0.0>

p0 p0 p1 p1 p3 p3

(+4.0)

slide-42
SLIDE 42

42

RESULTS

slide-43
SLIDE 43

43

CsrMV SPEEDUP

0.25 0.5 1 2 4 8 16 Speedup Matrix nonzeros 0.25 0.5 1 2 4 8 16 Speedup Matrix nonzeros

SMALL DATASETS* LARGE DATASETS**

0.125 0.5 2 8 32 128 Speedup Matrix nonzeros 0.125 0.5 2 8 32 128 Speedup Matrix nonzeros

Tesla K40M Xeon E5-2690 (x2) Merge-based

  • vs. cuSPARSE

Merge-based

  • vs. MKL

Max Speedup 198x 15.8x Small-problem* average (Win %) Min speedup 0.79x (39%) 0.34x 1.22x (90%) 0.51x Large-problem ** average (Win %) Min speedup 1.13x (60%) 0.43x 1.06x (39%) 0.89x * Fits in aggregate cache (nonzeros < 300K, nonzeros < 6M) ** Off-chip (nonzeros > 300K, nonzeros > 6M)

slide-44
SLIDE 44

44

ROW-LENGTH “IMPERVIOUSNESS”:

The correlation between throughput (GFLOPs) vs row-length variation (closer to 0.0 is better)

  • 0.16
  • 0.07
  • 0.06
  • 0.03
  • 0.24
  • 0.01
  • 0.07
  • 0.04

MKL CsrMV Merge-based CsrMV CSB SpMV POSKI SpMV cuSPARSE CsrMV Merge-based CsrMV HYB SpMV yaSpMV

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 0.5 Correlation of GFLOPs to row-variation [1] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks,” in Proc. SPAA, Calgary, Canada, 2009. [2] A. Jain, “pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures,” Master’s Thesis, University of California at Berkeley, 2008. [3] N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, New York, NY, USA, 2009, pp. 18:1–18:11. [4] S. Yan, C. Li, Y. Zhang, and H. Zhou, “yaSpMV: Yet Another SpMV Framework on GPUs,” in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 2014, pp. 107–118.

[1] [2] [3] [4]

slide-45
SLIDE 45

45

PERFORMANCE “PREDICTABILITY”

The correlation between elapsed running time and matrix nonzeros (closer to 1.0 is better)

0.98 0.97 0.99 0.94 0.30 0.87 0.69 0.65 MKL CsrMV Merge-based CsrMV CSB SpMV POSKI SpMV cuSPARSE CsrMV Merge-based CsrMV HYB SpMV yaSpMV

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 Correlation of runtime to nnz

slide-46
SLIDE 46

46

CSRMV THROUGHPUT

For commonly-evaluated large-matrix datasets

5 10 15 20 25 30 35 GFLOPs (double)

MKL CsrMV Merge-based CsrMV NUMA Merge-based CsrMV

5 10 15 20 25 30 35 GFLOPs (double)

cuSPARSE CsrMV Merge-based CsrMV

slide-47
SLIDE 47

47

REFERENCES / ACKNOWLEDGEMENTS

  • Narsingh Deo, Amit Jain, and Muralidhar Medidi. 1994. An optimal parallel algorithm for merging using multiselection. Inf.
  • Process. Lett. 50, 2 (April 1994), 81-87.
  • Odeh, S. et al. 2012. Merge Path - Parallel Merging Made Simple. Proceedings of the 2012 IEEE 26th International Parallel and

Distributed Processing Symposium Workshops & PhD Forum (Washington, DC, USA, 2012), 1611–1618 This research was developed, in part, with funding from the Defense Advanced Research Projects Agency (DARPA). The views,

  • pinions, and/or findings contained in this presentation are those of the authors and should not be interpreted as representing the
  • fficial views or policies of the Department of Defense or the U.S. Government

Source and reproducibility instructions at: https://github.com/dumerrill/merge-spmv

slide-48
SLIDE 48

Questions?