Parallel Segmented Merge and Its Applications to Two Sparse Matrix - - PowerPoint PPT Presentation

parallel segmented merge and its applications to two
SMART_READER_LITE
LIVE PREVIEW

Parallel Segmented Merge and Its Applications to Two Sparse Matrix - - PowerPoint PPT Presentation

SIAM Workshop on Combinatorial ScienCfic CompuCng 2018 (CSC18) Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels Weifeng Liu, Norwegian University of Science and Technology, Norway Hao Wang, Ohio State University, USA


slide-1
SLIDE 1

Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels

Weifeng Liu, Norwegian University of Science and Technology, Norway Hao Wang, Ohio State University, USA Brian Vinter, University of Copenhagen

June 8th, 2018, Bergen, Norway SIAM Workshop on Combinatorial ScienCfic CompuCng 2018 (CSC18)

slide-2
SLIDE 2

2

Background

  • 1. Merge (sort) and its use
  • 2. A definition of segmented merge
  • 3. Merge in parallel
slide-3
SLIDE 3

3

Merge (sort)

  • We consider to merge two sorted key-value ascending/descending sequences into
  • ne. A serial algorithm first creates two pointers running along the two sequences,

compares the entries pointed by them, and saves the smaller/larger one to the resulting sequence.

2 3

1

b c

2

2 b 3

1

c

2

2 3

=

2+b 0 3 c

Two sorted sequences Applying the left merging to sparse vector-vector addition: One sorted resulting sequence +

c b

slide-4
SLIDE 4

4

Use case 1: sparse matrix-matrix addition

  • Add a sparse matrix A by a sparse matrix B, and obtain a resulting sparse

matrix C.

+

2 1 3 6 5 4

A (4x4) sparse, 6 nonzeros =

d c a f b e

B (4x4) sparse, 6 nonzeros

2+b 1 3 6 5+f 4

C (4x4) sparse, 10 nonzeros

a c d e

slide-5
SLIDE 5

5

Use case 1: sparse matrix-matrix addition

  • Each resulting row is merged from two rows.

1

2

a

3

2 3

1

c b

2

d e

1 3

6 5 4

2 3

f

2

d e

1 3

2 b 3

1

c

2

1

2

a

3

6 5 4 f

2 3 2 + = + = + = + =

slide-6
SLIDE 6

6

Use case 2: sparse matrix-matrix multiplication

  • Multiply a sparse matrix A by a sparse matrix B, and obtain a resulting sparse

matrix C.

x

2 1 3 6 5 4

A (4x4) sparse, 6 nonzeros =

d c a f b e

B (4x4) sparse, 6 nonzeros

1d 4a+5e 5d 1e

C (4x4) sparse, 8 nonzeros

3b 3c 6f 2a

slide-7
SLIDE 7

7

x =

Use case 2: sparse matrix-matrix multiplication

  • Each resulting row is merged from multiple rows.

1d

1

1e

3

1d

1

1e

3

2a

3

3b 3c

1

3b 3c

1

2a

3

5d

1

5e

3

4a

3

6f

2

5d 6f

1 2

5e 4a

3 3 x = x = x =

slide-8
SLIDE 8

8

Background

  • 1. Merge (sort) and its use
  • 2. A definition of segmented merge
  • 3. Merge in parallel
slide-9
SLIDE 9

9

A definition of segmented merge (segmerge)

  • Segmented merge operation merges q sub-segments into p segments.

q = 7, p = 4 q = 6, p = 3

slide-10
SLIDE 10

10

Background

  • 1. Merge (sort) and its use
  • 2. A definition of segmented merge
  • 3. Merge in parallel
slide-11
SLIDE 11

11

Merge in parallel

  • There are two parallel merge methods:
  • (1) mergeing two sub-segments: merge-path works in a vectorized and balanced way,
  • (2) mergeing multiple segments: assigning each segment to a core

Green, McColl, Bader. GPU Merge Path: a GPU Merging Algorithm. ICS '12.

0 1 2 3 4 5 6 7 8

Output:

1 3 5 6 8

Input 1:

2 4 7

Input 2:

Core 1 Core 2 Core 3 Core 4

Though each segment can be merged in a balanced way, cores may have imbalanced workload due to very diverse segment sizes.

This motivates

  • ur segmerge

algorithm.

slide-12
SLIDE 12

12

Parallel Segmented Merge

  • 1. Algorithm description
  • 2. Experiments: SpTRANS and SpGEMM
  • 3. Micro-benchmarks
slide-13
SLIDE 13

13

Overview of our segmerge

Step 1. create level-1 and level-2 pointers Step 2. fix each thread’s workload (#entries) and compute #threads Step 3. sub-segments scaMer info to threads Step 4. threads gather info to find workload (similar to merge path) Step 5. threads are evenly assigned to cores Step 6. iterate steps 2- 5 unCl finish

slide-14
SLIDE 14

14

Parallel Segmented Merge

  • 1. Algorithm description
  • 2. Experiments: SpTRANS and SpGEMM
  • 3. Micro-benchmarks
slide-15
SLIDE 15

15

Experiment 1: sparse matrix transposition

  • Transpose a sparse matrix A (in CSR) to a sparse matrix B (i.e., AT, in

CSR). This is equivalent to a conversion from CSR to CSC, or vice versa.

  • >

A (4x4) sparse, 6 nonzeros

2 1 3 6 5 4

AT (4x4) sparse, 6 nonzeros

2 1 3 6 5 4 2 0 4 0 3 0 1 0 5 0 6

1 3 1 0 3 3 0 2 3 5 6

value = column index = row pointer = 1 0 2 0 3 0 4 0 5 0 6

2 0 1 0 2 3 0 1 3 3 6

value = column index = row pointer =

First collect entries, then segmerge them.

slide-16
SLIDE 16

16

Experiment 1: sparse matrix transposition

GPU: Titan X #matrices: 956 CUSP behaves better when very sparse ~10x over cuSPARSE ~7x over CUSP In some cases slower than CUSP and cuSPARSE

slide-17
SLIDE 17

17

Experiment 2: sparse matrix-matrix multiplication

Weifeng Liu, Brian Vinter. A Framework for General Sparse Matrix-Matrix Mul?plica?on

  • n GPUs and Heterogeneous Processors. JPDC. 2015. (extended from IPDPS14)

Weifeng Liu, Brian Vinter. An Efficient GPU General Sparse Matrix-Matrix Mul?plica?on for Irregular Data. IPDPS14.

Only use segmerge for `long’ rows

slide-18
SLIDE 18

18

Experiment 2: sparse matrix-matrix multiplication

GPU: Titan X #matrices: 956 ~20x over CUSP ~100x over cuSPARSE ~7x over bhSPARSE In some cases, cuSPARSE is faster, and bhSPARSE rarely gives higher perf.

slide-19
SLIDE 19

19

Parallel Segmented Merge

  • 1. Algorithm description
  • 2. Experiments: SpTRANS and SpGEMM
  • 3. Micro-benchmarks
slide-20
SLIDE 20

20

Micro-benchmark 1: segmerge vs sort

  • In the SpGEMM test, if we replace segmerge with CUDA thrust-sort:

In this case, sort is faster than segmerge

slide-21
SLIDE 21

21

Micro-benchmark 2: #threads in iterations

  • In the SpGEMM test, we record #threads issued in each iteration step:

108813822 552638 547690 515739 349187 32396 12215 6863 5987 5971 5963 5959 5956

14406767 13862538 13592550 13520724 13306832 13139154 13130214 13125706 13123376 13121288 11973924 9579396 9579268 9579188 9579132 9579116 9579088

slide-22
SLIDE 22

22

Conclusion

  • Though some parallel segmented operations (segsum[Blelloch

TR93, Liu PARCO15], segsort [Hou ICS17], etc.) exist, parallel segmerge has been largely ignored.

  • Our segmerge algorithm always evenly distributes entries to

threads, thus runs in a very balanced way on GPU.

  • SpTRANS and SpGEMM are largely accelerated by using

segmerge at proper stages.

  • According to micro-benchmarks, there are still some
  • ptimizations can be done.
slide-23
SLIDE 23

23

T k u !

4 9 8

A y Q s n s ?

2 7 4 11 13 12