Parallel Segmented Merge and Its Applications to Two Sparse Matrix - PowerPoint PPT Presentation

SIAM Workshop on Combinatorial ScienCfic CompuCng 2018 (CSC18) Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels Weifeng Liu, Norwegian University of Science and Technology, Norway Hao Wang, Ohio State University, USA Brian Vinter, University of Copenhagen June 8 th , 2018, Bergen, Norway

Background 1. Merge (sort) and its use 2. A definition of segmented merge 3. Merge in parallel 2

Merge (sort) • We consider to merge two sorted key-value ascending/descending sequences into one. A serial algorithm first creates two pointers running along the two sequences, compares the entries pointed by them, and saves the smaller/larger one to the resulting sequence. Applying the left merging to 0 1 0 2 Two sorted sparse vector-vector addition: sequences 0 0 0 0 2 3 b c 0 0 + 0 0 2 3 b c One sorted 0 0 1 2 = 2+ b 0 0 0 resulting 3 c 0 0 0 0 2 b 3 c sequence 3

Use case 1: sparse matrix-matrix addition • Add a sparse matrix A by a sparse matrix B , and obtain a resulting sparse matrix C . 0 0 0 0 1 a 1 a 0 0 0 0 0 0 0 2+ b 3 c 2 3 b c + = 0 0 0 0 d e d e 0 0 0 0 0 0 0 4 5+ f 6 4 5 6 f C A B (4x4) (4x4) (4x4) sparse, 10 nonzeros sparse, 6 nonzeros sparse, 6 nonzeros 4

Use case 1: sparse matrix-matrix addition • Each resulting row is merged from two rows. + = + = + = 2 3 0 1 0 2 1 3 0 2 3 2 + = 0 0 0 0 0 0 0 0 0 0 0 0 2 3 b c d e 4 5 6 f 1 a 0 0 1 2 1 3 0 2 2 3 2 3 0 0 0 0 0 0 0 0 0 0 0 0 b 3 2 c d e 4 5 f 6 1 a 5

Use case 2: sparse matrix-matrix multiplication • Multiply a sparse matrix A by a sparse matrix B , and obtain a resulting sparse matrix C . 0 0 0 1 a 1 d 1 e 0 0 0 0 0 0 2 3 b c 3 b 3 c 2 a x = 0 0 d e 0 0 0 0 0 0 4 5 6 f 5 d 6 f 4 a +5 e A B C (4x4) (4x4) (4x4) sparse, 6 nonzeros sparse, 6 nonzeros sparse, 8 nonzeros 6

Use case 2: sparse matrix-matrix multiplication • Each resulting row is merged from multiple rows. x = x = x = 3 1 3 2 1 3 3 0 1 x = 0 0 0 0 0 0 0 0 0 4 a 5 d 5 e 6 f 2 a 3 b 3 c 1 d 1 e 0 1 3 1 2 1 3 3 3 0 0 0 0 0 0 0 0 0 3 c 2 a 5 d 6 f 1 d 1 e 3 b 4 a 5 e 7

A definition of segmented merge (segmerge) • Segmented merge operation merges q sub-segments into p segments. q = 7, p = 4 q = 6, p = 3 9

Merge in parallel • There are two parallel merge methods: • (1) mergeing two sub-segments : merge-path works in a vectorized and balanced way, • (2) mergeing multiple segments : assigning each segment to a core Though each segment can be merged in a 1 3 5 6 8 Input 1: balanced way, 0 Input 2: cores may have imbalanced 2 workload due to 4 very diverse 7 segment sizes. This motivates 0 1 2 3 4 5 6 7 8 Output: Core 1 Core 2 Core 3 Core 4 our segmerge algorithm. Green, McColl, Bader. GPU Merge Path: a GPU Merging Algorithm. ICS '12. 11

Parallel Segmented Merge 1. Algorithm description 2. Experiments: SpTRANS and SpGEMM 3. Micro-benchmarks 12

Overview of our segmerge Step 1. create level-1 and level-2 pointers Step 2. fix each thread’s workload (#entries) and compute #threads Step 3. sub-segments scaMer info to threads Step 4. threads gather info to find workload (similar to merge path) Step 5. threads are evenly assigned to cores Step 6. iterate steps 2- 5 unCl finish 13

Experiment 1: sparse matrix transposition • Transpose a sparse matrix A (in CSR) to a sparse matrix B (i.e., A T , in CSR). This is equivalent to a conversion from CSR to CSC, or vice versa. row pointer = row pointer = 0 0 0 1 2 4 0 1 3 3 6 0 2 3 5 6 column index = column index = 0 0 0 2 3 3 -> 2 0 1 0 2 3 1 3 1 0 3 3 0 0 1 5 value = value = 0 0 0 0 4 5 6 6 0 1 0 2 0 3 0 4 0 5 0 0 2 0 4 0 3 0 1 0 5 0 6 6 A A T (4x4) (4x4) First collect entries, then segmerge them. sparse, 6 nonzeros sparse, 6 nonzeros 15

Experiment 1: sparse matrix transposition CUSP behaves ~7x over better when CUSP very sparse ~10x over In some cases cuSPARSE slower than CUSP and cuSPARSE GPU: Titan X #matrices: 956 16

Experiment 2: sparse matrix-matrix multiplication Only use segmerge for `long’ rows Weifeng Liu, Brian Vinter. A Framework for General Sparse Matrix-Matrix Mul?plica?on on GPUs and Heterogeneous Processors . JPDC. 2015. (extended from IPDPS14 ) Weifeng Liu, Brian Vinter. An Efficient GPU General Sparse Matrix-Matrix Mul?plica?on for Irregular Data . IPDPS14 . 17

Experiment 2: sparse matrix-matrix multiplication In some cases, ~100x over cuSPARSE cuSPARSE is faster, and bhSPARSE rarely gives higher perf. ~20x over CUSP ~7x over GPU: Titan X bhSPARSE #matrices: 956 18

Micro-benchmark 1: segmerge vs sort • In the SpGEMM test, if we replace segmerge with CUDA thrust-sort: In this case, sort is faster than segmerge 20

Micro-benchmark 2: #threads in iterations • In the SpGEMM test, we record #threads issued in each iteration step: 14406767 108813822 13862538 552638 13592550 547690 13520724 13306832 515739 13139154 349187 13130214 32396 13125706 13123376 12215 13121288 6863 11973924 5987 9579396 9579268 5971 9579188 5963 9579132 5959 9579116 9579088 5956 21

Conclusion • Though some parallel segmented operations (segsum[Blelloch TR93, Liu PARCO15], segsort [Hou ICS17], etc.) exist, parallel segmerge has been largely ignored. • Our segmerge algorithm always evenly distributes entries to threads, thus runs in a very balanced way on GPU. • SpTRANS and SpGEMM are largely accelerated by using segmerge at proper stages. • According to micro-benchmarks, there are still some optimizations can be done. 22

T k u ! 0 4 8 9 A y Q s n s ? 11 12 13 0 2 4 7 23

Parallel Segmented Merge and Its Applications to Two Sparse Matrix - PowerPoint PPT Presentation

SIAM Workshop on Combinatorial ScienCfic CompuCng 2018 (CSC18) Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels Weifeng Liu, Norwegian University of Science and Technology, Norway Hao Wang, Ohio State University, USA

a Atg12 Rab9 (ER) F-USP13 Merge (Autophagy) F-USP13 Merge COX4 (Mito) F-USP13 Merge Mock HSV-1 b

Accelerating the merge phase of sort-merge join FPL 2019 The 29th International Conference on

Merge Strategies for Merge-and-Shrink Masters Thesis Daniel Federau 13th February 2017

Mail Merge Internals Eilidh McAdam Mail Merge Mail merge fjlls a template from a

Segmented Regression Model 11 Oct, 2014 1E 2014 NNN 1 1E 2014 NNN 2 Segmented Are Global

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

SORTING Chapter 8 Comparison of Quadratic Sorts 2 1 12/6/2017 Merge Sort Section 8.7 Merge

Merge and Count Merge and count step. Given two sorted halves, count number of inversions

CS141: Intermediate Data Structures and Algorithms Divide and Conquer: Design and Analysis Amr

overview merge sort heaps data structures and algorithms 2020 09 07 heapsort intuitively

Model Merge Tooling: Whats New in EMF Diff/Merge for Neon ECLIPSECON FRANCE, 08/06/2016

News from Git in Eclipse Matthias Sohn (SAP) merge strategy extension point enables

CSSE 220 Merge Sort Import MergeSortSimple project from Git Todays Plan Big-oh practice

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

WFIRST and Giant Segmented Mirror Telescopes (GSMTs) Mark Dickinson (NOAO) + Michael Bolte

Pixel detectors: from segmented diodes to monolithic imaging sensors L. Gonella Particle Physics

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

baryons, dark matter or non-circular motions? Isabel Santos-Santos Postdoctoral Research Fellow

Modulated Sparse Regression Codes Kuan Hsieh and Ramji Venkataramanan University of Cambridge, UK

Sparse Regression Codes Sekhar Tatikonda (Yale University) in collaboration with Ramji

Enabling Sparse Matrix Computation in Multi-locale Chapel Tyler Simon Laboratory for Physical

Sparse Matrix Computation with PETSc Portable, Extensible Toolkit for

T Matrices F Gabriel Rodr guez, Louis-No el Pouchet A R D International Workshop on

Vector FPGA Acceleration of 1-D DWT Computations using Sparse Matrix Skeletons Sidharth