csr spmv with guaranteed workload balance
play

CSR SpMV with guaranteed workload balance Merge-based Parallel - PowerPoint PPT Presentation

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel


  1. CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication," SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Salt Lake City, UT , 2016, pp. 678-689. doi: 10.1109/SC.2016.57 1

  2. My soapbox 1. Algorithmic parallel decomposition matters too • Versus delegation of scheduling entirely to compiler/runtime 2. Workload imbalance in sparse applications • The biggest killer of machine utilization • Performance response for arbitrary inputs: reliable vs . capricious “face - planting” 3. Standard data formats • Performance portability 4. Evaluation methodology Avoid overfitting by benchmarking on 1Ks-1Ms of datasets, not 10s of datasets • 2

  3. PERFORMANCE (IN)CONSISTENCY Faceplant “Consistency is far better than rare moments of greatness” -Scott Ginsberg 3

  4. SPARSE MATRIX-VECTOR MULTIPLICATION Lots of available parallelism 1.0 -- 1.0 -- 1.0 (1.0)(1.0) + (1.0)(1.0) -- -- -- -- 1.0 0.0 = * -- -- 3.0 3.0 1.0 (3.0)(1.0) + (3.0)(1.0) 4.0 4.0 4.0 4.0 1.0 (4.0)(1.0) + (4.0)(1.0) + (4.0)(1.0) +(4.0)(1.0) sparse matrix dense vector dense vector A x y 4

  5. CSR PARALLEL DECOMPOSITION Option (a): row-based 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 5

  6. CSR PARALLEL DECOMPOSITION imbalance! Option (a): row-based p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 6

  7. CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 7

  8. CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A imbalance! 8

  9. CSR PARALLEL DECOMPOSITION Option (c): logical merger p 0 p 1 p 2 p 3 1.0 -- 1.0 -- 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values -- -- -- -- 0 2 2 4 row_offsets -- -- 1.0 1.0 column indices 0 2 2 3 0 1 2 3 1.0 1.0 1.0 1.0 A 9

  10. IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 35% slower 10

  11. IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 NVIDIA K40M cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 100x faceplant 11

  12. IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 Merge-based (DP-GFLOPs) 21.2 22.8 23.2 cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 NVIDIA K40M Merge-based (DP-GFLOPs) 15.5 16.7 14.1 12

  13. GPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, NVIDIA K40M) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size cuSPARSE CsrMV Merge-based CsrMV 13

  14. CPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, 2x Intel Xeon E5-2690) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size MKL CsrMV Merge-based CsrMV 14

  15. CSRMV VISUALIZATION AS 2D “MERGE - PATH” 15

  16. CsrMV visualization as 2D “merge -path ” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 16 16

  17. CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 17 17

  18. CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 18 18

  19. CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) 2.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 19 19

  20. CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) 0.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 20 20

  21. CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 21 21

  22. CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 22 22

  23. CsrMV visualization as 2D “merge - path” 8 0 2 2 4 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 6.0 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 23 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend