CSR SpMV with guaranteed workload balance Merge-based Parallel - PowerPoint PPT Presentation

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel Sparse Matrix-Vector Multiplication," SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Salt Lake City, UT , 2016, pp. 678-689. doi: 10.1109/SC.2016.57 1

My soapbox 1. Algorithmic parallel decomposition matters too • Versus delegation of scheduling entirely to compiler/runtime 2. Workload imbalance in sparse applications • The biggest killer of machine utilization • Performance response for arbitrary inputs: reliable vs . capricious “face - planting” 3. Standard data formats • Performance portability 4. Evaluation methodology Avoid overfitting by benchmarking on 1Ks-1Ms of datasets, not 10s of datasets • 2

PERFORMANCE (IN)CONSISTENCY Faceplant “Consistency is far better than rare moments of greatness” -Scott Ginsberg 3

SPARSE MATRIX-VECTOR MULTIPLICATION Lots of available parallelism 1.0 -- 1.0 -- 1.0 (1.0)(1.0) + (1.0)(1.0) -- -- -- -- 1.0 0.0 = * -- -- 3.0 3.0 1.0 (3.0)(1.0) + (3.0)(1.0) 4.0 4.0 4.0 4.0 1.0 (4.0)(1.0) + (4.0)(1.0) + (4.0)(1.0) +(4.0)(1.0) sparse matrix dense vector dense vector A x y 4

CSR PARALLEL DECOMPOSITION Option (a): row-based 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 5

CSR PARALLEL DECOMPOSITION imbalance! Option (a): row-based p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 6

CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A 7

CSR PARALLEL DECOMPOSITION Option (b): nonzero splitting p 0 p 1 p 2 p 3 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values 1.0 -- 1.0 -- column indices 0 2 2 3 0 1 2 3 -- -- -- -- -- -- 3.0 3.0 p 1 p 2 p 3 p 0 4.0 4.0 4.0 4.0 row 0 2 2 4 8 offsets A imbalance! 8

CSR PARALLEL DECOMPOSITION Option (c): logical merger p 0 p 1 p 2 p 3 1.0 -- 1.0 -- 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 values -- -- -- -- 0 2 2 4 row_offsets -- -- 1.0 1.0 column indices 0 2 2 3 0 1 2 3 1.0 1.0 1.0 1.0 A 9

IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 35% slower 10

IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 NVIDIA K40M cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 100x faceplant 11

IMBALANCE: CsrMV WITH ~35M NON-ZEROS thermomech_dK cnr-2000 ASIC_320k (temperature deformation) (Web connectivity) (circuit simulation) Row-length coeff. of variation 0.10 2.1 61.4 (24-cores each) Xeon E5-2690 2x Intel MKL (DP-GFLOPs) 17.9 13.4 11.8 Merge-based (DP-GFLOPs) 21.2 22.8 23.2 cuSPARSE (DP-GFLOPs) 12.4 5.9 0.12 NVIDIA K40M Merge-based (DP-GFLOPs) 15.5 16.7 14.1 12

GPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, NVIDIA K40M) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size cuSPARSE CsrMV Merge-based CsrMV 13

CPU CsrMV PERFORMANCE LANDSCAPE The entire Florida Sparse Matrix Collection (4.2K datasets, 2x Intel Xeon E5-2690) 1000 1000 100 100 Highly correlated with 10 10 Runtime (ms) Runtime (ms) problem size! 1 1 0.1 0.1 0.01 0.01 0.001 0.001 Matrices by size Matrices by size MKL CsrMV Merge-based CsrMV 14

CSRMV VISUALIZATION AS 2D “MERGE - PATH” 15

CsrMV visualization as 2D “merge -path ” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 16 16

CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 17 17

CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 18 18

CsrMV visualization as 2D “merge - path” 2 0 2 4 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (+1.0) (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (+1.0) (1.0)(1.0) row_offsets 2 (3.0)(1.0) 2.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 19 19

CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (3.0)(1.0) 0.0 ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 20 20

CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 21 21

CsrMV visualization as 2D “merge - path” 4 0 2 2 8 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 22 22

CsrMV visualization as 2D “merge - path” 8 0 2 2 4 ▪ Decision path runs from top-left to bottom-right row_offsets start 0 (1.0)(1.0) Each step advances the pointer to the bigger item Breaks ties by always preferring the element from 1 (1.0)(1.0) row_offsets 2 (+3.0) (3.0)(1.0) ▪ Moves down when advancing within ℕ Ax dot products 3 (+3.0) (3.0)(1.0) ( ℕ ) 4 Action : accumulate nonzero dp-values (4.0)(1.0) 6.0 5 (4.0)(1.0) ▪ Moves right when advancing within row_offsets 6 (4.0)(1.0) Action : flush and reset accumulator 7 (4.0)(1.0) end 23 23

CSR SpMV with guaranteed workload balance Merge-based Parallel - PowerPoint PPT Presentation

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

CH CHALLENGES ALLENGES OF OF CS CSR Intr troduction oduction CSR continues to evolve CSR

Corporate Social Responsibility (CSR) Index What is Corporate Social Responsibility?

Communicating CSR: Enhancing or inhibiting socially responsible business practice? Organized by

REPORTING? o CSR with deponent together and counsel in a different location o CSR with counsel

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

NCST Review Meeting Presentation On Major CSR Projects Selection of CSR Projects Base-Line

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

Local 006 Workload Appeal COLLECTIVE AGREEMENT 2014:LETTER OF INTENT #2 Why a Workload Appeal?

Workload Formulas Judicial Branch Workload Formulas and On-Bench Time Reporting | September 23,

CS 147: Computer Systems Performance Analysis Workload Selection 1 / 39 Overview CS147

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

Deletion from Okasakis Red-Black Trees: A Functional Pearl Matt Might University of Utah

Provisions Privacy-preserving proofs of solvency for Bitcoin exchanges Real World Crypto 2016

A Longitudinal View of Gender Balance in a Large Computer Science Program University of Michigan

Tie r m o d y n a m i c s o f I n f o r m a t i o n P r o c e s s i

Structure Preserving Numerical Methods for Hyperbolic Systems of Conservation and Balance Laws

st str If implemented, creates a string representation of an instance of the

Human rights in the balance David Clark MIT Why am I here? I wrote a paper that talked about

Sambuz

Useful Links

Newsletter

Mail Us

CSR SpMV with guaranteed workload balance Merge-based Parallel - PowerPoint PPT Presentation

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research Duane Merrill (dumerrill@nvidia.com) Michael Garland (mgarland@nvidia.com) January 26, 2019 D. Merrill and M. Garland, "Merge-Based Parallel

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

CH CHALLENGES ALLENGES OF OF CS CSR Intr troduction oduction CSR continues to evolve CSR

Corporate Social Responsibility (CSR) Index What is Corporate Social Responsibility?

Communicating CSR: Enhancing or inhibiting socially responsible business practice? Organized by

REPORTING? o CSR with deponent together and counsel in a different location o CSR with counsel

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

NCST Review Meeting Presentation On Major CSR Projects Selection of CSR Projects Base-Line

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

Local 006 Workload Appeal COLLECTIVE AGREEMENT 2014:LETTER OF INTENT #2 Why a Workload Appeal?

Workload Formulas Judicial Branch Workload Formulas and On-Bench Time Reporting | September 23,

CS 147: Computer Systems Performance Analysis Workload Selection 1 / 39 Overview CS147

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

Deletion from Okasakis Red-Black Trees: A Functional Pearl Matt Might University of Utah

Provisions Privacy-preserving proofs of solvency for Bitcoin exchanges Real World Crypto 2016

A Longitudinal View of Gender Balance in a Large Computer Science Program University of Michigan

Tie r m o d y n a m i c s o f I n f o r m a t i o n P r o c e s s i

Structure Preserving Numerical Methods for Hyperbolic Systems of Conservation and Balance Laws

__ __st str__ __ If implemented, creates a string representation of an instance of the

Human rights in the balance David Clark MIT Why am I here? I wrote a paper that talked about

Sambuz

Useful Links

Newsletter

Mail Us

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

st str If implemented, creates a string representation of an instance of the