CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse - PowerPoint PPT Presentation

49th International Conference on Parallel Processing - ICPP CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs Jiya Su ⋄ ‡, Feng Zhang ⋄ , Weifeng Liu ★ , Bingsheng He+, Ruofan Wu ⋄ , Xiaoyong Du ⋄ , Rujia Wang‡ ⋄ Renmin University of China ⋆ China University of Petroleum +National University of Singapore ‡ Illinois Institute of Technology 1/48

Outline 1. Background 2. Motivation 3. Challenges 4. CapelliniSpTRSV 5. Evaluation 6. Source Code at Github 7. Conclusion 2/48

Outline 1. Background 2. Motivation 3. Challenges 4. CapelliniSpTRSV 5. Evaluation 6. Source Code at Github 7. Conclusion 3/48

1. Background Lower Triangular Matrix L Sparse Matrix 0 1 2 3 4 5 6 7 in CSR format 0 1 Level 0 1 1 Level 0 2 1 1 Level 1 3 1 1 1 Level 2 4 1 1 1 Level 1 5 1 1 Level 2 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 (a) Matrix L. csrRowPtr = (0, 1, 2, 4, 7, 10, 12, 16, 20) csrColIdx = (0, 1, 1, 2, 1, 2, 3, 0, 1, 4, 2, 5, 0, 2, 5, 6, 0, 1, 2, 7) csrVal = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 4/48 (b) CSR representation.

1. Background Sparse Triangular Solve Example: Lx = b Lower Triangular Matrix L 0 1 2 3 4 5 6 7 1 ? 1 0 1 Level 0 1 ? 1 1 1 Level 0 1 1 ? 2 2 1 1 Level 1 1 1 1 ? 3 × = 3 1 1 1 ? 3 1 1 1 Level 2 4 1 1 1 ? 2 1 1 Level 1 ? 4 1 1 1 1 5 1 1 Level 2 ? 4 1 1 1 1 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 L x b Matrix L 5/48

1. Background Sparse Triangular Solve Example: Lx = b Lower Triangular Matrix L 0 1 2 3 4 5 6 7 1 1 1 0 1 Level 0 1 1 1 1 1 Level 0 1 1 1 2 2 1 1 Level 1 1 1 1 1 3 × = 3 1 1 1 1 3 1 1 1 Level 2 4 1 1 1 1 2 1 1 Level 1 1 4 1 1 1 1 5 1 1 Level 2 1 4 1 1 1 1 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 L x b Matrix L 6/48

1. Background Concepts : · Component component Lower Triangular Matrix L 0 1 2 3 4 5 6 7 1 1 1 0 1 Level 0 1 1 1 1 1 Level 0 1 1 1 2 2 1 1 Level 1 1 1 1 1 3 × = 3 1 1 1 1 3 1 1 1 Level 2 4 1 1 1 1 2 1 1 Level 1 1 4 1 1 1 1 5 1 1 Level 2 1 4 1 1 1 1 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 L x b Matrix L 7/48

1. Background Concepts : · Component · Element element Lower Triangular Matrix L 0 1 2 3 4 5 6 7 1 1 1 0 1 Level 0 1 1 1 1 1 Level 0 1 1 1 2 2 1 1 Level 1 1 1 1 1 3 × = 3 1 1 1 1 3 1 1 1 Level 2 4 1 1 1 1 2 1 1 Level 1 1 4 1 1 1 1 5 1 1 Level 2 1 4 1 1 1 1 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 L x b Matrix L 8/48

1. Background Concepts : · Component · Element · Dependency dependency Lower Triangular Matrix L 0 1 2 3 4 5 6 7 1 1 1 0 1 Level 0 1 1 1 1 1 Level 0 1 1 1 2 2 1 1 Level 1 1 1 1 1 3 × = 3 1 1 1 1 3 1 1 1 Level 2 4 1 1 1 1 2 1 1 Level 1 1 4 1 1 1 1 5 1 1 Level 2 1 4 1 1 1 1 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 L x b Matrix L 9/48

1. Background Concepts : · Component · Element · Dependency · Level Lower Triangular Matrix L 0 1 2 3 4 5 6 7 1 1 1 0 1 Level 0 Level-set 1 1 1 1 1 Level 0 1 1 1 2 2 1 1 Level 1 1 1 1 1 3 × = 3 1 1 1 1 3 1 1 1 Level 2 4 1 1 1 1 2 1 1 Level 1 1 4 1 1 1 1 5 1 1 Level 2 1 4 1 1 1 1 6 1 1 1 1 Level 3 7 1 1 1 1 Level 2 L x b Matrix L 10/48

1. Background Level-set SpTRSV The level-set method has two phases: (1) grouping nodes (rows or columns) that can be consumed in parallel, and (2) solving nodes group by group with barriers between. Level 0 Lower Triangular Matrix L 0 1 0 1 2 3 4 5 6 7 0 1 Level 0 Level 1 1 1 Level 0 2 4 2 1 1 Level 1 3 1 1 1 Level 2 Level 2 4 1 1 1 Level 1 3 5 7 5 1 1 Level 2 6 1 1 1 1 Level 3 Level 3 7 1 1 1 1 6 Level 2 (a) Matrix L . (b) Components x in the level-sets. 11/48

1. Background Level-set SpTRSV The level-set method has two phases: (1) grouping nodes es (rows or columns) that can be e consumed ed in parallel el , and (2) solving nodes group by group with barriers between. Level 0 Lower Triangular Matrix L 0 1 0 1 2 3 4 5 6 7 0 1 Level 0 Level 1 1 1 Level 0 2 4 2 1 1 Level 1 3 1 1 1 Level 2 Level 2 4 1 1 1 Level 1 3 5 7 5 1 1 Level 2 6 1 1 1 1 Level 3 Level 3 7 1 1 1 1 6 Level 2 (a) Matrix L . (b) Components x in the level-sets. 12/48

1. Background Level-set SpTRSV The level-set method has two phases: (1) grouping nodes (rows or columns) that can be consumed in parallel, and (2) (2) solving nodes es group by group with barrier ers between een . Level 0 Lower Triangular Matrix L 0 1 0 1 2 3 4 5 6 7 0 1 Level 0 Level 1 1 1 Level 0 2 4 2 1 1 Level 1 3 1 1 1 Level 2 Level 2 4 1 1 1 Level 1 3 5 7 5 1 1 Level 2 6 1 1 1 1 Level 3 Level 3 7 1 1 1 1 6 Level 2 (a) Matrix L . (b) Components x in the level-sets. 13/48

1. Background Synchronization-Free SpTRSV (warp-level) The algorithm computes components x in the original row order of the input matrix and uses one warp to compute one row. It uses a new flag array in in_degree to show whether the component x is solved, which avoids the synchronization and greatly reduces the processing time. Liu W, Li A, Hogg J, et al. A synchronization-free algorithm for parallel sparse triangular solves[C]//European 14/48 Conference on Parallel Processing. Springer, Cham, 2016: 617-630.

1. Background Synchronization-Free SpTRSV (warp-level) Th The alg algorit ithm computes components x in in the orig igin inal al row order of the input matrix and uses on of one warp to o com ompute on one row. It uses a new flag array in in_degree to show whether the component x is solved, which avoids the synchronization and greatly reduces the processing time. Liu W, Li A, Hogg J, et al. A synchronization-free algorithm for parallel sparse triangular solves[C]//European 15/48 Conference on Parallel Processing. Springer, Cham, 2016: 617-630.

1. Background Synchronization-Free SpTRSV (warp-level) The algorithm computes components x in the original row order of the input matrix and uses one warp to compute one row. It It uses es a a new flag lag ar array in in_degree to to show whether the component x is solved, which ch avoids the synch chronization and greatly reduce ces the proce cessing time. Liu W, Li A, Hogg J, et al. A synchronization-free algorithm for parallel sparse triangular solves[C]//European 16/48 Conference on Parallel Processing. Springer, Cham, 2016: 617-630.

1. Background Case study for preprocessing time and execution time of different SpTRSV algorithms Algorithm time (ms) nlpkkt160 wiki-Talk cant Level-Set preprocessing 310.07 31.09 4.81 execution 28.07 12.89 28.79 cuSPARSE preprocessing 16.24 1.99 0.28 execution 37.98 11.88 7.69 Sync-Free preprocessing 8.07 0.42 0.28 execution 27.73 10.02 5.02 17/48

2. Motivation Performance trend of warp-level synchronization-free SpTRSV. 18/48

2. Motivation Performance trend of warp-level synchronization-free SpTRSV. The performance declines after reaching the peak state. 19/48

2. Motivation thread 1 L(0,0) L(2,1) L(2,2) L(3,1) L(3,2) L(3,3) L(6,0) L(6,2) L(6,5) L(6,6) warp 1 thread 2 L(1,1) L(4,0) L(4,1) L(4,4) L(5,2) L(5,5) thread 3 L(7,0) L(7,1) L(7,2) L(7,7) thread 4 thread 5 warp 2 (a) Level-Set SpTRSV. thread 6 thread 1 L(0,0) L(2,1) L(2,2) L(4,0) L(4,4) L(5,2) L(5,5) L(7,0) L(7,7) warp 1 thread 2 L(4,1) L(7,1) thread 3 L(7,2) thread 4 L(1,1) L(3,1) L(3,2) L(3,3) L(6,0) L(6,5) L(6,6) thread 5 L(6,2) warp 2 Data thread 6 transmission (b) Warp-Level Synchronization-Free SpTRSV. Level 0 thread 1 L(0,0) L(6,0) L(6,2) L(6,5) L(6,6) Level 1 warp 1 thread 2 L(1,1) L(7,0) L(7,1) L(7,2) L(7,7) thread 3 L(2,1) L(2,2) Level 2 thread 4 L(3,1) L(3,2) L(3,3) Level 3 thread 5 L(4,0) L(4,1) L(4,4) warp 2 thread 6 L(5,2) L(5,5) (c) Thread-Level Synchronization-Free SpTRSV (CapelliniSpTRSV). time 20/48

2. Motivation • Observation: Warp-level synchronization-free SpTRSV algorithm cannot fully utilize GPU resources when parallel granularity is large. • Insight: Capellini fine-grained 21/48

3. Challenges • Challenge 1: avoiding deadlocks • In thread-level design, the threads in one warp may have dependencies. 22/48

3. Challenges • Challenge 2: last element checking • We need to verify whether the processed element is on the diagonal, which causes time overhead. 23/48

3. Challenges • Challenge 3: thread execution model • Although we use a thread to handle one component, the GPUs are still executed in the warp execution mode. 24/48

4. CapelliniSpTRSV • Design to avoid deadlocks • A two-phase mechanism to avoid the deadlocks in CapelliniSpTRSV 25/48

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse - PowerPoint PPT Presentation

49th International Conference on Parallel Processing - ICPP CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs Jiya Su , Feng Zhang , Weifeng Liu , Bingsheng He+, Ruofan Wu , Xiaoyong Du ,

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

POSIX Thread Synchronization Mutex Locks Condition Variables Read-Write Locks

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Thread synchronization David Hovemeyer 2 December 2019 David Hovemeyer Computer Systems

CPL 2016, week 2 Inter-thread synchronization: locks and monitors Oleg Batrashev Institute of

Thread and Synchronization Synchronization Mechanisms (Module 21) Yann-Hang Lee Arizona State

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Synchronization-Free Parallelism Today SPMD and OpenMP programming models

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

CSE-571 Robot moves from to . Probabilistic Robotics x , y , x ' ,

Probabilistic Fundamentals in Robotics Pr Pr obabilistic Mode ls of Mobile Robots obabilistic

Simulation Discrete-Event System Simulation Dr. Mesut Gne Computer Science, Informatik

Addressing the remaining questions on SGLT2 & CKD: a review of new outcome trials Colin

Dense Triangular Solvers on Multicore Clusters using UPC Jorge Gonzlez-Domnguez*, Mara J.

Lecture 12 Gaussian Process Models 10/16/2018 1 Multivariate Normal Multivariate Normal

Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie Chi Department of Electrical

Comments on Sigma Filter Degradation of a smoothed image is due to blurring of object

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse - PowerPoint PPT Presentation

49th International Conference on Parallel Processing - ICPP CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs Jiya Su , Feng Zhang , Weifeng Liu , Bingsheng He+, Ruofan Wu , Xiaoyong Du ,

Thread and Synchronization Synchronization Mechanisms (Module 20) Yann-Hang Lee Arizona State

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

POSIX Thread Synchronization Mutex Locks Condition Variables Read-Write Locks

Chapter 6: Process [&amp; Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Thread synchronization David Hovemeyer 2 December 2019 David Hovemeyer Computer Systems

CPL 2016, week 2 Inter-thread synchronization: locks and monitors Oleg Batrashev Institute of

Thread and Synchronization Synchronization Mechanisms (Module 21) Yann-Hang Lee Arizona State

CSCI [4|6] 730 Operating Systems Synchronization Part 1 : The Basics Maria Hybinette, UGA

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Synchronization-Free Parallelism Today SPMD and OpenMP programming models

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs

CSE-571 Robot moves from to . Probabilistic Robotics x , y , x ' ,

Probabilistic Fundamentals in Robotics Pr Pr obabilistic Mode ls of Mobile Robots obabilistic

Simulation Discrete-Event System Simulation Dr. Mesut Gne Computer Science, Informatik

Addressing the remaining questions on SGLT2 &amp; CKD: a review of new outcome trials Colin

Dense Triangular Solvers on Multicore Clusters using UPC Jorge Gonzlez-Domnguez*, Mara J.

Lecture 12 Gaussian Process Models 10/16/2018 1 Multivariate Normal Multivariate Normal

Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie Chi Department of Electrical

Comments on Sigma Filter Degradation of a smoothed image is due to blurring of object

Chapter 6: Process [& Thread] Synchronization Why is synchronization needed? CSCI [4|6]

Addressing the remaining questions on SGLT2 & CKD: a review of new outcome trials Colin