CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse - - PowerPoint PPT Presentation

capellinisptrsv a thread level synchronization free
SMART_READER_LITE
LIVE PREVIEW

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse - - PowerPoint PPT Presentation

49th International Conference on Parallel Processing - ICPP CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs Jiya Su , Feng Zhang , Weifeng Liu , Bingsheng He+, Ruofan Wu , Xiaoyong Du ,


slide-1
SLIDE 1

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

1/48

Jiya Su⋄ ‡, Feng Zhang⋄, Weifeng Liu★, Bingsheng He+, Ruofan Wu⋄, Xiaoyong Du⋄, Rujia Wang‡

⋄Renmin University of China ⋆ China University of Petroleum +National University of Singapore ‡ Illinois Institute of Technology

49th International Conference on Parallel Processing - ICPP

slide-2
SLIDE 2

Outline

1. Background 2. Motivation 3. Challenges 4. CapelliniSpTRSV 5. Evaluation 6. Source Code at Github 7. Conclusion

2/48

slide-3
SLIDE 3

Outline

1. Background 2. Motivation 3. Challenges 4. CapelliniSpTRSV 5. Evaluation 6. Source Code at Github 7. Conclusion

3/48

slide-4
SLIDE 4
  • 1. Background

Sparse Matrix in CSR format

4/48

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

(a) Matrix L.

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

csrRowPtr = (0, 1, 2, 4, 7, 10, 12, 16, 20) csrColIdx = (0, 1, 1, 2, 1, 2, 3, 0, 1, 4, 2, 5, 0, 2, 5, 6, 0, 1, 2, 7) csrVal = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

(b) CSR representation.

slide-5
SLIDE 5
  • 1. Background

Sparse Triangular Solve Example: Lx = b

5/48

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 × ? ? ? ? ? ? ? ? = 1 1 2 3 3 2 4 4

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

Matrix L x b L

slide-6
SLIDE 6
  • 1. Background

Sparse Triangular Solve Example: Lx = b

6/48

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 = 1 1 2 3 3 2 4 4

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

Matrix L x b L

slide-7
SLIDE 7
  • 1. Background

Concepts : · Component

7/48

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 = 1 1 2 3 3 2 4 4

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

Matrix L x b L

component

slide-8
SLIDE 8
  • 1. Background

Concepts : · Component · Element

8/48

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 = 1 1 2 3 3 2 4 4

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

Matrix L x b L

element

slide-9
SLIDE 9
  • 1. Background

Concepts : · Component · Element · Dependency

9/48

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 = 1 1 2 3 3 2 4 4

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

Matrix L x b L

dependency

slide-10
SLIDE 10
  • 1. Background

Concepts : · Component · Element · Dependency · Level

10/48

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 = 1 1 2 3 3 2 4 4

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

Matrix L x b L

Level-set

slide-11
SLIDE 11
  • 1. Background

Level-set SpTRSV The level-set method has two phases: (1) grouping nodes (rows or columns) that can be consumed in parallel, and (2) solving nodes group by group with barriers between.

11/48

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L 1 2 4 3 5 7 6

Level 0 Level 1 Level 2 Level 3 (a) Matrix L. (b) Components x in the level-sets.

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

slide-12
SLIDE 12
  • 1. Background

Level-set SpTRSV The level-set method has two phases: (1) grouping nodes es (rows or columns) that can be e consumed ed in parallel el, and (2) solving nodes group by group with barriers between.

12/48

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L 1 2 4 3 5 7 6

Level 0 Level 1 Level 2 Level 3 (a) Matrix L. (b) Components x in the level-sets.

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

slide-13
SLIDE 13
  • 1. Background

Level-set SpTRSV The level-set method has two phases: (1) grouping nodes (rows or columns) that can be consumed in parallel, and (2) (2) solving nodes es group by group with barrier ers between een.

13/48

Level 0 Level 0 Level 1 Level 2 Level 1 Level 2 Level 3 Level 2 Lower Triangular Matrix L 1 2 4 3 5 7 6

Level 0 Level 1 Level 2 Level 3 (a) Matrix L. (b) Components x in the level-sets.

1 2 3 4 5 6 7 1 1 1 2 1 1 3 1 1 1 4 1 1 1 5 1 1 6 1 1 1 1 7 1 1 1 1

slide-14
SLIDE 14
  • 1. Background

Synchronization-Free SpTRSV (warp-level)

The algorithm computes components x in the original row order

  • f the input matrix and uses one warp to compute one row.

It uses a new flag array in in_degree to show whether the component x is solved, which avoids the synchronization and greatly reduces the processing time.

14/48

Liu W, Li A, Hogg J, et al. A synchronization-free algorithm for parallel sparse triangular solves[C]//European Conference on Parallel Processing. Springer, Cham, 2016: 617-630.

slide-15
SLIDE 15
  • 1. Background

Synchronization-Free SpTRSV (warp-level)

Th The alg algorit ithm computes components x in in the orig igin inal al row order

  • f
  • f the input matrix and uses on
  • ne warp to
  • com
  • mpute on
  • ne row.

It uses a new flag array in in_degree to show whether the component x is solved, which avoids the synchronization and greatly reduces the processing time.

15/48

Liu W, Li A, Hogg J, et al. A synchronization-free algorithm for parallel sparse triangular solves[C]//European Conference on Parallel Processing. Springer, Cham, 2016: 617-630.

slide-16
SLIDE 16
  • 1. Background

Synchronization-Free SpTRSV (warp-level)

The algorithm computes components x in the original row order

  • f the input matrix and uses one warp to compute one row.

It It uses es a a new flag lag ar array in in_degree to to show whether the component x is solved, which ch avoids the synch chronization and greatly reduce ces the proce cessing time.

16/48

Liu W, Li A, Hogg J, et al. A synchronization-free algorithm for parallel sparse triangular solves[C]//European Conference on Parallel Processing. Springer, Cham, 2016: 617-630.

slide-17
SLIDE 17
  • 1. Background

Case study for preprocessing time and execution time of different SpTRSV algorithms

17/48

Algorithm time (ms) nlpkkt160 wiki-Talk cant Level-Set preprocessing execution 310.07 28.07 31.09 12.89 4.81 28.79 cuSPARSE preprocessing execution 16.24 37.98 1.99 11.88 0.28 7.69 Sync-Free preprocessing execution 8.07 27.73 0.42 10.02 0.28 5.02

slide-18
SLIDE 18
  • 2. Motivation

Performance trend of warp-level synchronization-free SpTRSV.

18/48

slide-19
SLIDE 19
  • 2. Motivation

Performance trend of warp-level synchronization-free SpTRSV.

19/48

The performance declines after reaching the peak state.

slide-20
SLIDE 20
  • 2. Motivation

20/48

thread 1 L(0,0) L(2,1) L(2,2) L(3,1) L(3,2) L(3,3) L(6,0) L(6,2) L(6,5) L(6,6) thread 2 L(1,1) L(4,0) L(4,1) L(4,4) L(5,2) L(5,5) thread 3 L(7,0) L(7,1) L(7,2) L(7,7) thread 4 thread 5 thread 6 thread 1 L(0,0) L(2,1) L(2,2) L(4,0) L(4,4) L(5,2) L(5,5) L(7,0) L(7,7) thread 2 L(4,1) L(7,1) thread 3 L(7,2) thread 4 L(1,1) L(3,1) L(3,2) L(3,3) L(6,0) L(6,5) L(6,6) thread 5 L(6,2) thread 6 thread 1 L(0,0) L(6,0) L(6,2) L(6,5) L(6,6) thread 2 L(1,1) L(7,0) L(7,1) L(7,2) L(7,7) thread 3 L(2,1) L(2,2) thread 4 L(3,1) L(3,2) L(3,3) thread 5 L(4,0) L(4,1) L(4,4) thread 6 L(5,2) L(5,5) warp 1 warp 2 warp 1 warp 2 warp 1 warp 2 time (a) Level-Set SpTRSV. (b) Warp-Level Synchronization-Free SpTRSV. (c) Thread-Level Synchronization-Free SpTRSV (CapelliniSpTRSV). Level 0 Level 1 Level 2 Level 3 Data transmission

slide-21
SLIDE 21
  • 2. Motivation
  • Observation: Warp-level synchronization-free SpTRSV

algorithm cannot fully utilize GPU resources when parallel granularity is large.

  • Insight:

21/48

fine-grained Capellini

slide-22
SLIDE 22
  • 3. Challenges
  • Challenge 1: avoiding deadlocks
  • In thread-level design, the threads in one warp may have

dependencies.

22/48

slide-23
SLIDE 23
  • 3. Challenges
  • Challenge 2: last element checking
  • We need to verify whether the processed element is on the

diagonal, which causes time overhead.

23/48

slide-24
SLIDE 24
  • 3. Challenges
  • Challenge 3: thread execution model
  • Although we use a thread to handle one component, the GPUs

are still executed in the warp execution mode.

24/48

slide-25
SLIDE 25
  • 4. CapelliniSpTRSV
  • Design to avoid deadlocks
  • A two-phase mechanism to avoid the deadlocks in

CapelliniSpTRSV

25/48

slide-26
SLIDE 26
  • 4. CapelliniSpTRSV

26/48

main() { //host code InputMatrix(L); // Rows = L.row_number InitiateVector(x, b, get_value); // x = 0, get_value = 0 launchKernel(Rows); // create Rows threads } kernel(L, x, b, get_value) { // GPU kernel rowID = globalID; sum = 0; B = getBoundary(L, rowID); processWhileLoop(L, b, B, rowID, sum, get_value); processWrtFst(L, b, B, rowID, sum, get_value, x); } (a) Two-Phase CapelliniSpTRSV

slide-27
SLIDE 27
  • 4. CapelliniSpTRSV
  • Design to avoid deadlocks
  • A two-phase mechanism to avoid the deadlocks in

CapelliniSpTRSV.

  • Efficient last element checking
  • A novel design to reduce the number of last element checking.

27/48

slide-28
SLIDE 28
  • 4. CapelliniSpTRSV

28/48

processW ssWhileLoop(L (L, b, B, ro rowID, su sum, get_va _value){ ){ For For id = L.rowID.st start to to B B{ While While !checkS kSolve ve(L (L, id, get_va _value); ); re record rdValue(L, id, b, su sum); } } processWrtFst(L, b, B, rowID, sum, get_value, x){ id = B; While id < L.rowID.end{ While checkSolve(L, id, get_value){ recordValue(L, id, b, sum); id ++; } If id == (L.rowID.end -1){ computeXValue(L, x, b, sum, rowID); setValue_get(rowID, get_value); id ++; } } }

slide-29
SLIDE 29
  • 4. CapelliniSpTRSV

29/48

processWhileLoop(L, b, B, rowID, sum, get_value){ For id = L.rowID.start to B{ While !checkSolve(L, id, get_value); recordValue(L, id, b, sum); } } processW ssWrtFst st(L (L, b, B, ro rowID, su sum, get_va _value, x) x){ id id = B; While While id id < L. L.row

  • wID.end

end{ While While checkS kSolve ve(L (L, id, get_va _value){ re record rdValue(L, id, b, su sum); id + d ++; } If i If id = d == ( (L. L.row

  • wID.end

end -1) 1){ com comput uteX eXVal alue ue(L, x, x, b, su sum, ro rowID); ); se setValue_g _get(ro rowID, , get_va _value); ); id id ++; } } }

slide-30
SLIDE 30
  • 4. CapelliniSpTRSV
  • Design to avoid deadlocks
  • A two-phase mechanism to avoid the deadlocks in

CapelliniSpTRSV.

  • Efficient last element checking
  • A novel design to reduce the number of such last element

checkings.

  • Adaptation to GPU thread execution
  • A Writing-First optimization that threads can compute the

elements and write the partial results first without waiting for the other threads.

30/48

slide-31
SLIDE 31

main() { //host code InputMatrix(L); // Rows = L.row_number InitiateVector(x, b, get_value); // x = 0, get_value = 0 launchKernel(Rows); // create Rows threads } kernel(L, x, b, get_value) { // GPU kernel rowID = globalID; sum = 0; B = getBoundary(L, rowID); processWhileLoop(L, b, B, rowID, sum, get_value); processWrtFst(L, b, B, rowID, sum, get_value, x); }

  • 4. CapelliniSpTRSV

31/48

(a) Two-Phase CapelliniSpTRSV delete

slide-32
SLIDE 32

main() { //host code InputMatrix(L); // Rows = L.row_number InitiateVector(x, b, get_value); // x = 0, get_value = 0 launchKernel(Rows); // create Rows threads } kernel(L, x, b, get_value) { // GPU kernel rowID = globalID; sum = 0; B = getBoundary(L, rowID); processWhileLoop(L, b, B, rowID, sum, get_value); processWrtFst(L, b, B, rowID, sum, get_value, x); }

  • 4. CapelliniSpTRSV

32/48

delete (a) Two-Phase CapelliniSpTRSV

slide-33
SLIDE 33

main() { //host code InputMatrix(L); // Rows = L.row_number InitiateVector(x, b, get_value); // x = 0, get_value = 0 LaunchKernel(Rows); // create Rows threads } kernel(L, x, b, get_value) { // GPU kernel rowID = globalID; sum = 0; processWrtFst(L, b, L.rowID.start, rowID, sum, get_value, x); }

  • 4. CapelliniSpTRSV

33/48

(b)Writing-First CapelliniSpTRSV

slide-34
SLIDE 34
  • 4. CapelliniSpTRSV

Features:

  • No preprocessing
  • Our algorithm can be easily applied to various situations.
  • Strong effectiveness
  • Our algorithm completes the current synchronization-free

SpTRSV design.

  • CSR format
  • The most popular CSR format.

34/48

slide-35
SLIDE 35
  • 4. CapelliniSpTRSV

Features:

  • No preprocessing
  • Our algorithm can be easily applied to various situations.
  • Strong effectiveness
  • Our algorithm completes the current synchronization-free

SpTRSV design.

  • CSR format
  • The most popular CSR format.

35/48

slide-36
SLIDE 36
  • 4. CapelliniSpTRSV

Features:

  • No preprocessing
  • Our algorithm can be easily applied to various situations.
  • Strong effectiveness
  • Our algorithm completes the current synchronization-free

SpTRSV design.

  • CSR format
  • The most popular CSR format.

36/48

slide-37
SLIDE 37
  • 4. CapelliniSpTRSV

Features:

  • No preprocessing
  • Our algorithm can be easily applied to various situations.
  • Strong effectiveness
  • Our algorithm completes the current synchronization-free

SpTRSV design.

  • CSR format
  • The most popular CSR format.

37/48

slide-38
SLIDE 38
  • 5. Evaluation

Experimental Setup

  • Methods
  • Capellini
  • SyncFree
  • cuSPARSE
  • Platforms
  • Pascal: GTX 1080
  • Volta: V100
  • Turing: RTX 2080 ti
  • Datasets
  • 245 matrices from University of Florida Sparse Matrix Collection

38/48

slide-39
SLIDE 39
  • 5. Evaluation

Experimental Setup

  • Methods
  • Capellini
  • SyncFree
  • cuSPARSE
  • Platforms
  • Pascal: GTX 1080
  • Volta: V100
  • Turing: RTX 2080 ti
  • Datasets
  • 245 matrices from University of Florida Sparse Matrix Collection

39/48

slide-40
SLIDE 40
  • 5. Evaluation

Experimental Setup

  • Methods
  • Capellini
  • SyncFree
  • cuSPARSE
  • Platforms
  • Pascal: GTX 1080
  • Volta: V100
  • Turing: RTX 2080 ti
  • Datasets
  • 245 matrices from University of Florida Sparse Matrix Collection

40/48

slide-41
SLIDE 41
  • 5. Evaluation

Performance (GFLOPS/s) average :

  • cuSPARSE : 1.92 GFLOPS/s
  • SyncFree : 1.78 GFLOPS/s
  • CapelliniSpTRSV : 6.84 GFLOPS/s

41/48

(a) Pascal (GeForce GTX 1080) (b) Volta (Tesla V100) (c) Turing (GeForce RTX 2080 Ti)

Capellini: 87% the highest performance

slide-42
SLIDE 42
  • 5. Evaluation

Speedup average : SyncFree : 4.97x cuSPARSE: 4.74x

42/48

matrix lp1 parallel granularity: 1.18 SyncFree: 34.77x

slide-43
SLIDE 43
  • 5. Evaluation

Algorithm preference distribution

43/48

slide-44
SLIDE 44
  • 5. Evaluation

Detailed Analysis Bandwidth Capellini: 56.09 GB/s. SyncFree: 5.17x cuSPARSE: 5.25x

44/48

Bandwidth utilization (sum of read and write bandwidth)

slide-45
SLIDE 45
  • 5. Evaluation

Detailed Analysis

  • executed instructions: save

SyncFree: 76.02% cuSPARSE : 56.02%

  • instruction stall percentage

Capellini: 12.55% SyncFree: 25.60% lower cuSPARSE : 65.40% lower

45/48

(b) Percentage of instruction dependency stalls. (a) Number of GPU instructions executed.

slide-46
SLIDE 46
  • 6. Source Code at
  • https://github.com/JiyaSu/CapelliniSpTRSV

46/48

slide-47
SLIDE 47
  • 7. Conclusion
  • We show our insights in current SpTRSV algorithms and propose

parallel granularity to describe sparse matrices.

  • We develop CapelliniSpTRSV to process sparse matrices that previous

SpTRSV algorithms cannot handle efficiently.

  • We evaluate CapelliniSpTRSV with 245 matrices, and demonstrate its

benefits over the state-of-the-art SpTRSV.

47/48

slide-48
SLIDE 48
  • 7. Conclusion
  • We show our insights in current SpTRSV algorithms and propose

parallel granularity to describe sparse matrices.

  • We develop CapelliniSpTRSV to process sparse matrices that previous

SpTRSV algorithms cannot handle efficiently.

  • We evaluate CapelliniSpTRSV with 245 matrices, and demonstrate its

benefits over the state-of-the-art SpTRSV.

48/48

slide-49
SLIDE 49
  • 7. Conclusion
  • We show our insights in current SpTRSV algorithms and propose

parallel granularity to describe sparse matrices.

  • We develop CapelliniSpTRSV to process sparse matrices that previous

SpTRSV algorithms cannot handle efficiently.

  • We evaluate CapelliniSpTRSV with 245 matrices, and demonstrate its

benefits over the state-of-the-art SpTRSV.

49/48

slide-50
SLIDE 50

Thank you!

  • Any questions?

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

Jiya Su⋄, Feng Zhang⋄, Weifeng Liu★, Bingsheng He+, Ruofan Wu⋄, Xiaoyong Du⋄, Rujia Wang‡

⋄Renmin University of China ⋆ China University of Petroleum +National University of Singapore ‡ Illinois Institute of Technology

Jiya_Su@ruc.edu.cn, fengzhang@ruc.edu.cn, weifeng.liu@cup.edu.cn, hebs@comp.nus.edu.sg, 2017202106@ruc.edu.cn, duyong@ruc.edu.cn, rwang67@iit.edu