swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture
Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu
1,3 1,3 1,3 2
1 2 3
Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, - - PowerPoint PPT Presentation
swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile
1,3 1,3 1,3 2
1 2 3
2 clarencewxl@gmail.com
3 clarencewxl@gmail.com
x0 = a x1 = b x2 = c - 2b x3 = d - 3a 1*x0 = a 1*x1 = b 2*x1+1*x2 = c 3*x0+1*x3 = d system: solution: Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector.
4 clarencewxl@gmail.com
x
1 1 1 1 3 2
L (4x4) nnzL = 6 known =
x3 x2 x0 x1
x (4x1) dense unknown b (4x1) dense known
d c a b
x0 = a x1 = b x2 = c - 2b x3 = d - 3a 1*x0 = a 1*x1 = b 2*x1+1*x2 = c 3*x0+1*x3 = d system: solution: Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector.
5 clarencewxl@gmail.com
x
1 1 1 1 3 2
L (4x4) nnzL = 6 known =
x3 x2 x0 x1
x (4x1) dense unknown b (4x1) dense known
d c a b
x0 = a x1 = b x2 = c - 2b x3 = d - 3a 1*x0 = a 1*x1 = b 2*x1+1*x2 = c 3*x0+1*x3 = d system: solution: Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. Use case: In direct methods for solving a sparse linear system Ax=b, A can be first decomposed to LU, then be solved by LUx=b. This is done by calling two sparse triangular solves Ly=b and Ux=y. In iterative solvers, incomplete LU preconditioner uses sparse triangular solves in a similar way.
6
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
clarencewxl@gmail.com
7
clarencewxl@gmail.com
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2
8
clarencewxl@gmail.com
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2
Level 0 Level 1 Level 2 Level 3
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2
Level 0 Level 1 Level 2 Level 3
9
clarencewxl@gmail.com
10
clarencewxl@gmail.com
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2
Level 0 Level 1 Level 2 Level 3
11
clarencewxl@gmail.com
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2
Level 0 Level 1 Level 2 Level 3
12
clarencewxl@gmail.com
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2
Level 0 Level 1 Level 2 Level 3
13
Level 0 Level 1 Level 2 Level 3
Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140.
clarencewxl@gmail.com
Thread 0 and Thread 2
14
Level 0 Level 1 Level 2 Level 3
Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630.
clarencewxl@gmail.com
the same value by atomic operations.
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
15 clarencewxl@gmail.com
Entire System Peak Performance 125 PFlops Linpack Performance 93 Pflops / 74.4% Total Memory 1310.72 TB Total Memory Bandwidth 5591.45 TB/s # nodes 40,960 # cores 10,649,600
16 clarencewxl@gmail.com
Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC
Computing Core LDM
Column Communication Bus Control Network
Registers
Row Communication Bus
Transfer Agent (TA)
Memory Level LDMLevel Register Level Computing Level
8*8 CPE Mesh
SPM
17 clarencewxl@gmail.com
Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC
Computing Core LDM
Column Communication Bus Control Network
Registers
Row Communication Bus
Transfer Agent (TA)
Memory Level LDMLevel Register Level Computing Level
8*8 CPE Mesh
18 SPM clarencewxl@gmail.com
Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC
Computing Core LDM
Column Communication Bus Control Network
Registers
Row Communication Bus
Transfer Agent (TA)
Memory Level LDMLevel Register Level Computing Level
8*8 CPE Mesh
19 SPM clarencewxl@gmail.com
Direct Memoy Access (DMA) 22.6 GB/s
Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC
Computing Core LDM
Column Communication Bus Control Network
Registers
Row Communication Bus
Transfer Agent (TA)
Memory Level LDMLevel Register Level Computing Level
8*8 CPE Mesh
20 SPM clarencewxl@gmail.com
Global Load/Store (Gload/Gstore) 1.5 GB/s
21
clarencewxl@gmail.com
22
clarencewxl@gmail.com
23
//P2P Test if (id%2 == 0) while(1) putr(data, id+1); else while(1) getr(&data);
Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017.
Latency: less than 11 cycles Integrated Bandwidth: 637 GB/s
clarencewxl@gmail.com
24 clarencewxl@gmail.com
25
clarencewxl@gmail.com
26
Limitation of register communication: only happen in the same column or row
clarencewxl@gmail.com
27
Limitation of register communication
clarencewxl@gmail.com
Communication cycle + Random communication size β
Lin H, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores[C] Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017: 635-645.
Limitation of register communication: only happen in the same column or row
28 clarencewxl@gmail.com
29 clarencewxl@gmail.com
30 clarencewxl@gmail.com
31 solution vector x right-hand vector b
10
B-Region 0 B-Region 1 B-Region 2 B-Region 3 X-Region 0 X-Region 1 X-Region 2 X-Region 3
Each Tile only targets
clarencewxl@gmail.com
fine-grained, random, unprefetchable course-grained, predictable, prefetchable
32
B-Region 0 B-Region 1 B-Region 2 B-Region 3 X-Region 0 X-Region 1 X-Region 2 X-Region 3
Step 1:
Regions
clarencewxl@gmail.com
Each offdiagonal nonzero import ππ = ππ β πππ β π¦π, using π¦π from an X-Region to update ππ from a B-Region
33
Step 2:
multiple levels if crossing multiple X-Regions.
L0 L1 L2 L3 1 2 3 4 5 6
X-Region 0 X-Region 1 X-Region 2 X-Region 3 B-Region 0 B-Region 1 B-Region 2 B-Region 3
clarencewxl@gmail.com
34
Step 3:
with their βnearestβ off-diagonal nonzeros to consist Diagonal-Tiles
Benefit:
elements and b elements must be cached.
1 2 3 4 5 6
X-Region 0 X-Region 1 X-Region 2 X-Region 3 B-Region 0 B-Region 1 B-Region 2 B-Region 3
clarencewxl@gmail.com
35
Step 4:
consist Offdiagonal-Tiles
smaller than Region Size Benefit:
elements and b elements must be cached
based on X-region
1 2 3 4 5 6
X-Region 0 X-Region 1 X-Region 2 X-Region 3 B-Region 0 B-Region 1 B-Region 2 B-Region 3
clarencewxl@gmail.com
36
Step 5:
the maximum B-Region each Tile modifies;
in front of those with bigger ID
front of Offdiagonal-Tiles. Benefit:
B-Region 0 B-Region 1 B-Region 2 B-Region 3 X-Region 0 X-Region 1 X-Region 2 X-Region 3
clarencewxl@gmail.com
37 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
clarencewxl@gmail.com
38 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
π0 π1 π2 π3
clarencewxl@gmail.com
39 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
π0 π1 π2 π3
clarencewxl@gmail.com
40 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
π0 π1 π2 π3 π0 π1 π2
clarencewxl@gmail.com
41 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
π3 π4 π5 π6 π¦0 π¦1 π¦2
clarencewxl@gmail.com
42 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
π3 π4 π5 π6 π¦0 π¦1 π¦2
clarencewxl@gmail.com
43 1 2 3 4 5 6 7 8 9 10
Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory
π3 π4 π5 π6 π¦0 π¦1 π¦2
clarencewxl@gmail.com
44 clarencewxl@gmail.com
45 clarencewxl@gmail.com
46
Consumers 8 rows 4 columns Producers
π/πππ;
π/πππ;
π¦π = π
π/πππ;
βππ= ππππ¦π ππ = ππ β βππ
4 columns
clarencewxl@gmail.com
CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7
47
Consumers 8 rows Producers Producer-Consumer Pair
ππ = ππ/πππ
clarencewxl@gmail.com
CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7
48
Send π from Consumers to Producers
clarencewxl@gmail.com
CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7
49 clarencewxl@gmail.com
Share x across the same column
CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7
50 clarencewxl@gmail.com
Send β from Producers to Consumers
CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7
51 clarencewxl@gmail.com
52
clarencewxl@gmail.com
Testbeds
A single CG of SW26010 @ 34 GB/s A single chip of NVIDIA K80 @ 240 GB/s An Intel Xeon Phi 7120 KNC @ 352 GB/s
53
[1] Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. [2] Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140.
clarencewxl@gmail.com
Testbeds Methods
A single CG of SW26010 @ 34 GB/s
1. A sequential method on MPE 2. A basic level-set method on CPEs 3. swSpTRSV method on CPEs
A single chip of NVIDIA K80 @ 240 GB/s
1. The method from cusparse v1 2. The method from cusparse v2 3. The synchronization-free method [1]
An Intel Xeon Phi 7120 KNC @ 352 GB/s
1. The method from Intel MKL 2. The P2P synchronization method [2]
54
[1] Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. [2] Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140.
clarencewxl@gmail.com
Parallelism = #nonzeros/#levels
Testbeds Methods
A single CG of SW26010 @ 34 GB/s
1. A sequential method on MPE 2. A basic level-set method on CPEs 3. swSpTRSV method on CPEs
A single chip of NVIDIA K80 @ 240 GB/s
1. The method from cusparse v1 2. The method from cusparse v2 3. The synchronization-free method [1]
An Intel Xeon Phi 7120 KNC @ 352 GB/s
1. The method from Intel MKL 2. The P2P synchronization method [2]
55 clarencewxl@gmail.com
The cheapest cost is 0.3ms, and the most expensive cost is 28.2s. The harmonic average is 3.3ms.
56 clarencewxl@gmail.com
Parallelism (nnz/levels)
Performance (Mflops)
3406 MFlops
sequential method:
set method:
Group A Group B Group C Group D
57 clarencewxl@gmail.com
The largest power is 38.18 Watt. And the best performance/power can reach 89.22 Mflops/W.
10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40
Performance/Power (MFlops/Watt) Power (Watt) ddr idle core idle ddr core Performance/Power
58
method can achieve the best performance compared with other methods
method on KNC in 1856 benchmarks
SyncFree method on K80 in
1672 benchmarks
best performance in 1624 benchmarks
clarencewxl@gmail.com
Group A Group B Group C Group D
59 clarencewxl@gmail.com
1. Sparse Level Tile layout;
method on MPE;
level-set method on CPEs;
benchmarks
60 clarencewxl@gmail.com
61 clarencewxl@gmail.com
62 clarencewxl@gmail.com