Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, - - PowerPoint PPT Presentation

β–Ά
sunway architecture
SMART_READER_LITE
LIVE PREVIEW

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, - - PowerPoint PPT Presentation

swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile


slide-1
SLIDE 1

swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture

Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

1,3 1,3 1,3 2

1 2 3

slide-2
SLIDE 2
  • 1. Background
  • 2. Sunway architecture
  • 3. Sparse Level Tile layout
  • 4. Producer-Consumer Pairing method
  • 5. Experiment
  • 6. Conclusion

2 clarencewxl@gmail.com

Ou Outline ine

slide-3
SLIDE 3

Sp Spars rse e Tr Triangular ular So Solve

3 clarencewxl@gmail.com

x0 = a x1 = b x2 = c - 2b x3 = d - 3a 1*x0 = a 1*x1 = b 2*x1+1*x2 = c 3*x0+1*x3 = d system: solution: Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector.

slide-4
SLIDE 4

Sp Spars rse e Tr Triangular ular So Solve

4 clarencewxl@gmail.com

x

1 1 1 1 3 2

L (4x4) nnzL = 6 known =

x3 x2 x0 x1

x (4x1) dense unknown b (4x1) dense known

d c a b

x0 = a x1 = b x2 = c - 2b x3 = d - 3a 1*x0 = a 1*x1 = b 2*x1+1*x2 = c 3*x0+1*x3 = d system: solution: Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector.

slide-5
SLIDE 5

Sp Spars rse e Tr Triangular ular So Solve

5 clarencewxl@gmail.com

x

1 1 1 1 3 2

L (4x4) nnzL = 6 known =

x3 x2 x0 x1

x (4x1) dense unknown b (4x1) dense known

d c a b

x0 = a x1 = b x2 = c - 2b x3 = d - 3a 1*x0 = a 1*x1 = b 2*x1+1*x2 = c 3*x0+1*x3 = d system: solution: Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. Use case: In direct methods for solving a sparse linear system Ax=b, A can be first decomposed to LU, then be solved by LUx=b. This is done by calling two sparse triangular solves Ly=b and Ux=y. In iterative solvers, incomplete LU preconditioner uses sparse triangular solves in a similar way.

slide-6
SLIDE 6

Sing ngle le cor

  • re:

e: Sequen quentia tial l meth ethod

  • d

6

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

π‘€π’š = 𝒄

A sequential method based

  • n CSC layout

clarencewxl@gmail.com

slide-7
SLIDE 7

A fe few w cor

  • res:

es: Leve vel-se set t me metho thod

7

Parallel in each level and sequential inter level

clarencewxl@gmail.com

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2

slide-8
SLIDE 8

A fe few w cor

  • res:

es: Leve vel-se set t me metho thod

8

Parallel in each level and sequential inter level

clarencewxl@gmail.com

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2

Level 0 Level 1 Level 2 Level 3

slide-9
SLIDE 9

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2

Level 0 Level 1 Level 2 Level 3

A fe few w cor

  • res:

es: Leve vel-se set t me metho thod

9

Parallel in each level and sequential inter level

clarencewxl@gmail.com

slide-10
SLIDE 10

A fe few w cor

  • res:

es: Leve vel-se set t me metho thod

10

Parallel in each level and sequential inter level

clarencewxl@gmail.com

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2

Level 0 Level 1 Level 2 Level 3

slide-11
SLIDE 11

A fe few w cor

  • res:

es: Leve vel-se set t me metho thod

11

Parallel in each level and sequential inter level

clarencewxl@gmail.com

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2

Level 0 Level 1 Level 2 Level 3

slide-12
SLIDE 12

A fe few w cor

  • res:

es: Leve vel-se set t me metho thod

12

Parallel in each level and sequential inter level

clarencewxl@gmail.com

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Thread 0 Thread 1 Thread 2

Level 0 Level 1 Level 2 Level 3

slide-13
SLIDE 13

Mor

  • re

e cor

  • res:

es: P2P method thod (CPU PU/MIC /MIC)

13

Level 0 Level 1 Level 2 Level 3

Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140.

clarencewxl@gmail.com

  • No full-synchronization
  • Only synchronize between

Thread 0 and Thread 2

slide-14
SLIDE 14

Mor

  • re

e cor

  • res:

es: Sync-fre free e me metho thod d (GP GPU) U)

14

Level 0 Level 1 Level 2 Level 3

Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630.

clarencewxl@gmail.com

  • Thread 0 and 2 modify

the same value by atomic operations.

slide-15
SLIDE 15

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Bac ackg kgrou round nd

15 clarencewxl@gmail.com

Problem Architecture

Sparse Triangular Solve Sunway Processor

slide-16
SLIDE 16

Entire System Peak Performance 125 PFlops Linpack Performance 93 Pflops / 74.4% Total Memory 1310.72 TB Total Memory Bandwidth 5591.45 TB/s # nodes 40,960 # cores 10,649,600

Sunwa nway Tai aihu huLig Light ht: Overvi rview ew

16 clarencewxl@gmail.com

slide-17
SLIDE 17

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDMLevel Register Level Computing Level

8*8 CPE Mesh

SPM

SW26010 W26010 Pr Processo

  • cessor

17 clarencewxl@gmail.com

slide-18
SLIDE 18

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDMLevel Register Level Computing Level

8*8 CPE Mesh

SW26010 W26010 Pr Processo

  • cessor

18 SPM clarencewxl@gmail.com

slide-19
SLIDE 19

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDMLevel Register Level Computing Level

8*8 CPE Mesh

SW26010 W26010 Pr Processo

  • cessor

19 SPM clarencewxl@gmail.com

Direct Memoy Access (DMA) 22.6 GB/s

slide-20
SLIDE 20

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDMLevel Register Level Computing Level

8*8 CPE Mesh

SW26010 W26010 Pr Processo

  • cessor

20 SPM clarencewxl@gmail.com

Global Load/Store (Gload/Gstore) 1.5 GB/s

slide-21
SLIDE 21

Re Regi gister ster Co Communi mmunica catio tion

21

Get C Get R

Put

Get C Get R

Put

Get C Get R

Put

Get C Get R

Put

clarencewxl@gmail.com

slide-22
SLIDE 22

Re Regi gister ster Co Communi mmunica catio tion

22

Get C Get R

Put

Get C Get R

Put

Get C Get R

Put

Get C Get R

Put putr οƒŸοƒ  getr putc οƒŸοƒ  getc

clarencewxl@gmail.com

slide-23
SLIDE 23

Re Regi gister ster Co Communi mmunica catio tion

23

Get C Get R

Put

Get C Get R

Put

Get C Get R

Put

Get C Get R

Put

//P2P Test if (id%2 == 0) while(1) putr(data, id+1); else while(1) getr(&data);

Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017.

Latency: less than 11 cycles Integrated Bandwidth: 637 GB/s

clarencewxl@gmail.com

slide-24
SLIDE 24

SW26010 W26010 Pr Processo

  • cessor

24 clarencewxl@gmail.com

  • Manual cache system (SPM)
  • Direct memory access (DMA)
  • Limited register communication
slide-25
SLIDE 25

Mismatch smatch between ween SpTRSV TRSV an and Sunway nway

25

  • Branch code to check whether cache is miss or not;
  • The cost of the branch is high
  • Cost much even cache hit
  • Hurt the instruction pipeline
  • Difficult to prefetch

clarencewxl@gmail.com

  • Manual cache system
  • Direct memory access
  • Register communication
slide-26
SLIDE 26

Mismatch match between etween SpT pTRSV RSV and Sunwa way

26

Limitation of register communication: only happen in the same column or row

CPE (0,0) CPE (0,1) CPE (1,1)

clarencewxl@gmail.com

  • Manual cache system
  • Direct memory access
  • Register communication
slide-27
SLIDE 27

CPE (0,0) CPE (0,1) CPE (1,1) CPE (1,0)

Cycle

Mismatch match between etween SpT pTRSV RSV and Sunwa way

27

Limitation of register communication

clarencewxl@gmail.com

  • Manual cache system
  • Direct memory access
  • Register communication

Communication cycle + Random communication size β‰ˆ

Dead-Lock

Lin H, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores[C] Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017: 635-645.

Limitation of register communication: only happen in the same column or row

slide-28
SLIDE 28

Contrib ntributi utions

  • ns

28 clarencewxl@gmail.com

  • Sparse Level Tile (SLT) Layout:
  • Manual Cache System
  • Direct Memory Access (DMA)
  • Producer-Consumer pairing method:
  • Register Communication
slide-29
SLIDE 29

Con

  • ntributions

tributions

29 clarencewxl@gmail.com

  • Sparse Level Tile layout
  • Make sure all the computation is cache-hit;
  • Replace fine-grained, random and unprefetchable

memory access with course-grained, predictable and prefetchable memory access;

  • Manual Cache System
  • Direct Memory Access (DMA)
slide-30
SLIDE 30
  • Register Communication

Con

  • ntributions

tributions

30 clarencewxl@gmail.com

  • Producer-Consumer pairing method:
  • Make communication cycle and random

communication size not happen at the same time;

slide-31
SLIDE 31

Spa parse rse Level el Tile e (SLT) Layo yout ut

31 solution vector x right-hand vector b

4 8 3 6

9 1 2 5 7

10

B-Region 0 B-Region 1 B-Region 2 B-Region 3 X-Region 0 X-Region 1 X-Region 2 X-Region 3

Each Tile only targets

  • n sub-vector x and b

clarencewxl@gmail.com

fine-grained, random, unprefetchable course-grained, predictable, prefetchable

slide-32
SLIDE 32

Spa parse rse Level el Tile e (SLT) T) Layo yout ut: : Step ep 1

32

B-Region 0 B-Region 1 B-Region 2 B-Region 3 X-Region 0 X-Region 1 X-Region 2 X-Region 3

Step 1:

  • Divide the x and b into multiple

Regions

clarencewxl@gmail.com

Each offdiagonal nonzero import 𝑐𝑗 = 𝑐𝑗 βˆ’ π‘šπ‘—π‘˜ βˆ— π‘¦π‘˜, using π‘¦π‘˜ from an X-Region to update 𝑐𝑗 from a B-Region

slide-33
SLIDE 33

Spa parse rse Level el Tile e (SLT) T) Layo yout ut: : Step ep 2 2

33

Step 2:

  • Separate the original levels into

multiple levels if crossing multiple X-Regions.

L0 L1 L2 L3 1 2 3 4 5 6

X-Region 0 X-Region 1 X-Region 2 X-Region 3 B-Region 0 B-Region 1 B-Region 2 B-Region 3

clarencewxl@gmail.com

slide-34
SLIDE 34

Spa parse rse Level el Tile e (SLT) T) Layo yout ut: : Step ep 3 3

34

Step 3:

  • Combine the diagonal nonzeros

with their β€œnearest” off-diagonal nonzeros to consist Diagonal-Tiles

  • The width is #diagonal nonzeros
  • The height is (width + Region Size)

Benefit:

  • Guarantee the corresponding x

elements and b elements must be cached.

1 2 3 4 5 6

X-Region 0 X-Region 1 X-Region 2 X-Region 3 B-Region 0 B-Region 1 B-Region 2 B-Region 3

clarencewxl@gmail.com

slide-35
SLIDE 35

Spa parse rse Level el Tile e (SLT) T) Layo yout ut: : Step ep 4 4

35

Step 4:

  • Combine residual nonzeros to

consist Offdiagonal-Tiles

  • Both the width and the height is

smaller than Region Size Benefit:

  • Guarantee the corresponding x

elements and b elements must be cached

  • Course-grainedly load x elements

based on X-region

1 2 3 4 5 6

X-Region 0 X-Region 1 X-Region 2 X-Region 3 B-Region 0 B-Region 1 B-Region 2 B-Region 3

clarencewxl@gmail.com

slide-36
SLIDE 36

Spa parse rse Level el Tile e (SLT) T) Layo yout ut: : Step ep 5 5

36

Step 5:

  • Sort the Tiles;
  • Each Tile has an ID, which is

the maximum B-Region each Tile modifies;

  • Tiles with smaller ID are stored

in front of those with bigger ID

  • Diagonal-Tiles are stored in

front of Offdiagonal-Tiles. Benefit:

  • Increase the data-reuse of b

4 8 3 6

9 1 2 5 7

10

B-Region 0 B-Region 1 B-Region 2 B-Region 3 X-Region 0 X-Region 1 X-Region 2 X-Region 3

clarencewxl@gmail.com

slide-37
SLIDE 37

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

37 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

clarencewxl@gmail.com

slide-38
SLIDE 38

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

38 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

𝑐0 𝑐1 𝑐2 𝑐3

clarencewxl@gmail.com

slide-39
SLIDE 39

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

39 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

𝑐0 𝑐1 𝑐2 𝑐3

clarencewxl@gmail.com

slide-40
SLIDE 40

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

40 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

𝑐0 𝑐1 𝑐2 𝑐3 𝑐0 𝑐1 𝑐2

clarencewxl@gmail.com

slide-41
SLIDE 41

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

41 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

𝑐3 𝑐4 𝑐5 𝑐6 𝑦0 𝑦1 𝑦2

clarencewxl@gmail.com

slide-42
SLIDE 42

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

42 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

𝑐3 𝑐4 𝑐5 𝑐6 𝑦0 𝑦1 𝑦2

clarencewxl@gmail.com

slide-43
SLIDE 43

The e SpTRSV pTRSV pr proc

  • ces

ess s based sed on

  • n SLT

T layo yout ut

43 1 2 3 4 5 6 7 8 9 10

Inherit from previous Tile Load from main memory Obtained from Used in Inherit Load from main memory Store to main memory

𝑐3 𝑐4 𝑐5 𝑐6 𝑦0 𝑦1 𝑦2

clarencewxl@gmail.com

slide-44
SLIDE 44

Spa parse rse Level el Tile e (SLT) T) Layout

  • ut

44 clarencewxl@gmail.com

  • Sparse Level Tile layout
  • Make sure all the computation is cache-hit;
  • Replace fine-grained, random and unprefetchable

memory access with course-grained, predictable and prefetchable memory access;

  • Manual Cache System
  • Direct Memory Access (DMA)
slide-45
SLIDE 45
  • Register Communication

Produce

  • ducer-Cons

Consum umer er pa pairing ring metho thod d

45 clarencewxl@gmail.com

  • Producer-Consumer pairing method:
  • Make communication cycle and random

communication size not happen at the same time;

slide-46
SLIDE 46

Produce

  • ducer-Consum

Consumer er pa pairing ring metho thod d

46

Consumers 8 rows 4 columns Producers

π‘¦π‘˜ = 𝑐

π‘˜/π‘šπ‘˜π‘˜;

βˆ†π‘—π‘˜= π‘šπ‘—π‘˜π‘¦π‘˜ 𝑐𝑗 = 𝑐𝑗 βˆ’ βˆ†π‘—π‘˜ π‘¦π‘˜ = 𝑐

π‘˜/π‘šπ‘˜π‘˜;

b𝑗 = bi βˆ’ π‘šπ‘—π‘˜π‘¦π‘˜

π‘¦π‘˜ = 𝑐

π‘˜/π‘šπ‘˜π‘˜;

βˆ†π‘—π‘˜= π‘šπ‘—π‘˜π‘¦π‘˜ 𝑐𝑗 = 𝑐𝑗 βˆ’ βˆ†π‘—π‘˜

4 columns

clarencewxl@gmail.com

CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7

slide-47
SLIDE 47

Produce

  • ducer-Consum

Consumer er pa pairing ring metho thod

47

Consumers 8 rows Producers Producer-Consumer Pair

π’šπ’‹ = 𝒄𝒋/π’Žπ’‹π’‹

clarencewxl@gmail.com

CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7

slide-48
SLIDE 48

Produce

  • ducer-Consum

Consumer er pa pairing ring metho thod d

48

Send 𝑐 from Consumers to Producers

clarencewxl@gmail.com

π’šπ’Œ = π’„π’Œ/π’Žπ’Œπ’Œ;

  • 1. Acyclic communication
  • 2. Pre-known communication size

CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7

slide-49
SLIDE 49

Produce

  • ducer-Consum

Consumer er pa pairing ring metho thod d

49 clarencewxl@gmail.com

βˆ†π’‹π’Œ= π’Žπ’‹π’Œπ’šπ’Œ

Share x across the same column

  • 1. Acyclic communication
  • 2. Pre-known communication size

CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7

slide-50
SLIDE 50

Produce

  • ducer-Consum

Consumer er pa pairing ring metho thod d

50 clarencewxl@gmail.com

𝒄𝒋 = 𝒄𝒋 βˆ’ βˆ†π’‹π’Œ

Send βˆ† from Producers to Consumers

  • 1. Acyclic communication
  • 2. Pre-known communication size

CPE 0,0 CPE 0,3 CPE 0,4 CPE 0,7 CPE 7,0 CPE 7,3 CPE 7,4 CPE 7,7

slide-51
SLIDE 51

51 clarencewxl@gmail.com

  • Sparse Level Tile layout
  • cache always hit;
  • course-grained memory access
  • Producer-Consumer pairing method:
  • Dead-lock free;
slide-52
SLIDE 52

Experimental perimental Setup tup

52

  • 3 platforms

clarencewxl@gmail.com

Testbeds

A single CG of SW26010 @ 34 GB/s A single chip of NVIDIA K80 @ 240 GB/s An Intel Xeon Phi 7120 KNC @ 352 GB/s

slide-53
SLIDE 53

Experimental perimental Setup tup

53

[1] Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. [2] Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140.

  • 3 platforms
  • 3+3+2 = 8 methods

clarencewxl@gmail.com

Testbeds Methods

A single CG of SW26010 @ 34 GB/s

1. A sequential method on MPE 2. A basic level-set method on CPEs 3. swSpTRSV method on CPEs

A single chip of NVIDIA K80 @ 240 GB/s

1. The method from cusparse v1 2. The method from cusparse v2 3. The synchronization-free method [1]

An Intel Xeon Phi 7120 KNC @ 352 GB/s

1. The method from Intel MKL 2. The P2P synchronization method [2]

slide-54
SLIDE 54

Experimental perimental Setup tup

54

[1] Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. [2] Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140.

  • 3 platforms
  • 3+3+2 = 8 methods
  • 2057 benchmarks from the University
  • f Florida Sparse Matrix Collection

clarencewxl@gmail.com

Parallelism = #nonzeros/#levels

Testbeds Methods

A single CG of SW26010 @ 34 GB/s

1. A sequential method on MPE 2. A basic level-set method on CPEs 3. swSpTRSV method on CPEs

A single chip of NVIDIA K80 @ 240 GB/s

1. The method from cusparse v1 2. The method from cusparse v2 3. The synchronization-free method [1]

An Intel Xeon Phi 7120 KNC @ 352 GB/s

1. The method from Intel MKL 2. The P2P synchronization method [2]

slide-55
SLIDE 55

Pre-pro proces cessing ing co cost

55 clarencewxl@gmail.com

The cheapest cost is 0.3ms, and the most expensive cost is 28.2s. The harmonic average is 3.3ms.

slide-56
SLIDE 56

Meth thods

  • ds on
  • n Sunway

way Proces

  • cessor

sor

56 clarencewxl@gmail.com

Parallelism (nnz/levels)

Performance (Mflops)

  • 1. Best performance:

3406 MFlops

  • 2. Compared with

sequential method:

  • Average speedup: 7.8;
  • Best speedup: 117.3;
  • 3. Compared with level-

set method:

  • Average speedup: 6.9;
  • Best speedup: 38.5;

Group A Group B Group C Group D

slide-57
SLIDE 57

The e po power wer of

  • f 20 typic

pical al bench nchmarks rks

57 clarencewxl@gmail.com

The largest power is 38.18 Watt. And the best performance/power can reach 89.22 Mflops/W.

10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40

Performance/Power (MFlops/Watt) Power (Watt) ddr idle core idle ddr core Performance/Power

slide-58
SLIDE 58

Di Different erent Me Methods thods on

  • n Different

erent Proc

  • ces

essors sors

58

  • The number of benchmarks in each group that one

method can achieve the best performance compared with other methods

  • Outperform MKL and P2P

method on KNC in 1856 benchmarks

  • Outperform cuSparse and

SyncFree method on K80 in

1672 benchmarks

  • swSpTRSV can achieve the

best performance in 1624 benchmarks

clarencewxl@gmail.com

Group A Group B Group C Group D

slide-59
SLIDE 59

Con

  • nclusi

usion

  • n

59 clarencewxl@gmail.com

Method:

1. Sparse Level Tile layout;

  • Manual Cache System
  • Direct Memory Access (DMA)
  • 2. Producer-Consumer pairing method;
  • Register Communication

Performance:

  • An average speedup of 7.8, compared with the sequential

method on MPE;

  • An average speedup of 6.9, compared with the basic

level-set method on CPEs;

  • Achieve the best performance in 1624/2057 (78.95%)

benchmarks

slide-60
SLIDE 60

60 clarencewxl@gmail.com

Code: https://github.com/clarencewxl/swSpTRSV.git

slide-61
SLIDE 61

61 clarencewxl@gmail.com

Login: http://www.nsccwx.cn/wxcyw/

slide-62
SLIDE 62

Welcome to Wuxi ! Welcome to TaihuLight !

62 clarencewxl@gmail.com

Th Thanks ks, , Q& Q&A