sunway architecture
play

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, - PowerPoint PPT Presentation

swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile


  1. swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3

  2. Ou Outline ine 1. Background 2. Sunway architecture 3. Sparse Level Tile layout 4. Producer-Consumer Pairing method 5. Experiment 6. Conclusion clarencewxl@gmail.com 2

  3. Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 1* x 1 = b system: 2* x 1 +1* x 2 = c 3* x 0 +1* x 3 = d x 0 = a x 1 = b solution: x 2 = c - 2 b x 3 = d - 3 a clarencewxl@gmail.com 3

  4. Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 0 0 0 1 x 0 a 1* x 1 = b system: 0 0 0 1 x 1 b 2* x 1 +1* x 2 = c x = 3* x 0 +1* x 3 = d 0 0 0 0 2 1 x 2 c 0 0 0 0 3 1 x 3 d L x b x 0 = a (4x4) (4x1) (4x1) x 1 = b solution: nnzL = 6 dense dense x 2 = c - 2 b known unknown known x 3 = d - 3 a clarencewxl@gmail.com 4

  5. Sp Spars rse e Tr Triangular ular So Solve Example: Lx = b Compute a solution vector x from the sparse linear system, where L is a square lower triangular sparse matrix, and b is the right-hand vector. 1* x 0 = a 0 0 0 1 x 0 a 1* x 1 = b system: 0 0 0 1 x 1 b 2* x 1 +1* x 2 = c Use case: x = 3* x 0 +1* x 3 = d 0 0 0 0 2 1 x 2 c In direct methods for solving a sparse linear system Ax=b, A can be first decomposed to LU, then be solved by LUx=b. 0 0 0 0 3 1 x 3 d This is done by calling two sparse triangular solves Ly=b and L x b x 0 = a Ux=y. (4x4) (4x1) (4x1) x 1 = b In iterative solvers, incomplete LU preconditioner uses sparse solution: nnzL = 6 dense dense x 2 = c - 2 b triangular solves in a similar way. known unknown known x 3 = d - 3 a clarencewxl@gmail.com 5

  6. Sing ngle le cor ore: e: Sequen quentia tial l meth ethod od 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: A sequential method based 13: 14: on CSC layout 15: 𝑀𝒚 = 𝒄 clarencewxl@gmail.com 6

  7. A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: 2: Thread 1 3: 4: 5: Thread 2 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 7

  8. A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 8

  9. A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: Thread 2 5: Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 9

  10. A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 10

  11. A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 11

  12. A fe few w cor ores: es: Leve vel-se set t me metho thod 0: Thread 0 1: Level 0 2: Thread 1 3: 4: 5: Thread 2 Level 1 6: 7: 8: 9: Level 2 10: 11: 12: Level 3 13: 14: 15: Parallel in each level and sequential inter level clarencewxl@gmail.com 12

  13. Mor ore e cor ores: es: P2P method thod (CPU PU/MIC /MIC) Level 0 Level 1 Level 2 • No full-synchronization • Only synchronize between Thread 0 and Thread 2 Level 3 Park J, et al. Sparsifying synchronization for high-performance shared-memory sparse triangular solver[C] International Supercomputing Conference. Springer, Cham, 2014: 124-140. clarencewxl@gmail.com 13

  14. Mor ore e cor ores: es: Sync-fre free e me metho thod d (GP GPU) U) Level 0 Level 1 Level 2 • Thread 0 and 2 modify the same value by atomic operations. Level 3 Liu W, et al. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C] European Conference on Parallel Processing. Springer International Publishing, 2016: 617-630. clarencewxl@gmail.com 14

  15. Bac ackg kgrou round nd Problem Architecture 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Sparse Triangular Solve Sunway Processor clarencewxl@gmail.com 15

  16. Sunwa nway Tai aihu huLig Light ht: Overvi rview ew Entire System Peak Performance 125 PFlops Linpack Performance 93 Pflops / 74.4% Total Memory 1310.72 TB Total Memory Bandwidth 5591.45 TB/s # nodes 40,960 # cores 10,649,600 clarencewxl@gmail.com 16

  17. SW26010 W26010 Pr Processo ocessor Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 17

  18. SW26010 W26010 Pr Processo ocessor Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 18

  19. SW26010 W26010 Pr Processo ocessor D irect M emoy A ccess Memory Memory (DMA) 22.6 GB/s iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 19

  20. SW26010 W26010 Pr Processo ocessor G lobal Load/Store Memory Memory (Gload/Gstore) 1.5 GB/s iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDMLevel Registers NoC Data Transfer SPM LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 2 Core Group 3 Network Communication Bus Computing Level Memory Memory clarencewxl@gmail.com 20

  21. Re Regi gister ster Co Communi mmunica catio tion Get C Get C Get R Get R Put Put Get C Get C Get R Get R Put Put clarencewxl@gmail.com 21

  22. Regi Re gister ster Co Communi mmunica catio tion Get C Get C putr  getr Get R Get R Put Put putc  getc Get C Get C Get R Get R Put Put clarencewxl@gmail.com 22

  23. Re Regi gister ster Co Communi mmunica catio tion Get C Get C // P2P Test Get R Get R if (id%2 == 0) Put Put while(1) putr(data, id+1); else while(1) getr(&data); Get C Get C Get R Get R Put Put Latency: less than 11 cycles Integrated Bandwidth: 637 GB/s Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International . IEEE, 2017. clarencewxl@gmail.com 23

  24. SW26010 W26010 Pr Processo ocessor • Manual cache system (SPM) • Direct memory access (DMA) • Limited register communication clarencewxl@gmail.com 24

  25. Mismatch smatch between ween SpTRSV TRSV an and Sunway nway • Branch code to check whether cache is miss or not; • The cost of the branch is high • Manual cache system • Direct memory access • Register communication • Cost much even cache hit • Hurt the instruction pipeline • Difficult to prefetch clarencewxl@gmail.com 25

  26. Mismatch match between etween SpT pTRSV RSV and Sunwa way Limitation of register communication: only happen in the same column or row • Manual cache system CPE CPE • Direct memory access (0,0) (0,1) • Register communication CPE (1,1) clarencewxl@gmail.com 26

  27. Mismatch match between etween SpT pTRSV RSV and Sunwa way Limitation of register communication: only happen in the Limitation of register communication same column or row • Manual cache system CPE CPE • Direct memory access (0,0) (0,1) • Register communication Cycle Communication cycle + Random CPE CPE communication size ≈ Dead-Lock (1,0) (1,1) Lin H, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores[C] Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017: 635-645. clarencewxl@gmail.com 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend