s u ed 3ecdp aoec e c e c ea pebe llhe pek o

SU:ED3ECDP( -AOECEC :EC ?EAPEBE? - PowerPoint PPT Presentation

SU:ED3ECDP( -AOECEC :EC ?EAPEBE? LLHE?PEKO P PDA ?HA KB # 4EHHEK ,KNAO 7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO


  1. ���S�U�:�ED�3ECDP(� -AOEC�E�C ��� :��E�C �?EA�PEBE? �LLHE?�PEK�O �P PDA �?�HA KB #� 4EHHEK� ,KNAO 7GUN�GT 5� =G�OUTGR�A��KXIUS���OTM�2KT�KX�OT�D��O 3K�GX�SKT� UL 4GX�N A�Y�KS AIOKTIK� BYOTMN�G CTO�KXYO�� AK��KSHKX����N�%����/�820A

  2. 6�PHE�A Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

  3. Sunway-I: Sunway BlueLight: Sunway TaihuLight: - CMA service, 1998 - NSCC-Jinan, 2011 - NSCC-Wuxi, 2016 - commercial chip - 16-core processor - 260-core processor - 0.384 Tflops - 1 Pflops - 125 Pflops - 48 th of TOP500 - 14 th of TOP500 - 1 st of TOP500 :DA ���S�U 4�?DE�A ��IEHU

  4. �����#�( ���S�U ����,KNA �NK?AOOKN Memory Memory iMC iMC Core Group 0 Memory Level Core Group 1 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDM Level Registers NoC Data Transfer LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 3 Core Group 2 Network Communication Bus Computing Level Memory Memory

  5. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  6. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  7. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  8. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  9. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  10. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO)

  11. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses

  12. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip

  13. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes

  14. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes Sunway Net

  15. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  16. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  17. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  18. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  19. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K�� Sunway Micro

  20. 6�PHE�A Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

  21. 4�?DE�A ,�L��EHEPU ,KIL�NEOK� TaihuLight Tianhe-2 Titan K Computer TaihuLight Tianhe-2 Titan K Computer Peak Performance Linpack 3 3 2.5 2.5 2 2 communication Memory Size 1.5 1.5 bandwidth 1 1 0.5 0.5 hpgmg Graph 0 0 memory Gflops/Watt bandwidth HPCG Tflops/m^3

  22. 4�FKN �A�P�NAO PK ,K�OE�AN Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs

  23. 4�FKN �A�P�NAO PK ,K�OE�AN Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs Intel KNL 7250 of Cori: 6.5 flops/byte NVIDIA P100 of Piz Daint: 7.2 flops/byte

  24. 4�FKN ,D�HHA�CA �#( �?�HE�C Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs

  25. 4�FKN ,D�HHA�CA ��( 4AIKNU ��HH Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs

  26. 4�FKN ,D�HHA�CA ��( 4AIKNU ��HH Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs Refactoring and Redesigning

  27. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2016 2017 Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  28. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2017 2016 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  29. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2017 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  30. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2016 Gordon Bell Prize 2017 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  31. 163,840 processes 65 threads racks chips core-groups cores total number of cores DD-MG K -cycle Now let’s find a Very shallow way to design a Plug & Play subdomain solver. Uniform DD �LLHE?�PEK� 1�( 1ILHE?EP �KHRAN BKN �PIKOLDANE? -U��IE?O

  32. 163,840 processes 65 threads racks chips core-groups cores total number of cores Geometry-based pipelined ILU (GP-ILU) DD-MG K -cycle Our goal of design: 1. Single sweep 2. Synchronization-free 8×8 8×8 Two-level pipeline 8×8 3. Improved data-locality 1 ´ 1 8×8 Y X blk_height Z Subdomain matrix reg_size ( ) of 1 st -order with num_cores - 1 + blk_height < dim_z cell_size geometric index Synchronization avoiding

  33. �PNK�C�O?�HE�C�NAO�HPO 100% Parallel efficiency 67% 80% 60% 40% 45% 33% (GB’15) 20% 0% 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M Total number of cores BNK����S�XKY�X�T,����� AF�3��O�N����)< IUXKY��J�.%��Y��8����KTGR���-(�

  34. �A���O?�HE�C�NAO�HPO DOFs=772B Resolution (km) 2.480 1.389 0.920 0.620 0.488 0.16 0.08 7.95 DP-PF 0.04 “Exa-scale” 34X for exp 0.02 SYPD 0.01 89.5X 0.005 0.0025 Implicit 0.00125 Explicit 23.66 DP-PF 0.33 M 0.67 M 1.33 M 2.66 M 5.32 M 10.64 M Total number of cores The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

Recommend


More recommend