s u ed 3ecdp aoec e c e c ea pebe llhe pek o
play

SU:ED3ECDP( -AOECEC :EC ?EAPEBE? - PowerPoint PPT Presentation

SU:ED3ECDP( -AOECEC :EC ?EAPEBE? LLHE?PEKO P PDA ?HA KB # 4EHHEK ,KNAO 7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO


  1. ���S�U�:�ED�3ECDP(� -AOEC�E�C ��� :��E�C �?EA�PEBE? �LLHE?�PEK�O �P PDA �?�HA KB #� 4EHHEK� ,KNAO 7GUN�GT 5� =G�OUTGR�A��KXIUS���OTM�2KT�KX�OT�D��O 3K�GX�SKT� UL 4GX�N A�Y�KS AIOKTIK� BYOTMN�G CTO�KXYO�� AK��KSHKX����N�%����/�820A

  2. 6�PHE�A Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

  3. Sunway-I: Sunway BlueLight: Sunway TaihuLight: - CMA service, 1998 - NSCC-Jinan, 2011 - NSCC-Wuxi, 2016 - commercial chip - 16-core processor - 260-core processor - 0.384 Tflops - 1 Pflops - 125 Pflops - 48 th of TOP500 - 14 th of TOP500 - 1 st of TOP500 :DA ���S�U 4�?DE�A ��IEHU

  4. �����#�( ���S�U ����,KNA �NK?AOOKN Memory Memory iMC iMC Core Group 0 Memory Level Core Group 1 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDM Level Registers NoC Data Transfer LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 3 Core Group 2 Network Communication Bus Computing Level Memory Memory

  5. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  6. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  7. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  8. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  9. 0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS

  10. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO)

  11. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses

  12. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip

  13. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes

  14. 0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes Sunway Net

  15. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  16. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  17. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  18. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��

  19. :SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K�� Sunway Micro

  20. 6�PHE�A Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

  21. 4�?DE�A ,�L��EHEPU ,KIL�NEOK� TaihuLight Tianhe-2 Titan K Computer TaihuLight Tianhe-2 Titan K Computer Peak Performance Linpack 3 3 2.5 2.5 2 2 communication Memory Size 1.5 1.5 bandwidth 1 1 0.5 0.5 hpgmg Graph 0 0 memory Gflops/Watt bandwidth HPCG Tflops/m^3

  22. 4�FKN �A�P�NAO PK ,K�OE�AN Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs

  23. 4�FKN �A�P�NAO PK ,K�OE�AN Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs Intel KNL 7250 of Cori: 6.5 flops/byte NVIDIA P100 of Piz Daint: 7.2 flops/byte

  24. 4�FKN ,D�HHA�CA �#( �?�HE�C Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs

  25. 4�FKN ,D�HHA�CA ��( 4AIKNU ��HH Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs

  26. 4�FKN ,D�HHA�CA ��( 4AIKNU ��HH Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs Refactoring and Redesigning

  27. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2016 2017 Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  28. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2017 2016 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  29. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2017 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  30. �� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2016 Gordon Bell Prize 2017 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation

  31. 163,840 processes 65 threads racks chips core-groups cores total number of cores DD-MG K -cycle Now let’s find a Very shallow way to design a Plug & Play subdomain solver. Uniform DD �LLHE?�PEK� 1�( 1ILHE?EP �KHRAN BKN �PIKOLDANE? -U��IE?O

  32. 163,840 processes 65 threads racks chips core-groups cores total number of cores Geometry-based pipelined ILU (GP-ILU) DD-MG K -cycle Our goal of design: 1. Single sweep 2. Synchronization-free 8×8 8×8 Two-level pipeline 8×8 3. Improved data-locality 1 ´ 1 8×8 Y X blk_height Z Subdomain matrix reg_size ( ) of 1 st -order with num_cores - 1 + blk_height < dim_z cell_size geometric index Synchronization avoiding

  33. �PNK�C�O?�HE�C�NAO�HPO 100% Parallel efficiency 67% 80% 60% 40% 45% 33% (GB’15) 20% 0% 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M Total number of cores BNK����S�XKY�X�T,����� AF�3��O�N����)< IUXKY��J�.%��Y��8����KTGR���-(�

  34. �A���O?�HE�C�NAO�HPO DOFs=772B Resolution (km) 2.480 1.389 0.920 0.620 0.488 0.16 0.08 7.95 DP-PF 0.04 “Exa-scale” 34X for exp 0.02 SYPD 0.01 89.5X 0.005 0.0025 Implicit 0.00125 Explicit 23.66 DP-PF 0.33 M 0.67 M 1.33 M 2.66 M 5.32 M 10.64 M Total number of cores The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend