���S�U�:�ED�3ECDP(� -AOEC�E�C ��� :��E�C �?EA�PEBE? �LLHE?�PEK�O �P PDA �?�HA KB #� 4EHHEK� ,KNAO 7GUN�GT 5� =G�OUTGR�A��KXIUS���OTM�2KT�KX�OT�D��O 3K�GX�SKT� UL 4GX�N A�Y�KS AIOKTIK� BYOTMN�G CTO�KXYO�� AK��KSHKX����N�%����/�820A
6�PHE�A Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight
Sunway-I: Sunway BlueLight: Sunway TaihuLight: - CMA service, 1998 - NSCC-Jinan, 2011 - NSCC-Wuxi, 2016 - commercial chip - 16-core processor - 260-core processor - 0.384 Tflops - 1 Pflops - 125 Pflops - 48 th of TOP500 - 14 th of TOP500 - 1 st of TOP500 :DA ���S�U 4�?DE�A ��IEHU
�����#�( ���S�U ����,KNA �NK?AOOKN Memory Memory iMC iMC Core Group 0 Memory Level Core Group 1 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDM Level Registers NoC Data Transfer LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 3 Core Group 2 Network Communication Bus Computing Level Memory Memory
0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS
0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS
0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS
0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS
0ECD�-A�OEPU 1�PACN�PEK� KB PDA ,KIL�PE�C �UOPAI n 0 5O�K��K�KR 8T�KMXG�OUT 7OKXGXIN� p IUS���OTM TUJK p IUS���OTM HUGXJ p Y��KX TUJK p IGHOTK� p KT�OXK IUS���OTM Y�Y�KS
0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO)
0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses
0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip
0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes
0KS PK ,K��A?P PDA #� 4EHHEK� ,KNAO) 2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes Sunway Net
:SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��
:SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��
:SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��
:SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K��
:SAAP ,KIIA�PO BNKI �NKB� ��PKODE 4�PO�K�� Sunway Micro
6�PHE�A Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight
4�?DE�A ,�L��EHEPU ,KIL�NEOK� TaihuLight Tianhe-2 Titan K Computer TaihuLight Tianhe-2 Titan K Computer Peak Performance Linpack 3 3 2.5 2.5 2 2 communication Memory Size 1.5 1.5 bandwidth 1 1 0.5 0.5 hpgmg Graph 0 0 memory Gflops/Watt bandwidth HPCG Tflops/m^3
4�FKN �A�P�NAO PK ,K�OE�AN Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs
4�FKN �A�P�NAO PK ,K�OE�AN Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs Intel KNL 7250 of Cori: 6.5 flops/byte NVIDIA P100 of Piz Daint: 7.2 flops/byte
4�FKN ,D�HHA�CA �#( �?�HE�C Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs
4�FKN ,D�HHA�CA ��( 4AIKNU ��HH Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs
4�FKN ,D�HHA�CA ��( 4AIKNU ��HH Sunway TaihuLight 10 million user-controlled 125 Pflops cores 64 KB LDM register 32 GB and communication 22 flops/byte MPE + CPE 136GB/s per node among CPEs Refactoring and Redesigning
�� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2016 2017 Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation
�� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2017 2016 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation
�� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2017 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation
�� 1�?KILHAPA� 3EOP KB ��HH��?�HA �LLHE?�PEK�O 2016 Gordon Bell Prize 2017 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Extreme-scale Graph Processing Framework Surface Wave Modeling Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via Phase Field Simulations of Coarsening Dynamics PEPS++ Molecular Dynamics Simulation of Condensed Atomistic Simulation of Silicon Nanowires Covalent Materials Run-away Electron Trajectory Simulation cryo-EM Macromolecule Structure Determination Genome Functional Annotation and Homeotic Redesigning CAM-SE Gene Building Spacecraft CFD Numerical Simulation Nonlinear Earthquake Simulation
163,840 processes 65 threads racks chips core-groups cores total number of cores DD-MG K -cycle Now let’s find a Very shallow way to design a Plug & Play subdomain solver. Uniform DD �LLHE?�PEK� 1�( 1ILHE?EP �KHRAN BKN �PIKOLDANE? -U��IE?O
163,840 processes 65 threads racks chips core-groups cores total number of cores Geometry-based pipelined ILU (GP-ILU) DD-MG K -cycle Our goal of design: 1. Single sweep 2. Synchronization-free 8×8 8×8 Two-level pipeline 8×8 3. Improved data-locality 1 ´ 1 8×8 Y X blk_height Z Subdomain matrix reg_size ( ) of 1 st -order with num_cores - 1 + blk_height < dim_z cell_size geometric index Synchronization avoiding
�PNK�C�O?�HE�C�NAO�HPO 100% Parallel efficiency 67% 80% 60% 40% 45% 33% (GB’15) 20% 0% 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M Total number of cores BNK����S�XKY�X�T,����� AF�3��O�N����)< IUXKY��J�.%��Y��8����KTGR���-(�
�A���O?�HE�C�NAO�HPO DOFs=772B Resolution (km) 2.480 1.389 0.920 0.620 0.488 0.16 0.08 7.95 DP-PF 0.04 “Exa-scale” 34X for exp 0.02 SYPD 0.01 89.5X 0.005 0.0025 Implicit 0.00125 Explicit 23.66 DP-PF 0.33 M 0.67 M 1.33 M 2.66 M 5.32 M 10.64 M Total number of cores The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit
Recommend
More recommend