exascale road in China Ruibo WANG National University of Defense - - PowerPoint PPT Presentation
exascale road in China Ruibo WANG National University of Defense - - PowerPoint PPT Presentation
Tianhe-3 and the exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT & TianHe the Exascale Road in China Tianhe-3 Contents NUDT & TianHe the Exascale Road in China Tianhe-3
Contents
❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3
Contents
❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3
NUDT & Tianhe
❑ NUDT
❑ 1953 originally founded at Harbin ❑ 1970 move to Changsha ❑ 1978 renamed as National University of Defense Technology
Harbin Changsha
NUDT & Tianhe
❑ Galaxy-I
❑ 1983, the 1st supercomputer in China ❑ peak performance 100 Mflops ❑ project started in 1978, widely used in oil exploration
and weather forecast
Galaxy-I supercomputer
NUDT & Tianhe
❑ Galaxy-I
❑ 1983, 100 Mflops
❑ Galaxy-II
❑ 1994, Gflops ❑ Vector structure
❑ Galaxy-III
❑ 1997, 13 Gflops ❑ MPP ❑ MIPS CPU
Galaxy-II Galaxy-III
NUDT & Tianhe
❑ TianHe-1, deployed in 2009, 1.2Pflops
❑ Rank No.1 in China ❑ Rank No.5 in Top500 (Nov. 2009)
❑ TianHe-1A, deployed in 2010, 4.7Pflops
❑ Rank No.1 in Top500 (Nov. 2010)
TianHe-1 TianHe-1A
NUDT & Tianhe
❑ TianHe-1A, deployed in 2010, 4.7Pflops
❑ Rank No.1 in Top500 (Nov. 2010) ❑ the 1st time China got the No.1 ❑ deployed in the National Supercomputer Center in Tianjin
NUDT & Tianhe
❑ TianHe-2 made its pre-release @ IHPCF2013
❑ International High Performance Computing Forum ❑ http://www.ihpcf.org/ ❑ Changsha, May. 2013
NUDT & Tianhe
❑ TianHe-2 ranked No.1 from Jun. 2013 to Nov. 2015
❑ No.1 for 3 years ( 6 times ) ❑ Peak 55Pflops, Linpack 33.86Pflops
- Jun. 2013, Leipzig
- Nov. 2013, Denver
- Jun. 2014, Leipzig
- Nov. 2014, New Orleans
- Jun. 2015, Frankfurt
- Nov. 2015, Austin
NUDT & Tianhe
❑ TianHe-2
❑ 16,000 compute nodes ❑ Frame: 32 compute Nodes ❑ Rack: 4 Compute Frames ❑ Whole System: 125 Racks
Compute Blade Compute Frame Compute Rack System
NUDT & Tianhe
❑ TianHe-2 Background
❑ Sponsored by 863 High Tech. Program, Government of
Guangdong province and Government of Guangzhou city
❑ deployed in National Supercomputer Center in
Guangzhou (NSCC-GZ)
❑ Oct. 2013: Phase 1 system was moved to NSCC-GZ
NUDT & Tianhe
❑ Jan. 2014, Tianhe-2 began to provide service in
NSCC-GZ
NUDT & Tianhe
❑ Originally planned to finish its upgrade to Phase 2 in
2015
❑ Use the new generation KNL to replace the KNC ❑ The peak performance would reach 100Pflops
❑ In early 2015, due to some reasons, we try to use the
homegrown accelerator to upgrade the system
❑ Phase 2 system is ready in the end of 2017
NUDT & Tianhe
❑ Comparison of Tianhe-2 & Tianhe-2A
Tianhe-2 Tianhe-2A
Nodes & Performance 16,000 nodes Intel CPU + KNC 17,792 nodes Intel CPU + Matrix-2000 54.9 Pflops 100.68 Pflops Interconnect 10Gbps, 1.57us 14Gbps, 1us Memory 1.4PB 3PB Storage 12.4PB, 512GB/s 19PB, 1TB/s Energy Efficiency 17.8MW, 1.9Gflops/W 18.5MW, 5.4Gflops/W Programming Environment MPSS for Intel KNC OpenMP/OpenCL for Matrix-2000
NUDT & Tianhe
❑ Matrix-2000
❑ 4 super-nodes (SN) ❑ 8 clusters per SN ❑ 4 cores per cluster ❑ Core
❑ Self-defined 256-bit vector ISA ❑ 16 DP flops/cycle per core
❑ Peak performance: 2.4576Tflops@1.2GHz ❑ Power: ~240w ❑ 8 DDR4-2400 channels ❑ x16 PCIe Gen3
4 SNs x 8 clusters x 4cores x 16 flops x 1.2 GHz = 2.4576 Tflops
SN3
C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7
On chip interconnection PCIE DDR4 DDR4 DDR4 DDR4 SN0
C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7
SN1
C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7
SN2
C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7
NUDT & Tianhe
❑ Heterogeneous Compute Nodes
❑ Intel Xeon CPU x2 ❑ Matrix-2000 x2 ❑ Memory:192GB
❑ Interconnect: 14G homegrown
network
❑ Peak performance: 5.34Tflops
NUDT & Tianhe
❑ Heterogeneous Compute Blades
❑ Compute blade = Xeon part + Matrix-2000 part ❑ Use the Matrix-2000 part to replace the KNC part
4 Intel Xeon CPUs 4 Matrix-2000 2 Compute Nodes
NUDT & Tianhe
❑ Heterogeneous programming environment
❑ support OpenMP 4.x and OpenCL
OpenMP 4.x OpenCL OpenMP runtime heterogeneous computing library symmetric communication library API wrapper driver driver (device) symmetric communication library
Xeon Matrix-2000
heterogeneous computing library host OS device OS
User Kernel
Math Library OpenCL runtime plugin X compiler OpenMP runtime OpenCL runtime
Contents
❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3
Next step: Exascale
❑ Governments target on Exascale computing
❑ US, Japan, EU, China
❑ China has currently achieved 100P level, but
Exascale poses great more challenges
❑ Memory wall ❑ Communication wall ❑ Reliability wall ❑ Energy consumption wall ❑ etc.
More Walls for China
❑ Microelectronics & chip industry
❑ Still in an underdevelopment stage ❑ Calls for more Technology Accumulation
❑ Various & complex needs
❑ Huge & highly diverse market ❑ Calls for various design & development road
❑ Self-controllable road
❑ Processor ❑ Platform & OS ❑ APP ❑ Eco-system
23
China’s Development
National Projects & Plans in China
❑ since 1990, China release an HPC project in every 5-year plan,
sponsored by the 863 High Tech. Program of the Ministry of Science & Technology
❑ the 10th 5-year plan (2001~2005)
❑ Project: High performance Computer and software system ❑ Targets: TFlops supercomputer and High Performance computing
environment
❑ Successfully developed TF-scale computers and China National
Grid (CNGrid) testbed
❑ the 11th 5-year plan (2006~2010)
❑ Project: High productive computer and network computing
environment
❑ Targets: PFlops supercomputer and Grid computing environment ❑ Successfully developed Peta-scale computers, upgraded CNGrid
into the national HPC service environment
❑ the 12th 5-year plan (2011~2015)
❑ Project: High productive computer and computing
environment
❑ Targets: 100PFlops supercomputer and cloud
computing environment
❑ Developed world-class computer systems
❑ Tianhe-2 ❑ Sunway TaihuLight
❑ the 13th 5-year plan (2016~2020)
❑ Project: Exascale system ❑ Targets: key technology of EFlops supercomputer
National Projects & Plans in China
❑ GOALS
❑ Develop self-dependent and controllable core
technology of exascale computing, and keep China’s leading position
❑ Develop a series of critical HPC application and software
center, building the HPC application eco-system
❑ Build national HPC environment with global top level
resources and services
❑ Two Steps to Exascale
❑ Support vendors to develop prototypes (2016-2018) ❑ Choose and support vendors to achieve exascale
The 13th 5-year plan (2016~2020)
Exascale Goal in 2016 proposal
❑ System performance 1 Eflops ❑ Node performance > 10Tflops ❑ Network bandwidth > 400Gbps ❑ Network scale up to more than 100,000 nodes ❑ MPI latency < 1.2us ❑ Linpack efficiency > 60% ❑ Power efficiency > 30Gflops/W
❑ University
❑ NUDT
❑ homegrown CPU, accelerator and interconnect
❑ Institute
❑ National Research Center of Parallel Computer
Engineering and Technology (NRCPC)
❑ homegrown many-core CPU
❑ Company
❑ Dawning (Sugon)
❑ Various products lines besides HPC: server, PC, data
center products, etc.
❑ High portion of market share
Vendors in China
NUDT exascale prototype system
deployed in the National Supercomputer Center in Tianjin, 2018
NUDT exascale prototype system
❑ 512 nodes
❑ 3 MT-2000+ processors ❑ 6Tflops peak performance
❑ Matrix-2000+
❑ 128 cores ❑ 2 GHz ❑ 2 Tflops ❑ ~130W, ~15Gflops/W
❑ 400Gbps homegrown network
NUDT exascale prototype system
❑ Air and water hybrid cooling ❑ PUE < 1.15 ❑ High density
❑ SW26010 CPU
❑ Used in Sunway TaihuLight system
❑ 512 nodes
❑ Each node has 2 CPUs
❑ Homegrown network
NRCPC exascale prototype system
Sugon exascale prototype system
❑ Heterogenous architecture
❑ Hygon CPU + DCU
❑ 6D torus network
Sugon exascale prototype system
❑ Hierarchy
❑ 512 Nodes ❑ 32 Supernodes ❑ 6 Silicon Units ❑ 1 Silicon Cube
❑ Cooling
❑ Total immersion cooling ❑ No noise ❑ Better performance on heat exchange
Sugon exascale prototype system
❑ Compute
❑ Traditional multi-core CPU ❑ Many-core CPU ❑ CPU + DCU
❑ Network
❑ Homegrown interconnect network ❑ Commercial network
❑ Cooling
❑ Air & Water Hybrid cooling ❑ Traditional Water cooling ❑ Total immersion cooling
Exascale prototype systems
Contents
❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3
❑ Heterogeneous (CPU + Accelerator) is the trend
❑ Summit (US) ❑ Effectively increase single node performance ❑ Mitigate the Communication Wall & Reliability Wall ❑ Heterogeneous programming is prevalent ❑ A practical way to Exacale
❑ Our plan
❑ Based on our current Matrix accelerator tech. (upcoming
model Matrix-3000)
❑ better manufacturing technique ❑ increase the peak performance ❑ Optimized vector performance
Architecture
❑ Heterogenous architecture
Architecture
Multi Processor
CPU CPU CPU CPU
Multi Processor with Accelerator
CPU CPU CPU CPU MT MT MT MT
❑ Heterogenous flexible architecture
Architecture
MT MT MT
Accelerator dominant
Interconnect
Heterogenous
MT
CPU dominant
Interconnect
CPU CPU CPU
Interconnect
CPU CPU CPU MT MT
Engineering
❑ Easy replaceable
❑ two kinds of compute blade ❑ High density
❑ Fast Interconnect,
support > 8
❑ Cores=64, > 2Tflops ❑ DDR4 ❑ PCIe Gen4 ❑ Support half precision
CPU
C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X X X
L3C L3C DDR4
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X X X
L3C L3C DDR4
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X
C C L2C C C L2C
X X X X X
L3C L3C DDR4 L3C L3C DDR4 L3C L3C L3C L3C DDR4 DDR4 L3C L3C DDR4 L3C L3C DDR4 PCIE4.0 PCIE4.0
❑ GPDSP ❑ Cores>=96, > 10 Tflops ❑ HBM2 ❑ PCIe Gen4 ❑ Support half precision
Matrix-3000
GPDSP Cluster
HBM 2_3
GPDSP Cluster 3
HBM 2_1
GPDSP Cluster 1
HBM 2_2
GPDSP Cluster 2
PCIe4.0 16
HBM 2_0
❑ Homegrown ❑ Bandwidth > 400Gbps ❑ MPI latency < 2us ❑ Support ~100,000 nodes, max hops 5
Interconnect Network
Interconnect Network
❑ 3D Butterfly topology
❑ Maximum of 5 hops between any two nodes ❑ Intelligent network management: path tracing, link test,
fault reporting, chip configurations, etc.
❑ Other features: QoS, fault tolerance, etc. ❑ Multi-planes
ZNR ZNR ZNR ZNR node1 node2 node23 node24 ZNR ZNR
MNI CPU CPU CPU CPU MNI CPU CPU CPU CPU MNI CPU CPU CPU CPU- MNI
ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR
PCIe_0 PCIe_3 NODE0ZNI ZNI
Cooling
❑ Air and water hybrid cooling
❑ An efficient way for cooling ❑ Practical, good cost performance ❑ PUE < 1.1
❑ OpenCL support
❑ Software defined super node
❑ GPDSP Library
❑ BLAS, library optimized for underlying hardware
❑ 3 major Platforms
❑ Traditional scientific computing ❑ Big Data ❑ AI
Software
❑ Container support
❑ For the future supercomputer center use ❑ Supercomputing cloud ❑ Cloud supercomputing
❑ Fault tolerant
❑ Autonomic management system ❑ Hardware error, software error
Software
Tianhe-3
CPU (2Tflops) MT-3000 (10Tflops) System (100 cabinets) Cabinet (4 frames) Single Frame (32 blades) Blade (8 CPUs : 16Tflops / 8 MTs: 80Tflops)
HPC Eco-system
❑ We aim to develop the eco-system
❑ homegrown processors and ISAs are the basic part
❑ Many competitors
Development road
❑ Government guiding & leading
❑ National Supercomputer Centers
❑ Huge system ❑ General purpose, various users & needs
❑ National support
❑ Sufficient and constant fund & support ❑ HPC system is a strategic tool for a nation
❑ Cannot just rely on market to push forward ❑ Pushing tech forward and using techs to push markets
forward
❑ HPC as an engine to push developing other high-tech
industries
❑ Keep pace with the friends (Japan, EU and US)