exascale road in China Ruibo WANG National University of Defense - - PowerPoint PPT Presentation

exascale road in china
SMART_READER_LITE
LIVE PREVIEW

exascale road in China Ruibo WANG National University of Defense - - PowerPoint PPT Presentation

Tianhe-3 and the exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT & TianHe the Exascale Road in China Tianhe-3 Contents NUDT & TianHe the Exascale Road in China Tianhe-3


slide-1
SLIDE 1

Ruibo WANG National University of Defense Technology

Tianhe-3 and the exascale road in China

slide-2
SLIDE 2

Contents

❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3

slide-3
SLIDE 3

Contents

❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3

slide-4
SLIDE 4

NUDT & Tianhe

❑ NUDT

❑ 1953 originally founded at Harbin ❑ 1970 move to Changsha ❑ 1978 renamed as National University of Defense Technology

Harbin Changsha

slide-5
SLIDE 5

NUDT & Tianhe

❑ Galaxy-I

❑ 1983, the 1st supercomputer in China ❑ peak performance 100 Mflops ❑ project started in 1978, widely used in oil exploration

and weather forecast

Galaxy-I supercomputer

slide-6
SLIDE 6

NUDT & Tianhe

❑ Galaxy-I

❑ 1983, 100 Mflops

❑ Galaxy-II

❑ 1994, Gflops ❑ Vector structure

❑ Galaxy-III

❑ 1997, 13 Gflops ❑ MPP ❑ MIPS CPU

Galaxy-II Galaxy-III

slide-7
SLIDE 7

NUDT & Tianhe

❑ TianHe-1, deployed in 2009, 1.2Pflops

❑ Rank No.1 in China ❑ Rank No.5 in Top500 (Nov. 2009)

❑ TianHe-1A, deployed in 2010, 4.7Pflops

❑ Rank No.1 in Top500 (Nov. 2010)

TianHe-1 TianHe-1A

slide-8
SLIDE 8

NUDT & Tianhe

❑ TianHe-1A, deployed in 2010, 4.7Pflops

❑ Rank No.1 in Top500 (Nov. 2010) ❑ the 1st time China got the No.1 ❑ deployed in the National Supercomputer Center in Tianjin

slide-9
SLIDE 9

NUDT & Tianhe

❑ TianHe-2 made its pre-release @ IHPCF2013

❑ International High Performance Computing Forum ❑ http://www.ihpcf.org/ ❑ Changsha, May. 2013

slide-10
SLIDE 10

NUDT & Tianhe

❑ TianHe-2 ranked No.1 from Jun. 2013 to Nov. 2015

❑ No.1 for 3 years ( 6 times ) ❑ Peak 55Pflops, Linpack 33.86Pflops

  • Jun. 2013, Leipzig
  • Nov. 2013, Denver
  • Jun. 2014, Leipzig
  • Nov. 2014, New Orleans
  • Jun. 2015, Frankfurt
  • Nov. 2015, Austin
slide-11
SLIDE 11

NUDT & Tianhe

❑ TianHe-2

❑ 16,000 compute nodes ❑ Frame: 32 compute Nodes ❑ Rack: 4 Compute Frames ❑ Whole System: 125 Racks

Compute Blade Compute Frame Compute Rack System

slide-12
SLIDE 12

NUDT & Tianhe

❑ TianHe-2 Background

❑ Sponsored by 863 High Tech. Program, Government of

Guangdong province and Government of Guangzhou city

❑ deployed in National Supercomputer Center in

Guangzhou (NSCC-GZ)

❑ Oct. 2013: Phase 1 system was moved to NSCC-GZ

slide-13
SLIDE 13

NUDT & Tianhe

❑ Jan. 2014, Tianhe-2 began to provide service in

NSCC-GZ

slide-14
SLIDE 14

NUDT & Tianhe

❑ Originally planned to finish its upgrade to Phase 2 in

2015

❑ Use the new generation KNL to replace the KNC ❑ The peak performance would reach 100Pflops

❑ In early 2015, due to some reasons, we try to use the

homegrown accelerator to upgrade the system

❑ Phase 2 system is ready in the end of 2017

slide-15
SLIDE 15

NUDT & Tianhe

❑ Comparison of Tianhe-2 & Tianhe-2A

Tianhe-2 Tianhe-2A

Nodes & Performance 16,000 nodes Intel CPU + KNC 17,792 nodes Intel CPU + Matrix-2000 54.9 Pflops 100.68 Pflops Interconnect 10Gbps, 1.57us 14Gbps, 1us Memory 1.4PB 3PB Storage 12.4PB, 512GB/s 19PB, 1TB/s Energy Efficiency 17.8MW, 1.9Gflops/W 18.5MW, 5.4Gflops/W Programming Environment MPSS for Intel KNC OpenMP/OpenCL for Matrix-2000

slide-16
SLIDE 16

NUDT & Tianhe

❑ Matrix-2000

❑ 4 super-nodes (SN) ❑ 8 clusters per SN ❑ 4 cores per cluster ❑ Core

❑ Self-defined 256-bit vector ISA ❑ 16 DP flops/cycle per core

❑ Peak performance: 2.4576Tflops@1.2GHz ❑ Power: ~240w ❑ 8 DDR4-2400 channels ❑ x16 PCIe Gen3

4 SNs x 8 clusters x 4cores x 16 flops x 1.2 GHz = 2.4576 Tflops

SN3

C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7

On chip interconnection PCIE DDR4 DDR4 DDR4 DDR4 SN0

C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7

SN1

C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7

SN2

C C C C Cluster 0 C C C C Cluster 1 C C C C Cluster 2 C C C C Cluster 3 C C C C Cluster 4 C C C C Cluster 5 C C C C Cluster 6 C C C C Cluster 7

slide-17
SLIDE 17

NUDT & Tianhe

❑ Heterogeneous Compute Nodes

❑ Intel Xeon CPU x2 ❑ Matrix-2000 x2 ❑ Memory:192GB

❑ Interconnect: 14G homegrown

network

❑ Peak performance: 5.34Tflops

slide-18
SLIDE 18

NUDT & Tianhe

❑ Heterogeneous Compute Blades

❑ Compute blade = Xeon part + Matrix-2000 part ❑ Use the Matrix-2000 part to replace the KNC part

4 Intel Xeon CPUs 4 Matrix-2000 2 Compute Nodes

slide-19
SLIDE 19

NUDT & Tianhe

❑ Heterogeneous programming environment

❑ support OpenMP 4.x and OpenCL

OpenMP 4.x OpenCL OpenMP runtime heterogeneous computing library symmetric communication library API wrapper driver driver (device) symmetric communication library

Xeon Matrix-2000

heterogeneous computing library host OS device OS

User Kernel

Math Library OpenCL runtime plugin X compiler OpenMP runtime OpenCL runtime

slide-20
SLIDE 20

Contents

❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3

slide-21
SLIDE 21

Next step: Exascale

❑ Governments target on Exascale computing

❑ US, Japan, EU, China

❑ China has currently achieved 100P level, but

Exascale poses great more challenges

❑ Memory wall ❑ Communication wall ❑ Reliability wall ❑ Energy consumption wall ❑ etc.

slide-22
SLIDE 22

More Walls for China

❑ Microelectronics & chip industry

❑ Still in an underdevelopment stage ❑ Calls for more Technology Accumulation

❑ Various & complex needs

❑ Huge & highly diverse market ❑ Calls for various design & development road

❑ Self-controllable road

❑ Processor ❑ Platform & OS ❑ APP ❑ Eco-system

slide-23
SLIDE 23

23

China’s Development

slide-24
SLIDE 24

National Projects & Plans in China

❑ since 1990, China release an HPC project in every 5-year plan,

sponsored by the 863 High Tech. Program of the Ministry of Science & Technology

❑ the 10th 5-year plan (2001~2005)

❑ Project: High performance Computer and software system ❑ Targets: TFlops supercomputer and High Performance computing

environment

❑ Successfully developed TF-scale computers and China National

Grid (CNGrid) testbed

❑ the 11th 5-year plan (2006~2010)

❑ Project: High productive computer and network computing

environment

❑ Targets: PFlops supercomputer and Grid computing environment ❑ Successfully developed Peta-scale computers, upgraded CNGrid

into the national HPC service environment

slide-25
SLIDE 25

❑ the 12th 5-year plan (2011~2015)

❑ Project: High productive computer and computing

environment

❑ Targets: 100PFlops supercomputer and cloud

computing environment

❑ Developed world-class computer systems

❑ Tianhe-2 ❑ Sunway TaihuLight

❑ the 13th 5-year plan (2016~2020)

❑ Project: Exascale system ❑ Targets: key technology of EFlops supercomputer

National Projects & Plans in China

slide-26
SLIDE 26

❑ GOALS

❑ Develop self-dependent and controllable core

technology of exascale computing, and keep China’s leading position

❑ Develop a series of critical HPC application and software

center, building the HPC application eco-system

❑ Build national HPC environment with global top level

resources and services

❑ Two Steps to Exascale

❑ Support vendors to develop prototypes (2016-2018) ❑ Choose and support vendors to achieve exascale

The 13th 5-year plan (2016~2020)

slide-27
SLIDE 27

Exascale Goal in 2016 proposal

❑ System performance 1 Eflops ❑ Node performance > 10Tflops ❑ Network bandwidth > 400Gbps ❑ Network scale up to more than 100,000 nodes ❑ MPI latency < 1.2us ❑ Linpack efficiency > 60% ❑ Power efficiency > 30Gflops/W

slide-28
SLIDE 28

❑ University

❑ NUDT

❑ homegrown CPU, accelerator and interconnect

❑ Institute

❑ National Research Center of Parallel Computer

Engineering and Technology (NRCPC)

❑ homegrown many-core CPU

❑ Company

❑ Dawning (Sugon)

❑ Various products lines besides HPC: server, PC, data

center products, etc.

❑ High portion of market share

Vendors in China

slide-29
SLIDE 29

NUDT exascale prototype system

deployed in the National Supercomputer Center in Tianjin, 2018

slide-30
SLIDE 30

NUDT exascale prototype system

❑ 512 nodes

❑ 3 MT-2000+ processors ❑ 6Tflops peak performance

❑ Matrix-2000+

❑ 128 cores ❑ 2 GHz ❑ 2 Tflops ❑ ~130W, ~15Gflops/W

❑ 400Gbps homegrown network

slide-31
SLIDE 31

NUDT exascale prototype system

❑ Air and water hybrid cooling ❑ PUE < 1.15 ❑ High density

slide-32
SLIDE 32

❑ SW26010 CPU

❑ Used in Sunway TaihuLight system

❑ 512 nodes

❑ Each node has 2 CPUs

❑ Homegrown network

NRCPC exascale prototype system

slide-33
SLIDE 33

Sugon exascale prototype system

slide-34
SLIDE 34

❑ Heterogenous architecture

❑ Hygon CPU + DCU

❑ 6D torus network

Sugon exascale prototype system

slide-35
SLIDE 35

❑ Hierarchy

❑ 512 Nodes ❑ 32 Supernodes ❑ 6 Silicon Units ❑ 1 Silicon Cube

❑ Cooling

❑ Total immersion cooling ❑ No noise ❑ Better performance on heat exchange

Sugon exascale prototype system

slide-36
SLIDE 36

❑ Compute

❑ Traditional multi-core CPU ❑ Many-core CPU ❑ CPU + DCU

❑ Network

❑ Homegrown interconnect network ❑ Commercial network

❑ Cooling

❑ Air & Water Hybrid cooling ❑ Traditional Water cooling ❑ Total immersion cooling

Exascale prototype systems

slide-37
SLIDE 37

Contents

❑ NUDT & TianHe ❑ the Exascale Road in China ❑ Tianhe-3

slide-38
SLIDE 38

❑ Heterogeneous (CPU + Accelerator) is the trend

❑ Summit (US) ❑ Effectively increase single node performance ❑ Mitigate the Communication Wall & Reliability Wall ❑ Heterogeneous programming is prevalent ❑ A practical way to Exacale

❑ Our plan

❑ Based on our current Matrix accelerator tech. (upcoming

model Matrix-3000)

❑ better manufacturing technique ❑ increase the peak performance ❑ Optimized vector performance

Architecture

slide-39
SLIDE 39

❑ Heterogenous architecture

Architecture

Multi Processor

CPU CPU CPU CPU

Multi Processor with Accelerator

CPU CPU CPU CPU MT MT MT MT

slide-40
SLIDE 40

❑ Heterogenous flexible architecture

Architecture

MT MT MT

Accelerator dominant

Interconnect

Heterogenous

MT

CPU dominant

Interconnect

CPU CPU CPU

Interconnect

CPU CPU CPU MT MT

slide-41
SLIDE 41

Engineering

❑ Easy replaceable

❑ two kinds of compute blade ❑ High density

slide-42
SLIDE 42

❑ Fast Interconnect,

support > 8

❑ Cores=64, > 2Tflops ❑ DDR4 ❑ PCIe Gen4 ❑ Support half precision

CPU

C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X X X

L3C L3C DDR4

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X X X

L3C L3C DDR4

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X

C C L2C C C L2C

X X X X X

L3C L3C DDR4 L3C L3C DDR4 L3C L3C L3C L3C DDR4 DDR4 L3C L3C DDR4 L3C L3C DDR4 PCIE4.0 PCIE4.0

slide-43
SLIDE 43

❑ GPDSP ❑ Cores>=96, > 10 Tflops ❑ HBM2 ❑ PCIe Gen4 ❑ Support half precision

Matrix-3000

GPDSP Cluster

HBM 2_3

GPDSP Cluster 3

HBM 2_1

GPDSP Cluster 1

HBM 2_2

GPDSP Cluster 2

PCIe4.0 16

HBM 2_0

slide-44
SLIDE 44

❑ Homegrown ❑ Bandwidth > 400Gbps ❑ MPI latency < 2us ❑ Support ~100,000 nodes, max hops 5

Interconnect Network

slide-45
SLIDE 45

Interconnect Network

❑ 3D Butterfly topology

❑ Maximum of 5 hops between any two nodes ❑ Intelligent network management: path tracing, link test,

fault reporting, chip configurations, etc.

❑ Other features: QoS, fault tolerance, etc. ❑ Multi-planes

ZNR ZNR ZNR ZNR node1 node2 node23 node24 ZNR ZNR

MNI CPU CPU CPU CPU MNI CPU CPU CPU CPU MNI CPU CPU CPU CPU
  • MNI
CPU CPU CPU CPU

ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR ZNR

PCIe_0 PCIe_3 NODE0

ZNI ZNI

slide-46
SLIDE 46

Cooling

❑ Air and water hybrid cooling

❑ An efficient way for cooling ❑ Practical, good cost performance ❑ PUE < 1.1

slide-47
SLIDE 47

❑ OpenCL support

❑ Software defined super node

❑ GPDSP Library

❑ BLAS, library optimized for underlying hardware

❑ 3 major Platforms

❑ Traditional scientific computing ❑ Big Data ❑ AI

Software

slide-48
SLIDE 48

❑ Container support

❑ For the future supercomputer center use ❑ Supercomputing cloud ❑ Cloud supercomputing

❑ Fault tolerant

❑ Autonomic management system ❑ Hardware error, software error

Software

slide-49
SLIDE 49

Tianhe-3

CPU (2Tflops) MT-3000 (10Tflops) System (100 cabinets) Cabinet (4 frames) Single Frame (32 blades) Blade (8 CPUs : 16Tflops / 8 MTs: 80Tflops)

slide-50
SLIDE 50

HPC Eco-system

❑ We aim to develop the eco-system

❑ homegrown processors and ISAs are the basic part

❑ Many competitors

slide-51
SLIDE 51

Development road

❑ Government guiding & leading

❑ National Supercomputer Centers

❑ Huge system ❑ General purpose, various users & needs

❑ National support

❑ Sufficient and constant fund & support ❑ HPC system is a strategic tool for a nation

❑ Cannot just rely on market to push forward ❑ Pushing tech forward and using techs to push markets

forward

❑ HPC as an engine to push developing other high-tech

industries

❑ Keep pace with the friends (Japan, EU and US)

slide-52
SLIDE 52

Thank you

Thank you for your attention!