[PPT] - SU:ED3ECDP( -AOECEC :EC ?EAPEBE? PowerPoint Presentation, free download

SLIDE 1

SU:ED3ECDP(

AOECEC :EC ?EAPEBE? LLHE?PEKO

P PDA ?HA KB # 4EHHEK ,KNAO

7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO 3KGXSKT UL 4GXN AYKS AIOKTIK BYOTMNG CTOKXYO AKKSHKXN%/820A

SLIDE 2

Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

6PHEA

SLIDE 3

Sunway-I:

CMA service, 1998
commercial chip
0.384 Tflops
48th of TOP500

Sunway BlueLight:

NSCC-Jinan, 2011
16-core processor
1 Pflops
14th of TOP500

Sunway TaihuLight:

NSCC-Wuxi, 2016
260-core processor
125 Pflops
1st of TOP500

:DA SU 4?DEA IEHU

SLIDE 4

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDM Level Register Level Computing Level

8*8 CPE Mesh

#( SU ,KNA NK?AOOKN

SLIDE 5

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

SLIDE 6

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

SLIDE 7

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

SLIDE 8

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

SLIDE 9

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

SLIDE 10

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

SLIDE 11

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses

SLIDE 12

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses Network on Chip

SLIDE 13

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes

SLIDE 14

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes Sunway Net

SLIDE 15

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

SLIDE 16

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

SLIDE 17

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

SLIDE 18

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

SLIDE 19

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

Sunway Micro

SLIDE 20

Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

6PHEA

SLIDE 21

4?DEA ,LEHEPU ,KILNEOK

0.5 1 1.5 2 2.5 3

Peak Performance Memory Size Gflops/Watt Tflops/m^3 memory bandwidth communication bandwidth

TaihuLight Tianhe-2 Titan K Computer

0.5 1 1.5 2 2.5 3

Linpack Graph HPCG hpgmg

TaihuLight Tianhe-2 Titan K Computer

SLIDE 22

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN APNAO PK ,KOEAN

SLIDE 23

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN APNAO PK ,KOEAN

Intel KNL 7250 of Cori: 6.5 flops/byte NVIDIA P100 of Piz Daint: 7.2 flops/byte

SLIDE 24

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN ,DHHACA #( ?HEC

SLIDE 25

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN ,DHHACA ( 4AIKNU HH

SLIDE 26

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN ,DHHACA ( 4AIKNU HH

Refactoring and Redesigning

SLIDE 27

2016

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

SLIDE 28

2016 Gordon Bell Finalists

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

SLIDE 29

2016 Gordon Bell Prize

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

SLIDE 30

2016 Gordon Bell Prize

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017 Gordon Bell Finalists

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

SLIDE 31

racks chips core-groups cores total number of cores

163,840 processes 65 threads

DD-MG K-cycle Very shallow Uniform DD Plug & Play

Now let’s find a way to design a subdomain solver.

LLHE?PEK 1( 1ILHE?EP KHRAN BKN PIKOLDANE? -UIE?O

SLIDE 32

racks chips core-groups cores total number of cores

163,840 processes 65 threads

Geometry-based pipelined ILU (GP-ILU)

Y X Z 8×8 8×8 8×8 8×8

Two-level pipeline blk_height Synchronization avoiding

1 1´

( )

dim_z blk_height 1 num_cores cell_size reg_size < +

DD-MG K-cycle

Subdomain matrix

f 1st-order with

geometric index

Our goal of design:

1. Single sweep
2. Synchronization-free
3. Improved data-locality

SLIDE 33

BNKSXKYXT, AF3ON)< IUXKYJ.%Y8KTGR-(

PNKCO?HECNAOHPO

1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M 0% 20% 40% 60% 80% 100%

Total number of cores Parallel efficiency

33% (GB’15) 67% 45%

SLIDE 34

0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.16 0.33 M 1.33 M 5.32 M 2.66 M 10.64 M 0.488

34X

SYPD Total number of cores

Implicit Explicit

89.5X

2.480 1.389 0.920 0.620

Resolution (km)

0.67 M

AO?HECNAOHPO

7.95 DP-PF 23.66 DP-PF DOFs=772B

“Exa-scale” for exp

The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

SLIDE 35

LLHE?PEK 11( KNPEC ,.4 AAOECEC ,4. BKN SU :ED3ECDP

35

CAM5.0

CPL7

CESM1.2.0 Tsinghua + BNU 30+ Professors and Students

Four component models, millions lines of code
Large-scale run on Sunway TaihuLight
24,000 MPI processes
Over one million cores
10-20x speedup for kernels
2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

SLIDE 36

LLHE?PEK 11( KNPEC ,.4 AAOECEC ,4. BKN SU :ED3ECDP

36

CAM5.0

CPL7

CESM1.2.0 Tsinghua + BNU 30+ Professors and Students

Four component models, millions lines of code
Large-scale run on Sunway TaihuLight
24,000 MPI processes
Over one million cores
10-20x speedup for kernels
2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

SLIDE 37

a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

4FKN ,DHHACAO

SLIDE 38

6LA,,OA AB?PKNEC KB ,4

CAM initial Dyn_run Phy_run1 Phy_run2 Pass state variables Pass state variables and tracers Pass tracers (u, v) to dynamics

do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do Euler_step: do ie = nets, nete compute Q min/max values for lim8 compute Biharmonic mixing term f end do do ie = nets, nete 2D advection step data packing end do Bonundary exchange Data extracting do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = … end do end do end do Data packing do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev dp0 = func_3() dpdiss = func_4() do q = 1, qsize Qtens(k,q,ie) = func_5(dp0, dpdiss) end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do do k = 1, nlev dp_star(k) = func_8(dp(k)) end do do k = 1, nlev Qtens(k,q,ie) = func_9(dp_star(k)) end do end do Data packing end do

ptimized:

do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_2(func_1()) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_5(func_3(),func_4()) end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(func_5(),func_6()) end do do k = 1, nlev Qtens(k,q,ie) = func_9(func_8(func_5())) end do end do Data packing end do do ie = nets, nete do q = 1, qsize do k = 1, nlev qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = … end do end do end do Data packing !$ACC PARALLEL LOOP do ie_q = 1, qsize*(nete-nets) do k = 1, nlev q = func(ie_q) ie = func(ie_q) qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do !$ACC PARALLEL LOOP do ie_q = 1, qsize*(nete-nets) do k = 1, nlev q = func(ie_q) ie = func(ie_q) Qtens(k,q,ie) = … end do end do !$ACC PARALLEL LOOP Data packing 1 2 3 4 5 6

manual transformation of

loops

manual OpenACC

parallelization and

ptimization on code and

data structures

do begin_chunk to end_chunk tphysbc() { convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) } enddo tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) enddo } tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) enddo …… do begin_chunk to end_chunk microp_driver_tend(7.13%) enddo …… do begin_chunk to end_chunk radiation_tend(54.07%) enddo } do begin_chunk to end_chunk convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { zm_convr(2.03%) zm_conv_evap() montran() convtranc(0.06%) } } enddo convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { do begin_chunk to end_chunk zm_convr(2.03%) enddo do begin_chunk to end_chunk zm_conv_evap() enddo do begin_chunk to end_chunk montran() enddo do begin_chunk to end_chunk convtranc(0.06%) enddo } }

tool based transformation of loops

SLIDE 39

0.04 0.15 0.24 0.25 0.6 0.78 0.87 1.54 1.2 1.62 1.75 2.81

0.5 1 1.5 2 2.5 3 1024 2400 4096 5120 7350 9600 12000 24000

Simulation Speed (Described in Model Year Per Day(MYPD))

Number of CGs (each CG includes 1 MPE and 64 CPEs)

MPE only MPE+CPE for dynamic core MPE+CPE for both dynamic core and physics schemes

,4 IKAH( O?HEHEPU OLAAL

SORROUT IUXK YIGRK % AF3
SGTIUXK XKLGIUXOTM LUX NK

KTOXK SUJKR

IUSKOOK YOSRGOUT YKKJ U

NK YGSK SUJKR UT =20@ FKRRUYUTK

SLIDE 40

n AK , XKXOK UL 5UXXGT KT022 IUJK U 0NXKGJ 2 IUJK

p LOTKX SKSUX IUTXUR NXUMN

G YKIOLOI 3<0 YINKSK

p SUXK KLLOIOKT KIUXOGOUT

n AK %, XKMOYKXIUSSTOIGOUT

HGYKJ XKJKYOMT

p XKSUK JGG JKKTJKTI p KUYK SUXK GXGRRKROYS

PDNAOA EACNEA AAOEC

elek elek+7

C0,0 C7,0

a16*i Stage 1 Ci, j a16*i+a16*i+1 a16*i+...+a16*i+15 Stage 2 a0 C0,0 a0+...+a15 a16 C1,0 a16+...+a31 p15 p31 ak Ck, 0 ak+...+ak+15 p15 + = p127 p111 + = ... ... ... ... ... ... ...

C0,1 C7,1

... ... ... ... ...

C0,7 C7,7

a0+a1 a8+a9 ak+ak+1 Stage 3 Ci, j ... p16*i a16*i a16*i+a16*i+1 a16*i+...+a16*i+15 ...

SLIDE 41

ANBKNI?A 1ILNKRAIAP PDNKCD AAOEC

1 Sunway CG (64 CPEs) could be equivalent to 0.1x Intel Core

r

1.8x Intel Core

r

7.2x Intel Core

r in certain cases

43.1x Intel Core

SLIDE 42

?HEC PDA

UIE?

,KNA PK 4EHHEKO KB ,KNAO

512 2048 8192 32768 131072 0.01 0.05 0.1 1 5 Number of Processes PFlops 48 elements in each process 192 elements in each process 768 elements in each process 650 elelments in each process 3.3 PFlops Para.eff 98.55% 2.4 PFlops Para.eff 92.2% 1.76 PFlops Para.eff 88.3% 2.72 PFlops Para.eff 92.9%

SLIDE 43

EIHPEK KB 0NNE?A 2PNE

SLIDE 44

n 3TGSOI XXK YUXIK

MKTKXGUX UXOMOTGKJ LXUS 2653<

n AKOYSOI GK XUGMGOUT

UXOMOTGKJ LXUS 0D32

n NKX OROOKY,

p YUXIK GXOOUTKX p 3 <UJKR 8TKXURGUX p @KYGX IUTXURRKX

LLHE?PEK 111( KHEAN .NPDMA EIHPEK K SU :ED3ECDP

Source Partitioner Restart Controller 3D Model Interpolator LZ4 Compression, Group I/O, Balanced I/O Forwarding Snapshot/Sesimo Recorder

Velocity Update Stress Update Source Injection

Stress Adjustment For Plasticity

Dynamic Rupture Source Generator (Based on CG-FDM) Seismic Wave Propagation (Based on AWP-ODC)

Next Timestep

3D Vel/Den Model Fault Stress Init Friction Law Ctrl Wave Eqn Solver

SLIDE 45

4HPE3ARAH -KIE -A?KILKOEPEK

𝑁𝑦 𝑁𝑧 (1) MPI decomposition 𝐶𝑨 𝐶𝑧 (2) CG blocking

Finished area Computing area Buffer area Unfinished area

𝑥𝑦 𝑥𝑨 𝑥𝑧 𝐷𝑨 𝐷𝑧 z y x

Compute direction

(3) Athread decomposition (4) LDM buffering scheme

SLIDE 46 1 xx 2

x x

n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz

...

1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 xx 2

x x

n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz

...

1 xx 1 yy 1 zz 1 xy 1 xz 1 yz 2

x x

2

y y

2 zz 2 xy 2 xz 2 yz 1 u 1 v 1 w 2 u 2 v 2 w 1 xx 2

x x

n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz 1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 u 1 v 1 w 2 u 2 v 2 w 1 xx 1 yy 1 zz 1 xy 1 xz 1 yz 2

x x

2

y y

2 zz 2 xy 2 xz 2 yz

...

dvelcx dstrqc dvelcx dstrqc after before fuse arrays

……

Left boundary Inner part Right boundary DMA transfer Register communication Register communication

ID: 63 ID: 00 ID: 01 ID: 02

# NNU BOEK DHK AT?DCA PDNKCD NACEOPAN ?KIIE?PEK KLPEIEA HK?EC ?KBECNPEK CEA U HUPE?H IKAH

H?A 4AIKNU ?DAIA

SLIDE 47

6PDABHU ,KILNAOOEK

(b) Computation workflow (d) Compression algorithms

(2) (1) (3)

dma_get dma_put

n

sign exp (8b) frac (24b) sign exp (0-8b) frac (7-15b)

... ...

e f e

N N E E N

15

) ( log

min max 2

(str, r1,r2, ,r6,sigma2,yldfac)

sign exp (5b) frac (10b) sign exp (8b) frac (24b)

1EEE754 32b to 16b FP conversion (vel,ww0,phi,cohes,taxx, ,taxz)

sign exp (8b) frac (24b)

IEEE754 32-bit floating point format

sign frac (15b)

8 ) /( 1

min max

V

V V V V V V

cmpr

16-bit floating point formats

(d1,lam,mu,qp,qs,vx1,vx2,ww)

LDM Main Memory 16b to 32b decompression General 32b computation 32b to 16b compression

Host Memory: CPE:

SLIDE 48

40.6 47.8 45.4 28.9 27.6 22.9 4.2 39.3 12.9 13.1

10 20 30 40 50 60

Speedup

MPE PAR MEM CMPR

LAAL( ,. RO # 4.

SLIDE 49

21.2 62% 21.2 62% 18.5 54% 18.5 54% 23.8 70% 24.8 73% 12.4 36% 26.9 79% 27 79% 27 79%

5 10 15 20 25 30

DMA Bandwidth

PAR MEM CMPR

4AIKNU SEPD PEHEPEK

SLIDE 50

Number of processes

8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K

PFLOPS

0.6 1 1.5 2 3 4 6 9 14 18 Ideal (Linear) Ideal (Non-linear) Ideal (Linear+Compress) Ideal (Non-linear+Compress) Linear (Peak: 10.7PFLops, Para. eff. 97.9%) Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%)

A ?HEC

SLIDE 51

Speedup 1 2 3 4 6 8 12 16 22 79.9% 73.6% 63.6%

Linear Ideal dx=100m dx=50m dx=16m

75.5% 75.6% 53.3%

Non-Linear Ideal dx=100m dx=50m dx=16m

Speedup 1 2 3 4 6 8 12 16 22

160K

75.8%

128K

72.4%

100K

51.2%

Linear+Compress

80K 64K

Number of processes

48K 32K 24K 16K 12K 8K

Ideal dx=100m dx=50m dx=16m

160K

67.5%

128K

Non-Linear+Compress

67.2%

100K 80K

51.7%

64K

Number of processes

48K 32K 24K 16K 12K 8K

Ideal dx=100m dx=50m dx=16m

PNKC ?HEC

SLIDE 52

38˚N 39˚N 40˚N

Beijing

Tangshan

Shunyi

Tianjin

Ninhe Luanxian

Cangzhou

Bohai

Luannan Wuqing

(c)

Beijing

Tangshan

Shunyi

Tianjin

Ninhe Luanxian

Cangzhou

Bohai

Luannan Wuqing

−0.2 −0.1 0.0 0.1 0.2

Velocity (m/s)

(d)

0s 50s 100s Ninghe (200m) Cangzhou (200m)

(a)

0s 50s 100s Ninghe (16m) Cangzhou (16m)

(b)

EIHPEK AOHPO( I RO #I

SLIDE 53

Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

6PHEA

SLIDE 54

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

54

SLIDE 55

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro tanshan.gif

3KC :ANI H

55

“15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios”, Gordon Bell Prize Finalist, SC 2017.

SLIDE 56

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

56

SLIDE 57

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

57

1.0x 2.2x 3.5x 1.0x 2.7x 4.5x 1.0x 7.1x 8.5x 10 20 30 40 50 60 70 80 Intel(24 cores) swBLAS(1CG) swDNN(1CG) Average time per iteration

Training AlexNet with swCaffe

total convolution fully connected

SLIDE 58

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

58

SLIDE 59

n <AB 2NOTG, SGPUX YUTYUXY UL NK 72 NGXJGXK GTJ YULGXK JKKRUSKT n =@22, KTJUX UL NK SGINOTK n =20@, @OIN UL UNT 3KTTOY 0RROYUT 1GKX 7GOOTM E YUX GTJ GJOIK UT

NK 20<A4 UX

n A242, FOLKTM 2O AKK 3G 3GTOKR @UKT :OS RYKT UYN BUHOT 0RK 1XKKX GTJ

3GKO < JOYIYYOUT GTJ GJOIK UT NK KGXNWGK YOSRGOUT UX

?KSHACAIAPO

SLIDE 60