SU:ED3ECDP( -AOECEC :EC ?EAPEBE? - - PowerPoint PPT Presentation

s u ed 3ecdp aoec e c e c ea pebe llhe pek o
SMART_READER_LITE
LIVE PREVIEW

SU:ED3ECDP( -AOECEC :EC ?EAPEBE? - - PowerPoint PPT Presentation

SU:ED3ECDP( -AOECEC :EC ?EAPEBE? LLHE?PEKO P PDA ?HA KB # 4EHHEK ,KNAO 7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO


slide-1
SLIDE 1

SU:ED3ECDP(

  • AOECEC :EC ?EAPEBE? LLHE?PEKO

P PDA ?HA KB # 4EHHEK ,KNAO

7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO 3KGXSKT UL 4GXN AYKS AIOKTIK BYOTMNG CTOKXYO AKKSHKXN%/820A

slide-2
SLIDE 2

Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

6PHEA

slide-3
SLIDE 3

Sunway-I:

  • CMA service, 1998
  • commercial chip
  • 0.384 Tflops
  • 48th of TOP500

Sunway BlueLight:

  • NSCC-Jinan, 2011
  • 16-core processor
  • 1 Pflops
  • 14th of TOP500

Sunway TaihuLight:

  • NSCC-Wuxi, 2016
  • 260-core processor
  • 125 Pflops
  • 1st of TOP500

:DA SU 4?DEA IEHU

slide-4
SLIDE 4

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDM Level Register Level Computing Level

8*8 CPE Mesh

#( SU ,KNA NK?AOOKN

slide-5
SLIDE 5

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

slide-6
SLIDE 6

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

slide-7
SLIDE 7

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

slide-8
SLIDE 8

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

slide-9
SLIDE 9

n 0 5OKKKR 8TKMXGOUT 7OKXGXIN

p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK

IUSOTM YYKS

0ECD-AOEPU 1PACNPEK KB PDA ,KILPEC UOPAI

slide-10
SLIDE 10

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

slide-11
SLIDE 11

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses

slide-12
SLIDE 12

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses Network on Chip

slide-13
SLIDE 13

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes

slide-14
SLIDE 14

0KS PK ,KA?P PDA # 4EHHEK ,KNAO)

2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes Sunway Net

slide-15
SLIDE 15

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

slide-16
SLIDE 16

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

slide-17
SLIDE 17

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

slide-18
SLIDE 18

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

slide-19
SLIDE 19

:SAAP ,KIIAPO BNKI NKB PKODE 4POK

Sunway Micro

slide-20
SLIDE 20

Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

6PHEA

slide-21
SLIDE 21

4?DEA ,LEHEPU ,KILNEOK

0.5 1 1.5 2 2.5 3

Peak Performance Memory Size Gflops/Watt Tflops/m^3 memory bandwidth communication bandwidth

TaihuLight Tianhe-2 Titan K Computer

0.5 1 1.5 2 2.5 3

Linpack Graph HPCG hpgmg

TaihuLight Tianhe-2 Titan K Computer

slide-22
SLIDE 22

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN APNAO PK ,KOEAN

slide-23
SLIDE 23

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN APNAO PK ,KOEAN

Intel KNL 7250 of Cori: 6.5 flops/byte NVIDIA P100 of Piz Daint: 7.2 flops/byte

slide-24
SLIDE 24

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN ,DHHACA #( ?HEC

slide-25
SLIDE 25

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN ,DHHACA ( 4AIKNU HH

slide-26
SLIDE 26

Sunway TaihuLight

125 Pflops

32 GB and 136GB/s per node 22 flops/byte

10 million cores

MPE + CPE

user-controlled 64 KB LDM

register communication among CPEs

4FKN ,DHHACA ( 4AIKNU HH

Refactoring and Redesigning

slide-27
SLIDE 27

2016

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

slide-28
SLIDE 28

2016 Gordon Bell Finalists

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

slide-29
SLIDE 29

2016 Gordon Bell Prize

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

slide-30
SLIDE 30

2016 Gordon Bell Prize

Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation

2017 Gordon Bell Finalists

Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

1?KILHAPA 3EOP KB HH?HA LLHE?PEKO

slide-31
SLIDE 31

racks chips core-groups cores total number of cores

163,840 processes 65 threads

DD-MG K-cycle Very shallow Uniform DD Plug & Play

Now let’s find a way to design a subdomain solver.

LLHE?PEK 1( 1ILHE?EP KHRAN BKN PIKOLDANE? -UIE?O

slide-32
SLIDE 32

racks chips core-groups cores total number of cores

163,840 processes 65 threads

Geometry-based pipelined ILU (GP-ILU)

Y X Z 8×8 8×8 8×8 8×8

Two-level pipeline blk_height Synchronization avoiding

1 1´

( )

dim_z blk_height 1 num_cores cell_size reg_size < +

  • DD-MG K-cycle

Subdomain matrix

  • f 1st-order with

geometric index

Our goal of design:

  • 1. Single sweep
  • 2. Synchronization-free
  • 3. Improved data-locality
slide-33
SLIDE 33

BNKSXKYXT, AF3ON)< IUXKYJ.%Y8KTGR-(

PNKCO?HECNAOHPO

1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M 0% 20% 40% 60% 80% 100%

Total number of cores Parallel efficiency

33% (GB’15) 67% 45%

slide-34
SLIDE 34

0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.16 0.33 M 1.33 M 5.32 M 2.66 M 10.64 M 0.488

34X

SYPD Total number of cores

Implicit Explicit

89.5X

2.480 1.389 0.920 0.620

Resolution (km)

0.67 M

AO?HECNAOHPO

7.95 DP-PF 23.66 DP-PF DOFs=772B

“Exa-scale” for exp

The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

slide-35
SLIDE 35

LLHE?PEK 11( KNPEC ,.4 AAOECEC ,4. BKN SU :ED3ECDP

35

CAM5.0

  • CPL7

CESM1.2.0 Tsinghua + BNU 30+ Professors and Students

  • Four component models, millions lines of code
  • Large-scale run on Sunway TaihuLight
  • 24,000 MPI processes
  • Over one million cores
  • 10-20x speedup for kernels
  • 2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

slide-36
SLIDE 36

LLHE?PEK 11( KNPEC ,.4 AAOECEC ,4. BKN SU :ED3ECDP

36

CAM5.0

  • CPL7

CESM1.2.0 Tsinghua + BNU 30+ Professors and Students

  • Four component models, millions lines of code
  • Large-scale run on Sunway TaihuLight
  • 24,000 MPI processes
  • Over one million cores
  • 10-20x speedup for kernels
  • 2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

slide-37
SLIDE 37

a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

4FKN ,DHHACAO

slide-38
SLIDE 38

6LA,,OA AB?PKNEC KB ,4

CAM initial Dyn_run Phy_run1 Phy_run2 Pass state variables Pass state variables and tracers Pass tracers (u, v) to dynamics

do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do Euler_step: do ie = nets, nete compute Q min/max values for lim8 compute Biharmonic mixing term f end do do ie = nets, nete 2D advection step data packing end do Bonundary exchange Data extracting do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = … end do end do end do Data packing do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev dp0 = func_3() dpdiss = func_4() do q = 1, qsize Qtens(k,q,ie) = func_5(dp0, dpdiss) end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do do k = 1, nlev dp_star(k) = func_8(dp(k)) end do do k = 1, nlev Qtens(k,q,ie) = func_9(dp_star(k)) end do end do Data packing end do
  • ptimized:
do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_2(func_1()) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_5(func_3(),func_4()) end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(func_5(),func_6()) end do do k = 1, nlev Qtens(k,q,ie) = func_9(func_8(func_5())) end do end do Data packing end do do ie = nets, nete do q = 1, qsize do k = 1, nlev qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = … end do end do end do Data packing !$ACC PARALLEL LOOP do ie_q = 1, qsize*(nete-nets) do k = 1, nlev q = func(ie_q) ie = func(ie_q) qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do !$ACC PARALLEL LOOP do ie_q = 1, qsize*(nete-nets) do k = 1, nlev q = func(ie_q) ie = func(ie_q) Qtens(k,q,ie) = … end do end do !$ACC PARALLEL LOOP Data packing 1 2 3 4 5 6
  • manual transformation of

loops

  • manual OpenACC

parallelization and

  • ptimization on code and

data structures

do begin_chunk to end_chunk tphysbc() { convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) } enddo tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) enddo } tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) enddo …… do begin_chunk to end_chunk microp_driver_tend(7.13%) enddo …… do begin_chunk to end_chunk radiation_tend(54.07%) enddo } do begin_chunk to end_chunk convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { zm_convr(2.03%) zm_conv_evap() montran() convtranc(0.06%) } } enddo convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { do begin_chunk to end_chunk zm_convr(2.03%) enddo do begin_chunk to end_chunk zm_conv_evap() enddo do begin_chunk to end_chunk montran() enddo do begin_chunk to end_chunk convtranc(0.06%) enddo } }
  • tool based transformation of loops
slide-39
SLIDE 39

0.04 0.15 0.24 0.25 0.6 0.78 0.87 1.54 1.2 1.62 1.75 2.81

0.5 1 1.5 2 2.5 3 1024 2400 4096 5120 7350 9600 12000 24000

Simulation Speed (Described in Model Year Per Day(MYPD))

Number of CGs (each CG includes 1 MPE and 64 CPEs)

MPE only MPE+CPE for dynamic core MPE+CPE for both dynamic core and physics schemes

,4 IKAH( O?HEHEPU OLAAL

  • SORROUT IUXK YIGRK % AF3
  • SGTIUXK XKLGIUXOTM LUX NK

KTOXK SUJKR

  • IUSKOOK YOSRGOUT YKKJ U

NK YGSK SUJKR UT =20@ FKRRUYUTK

slide-40
SLIDE 40

n AK , XKXOK UL 5UXXGT KT022 IUJK U 0NXKGJ 2 IUJK

p LOTKX SKSUX IUTXUR NXUMN

G YKIOLOI 3<0 YINKSK

p SUXK KLLOIOKT KIUXOGOUT

n AK %, XKMOYKXIUSSTOIGOUT

HGYKJ XKJKYOMT

p XKSUK JGG JKKTJKTI p KUYK SUXK GXGRRKROYS

PDNAOA EACNEA AAOEC

elek elek+7

C0,0 C7,0

a16*i Stage 1 Ci, j a16*i+a16*i+1 a16*i+...+a16*i+15 Stage 2 a0 C0,0 a0+...+a15 a16 C1,0 a16+...+a31 p15 p31 ak Ck, 0 ak+...+ak+15 p15 + = p127 p111 + = ... ... ... ... ... ... ...

C0,1 C7,1

... ... ... ... ...

C0,7 C7,7

a0+a1 a8+a9 ak+ak+1 Stage 3 Ci, j ... p16*i a16*i a16*i+a16*i+1 a16*i+...+a16*i+15 ...

slide-41
SLIDE 41

ANBKNI?A 1ILNKRAIAP PDNKCD AAOEC

1 Sunway CG (64 CPEs) could be equivalent to 0.1x Intel Core

  • r

1.8x Intel Core

  • r

7.2x Intel Core

  • r in certain cases

43.1x Intel Core

slide-42
SLIDE 42

?HEC PDA

  • UIE?

,KNA PK 4EHHEKO KB ,KNAO

512 2048 8192 32768 131072 0.01 0.05 0.1 1 5 Number of Processes PFlops 48 elements in each process 192 elements in each process 768 elements in each process 650 elelments in each process 3.3 PFlops Para.eff 98.55% 2.4 PFlops Para.eff 92.2% 1.76 PFlops Para.eff 88.3% 2.72 PFlops Para.eff 92.9%

slide-43
SLIDE 43

EIHPEK KB 0NNE?A 2PNE

slide-44
SLIDE 44

n 3TGSOI XXK YUXIK

MKTKXGUX UXOMOTGKJ LXUS 2653<

n AKOYSOI GK XUGMGOUT

UXOMOTGKJ LXUS 0D32

n NKX OROOKY,

p YUXIK GXOOUTKX p 3 <UJKR 8TKXURGUX p @KYGX IUTXURRKX

LLHE?PEK 111( KHEAN .NPDMA EIHPEK K SU :ED3ECDP

Source Partitioner Restart Controller 3D Model Interpolator LZ4 Compression, Group I/O, Balanced I/O Forwarding Snapshot/Sesimo Recorder

Velocity Update Stress Update Source Injection

Stress Adjustment For Plasticity

Dynamic Rupture Source Generator (Based on CG-FDM) Seismic Wave Propagation (Based on AWP-ODC)

Next Timestep

3D Vel/Den Model Fault Stress Init Friction Law Ctrl Wave Eqn Solver

slide-45
SLIDE 45

4HPE3ARAH -KIE -A?KILKOEPEK

𝑁𝑦 𝑁𝑧 (1) MPI decomposition 𝐶𝑨 𝐶𝑧 (2) CG blocking

Finished area Computing area Buffer area Unfinished area

𝑥𝑦 𝑥𝑨 𝑥𝑧 𝐷𝑨 𝐷𝑧 z y x

Compute direction

(3) Athread decomposition (4) LDM buffering scheme

slide-46
SLIDE 46 1 xx 2

x x

n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz

...

1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 xx 2

x x

n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz

...

1 xx 1 yy 1 zz 1 xy 1 xz 1 yz 2

x x

2

y y

2 zz 2 xy 2 xz 2 yz 1 u 1 v 1 w 2 u 2 v 2 w 1 xx 2

x x

n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz 1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 u 1 v 1 w 2 u 2 v 2 w 1 xx 1 yy 1 zz 1 xy 1 xz 1 yz 2

x x

2

y y

2 zz 2 xy 2 xz 2 yz

...

dvelcx dstrqc dvelcx dstrqc after before fuse arrays

……

Left boundary Inner part Right boundary DMA transfer Register communication Register communication

ID: 63 ID: 00 ID: 01 ID: 02

# NNU BOEK DHK AT?DCA PDNKCD NACEOPAN ?KIIE?PEK KLPEIEA HK?EC ?KBECNPEK CEA U HUPE?H IKAH

H?A 4AIKNU ?DAIA

slide-47
SLIDE 47

6PDABHU ,KILNAOOEK

(b) Computation workflow (d) Compression algorithms

(2) (1) (3)

dma_get dma_put

  • n

sign exp (8b) frac (24b) sign exp (0-8b) frac (7-15b)

... ...

e f e

N N E E N

  • 15

) ( log

min max 2

(str, r1,r2, ,r6,sigma2,yldfac)

sign exp (5b) frac (10b) sign exp (8b) frac (24b)

1EEE754 32b to 16b FP conversion (vel,ww0,phi,cohes,taxx, ,taxz)

sign exp (8b) frac (24b)

IEEE754 32-bit floating point format

sign frac (15b)

8 ) /( 1

min max

  • V

V V V V V V

cmpr

16-bit floating point formats

(d1,lam,mu,qp,qs,vx1,vx2,ww)

LDM Main Memory 16b to 32b decompression General 32b computation 32b to 16b compression

Host Memory: CPE:

slide-48
SLIDE 48

40.6 47.8 45.4 28.9 27.6 22.9 4.2 39.3 12.9 13.1

10 20 30 40 50 60

Speedup

MPE PAR MEM CMPR

LAAL( ,. RO # 4.

slide-49
SLIDE 49

21.2 62% 21.2 62% 18.5 54% 18.5 54% 23.8 70% 24.8 73% 12.4 36% 26.9 79% 27 79% 27 79%

5 10 15 20 25 30

DMA Bandwidth

PAR MEM CMPR

4AIKNU SEPD PEHEPEK

slide-50
SLIDE 50

Number of processes

8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K

PFLOPS

0.6 1 1.5 2 3 4 6 9 14 18 Ideal (Linear) Ideal (Non-linear) Ideal (Linear+Compress) Ideal (Non-linear+Compress) Linear (Peak: 10.7PFLops, Para. eff. 97.9%) Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%)

A ?HEC

slide-51
SLIDE 51

Speedup 1 2 3 4 6 8 12 16 22 79.9% 73.6% 63.6%

Linear Ideal dx=100m dx=50m dx=16m

75.5% 75.6% 53.3%

Non-Linear Ideal dx=100m dx=50m dx=16m

Speedup 1 2 3 4 6 8 12 16 22

160K

75.8%

128K

72.4%

100K

51.2%

Linear+Compress

80K 64K

Number of processes

48K 32K 24K 16K 12K 8K

Ideal dx=100m dx=50m dx=16m

160K

67.5%

128K

Non-Linear+Compress

67.2%

100K 80K

51.7%

64K

Number of processes

48K 32K 24K 16K 12K 8K

Ideal dx=100m dx=50m dx=16m

PNKC ?HEC

slide-52
SLIDE 52

38˚N 39˚N 40˚N

Beijing

Tangshan

Shunyi

Tianjin

Ninhe Luanxian

Cangzhou

Bohai

Luannan Wuqing

(c)

Beijing

Tangshan

Shunyi

Tianjin

Ninhe Luanxian

Cangzhou

Bohai

Luannan Wuqing

−0.2 −0.1 0.0 0.1 0.2

Velocity (m/s)

(d)

0s 50s 100s Ninghe (200m) Cangzhou (200m)

(a)

0s 50s 100s Ninghe (16m) Cangzhou (16m)

(b)

EIHPEK AOHPO( I RO #I

slide-53
SLIDE 53

Sunway Machine: the Challenges and Opportunities Scientific Computing with 10 Million Cores Long Term Plan for Sunway TaihuLight

6PHEA

slide-54
SLIDE 54

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

54

slide-55
SLIDE 55

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro tanshan.gif

3KC :ANI H

55

“15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios”, Gordon Bell Prize Finalist, SC 2017.

slide-56
SLIDE 56

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

56

slide-57
SLIDE 57

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

57

1.0x 2.2x 3.5x 1.0x 2.7x 4.5x 1.0x 7.1x 8.5x 10 20 30 40 50 60 70 80 Intel(24 cores) swBLAS(1CG) swDNN(1CG) Average time per iteration

Training AlexNet with swCaffe

total convolution fully connected

slide-58
SLIDE 58

Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro

3KC :ANI H

58

slide-59
SLIDE 59

n <AB 2NOTG, SGPUX YUTYUXY UL NK 72 NGXJGXK GTJ YULGXK JKKRUSKT n =@22, KTJUX UL NK SGINOTK n =20@, @OIN UL UNT 3KTTOY 0RROYUT 1GKX 7GOOTM E YUX GTJ GJOIK UT

NK 20<A4 UX

n A242, FOLKTM 2O AKK 3G 3GTOKR @UKT :OS RYKT UYN BUHOT 0RK 1XKKX GTJ

3GKO < JOYIYYOUT GTJ GJOIK UT NK KGXNWGK YOSRGOUT UX

?KSHACAIAPO

slide-60
SLIDE 60