SU:ED3ECDP(
- AOECEC :EC ?EAPEBE? LLHE?PEKO
P PDA ?HA KB # 4EHHEK ,KNAO
7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO 3KGXSKT UL 4GXN AYKS AIOKTIK BYOTMNG CTOKXYO AKKSHKXN%/820A
SU:ED3ECDP( -AOECEC :EC ?EAPEBE? - - PowerPoint PPT Presentation
SU:ED3ECDP( -AOECEC :EC ?EAPEBE? LLHE?PEKO P PDA ?HA KB # 4EHHEK ,KNAO 7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO
7GUNGT 5 =GOUTGRAKXIUSOTM2KTKXOTDO 3KGXSKT UL 4GXN AYKS AIOKTIK BYOTMNG CTOKXYO AKKSHKXN%/820A
Sunway-I:
Sunway BlueLight:
Sunway TaihuLight:
Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC
Computing Core LDM
Column Communication Bus Control Network
Registers
Row Communication Bus
Transfer Agent (TA)
Memory Level LDM Level Register Level Computing Level
8*8 CPE Mesh
n 0 5OKKKR 8TKMXGOUT 7OKXGXIN
p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK
IUSOTM YYKS
n 0 5OKKKR 8TKMXGOUT 7OKXGXIN
p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK
IUSOTM YYKS
n 0 5OKKKR 8TKMXGOUT 7OKXGXIN
p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK
IUSOTM YYKS
n 0 5OKKKR 8TKMXGOUT 7OKXGXIN
p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK
IUSOTM YYKS
n 0 5OKKKR 8TKMXGOUT 7OKXGXIN
p IUSOTM TUJK p IUSOTM HUGXJ p YKX TUJK p IGHOTK p KTOXK
IUSOTM YYKS
2D core array with row and column buses
2D core array with row and column buses Network on Chip
2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes
2D core array with row and column buses Network on Chip Customized Network Board to Fully Connect 256 Nodes Sunway Net
0.5 1 1.5 2 2.5 3
Peak Performance Memory Size Gflops/Watt Tflops/m^3 memory bandwidth communication bandwidth
TaihuLight Tianhe-2 Titan K Computer
0.5 1 1.5 2 2.5 3
Linpack Graph HPCG hpgmg
TaihuLight Tianhe-2 Titan K Computer
32 GB and 136GB/s per node 22 flops/byte
MPE + CPE
register communication among CPEs
32 GB and 136GB/s per node 22 flops/byte
MPE + CPE
register communication among CPEs
Intel KNL 7250 of Cori: 6.5 flops/byte NVIDIA P100 of Piz Daint: 7.2 flops/byte
32 GB and 136GB/s per node 22 flops/byte
MPE + CPE
register communication among CPEs
32 GB and 136GB/s per node 22 flops/byte
MPE + CPE
register communication among CPEs
32 GB and 136GB/s per node 22 flops/byte
MPE + CPE
register communication among CPEs
Refactoring and Redesigning
Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation
Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation
Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation
Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation
Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation
Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation
Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation
Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-EM Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation
racks chips core-groups cores total number of cores
163,840 processes 65 threads
DD-MG K-cycle Very shallow Uniform DD Plug & Play
Now let’s find a way to design a subdomain solver.
LLHE?PEK 1( 1ILHE?EP KHRAN BKN PIKOLDANE? -UIE?O
racks chips core-groups cores total number of cores
163,840 processes 65 threads
Geometry-based pipelined ILU (GP-ILU)
Y X Z 8×8 8×8 8×8 8×8
Two-level pipeline blk_height Synchronization avoiding
1 1´
( )
dim_z blk_height 1 num_cores cell_size reg_size < +
Subdomain matrix
geometric index
Our goal of design:
BNKSXKYXT, AF3ON)< IUXKYJ.%Y8KTGR-(
1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M 0% 20% 40% 60% 80% 100%
Total number of cores Parallel efficiency
33% (GB’15) 67% 45%
0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.16 0.33 M 1.33 M 5.32 M 2.66 M 10.64 M 0.488
34X
SYPD Total number of cores
Implicit Explicit
89.5X
2.480 1.389 0.920 0.620
Resolution (km)
0.67 M
7.95 DP-PF 23.66 DP-PF DOFs=772B
“Exa-scale” for exp
The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit
LLHE?PEK 11( KNPEC ,.4 AAOECEC ,4. BKN SU :ED3ECDP
35
CAM5.0
CESM1.2.0 Tsinghua + BNU 30+ Professors and Students
“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.
LLHE?PEK 11( KNPEC ,.4 AAOECEC ,4. BKN SU :ED3ECDP
36
CAM5.0
CESM1.2.0 Tsinghua + BNU 30+ Professors and Students
“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.
a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience
CAM initial Dyn_run Phy_run1 Phy_run2 Pass state variables Pass state variables and tracers Pass tracers (u, v) to dynamics
do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do Euler_step: do ie = nets, nete compute Q min/max values for lim8 compute Biharmonic mixing term f end do do ie = nets, nete 2D advection step data packing end do Bonundary exchange Data extracting do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = … end do end do end do Data packing do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev dp0 = func_3() dpdiss = func_4() do q = 1, qsize Qtens(k,q,ie) = func_5(dp0, dpdiss) end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do do k = 1, nlev dp_star(k) = func_8(dp(k)) end do do k = 1, nlev Qtens(k,q,ie) = func_9(dp_star(k)) end do end do Data packing end doloops
parallelization and
data structures
do begin_chunk to end_chunk tphysbc() { convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) } enddo tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) enddo } tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) enddo …… do begin_chunk to end_chunk microp_driver_tend(7.13%) enddo …… do begin_chunk to end_chunk radiation_tend(54.07%) enddo } do begin_chunk to end_chunk convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { zm_convr(2.03%) zm_conv_evap() montran() convtranc(0.06%) } } enddo convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { do begin_chunk to end_chunk zm_convr(2.03%) enddo do begin_chunk to end_chunk zm_conv_evap() enddo do begin_chunk to end_chunk montran() enddo do begin_chunk to end_chunk convtranc(0.06%) enddo } }0.04 0.15 0.24 0.25 0.6 0.78 0.87 1.54 1.2 1.62 1.75 2.81
0.5 1 1.5 2 2.5 3 1024 2400 4096 5120 7350 9600 12000 24000
Simulation Speed (Described in Model Year Per Day(MYPD))
Number of CGs (each CG includes 1 MPE and 64 CPEs)
MPE only MPE+CPE for dynamic core MPE+CPE for both dynamic core and physics schemes
KTOXK SUJKR
NK YGSK SUJKR UT =20@ FKRRUYUTK
n AK , XKXOK UL 5UXXGT KT022 IUJK U 0NXKGJ 2 IUJK
p LOTKX SKSUX IUTXUR NXUMN
G YKIOLOI 3<0 YINKSK
p SUXK KLLOIOKT KIUXOGOUT
n AK %, XKMOYKXIUSSTOIGOUT
HGYKJ XKJKYOMT
p XKSUK JGG JKKTJKTI p KUYK SUXK GXGRRKROYS
elek elek+7
C0,0 C7,0
a16*i Stage 1 Ci, j a16*i+a16*i+1 a16*i+...+a16*i+15 Stage 2 a0 C0,0 a0+...+a15 a16 C1,0 a16+...+a31 p15 p31 ak Ck, 0 ak+...+ak+15 p15 + = p127 p111 + = ... ... ... ... ... ... ...
C0,1 C7,1
... ... ... ... ...
C0,7 C7,7
a0+a1 a8+a9 ak+ak+1 Stage 3 Ci, j ... p16*i a16*i a16*i+a16*i+1 a16*i+...+a16*i+15 ...
ANBKNI?A 1ILNKRAIAP PDNKCD AAOEC
1 Sunway CG (64 CPEs) could be equivalent to 0.1x Intel Core
1.8x Intel Core
7.2x Intel Core
43.1x Intel Core
512 2048 8192 32768 131072 0.01 0.05 0.1 1 5 Number of Processes PFlops 48 elements in each process 192 elements in each process 768 elements in each process 650 elelments in each process 3.3 PFlops Para.eff 98.55% 2.4 PFlops Para.eff 92.2% 1.76 PFlops Para.eff 88.3% 2.72 PFlops Para.eff 92.9%
n 3TGSOI XXK YUXIK
MKTKXGUX UXOMOTGKJ LXUS 2653<
n AKOYSOI GK XUGMGOUT
UXOMOTGKJ LXUS 0D32
n NKX OROOKY,
p YUXIK GXOOUTKX p 3 <UJKR 8TKXURGUX p @KYGX IUTXURRKX
Source Partitioner Restart Controller 3D Model Interpolator LZ4 Compression, Group I/O, Balanced I/O Forwarding Snapshot/Sesimo Recorder
Velocity Update Stress Update Source Injection
Stress Adjustment For Plasticity
Dynamic Rupture Source Generator (Based on CG-FDM) Seismic Wave Propagation (Based on AWP-ODC)
Next Timestep
3D Vel/Den Model Fault Stress Init Friction Law Ctrl Wave Eqn Solver
4HPE3ARAH -KIE -A?KILKOEPEK
𝑁𝑦 𝑁𝑧 (1) MPI decomposition 𝐶𝑨 𝐶𝑧 (2) CG blocking
Finished area Computing area Buffer area Unfinished area
𝑥𝑦 𝑥𝑨 𝑥𝑧 𝐷𝑨 𝐷𝑧 z y x
Compute direction
(3) Athread decomposition (4) LDM buffering scheme
x x
n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz...
1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 xx 2x x
n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz...
1 xx 1 yy 1 zz 1 xy 1 xz 1 yz 2x x
2y y
2 zz 2 xy 2 xz 2 yz 1 u 1 v 1 w 2 u 2 v 2 w 1 xx 2x x
n xx 1 yy 2 yy n yy 1 zz 2 zz n zz 1 yz 2 yz n yz 1 v 2 v n v 1 u 2 u n u 1 w 2 w n w 1 u 1 v 1 w 2 u 2 v 2 w 1 xx 1 yy 1 zz 1 xy 1 xz 1 yz 2x x
2y y
2 zz 2 xy 2 xz 2 yz...
dvelcx dstrqc dvelcx dstrqc after before fuse arrays
……
Left boundary Inner part Right boundary DMA transfer Register communication Register communication
ID: 63 ID: 00 ID: 01 ID: 02
# NNU BOEK DHK AT?DCA PDNKCD NACEOPAN ?KIIE?PEK KLPEIEA HK?EC ?KBECNPEK CEA U HUPE?H IKAH
(b) Computation workflow (d) Compression algorithms
(2) (1) (3)
dma_get dma_put
sign exp (8b) frac (24b) sign exp (0-8b) frac (7-15b)
... ...
e f e
N N E E N
) ( log
min max 2
(str, r1,r2, ,r6,sigma2,yldfac)
sign exp (5b) frac (10b) sign exp (8b) frac (24b)
1EEE754 32b to 16b FP conversion (vel,ww0,phi,cohes,taxx, ,taxz)
sign exp (8b) frac (24b)
IEEE754 32-bit floating point format
sign frac (15b)
8 ) /( 1
min max
V V V V V V
cmpr
16-bit floating point formats
(d1,lam,mu,qp,qs,vx1,vx2,ww)
LDM Main Memory 16b to 32b decompression General 32b computation 32b to 16b compression
Host Memory: CPE:
40.6 47.8 45.4 28.9 27.6 22.9 4.2 39.3 12.9 13.1
10 20 30 40 50 60
Speedup
MPE PAR MEM CMPR
21.2 62% 21.2 62% 18.5 54% 18.5 54% 23.8 70% 24.8 73% 12.4 36% 26.9 79% 27 79% 27 79%
5 10 15 20 25 30
DMA Bandwidth
PAR MEM CMPR
Number of processes
8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K
PFLOPS
0.6 1 1.5 2 3 4 6 9 14 18 Ideal (Linear) Ideal (Non-linear) Ideal (Linear+Compress) Ideal (Non-linear+Compress) Linear (Peak: 10.7PFLops, Para. eff. 97.9%) Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%)
Speedup 1 2 3 4 6 8 12 16 22 79.9% 73.6% 63.6%
Linear Ideal dx=100m dx=50m dx=16m
75.5% 75.6% 53.3%
Non-Linear Ideal dx=100m dx=50m dx=16m
Speedup 1 2 3 4 6 8 12 16 22
160K
75.8%
128K
72.4%
100K
51.2%
Linear+Compress
80K 64K
Number of processes
48K 32K 24K 16K 12K 8K
Ideal dx=100m dx=50m dx=16m
160K
67.5%
128K
Non-Linear+Compress
67.2%
100K 80K
51.7%
64K
Number of processes
48K 32K 24K 16K 12K 8K
Ideal dx=100m dx=50m dx=16m
38˚N 39˚N 40˚N
Beijing
Tangshan
Shunyi
Tianjin
Ninhe Luanxian
Cangzhou
Bohai
Luannan Wuqing
(c)
Beijing
Tangshan
Shunyi
Tianjin
Ninhe Luanxian
Cangzhou
Bohai
Luannan Wuqing
−0.2 −0.1 0.0 0.1 0.2
Velocity (m/s)
(d)
0s 50s 100s Ninghe (200m) Cangzhou (200m)
(a)
0s 50s 100s Ninghe (16m) Cangzhou (16m)
(b)
Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro
54
Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro tanshan.gif
55
“15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios”, Gordon Bell Prize Finalist, SC 2017.
Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro
56
Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro
57
1.0x 2.2x 3.5x 1.0x 2.7x 4.5x 1.0x 7.1x 8.5x 10 20 30 40 50 60 70 80 Intel(24 cores) swBLAS(1CG) swDNN(1CG) Average time per iteration
Training AlexNet with swCaffe
total convolution fully connected
Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro
58
n <AB 2NOTG, SGPUX YUTYUXY UL NK 72 NGXJGXK GTJ YULGXK JKKRUSKT n =@22, KTJUX UL NK SGINOTK n =20@, @OIN UL UNT 3KTTOY 0RROYUT 1GKX 7GOOTM E YUX GTJ GJOIK UT
NK 20<A4 UX
n A242, FOLKTM 2O AKK 3G 3GTOKR @UKT :OS RYKT UYN BUHOT 0RK 1XKKX GTJ
3GKO < JOYIYYOUT GTJ GJOIK UT NK KGXNWGK YOSRGOUT UX