Porting Atmospheric Programs
- n Various Systems
Lin Gan
Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn
2017.10.23-27 @Schloss Dagstuhl
Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions
Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua - - PowerPoint PPT Presentation
Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn 2017.10.23-27 @Schloss Dagstuhl Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges,
Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn
2017.10.23-27 @Schloss Dagstuhl
Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions
2
lingan@tsinghua.edu.cn www.thuhpgc.org
Performance portable applications can be efficiently executed on a wide variety of HPC architectures without significant manual modifications.
3
lingan@tsinghua.edu.cn www.thuhpgc.org
Performance portable applications can be efficiently executed on a wide variety of HPC architectures without significant manual modifications.
a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience
4
lingan@tsinghua.edu.cn www.thuhpgc.org
[2010 SISC]: 2D SWE model 2k CPU cores
[2011 JCP]: 2D SWE model 80k CPU cores
[2013 PPoPP]: 2D SWE model 100K CPU-GPU cores 0.8 DP-TF on Tianhe-1A [2014 IPDPS]: 2D SWE model 1.6m CPU-MIC cores 1.63 DP-PF on Tianhe-2 [2015 ToC]: 3D Nonhyd model 1.2m CPU-MIC cores 1.74 DP-PF on Tianhe-2 [2016 Gordon Bell Prize] 3D Nonhyd model 10.6m Sunway cores both on TaihuLight Implicit solver Explicit solver
Our FPGA Work
8 Altera Stratix5 D8 FPGAs 3x in perf. 1.2x in energy. 8 Altera Stratix5 D8 FPGAs 6x in perf. 5x in energy. Virtex-6 SX475T FPGA 4.3x in perf. 10x in energy.
5
lingan@tsinghua.edu.cn www.thuhpgc.org
Original LOC
Further Opti.
Final LOC
LOC/perf.
perf./LOC
6
lingan@tsinghua.edu.cn www.thuhpgc.org
Sunway TaihuLight
7
lingan@tsinghua.edu.cn www.thuhpgc.org
§ X : OpenACC / Athread § One MPI process manages to run on one management core (MPE) § OpenACC conducts data transfer between main memory and on- chip memory (SPM), and distributes the kernel workload across compute cores (CPEs) § Athread is the threading library to manage thread on compute core (CPE), which is used in OpenACC implementation
8
lingan@tsinghua.edu.cn www.thuhpgc.org
int A[1024][1024]; intB[1024][1024]; intC[1024][1024]; #pragma acc parallel loop \ copyin(B, C) copyout(A) for(i = 0; i < 1024; i ++) { for(j = 0; j < 1024; j ++) { A[i][j] = B[i][j] + C[i][j]; } }
…
CPEs_swpan(CPE_kernel, args);
…
__SPM_localint SPM_A[1][1024]; __SPM_localint SPM_B[1][1024]; __SPM_localint SPM_C[1][1024]; void CPE_kernel(args) { for(i=CPE_id; i < 1024; i +=CPE_num) { dma_get(&B[i][0], SPM_B, 4096); dma_get(&C[i][0], SPM_C, 4096); for(j = 0; j < 1024; j ++) { SPM_A[0][j] = SPM_B[0][j] + SPM_C[0][j]; } //j-loop dma_put(SPM_A, &A[i][0], 4096); } //i-loop }
SWACC
Basic Compiler
a.out
9
lingan@tsinghua.edu.cn www.thuhpgc.org
The performance improvements for the entire CAM model in ne30. MPE\ori refers to the original version based on MPE,
10
lingan@tsinghua.edu.cn www.thuhpgc.org
do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do
do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_2(func_1()) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do
do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(func_5(),func_6()) end do
11
lingan@tsinghua.edu.cn www.thuhpgc.org
3.3 PFlops for a 750-m resolution using 10,075,000 cores 3.4 SYPD for 25-km horizontal resolution 21.5 SYPD for 100-km horizontal resolution
12
lingan@tsinghua.edu.cn www.thuhpgc.org
13
lingan@tsinghua.edu.cn www.thuhpgc.org
14
lingan@tsinghua.edu.cn www.thuhpgc.org
Center in Jinan
(16-core)
Sunway BlueLight 2012
Center in Wuxi
(260-core)
Sunway TaihuLight 2016
OpenFoam LAMMPS xMath VASP Thunder Star bowtie Petsc
15
lingan@tsinghua.edu.cn www.thuhpgc.org
Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn
2017.10.23-27 @Schloss Dagstuhl
Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions