porting atmospheric programs on various systems
play

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua - PowerPoint PPT Presentation

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn 2017.10.23-27 @Schloss Dagstuhl Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges,


  1. Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn 2017.10.23-27 @Schloss Dagstuhl Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  2. About me Performance portable applications can be efficiently executed on a wide variety of HPC architectures without significant manual modifications. Climate Modeling Tsinghua Geophysics Exploration University NSCC-Wuxi MIC FPGA Sunway GPU 2 lingan@tsinghua.edu.cn www.thuhpgc.org

  3. Climate Modeling Performance portable applications can be efficiently executed on a wide variety of HPC architectures without significant manual modifications. a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience 3 lingan@tsinghua.edu.cn www.thuhpgc.org

  4. Our atmosphere study [2016 Gordon Bell Prize] 3D Nonhyd model Implicit solver 10.6m Sunway cores [2014 IPDPS]: Programming ? both on TaihuLight 2D SWE model Explicit solver 1.6m CPU-MIC cores 8 Altera Stratix5 D8 FPGAs 1.63 DP-PF on Tianhe-2 3x in perf. 1.2x in energy. [2011 JCP]: 2D SWE model 80k CPU cores [2015 ToC]: on Tianhe-1A 3D Nonhyd model 1.2m CPU-MIC cores [2013 PPoPP]: 1.74 DP-PF on Tianhe-2 2D SWE model 100K CPU-GPU cores 8 Altera Stratix5 D8 FPGAs 0.8 DP-TF on Tianhe-1A 6x in perf. 5x in energy. [2010 SISC]: Virtex-6 SX475T FPGA 2D SWE model 2k CPU cores 4.3x in perf. 10x in energy. on BlueGene/L Our FPGA Work 1. “GlobalAtmosphere SimulationbasedonReconfigurable Platform”, (short paper)in FCCM 2013 2. “Accelerating Solvers for Global SWEs through Mixed-precision Data Flow Engines”, in FPL 2013 3. “Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms”, in ACMTRETS 4 4. “A Highly-Efficient and Green Data Flow Engine for Solving Euler Atmospheric Equations”, in FPL 2014 lingan@tsinghua.edu.cn 5. “Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture ”,in IEEE MICRO2017 www.thuhpgc.org

  5. CPU OpenCL for FPGA CPU+MIC CPU+GPU Comparison of Line of Code CPU+FPGA Platform CPU 1x GPU 1x MIC 1x FPGA Original LOC 969 - - - Direct Porting - +96 +10 +644 Further Opti. - +90 +80 +115 Final LOC 969 1155 1059 1728 LOC/perf. 46.8 7.3 4.4 1.8 perf./LOC 0.02 0.14 0.22 0.56 What is your choice? 5 lingan@tsinghua.edu.cn www.thuhpgc.org

  6. The Sunway TaihuLight Supercomputer Sunway TaihuLight Peak: 125 PFlops • Linpack: 93 Flops • 40960 SW 26010CPUs • Disaster ? Over 10M cores • Reg. Comm. 65 threads 163840 MPI processes 6 lingan@tsinghua.edu.cn www.thuhpgc.org

  7. Principal Programming Model on TaihuLight § MPI+X § X : OpenACC / Athread § One MPI process manages to run on one management core (MPE) § OpenACC conducts data transfer between main memory and on- chip memory (SPM), and distributes the kernel workload across compute cores (CPEs) § Athread is the threading library to manage thread on compute core (CPE), which is used in OpenACC implementation 7 lingan@tsinghua.edu.cn www.thuhpgc.org

  8. Host … Code CPEs_swpan(CPE_kernel, args); … int A[1024][1024]; __SPM_localint SPM_A[1][1024]; intB[1024][1024]; __SPM_localint SPM_B[1][1024]; intC[1024][1024]; SWACC Basic __SPM_localint SPM_C[1][1024]; #pragma acc parallel loop \ Compiler void copyin(B, C) copyout(A) CPE_kernel(args) { for(i = 0; i < 1024; i ++) { for(i=CPE_id; i < 1024; i +=CPE_num) { for(j = 0; j < 1024; j ++) { dma_get(&B[i][0], SPM_B, 4096); A[i][j] = B[i][j] + C[i][j]; dma_get(&C[i][0], SPM_C, 4096); } for(j = 0; j < 1024; j ++) { } Kernel SPM_A[0][j] = SPM_B[0][j] + SPM_C[0][j]; a.out } //j-loop Code dma_put(SPM_A, &A[i][0], 4096); } //i-loop } OpenACC-level Athread-level 8 lingan@tsinghua.edu.cn www.thuhpgc.org

  9. Porting CAM: OpenACC + Athread The performance improvements for the entire CAM model in ne30. MPE\ori refers to the original version based on MPE, openacc refers to the usage of OpenACC directive, and athreadrefers to the further usage of Athread. Can you accept the performance of OpenACC? 9 lingan@tsinghua.edu.cn www.thuhpgc.org

  10. Loop transformation tools do ie = nets, nete do ie = nets, nete Tools for loop do k = 1, nlev do k = 1, nlev dp(k) = func_1() do q = 1, qsize transformation do q = 1, qsize Qtens(k,q,ie) = Qtens(k,q,ie) = func_2(dp(k)) func_2(func_1()) end do end do end do OpenACC end do end do end do optimized: do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = … qmin(k,q,ie) = … qmax(k,q,ie) = … qmax(k,q,ie) = … end do end do end do end do end do do ie = nets, nete end do Athread do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() do ie = nets, nete end do Data locality do q = 1, qsize do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = do k = 1, nlev func_7(func_5(),func_6()) Qtens(k,q,ie) = func_7(dp(k), 10 lingan@tsinghua.edu.cn end do Vstar(k)) www.thuhpgc.org end do

  11. Porting CAM on Sunway 3.3 PFlops for a 750-m resolution using 10,075,000 cores 3.4 SYPD for 25-km horizontal resolution 21.5 SYPD for 100-km horizontal resolution CAM original total is 754,129 LOC We modified 152,336 LOC We further added 57,709 LOC 2 years Redesign the atmosphere model 11 lingan@tsinghua.edu.cn www.thuhpgc.org

  12. Climate Models on Sunway CAM WRF AM3 POM MASN CESM CIESM UM 12 lingan@tsinghua.edu.cn www.thuhpgc.org

  13. Software on Sunway xMath Petsc LAMMPS OpenFoam VASP Thunder Star bowtie 13 lingan@tsinghua.edu.cn www.thuhpgc.org

  14. China’s homegrown supercomputers • National Supercomputing Sunway Center in Jinan BlueLight • 1 PFlops • multi-core processor 2012 ( 16-core ) • National Supercomputing Sunway Center in Wuxi TaihuLight • 125 PFlops • many-core processor 2016 ( 260-core ) VASP OpenFoam xMath LAMMPS Thunder Star bowtie Petsc Will you be interested in using future machine? 14 lingan@tsinghua.edu.cn www.thuhpgc.org

  15. Thanks and welcome to visit us in Wuxi ICCS 2018 will take place at June, 2018 15 lingan@tsinghua.edu.cn www.thuhpgc.org

  16. Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn 2017.10.23-27 @Schloss Dagstuhl Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend