refactoring and optimizing the community atmosphere model
play

Refactoring and Optimizing the Community Atmosphere Model (CAM) on - PowerPoint PPT Presentation

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC Sunway


  1. Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC

  2. Sunway TaihuLight: an Overview Homegrown many-core processor: SW26010 • 260 cores per chip • 3 Tflops The first system in the world that provides over 100 Pflops performance with over 10 million cores • theoretical peak 125 Pflops, 2.5 times improvement over before • LINPACK performance 93 Pflops, 3 times improvement over before High efficiency of the overall system • 6.05 Gflops/Watt, 3 to 6 times improvement over Tianhe-2, Titan, and K Three full-scale applications elected as 2016 Gordon Bell finalists

  3. SW26010: General Architecture Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDM Level Registers NoC Data Transfer LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 3 Core Group 2 Network Communication Bus Computing Level Memory Memory

  4. Earth System Modeling and HPC: the Current Computational Challenges

  5. More and more component models ocean- atmosphere marine ocean space atmosphere boundary biology model model weather ocean-ice land-atmosphere atmospheric coupler boundary boundary chemistry land land model ice model dynamic ice biology ice-land boundary hydrological solid earth process

  6. Increase in Spatial and Temporal Resolution to be Cloud-Resolving and Eddy-Resolving

  7. Simulation of more and more detailed physics processes Simulation of Cloud Droplet Formation

  8. Online Ensembles TH240_N_111 TH240_CAM Simulation of Cloud Droplet Formation TH240_ATMP3 TH240_BCC

  9. The Gap between Software and Hardware 100P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 10

  10. Our Research Goals • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 11

  11. Our Research Goals • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 12

  12. Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or computational mathematics cloud resolving other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University Sunway, GPU, MIC, geo-computing FPGA C/C++, Fortran, MPI, CUDA, Java, … The “Best” Computational Solution 13

  13. [2014 IPDPS]: 2D SWE model 1.6m CPU-MIC cores 1.63 Pflops on Tianhe-2 [2013 PPoPP]: 2D SWE model 0.8m CPU-GPU cores 0.8 Pflops on Tianhe-1A [2016 SC]: 3D Nonhyd model 10.6m Sunway cores 8 Pflops on TiahuLight [2013 FPL]: [2014 TC]: 2D SWE on one FPGA chip 3D Nonhyd model a further 6~10x improvement 1.2m CPU-MIC cores on performance and power 1.74 Pflops on Tianhe-2 efficiency

  14. Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 15

  15. Earth System Modeling and HPC: Our Efforts on Refactoring CAM

  16. THE CESM PROJECT F 算例(大气 + 陆面) G 算例(海洋 + 海冰) B 算例(全耦合) • Four component models, millions lines of code • Large-scale run on Sunway TaihuLight • 24,000 MPI processes • Over one million cores • 10-20x speedup for kernels Tsinghua + BNU • 2-3x speedup for the entire model

  17. Major Challenges a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

  18. Workflow of CAM Pass tracers (u, v) to dynamics CAM Phy_run Phy_run Dyn_run initial 1 2 Pass state Pass state variables and variables tracers After initialization, the physics and the dynamics are executed in turn during each simulation time-step.

  19. Porting of CAM: General Idea n Entire code base: 530, 000 lines of code n Components with regular code patterns q e.g. the CAM-SE dynamic core q manual OpenACC parallelization and optimization on code and data structures n Components with irregular and complex code patterns q e.g. the CAM physics schemes q loop transformation tool to expose the right level of parallelism and code size q memory footprint analysis and reduction tool

  20. Refactoring the Euler Step Euler_step : do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize dp(k) = func_1() do ie = nets, nete do k = 1, nlev do q = 1, qsize compute Q min/max values for lim8 …. Qtens(k,q,ie) = func_2(dp(k)) compute Biharmonic mixing term f end do end do end do end do end do end do end do do ie = nets, nete do ie = nets, nete 2D advection step do ie = nets, nete do k = 1, nlev data packing do k = 1, nlev dp0 = func_3() end do do q = 1, qsize dpdiss = func_4() qmin(k,q,ie) = … do q = 1, qsize Bonundary exchange qmax(k,q,ie) = … Qtens(k,q,ie) = end do func_5(dp0, dpdiss) Data extracting end do end do 1 end do end do end do do k = 1, nlev do ie = nets, nete dp_star(k) = func_8(dp(k)) do k = 1, nlev end do dp(k) = func_5() Vstar(k) = func_6() do k = 1, nlev end do Qtens(k,q,ie) = do q = 1, qsize func_9(dp_star(k)) do k = 1, nlev end do Qtens(k,q,ie) = end do func_7(dp(k), Vstar(k)) Data packing end do end do 2

  21. Refactoring the Euler Step do ie = nets, nete do ie = nets, nete do q = 1, qsize do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do k = 1, nlev do q = 1, qsize dp(k) = func_1() …. do q = 1, qsize do k = 1, nlev do q = 1, qsize end do Qtens(k,q,ie) = …. Qtens(k,q,ie) = end do func_2(func_1()) end do func_2(dp(k)) end do end do end do end do end do end do end do do ie = nets, nete end do end do do k = 1, nlev do ie = nets, nete optimized: dp0 = func_3() do ie = nets, nete do k = 1, nlev do ie = nets, nete dpdiss = do k = 1, nlev do q = 1, qsize do k = 1, nlev func_4() do q = 1, qsize Qtens(k,q,ie) = do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = … qmin(k,q,ie) = … Qtens(k,q,ie) qmax(k,q,ie) = … func_5(func_3(),func_4()) qmax(k,q,ie) = … = func_5(dp0, dpdiss) end do end do end do end do end do end do end do end do end do end do end do end do do k = 1, nlev do k = 1, nlev dp_star(k) = do ie = nets, nete Qtens(k,q,ie) = func_8(dp(k)) do k = 1, nlev func_9(func_8(func_5())) end do do ie = nets, nete dp(k) = func_5() end do do q = 1, qsize Vstar(k) = func_6() end do do k = 1, nlev do k = 1, nlev end do Data packing Qtens(k,q,ie) Qtens(k,q,ie) = end do = func_7(func_5(),func_6()) do q = 1, qsize end do do k = 1, nlev func_9(dp_star(k)) Qtens(k,q,ie) = 3 end do func_7(dp(k), Vstar(k)) end do 2 end do Data packing end do

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend