Refactoring and Optimizing the Community Atmosphere Model (CAM) on - PowerPoint PPT Presentation

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC

Sunway TaihuLight: an Overview Homegrown many-core processor: SW26010 • 260 cores per chip • 3 Tflops The first system in the world that provides over 100 Pflops performance with over 10 million cores • theoretical peak 125 Pflops, 2.5 times improvement over before • LINPACK performance 93 Pflops, 3 times improvement over before High efficiency of the overall system • 6.05 Gflops/Watt, 3 to 6 times improvement over Tianhe-2, Titan, and K Three full-scale applications elected as 2016 Gordon Bell finalists

SW26010: General Architecture Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDM Level Registers NoC Data Transfer LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 3 Core Group 2 Network Communication Bus Computing Level Memory Memory

Earth System Modeling and HPC: the Current Computational Challenges

More and more component models ocean- atmosphere marine ocean space atmosphere boundary biology model model weather ocean-ice land-atmosphere atmospheric coupler boundary boundary chemistry land land model ice model dynamic ice biology ice-land boundary hydrological solid earth process

Increase in Spatial and Temporal Resolution to be Cloud-Resolving and Eddy-Resolving

Simulation of more and more detailed physics processes Simulation of Cloud Droplet Formation

Online Ensembles TH240_N_111 TH240_CAM Simulation of Cloud Droplet Formation TH240_ATMP3 TH240_BCC

The Gap between Software and Hardware 100P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 10

Our Research Goals • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 11

Our Research Goals • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 12

Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or computational mathematics cloud resolving other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University Sunway, GPU, MIC, geo-computing FPGA C/C++, Fortran, MPI, CUDA, Java, … The “Best” Computational Solution 13

[2014 IPDPS]: 2D SWE model 1.6m CPU-MIC cores 1.63 Pflops on Tianhe-2 [2013 PPoPP]: 2D SWE model 0.8m CPU-GPU cores 0.8 Pflops on Tianhe-1A [2016 SC]: 3D Nonhyd model 10.6m Sunway cores 8 Pflops on TiahuLight [2013 FPL]: [2014 TC]: 2D SWE on one FPGA chip 3D Nonhyd model a further 6~10x improvement 1.2m CPU-MIC cores on performance and power 1.74 Pflops on Tianhe-2 efficiency

Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 15

Earth System Modeling and HPC: Our Efforts on Refactoring CAM

THE CESM PROJECT F 算例（大气 + 陆面） G 算例（海洋 + 海冰） B 算例（全耦合） • Four component models, millions lines of code • Large-scale run on Sunway TaihuLight • 24,000 MPI processes • Over one million cores • 10-20x speedup for kernels Tsinghua + BNU • 2-3x speedup for the entire model

Major Challenges a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

Workflow of CAM Pass tracers (u, v) to dynamics CAM Phy_run Phy_run Dyn_run initial 1 2 Pass state Pass state variables and variables tracers After initialization, the physics and the dynamics are executed in turn during each simulation time-step.

Porting of CAM: General Idea n Entire code base: 530, 000 lines of code n Components with regular code patterns q e.g. the CAM-SE dynamic core q manual OpenACC parallelization and optimization on code and data structures n Components with irregular and complex code patterns q e.g. the CAM physics schemes q loop transformation tool to expose the right level of parallelism and code size q memory footprint analysis and reduction tool

Refactoring the Euler Step Euler_step : do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize dp(k) = func_1() do ie = nets, nete do k = 1, nlev do q = 1, qsize compute Q min/max values for lim8 …. Qtens(k,q,ie) = func_2(dp(k)) compute Biharmonic mixing term f end do end do end do end do end do end do end do do ie = nets, nete do ie = nets, nete 2D advection step do ie = nets, nete do k = 1, nlev data packing do k = 1, nlev dp0 = func_3() end do do q = 1, qsize dpdiss = func_4() qmin(k,q,ie) = … do q = 1, qsize Bonundary exchange qmax(k,q,ie) = … Qtens(k,q,ie) = end do func_5(dp0, dpdiss) Data extracting end do end do 1 end do end do end do do k = 1, nlev do ie = nets, nete dp_star(k) = func_8(dp(k)) do k = 1, nlev end do dp(k) = func_5() Vstar(k) = func_6() do k = 1, nlev end do Qtens(k,q,ie) = do q = 1, qsize func_9(dp_star(k)) do k = 1, nlev end do Qtens(k,q,ie) = end do func_7(dp(k), Vstar(k)) Data packing end do end do 2

Refactoring the Euler Step do ie = nets, nete do ie = nets, nete do q = 1, qsize do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do k = 1, nlev do q = 1, qsize dp(k) = func_1() …. do q = 1, qsize do k = 1, nlev do q = 1, qsize end do Qtens(k,q,ie) = …. Qtens(k,q,ie) = end do func_2(func_1()) end do func_2(dp(k)) end do end do end do end do end do end do end do do ie = nets, nete end do end do do k = 1, nlev do ie = nets, nete optimized: dp0 = func_3() do ie = nets, nete do k = 1, nlev do ie = nets, nete dpdiss = do k = 1, nlev do q = 1, qsize do k = 1, nlev func_4() do q = 1, qsize Qtens(k,q,ie) = do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = … qmin(k,q,ie) = … Qtens(k,q,ie) qmax(k,q,ie) = … func_5(func_3(),func_4()) qmax(k,q,ie) = … = func_5(dp0, dpdiss) end do end do end do end do end do end do end do end do end do end do end do end do do k = 1, nlev do k = 1, nlev dp_star(k) = do ie = nets, nete Qtens(k,q,ie) = func_8(dp(k)) do k = 1, nlev func_9(func_8(func_5())) end do do ie = nets, nete dp(k) = func_5() end do do q = 1, qsize Vstar(k) = func_6() end do do k = 1, nlev do k = 1, nlev end do Data packing Qtens(k,q,ie) Qtens(k,q,ie) = end do = func_7(func_5(),func_6()) do q = 1, qsize end do do k = 1, nlev func_9(dp_star(k)) Qtens(k,q,ie) = 3 end do func_7(dp(k), Vstar(k)) end do 2 end do Data packing end do

Refactoring and Optimizing the Community Atmosphere Model (CAM) on - PowerPoint PPT Presentation

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC Sunway

Refactoring Your Code A Key Step to Agility Venkat Subramaniam (svenkat@cs.uh.edu)

Design Patterns & Refactoring Introduction to Refactoring Oliver Haase HTWG Konstanz Oliver

Constraint-Based Refactoring Rename Field Problem Proven Correct Solution Constraint- Based

A Refactoring Approach for Optimizing Mobile Networks Ashwin Rao University of Helsinki 1 A

COMMUNITY MANAGEMENT jono bacon COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY

Formation of the Earth Formation of the Earth s Atmosphere s Atmosphere and Oceans and

MODIS Atmosphere Products MODIS Atmosphere Products Michael D. King Michael D. King NASA

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Lecture 4 Based on Fowler Refactoring and UWaterloo slides Dr. Tom Way CSC 4700 1

How to get away with murder refactoring @qcmaude This is me (and Chewbacca). I work at &

Refactoring, Refinement, and Reasoning A Logical Characterization for Hybrid Systems Stefan Mitsch

Automated T esting of Refactoring Engines Brett Daniel Danny Dig Kely Garcia Darko Marinov

Refactoring Noun: A change made to the internal structure of Refactoring software to make

Refactoring functional programs Simon Thompson, Claus Reinke Computing Laboratory, University of

Refactoring R Programs Tobias Verbeke Business & Decision 2008-08-12 Plan of the

Refactoring Fundamentals The Gilded Rose Refactoring Kata Steve Smith Ardalis.com @ardalis The

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15

Structure Groups (and Rings) Wolfgang Rump Instead of Groups and associated Structures I

Full System Simulator Simulates different new IBM architectures like PERCS, PowerPC 970 and

Refactoring and Optimizing the Community Atmosphere Model (CAM) on - PowerPoint PPT Presentation

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC Sunway

Refactoring Your Code A Key Step to Agility Venkat Subramaniam (svenkat@cs.uh.edu)

Design Patterns &amp; Refactoring Introduction to Refactoring Oliver Haase HTWG Konstanz Oliver

Constraint-Based Refactoring Rename Field Problem Proven Correct Solution Constraint- Based

A Refactoring Approach for Optimizing Mobile Networks Ashwin Rao University of Helsinki 1 A

COMMUNITY MANAGEMENT jono bacon COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY

Formation of the Earth Formation of the Earth s Atmosphere s Atmosphere and Oceans and

MODIS Atmosphere Products MODIS Atmosphere Products Michael D. King Michael D. King NASA

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Lecture 4 Based on Fowler Refactoring and UWaterloo slides Dr. Tom Way CSC 4700 1

How to get away with murder refactoring @qcmaude This is me (and Chewbacca). I work at &amp;

Refactoring, Refinement, and Reasoning A Logical Characterization for Hybrid Systems Stefan Mitsch

Automated T esting of Refactoring Engines Brett Daniel Danny Dig Kely Garcia Darko Marinov

Refactoring Noun: A change made to the internal structure of Refactoring software to make

Refactoring functional programs Simon Thompson, Claus Reinke Computing Laboratory, University of

Refactoring R Programs Tobias Verbeke Business &amp; Decision 2008-08-12 Plan of the

Refactoring Fundamentals The Gilded Rose Refactoring Kata Steve Smith Ardalis.com @ardalis The

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15

Structure Groups (and Rings) Wolfgang Rump Instead of Groups and associated Structures I

Full System Simulator Simulates different new IBM architectures like PERCS, PowerPC 970 and

Design Patterns & Refactoring Introduction to Refactoring Oliver Haase HTWG Konstanz Oliver

How to get away with murder refactoring @qcmaude This is me (and Chewbacca). I work at &

Refactoring R Programs Tobias Verbeke Business & Decision 2008-08-12 Plan of the