Refactoring and Optimizing the Community Atmosphere Model (CAM) on - - PowerPoint PPT Presentation

refactoring and optimizing the community atmosphere model
SMART_READER_LITE
LIVE PREVIEW

Refactoring and Optimizing the Community Atmosphere Model (CAM) on - - PowerPoint PPT Presentation

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC Sunway


slide-1
SLIDE 1

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer

Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5th 2016 @ CCDSC

slide-2
SLIDE 2

Sunway TaihuLight: an Overview

Homegrown many-core processor: SW26010

  • 260 cores per chip
  • 3 Tflops

The first system in the world that provides over 100 Pflops performance with over 10 million cores

  • theoretical peak 125 Pflops, 2.5 times improvement over before
  • LINPACK performance 93 Pflops, 3 times improvement over before

High efficiency of the overall system

  • 6.05 Gflops/Watt, 3 to 6 times improvement over Tianhe-2, Titan, and K

Three full-scale applications elected as 2016 Gordon Bell finalists

slide-3
SLIDE 3

SW26010: General Architecture

Core Group 2 Data Transfer Network MPE 8*8 CPE Mesh PPU iMC Memory Core Group 0 MPE 8*8 CPE Mesh iMC PPU Memory Core Group 1 MPE 8*8 CPE Mesh PPU Core Group 3 iMC Memory MPE 8*8 CPE Mesh PPU iMC Memory NoC

Computing Core LDM

Column Communication Bus Control Network

Registers

Row Communication Bus

Transfer Agent (TA)

Memory Level LDM Level Register Level Computing Level

8*8 CPE Mesh

slide-4
SLIDE 4

Earth System Modeling and HPC: the Current Computational Challenges

slide-5
SLIDE 5

More and more component models

marine biology

dynamic ice

  • cean

model ice model

coupler

land model

hydrological process

land biology

atmospheric chemistry atmosphere model

space weather solid earth

  • cean-ice

boundary

land-atmosphere boundary

ice-land boundary

  • cean-

atmosphere boundary

slide-6
SLIDE 6

Increase in Spatial and Temporal Resolution to be Cloud-Resolving and Eddy-Resolving

slide-7
SLIDE 7

Simulation of more and more detailed physics processes Simulation of Cloud Droplet Formation

slide-8
SLIDE 8

Online Ensembles

Simulation of Cloud Droplet Formation

TH240_CAM TH240_BCC TH240_N_111 TH240_ATMP3

slide-9
SLIDE 9
slide-10
SLIDE 10

The Gap between Software and Hardware

10 China’s supercomputers

  • heterogeneous systems

with many-core chips

  • millions of cores

China’s models

  • pure CPU code
  • scaling to hundreds or

thousands of cores 100T

  • millions lines of legacy

code

  • poor scalability
  • written for multi-core,

rather than many-core 100P

slide-11
SLIDE 11

Our Research Goals

11 China’s supercomputers

  • heterogeneous systems

with many-core chips

  • millions of cores

China’s models

  • pure CPU code
  • scaling to hundreds or

thousands of cores 100T~1P

  • highly scalable framework that can efficiently utilize

many-core processors

  • automated tools to deal with the legacy code
slide-12
SLIDE 12

Our Research Goals

12 China’s supercomputers

  • heterogeneous systems

with many-core chips

  • millions of cores

China’s models

  • pure CPU code
  • scaling to hundreds or

thousands of cores 100T~1P

  • highly scalable framework that can efficiently utilize

many-core processors

  • automated tools to deal with the legacy code
slide-13
SLIDE 13

Example: Highly-Scalable Atmospheric Simulation Framework

13

The “Best” Computational Solution Architecture Algorithm Application

cloud resolving explicit, implicit, or semi-implicit method cube-sphere grid or

  • ther grid

Sunway, GPU, MIC, FPGA C/C++, Fortran, MPI, CUDA, Java, …

Wang, Lanning Beijing Normal University climate modeling Yang, Chao Institute of Software, CAS computational mathematics Xue, Wei Tsinghua University computer science Fu, Haohuan Tsinghua University geo-computing

slide-14
SLIDE 14

[2013 PPoPP]: 2D SWE model 0.8m CPU-GPU cores 0.8 Pflops on Tianhe-1A [2013 FPL]: 2D SWE on one FPGA chip a further 6~10x improvement

  • n performance and power

efficiency [2014 IPDPS]: 2D SWE model 1.6m CPU-MIC cores 1.63 Pflops on Tianhe-2 [2014 TC]: 3D Nonhyd model 1.2m CPU-MIC cores 1.74 Pflops on Tianhe-2 [2016 SC]: 3D Nonhyd model 10.6m Sunway cores 8 Pflops on TiahuLight

slide-15
SLIDE 15

Our Research Goals

15 China’s supercomputers

  • heterogeneous systems

with many-core chips

  • millions of cores

China’s models

  • pure CPU code
  • scaling to hundreds or

thousands of cores 100T~1P

  • highly scalable framework that can efficiently utilize

many-core accelerators

  • automated tools to deal with the legacy code
slide-16
SLIDE 16

Earth System Modeling and HPC: Our Efforts on Refactoring CAM

slide-17
SLIDE 17

Tsinghua + BNU

F算例(大气+陆面) G算例(海洋+海冰) B算例(全耦合)

  • Four component models, millions lines of code
  • Large-scale run on

Sunway TaihuLight

  • 24,000 MPI processes
  • Over one million cores
  • 10-20x speedup for

kernels

  • 2-3x speedup for the

entire model

THE CESM PROJECT

slide-18
SLIDE 18

Major Challenges

a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

slide-19
SLIDE 19

CAM initial Dyn_run Phy_run 1 Phy_run 2 Pass state variables Pass state variables and tracers Pass tracers (u, v) to dynamics

Workflow of CAM

After initialization, the physics and the dynamics are executed in turn during each simulation time-step.

slide-20
SLIDE 20

Porting of CAM: General Idea

n Entire code base: 530, 000 lines of code n Components with regular code patterns

q e.g. the CAM-SE dynamic core q manual OpenACC parallelization and optimization on code

and data structures

n Components with irregular and complex code

patterns

q e.g. the CAM physics schemes q loop transformation tool to expose the right level of

parallelism and code size

q memory footprint analysis and reduction tool

slide-21
SLIDE 21

Euler_step:

do ie = nets, nete compute Q min/max values for lim8 compute Biharmonic mixing term f end do do ie = nets, nete 2D advection step data packing end do Bonundary exchange Data extracting do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev dp0 = func_3() dpdiss = func_4() do q = 1, qsize Qtens(k,q,ie) = func_5(dp0, dpdiss) end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do do k = 1, nlev dp_star(k) = func_8(dp(k)) end do do k = 1, nlev Qtens(k,q,ie) = func_9(dp_star(k)) end do end do Data packing end do

1 2

Refactoring the Euler Step

slide-22
SLIDE 22

do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev dp0 = func_3() dpdiss = func_4() do q = 1, qsize Qtens(k,q,ie) = func_5(dp0, dpdiss) end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do do k = 1, nlev dp_star(k) = func_8(dp(k)) end do do k = 1, nlev Qtens(k,q,ie) = func_9(dp_star(k)) end do end do Data packing end do

  • ptimized:

do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_2(func_1()) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_5(func_3(),func_4()) end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(func_5(),func_6()) end do do k = 1, nlev Qtens(k,q,ie) = func_9(func_8(func_5())) end do end do Data packing end do

2

3

Refactoring the Euler Step

slide-23
SLIDE 23

do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = … end do end do end do Data packing do ie = nets, nete do q = 1, qsize do k = 1, nlev qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = … end do end do end do Data packing !$ACC PARALLEL LOOP do ie_q = 1, qsize*(nete- nets) do k = 1, nlev q = func(ie_q) ie = func(ie_q) qmin(k,q,ie) = … qmax(k,q,ie) = … Qtens(k,q,ie) = … end do end do !$ACC PARALLEL LOOP do ie_q = 1, qsize*(nete- nets) do k = 1, nlev q = func(ie_q) ie = func(ie_q) Qtens(k,q,ie) = … end do end do !$ACC PARALLEL LOOP Data packing

  • ptimized:

do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_2(func_1()) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev …. end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_5(func_3(),func_4()) end do end do end do do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(func_5(),func_6()) end do do k = 1, nlev Qtens(k,q,ie) = func_9(func_8(func_5())) end do end do Data packing end do

3 4 5 6

Refactoring the Euler Step

slide-24
SLIDE 24

column (col)

… …

chunk

… …

pver

Refactoring of the Physics Schemes

slide-25
SLIDE 25

A Loop Transformation Tool: Typical Scenario

Do i = 1, m call F(A, B(i), C(i,:)) End do Subroutine F(A, B, C) !parameter declaration real :: A, B real, dimension(:) :: C !local variable declaration real :: X, Y !execution X = 1 Y = 1 call lower1(X, C) call lower2(Y, C) B = X+Y C(:) = C(:) + X*Y End Subroutine F(A, B, C, m) !parameter declaration real :: A real, dimension(:) :: B real, dimension(:,:) :: C integer :: m !local variable declaration real, dimension(m) :: X, Y !execution do i = 1, m X(i) = 1 Y(i) = 2 call lower1(X(i), C(i, :)) end do do i = 1, m call lower2(Y(i), C(i, :)) B(i) = X(i)+Y(i) C(i, :) = X(i)*Y(i) end do End call F(A, B(:), C(:,:), m)

slide-26
SLIDE 26

do begin_chunk to end_chunk tphysbc() { convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) } enddo tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) convect_shallow_tend(15.57%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) enddo } tphysbc() { do begin_chunk to end_chunk convect_deep_tend(6.47%) enddo …… do begin_chunk to end_chunk microp_driver_tend(7.13%) enddo …… do begin_chunk to end_chunk radiation_tend(54.07%) enddo } do begin_chunk to end_chunk convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { zm_convr(2.03%) zm_conv_evap() montran() convtranc(0.06%) } } enddo convect_deep_tend(6.47%) { zm_conv_tend(6.47%) { do begin_chunk to end_chunk zm_convr(2.03%) enddo do begin_chunk to end_chunk zm_conv_evap() enddo do begin_chunk to end_chunk montran() enddo do begin_chunk to end_chunk convtranc(0.06%) enddo } } 1 2 3 4 5

Loop Transformation for Phys_run1

slide-27
SLIDE 27

Variable Storage Space Analysis and Reduction Tool

  • Estimate the storage requirements of

the variable and arrays

  • Identify the lifespan of the variables

and arrays

  • Determine whether the variables and

arrays of each CPE thread can fit into the 64KB SPM. Basic functions

  • The original Fortran function accesses 7

intermediate arrays (A to G) during the computation process. By analyzing the lifespan of these 7 arrays, which are annotated by the lines above these arrays, we can determine that 4 arrays would provide sufficient space to store these 7 arrays in different stages of the execution process. Example Explanation

slide-28
SLIDE 28

Speedup of Major Kernels in CAM-SE

7x to 22x speedup for computing intensive kernels; 2x to 7x speedup for memory-bound kernels

slide-29
SLIDE 29

Speedup of Major Kernels in CAM-PHY

The microp_mg1_0 kernel demonstrates a significant speedup of 14x, as the intermediate variables and arrays provide a nice fit to the SPM of the CPE clusters after the automated optimizations.

slide-30
SLIDE 30

CAM model: scalability and speedup

0.04 0.15 0.24 0.25 0.6 0.78 0.87 1.54 1.2 1.62 1.75 2.81

0.5 1 1.5 2 2.5 3 1024 2400 4096 5120 7350 9600 12000 24000

Simulation Speed (Described in Model Year Per Day(MYPD))

Number of CGs (each CG includes 1 MPE and 64 CPEs)

MPE only MPE+CPE for dynamic core MPE+CPE for both dynamic core and physics schemes

  • million core scale, 2.81 SYPD
  • many-core refactoring for the

entire model

  • competitive simulation speed to

the same model on NCAR Yellowstone

slide-31
SLIDE 31

Future Work

Further improvement from 2.81 SYPD to 5~8 SYPD

dynamic core: by another factor of 2

computation- communication

  • verlapping

data sharing among CPEs by register communication

physics schemes: by another factor of 2~4

further improvement

  • n the loop

transformation and variable storage space reduction tool 20x speedup for most physics schemes

slide-32
SLIDE 32

Sunway TaihuLight

太 湖 之 光

slide-33
SLIDE 33

Thank you.

haohuan@tsinghua.edu.cn