Hybrid Fortran
High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model
Michel Müller
supervised by Takayuki Aoki Tokyo Institute of Technology
Hybrid Fortran High Productivity GPU Porting Framework Applied to - - PowerPoint PPT Presentation
Hybrid Fortran High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model Michel Mller supervised by Takayuki Aoki Tokyo Institute of Technology Outline 1. Motivation & Problem Description 2. Proposed
supervised by Takayuki Aoki Tokyo Institute of Technology
What is ASUCA? [6]
model
model, in production since end of 2014
horizontal domain IJ, vertical domain K
mostly sequential Goals of Hybrid ASUCA
keep Fortran
[6] Kawano K., Ishida J. and Muroi C.: “Development of a New Nonhydrostatic Model ASUCA at JMA”, 2010 Cloud cover result with ASUCA using a 2km resolution grid and real world data
simulation for t ∈ [0,tend]:
routine
loop repeating .. statements .. for each x ∈ [a, b]
Legend physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..
.. pointwise process ..
call
for x ∈ [a, b]: .. statements ..
p.b. phi calc .. pointwise process ..
dycore
radiation
surface planetary boundary
→ Physics are hard to port. However, leaving them on CPU requires Host-Device-Host data transfers for each timestep.
Definition of granularity: The amount of work done by one thread. For our purposes, we distinguish between two types of granularity: a) runtime defined granularity b) code defined granularity
simulation for t ∈ [0,tend]: physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..
.. pointwise process .. p.b. phi calc .. pointwise process ..
dycore
radiation
surface planetary boundary
coarse code granularity →GPU unfriendly, performant on CPU (simply parallelize j-loop) fine code granularity →GPU friendly
Definition of granularity: The amount of work done by one thread. For our purposes, we distinguish between two types of granularity: a) runtime defined granularity b) code defined granularity
simulation for t ∈ [0,tend]: physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..
.. pointwise process .. p.b. phi calc .. pointwise process ..
dycore
radiation
surface planetary boundary
arrays offer a simple to use and efficient data structure
varying vertical domain in cache → k-first Example stencil in original code:
A_out(k,i,j) = A(k,i,j) + A(k,i-1,j) …
access
➡ User needs to refine coarse kernels manually at first. ➡ Difficult to manage across functions and modules in a deep call-tree
➡ Rewrite of point-wise code necessary
[1] Shimokawabe T., Aoki T. and Onodera, N.: “High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA”, 2014 [2] Fuhrer O. et al..: “Towards a performance portable, architecture agnostic implementation strategy for weather and climate models”, 2014
๏ We assume the number of parallel loop constructs to be small (ASUCA: 200-300). ➡ Rewrite of these structures is manageable.
๏ Manual rewrite of memory access patterns is time consuming and error-prone. ➡ We automate this process in backend. In case of ASUCA:
privatization (I-J extension) of thread-local scalars and vertical arrays
codebase; Produces CUDA Fortran, OpenACC and OpenMP parallel versions in backend.
Main Advantages versus DSLs
Main Advantages versus Directives (e.g. OpenACC)
granularity
do i = 1, nx do j = 1, ny ! ..pointwise code.. @parallelRegion{ domName(i,j), domSize(nx,ny), appliesTo(CPU) } ! ..pointwise code..
allows multiple parallelization granularities explicit parallelization -
simulation for t ∈ [0,tend]:
routine
loop repeating .. statements .. for each x ∈ [a, b]
Legend physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..
.. pointwise process ..
call
for x ∈ [a, b]: .. statements ..
p.b. phi calc .. pointwise process ..
dycore
radiation
surface planetary boundary
Example reference code from surface flux Data parallelism not exposed at this layer
l t = tile_land i f ( t l c v r ( l t ) > . _r_size ) then c a l l sf_slab_flx_land_run(& ! . . . inputs and f u r t h e r t i l e v a r i a b l e s
& taux_tile_ex ( l t ) , tauy_tile_ex ( l t ) & & ) u_f ( l t ) = sqrt ( sqrt ( taux_tile_ex ( l t ) ** + tauy_tile_ex ( l t ) ** ) ) e l s e taux_tile_ex ( l t ) = . _r_size tauy_tile_ex ( l t ) = . _r_size ! . . . f u r t h e r t i l e v a r i a b l e s
end i f
simulation for t ∈ [0,tend]:
routine
loop repeating .. statements .. for each x ∈ [a, b]
Legend
physics run shortwave rad. for k ∈ [1,nz]: .. pw. proc.
.. pw. proc.
call
for x ∈ [a, b]: .. statements ..
CPU i,j ∈ [1,nx], [1,ny] GPU i,j ∈ [1,nx], [1,ny] GPU i,j ∈ [1,nx], [1,ny]
p.b. phi calc .. pw. proc.
GPU i,j ∈ [1,nx], [1,ny]
execute .. statements .. in parallel for each i,j ∈ [1,nx], [1,ny] if the executable is compiled for CPU. Otherwise run .. statements.. a single time. CPU i,j ∈ [1,nx], [1,ny] .. statements ..
dycore
radiation surface planetary boundary
example code from surface flux using Hybrid Fortran Pointwise code can be reused as is - Hybrid Fortran rewrites this code automatically to apply fine grained parallelism by using the appliesTo clause and the global call graph.
@parallelRegion { appliesTo (GPU) , domName( i , j ) , domSize (nx , ny ) } l t = tile_land i f ( t l c v r ( l t ) > . _r_size ) then c a l l sf_slab_flx_land_run(& ! . . . inputs and f u r t h e r t i l e v a r i a b l e s
& taux_tile_ex ( l t ) , tauy_tile_ex ( l t ) & & ) u_f ( l t ) = sqrt ( sqrt ( taux_tile_ex ( l t ) ** + tauy_tile_ex ( l t ) ** ) ) e l s e taux_tile_ex ( l t ) = . _r_size tauy_tile_ex ( l t ) = . _r_size ! . . . f u r t h e r t i l e v a r i a b l e s
end i f ! . . . sea t i l e s code and v a r i a b l e summing omitted @end p a r a l l e l R e g i o n
surface flux example including data specifications
given by @domainDependant directive
@domainDependant{domName( i , j ) , domSize (nx , ny ) , a t t r i b u t e (autoDom , present ) } tlcvr , taux_tile_ex , tauy_tile_ex , u_f @end domainDependant @parallelRegion { appliesTo (GPU) , domName( i , j ) , domSize (nx , ny ) } l t = tile_land i f ( t l c v r ( l t ) > . _r_size ) then c a l l sf_slab_flx_land_run(& ! . . . inputs and f u r t h e r t i l e v a r i a b l e s
& taux_tile_ex ( l t ) , tauy_tile_ex ( l t ) & & ) u_f ( l t ) = sqrt ( sqrt ( taux_tile_ex ( l t ) ** + tauy_tile_ex ( l t ) ** ) ) e l s e taux_tile_ex ( l t ) = . _r_size tauy_tile_ex ( l t ) = . _r_size ! . . . f u r t h e r t i l e v a r i a b l e s
end i f ! . . . sea t i l e s code and v a r i a b l e summing omitted @end p a r a l l e l R e g i o n
1. Process macros in input 2. Sanitize input
deleting whitespace & comments, merging continued lines
3. Parse global call graph (“parse”) 4. Apply user-defined target-specific parallelization granularity to call graph (“analyze”) 5. Parse module data specifications 6. Link module data spec. to routines where data is imported 7. Generate global application model
intermediate representation, contains modules, routines and code regions, each linked with all relevant user code and meta information
8. Transform code for target architecture (“transform”)
implementation class per routine with hooks called for each detected pattern that requires transformation
9. Sanitize output
split lines that are too long for Fortran standard
implementation of memory layout
make transform parse
Hybrid Sources global informatio n executable analyze
implemented Fortran
Build Dependencies Build Configuration Macro Definitions
global information - applied to architecture
hybrid file python GNU Make
legend
file with CPU+GPU version user facing
input machine facing
simulation timestep physics run shortwave rad.
p.b. phi calc
dycore
radiation
surface planetary boundary
routine
loop repeating .. statements .. for each x ∈ [a, b]
Legend
call
for x ∈ [a, b]: .. statements ..
simulation physics run shortwave rad.
p.b. phi calc
dycore
radiation
surface planetary boundary
routine
loop repeating .. statements .. for each x ∈ [a, b]
Legend
call
for x ∈ [a, b]: .. statements ..
“outside” —> routine calling kernel routines
“kernel” routines
“inside” —> routine called inside kernel
simulation physics run shortwave rad.
p.b. phi calc
dycore
radiation
surface planetary boundary
routine
loop repeating .. statements .. for each x ∈ [a, b]
Legend
call
for x ∈ [a, b]: .. statements ..
“outside” —> routine calling kernel routines
“kernel” routines
“inside” —> routine called inside kernel
Code Reuse and Changes Comparison with OpenACC Estimate
Strong scaling results
1581 x 1301 x 58 Grid (Japan and surrounding region) Kernel performance on reduced Grid (301 x 301 x 58) 4.9x 3x
21 Sample Codes LGPL License PDF Documentation