Hybrid Fortran High Productivity GPU Porting Framework Applied to - - PowerPoint PPT Presentation

hybrid fortran
SMART_READER_LITE
LIVE PREVIEW

Hybrid Fortran High Productivity GPU Porting Framework Applied to - - PowerPoint PPT Presentation

Hybrid Fortran High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model Michel Mller supervised by Takayuki Aoki Tokyo Institute of Technology Outline 1. Motivation & Problem Description 2. Proposed


slide-1
SLIDE 1

Hybrid Fortran

High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model

Michel Müller

supervised by Takayuki Aoki Tokyo Institute of Technology

slide-2
SLIDE 2

Outline

  • 1. Motivation & Problem Description
  • 2. Proposed Solution
  • 3. Example & Application Status
  • 4. Code Transformation
  • 5. Performance- & Productivity Results
  • 6. Conclusion
slide-3
SLIDE 3

What is ASUCA? [6]

  • Non-hydrostatic weather prediction

model

  • Main Japanese mesoscale weather

model, in production since end of 2014

  • Dynamical + physical core
  • Regular grid:


horizontal domain IJ,
 vertical domain K

  • Mostly parallelizeable in IJ, K is

mostly sequential Goals of Hybrid ASUCA

  • Performant GPU Implementation
  • Low code divergence
  • Code as close to original as possible -

keep Fortran

[6] Kawano K., Ishida J. and Muroi C.: “Development of a New Nonhydrostatic Model ASUCA at JMA”, 2010 Cloud cover result with ASUCA using a 2km resolution grid and real world data

ASUCA

slide-4
SLIDE 4

ASUCA… another point of view

  • 155k LOC
  • 338 kernels
  • one lonely Fortran GPU programmer
slide-5
SLIDE 5

simulation for t ∈ [0,tend]:

routine

loop repeating .. statements .. for each x ∈ [a, b]

Legend physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..

  • surf. flux

.. pointwise process ..

call

for x ∈ [a, b]: .. statements ..

p.b. phi calc .. pointwise process ..

dycore

radiation

surface planetary boundary

… … … …

ASUCA

→ Physics are hard to port. However, leaving them on CPU requires Host-Device-Host data transfers for each timestep.

slide-6
SLIDE 6

Focus Problems

  • 1. Code Granularity
  • 2. Memory Layout
slide-7
SLIDE 7

Focus Problems

  • 1. Code Granularity
  • 2. Memory Layout

Definition of granularity: The amount of work done by one thread. For our purposes, we distinguish between two types of granularity: a) runtime defined granularity b) code defined granularity

simulation for t ∈ [0,tend]: physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..

  • surf. flux

.. pointwise process .. p.b. phi calc .. pointwise process ..

dycore

radiation

surface planetary boundary

… … … …

coarse code granularity
 →GPU unfriendly, performant on CPU
 (simply parallelize j-loop) fine code granularity
 →GPU friendly

slide-8
SLIDE 8

Focus Problems

  • 1. Code Granularity
  • 2. Memory Layout

Definition of granularity: The amount of work done by one thread. For our purposes, we distinguish between two types of granularity: a) runtime defined granularity b) code defined granularity

simulation for t ∈ [0,tend]: physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..

  • surf. flux

.. pointwise process .. p.b. phi calc .. pointwise process ..

dycore

radiation

surface planetary boundary

… … … …

  • Regular grid → Fortran’s multi-dimensional

arrays offer a simple to use and efficient data structure

  • Performant layout on CPU: Keep fast

varying vertical domain in cache → k-first
 Example stencil in original code:


A_out(k,i,j) = A(k,i,j) + A(k,i-1,j) …

  • GPU: Requires i-first or j-first for coalesced

access

slide-9
SLIDE 9

OpenACC is not high level enough for this usecase.

slide-10
SLIDE 10

What others* do …

  • 1. Code Granularity
  • 2. Memory Layout

Kernel fusion in backend

➡ User needs to refine coarse kernels manually at first. ➡ Difficult to manage across functions and modules in a deep call-tree

Stencil DSL abstraction in frontend

➡ Rewrite of point-wise code necessary

[1] Shimokawabe T., Aoki T. and Onodera, N.: “High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA”, 2014 [2] Fuhrer O. et al..: “Towards a performance portable, architecture agnostic implementation strategy for weather and climate models”, 2014

*

slide-11
SLIDE 11

… and what we propose

  • 1. Code Granularity
  • 2. Memory Layout

Abstraction in frontend

๏ We assume the number of parallel loop constructs to be small (ASUCA: 200-300). ➡ Rewrite of these structures is manageable.

Transformation in backend

๏ Manual rewrite of memory access patterns is time consuming and error-prone. ➡ We automate this process in backend. In case of ASUCA:

  • 1. Reordering of K-I-J to I-J-K
  • 2. Due to granularity change for physics: Auto-

privatization (I-J extension) of thread-local scalars and vertical arrays

slide-12
SLIDE 12

→ Hybrid Fortan

  • A language extension for Fortran
  • A code transformation for targeting GPU and multi-core CPU parallelizations with the same

codebase; Produces CUDA Fortran, OpenACC and OpenMP parallel versions in backend.

  • Goal: Making GPU retargeting of existing Fortran code as productive as possible
  • Idea: Combine strengths of DSLs and Directives


Main Advantages versus DSLs

  • No change of programming language necessary
  • Code with coarse parallelization granularity can easily be ported

Main Advantages versus Directives (e.g. OpenACC)

  • Memory layout is abstracted —> Optimized layouts for GPUs and CPUs
  • No rewrite and/or code duplication necessary for code with coarse parallelization

granularity

do i = 1, nx
 do j = 1, ny
 ! ..pointwise code.. @parallelRegion{
 domName(i,j), domSize(nx,ny), appliesTo(CPU)
 }
 ! ..pointwise code..

allows multiple parallelization granularities explicit parallelization -

  • rthogonal to sequential loops
slide-13
SLIDE 13

simulation for t ∈ [0,tend]:

routine

loop repeating .. statements .. for each x ∈ [a, b]

Legend physics run for j ∈ [1,ny]: for i ∈ [1,nx]: shortwave rad. for k ∈ [1,nz]: .. pointwise process ..

  • surf. flux

.. pointwise process ..

call

for x ∈ [a, b]: .. statements ..

p.b. phi calc .. pointwise process ..

dycore

radiation

surface planetary boundary

… … … …

Example reference code from surface flux Data parallelism not exposed at this layer

  • f code —> coarse grained parallelization

l t = tile_land i f ( t l c v r ( l t ) > . _r_size ) then c a l l sf_slab_flx_land_run(& ! . . . inputs and f u r t h e r t i l e v a r i a b l e s

  • mitted

& taux_tile_ex ( l t ) , tauy_tile_ex ( l t ) & & ) u_f ( l t ) = sqrt ( sqrt ( taux_tile_ex ( l t ) ** + tauy_tile_ex ( l t ) ** ) ) e l s e taux_tile_ex ( l t ) = . _r_size tauy_tile_ex ( l t ) = . _r_size ! . . . f u r t h e r t i l e v a r i a b l e s

  • mitted

end i f

Example

slide-14
SLIDE 14

Example using Hybrid Fortran

simulation for t ∈ [0,tend]:

routine

loop repeating .. statements .. for each x ∈ [a, b]

Legend

physics run shortwave rad.
 for k ∈ [1,nz]: .. pw. proc.

  • surf. flux

.. pw. proc.

call

for x ∈ [a, b]: .. statements ..

CPU i,j ∈ [1,nx], [1,ny] GPU i,j ∈ [1,nx], [1,ny] GPU i,j ∈ [1,nx], [1,ny]

p.b. phi calc .. pw. proc.

GPU i,j ∈ [1,nx], [1,ny]

execute .. statements .. in parallel for each i,j ∈ [1,nx], [1,ny]
 if the executable is compiled for CPU. Otherwise run .. statements.. a single time. CPU i,j ∈ [1,nx], [1,ny] .. statements ..

… …

dycore

radiation surface planetary boundary

… … …

example code from surface flux
 using Hybrid Fortran Pointwise code can be reused as is - Hybrid Fortran rewrites this code automatically to apply fine grained parallelism by using the appliesTo clause and the global call graph.

@parallelRegion { appliesTo (GPU) , domName( i , j ) , domSize (nx , ny ) } l t = tile_land i f ( t l c v r ( l t ) > . _r_size ) then c a l l sf_slab_flx_land_run(& ! . . . inputs and f u r t h e r t i l e v a r i a b l e s

  • mitted

& taux_tile_ex ( l t ) , tauy_tile_ex ( l t ) & & ) u_f ( l t ) = sqrt ( sqrt ( taux_tile_ex ( l t ) ** + tauy_tile_ex ( l t ) ** ) ) e l s e taux_tile_ex ( l t ) = . _r_size tauy_tile_ex ( l t ) = . _r_size ! . . . f u r t h e r t i l e v a r i a b l e s

  • mitted

end i f ! . . . sea t i l e s code and v a r i a b l e summing omitted @end p a r a l l e l R e g i o n

slide-15
SLIDE 15

surface flux example including data specifications

  • autoDom —> extend existing data domain specification with parallel domain

given by @domainDependant directive

  • present —> data is already present on device

@domainDependant{domName( i , j ) , domSize (nx , ny ) , a t t r i b u t e (autoDom , present ) } tlcvr , taux_tile_ex , tauy_tile_ex , u_f @end domainDependant @parallelRegion { appliesTo (GPU) , domName( i , j ) , domSize (nx , ny ) } l t = tile_land i f ( t l c v r ( l t ) > . _r_size ) then c a l l sf_slab_flx_land_run(& ! . . . inputs and f u r t h e r t i l e v a r i a b l e s

  • mitted

& taux_tile_ex ( l t ) , tauy_tile_ex ( l t ) & & ) u_f ( l t ) = sqrt ( sqrt ( taux_tile_ex ( l t ) ** + tauy_tile_ex ( l t ) ** ) ) e l s e taux_tile_ex ( l t ) = . _r_size tauy_tile_ex ( l t ) = . _r_size ! . . . f u r t h e r t i l e v a r i a b l e s

  • mitted

end i f ! . . . sea t i l e s code and v a r i a b l e summing omitted @end p a r a l l e l R e g i o n

Example using Hybrid Fortran

slide-16
SLIDE 16

Code Transformation Process

1. Process macros in input 2. Sanitize input


deleting whitespace & comments, merging continued lines

3. Parse global call graph (“parse”) 4. Apply user-defined target-specific parallelization granularity to call graph (“analyze”) 5. Parse module data specifications 6. Link module data spec. to routines where data is imported 7. Generate global application model


intermediate representation, contains modules, routines and code regions, each linked with all relevant user code and meta information

8. Transform code for target architecture (“transform”)


implementation class per routine with hooks called for each detected pattern that requires transformation

9. Sanitize output


split lines that are too long for Fortran standard

  • 10. Process macros in output


implementation of memory layout

make transform parse

F90 ¡

Hybrid Sources global informatio n executable analyze

F90 ¡

implemented Fortran

Build Dependencies Build Configuration Macro Definitions

global information - applied to architecture

hybrid file python GNU Make

legend

file with CPU+GPU version user facing

  • utput

input machine facing

slide-17
SLIDE 17

Analysis Step: CPU

simulation timestep physics run shortwave rad.

  • surf. flux

p.b. phi calc

dycore

radiation

surface planetary boundary

… … … …

routine

loop repeating .. statements .. for each x ∈ [a, b]

Legend

call

for x ∈ [a, b]: .. statements ..

slide-18
SLIDE 18

Analysis Step: CPU

simulation physics run shortwave rad.

  • surf. flux

p.b. phi calc

dycore

radiation

surface planetary boundary

… … … …

K O I I I

routine

loop repeating .. statements .. for each x ∈ [a, b]

Legend

call

for x ∈ [a, b]: .. statements ..

O

“outside” —> routine calling kernel routines

K

“kernel” routines

I

“inside” —> routine called inside kernel

slide-19
SLIDE 19

Analysis Step: GPU

simulation physics run shortwave rad.

  • surf. flux

p.b. phi calc

dycore

radiation

surface planetary boundary

… … … …

O

routine

loop repeating .. statements .. for each x ∈ [a, b]

Legend

call

for x ∈ [a, b]: .. statements ..

O

“outside” —> routine calling kernel routines

K

“kernel” routines

I

“inside” —> routine called inside kernel

O K K K

slide-20
SLIDE 20

ASUCA: Productivity

Code Reuse and Changes Comparison with OpenACC Estimate

slide-21
SLIDE 21

ASUCA: Performance

Strong scaling results

  • n Reedbush-H,

1581 x 1301 x 58 Grid (Japan and surrounding region) Kernel performance on reduced Grid (301 x 301 x 58) 4.9x 3x

→ ←

slide-22
SLIDE 22

Hybrid Fortran on GitHub

21 Sample Codes LGPL License PDF Documentation

slide-23
SLIDE 23

Conclusions

  • A performant GPU port for a meso-scale

weather prediction model has been achieved (physics + dynamics).

  • Using Hybrid Fortran, 85% of the ported

code is a direct copy of the original - without counting whitespace, comments and line continuations.

  • Overall, the code size has grown by less

than 4%.

  • A library of code examples has been

constructed and Open Sourced together with Hybrid Fortran.