STELLA: A Domain Specific Language for Stencil Computations Carlos - - PowerPoint PPT Presentation

stella a domain specific language for stencil computations
SMART_READER_LITE
LIVE PREVIEW

STELLA: A Domain Specific Language for Stencil Computations Carlos - - PowerPoint PPT Presentation

STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology,


slide-1
SLIDE 1

Carlos Osuna 1 Dagstuhl Seminar, 14th April 2015

STELLA: A Domain Specific Language for Stencil Computations

Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives

slide-2
SLIDE 2

Carlos Osuna 2 Dagstuhl Seminar – 14th April 2015

Outline

Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models

slide-3
SLIDE 3

Carlos Osuna 3 Dagstuhl Seminar – 14th April 2015

MOTIVATION: COSMO MODEL

Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran) operators (advection, diffusion, fast waves, etc.)

  • n prognostic variables (ρ, u, v, w, T, qx, ...)
slide-4
SLIDE 4

Carlos Osuna 4 Dagstuhl Seminar– 14th April 2015

COSMO tjme step

Initialization

Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O

Δt

Cleanup

slide-5
SLIDE 5

Carlos Osuna 5 Dagstuhl Seminar – 14th April 2015

COSMO Dynamical Core

Initialization

Boundary conditions Physics Dynamics Data assimilation Halo-update Diagnostics I/O

Δt

Cleanup

Vertical Diffusion (compute U, V, W) Vertical Diffusion (compute T)

S t a r t c

  • m

m U , V , W W a i t c

  • m

m U , V , W

Time Integrator (FastWaves RungeKutta) …. …. N small ∆t Dynamical code: ~35 stencils per time step Horizontal Diffusion

slide-6
SLIDE 6

Carlos Osuna 6 Dagstuhl Seminar– 14th April 2015

Algorithmic Motifs In COSMO

Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies

for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1)

} }

for i=1, ie-1 { for j=1, je-1 { difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) } } }

slide-7
SLIDE 7

Carlos Osuna 7 Dagstuhl Seminar – 14th April 2015

// forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k) = 1.0 / b(i,j,k) - c(i,j,k-1) * a(i,j,k) ) d(i,j,k) = ( d(i,j,k) - d(i,j,k-1) * a(i,j,k) ) * c(i,j,k) } } } // backward substjtutjon for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k) = d(i,j,k) - c(i,j,k) * x(i,j,k+1) } } }

Vertical PDE operators implicitly solved -> tridiagonal systems

Algorithmic Motifs In COSMO

slide-8
SLIDE 8

Carlos Osuna 8 Dagstuhl Seminar – 14th April 2015

U =U( x, z,t) U (x, z ,t=0)=U0(x, z) ∂U ∂t =−α ∇

4U=−α ∇ 2(∇ 2U)

2U i, j=Ui+1, j+U i−1, j+U i, j+1+U i, j−1−4U i, j

Δ

2

for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1)

Model development

slide-9
SLIDE 9

Carlos Osuna 9 Dagstuhl Seminar– 14th April 2015

for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1)

U =U( x, z,t) U (x, z ,t=0)=U0(x, z) ∂U ∂t =−α ∇

4U=−α ∇ 2(∇ 2U)

2U i, j=Ui+1, j+U i−1, j+U i, j+1+U i, j−1−4U i, j

Δ

2

slide-10
SLIDE 10

Carlos Osuna 10 Dagstuhl Seminar – 14th April 2015

U=U( x, z,t) U (x, z ,t=0)=U0(x, z) ∂U ∂ t =−α ∇

4U=−α ∇ 2(∇ 2U)

?

2U i, j=U i+1, j+U i−1, j+U i, j+1+U i, j−1−4U i, j

Δ

2

slide-11
SLIDE 11

Carlos Osuna 11 Dagstuhl Seminar – 14th April 2015

static void Lap(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[u::Center()]; }

2U i, j=U i+1, j+U i−1, j+U i, j+1+U i, j−1−4U i, j

Δ

2

slide-12
SLIDE 12

Carlos Osuna 12 Dagstuhl Seminar – 14th April 2015

const int i = threadIdx.x; const int j = threadIdx.y; int i_h = 0; int j_h = 0; if(j < 2) { i_h = i; j_h = (j==0 ? -1 : blockDim.y); } else if(j < 4 && i <= blockDim.y) { i_h = (j==2 ? -1 : blockDim.x); j_h = i; } for(int k=0; k < kdim; ++k) { lap(i,j) = - 4.0 * U(i,j,k) + U(i+1,j,k) + U(i-1,j,k) + U(i,j+1,k) + U(i,j-1,k) if(i_h != 0 || j_h != 0) lap(i_h, j_h) = - 4.0 * U(i_h,j_h,k) + U(i_h+1,j_h,k) + U(i_h-1,j_h,k) + U(i_h,j_h+1,k) + U(i_h,j_h-1,k) __syncthreads(); result(i,j) = -4*(i,j,k) - alpha(i,j,k)*( lap(i,j,k) - lap(i-1,j,k) + lap(i,j,k) - lap(i,j-1,k)) }

slide-13
SLIDE 13

Carlos Osuna 13 Dagstuhl Seminar – 14th April 2015

STELLA

DSL using C++ template metaprogramming Separate model and algorithm from hardware specifjc implementatjon and optjmizatjons. Single source code compiling multjple architectures, performance portable. Concise syntax / close to mathematjcal descriptjon of operators Same toolchain as the full model

– (access to debugger/data fjelds, interoperability with other programming languages) Supports CPU & GPU backends

Used to rewrite a full dynamical core (COSMO)

slide-14
SLIDE 14

Carlos Osuna 14 Dagstuhl Seminar– 14th April 2015

How it works?

Operators are encapsulated as type informatjon (functors) Types are composed/organized and manipulated Specifjc backends expand the code of operators into template kernels with loop structures Many optjmizatjons (memory layout, tjling, looping, data locality) depend on the backend

template<typename Context> struct LapStage{ static void Do(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[T::Center()]; } };

slide-15
SLIDE 15

Carlos Osuna 15 Dagstuhl Seminar – 14th April 2015

Stencil Stages

template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv) STAGE_PARAMETER(u) STAGE_PARAMETER(lap) statjc void Do(Context ctx, FullDomain) { ctx[lap::Center()] = -4*ctx[u::Center()] + ctx[u::At(iplus1)] +

ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)];

} }

Stencil Stages describe the single operatjon applied to each grid point. Assumed parallel executjon in the IJ plane

slide-16
SLIDE 16

Carlos Osuna 16 Dagstuhl Seminar– 14th April 2015

Full example

Instantiate data fields (memory layout abstracted, backend dependent)

IJKRealField in_, out_; StencilCompiler::Build( Stencil, pack_parameters( Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

slide-17
SLIDE 17

Carlos Osuna 17 Dagstuhl Seminar – 14th April 2015

Full example

Pack the parameters and associate them to place holders used in stages

IJKRealField in_, out_; StencilCompiler::Build( Stencil, pack_parameters( Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

slide-18
SLIDE 18

Carlos Osuna 18 Dagstuhl Seminar – 14th April 2015

Full example

Define temporary buffers used by the stencil

IJKRealField in_, out_; StencilCompiler::Build( Stencil, pack_parameters( Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

slide-19
SLIDE 19

Carlos Osuna 19 Dagstuhl Seminar – 14th April 2015

Full example

Compose multiple stencil stages.

( Operations with data dependencies placed in different stages )

Build Sweeps (sequential loops in K )

IJKRealField in_, out_; StencilCompiler::Build( Stencil, pack_parameters( Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

slide-20
SLIDE 20

Carlos Osuna 20 Dagstuhl Seminar– 14th April 2015

Computation on the fly or buffering?

∂U ∂ t =−α ∇

4=−α ∇ 2(∇ 2U)

2U i, j=U i+1, j+U i−1, j+U i, j+1+U i, j−1−4U i, j

Δ

2

2U

2T

4T

slide-21
SLIDE 21

Carlos Osuna 21 Dagstuhl Seminar– 14th April 2015

2U

2U

StencilCompiler::Build( stencil, pack_parameters(...), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

4U

template<typename TEnv> struct Lap { STENCIL_STAGE(TEnv) STAGE_PARAMETER(u) STAGE_PARAMETER(diff) static void Do(Context ctx, FullDomain) { ctx[diff::Center()] = ctx[Call<LapFn>::With( Call<LapFn>::With(u::Center() ) )]; } }; StencilCompiler::Build( //... define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) );

Using DSL Temporaries Using DSL Functions

slide-22
SLIDE 22

Carlos Osuna 22 Dagstuhl Seminar– 14th April 2015

2U

2U

4U

slide-23
SLIDE 23

Carlos Osuna 23 Dagstuhl Seminar – 14th April 2015

Loop and Kernel Fusion

STELLA applies loop fusion to stages in a sweep (only GPU)

const int i = threadIdx.x; const int j = threadIdx.y; int i_h = 0; int j_h = 0; if(j < 2) { i_h = i; j_h = (j==0 ? -1 : blockDim.y); } else if(j < 4 && i <= blockDim.y) { i_h = (j==2 ? -1 : blockDim.x); j_h = i; } for(int k=0; k < kdim; ++k) { lap(i,j) = - 4.0 * U(i,j,k) + U(i+1,j,k) + U(i-1,j,k) + U(i,j+1,k) + U(i,j-1,k) if(i_h != 0 || j_h != 0) lap(i_h, j_h) = - 4.0 * U(i_h,j_h,k) + U(i_h+1,j_h,k) + U(i_h-1,j_h,k) + U(i_h,j_h+1,k) + U(i_h,j_h-1,k) __syncthreads(); result(i,j) = -4*(i,j,k) - alpha(i,j,k)*( lap(i,j,k) - lap(i-1,j,k) + lap(i,j,k) - lap(i,j-1,k)) }

define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

slide-24
SLIDE 24

Carlos Osuna 24 Dagstuhl Seminar– 14th April 2015

const int i = threadIdx.x; const int j = threadIdx.y; int i_h = 0; int j_h = 0; if(j < 2) { i_h = i; j_h = (j==0 ? -1 : blockDim.y); } else if(j < 4 && i <= blockDim.y) { i_h = (j==2 ? -1 : blockDim.x); j_h = i; } for(int k=0; k < kdim; ++k) { lap(i,j) = - 4.0 * U(i,j,k) + U(i+1,j,k) + U(i-1,j,k) + U(i,j+1,k) + U(i,j-1,k) if(i_h != 0 || j_h != 0) lap(i_h, j_h) = - 4.0 * U(i_h,j_h,k) + U(i_h+1,j_h,k) + U(i_h-1,j_h,k) + U(i_h,j_h+1,k) + U(i_h,j_h-1,k) __syncthreads(); result(i,j) = -4*(i,j,k) - alpha(i,j,k)*( lap(i,j,k) - lap(i-1,j,k) + lap(i,j,k) - lap(i,j-1,k)) }

! Loop fusion requires duplicated computation

at halo points CUDA block (32x?)

slide-25
SLIDE 25

Carlos Osuna 25 Dagstuhl Seminar – 14th April 2015

STELLA applies kernel fusion to all sweeps of a stencil

define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<ForwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) define_sweep<cKDecrement>( define_stages( StencilStage<BackwardStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) );

Loop and Kernel Fusion

slide-26
SLIDE 26

Carlos Osuna 26 Dagstuhl Seminar – 14th April 2015

Performance implications of Loop & Kernel fusion

Loop and Kernel Fusion

slide-27
SLIDE 27

Carlos Osuna 27 Dagstuhl Seminar – 14th April 2015

Data locality? DSL elements for describing data reuse

define_loops( define_sweep<cKIncrement>( define_caches( KCache<buff, cLocal, KWindow<-2,1>, KRange<FullDomain> define_stages( StencilStage<Avg, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(), StencilStage<Interp, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >() ) )

+

Avg Interp

slide-28
SLIDE 28

Carlos Osuna 28 Dagstuhl Seminar– 14th April 2015

KCache cache a window of a vertical column in registers: KCache<buff, cLocal, KWindow<-2,1>, KRange<FullDomain> IJKCache cache a 3D window in shared memory. IJKCache<buff, cFill, IJWindow<-2,2,-2,2,-2,0>, KRange<FullDomain,0,0> Policies: cFill, cLocal, cFlush, cFillAndFlush

KCache IJKCache

Data locality? DSL elements for describing data reuse

slide-29
SLIDE 29

Carlos Osuna 29 Dagstuhl Seminar– 14th April 2015

KCache and IJKCache effect in COSMO dycore K20c

slide-30
SLIDE 30

Carlos Osuna 30 Dagstuhl Seminar – 14th April 2015

Performance on the COSMO Dycore

Overall speedup of 1.8x for CPU and 5.8x for GPU with respect to legacy code.

slide-31
SLIDE 31

Carlos Osuna 31 Dagstuhl Seminar – 14th April 2015

Scalability

slide-32
SLIDE 32

Carlos Osuna 32 Dagstuhl Seminar– 14th April 2015

∂qs ∂ t =−α ∇

4qs

s∈{1,...,100} α=α( x, y ,z)

Some operators applied to multiple fields, like concentration of certain species in the atmosphere. Ex: pollen, sea salt, volcano ashes... Processing operator on multiple fields in single kernel is beneficial due to:

  • Reuse of common intermediate computations
  • Data locality of input coefficients or fields

Reusing operators

slide-33
SLIDE 33

Carlos Osuna 33 Dagstuhl Seminar – 14th April 2015 // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationDomain(), StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(), pack_parameters( /* output fields */ ExpandableParam<data_out, cInOut>(tracOut.begin(), tracOut.end()), /* input fields */ ExpandableParam<data_in, cIn>(tracIn.begin(), tracIn.end()), Param<hdmasktr, cIn>(repository.hdmask()), Param<ofahdx, cIn>(repository.ofahdx()), Param<ofahdy, cIn>(repository.ofahdy()), Param<crlato, cIn>(repository.crlato()), Param<crlatu, cIn>(repository.crlatu()) ), define_temporaries( StencilExpandableBuffer<lap, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<flx, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<fly, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<rxp, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<rxm, Real, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(), IJCache<flx, KRange<FullDomain,0,0> >(), IJCache<fly, KRange<FullDomain,0,0> >(), IJCache<rxp, KRange<FullDomain,0,0> >(), IJCache<rxm, KRange<FullDomain,0,0> >() ), define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(), StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(), StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(), StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

Expandable parameters

slide-34
SLIDE 34

Carlos Osuna 34 Dagstuhl Seminar– 14th April 2015 // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationDomain(), StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(), pack_parameters( /* output fields */ ExpandableParam<NumTracersPerStencil::value, data_out, cInOut>(tracOut.begin(), tracOut.end()), /* input fields */ ExpandableParam<NumTracersPerStencil::value, data_in, cIn>(tracIn.begin(), tracIn.end()), Param<hdmasktr, cIn>(repository.hdmask()), Param<ofahdx, cIn>(repository.ofahdx()), Param<ofahdy, cIn>(repository.ofahdy()), Param<crlato, cIn>(repository.crlato()), Param<crlatu, cIn>(repository.crlatu()), ), define_temporaries( StencilExpandableBuffer<lap, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<flx, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<fly, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<rxp, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<rxm, Real, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(), IJCache<flx, KRange<FullDomain,0,0> >(), IJCache<fly, KRange<FullDomain,0,0> >(), IJCache<rxp, KRange<FullDomain,0,0> >(), IJCache<rxm, KRange<FullDomain,0,0> >() ), define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(), StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(), StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(), StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

Expandable parameters Expandable buffers

slide-35
SLIDE 35

Carlos Osuna 35 Dagstuhl Seminar– 14th April 2015 // setup the tracer stencil StencilCompiler::Build( stencil_, "HorizontalDiffusionTracers", repository.calculationDomain(), StencilConfiguration<Real, HorizontalDiffusionTracersBlockSize>(), pack_parameters( /* output fields */ ExpandableParam<data_out, cInOut>(tracOut.begin(), tracOut.end()), /* input fields */ ExpandableParam<data_in, cIn>(tracIn.begin(), tracIn.end()), Param<hdmasktr, cIn>(repository.hdmask()), Param<ofahdx, cIn>(repository.ofahdx()), Param<ofahdy, cIn>(repository.ofahdy()), Param<crlato, cIn>(repository.crlato()), Param<crlatu, cIn>(repository.crlatu()) ), define_temporaries( StencilExpandableBuffer<lap, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<flx, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<fly, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<rxp, Real, KRange<FullDomain,0,0> >(), StencilExpandableBuffer<rxm, Real, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_caches( IJCache<lap, KRange<FullDomain,0,0> >(), IJCache<flx, KRange<FullDomain,0,0> >(), IJCache<fly, KRange<FullDomain,0,0> >(), IJCache<rxp, KRange<FullDomain,0,0> >(), IJCache<rxm, KRange<FullDomain,0,0> >() ), define_stages( StencilStage<LapStage, IJRange<cComplete,-2,2,-2,2>, KRange<FullDomain,0,0> >(), StencilStage<FluxStage, IJRange<cComplete,-2,1,-2,1>, KRange<FullDomain,0,0> >(), StencilStage<RXStage, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<LimitFluxStage, IJRange<cIndented,-1,0,-1,0>, KRange<FullDomain,0,0> >(), StencilStage<DataStage, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

Expandable parameters Expandable buffers Expandable caches

slide-36
SLIDE 36

Carlos Osuna 36 Dagstuhl Seminar– 14th April 2015

Extending the Parallelization Model

Warp: 32 threads

  • Tiling in IJ plane
  • Blocks are extended with halo points
  • Parallel execution of IJ plane
  • K dimension executed sequentially

by each (CUDA/OpenMP) thread

slide-37
SLIDE 37

Carlos Osuna 37 Dagstuhl Seminar – 14th April 2015

Parallelization Model: KParallel

Compact stencils of horizontal operators can be parallelized in K. Vertical dependencies only on input fields. Only horizontal dependencies on intermediate computed values K parallelization increases parallelism and occupancy in GPUs

slide-38
SLIDE 38

Carlos Osuna 38 Dagstuhl Seminar – 14th April 2015

StencilCompiler::Build( stencil, pack_parameters(...), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKParallel>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Result, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

Parallelization Model: KParallel

slide-39
SLIDE 39

Carlos Osuna 39 Dagstuhl Seminar – 14th April 2015

Parallelization Model: KParallel

dycore stencils on 32x32, K20c

slide-40
SLIDE 40

Carlos Osuna 40 Dagstuhl Seminar – 14th April 2015

Parallelization Model: Parallel Tridiagonal Solver

Basic Thomas algorithm (sequential in K) for large domains.

Performance deteriorates at small domain, due to lack of parallelism STELLA integrates a parallel HPCR algorithm (Jeremy Appleyard – NVIDIA) Additionally we can prepare the system coefficients on the fly in the same stencil based on the prognostic variables.

slide-41
SLIDE 41

Carlos Osuna 41 Dagstuhl Seminar – 14th April 2015

Parallelization Model: Parallel Tridiagonal Solves

StencilCompiler::Build( stencil_, "TridiagonalSolve_HPCR", repository.calculationDomain(), StencilConfiguration<Real, TridiagonalSolve_HPCRBlockSize>(), pack_parameters( /* output fields */ Param<w_out, cInOut>(repository.w_out()), Param<acol_dat, cInOut>(repository.acol()), Param<bcol_dat, cInOut>(repository.bcol()), Param<ccol_dat, cInOut>(repository.ccol()), Param<dcol_dat, cInOut>(repository.dcol()) ), define_loops( define_sweep<cTridiagonalSolve>( define_stages( StencilStage<SetupStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(), StencilStage<TridiagonalSolveFWStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >(), StencilStage<WriteOutputStage, IJRange<cIndented,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) );

  • STELLA keyword to trigger a HPCR algorithm for the tridiagonal solve
slide-42
SLIDE 42

Carlos Osuna 42 Dagstuhl Seminar – 14th April 2015

HPCR Performance vs Thomas for domain sizes of 32 x (J domain size) Thomas HPCR

slide-43
SLIDE 43

Carlos Osuna 43 Dagstuhl Seminar – 14th April 2015

Conclusions

  • STELLA used to port the dynamical core of COSMO to

GPUs:

  • Retain single source code.
  • Performance portable
  • 2 backends, CPU(1.8x) and GPU(5.8x)
  • C++ template metaprogramming -> interoperatility.
  • Extended parallelization modes further exploits

performance for characteristic algorithmics motifs of STELLA

slide-44
SLIDE 44

Carlos Osuna 44 Dagstuhl Seminar – 14th April 2015

BACKUP

slide-45
SLIDE 45

Carlos Osuna 45 Dagstuhl Seminar – 14th April 2015