stella a domain specific language for stencil computations
play

STELLA: A Domain Specific Language for Stencil Computations Carlos - PowerPoint PPT Presentation

STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology,


  1. STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives Carlos Osuna Dagstuhl Seminar, 14th April 2015 1

  2. Outline Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models Carlos Osuna Dagstuhl Seminar – 14th April 2015 2

  3. MOTIVATION: COSMO MODEL Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran) operators (advection, diffusion, fast waves, etc.) on prognostic variables ( ρ , u, v, w, T, qx, ...) Carlos Osuna Dagstuhl Seminar – 14th April 2015 3

  4. COSMO tjme step Initialization Boundary conditions Physics Dynamics Data assimilation Halo-update Δt Diagnostics I/O Cleanup Carlos Osuna Dagstuhl Seminar– 14th April 2015 4

  5. COSMO Dynamical Core Dynamical code: ~35 stencils per time step Initialization Vertical Diffusion (compute U, V, W) S t U a r Boundary conditions , t V c , o Vertical Diffusion W Physics m m (compute T) Dynamics W U a i , t Data assimilation …. V c , o W m Halo-update m Δt Time Integrator Diagnostics N small ∆ t (FastWaves RungeKutta) I/O …. Horizontal Diffusion Cleanup Carlos Osuna Dagstuhl Seminar – 14th April 2015 5

  6. Algorithmic Motifs In COSMO Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) } } for i=1, ie-1 { for j=1, je-1 { difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) } } } Carlos Osuna Dagstuhl Seminar– 14th April 2015 6

  7. Algorithmic Motifs In COSMO Vertical PDE operators implicitly solved -> tridiagonal systems // forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k) = 1.0 / b(i,j,k) - c(i,j,k-1) * a(i,j,k) ) d(i,j,k) = ( d(i,j,k) - d(i,j,k-1) * a(i,j,k) ) * c(i,j,k) } } } // backward substjtutjon for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k) = d(i,j,k) - c(i,j,k) * x(i,j,k+1) } } } Carlos Osuna Dagstuhl Seminar – 14th April 2015 7

  8. Model development U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) Carlos Osuna Dagstuhl Seminar – 14th April 2015 8

  9. U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) Carlos Osuna Dagstuhl Seminar– 14th April 2015 9

  10. U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ ? Carlos Osuna Dagstuhl Seminar – 14th April 2015 10

  11. 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ static void Lap(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[u::Center()]; } Carlos Osuna Dagstuhl Seminar – 14th April 2015 11

  12. const int i = threadIdx.x; const int j = threadIdx.y; int i_h = 0; int j_h = 0; if(j < 2) { i_h = i; j_h = (j==0 ? -1 : blockDim.y); } else if(j < 4 && i <= blockDim.y) { i_h = (j==2 ? -1 : blockDim.x); j_h = i; } for(int k=0; k < kdim; ++k) { lap(i,j) = - 4.0 * U(i,j,k) + U(i+1,j,k) + U(i-1,j,k) + U(i,j+1,k) + U(i,j-1,k) if(i_h != 0 || j_h != 0) lap(i_h, j_h) = - 4.0 * U(i_h,j_h,k) + U(i_h+1,j_h,k) + U(i_h-1,j_h,k) + U(i_h,j_h+1,k) + U(i_h,j_h-1,k) __syncthreads(); result(i,j) = -4*(i,j,k) - alpha(i,j,k)*( lap(i,j,k) - lap(i-1,j,k) + lap(i,j,k) - lap(i,j-1,k)) } Carlos Osuna Dagstuhl Seminar – 14th April 2015 12

  13. STELLA DSL using C++ template metaprogramming Separate model and algorithm from hardware specifjc implementatjon and optjmizatjons. Single source code compiling multjple architectures, performance portable. Concise syntax / close to mathematjcal descriptjon of operators Same toolchain as the full model – (access to debugger/data fjelds, interoperability with other programming languages) Supports CPU & GPU backends Used to rewrite a full dynamical core (COSMO) Carlos Osuna Dagstuhl Seminar – 14th April 2015 13

  14. How it works? Operators are encapsulated as type informatjon (functors) Types are composed/organized and manipulated template<typename Context> struct LapStage{ static void Do(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[T::Center()]; } }; Specifjc backends expand the code of operators into template kernels with loop structures Many optjmizatjons (memory layout, tjling, looping, data locality) depend on the backend Carlos Osuna Dagstuhl Seminar– 14th April 2015 14

  15. Stencil Stages Stencil Stages describe the single operatjon applied to each grid point. Assumed parallel executjon in the IJ plane template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv) STAGE_PARAMETER(u) STAGE_PARAMETER(lap) statjc void Do(Context ctx, FullDomain) { ctx[lap::Center()] = -4*ctx[u::Center()] + ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)]; } } Carlos Osuna Dagstuhl Seminar – 14th April 2015 15

  16. Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Instantiate data fields pack_parameters( (memory layout abstracted, Param<in, cIn, cDataField>( in_ ), backend dependent) Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar– 14th April 2015 16

  17. Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Pack the parameters and associate pack_parameters( them to place holders used in stages Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar – 14th April 2015 17

  18. Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Define temporary buffers used by the pack_parameters( stencil Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar – 14th April 2015 18

  19. Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Compose multiple stencil stages. pack_parameters( ( Operations with data dependencies placed Param<in, cIn, cDataField>( in_ ), in different stages ) Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( Build Sweeps (sequential loops in K ) StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Dagstuhl Seminar – 14th April 2015 19 Carlos Osuna

  20. Computation on the fly or buffering? 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∂ U ∇ 4 =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 Δ 4 T 2 U ∇ ∇ 2 T ∇ Carlos Osuna Dagstuhl Seminar– 14th April 2015 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend