STELLA: A Domain Specific Language for Stencil Computations Carlos - PowerPoint PPT Presentation

STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology, Meteoswiss Zurich Mauro Bianco, Swiss National Supercomputing Centre, CSCS Lugano. Thomas C. Schulthess, Swiss National Supercomputing Centre, CSCS Lugano. Work funded by HP2C and PASC initiatives Carlos Osuna Dagstuhl Seminar, 14th April 2015 1

Outline Dynamical Core of COSMO STELLA: DSL using C++ template metaprogramming STELLA DSL elements Exploiting performance: Loop & kernel fusion, data locality Extending Parallelization models Carlos Osuna Dagstuhl Seminar – 14th April 2015 2

MOTIVATION: COSMO MODEL Dynamical core of COSMO solves the Navier Stokes equations using finite difference methods on structured grids. Stencils on 3D grids: 2-7 km resolution in the horizontal, 60/80 levels in the vertical Split-explicit time integrator with slow processes integrated using Runge-Kutta method. Collection of (Fortran) operators (advection, diffusion, fast waves, etc.) on prognostic variables ( ρ , u, v, w, T, qx, ...) Carlos Osuna Dagstuhl Seminar – 14th April 2015 3

COSMO tjme step Initialization Boundary conditions Physics Dynamics Data assimilation Halo-update Δt Diagnostics I/O Cleanup Carlos Osuna Dagstuhl Seminar– 14th April 2015 4

COSMO Dynamical Core Dynamical code: ~35 stencils per time step Initialization Vertical Diffusion (compute U, V, W) S t U a r Boundary conditions , t V c , o Vertical Diffusion W Physics m m (compute T) Dynamics W U a i , t Data assimilation …. V c , o W m Halo-update m Δt Time Integrator Diagnostics N small ∆ t (FastWaves RungeKutta) I/O …. Horizontal Diffusion Cleanup Carlos Osuna Dagstuhl Seminar – 14th April 2015 5

Algorithmic Motifs In COSMO Horizontal PDE operators explicitly solved -> compact stencils with horizontal data dependencies for k=1, ke { for i=1, ie-1 { for j=1, je-1 { lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) } } for i=1, ie-1 { for j=1, je-1 { difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) } } } Carlos Osuna Dagstuhl Seminar– 14th April 2015 6

Algorithmic Motifs In COSMO Vertical PDE operators implicitly solved -> tridiagonal systems // forward for j=1, je for i=1, ie for k=2, ke { c(i,j,k) = 1.0 / b(i,j,k) - c(i,j,k-1) * a(i,j,k) ) d(i,j,k) = ( d(i,j,k) - d(i,j,k-1) * a(i,j,k) ) * c(i,j,k) } } } // backward substjtutjon for j=1, je { for i=1, ie { for k=2, ke { x(i,j,k) = d(i,j,k) - c(i,j,k) * x(i,j,k+1) } } } Carlos Osuna Dagstuhl Seminar – 14th April 2015 7

Model development U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) Carlos Osuna Dagstuhl Seminar – 14th April 2015 8

U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ for i=1, ie-1 for j=1, je-1 lap(i,j) = -4*U(i,j) + U(i+1,j) + U(i-1,j) + U(i,j+1) + U(i,j-1) for i=1, ie-1 for j=1, je-1 difg(i,j) = -4*lap(i,j) + lap(i+1,j) + lap(i-1,j) + lap(i,j+1) + lap(i,j-1) Carlos Osuna Dagstuhl Seminar– 14th April 2015 9

U = U ( x, z,t ) U ( x, z ,t = 0 )= U 0 ( x, z ) ∂ U 4 U =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ ? Carlos Osuna Dagstuhl Seminar – 14th April 2015 10

2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∇ 2 Δ static void Lap(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[u::Center()]; } Carlos Osuna Dagstuhl Seminar – 14th April 2015 11

const int i = threadIdx.x; const int j = threadIdx.y; int i_h = 0; int j_h = 0; if(j < 2) { i_h = i; j_h = (j==0 ? -1 : blockDim.y); } else if(j < 4 && i <= blockDim.y) { i_h = (j==2 ? -1 : blockDim.x); j_h = i; } for(int k=0; k < kdim; ++k) { lap(i,j) = - 4.0 * U(i,j,k) + U(i+1,j,k) + U(i-1,j,k) + U(i,j+1,k) + U(i,j-1,k) if(i_h != 0 || j_h != 0) lap(i_h, j_h) = - 4.0 * U(i_h,j_h,k) + U(i_h+1,j_h,k) + U(i_h-1,j_h,k) + U(i_h,j_h+1,k) + U(i_h,j_h-1,k) __syncthreads(); result(i,j) = -4*(i,j,k) - alpha(i,j,k)*( lap(i,j,k) - lap(i-1,j,k) + lap(i,j,k) - lap(i,j-1,k)) } Carlos Osuna Dagstuhl Seminar – 14th April 2015 12

STELLA DSL using C++ template metaprogramming Separate model and algorithm from hardware specifjc implementatjon and optjmizatjons. Single source code compiling multjple architectures, performance portable. Concise syntax / close to mathematjcal descriptjon of operators Same toolchain as the full model – (access to debugger/data fjelds, interoperability with other programming languages) Supports CPU & GPU backends Used to rewrite a full dynamical core (COSMO) Carlos Osuna Dagstuhl Seminar – 14th April 2015 13

How it works? Operators are encapsulated as type informatjon (functors) Types are composed/organized and manipulated template<typename Context> struct LapStage{ static void Do(Context ctx) { ctx[lap::Center()] = ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)] - 4*ctx[T::Center()]; } }; Specifjc backends expand the code of operators into template kernels with loop structures Many optjmizatjons (memory layout, tjling, looping, data locality) depend on the backend Carlos Osuna Dagstuhl Seminar– 14th April 2015 14

Stencil Stages Stencil Stages describe the single operatjon applied to each grid point. Assumed parallel executjon in the IJ plane template<typename Tenv> struct Lap { STENCIL_STAGE(TEnv) STAGE_PARAMETER(u) STAGE_PARAMETER(lap) statjc void Do(Context ctx, FullDomain) { ctx[lap::Center()] = -4*ctx[u::Center()] + ctx[u::At(iplus1)] + ctx[u::At(iminus1)] + ctx[u::At(jplus1)] + ctx[u::At(jminus1)]; } } Carlos Osuna Dagstuhl Seminar – 14th April 2015 15

Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Instantiate data fields pack_parameters( (memory layout abstracted, Param<in, cIn, cDataField>( in_ ), backend dependent) Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar– 14th April 2015 16

Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Pack the parameters and associate pack_parameters( them to place holders used in stages Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar – 14th April 2015 17

Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Define temporary buffers used by the pack_parameters( stencil Param<in, cIn, cDataField>( in_ ), Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Carlos Osuna Dagstuhl Seminar – 14th April 2015 18

Full example IJKRealField in_, out_; StencilCompiler::Build( Stencil, Compose multiple stencil stages. pack_parameters( ( Operations with data dependencies placed Param<in, cIn, cDataField>( in_ ), in different stages ) Param<out, cInOut, cDataField>( out_ ) ), define_temporaries( Build Sweeps (sequential loops in K ) StencilBuffer<lap, double, KRange<FullDomain,0,0> >() ), define_loops( define_sweep<cKIncrement>( define_stages( StencilStage<Lap, IJRange<cIndented,-1,1,-1,1>, KRange<FullDomain,0,0> >(), StencilStage<Lap2, IJRange<cComplete,0,0,0,0>, KRange<FullDomain,0,0> >() ) ) ) ); Dagstuhl Seminar – 14th April 2015 19 Carlos Osuna

Computation on the fly or buffering? 2 U i, j = U i + 1, j + U i − 1, j + U i, j + 1 + U i, j − 1 − 4 U i, j ∂ U ∇ 4 =−α ∇ 2 (∇ 2 U ) ∂ t =−α ∇ 2 Δ 4 T 2 U ∇ ∇ 2 T ∇ Carlos Osuna Dagstuhl Seminar– 14th April 2015 20

STELLA: A Domain Specific Language for Stencil Computations Carlos - PowerPoint PPT Presentation

STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology,

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Stella TUESDAY JUNE 18, 2019 What is Stella? Stella is a strategy and analysis tool that helps

Stella Performance Strategy & Analysis Tool June 5 & 6, 2019 1 Stella Performance

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

Domain-Specific Languages for Stencil Computations Azamat Mametjanov Boyana Norris

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil , Cy Chan ,

DSL Engineering with Sven Efftinge - itemis.com DOMAIN-SPECIFIC LANGUAGE A Domain Specific

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Domain Specific Languages Domain Specific Languages in Erlang Dennis Byrne

Organization of DSLE part Tooling Domain Specific Language Domain Specific Language

Developmental Developmental Disorders affecting Disorders affecting language language

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki

Chart 1 U.S. Personal Saving Rate and Household Debt (consumer plus mortgage) as a % of

Changing Network Detection Using Bro and Distributed Computing Concepts Mike Reeves @TOoSmOotH

BenchCouncil AIBench --- A Datacenter AI Benchmark Suite Wanling Gao, Fei Tang, Jianfeng Zhan

New opportunities in optical testing given by photochromic materials Giorgio Pariani, Martino

Women on Boards in Italy Magda Bianco, Angela Ciavarella, Rossella Signoretti Banca dItalia,

On Model Checking Techniques for Randomized Distributed Systems Christel Baier Technische

JRA2 Task 24.7 Testing JRA Face to Face Meeting ESRF Ivan Andrian 2018.05.24 2 Data

New observable for CP asymmetry in charm decays Hsiang-nan Li ( ) Academia Sinica,

STELLA: A Domain Specific Language for Stencil Computations Carlos - PowerPoint PPT Presentation

STELLA: A Domain Specific Language for Stencil Computations Carlos Osuna, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology and Climatology,

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Stella TUESDAY JUNE 18, 2019 What is Stella? Stella is a strategy and analysis tool that helps

Stella Performance Strategy &amp; Analysis Tool June 5 &amp; 6, 2019 1 Stella Performance

Realizing OutofCore Stencil Computations using MultiTier Memory Hierarchy on GPGPU

Domain-Specific Languages for Stencil Computations Azamat Mametjanov Boyana Norris

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil , Cy Chan ,

DSL Engineering with Sven Efftinge - itemis.com DOMAIN-SPECIFIC LANGUAGE A Domain Specific

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Domain Specific Languages Domain Specific Languages in Erlang Dennis Byrne

Organization of DSLE part Tooling Domain Specific Language Domain Specific Language

Developmental Developmental Disorders affecting Disorders affecting language language

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki

Chart 1 U.S. Personal Saving Rate and Household Debt (consumer plus mortgage) as a % of

Changing Network Detection Using Bro and Distributed Computing Concepts Mike Reeves @TOoSmOotH

BenchCouncil AIBench --- A Datacenter AI Benchmark Suite Wanling Gao, Fei Tang, Jianfeng Zhan

New opportunities in optical testing given by photochromic materials Giorgio Pariani, Martino

Women on Boards in Italy Magda Bianco, Angela Ciavarella, Rossella Signoretti Banca dItalia,

On Model Checking Techniques for Randomized Distributed Systems Christel Baier Technische

JRA2 Task 24.7 Testing JRA Face to Face Meeting ESRF Ivan Andrian 2018.05.24 2 Data

New observable for CP asymmetry in charm decays Hsiang-nan Li ( ) Academia Sinica,

Stella Performance Strategy & Analysis Tool June 5 & 6, 2019 1 Stella Performance