spcl.inf.ethz.ch @spcl_eth
MODESTO: Data-centric Analytic Optimization of Complex Stencil - - PowerPoint PPT Presentation
MODESTO: Data-centric Analytic Optimization of Complex Stencil - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures spcl.inf.ethz.ch @spcl_eth Stencil Programs Motivation:
spcl.inf.ethz.ch @spcl_eth
Motivation:
- COSMO is a regional climate model used by 7 national weather services
- The real-world application COSMO contains hundreds of different stencils
Analysis:
- Stencil programs can be represented using directed acyclic graphs
- Nodes and edges correspond to stencils and dependencies respectively
2
Stencil Programs
simplified horizontal diffusion example
in lap
- ut
fli flj ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 𝑏⨁𝑐 = 𝑏′ + 𝑐′ 𝑏′ ∈ 𝑏, 𝑐′ ∈ 𝑐+
spcl.inf.ethz.ch @spcl_eth
3
TODO: Quickly say what tiling and loop fusion are?
spcl.inf.ethz.ch @spcl_eth
- Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)
4
How to Deal with Data Dependencies?
i-dimension tile 2 tile 1 tile 0 Inter tile data dependencies:
- Perform halo exchange communication
- Introduce redundant computation
time
spcl.inf.ethz.ch @spcl_eth
- Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)
5
How to Deal with Data Dependencies?
i-dimension Halo Exchange Parallel (hp):
- Update tiles in parallel
- Perform halo exchange communication
Pros and Cons:
- Avoid redundant computation
- At the cost of additional synchronization
time
spcl.inf.ethz.ch @spcl_eth
- Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)
6
How to Deal with Data Dependencies?
Halo Exchange Sequential (hs):
- Update tiles sequentially
- E.g. Innermost loop updates tile-by-tile
Pros and Cons:
- Avoid redundant computation
- At cost of being sequential
i-dimension time
spcl.inf.ethz.ch @spcl_eth
- Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)
7
How to Deal with Data Dependencies?
i-dimension Computation on-the-fly (of):
- Compute all dependencies on-the-fly
- «Overlapped tiling»
Pros and Cons:
- Avoid synchronization
- At the cost of redundant computation
time
spcl.inf.ethz.ch @spcl_eth
- STELLA is a stencil DSL used to refactor the dynamical core of COSMO
8
Case Study: STELLA (STEncil Loop LAnguage)
// define stencil functors struct Lap { ... }; struct Fli { ... }; ... // stencil assembly Stencil stencil; StencilCompiler::Build( stencil, pack_parameters( ... ), define_temporaries( StencilBuffer<lap, double>(), StencilBuffer<fli, double>(), ... ), define_loops( define_sweep( StencilStage<Lap, IJRange<-1,1,-1,1> >(), StencilStage<Fli, IJRange<-1,0,0,0> >(), ... ))); // stencil execution stencil.Apply();
using C++ template metaprogramming: STELLA defines a virtual tiling hierarchy that facilitates platform independent code generation
spcl.inf.ethz.ch @spcl_eth
9
Tiling Hierarchy of STELLA’s GPU-Backend
DSL Tile Size Strategy Memory Communication sweep 1 x 1 x 1 halo exchange parallel registers scratchpad sweep ∞ x ∞ x 1 halo exchange sequential registers registers loop 64x4x64 computation
- n-the-fly
GDDR
- stencil ∞ x ∞ x ∞
computation
- n-the-fly
GDDR
- tiling hierarchy
spcl.inf.ethz.ch @spcl_eth
- Map stencils to the tiling hierarchy using a bracket expression
- Enumerate the stencil execution orders that respect the dependencies
- Enumerate implementation variants by adding/removing brackets
10
Stencil Program Algebra
lap
- ut
fli flj lap
- ut
fli flj ...,lap,fli flj,out,... , ],[ ]],[[ [lap,fli,flj,out] [[lap,fli,flj],[out]]
spcl.inf.ethz.ch @spcl_eth
- Our model considers peak computation and communication throughputs
11
Machine Model
core 1 (30 Gflop) core 2 (30 Gflop) core 3 (30 Gflop) cache (256 kB) DDR (8 GB) target machine machine model cache (256 kB) cache (256 kB) C = 90 Gflops 100 GB/s 100 GB/s 100 GB/s 10 GB/s 25 GB/s 25 GB/s V1 = 300 GB/s L1 = 50 GB/s M1 = 256 kB V0 = 10 GB/s L0 = 0 GB/s M0 = 8 GB lateral and vertical communication refer to commuication within one respectively between different tiling hierarchy levels
spcl.inf.ethz.ch @spcl_eth
- Given a stencil 𝑡 given and the amount of computation 𝑑𝑡
𝑢𝑡 = 𝑑𝑡/𝐷
- Given a group and the vertical and lateral communication 𝑤𝑑 and 𝑚𝑑
1, … , 𝑚𝑑 𝑛
𝑢 = max(𝑢𝑑, 𝑤𝑑 𝑊𝑛 , 𝑚𝑑
1 𝑀1
, … , 𝑚𝑑
𝑛 𝑀𝑛
)
𝑑∈.𝑑𝑖𝑗𝑚𝑒
12
Performance Modeling
tiling hierarchy execution time 𝑢𝑚𝑏𝑞 𝑢𝑔𝑚𝑗 𝑢𝑔𝑚𝑘 𝑢𝑝𝑣𝑢 𝑤𝑚𝑏𝑞 𝑊2 𝑤𝑔𝑚𝑗 𝑊2 𝑤𝑔𝑚𝑘 𝑊2 𝑢,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑤𝑝𝑣𝑢 𝑊2 𝑢,𝑝𝑣𝑢- 𝑤,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑊1 𝑤,𝑝𝑣𝑢- 𝑊1 𝑢,,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘-,,𝑝𝑣𝑢-- remember the performance model parameters 𝐷, 𝑊𝑛, 𝑀1, … , 𝑀𝑛
spcl.inf.ethz.ch @spcl_eth
- The stencil program analysis is based on (quasi-) affine sets and maps
𝑇 = 𝑗 𝑗 ∈ ℤ𝑜 ⋀ (0, … , 0) < 𝑗 < (10, … , 10) 𝑁 = 𝑗 → 𝑘 𝑗 ∈ ℤ𝑜, 𝑘 ∈ ℤ𝑜 𝑘 = 2 ∙ 𝑗
- For example, data dependencies are expressed using a named map
𝐸𝑚𝑏𝑞 = 𝑚𝑏𝑞 𝑗 → 𝑗𝑜 𝑗 + 𝑘 𝑗 ∈ ℤ2, 𝑘 ∈ 0,0 , 1,0 , −1,0 , 0,1 , 0, −1
13
Affine Sets and Maps
𝐸 = 𝐸𝑚𝑏𝑞 ∪ 𝐸
𝑔𝑚𝑗 ∪ 𝐸 𝑔𝑚𝑘 ∪ 𝐸𝑝𝑣𝑢
𝐹 = D+(* 𝑝𝑣𝑢 0 +)
apply the out origin vector to the transitive closure of all dependencies in lap
- ut
fli flj
spcl.inf.ethz.ch @spcl_eth
- Define a tiling using a map that associates stencil evaluations to tile ids
𝑈
𝑝𝑣𝑢 = *(out, 𝑗0, 𝑗1 ) → ⌊𝑗0 2
⌋, ⌊𝑗1 2 ⌋ +
14
Tiling Maps
i1 i0 1 2 3 4 5 1 2 3 4 5
(1,0) (2,0) (0,0) (1,1) (2,1) (0,1) (1,2) (2,2) (0,2)
tile id (1,2)
- ut (2,5)
spcl.inf.ethz.ch @spcl_eth
- Intersect the range of the tiling map with the base tiling hierarchy origin tile
𝑇 = 0, 0, 𝑧0, 𝑧1 𝑧𝑘 ∈ ℤ+
- Count the number of stencil evaluations associated to the remaining tiles
𝑑𝑝𝑣𝑢 = 𝑈
𝑝𝑣𝑢 ∩𝑠𝑏𝑜 𝑇 ∙ #𝑔𝑚𝑝𝑞𝑡
15
Count Stencil Evaluations
tiling hierarchy count the points in the filtered tiling map using the Barvinok algorithm
spcl.inf.ethz.ch @spcl_eth
- Put it all together (stencil algebra, performance model, stencil analysis)
- 1. Optimize the stencil execution order (brute force search)
- 2. Optimize the stencil grouping (dynamic programming)
16
Stencil Program Optimization
lap fli flj
- ut
lap fli flj
- ut
lap fli flj
- ut
lap fli flj
- ut
minimize
𝑦∈𝐽
𝑢(𝑦) I subject to 𝑛(𝑦) ≤ 𝑁
spcl.inf.ethz.ch @spcl_eth
17
Dynamic Programming (simplified)
lap⟷lap
- lap⟷fli
fli⟷fli
- lap⟷flj
fli⟷flj flj⟷flj
- lap⟷out
fli⟷out flj⟷out
- ut⟷out
- [ ... ] :: ...
fli⟷out flj⟷out
- ut⟷out
tiling hierarchy level 1 tiling hierarchy level 0 1 2 2 3 4 4 3
spcl.inf.ethz.ch @spcl_eth
18
Evaluation
CPU Experiments (i5-3330): GPU Experiments (Tesla K20c):
1 1 1 1 3.1 2.7 2.1 2.4 3.1 2.7 2.1 2.4 hd uv div uv&div no fusion hand-tuned
- ptimized
1 1 1 1 2.3 2.1 1.1 1.5 2.3 2.4 2 2.1 hd uv div uv&div no fusion hand-tuned
- ptimized
m ~ 1.6e 40 80 120 20 40 60 80 m = measured time [ms] e = estimated time [ms] m ~ 1.5e 4 8 12 2 4 6 8 m = measured time [ms] e = estimated time [ms]
spcl.inf.ethz.ch @spcl_eth
- We categorize data locality transformations for stencil programs
- We enumerate stencil program implementation variants using an algebra
- Our performance model estimates the stencil program execution time
- Using MODESTO we can automatically tune STELLA stencil programs
- 2.0-3.1x speedup against naive implementations
- 1.0-1.8x speedup against expert tuned implementations
19