MODESTO: Data-centric Analytic Optimization of Complex Stencil - - PowerPoint PPT Presentation

modesto data centric analytic optimization of complex
SMART_READER_LITE
LIVE PREVIEW

MODESTO: Data-centric Analytic Optimization of Complex Stencil - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures spcl.inf.ethz.ch @spcl_eth Stencil Programs Motivation:


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

TOBIAS GYSI, TOBIAS GROSSER, AND TORSTEN HOEFLER

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

Motivation:

  • COSMO is a regional climate model used by 7 national weather services
  • The real-world application COSMO contains hundreds of different stencils

Analysis:

  • Stencil programs can be represented using directed acyclic graphs
  • Nodes and edges correspond to stencils and dependencies respectively

2

Stencil Programs

simplified horizontal diffusion example

in lap

  • ut

fli flj ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ 𝑏⨁𝑐 = 𝑏′ + 𝑐′ 𝑏′ ∈ 𝑏, 𝑐′ ∈ 𝑐+

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

3

TODO: Quickly say what tiling and loop fusion are?

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

  • Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)

4

How to Deal with Data Dependencies?

i-dimension tile 2 tile 1 tile 0 Inter tile data dependencies:

  • Perform halo exchange communication
  • Introduce redundant computation

time

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

  • Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)

5

How to Deal with Data Dependencies?

i-dimension Halo Exchange Parallel (hp):

  • Update tiles in parallel
  • Perform halo exchange communication

Pros and Cons:

  • Avoid redundant computation
  • At the cost of additional synchronization

time

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

  • Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)

6

How to Deal with Data Dependencies?

Halo Exchange Sequential (hs):

  • Update tiles sequentially
  • E.g. Innermost loop updates tile-by-tile

Pros and Cons:

  • Avoid redundant computation
  • At cost of being sequential

i-dimension time

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

  • Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension)

7

How to Deal with Data Dependencies?

i-dimension Computation on-the-fly (of):

  • Compute all dependencies on-the-fly
  • «Overlapped tiling»

Pros and Cons:

  • Avoid synchronization
  • At the cost of redundant computation

time

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

  • STELLA is a stencil DSL used to refactor the dynamical core of COSMO

8

Case Study: STELLA (STEncil Loop LAnguage)

// define stencil functors struct Lap { ... }; struct Fli { ... }; ... // stencil assembly Stencil stencil; StencilCompiler::Build( stencil, pack_parameters( ... ), define_temporaries( StencilBuffer<lap, double>(), StencilBuffer<fli, double>(), ... ), define_loops( define_sweep( StencilStage<Lap, IJRange<-1,1,-1,1> >(), StencilStage<Fli, IJRange<-1,0,0,0> >(), ... ))); // stencil execution stencil.Apply();

using C++ template metaprogramming: STELLA defines a virtual tiling hierarchy that facilitates platform independent code generation

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

9

Tiling Hierarchy of STELLA’s GPU-Backend

DSL Tile Size Strategy Memory Communication sweep 1 x 1 x 1 halo exchange parallel registers scratchpad sweep ∞ x ∞ x 1 halo exchange sequential registers registers loop 64x4x64 computation

  • n-the-fly

GDDR

  • stencil ∞ x ∞ x ∞

computation

  • n-the-fly

GDDR

  • tiling hierarchy
slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

  • Map stencils to the tiling hierarchy using a bracket expression
  • Enumerate the stencil execution orders that respect the dependencies
  • Enumerate implementation variants by adding/removing brackets

10

Stencil Program Algebra

lap

  • ut

fli flj lap

  • ut

fli flj ...,lap,fli flj,out,... , ],[ ]],[[ [lap,fli,flj,out] [[lap,fli,flj],[out]]

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

  • Our model considers peak computation and communication throughputs

11

Machine Model

core 1 (30 Gflop) core 2 (30 Gflop) core 3 (30 Gflop) cache (256 kB) DDR (8 GB) target machine machine model cache (256 kB) cache (256 kB) C = 90 Gflops 100 GB/s 100 GB/s 100 GB/s 10 GB/s 25 GB/s 25 GB/s V1 = 300 GB/s L1 = 50 GB/s M1 = 256 kB V0 = 10 GB/s L0 = 0 GB/s M0 = 8 GB lateral and vertical communication refer to commuication within one respectively between different tiling hierarchy levels

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

  • Given a stencil 𝑡 given and the amount of computation 𝑑𝑡

𝑢𝑡 = 𝑑𝑡/𝐷

  • Given a group 𝑕 and the vertical and lateral communication 𝑤𝑑 and 𝑚𝑑

1, … , 𝑚𝑑 𝑛

𝑢𝑕 = max(𝑢𝑑, 𝑤𝑑 𝑊𝑛 , 𝑚𝑑

1 𝑀1

, … , 𝑚𝑑

𝑛 𝑀𝑛

)

𝑑∈𝑕.𝑑𝑖𝑗𝑚𝑒

12

Performance Modeling

tiling hierarchy execution time 𝑢𝑚𝑏𝑞 𝑢𝑔𝑚𝑗 𝑢𝑔𝑚𝑘 𝑢𝑝𝑣𝑢 𝑤𝑚𝑏𝑞 𝑊2 𝑤𝑔𝑚𝑗 𝑊2 𝑤𝑔𝑚𝑘 𝑊2 𝑢,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑤𝑝𝑣𝑢 𝑊2 𝑢,𝑝𝑣𝑢- 𝑤,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑊1 𝑤,𝑝𝑣𝑢- 𝑊1 𝑢,,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘-,,𝑝𝑣𝑢-- remember the performance model parameters 𝐷, 𝑊𝑛, 𝑀1, … , 𝑀𝑛

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

  • The stencil program analysis is based on (quasi-) affine sets and maps

𝑇 = 𝑗 𝑗 ∈ ℤ𝑜 ⋀ (0, … , 0) < 𝑗 < (10, … , 10) 𝑁 = 𝑗 → 𝑘 𝑗 ∈ ℤ𝑜, 𝑘 ∈ ℤ𝑜 𝑘 = 2 ∙ 𝑗

  • For example, data dependencies are expressed using a named map

𝐸𝑚𝑏𝑞 = 𝑚𝑏𝑞 𝑗 → 𝑗𝑜 𝑗 + 𝑘 𝑗 ∈ ℤ2, 𝑘 ∈ 0,0 , 1,0 , −1,0 , 0,1 , 0, −1

13

Affine Sets and Maps

𝐸 = 𝐸𝑚𝑏𝑞 ∪ 𝐸

𝑔𝑚𝑗 ∪ 𝐸 𝑔𝑚𝑘 ∪ 𝐸𝑝𝑣𝑢

𝐹 = D+(* 𝑝𝑣𝑢 0 +)

apply the out origin vector to the transitive closure of all dependencies in lap

  • ut

fli flj

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

  • Define a tiling using a map that associates stencil evaluations to tile ids

𝑈

𝑝𝑣𝑢 = *(out, 𝑗0, 𝑗1 ) → ⌊𝑗0 2

⌋, ⌊𝑗1 2 ⌋ +

14

Tiling Maps

i1 i0 1 2 3 4 5 1 2 3 4 5

(1,0) (2,0) (0,0) (1,1) (2,1) (0,1) (1,2) (2,2) (0,2)

tile id (1,2)

  • ut (2,5)
slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

  • Intersect the range of the tiling map with the base tiling hierarchy origin tile

𝑇 = 0, 0, 𝑧0, 𝑧1 𝑧𝑘 ∈ ℤ+

  • Count the number of stencil evaluations associated to the remaining tiles

𝑑𝑝𝑣𝑢 = 𝑈

𝑝𝑣𝑢 ∩𝑠𝑏𝑜 𝑇 ∙ #𝑔𝑚𝑝𝑞𝑡

15

Count Stencil Evaluations

tiling hierarchy count the points in the filtered tiling map using the Barvinok algorithm

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

  • Put it all together (stencil algebra, performance model, stencil analysis)
  • 1. Optimize the stencil execution order (brute force search)
  • 2. Optimize the stencil grouping (dynamic programming)

16

Stencil Program Optimization

lap fli flj

  • ut

lap fli flj

  • ut

lap fli flj

  • ut

lap fli flj

  • ut

minimize

𝑦∈𝐽

𝑢(𝑦) I subject to 𝑛(𝑦) ≤ 𝑁

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

17

Dynamic Programming (simplified)

lap⟷lap

  • lap⟷fli

fli⟷fli

  • lap⟷flj

fli⟷flj flj⟷flj

  • lap⟷out

fli⟷out flj⟷out

  • ut⟷out
  • [ ... ] :: ...

fli⟷out flj⟷out

  • ut⟷out

tiling hierarchy level 1 tiling hierarchy level 0 1 2 2 3 4 4 3

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

18

Evaluation

CPU Experiments (i5-3330): GPU Experiments (Tesla K20c):

1 1 1 1 3.1 2.7 2.1 2.4 3.1 2.7 2.1 2.4 hd uv div uv&div no fusion hand-tuned

  • ptimized

1 1 1 1 2.3 2.1 1.1 1.5 2.3 2.4 2 2.1 hd uv div uv&div no fusion hand-tuned

  • ptimized

m ~ 1.6e 40 80 120 20 40 60 80 m = measured time [ms] e = estimated time [ms] m ~ 1.5e 4 8 12 2 4 6 8 m = measured time [ms] e = estimated time [ms]

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

  • We categorize data locality transformations for stencil programs
  • We enumerate stencil program implementation variants using an algebra
  • Our performance model estimates the stencil program execution time
  • Using MODESTO we can automatically tune STELLA stencil programs
  • 2.0-3.1x speedup against naive implementations
  • 1.0-1.8x speedup against expert tuned implementations

19

Conclusions